LABIC - Bioinformatics and Computational Intelligence Laboratory

Local Repository of Research Datasets


OSVidCAP: a Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario


 

  1. Introduction

Automatically understanding and describing the visual content of videos in natural language is a challenging task. Most current approaches are designed to describe single events in a closed-set setting. However, in real-world scenarios, concurrent activities and previously unseen actions may appear ina video.
The OSVidCap is a novel open-set video captioning framework that recognizes and describes, in natural language, concurrent known actions and detects the unknown ones. It is based on the encoder-decoder framework and uses a detection-and-tracking-object-based mechanism followed by a background blurring method to focus on specific targets in a video. Additionally, the TI3D Network with the Extreme Value Machine (EVM) is also used to learn representations and recognize unknown actions.

  1. Dataset Description
In our experiments, we use the LIRIS human activities dataset. It was designed for recognizing complex and realistic actions in videos and made availablefor the ICPR-HARL’2012 competition. The full dataset con-tains 828 actions (including discussing, telephone calls, givingan item, etc.) performed by 21 different people in 10 differentclasses. It was organized into two independentsubsets: the D1 subset, with depth and grayscale images, andthe D2 subset, with color images. The dataset also has unan-notated actions, such as walking, running, whiteboard writing,book leafing, etc. n this work we used the D2 subset that contains 367 anno-tated actions from 167 videos. Each action consists of one ormore people performing one or more different activities. Be-sides, 116 video segments in 15 different unannotated actionswere extracted from the original videos. They were considered as unknown classes. Each new video segment was also anno-tated with spatial, temporal, and description information.

Videos are available on the Liris human activity website: https://projet.liris.cnrs.fr/voir/activities-dataset/download.html.


D2-subset Action histograms
Training Test
Label N. Actions Label N. Images
1 - Discussion between two or more people

23

1 - Discussion between two or more people

15

2 - Give an object to another person

13

2 - Give an object to another person

6

3 - Put/take an object into/from a box/desk

41

3 - Put/take an object into/from a box/desk

20

4 - Enter/leave a room (pass through a door) without unlocking

56

4 - Enter/leave a room (pass through a door) without unlocking

28

5 - Try to enter a room (unsuccessfully)

16

5 - Try to enter a room (unsuccessfully)

7

6 - Unlock and enter (or leave) a room

14

6 - Unlock and enter (or leave) a room

7

7 - Leave baggage unattended

13

7 - Leave baggage unattended

8

8 - Handshaking

21

8 - Handshaking

15

9 - Typing on a keyboard

30

9 - Typing on a keyboard

13

10 - Telephone conversation

15

10 - Telephone conversation

6

Total

242

Total

125

 


D2-subset Unknown Action histograms.
Label N. Actions
Cleaning whiteboard 2
Closing door 3
Standing still / Doing nothing 7
Holding and flipping an object 1
Flipping through a book 3
Opening laptop for work 1
Picking something from the pocket 2
Putting something to the pocket 2
Putting something on the wall 1
Running 2
Taking off a backpack 7
Using cellphone 1
Walking 76
Walking and through a book 2
Writing Whiteboard 6
Total 116
  1. Link to the annotations

        Code and data are available at this paperswithcode site

 

  1. Paper

If you use this data/code, please, cite this paper:

Inácio, A.S., Gutoski, M., Lazzaretti, A.E., Lopes, H.S., OSVidCap: a framework for the simultaneous recognition and description of concurrent actions in videos in an open-set scenario. IEEE Access, vol. 9, pp. 137029-137041, 2021. DOI: 10.1109/ACCESS.2021.3116882