LABIC - Bioinformatics and Computational Intelligence Laboratory

Local Repository of Research Datasets

OSVidCAP: a Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

Introduction

Automatically understanding and describing the visual content of videos in natural language is a challenging task. Most current approaches are designed to describe single events in a closed-set setting. However, in real-world scenarios, concurrent activities and previously unseen actions may appear ina video.
The OSVidCap is a novel open-set video captioning framework that recognizes and describes, in natural language, concurrent known actions and detects the unknown ones. It is based on the encoder-decoder framework and uses a detection-and-tracking-object-based mechanism followed by a background blurring method to focus on specific targets in a video. Additionally, the TI3D Network with the Extreme Value Machine (EVM) is also used to learn representations and recognize unknown actions.

Dataset Description

In our experiments, we use the LIRIS human activities dataset. It was designed for recognizing complex and realistic actions in videos and made availablefor the ICPR-HARLâ€™2012 competition. The full dataset con-tains 828 actions (including discussing, telephone calls, givingan item, etc.) performed by 21 different people in 10 differentclasses. It was organized into two independentsubsets: the D1 subset, with depth and grayscale images, andthe D2 subset, with color images. The dataset also has unan-notated actions, such as walking, running, whiteboard writing,book leafing, etc. n this work we used the D2 subset that contains 367 anno-tated actions from 167 videos. Each action consists of one ormore people performing one or more different activities. Be-sides, 116 video segments in 15 different unannotated actionswere extracted from the original videos. They were considered as unknown classes. Each new video segment was also anno-tated with spatial, temporal, and description information.

Videos are available on the Liris human activity website: https://projet.liris.cnrs.fr/voir/activities-dataset/download.html.

D2-subset Action histograms

Training		Test
Label	N. Actions	Label	N. Images
1 - Discussion between two or more people	23	1 - Discussion between two or more people	15
2 - Give an object to another person	13	2 - Give an object to another person	6
3 - Put/take an object into/from a box/desk	41	3 - Put/take an object into/from a box/desk	20
4 - Enter/leave a room (pass through a door) without unlocking	56	4 - Enter/leave a room (pass through a door) without unlocking	28
5 - Try to enter a room (unsuccessfully)	16	5 - Try to enter a room (unsuccessfully)	7
6 - Unlock and enter (or leave) a room	14	6 - Unlock and enter (or leave) a room	7
7 - Leave baggage unattended	13	7 - Leave baggage unattended	8
8 - Handshaking	21	8 - Handshaking	15
9 - Typing on a keyboard	30	9 - Typing on a keyboard	13
10 - Telephone conversation	15	10 - Telephone conversation	6
Total	242	Total	125

D2-subset Unknown Action histograms.

Label	N. Actions
Cleaning whiteboard	2
Closing door	3
Standing still / Doing nothing	7
Holding and flipping an object	1
Flipping through a book	3
Opening laptop for work	1
Picking something from the pocket	2
Putting something to the pocket	2
Putting something on the wall	1
Running	2
Taking off a backpack	7
Using cellphone	1
Walking	76
Walking and through a book	2
Writing Whiteboard	6
Total	116

Link to the annotations:

Code and data are available at this paperswithcode site

Paper

If you use this data/code, please, cite this paper:

Inácio, A.S., Gutoski, M., Lazzaretti, A.E., Lopes, H.S., OSVidCap: a framework for the simultaneous recognition and description of concurrent actions in videos in an open-set scenario. IEEE Access, vol. 9, pp. 137029-137041, 2021. DOI: 10.1109/ACCESS.2021.3116882