Are you interested in multimodal applications, but bored of image captioning? Then try video captioning with Atomic Visual Actions (AVA). AVA is a new dataset that provides multiple action labels for each person in extended video sequences. It consists of URLs for publicly available videos from YouTube, annotated with a set of 80 atomic actions (e.g. “walk”, “kick (an object)”, “shake hands”) with a total of 210k action labels.