Visual temporal attention

Video frames of the Parallel Bars action category in the UCF-101 dataset[1] (a) The highest ranking four frames in video temporal attention weights, in which the athlete is performing on the parallel bars; (b) The lowest ranking four frames in video temporal attention weights, in which the athlete is standing on the ground. All weights are predicted by the ATW CNN algorithm.[2] The highly weighted video frames generally captures the most distinctive movements relevant to the action category.

Visual temporal attention is a special case of visual attention that involves directing attention to specific instant of time. Similar to its spatial counterpart visual spatial attention, these attention modules have been widely implemented in video analytics in computer vision to provide enhanced performance and human interpretable explanation[3] of deep learning models.

As visual spatial attention mechanism allows human and/or computer vision systems to focus more on semantically more substantial regions in space, visual temporal attention modules enable machine learning algorithms to emphasize more on critical video frames in video analytics tasks, such as human action recognition. In convolutional neural network-based systems, the prioritization introduced by the attention mechanism is regularly implemented as a linear weighting layer with parameters determined by labeled training data.[3]

  1. ^ Cite error: The named reference Center 2013 was invoked but never defined (see the help page).
  2. ^ Cite error: The named reference Zang Wang Liu Zhang 2018 pp. 97–108 was invoked but never defined (see the help page).
  3. ^ a b Cite error: The named reference Interpretable ML Symposium 2017 was invoked but never defined (see the help page).