详细解决方案
(四十六):VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
热度:87 发布时间:2023-11-17 07:40:59.0
(四十六):VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
- Abstract
- 1. Introduction
- 2. Related work
-
- 2.1. Transformers in Vision
- 2.2. Self-Supervised Learning
- 3. Approach
-
- 3.1. Tokenization and Positional Encoding
- 3.1.1 DropToken
- 3.2. The Transformer Architecture
- 3.3. Common Space Projection
- 3.4. Multimodal Contrastive Learning
<