当前位置: 代码迷 >> 综合 >> 【Music】视频配乐|多模态检索 Content-based video–music retrieval (CBVMR) Using Soft Intra-Modal 笔记

【Music】视频配乐|多模态检索 Content-based video–music retrieval (CBVMR) Using Soft Intra-Modal 笔记

热度:93   发布时间:2024-02-28 05:46:13.0

2018 ICMR
Content-Based Video–Music Retrieval Using Soft Intra-Modal Structure Constraint


bidirectional retrieval


  • 设计一种对元数据没有要求的跨模式模型
  • 难以获得匹配的视频音乐对,视频和音乐之间的匹配标准比其他跨模态任务(例如,图像到文本的检索)更加模糊


  • Content-based, cross-modal embedding network
    • introduce VM-NET, two-branch neural network that infers the latent alignment between videos and music tracks using only their contents
    • train the network via inter-modal ranking loss
      such that videos and music with similar semantics end up close together in the embedding space

However, if only the inter-modal ranking constraint for embedding is considered, modality-specific characteristics (e.g., rhythm or tempo for music and texture or color for image) may be lost.

  • devise a novel soft intra-modal structure constraint
    takes advantage of the relative distance relationship of samples within each modality
    does not require ground truth pair information within individual modality.

Large-scale video–music pair dataset

  • Hong–Im Music–Video 200K (HIMV- 200K)
    composed of 200,500 video–music pairs.


  • Recall@K
  • subjective user evaluation

Related work

A. Video–Music Related Tasks

conventional approaches can be divided into three categories according to the task:

  • generation,
  • classification
  • matching


B. Two-branch Neural Networks Over


Tunesensor: A semantic-driven music recommendation service for digital photo albums (ISWC 2011)


A. Music Feature Extraction

  1. decompose an audio signal into harmonic and percussive components
    谐波 / 打击乐

  2. apply log-amplitude scaling to each component
    to avoid numerical underflow

  3. slice the components into shorter segments called local frames (or windowed excerpts) and extract multiple features from each component of each frame.

Frame-level features.

  1. Spectral features
    The first type of audio features are derived from spectral analyses.
  • first apply the fast Fourier transform and the discrete wavelet transform to the windowed signal in each local frame
  • From the magnitude spectral results
    • compute summary features including the spectral centroid, the spectral bandwidth, the spectral rolloff, and the first and second order polynomial features of a spectrogram
  1. Mel-scale features
  • compute the Mel-scale spectrogram of each frame as well as the Mel-frequency Cepstral Coefficients (MFCC)
    to extract more meaningful features

  • use delta-MFCC features(the first- and second-order differences in MFCC features over time)
    capture variations of timbre over time

  1. Chroma features
  • use chroma short-time Fourier transform as well as chroma energy normalized
    色度短时傅立叶变换 以及色度能量归一化
    While Mel- scaled representations efficiently capture timbre, they provide poor resolution of pitches and pitch classes
  1. Etc.
  • use the number of time domain zero-crossings as an audio feature
    in order to detect the amount of noise in the audio signal.
  • use the root-mean-square energy for each frame

B. Video Feature Extraction

###Frame-level features

  • HIMV-200K dataset 包含大量数据,CNN 从头训太久
    因此使用 在 ImageNet 上预训练的 Inception,extract frame-level features

  • whitened principal component analysis (WPCA)
    the normalized features are approximately multivariate Gaussian with zero mean and identity covariance

Video-level features

concatenation the music-level features
a global normalization process(subtracts the mean of vectors from all the features)
principal component analysis (PCA)

L2 normalization

C. Multimodal Embedding

The final step is to embed the separately extracted features of the heterogeneous music and video modalities into a shared embedding space.

The two-branch neural network


  • 视频特征是从 pretrain 的CNN中提取的
  • 音乐特征只是 low- level 音频特征统计的简单 concat

为了补偿相对较 low-level 的音频 ,我们使网络的音频分支比视频分支更深


Inter-modal ranking constraint

受 triplet ranking loss 启发

  • a positive cross-modal sample
    a ground truth pair item separated from the same music video
  • a negative sample
    not paired with the anchor


  • vi (anchor)
    video of the i-th music video
  • mi (positive sample)
    music of the i-th music video
  • mj (negative sample)
    the music feature obtained from the j-th music video
  • d(v,m)
    distance (e.g., Euclidean distance)
  • e
    a margin constant

video input

music input



  • top Q most violated cross-modal matches in each mini-batch

selecting a maximum of Q violating negative matches that are closer to the positive pair (i.e., a ground truth video–music pair) in the embedding space.

Soft intra-modal structure constraint

只使用 Inter-modal ranking constraint

the modality- specific characteristics

  • in music
    rhythm,tempo, or timbre
    旋律 速度 音色
  • in videos
    brightness, color, or texture

为了解决每个模态内结构崩溃的问题,我们设计了一种 Soft intra-modal structure constraint

video input

music input

xxx music features in multimodal space if xxx music features before embedding

do not use the margin constant

Embedding network loss

  • inter-modal ranking constraint
    two types of triplets (vi,mi,mj)(vi,mi,mj)(vi,mi,mj) and (mi,vi,vj)(mi, vi, vj)(mi,vi,vj)

  • soft intra-modal structure constraint
    two types of triplets (vi,vj,vk)(vi, vj, vk)(vi,vj,vk) and (mi,mj,mk)(mi,mj,mk)(mi,mj,mk)

sign(x)={1,x>00,x=0?1,x<0sign(x)=\begin{cases}1, x>0 \\0, x=0\\-1,x<0\end{cases}sign(x)=??????1,x>00,x=0?1,x<0?

Dataset and implementation details

A. Construction of the Dataset The




下载完了所有带有“音乐视频”标签的视频,就可以使用FFmpeg 将它们分为视频和音频 最终我们获得了205,000对视频音乐组合,用于训练,验证和测试的组合分别包括200K,4K和1K对。

为了公开发布我们的HIMV-200K数据集而又不侵犯版权,我们在“在线视频”类别下提供了YouTube视频的URL,并在我们的在线存储库中提供了视频和音乐曲目的特征提取代码。 (https://github.com/csehong/VM-NET)

B. Implementation Details

Therefore, we trimmed the audio signals to 29.12 s at the center of the songs and downsample them from 22.05 kHz to 12 kHz following [36].

  • 将音频信号分解为谐波分量和打击乐分量,并逐帧提取大量音频特征。

video–music retrieval

followed the implementation details in [40].



Experimental results

A. The Recall@K Metric The

1k 测试集



对于给定的K值,它衡量在测试集中至少有一个正确的基础事实匹配项被排在前K个匹配项中的查询集中的查询百分比。例如,如果我们考虑要求适当音乐的视频查询,则Recall @ 10会告诉我们前十个结果中包含基本音乐匹配项的视频查询的百分比。

[30] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
[33] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5005–5013.

相对地赋予λ1比λ2更多的权重,通常可以提高性能。但是,根据经验,我们确认将λ1设置为5或更大不会改善Recall @ K。

B. A Human Preference Test



