Robust Similarity from Vision-Language Models for Learning with Noisy Labels
研究背景
预训练-微调范式 (Pre-Training and Fine-Tuning(PT-FT)) 已经成为自然语言处理和多模态领域中的主流,针对视觉语言模型,通过提示学习微调预训练模型适配下游数据集已经被广泛地证明有非常好的泛化性能。然而对于许多下游任务场景,获取的数据往往具有很大的噪声,采用的人工标注和校正方法将会耗费大量时间成本,并且针对模型快速迁移应用的需求,我们旨在探索一种对噪声鲁棒性更强并且采用少样本学习的鲁棒性提示学习机制,能够更好地微调视觉语言模型适配到下游数据集。
研究领域
- Noisy Label Learning
- Few-shots Learning
- Prompt Learning
- Visual Language Models(VLMs)
研究基础
- 数据:下游任务数据集存在噪声(标注错误);
- 模型:Visual Language Models (VLMs);
- 目标:在有噪音的下游任务上学习一个 Robust 的模型;
- 方式:fine-tune pre-trained model,few-shots learning;
研究激励
- 如何构造多个决策体,使得集成决策的策略能够缓解或避免单个模型的偏好作用;
- 如何利用样本特征获取样本的潜在标签,辅助决策;
Double Similarities Supervision For Filtering Noisy Samples
Prompt Similarities By Matrix Learners
- Step 1:对每个类别构造 $m$ 个提示块形成一个提示矩阵;
- Step 2:利用 frozen encoders 在多个决策体指导下获得每个样本的特征矩阵;
- Step 3:集成所有决策体的决策获得每个样本的 prompt-similarity;
Feature Similarities By Mutual Distance
- Step 1:利用visual encoder 提取所有图像样本的特征;
- Step 2:依据 noisy labels 对提取的特征进行分组;
- Step 3:计算每个类别中各样本的相互距离矩阵,得到单个样本的 feature-distance;
- Step 4:最小化相互距离,即最大化类间相似度,feature-similarity = - feature-distance;
Robust Similarity Construction
样本提示相似度非常依赖于CLIP的预测能力,在迁移到下游任务初期,模型性能需要进一步提高,其标签预测可信度较低,而样本特征之间的关联性能直接反映噪声样本和干净样本的差别(相对于大多数干净样本的联合特征分布,噪声样本的特征分布显得较为独立,具有差异较大的均值和方差)。因此我们构建基于两者性能平衡的鲁棒性相似度,即在模型学习能力和样本特征潜在结构之间实现 trade-off:
$$
G_i = \alpha \cdot \tilde{y_i} + (1-\alpha) \cdot g_i, \ \ \ \ \ \ (i=1,…,D)\newline
\alpha=0.2\cdot e^{epoch/35}\sim(0.2, 0.8325)
$$
- 训练初期,模型学习能力较弱,伪标签可信度较低,鲁棒性相似度主要来源于样本潜在结构形成的特征相似度;
- 训练后期,模型学习能力渐渐提高,伪标签可信度较高,鲁棒性相似度主要来源于模型集成预测的提示相似度;
How to run
Requirements
Only for the purpose of verifying the model principles, we just used one GPU: RTX 2080-Ti and trained the prompt learner. The following codes is my constructed shell codes for running only once.
- Screen 0
1 | screen -S cuda0 |
Shell Codes
We configure all experiments in a shell script so that it’s very convenient to conduct Validation Experiments and Ablation Experiment. After running experiments, the script immediately did result analysis.
1 |
|
Results
Abalation Study for POMA – prompt blocks
- dataset: Dtd
- noise rate: 0 | 12.5% | 25% | 50%
- backbone: Text: ViT-B/32-PT, Visual: RN50-PT
Prompt Blocks | Noise Rate | Noise Rate | Noise Rate | Noise Rate | MeanAcc |
---|---|---|---|---|---|
0 | 12.5% | 25% | 50% | ||
PTNL | 62.86% | 58.90% | 53.62% | 46.19% | 55.39% |
1 | 61.90% +- 1.29% | 59.77% +- 1.02% | 57.68% +- 0.76% | 49.39% +- 0.31% | 57.19% |
2 | 62.73% +- 1.00% | 60.92% +- 0.45% | 59.65% +- 1.50% | 49.84% +- 0.89% | 58.28% |
4 | 62.80% +- 0.51% | 62.61% +- 0.91% | 60.56% +- 0.41% | 52.40% +- 1.10% | 59.59% |
6 | 63.95% +- 0.54% | 62.77% +- 0.59% | 61.17% +- 1.02% | 53.74% +- 1.67% | 60.41% |
Raw Materials
Model traning logs can be found in the log.txt
under each experiment directory.
Parsing results can be found in the following files:
- Dataset: Dtd
Conclusions
- prompt matrix 能够有效地缓解模型偏好的作用,能够进一步提高CLIP的在下游任务上的表现,并且在高噪声情况下性能提升更为明显;
- 构建的鲁棒性相似度能够更好地结合模型特性和样本特征结构,实现更好的迁移性能和噪声鲁棒性。
References
Robust Similarity from Vision-Language Models for Learning with Noisy Labels