SIFT-50M: New Data Supercharges Speech LLMs, Improves Understanding

This is a Plain English Papers summary of a research paper called SIFT-50M: New Data Supercharges Speech LLMs, Improves Understanding. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter. Bridging the Speech-Text Gap: How SIFT-50M Improves LLMs' Understanding of Speech Speech-text large language models (LLMs) represent an exciting frontier in AI, integrating audio processing capabilities with the reasoning powers of language models. However, these models have faced a significant limitation: they typically train on datasets designed for automatic speech recognition (ASR) rather than natural language instruction following. This has restricted their ability to respond flexibly to diverse speech understanding tasks. A new dataset called SIFT-50M (Speech Instruction Fine-Tuning) addresses this gap with 50 million examples of instruction-based training data spanning five languages. This represents a major advancement similar to how instruction tuning datasets for code have transformed code-generating models. The Speech Instruction Challenge Recent speech-text LLMs have made significant progress by connecting audio encoders with language models. Some projects like SALMONN integrate Whisper and BEATs encoders with pre-trained LLMs, while others like Qwen-Audio train on multiple speech and audio tasks. Models like AudioPaLM and LauraGPT extend LLM vocabulary with discrete audio tokens to support both understanding and generation. Despite these advances, all these models face the same fundamental limitation: they rely on datasets primarily designed for speech recognition rather than instruction following. This inhibits their ability to generalize to a wide range of speech understanding tasks, similar to challenges addressed in multilingual instruction fine-tuning work. SIFT-50M: A Comprehensive Speech Instruction Dataset SIFT-50M represents a major leap forward, with 50 million examples built from publicly available speech corpora containing 14,000 hours of speech across five languages. The dataset includes diverse instruction types for both speech understanding and controllable speech generation. Category #Samples (train/dev/EvalSIFT) Closed-Ended - Acoustic-level 17.8M / 100K / 2.5K - Content-level 14.5M / 80K / 2.5K - Word-Align 9.8M / 40K / 2.5K - Comparison 3.6M / 100K / 2.5K Open-Ended 4.3M / 100K / 10K Controllable Generation 5.6M / 50K / 10K Total 55.6M / 470K / 30K Table 1: Representative examples from SIFT-50M for different categories of instruction types. The dataset features several distinct instruction categories: Closed-Ended Instructions: Tasks with specific, verifiable answers including: Acoustic-level questions about audio characteristics like speaker identification and emotion Content-level questions focusing on spoken content and its meaning Word-Align instructions requesting information about specific words or phrases Comparison instructions asking to compare elements across different audio segments Open-Ended Instructions: Tasks requiring deeper reasoning and interpretation of speech Controllable Generation Instructions: Prompts for generating speech with specified characteristics like pitch, speaking rate, or emotion This comprehensive approach enables models to develop a full range of speech processing capabilities, similar to how medical language models have benefited from domain-specific instruction datasets. Experimental Methodology: Training and Evaluating SIFT-LLM Using the SIFT-50M dataset, the researchers trained SIFT-LLM, a speech-text LLM designed to excel at instruction following while maintaining strong performance on foundational speech tasks. The training process included two stages: Speech Understanding: Fine-tuning on speech understanding instructions Controllable Generation: Further training to enable speech generation capabilities For evaluation, they introduced EvalSIFT, a new benchmark specifically designed to test speech-text LLMs' instruction-following abilities. They also used established benchmarks including Dynamic-SUPERB and AIR-Bench Chat. The evaluation methodology employed objective metrics for closed-ended tasks and human evaluation (supplemented by LLM-based scoring) for open-ended questions. This approach parallels efficient test-time learning methods in its rigorous evaluation focus. Performance Results: SIFT-LLM Sets New Benchmarks SIFT-LLM demonstrates substantial improvements over existing speech-text LLMs on instruction-following benchmarks: Model Closed-Ended Open-Ended Dynamic-Superb Tasks DS-1 EvalSIFT AB-Chat EvalSIFT Audio PL Semt. Degrd. Content Speaker SALMONN-7B 34.7 21.9 6.4 6.0 31.7 30.5 47.5 30.0 45.2 Qwen2-Audio-Inst. 48.0 25.1 7.2 7.3 53.5 28.9 40.3 43.9 70.6 O-ASQA-LLM 45.9 22.9 6.6 4.7 28.5 30.0 38.6 45.9 72.3 SIFT-LLM (ours) 57.4 46.1 7.3 7.8 37.5 42.8 51.3 63.6 75.6 Table 3

Apr 17, 2025 - 20:35

This is a Plain English Papers summary of a research paper called SIFT-50M: New Data Supercharges Speech LLMs, Improves Understanding. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Bridging the Speech-Text Gap: How SIFT-50M Improves LLMs' Understanding of Speech

Speech-text large language models (LLMs) represent an exciting frontier in AI, integrating audio processing capabilities with the reasoning powers of language models. However, these models have faced a significant limitation: they typically train on datasets designed for automatic speech recognition (ASR) rather than natural language instruction following. This has restricted their ability to respond flexibly to diverse speech understanding tasks.

A new dataset called SIFT-50M (Speech Instruction Fine-Tuning) addresses this gap with 50 million examples of instruction-based training data spanning five languages. This represents a major advancement similar to how instruction tuning datasets for code have transformed code-generating models.

The Speech Instruction Challenge

Recent speech-text LLMs have made significant progress by connecting audio encoders with language models. Some projects like SALMONN integrate Whisper and BEATs encoders with pre-trained LLMs, while others like Qwen-Audio train on multiple speech and audio tasks. Models like AudioPaLM and LauraGPT extend LLM vocabulary with discrete audio tokens to support both understanding and generation.

Despite these advances, all these models face the same fundamental limitation: they rely on datasets primarily designed for speech recognition rather than instruction following. This inhibits their ability to generalize to a wide range of speech understanding tasks, similar to challenges addressed in multilingual instruction fine-tuning work.

SIFT-50M: A Comprehensive Speech Instruction Dataset

SIFT-50M represents a major leap forward, with 50 million examples built from publicly available speech corpora containing 14,000 hours of speech across five languages. The dataset includes diverse instruction types for both speech understanding and controllable speech generation.

Category	#Samples (train/dev/EvalSIFT)
Closed-Ended
- Acoustic-level	17.8M / 100K / 2.5K
- Content-level	14.5M / 80K / 2.5K
- Word-Align	9.8M / 40K / 2.5K
- Comparison	3.6M / 100K / 2.5K
Open-Ended	4.3M / 100K / 10K
Controllable Generation	5.6M / 50K / 10K
Total	55.6M / 470K / 30K

Table 1: Representative examples from SIFT-50M for different categories of instruction types.

The dataset features several distinct instruction categories:

Closed-Ended Instructions: Tasks with specific, verifiable answers including:
- Acoustic-level questions about audio characteristics like speaker identification and emotion
- Content-level questions focusing on spoken content and its meaning
- Word-Align instructions requesting information about specific words or phrases
- Comparison instructions asking to compare elements across different audio segments
Open-Ended Instructions: Tasks requiring deeper reasoning and interpretation of speech
Controllable Generation Instructions: Prompts for generating speech with specified characteristics like pitch, speaking rate, or emotion

This comprehensive approach enables models to develop a full range of speech processing capabilities, similar to how medical language models have benefited from domain-specific instruction datasets.

Experimental Methodology: Training and Evaluating SIFT-LLM

Using the SIFT-50M dataset, the researchers trained SIFT-LLM, a speech-text LLM designed to excel at instruction following while maintaining strong performance on foundational speech tasks. The training process included two stages:

Speech Understanding: Fine-tuning on speech understanding instructions
Controllable Generation: Further training to enable speech generation capabilities

For evaluation, they introduced EvalSIFT, a new benchmark specifically designed to test speech-text LLMs' instruction-following abilities. They also used established benchmarks including Dynamic-SUPERB and AIR-Bench Chat.

The evaluation methodology employed objective metrics for closed-ended tasks and human evaluation (supplemented by LLM-based scoring) for open-ended questions. This approach parallels efficient test-time learning methods in its rigorous evaluation focus.

Performance Results: SIFT-LLM Sets New Benchmarks

SIFT-LLM demonstrates substantial improvements over existing speech-text LLMs on instruction-following benchmarks:

Model	Closed-Ended		Open-Ended		Dynamic-Superb Tasks
	DS-1	EvalSIFT	AB-Chat	EvalSIFT	Audio	PL	Semt.	Degrd.	Content Speaker
SALMONN-7B	34.7	21.9	6.4	6.0	31.7	30.5	47.5	30.0	45.2
Qwen2-Audio-Inst.	48.0	25.1	7.2	7.3	53.5	28.9	40.3	43.9	70.6
O-ASQA-LLM	45.9	22.9	6.6	4.7	28.5	30.0	38.6	45.9	72.3
SIFT-LLM (ours)	57.4	46.1	7.3	7.8	37.5	42.8	51.3	63.6	75.6

Table 3: Evaluation results of speech-text LLMs on Dynamic-Superb (DS-1), AIR-Bench Chat (AB-Chat), and EvalSIFT (English). Accuracy (in %) is reported for closed-ended evaluations and LLM score (0 to 10) for open-ended evaluations. The adjacent table provides a breakdown by task categories in Dynamic-Superb.

SIFT-LLM achieves significant improvements on the EvalSIFT benchmark, with 46.1% accuracy on closed-ended tasks compared to 25.1% for the next best model. It also performs well on Dynamic-SUPERB and maintains competitive performance on foundational speech tasks.

Importantly, SIFT-LLM also demonstrates strong multilingual capabilities:

Model	German	French	Italian	Spanish
SAL	$15.0 \mid 4.3$	$16.3 \mid 5.0$	$14.3 \mid 5.0$	$16.7 \mid 5.4$
QwA	$18.6 \mid 6.0$	$18.8 \mid 6.8$	$18.2 \mid 7.2$	$21.2 \mid 7.3$
SIFL	$\mathbf{39.0} \mid \mathbf{6 . 6}$	$\mathbf{3 4 . 3} \mid \mathbf{7 . 1}$	$\mathbf{3 3 . 2} \mid \mathbf{7 . 5}$	$\mathbf{3 5 . 6} \mid 7.0$

Results on non-English languages from EvalSIFT. SIFT-LLM outperforms SALMONN-7B and Qwen2-Audio-Instruct on closed-ended evaluations across languages, though absolute accuracy is lower on non-English languages compared to English.

Controllable Speech Generation: A Step Toward Expressive AI

Beyond understanding speech, SIFT-LLM also demonstrates the ability to generate speech with specified characteristics:

Feature	MAE $(\downarrow)$	QWK $(\uparrow)$
Pitch variation	$0.99 \pm 1.05 e-2$	$0.15 \pm 2.17 e-2$
Speaking rate	$0.65 \pm 0.35 e-2$	$0.46 \pm 0.49 e-2$
Intensity	$0.18 \pm 0.16 e-2$	$0.02 \pm 0.57 e-2$

Table 6: Evaluation of SIFT-LLM GEN on the controllable generation set of EvalSIFT. MAE and QWK metrics compare speech characteristics from SIFT-LLM GEN's generated audio with those specified in the instructions.

The model shows promising results in controlling speaking rate and managing intensity, with more moderate success in pitch variation. This controllable speech generation capability represents an important step toward more expressive AI communication systems.

Analysis: The Importance of Diverse Instructions

Ablation studies reveal several important findings about instruction tuning for speech LLMs:

Setup	DS-1	EvalSIFT (Closed)	EvalSIFT (Open)
Default	57.3	45.4	8.0
Init. from 200K ckpt	52.5	42.3	7.7
No Pre-training	56.7	43.7	7.9
LoRA rank=16	58.4	45.1	8.0
LoRA rank=32	55.7	43.9	7.8
No open-ended data	54.3	42.2	6.1
No word-align data	57.1	41.5	8.0
No comparison data	56.1	34.3	7.3

Ablation study results showing the impact of different training configurations and data subsets on model performance.

Key findings include:

Removing comparison data significantly hurts performance on EvalSIFT closed-ended tasks
Omitting open-ended data reduces open-ended task performance
Pre-training provides modest benefits for instruction-following capabilities
The choice of LoRA rank affects performance, with rank=16 providing optimal results

Conclusion: A New Era for Speech-Text LLMs

SIFT-50M represents a significant advancement in speech-text LLM training, providing a large-scale, diverse dataset of instruction examples spanning multiple languages. The resulting SIFT-LLM model demonstrates superior instruction-following capabilities while maintaining strong performance on foundational speech tasks.

The introduction of EvalSIFT as a benchmark also provides a valuable tool for evaluating future speech-text LLMs, helping drive progress in this rapidly evolving field. Similar to the impact of large instruction datasets in code generation, SIFT-50M could help spur a new wave of more capable and flexible speech-text AI systems.

The research highlights the critical importance of instruction tuning for speech understanding models and provides a valuable resource for future work in this domain. As speech interfaces become increasingly important for human-AI interaction, the ability to follow diverse instructions will be crucial for creating useful and adaptable systems.

Click here to read the full summary of this paper