Massive Sound Embedding Benchmark: Improve AI Sound Intelligence

DeepGeek
المؤلف DeepGeek
تاريخ النشر
آخر تحديث
Massive Sound Embedding Benchmark: Improve AI Sound Intelligence

Sound is very important for how we understand the world. For any smart system, like a voice assistant or a security monitor, to act normally, it needs to understand many types of sounds. These include writing down speech, sorting sounds, finding sounds, thinking about sounds, cutting sounds into parts, grouping similar sounds, ordering results, and making sounds again.

These different jobs need to change raw sound into a middle step, called an embedding. But research on improving how AI understands sound has been separate. We still have big questions. How can we compare sounds from people talking to sounds from nature? How much better can AI get at understanding sound? Could one sound embedding help with all these jobs?

To help answer these questions and speed up progress in AI sound understanding, we made the Massive Sound Embedding Benchmark (MSEB). We showed it at NeurIPS 2025.

MSEB helps answer these questions by:

  • Making evaluation the same for eight real-world jobs that smart systems need.
  • Giving a system that researchers can use and add to. It lets them test any AI model, from normal models to new ones that understand many types of information.
  • Showing clear goals for how good models can be. This helps find new research areas beyond what's possible today.

Our first tests show that current ways of understanding sound are not perfect. There is much room for improvement in all eight jobs.

MSEB: One system for sound understanding

MSEB is built on three main ideas. These give everyone tools to build better AI that understands sound.

1. Different data for real situations

A benchmark is only as good as its data. MSEB has carefully chosen data that shows how people around the world use sound. The main part of our benchmark is the Simple Voice Questions (SVQ) dataset. It has over 177,000 short voice questions in 26 languages. These recordings were made in four different sound places: quiet, with other people talking, with traffic noise, and with media noise. It also has details about the speakers and important words. We made this data public on Hugging Face.

MSEB also uses good public data for different sound types:

  • Speech-MASSIVE: For understanding spoken words in many languages and figuring out what people want.
  • FSD50K: A big dataset for recognizing many types of sounds in the environment. It has 200 classes from the AudioSet list.
  • BirdSet: A large dataset for studying bird sounds. It includes recordings of complex natural sounds.

We are adding more useful and large datasets to MSEB. You can share your ideas or ask to work with us on our GitHub page.

2. Eight important sound jobs

MSEB is designed because we think future AI that uses sound will use many types of information. Every job uses sound as the main input. But it also uses other information, like text or knowledge, to act like real life.

MSEB has eight main jobs that an AI system needs:

  • Finding sounds (voice search): Like a voice search, it finds useful information from a voice request.
  • Thinking (smart assistants): It checks if the AI can find the right answer in a document from a spoken question.
  • Sorting sounds (monitoring/security): It puts sounds into groups, like who is talking, what someone wants, the sound place, or specific sound events.
  • Writing down speech: It changes the sound into text, like how apps turn speech into words.
  • Splitting sounds (indexing): It finds the most important words in a sound clip and shows when they start and end.
  • Grouping sounds (organizing): It puts sound samples into groups with similar features, like who is talking or the sound place, without needing labels.
  • Ordering results (improving guesses): It puts a list of unclear text guesses in a better order to match the original voice request.
  • Making sounds again (creating AI): It checks how good the embedding is by seeing how well the original sound can be made from it.
Infographic titled Massive Sound Embedding Benchmark (MSEB) displaying icons for eight audio tasks, such as Retrieval, Classification, and Transcription.

MSEB jobs help with finding information (finding, ordering, thinking), basic understanding (sorting, writing, splitting), and higher-level organizing and creating (grouping, making).

We are planning to add more real-world jobs in new areas, like music or when sound works with images.

3. A strong way to test and see how much better AI can get

The main goal of MSEB is to set good standards and show how much current AI models can improve. We test them in two main types of jobs:

  • Understanding meaning (like voice search, thinking): Do models understand what the spoken words mean, even if the sound is noisy?
  • Sound details (like sorting, grouping): Do models correctly know who is talking or what the sound is, no matter the meaning?

MSEB can test many kinds of AI models. It works for older systems and new ones that understand sound. It all happens in one set of tests.

How we compare

We used the MSEB system to test current sound embedding models. We wanted to see how close they are to being truly smart and able to handle all sounds.

For jobs about meaning, we compared models to the correct text. For jobs not about meaning, we compared them to the best current specific tool. This sets a good starting point that new, general models must beat.

Problems with current sound understanding

The results show that current AI models have clear problems in all key sound-understanding jobs. This shows why we need a testing system like MSEB.

Bar chart comparing performance metrics for Text and Sound inputs across "SuperTasks" like Retrieval, Reasoning, and Classification.

MSEB tests AI models on key jobs. It shows big problems and room for improvement. The tests use metrics like MRR, F1, mAP, ACC, WER, NDCG, VMeasure, and FAD.

This test shows five main problems that limit current AI that uses sound:

1. Problems with meaning

For jobs that need understanding the words (finding, thinking, ordering), the speech-to-text part always causes problems. This makes the AI lose the correct meaning.

2. Wrong goals

The normal way in speech technology is to first change speech to text. Then, that text is used for all other jobs. This is wrong because the system is trained for the wrong goal. The speech-to-text part only tries to make fewer word mistakes. This goal is not good for real-world use. Real use often needs to get the right idea, be correct, or think well, even if the words are not perfect.

3. Not working for all languages

Models are not reliable. They work very differently for each language. The systems only work well for big, common languages. When used for less common languages, the speech-to-text quality gets very bad. This causes big failures in search, ordering, and splitting sounds.

4. Not strong enough

Making sounds again works worse with noise. When background noise is added, the model has trouble understanding the original sound and place. This is hard for the system. It shows it struggles with complex sounds in real places, like a busy office or a noisy street.

5. Too complicated

For simple jobs that don't need understanding meaning (like knowing who is talking), very complex AI models are no better than using the sound waves directly. This makes people waste time on models that are too hard when simple data works just as well.

The End

The results show a big difference in how well current general sound systems work for all eight jobs. This widespread poor performance shows we need more research on strong, unified sound ways of understanding. This will help close the gap in AI sound understanding.

We want MSEB to be a place where everyone working with sound can join. We ask you to use MSEB to test your sound ways. You can add new jobs and data to the benchmark. Join us in pushing what AI can do with sound.

أضف تفاعلك على هذا المقال

Commentaires

عدد التعليقات : 0