Investigating Modality Contribution in Audio LLMs for Music

1: New York University, Music and Audio Research Lab
2: New York University, Integrated Design and Media


Abstract

Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model’s prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

Summary of the work

In this work, we investigate how Audio LLMs are using audio information to answer multiple-choice questions. An Audio LLM is a Large Language Model that is capable of processing audio data. These models usually receive as input, an audio file (.wav, .mp3 etc) and a text prompt. In theory, the prompt can be any text. We use the MuChoMusic benchmark as our dataset. In this benchmark, the prompts are multiple-choice questions regarding the input audio.

Listen to the toy example audio and try to answer the question:



What is the sound that happen in the audio?
Options: (A) Crickets (B) Jackhammer (C) Birds chirping (D) Car horn
The correct answer is:

If you chose option (B) Jackhammer, you're correct and already doing better than a lot of models. How was that? Difficult? Were you able to answer without listening to the audio?

I know this last question sounds counter-intuitive, but current benchmarks like like MuChoMusic, MMAU and MMAU-Pro are telling us that Audio LLMs performance will not change if you replace the actual audio with silence or noise. And this is what we want to investigate with this project. Are models really using audio input? If so, how much?

So our objective here is to measure how much models use the audio and the text information. To achieve this, we first calculate Shapley values for our audio and text features. Think of it like this: if a you and your friends win a game, how much did you contribute to the win? Shapley values provides us with a tool to assign credit to each player. Translating this to our problem, Shapley values will tell us how much audio and text features contribute to the model's answer. In our case, we would see how much the features contributed to the answer a model produce. Once we have the feature importance, we can use them to calculate the modality contribution. This method is called MM-SHAP.

We test this approach with 2 models tested in the original MuChoMusic paper: Qwen-Audio and MU-LLaMA. Our paper shows that both models use audio, but the best performing audio uses less audio than we expected.

We designed this interactive demo so you could explore more examples on your own. In the paper we were able to discuss only one example, but here we provide good and bad examples. Have fun!


Interactive examples

We divide our examples in two categories: the ones in which the answer is a single-sounding event (which we inspected and annotated the ground truth) and randomly selected examples. For the random examples, we do not have the ground truth annotations as the answer usually required long-term understanding of the audio. As discussed in our paper, how to annotate those examples remains an open question.

In the subsection below, we provide a brief explanation of the plot components. Feel free to skip it and play with the examples.

How to read the plots

Each plot will have a similar structure to this one:



Let's go over the different components in this plot:
  • Experiment name: a combination of the model + input type. So, in our example "Qwen-Audio MC-PI" we have the model Qwen-Audio and the experiment Multiple-Choice with Previous Instructions.
  • Current view: this is whether we are looking into the aggregate view, i.e. sum across output tokens, or the values for a single token.
  • Statisticss: a brief view of the actual values of the features. This view is optional.
  • Question: this is the prompt that the model receive. We display this information here as a list of tokens that the model uses based on the input. For readibility purposes, we only highlight question tokens that have a Shapley value greater than 80% of the highest value.
  • Model Answer: this is the model output. Here is where the interactivit shines. You can click in whatever token you want to see how much both audio and the input text contributed to the generated token. If you want to go back to the original aggregated view, just click in the "Reset view" button.
  • Waveform: the input waveform. You can click in a timestamp on it to update the audio player. Here we have only one audio player and one plot, but in the comparison page we'll have one audio player and four plots. The playhead is shared among the plots.
  • Absolute value: the absolute value of this sample Shapley values. In practice, this is what is used to calculate the modality contribution score. It uses positive and negative values equally.
  • Positive Only: only the positive Shapley values, meaning that it highlights the features that contributed positively for the given output. So if a region is "activated" in this subplot, that means that this region is important to the output.
  • Negative Only: as before, only the negative Shapley values, so it shows what features contributed negatively to the output.

Important to notice that the values are zero whenever the feature has no contribution. In the audio, we see that all the silent regions have zero contribution, independent of the output token or aggregated view. (:

Single-sounding events

Those are events that the answer is very well localized in the sound. We refer again to our paper in case you want to see the methodology in details. To see the comparison across experiments, click in the question.


Random Examples

Those were chosen randomly to show that the method is harder to interpret depending on the scenario. Here we have questions that the answer is not a single sound. In MuChoMusic, these questions are the majority. To see the comparison across experiments, click the question.


Citation

@misc{morais2025investigatingmodalitycontributionaudio, title={Investigating Modality Contribution in Audio LLMs for Music}, author={Giovana Morais and Magdalena Fuentes}, year={2025}, eprint={2509.20641}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2509.20641}, }

Disclaimer

The visualizations were developed in D3.js with the support of Gemini. As the lead developer of this demo page, I (Giovana) am responsible for the implementation and testing of this code.
If you find anything wrong, let me know by sending an email to giovana.morais@nyu.edu or by creating an issue the GitHub repo.

Muito obrigada and have a nice day!