Constructing post-hoc interpretations for audio classification models

Yulia Pak, Irina Teryokhina

Abstract


This paper discusses the task of constructing interpretations for machine learning models that classify audio data. The proposed approach enables the construction of interpretations in both visual and listenable forms by masking spectrograms based on feature attribution maps and subsequently reconstructing the signal. To generate the feature attribution maps, the methods Saliency, Grad-CAM, LIME, and SHAP are employed. These methods are universal and can be applied to various architectures. The effectiveness of the approach is evaluated in terms of the fidelity of interpretations to the model’s behavior and their perceptual simplicity. Experiments were conducted with different types of masks as well as with the addition of background noise. The results demonstrate that the main challenge of the proposed approach is achieving a compromise between accurately reflecting the model’s behavior and producing simple, interpretable explanations. While the addition of noise does not change global trends, the type of noise affects model behavior and, consequentially, the characteristics of the corresponding interpretations.


Full Text:

PDF (Russian)

References


Z. Bai, X.-L. Zhang, Speaker recognition based on deep learning: An overview. Neural Networks, 2021, vol. 140, pp. 65–99. DOI: 10.1016/j.neunet.2021.03.004.

A. Srivastava, S. Jain, R. Miranda, S. Patil, S. Pandya, K. Kotecha, Deep learning based respiratory sound analysis for detection of chronic obstructive pulmonary disease. PeerJ Computer Science, 2021, vol. 7, p. e369. DOI: 10.7717/peerj-cs.369.

E. Dufourq, C. Batist, R. Foquet, I. Durbach, Passive acoustic monitoring of animal populations with transfer learning. Ecological Informatics, 2022, vol. 70, p. 101688. DOI: 10.1016/j.ecoinf.2022.101688.

L. K. D. Katsis, A. P. Hill, E. Piña-Covarrubias, P. Prince, A. Rogers, C. P. Doncaster, J. L. Snaddon, Automated detection of gunshots in tropical forests using convolutional neural networks. Ecological Indicators, 2022, vol. 141, p. 109128. DOI: 10.1016/j.ecolind.2022.109128.

F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning, 2017, url: https://arxiv.org/abs/1702.08608. DOI: 10.48550/arXiv.1702.08608.

C. Molnar, Interpretable machine learning: A guide for making black-box models explainable. Leanpub, 2020. ISBN: 978-0244768522.

A. V. Oppenheim, R. W. Schafer, Discrete-time signal processing. Prentice Hall, 1989. ISBN: 978-0132162920.

A. Akman, B. W. Schuller, Audio explainable artificial intelligence: A review. Intelligent Computing, 2024, vol. 3, p. 0074. DOI: 10.34133/icomputing.0074.

A. N. Zereen, A. Das, J. Uddin, Machine fault diagnosis using audio sensors data and explainable AI techniques-LIME and SHAP. Computers, Materials and Continua, 2024, vol. 80, no. 3, pp. 3463-3484. DOI: 10.32604/cmc.2024.054886.

M. T. Ribeiro, S. Singh, C. Guestrin, "Why should I trust you?": Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144. DOI: 10.1145/2939672.2939778.

S. M. Lundberg, S. I. Lee, A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 4768-4777.

J. Kim, J. Oh, T. Y. Heo, Acoustic scene classification and visualization of beehive sounds using machine learning algorithms and Grad‐CAM. Mathematical Problems in Engineering, 2021, vol. 2021, no. 1, p. 5594498. DOI: 10.1155/2021/5594498.

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-CAM: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision, 2017, pp. 618-626. DOI: 10.1109/ICCV.2017.74.

J. Vielhaben, S. Lapuschkin, G. Montavon, W. Samek, Explainable AI for time series via virtual inspection layers. Pattern Recognition, vol. 150, p. 110309. DOI: 10.1016/j.patcog.2024.110309.

A. Wullenweber, A. Akman, B. W. Schuller, CoughLIME: Sonified explanations for the predictions of COVID-19 cough classifiers. 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2022, pp. 1342-1345. DOI: 10.1109/EMBC48229.2022.9871291.

V. Haunschmid, E. Manilow, G. Widmer, audioLIME: Listenable explanations using source separation, 2020. url: https://arxiv.org/abs/2008.00582. DOI: 10.48550/arXiv.2008.00582.

S. Mishra, B. L. Sturm, S. Dixon, Local interpretable model-agnostic explanations for music content analysis. ISMIR, 2017, vol. 53, pp. 537-543. DOI: 10.5281/zenodo.1417387.

J. Parekh, S. Parekh, P. Mozharovskyi, F. d'Alché-Buc, G. Richard, Listen to interpret: Post-hoc interpretability for audio networks with nmf. Advances in Neural Information Processing Systems, 2022, vol. 35, pp. 35270-35283.

F. Paissan, M. Ravanelli, C. Subakan, Listenable maps for audio classifiers, International Conference on Machine Learning (ICML), 2024.

K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps, 2013. url: https://arxiv.org/abs/1312.6034. DOI: 10.48550/arXiv.1605.01713.

Y. Pak, Constructing Post-hoc Interpretations for Audio Classification Models, 2025. Available: https://github.com/exile8/audio-xai.

A. Shrikumar, P. Greenside, A. Kundaje, Learning important features through propagating activation differences. Proceedings of the 34th International Conference on Machine Learning, PMLR, 2017, vol. 70, pp. 3145-3153.

A. Shrikumar, P. Greenside, A. Shcherbina, A. Kundaje, Not just a black box: Learning important features through propagating activation differences, 2016. url: https://arxiv.org/abs/1605.01713.

K. J. Piczak, ESC: Dataset for environmental sound classification. Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 1015–1018. DOI: 10.1145/2733373.2806390.

Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, vol. 28, pp. 2880-2894. DOI: 10.1109/TASLP.2020.3030497.

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore et al., Audio Set: An ontology and human-labeled dataset for audio events. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776-780. DOI: 10.1109/ICASSP.2017.7952261.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИТ конгресс СНЭ

ISSN: 2307-8162