ISMIR Bengaluru 2022 - Interesting papers

27 December 2022 - 4 mins read time
Tags: music research ISMIR

I had the opportunity to attend ISMIR 2022 this year, which was held in Bengaluru (the first ISMIR to be held in India!). It was my first time attending a research conference, and I had an absolutely amazing time there, learnt so much, and met some incredible people, I’m really very grateful for the opportunity.

I wanted to write about some of the papers I saw presented this ISMIR that really excited me (and there were quite a few).

Musika! Fast Infinite Waveform Music Generation

Musika is a GAN-based music generation system made by Marco Pasini and Jan Schlüter, with blazing fast inference time. It can generate music on a GPU at ~1000x real time (meaning it is capable of generating 1m 40s of audio in just 100ms). In comparison, OpenAI’s Jukebox required more than 8 hours to generate a single minute of audio.

Musika achieves this performance by ditching auto-regression (a paradigm where a system’s further outputs are dependent on its previous outputs), and instead uses a latent coordinate system to keep the sense of temporality within the music it generates. It can also generate audio of arbitrary length as a result.

The samples are just incredible, check them out here! You can also try out the online demo here.

Equivariant self-supervision for musical tempo estimation

This is a paper by Elio Quinton. What I like about this paper is the simplicity and elegance of the technique used for self-supervised learning.

A sample clip of music is taken, and time stretched by two different ratios. Then a rhythm representation of these stretched clips are taken, and from these representations, a tempo is predicted. If the tempo predicted for both versions of the clip are correct, the ratio of their tempos should be the same as the ratio of their degree of stretching. This is the underlying principle that defines the loss used for training the whole model.

Raga classification From Vocal Performances Using Multimodal Analysis

Multimodal deep learning is on the rise, and I’m all for it! In this paper by Martin Clayton, Preeti Rao et al., the authors propose the use of pose information extracted by video as an additional mode on top of audio information to identify the raga being performed.

The gestures that Hindustani classical vocalists (the focus of the paper) use while performing are often idiosyncratic, but adding this modality still improves the accuracy of the Raga classification model.

Contrastive Audio-Language Learning for Music

This paper by Ilaria Manco et al. lies at the intersection of music, language, and computation, precisely where my research interest lies.

The authors propose MusCALL, a framework for Music Contrastive Audio-Language Learning. This is a CLIP-like architecture for learning a joint embedding space for text and audio, allowing for text-to-audio and audio-to-text retrieval, which generalizes to the zero-shot transfer scenario.

There’s a clever weighting system in the loss, a modification of the InfoNCE loss, which leverages similarity in the text modality for contrastive learning.

Multi-instrument Music Synthesis with Spectrogram Diffusion

In this paper by the Curtis Hawthorne et al. at Google Brain, the authors train a model to generate audio from MIDI, using a Denoising Diffusion Probabilistic Model (DDPM) on spectrograms.

Why would you want to do that when you can just use a soundfont? Well, firstly, the model learns to pick up on things that are not explicitly specified in the MIDI, like fret noise for guitar, and the performer’s breaths for wind instruments.

In addition, this can act as a simpler controlled spectrogram diffusion task, allowing for the control to later be shifted from MIDI to something more general, like text. (did someone say Riffusion?)

Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations

This paper is special cause it’s from IIIT, my institute, and it even won the Brave New Idea Award! In this paper, Jaidev Sriram, Vinoo Alluri, and Makarand Tapaswi, introduce a novel method for taking soundtracks from the movie adaptation of a book, applying heuristics for segmenting the modalities appropriately and stitching them back together to make a soundtrack for the book.

At ISMIR, I did hear Siddharth Saxena specifically mention that he would put on the soundtrack for the Witcher video games while reading the book, so there are definitely people who are already looking for something like this!

You can check out the online demo here!