Automatic multitrack mixing with a differentiable mixing console of neural audio effects

Abstract

Applications of deep learning to automatic multitrack mixing are largely unexplored. This is partly due to the limited available data, coupled with the fact that such data is relatively unstructured and variable. To address these challenges, we propose a domain-inspired model with a strong inductive bias for the mixing task. We achieve this with the application of pre-trained sub-networks and weight sharing, as well as with a sum/difference stereo loss function. The proposed model can be trained with a limited number of examples, is permutation invariant with respect to the input ordering, and places no limit on the number of input sources. Furthermore, it produces human-readable mixing parameters, allowing users to manually adjust or refine the generated mix. Results from a perceptual evaluation involving audio engineers indicate that our approach generates mixes that outperform baseline approaches. To the best of our knowledge, this work demonstrates the first approach in learning multitrack mixing conventions from real-world data at the waveform level, without knowledge of the underlying mixing parameters.

Transformation Network

The transformation network is tasked with emulating a series connection of digital audio effects. In this case we train the transformation network to model an EQ, dynamic range compressor, and reverb. To train this network we randomly configure the parameters of the signal chain, and process clean examples from the dataset to generate input and target pairs. In the samples below, we shown the input signal and the target signal from the original signal chain with a number of random configurations, where EQ, compression, and reverb are all applied. The remaining examples are the output of three trained models, TCN-10, TCN-20, and TCN-30. All examples here come from the test set of the SignalTrain LA2A Dataset.

Passage	Input	Target	TCN-10	TCN-20	TCN-30
A
B
C
D
E
F

All listening samples have been loudness normalized to -23 dB LUFS and encoded to VBR mp3.

Differentiable mixing console

After training the transformation network, we use the pre-trained weights from this model, along with the encoder to build the differentiable mixing console (DMC). Weights are shared across the input channels for these models, along with the post-processor. Here we show examples generated from songs in the test sets of the ENST-Drums and MedleyDB datasets. Our DMC model is compared against the mono mix and random mix baselines, as well as an adaptation of Demucs. We also include here the boxplots showing the results from a perceptual evaluation involving 16 audio engineers. They were asked to listen to the same passages shown below, and provide a rating for each on a scale from 0 and 1.

ENST-drums basic mixing task

Passage	DMC (ours)	Mono	Random	Target	Demucs
A
B
C
D
E

All listening samples have been loudness normalized to -23 dB LUFS and encoded to VBR mp3.

MedleyDB full mixing task

Passage	DMC (ours)	Mono	Random	Target
A
B
C
D
E
F

All listening samples have been loudness normalized to -23 dB LUFS and encoded to VBR mp3.

Citation

                
    @inproceedings{steinmetz2020mixing,
            title={Automatic multitrack mixing with a differentiable mixing console of neural audio effects},
            author={Steinmetz, Christian J. and Pons, Jordi and Pascual, Santiago and Serrà, Joan},
            booktitle={ICASSP},
            year={2021}}