Efficient neural networks for real-time modeling of analog dynamic range compression

Abstract

Deep learning approaches have demonstrated success in modeling analog audio effects. Nevertheless, challenges remain in modeling more complex effects that involve time-varying nonlinear elements, such as dynamic range compressors. Previous approaches for modeling analog compressors either ignore the device parameters, do not attain sufficient accuracy, or otherwise require large noncausal models prohibiting real-time operation. In this work, we propose a modification to temporal convolutional networks (TCNs) that enables greater efficiency without sacrificing performance. By utilizing very sparse convolutional kernels through rapidly growing dilations, our proposed model attains a significant receptive field using fewer layers, reducing computation. Through a detailed evaluation we demonstrate our efficient and causal approach achieves state-of-the-art performance in modeling the analog LA-2A compressor, is capable of real-time operation on CPU, and only requires 10 minutes of training data.

Results

Here we compare the test performance of three models. All examples here come from the test set of the SignalTrain LA2A Dataset, which contains audio that was not seen during training. The TCN-300-C model featured is causal, was trained using 1% of the SignalTrain dataset, and is capable of running in real-time on CPU. Results from a perceptual evaluation (MUSHRA-style) using these audio samples are shown below.

Example	Input	Reference (Analog LA-2A)	SignalTrain (Hawley et al.)	TCN-300-C (Ours)	LSTM-32 (Ours)
Song Limit: 1 Peak: 80
PnoA Limit: 0 Peak: 65
PnoB Limit: 1 Peak: 80
EGtr Limit: 0 Peak: 65
AcGtr Limit: 1 Peak: 65

No normalization has been applied to the listening samples, but they have been encoded to VBR mp3.

We found that both the LSTM-32 and TCN-300-C models produce perceptually similar results. Although, these results indicate that while both models are significantly closer to the reference than SignalTrain, listeners will still able to hear subtle differences between the reference. In comments from participants, and in our own listening, we found that this discrepancy was often due to a subtle difference in how transients were handled by the neural network approaches. It appears that both the LSTM-32 and TCN-300-C models often are less aggressive in suppressing some transients when large amount of gain reduction are applied.

Try it out

Choose a source:

Limit:

Peak Reduction:

In this demo you can interact with some pre-processed outputs in order to get a feel for the quality of the models. First select a source from the drop-down menu above. Then you can toggle between the input (unprocessed) and the output of the model (TCN-300-C with 1% of the training data). You will note that as your increase the peak reduction, the output level is reduced significantly.

Note: This interface is a bit buggy.

Citation

                
    @inproceedings{steinmetz2021efficient,
            title={Efficient neural networks for real-time analog audio effect modeling},
            author={Steinmetz, Christian J. and Reiss, Joshua D.},
            booktitle={152nd Audio Engineering Society Convention},
            year={2022}}