flowEQ uses a disentangled variational autoencoder (β-VAE) in order to provide a new modality for modifying the timbre of recordings via a parametric equalizer. By traversing the learned latent space of the trained decoder network, the user can more quickly search through the configurations of a five band parametric equalizer. This methodology promotes using one’s ears to determine the proper EQ settings over looking at transfer functions or specific frequency controls. Two main modes of operation are provided (Traverse and Semantic), which allow users to sample from the latent space of the 12 trained models.
Just download the proper plugin for for you platform and move it to the plugin directory for your OS.
Windows
VST3: C:\Program Files\Common Files\VST3
macOS
AU: Macintosh HD/Library/Audio/Plug-Ins/Components
VST3: Macintosh HD/Library/Audio/Plug-Ins/VST3
Skip to the Controls section to learn about how to use the plugin, or read on to learn more about how it all works.
flowEQ is open source and all of the code is available on GitHub. You can build the plugin yourself using MATLAB or modify and train the models yourself, with easy tools for importing them into the plugin.
In addition to the main plugin there is also a lite version. The lite version is a stripped version of the full plugin with the goal of simplifying the interface for novice users. It features only the Traverse mode with the 3 dimensional model (x, y, z controls) with a β value of 0.001. Download this plugin with one of the links below.
The parametric equalizer is a staple in the audio engineer's toolbox for shaping recordings. First introduced in the seminal AES paper by George Massenburg in 1972, the parametric equalizer has become the de-facto format for providing control over timbral shaping. These equalizers often feature multiple bands, each with their own center frequency, gain, and Q controls. This provides the audio engineer with great freedom over the shape of the filter’s transfer function.
While the parametric equalizer provides a well designed interface, it requires a number of skills on the part of the audio engineer to be utilized effectively. These include an understanding of the relationship of individual frequency ranges to different timbres, as well as filter shapes (peaking, shelves, and Q factor). The process of learning to use these powerful timbral shaping tool is time consuming and requires a great deal of experience.
The goal of flowEQ is to provide a high-level interface to a traditional five band parametric equalizer that enables novice users to effectively apply timbral processing. In addition, this interface can provide experienced engineers with a new method of searching across multiple timbral profiles very quickly. This also has the potential to unlock new creative effects, that would be challenging to achieve otherwise.
The plugin is built using the MATLAB Audio Toolbox and Python. The autoencoder models are first trained using Keras (tf.keras in TensorFlow 2.0 beta). These trained models are later converted to MATLAB code and incorporated into the plugin. Parameters are exposed that allow the user to choose among these different models and then directly interact with them in real-time. The MATLAB Audio Toolbox provides a means to create VST and AU plugins directly from MATLAB code, enabling their use in common Digital Audio Workstations (DAWs).
The EQ features three modes of operation, which are selected using the EQ Mode control at the top of the plugin.
The Traverse mode allows the user to freely investigate the latent space of the models. In this mode the three x, y, z sliders can be used to traverse the latent space of the decoder. Each latent vector decodes to a set of values for the thirteen parameters in the five band equalizer. By enabling the Extend mode, the limits of the sliders is extended by a factor of 2. This means that a slider value of -2 will be decoded as -4, and so forth. This allows for more of the latent space to be explored but may produce more strange and undesirable results.
This mode allows for a different method of sampling from the latent space. The x, y, z sliders are deactivated, and the Embedding A and Embedding B combo boxes are used, along with the Interpolate slider. After training, the semantic labels are used to identify relevant clusters within the latent space. These clusters represent areas of the latent space which are associated with certain semantic descriptors. The Interpolate control allows users to seamlessly move between the two semantic descriptors in the latent space. By default the value is set to 0, which means that the decoder is using the latent vector specified by Embedding A. As the use increases this value to 1, a new latent vector is calculated, which lies somewhere between A and B. When set to 1, the latent vector of B will be used as input to the decoder.
Manual mode provides the user with direct control of the five band parametric equalizer using the controls at the bottom of the plugin. Unfortunately, the framework does not currently provide a means to link the parameters of the manual equalizer with the intelligent equalizer. In a future implementation (via JUCE), the user will be able to seamlessly switch between interacting with the decoder and see those parameters updated in real-time on the full manual parametric equalizer below. This will enable users to quickly search the latent space with the decoder for a relevant timbre and then tweak it further with the manual controls. Each of the five bands features an Active checkbox. Un-checking one of these will deactivate the respective band. This is applicable both in Manual mode as well as Traverse and Semantic, although it may prove less useful in the later two.
The Latent control allows the user to switch between models with a different number of latent dimensions (1, 2, or 3). For example, with the default setting of 2, only the x and y values will be used. Increasing the latent dimension gives more control over the shape of the generated EQ curve but requires tuning more parameters. Decreasing the latent dimension makes searching through the latent space faster, but at the cost of find control.
This control (a.k.a disentanglement) allows the user to sample from models with varying levels of latent space disentanglement. Setting this to a lower value will decrease the regularization of the latent space. This means that moving along a certain dimension is tied less to a specific feature of the equalizer curve, or more entangled. Greater β means the dimensions of the latent space are more closely tied to a specific feature (in this case warm and bright). It is recommended to leave this control at the default. The intuition behind this control is outlined further in the Theory section.
This control allows the user to decrease the intensity, or strength, of the equalizer by simply scaling the gains for each band. A Strength setting of 1 will result in the equalizer applied with the exact gains for each band as produced by the decoder. Lowering this value will scale the gain values downward (toward 0 dB), making the equalizer's effect less prominent. A setting of 0 will have the effect of turning all gains to 0 dB, therefore bypassing the equalizer all together.
Since it is difficult to examine differences in signals that are perceived at different levels, an automatic gain compensation feature is included. When enabled, this control monitors the difference in the perceived loudness (ITU-R BS.1770 Short-term loudness) between the input (unprocessed audio) and the output (post-EQ). A gain offset is then applied to force the output level to match the input. This makes comparing different equalizer settings easier.
The default settings work well, but the user can adjust the Update Rate, which will change how quickly the gain adjustment value is moved (if it is too fast clicks will be audible). The Gain Range setting will limit the maximum amount of gain compensation applied in the positive or negative directions (12 dB allows for +12 dB or -12 dB of adjustment). Make sure to disable this control after setting the equalizer, or you may hear fluctuations in volume as audio is played.
These are very straightforward controls that allow the user to adjust the level of the audio both before it enters the equalizer and after it is equalized. These controls prove useful when additional headroom is needed, or to change the output level to match another source for comparison. Both sliders are active in all modes of operation.
Since the MATLAB Audio Toolbox does not provide a means for more advanced visualizations within the GUI, a second, small MATLAB program is provided to make visualizations in real-time. This works by sending data from the plugin over UDP to the visualizer. This program features two windows.
Using the dsp.DynamicFilterVisualizer, the current filter coefficients from flowEQ are displayed on a magnitude response plot. This helps to provide greater visual feedback to the user on the equalizer parameters, since they cannot be linked to the knobs within the plugin itself. Traversing the latent space and observing these transfer functions lends insight into the structure of the latent space for each trained model.
This visualization shows the physical location of the current latent code within the N dimensional latent space of the current model. As shown in the animation above, when the user changes the Latent control in the plugin, the plot transitions from a 2D to a 3D plot. When using a one dimensional model, a 2D plot is shown but the code only moves across the x-axis.
flowEQ uses a disentangled variational autoencoder (β-VAE) in order to provide an intelligent interface for using a five band parametric equalizer. This takes advantage of the assumption that certain locations within the parameter space of the equalizer are more relevant than others. In this case we utilize the SAFE-DB Equalizer dataset, which features a collection of settings from a five band parametric equalizer along with semantic descriptors for each setting, to train our autoencoder.
During training our model learns to reconstruct the thirteen parameters of the equalizer after passing the original input through a lower dimensional bottleneck (1, 2, or 3 dimensional). This overall structure for this full model is shown below for the 2 dimensional case. We see on the left side the input is a 13 dimensional vector made up of the original equalizer parameters. This is passed through a hidden layer and then through a 2 dimensional bottleneck. The decoder then takes as input this low dimensional vector and attempts to reconstruct the original 13 parameters.
While this may not seem like a useful task, we find that if we use the decoder portion of the model, which takes as input a low dimensional vector, we can reconstruct a wide range of equalizer curves using a very only small number of knobs (1, 2, or 3 depending on the dimensionality of the latent space). The diagram below demonstrates this operation. Here we have discarded the encoder and sample points from a 2 dimensional plane and feed these points to the decoder. The decoder then attempts the reconstruct the full 13 parameters. This lower dimensional latent space provides an easy way to search across the space of possible equalizer parameters.
We can call our encoder function $f_\theta$, where $\theta$ represents the weights of the encoder. We can call our decoder function $g_\phi$, where $\phi$ represents the weights of the decoder. The whole autoencoder is then given by $$ \large \hat{x} = f_\theta(g_\phi(x))),$$ and the reconstruction loss, with a simple MSE metric, can be represented as $$ \large L_{AE}(\theta,\phi) = \frac{1}{n}\sum_{i=1}^{n} (x_i - f_\theta(g_\phi(x_i))))^2.$$
While the VAE overall structure looks very similar to the vanilla autoencoder, it is founded on a different statistical methodology: variational inference. We start by examining a sample from some data distribution ($p_{data}$), in our case a 13 dimensional vector of equalizer parameters called $\mathbf{x}$. We then want to determine some latent variable $\mathbf{z}$ which generates $\mathbf{x}$ via the probability distribution $p_{data}$, i.e. $p_{data}(\mathbf{z}|\mathbf{x})$. Bayes theorem tells us we can find this with $$ \large p_{data}(\mathbf{z}|\mathbf{x}) = \frac{p_{data}(\mathbf{x}|\mathbf{z}) p_{data}(\mathbf{z})}{p_{data}(\mathbf{x})}. $$ The issue is we need to obtain $p_{data}(\mathbf{x})$, which is given by $$ \large p_{data}(\mathbf{x}) = \int{p_{data}(\mathbf{x}|\mathbf{z})p_{data}(\mathbf{z})}d\mathbf{z},$$ but it turns out this is intractable.
Instead we attempt to approximate $p_{data}(\mathbf{z}|\mathbf{x})$ using $q_{\theta}(\mathbf{z}|\mathbf{x})$, another distribution, preferably a tractable one. If we parameterize $q_{\theta}(\mathbf{z}|\mathbf{x})$ with $\theta$ so that it is very similar, we can use it as $p_{data}(\mathbf{z}|\mathbf{x})$, and this is precisely what we attempt to do during training. To achieve this we add a new term to the loss function introduced for the vanilla autoencoder. We want to minimize the difference between these distributions by minimizing the KL divergence, $$ \large \min D_{KL} (q(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x})).$$
The KL divergence measures how much information is lost if we use $q(\mathbf{z}|\mathbf{x})$ to represent $p(\mathbf{z}|\mathbf{x})$, and therefore is an appropriate metric to attempt to minimize in our loss function. We will not go through a complete derivation of the loss function (refer to the good explanation in this video), but we find that we arrive at a new loss function given by $$ \large \mathcal{L}_{VAE}(\pmb{\theta},\pmb{\phi},\mathbf{x}^{(i)}) = - \mathbb{E}_{\mathbf{z} \sim q_{\theta}(\mathbf{z}|\mathbf{x}^{(i)})} [ {\log{p_\phi(\mathbf{x}^{(i)} | \mathbf{z})}} ] + D_{KL} (q_\theta(\mathbf{z} | \mathbf{x}^{(i)}) || p_\theta(\mathbf{z})). $$
While this looks fairly different from our previous loss function its not completely different. We notice that the first term acts similarly to our previous reconstruction loss. Here we are minimizing the expected negative log-likelihood between the data generated by the decoder, $p_{\phi}(\mathbf{x}^{(i)} | \mathbf{z})$, after taking a representation from the encoder, $\mathbf{z} \sim q_{\theta}(\mathbf{z}|\mathbf{x}^{(i)})$. The second term in the KL divergence as introduced above.
In our implementation, we specify $p_{\phi}$ as a the isotropic multivariate Gaussian, $p_{\phi}(\mathbf{z}) = \mathcal{N}(\mathbf{z};0,\mathbf{I})$. This means that our KL divergence term penalizes the encoder when it places codes far outside the bounds of the Gaussian distriubtion. Without this restriction, as in the vanilla autoencoder, the model may place each data point in arbitrary points in $\mathbf{z}$. In the case of the VAE, we see that the learned space is constrained and a sample, $\mathbf{x}^{(i)}$, is more likely to be placed near other similar samples, a very desirable property in our application.
We can extend the VAE further by adding a new term to the loss function, $\beta$. It is simply a weighting factor applied to the KL divergence in the loss function (shown below in a simplified version of the original VAE loss). $$ \large \mathcal{L}_{\beta-VAE}(\pmb{\theta},\pmb{\phi},\mathbf{x}^{(i)}) = - \mathbb{E}_{\mathbf{z} \sim q_{\theta}(\mathbf{z}|\mathbf{x}^{(i)})} [ {\log{p_\phi(\mathbf{x}^{(i)} | \mathbf{z})}} ] + \mathbf{\beta} \: D_{KL} (q_\theta(\mathbf{z} | \mathbf{x}^{(i)}) || p_\theta(\mathbf{z})). $$ This parameter allows us to adjust the importance of the KL divergence during training and this has an interesting effect. As $\beta$ is increased the model becomes more regularized and encourages each latent disentanglement.
Disentanglement is the notion that each latent variable more directly controls a single ground truth factor of the output. For example, in the case of images of faces, this means that a single dimension encodes age of the subject, for instance. Greater disentanglement is often desired as the latent space may become more interpretable and our representation more efficient. It will use the minimal number of latent variables required. A large $\beta$ may cause some latent dimensions to end up unused, as the entirety of the output can be modeled with fewer dimensions.
In the case of the parametric equalizer we are not concerned with disentanglement in the same way as with that of generating faces. Since we are using models with 1, 2, and 3 latent dimensions, we are naturally limited to a small number of possible disentangled controls. What we desire is to arrange the latent space in such a way that makes searching for the desired timbre more efficient. To expand the possible options multiple models are trained, each with a different value of $\beta$ (0.0, 0.001, 0.01, 0.02). The user then has the ability to adjust this parameter and note how the structure of the latent space changes.
The block diagram above outlines the steps in buildings the plugin from pre-processing, to training, and finally to incorporation into the plugin.
A number of steps are taken in order to prepare the raw dataset for the training procedure. First we normalize all the equalizer parameters in the dataset so each is scaled between 0 and 1. We achieve this by using the following normalization equation $$ \large x_{normalized} = \frac{x_{unnormalized} - x_{min} }{x_{max} - x_{min} } , $$ where $x_{min}$ and $x_{max}$ are the minimum and maximum values the parameter can take on respectively. The model will only operate on normalized parameters. Therefore when we want to convert the output back to their unnormalized form we can use the following equation $$ \large x_{unnormalized} = [x_{normalized} * (x_{max} - x_{min})] + x_{min} . $$
Next, since it is possible to represent the same equalizer transfer function with a different set of parameter. This can be achieved by swapping the order of the three mid bands. To rectify this, we sort these bands in order of ascending center frequency before training to help the model learn more efficiently.
In order to more effectively group the semantic descriptors, which generally have little consistency, we apply some light processing to clean them and group them based on core meaning. This is an area that could be expanded in the future to get more effective transformations. After this we count up the frequency of each descriptor and save out a new dataframe that contains these descriptors and their frequency. Finally we shuffle all of the samples in the dataset before training and save out the new dataframe to file.
Training consists of training all 12 models, each with different hyperparameters. The details for these models are shown in the table below. During training, a new Keras model is constructed for each case and is trained for 200 epochs. A batch size of 8 was used, as this was found to lead to the lowest loss at convergence.
Model | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Latent | 1D | 1D | 1D | 1D | 2D | 2D | 2D | 2D | 3D | 3D | 3D | 3D |
β | 0.000 | 0.001 | 0.01 | 0.02 | 0.000 | 0.001 | 0.01 | 0.02 | 0.000 | 0.001 | 0.01 | 0.02 |
During training, a method known as KL annealing is utilized. It has been shown that while training VAEs, the KL loss term often dominates at the start of the training process, and this ultimately leads to the models inability to learn. To solve this we initialize the KL loss to 0, and then after training on only the reconstruction loss for a number of epochs, we increase it to the final value given by $\beta$.
Since in our application we want to learn the space that is most conducive for applying parametric equalization traditional metrics may not be very applicable. Instead we are interested largely in examining the structure of the latent space to see if it appears to have learned useful representations. In the future a more thorough evaluation of the performance of these models can be undertaken alongside a user study in order to uncover what kinds of latent space representations are most desireable.
The first set of animations above show points sampled from the 2 dimensional latent space during training, each with different $\beta$ values. The 2 dimensional latent code at each point has been passed through the decoder, converted to a 13 point vector, and then its transfer function plotted. We observe that as $\beta$ increases the latent space becomes more regularized, and for the case where $\beta = 0.02$, many of the points appear to decode to the same equalizer transfer function. This is an example of what might be considered over-regularization.
Even though we efficiently represented the data, (using just a single dimension as will be evident in the next set of animations) the reconstruction error is greater and we are no longer able to represent as much diversity among samples. Meanwhile, in the case where $\beta = 0.000$, we can see the structure of the latent space is less neatly organized. It appears that the cases where $\beta = 0.001$ and $\beta = 0.01$ may be the most useful as they both show a diversity in samples and are well structured. From this conclusion we choose $\beta = 0.001$ to be the default within the plugin, but the user is free to adjust this.
In this set of animations, instead of sampling from the latent space with the decoder, we pass the training data through the encoder, projecting it into the latent space. Since only parameters with warm and bright descriptors were used, we observe how these two classes are organized.
As we noted above, in the case for $\beta=0.02$, it is clear that the data is being projected largely on the y-axis only. We observe similar trends from the previous set of animations. The models with $\beta = 0.001$ and $\beta = 0.01$ appear to have regularized and well organized latent spaces, the $\beta = 0.01$ model perhaps looking the best.
With the models trained and examined, we now need to connect them to a parametric equalizer and build an interface for users. To achieve this, we use the MATLAB Audio Toolbox, which provides the ability to turn MATLAB code into VST/AU plugins. While this toolbox has some limitations (few GUI options and special considerations for code generation), it provides a convenient avenue to implement and prototype our plugin. The diagram below provides and overview of the structure of the plugin.
The audio processing path is fairly straightforward and is made up of three gains blocks and the five bands of the parametric equalizer. The input and output gain blocks are controlled by the user directly with sliders on the GUI. The compensation gain block attempts to scale the output in an effort to keep the output signal at a perceptual loudness level equal of that to the input, post input gain.
As mentioned in the Controls section, the ITU-R BS.1770 Short-term loudness is measured before and after the equalizer section. The post-EQ loudness is subtracted from the pre-EQ loudness to determine a correction factor in dB. This works since the loudness measurement, in dB LUFS, corresponds to direct changes in dBFS. The Update rate parameter determines how many frames of audio are sampled before the gain compensation value is updated. The Gain range parameter determines the maximum amount of gain compensation to apply. If the gain compensation value exceeds the set value it will be capped at the maximum value. This is included to prevent situations where the gain changes rapidly and causes undesirable shifts in output volume.
The five equalizer bands are implemented using biquad filter design equations for shelving and peaking filters. The implementation included in the plugin is based directly on the JUCE library IIR filter implementation, since the original SAFE Equalizer used this library. Their implementation appears to be based on the infamous Cookbook formulae by Robert Bristow-Johnson. Update flags are used to signal whenever the user has changed a parameter that controls the equalizer, and this forces the coefficients for all the filters to be updated based on the new parameters. Additionally, bypass controls are provided for each band, which when bypassed, will simply skip over the respective stage during the audio processing loop.
In order to provide an interface that allows for the user to sample from the 12 trained models, the plugin needs to implement the decoder network.
This then allows for the user to provide latent codes, which can then be converted directly to equalizer parameters.
Ideally, the easiest method to achieve this would be to use the importKerasNetwork
function from the Deep Learning Toolbox.
This would load the HDF5 model weights and architecture after training in Keras and implement the model as a DAGNetwork
object.
Unfortunately, there is limited code generation support for these objects, and focus has been placed on targeting specific hardware and GPUs.
(Although, there is some cool functionality, like the ability to compile code for ARM targets like the Raspberry Pi)
This won't work for our use case though. We need the ability to compile plugins for many different CPU targets across operating systems.
To solve this, we simply implement the network in pure MATLAB code.
This is fairly simple since our network is relatively small.
And it should be noted that not only does a small network mean that it is easier to implement, it means it is also much faster to run.
We could attempt to use the legacy Neural Network tools,
which are supported for general C/C++ code generation using the genFunction
function.
The issue is since this API doesn't have all of the same, modern functionality as Keras (namely the same activation functions) it will be challenging to use the weights from the Keras models.
Instead, we simply load the weights as matrices and create an object that has the functions needed to implement the network exactly as it is implemented in Keras. The decoder network is a simple densely connected neural network, with a a hidden layer of 1024 units and an output layer of 13 units. We can represent the first layer of the decoder with the following function, $$ z = \text{ReLU}(x W_1 + b_1) $$ where $x$ is out input latent vector of 1, 2, or 3 dimensions, $W_1$ is the weight matrix for the first layer, $b_1$ is a vector of biases, and $\sigma$ is the ReLU activation function. $$ \text{ReLU}(x) = max(0,x) $$ The output of this layer, $z$, is then fed as input to the output layer, $$\hat{y} = \sigma(z W_2 + b_2) $$ where $\hat{y}$ is the output, a 13 dimensional vector of normalized equalizer parameters, $W_2$ and $b_2$ are the weights and biases of the second layer, and $\sigma(x)$ is the sigmoid activation function. $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
This is all implemented in a fairly straightforward MATLAB class called Decoder
.
Since the network is so small, inference times (a single forward pass through the decoder) are very fast, at around 300 microseconds per forward pass.
This makes real-time operation of the plugin feasible, to the point where the user doesn't even notice that a neural network is running as they interact with the sliders.
This is one of the biggest challenges with building real-time audio tools that incorporate machine learning.
As larger models are used they become nearly impossible to run in real-time audio applications without significant GPU acceleration.
flowEQ is still very much a proof of concept, and the current implementation is somewhat limited by the MATLAB framework. Below are some future areas of development to further improve the plugin and expand its functionality.