In this work, we pose and address the following ``cover song analogies'' problem: given a song A by artist 1 and a cover song A' of this song by artist 2, and given a different song B by artist 1, synthesize a song B' which is a cover of B in the style of artist 2. Normally, such a polyphonic style transfer problem would be quite challenging, but we show how the cover songs example constrains the problem, making it easier to solve. First, we extract the longest common beat-synchronous subsequence between A and A', and we time stretch the corresponding beat intervals in A' so that they align with A. We then derive a version of joint 2D convolutional NMF, which we apply to the constant-Q spectrograms of the synchronized segments to learn a translation dictionary of sound templates from A to A'. Finally, we apply the learned templates as filters to the song B, and we mash up the translated filtered components into the synthesized song B' using audio mosaicing. We showcase our algorithm on several examples, including a synthesized cover version of Michael Jackson's ``Bad'' by Alien Ant Farm, learned from the latter's ``Smooth Criminal'' cover.
We can listen to the components of W1 and W2 by applying the Griffin Lim algorithm in the CQT domain and inverting at the end. This can help us gain insights into what the filters are actually picking up on.
As described in the paper, we can use the learned decomposition to come up with masks on the original audio and separate it into corresponding components between A and A'. This creates our "per-instrument translation dictionary," which we use to synthesize cover songs.
Now that we have the translation dictionaries, we can split B up into components B1, B2, and B3 using W1, and we can use templates from A1, A2, and A3, respectively, to form them (using Driedger's audio musaicing). Based on the coefficients we use from A1, A2, and A3, we can then translate them into grains A'1, A'2, and A'3, which form the final translations B'1, B'2, and B'3