MusicLM: Google AI generates music in various genres at 24 kHz

Enlarge / An AI generated image of an exploding music ball.
Ars-Technica
On Thursday, Google researchers announced a new generative AI model called MusicLM that can create 24KHz musical sound from text descriptions, like “a soothing violin melody backed by a distorted guitar riff.” It can also turn a hummed melody into a different musical style and produce music for several minutes.
MusicLM uses an AI model trained on what Google calls “a large unlabeled music dataset,” along with captions from MusicCaps, a new dataset consisting of 5,521 music-text pairs. MusicCaps gets its text descriptions from human experts and its corresponding audio clips from Google’s AudioSet, a collection of over 2 million tagged 10-second audio clips taken from YouTube videos.
Broadly speaking, MusicLM works in two main parts: first, it takes a sequence of audio tokens (chunks of sound) and maps them to semantic tokens (words that represent meaning) in the subtitles for training. The second part receives the user’s captions and/or input audio and generates acoustic tokens (chunks of sound that make up the resulting song output). The system relies on an earlier AI model called AudioLM (introduced by Google in September) as well as other components such as SoundStream and MuLan.
Google claims that MusicLM surpasses previous AI music generators in terms of audio quality and adherence to text descriptions. On the MusicLM demo page, Google provides plenty of examples of the AI model in action, creating audio from “rich captions” that describe the feel of music, and even vocals (which up to now are gibberish). Here is an example of a rich caption they provide:
Reggae song with slow tempo and bass and drums. Sustained electric guitar. High-pitched bongos with ringtones. Vocals are relaxed with a laid back, very expressive feel.
Google is also introducing MusicLM’s “long generation” (creating five-minute music videos from a single prompt), “story mode” (which takes a sequence of text prompts and turns it into a series of musical morphing), “text and melody conditioning” (which takes human hum or hiss audio input and modifies it to match the style presented in a prompt), and generating music that matches the mood of the captions of picture.
Enlarge / A block diagram of the MusicLM AI music generation model taken from his academic paper.
Google Search
Further down the example page, Google dives into MusicLM’s ability to recreate particular instruments (e.g. flute, cello, guitar), different genres of music, different levels of musician experience, locations (e.g. prison escape , gymnasium), periods (a club in the 1950s), and more.
AI-generated music isn’t a new idea, but AI music generation methods from previous decades often created musical notation that was then played by hand or through a synthesizer, while MusicLM generates the raw audio frequencies of the music. Also, in December, we covered Riffusion, a hobbyist AI project that can similarly create music from text descriptions, but not in high fidelity. Google refers to Riffusion in its academic article MusicLM, saying that MusicLM surpasses it in quality.
In the MusicLM article, its creators describe the potential impacts of MusicLM, including “potential misappropriation of creative content” (i.e. copyright issues), potential biases for cultures under -represented in training data and potential issues of cultural appropriation. As a result, Google stresses the need to work more on addressing these risks, and they hold back the code: “We have no plans to release any templates at this time.”
Google researchers are already considering future improvements: “Future work could focus on lyric generation, as well as improving text conditioning and vocal quality. Another aspect is song structure modeling high level like intro, verse and chorus. music at a higher sample rate is an additional goal.”
It’s probably no exaggeration to suggest that AI researchers will continue to improve music generation technology until anyone can create studio-quality music in any style simply by describing – although nobody can yet predict exactly when this goal will be achieved or how exactly it will impact the music industry. Stay tuned for future developments.