In short
The Atlantic has created a searchable database of music datasets tied to AI training, surfacing millions of tracks and raising fresh questions about consent and licensing. The project makes it easier for artists and the public to see how far-reaching AI training data has become.
- The Atlantic built a searchable database of four music datasets tied to AI training.
- Two of the collections reportedly contain 12 million and 9 million tracks.
- Some datasets rely on link lists from platforms like YouTube and Spotify.
- Google and Stability have both reportedly cited some of the datasets in research.
- The project raises new questions about licensing, consent and platform rules.
The argument over AI training data has often centered on books, news articles, images and code. Music, however, has become another major battleground — and one that is far less visible to the public. The Atlantic has now put that hidden layer on display with a searchable database that lets readers explore four large music datasets being used, or potentially used, to train artificial intelligence systems.
Reported by Atlantic journalist Alex Reisner, the project is notable not just because it names individual songs and artists, but because it reveals the scale at which music has been collected, assembled and repurposed for model training. Two of the datasets contain staggering numbers of tracks — roughly 12 million and 9 million songs — while the other two each hold more than 100,000 tracks. For musicians, rights holders and AI developers, the archive raises difficult questions about consent, licensing, platform rules and the future economics of creative work.
The new database does not settle the legal status of any one dataset. Instead, it makes the broader landscape easier to inspect. That matters because the use of music in AI development has largely happened out of public view, with many datasets distributed across research repositories, link lists and archived collections that are difficult for non-specialists to track.
Why The Atlantic’s database matters
For years, debates about generative AI have focused on whether companies trained models on protected works without permission. The discussion has usually centered on text and images because those examples are easier to illustrate and, in many cases, easier to find. Music training data has been harder to document, even though it can be just as valuable to AI developers building systems that generate audio, recommend songs, classify sound, or support creative tools for musicians.
Reisner’s project changes that by giving the public a way to search through song titles, artists and dataset names rather than relying on vague descriptions buried in technical papers or dataset documentation. The result is a more concrete picture of what is feeding AI systems, and whose work may have been included.
The database is also important because it helps bridge a gap between technical research and public understanding. AI companies often discuss training data in broad terms, while creators ask more specific questions: Which songs were used? Were they licensed? Were artists compensated? Did the developers respect platform terms? A searchable archive can’t answer every one of those questions, but it can make them harder to ignore.
The scale of the datasets
The most striking detail in The Atlantic’s reporting is the sheer size of the collections Reisner identified. Two datasets alone reportedly contain 12 million and 9 million tracks, placing them among the largest music resources ever tied to AI development. The smaller collections are still substantial, each exceeding 100,000 songs.
That scale matters for more than simple volume. Large datasets can shape how a model learns musical structure, genre patterns, instrumentation, melody, rhythm and production style. In other words, the choice of training data is not incidental — it can determine the output style and capabilities of the model itself.
Here is a quick summary of the four datasets Reisner surfaced:
| Dataset group | Approximate size | Notable detail |
|---|---|---|
| Dataset A | 12 million tracks | One of the largest collections identified |
| Dataset B | 9 million tracks | Another massive archive linked to AI training |
| Dataset C | More than 100,000 tracks | Smaller, but still a major corpus |
| Dataset D | More than 100,000 tracks | Another significant training set |
Those figures are important because they suggest the use of music in AI is not a niche experiment. It is happening at industrial scale.
What kind of music is in the datasets?
Reisner’s database includes a broad range of artists, spanning mainstream pop, indie rock, hip-hop and experimental electronic music. Among the names that appear are Lady Gaga, Fred Again.., Radiohead, Aphex Twin, Wu-Tang Clan, Bruce Springsteen and experimental composer Hainbach.
That mix illustrates an important reality: AI training sets do not only include obscure material. They can also contain highly recognizable, commercially valuable recordings by globally known artists. For rights holders, that raises the stakes considerably. If a dataset includes chart-topping hits or signature catalog tracks, it can influence both legal exposure and the optics of AI development.
It also underscores a broader issue in modern data collection. Once music is scraped, linked or archived at scale, the line between “available online” and “available for AI training” can become blurred, especially when datasets are redistributed through research channels that are not intuitive to the average listener or creator.
How the data was gathered
The Atlantic’s reporting suggests that some of the music collections are not simply bundles of downloadable files. In three of the four cases, the datasets are presented as lists of links to tracks hosted on major platforms such as YouTube or Spotify. AI developers then use automated tools to retrieve the audio itself.
This distinction is critical. A link list may look harmless on paper, but the downstream process can involve software that bypasses normal platform features — including ads, logins or other mechanisms that help fund creators and streaming services. According to Reisner’s explanation, such tools can violate the terms of service of the platforms involved.
That creates a messy intersection between AI training, platform governance and music rights. A developer might argue that the dataset is publicly available. A platform might say that the automated harvesting violates its rules. A creator may say that the underlying use of the music was never authorized in the first place.
Why link-based datasets are controversial
Link-based datasets are controversial because they can appear to sidestep direct copying while still enabling large-scale extraction. From a developer’s perspective, a list of URLs can be a practical way to organize source material. From a creator’s perspective, it may feel like a workaround that ignores the spirit, if not the letter, of existing protections.
For regulators and courts, these cases are especially difficult because the legal status can depend on jurisdiction, platform policy, the exact method of retrieval and how the material is ultimately used. The Atlantic’s database does not resolve those questions, but it makes them easier to investigate.
Reisner’s reporting highlights a central concern: some AI developers may be using tools that automatically pull audio from platforms in ways that can bypass ads, logins and other creator-supporting mechanisms.
Google, Stability and the research pipeline
One of the most revealing points in the reporting is that The Atlantic says Google and Stability have each acknowledged using some of the identified datasets in research papers. That confirmation is important because it connects the collections to major industry players rather than leaving them as obscure repositories with unclear relevance.
Even when a company uses a dataset for research rather than a commercial product, that does not eliminate controversy. Research training can still shape model capabilities, inform future commercial systems and normalize the use of large-scale collections that creators may never have intended to support.
This also reflects the way AI development often works in practice. Datasets appear in academic or semi-academic contexts first, then become embedded in more widely deployed tools later. By the time the public notices, the underlying data pipeline may already be deeply established.
What this means for musicians and rights holders
For artists, songwriters, labels and publishers, the existence of a public search tool is both useful and unsettling. It offers visibility into a process that has often seemed opaque, but it also confirms suspicions that many creators may have already had: their work is likely being absorbed into AI systems at massive scale, sometimes without direct permission.
The practical implications are broad:
- Artists may want to know whether their catalog appears in training datasets.
- Labels and publishers may use the database to assess potential infringement or licensing gaps.
- Legal teams may rely on the information to build cases or evaluate settlement strategies.
- Policy advocates may cite the data in debates over AI transparency and compensation.
The database also highlights a core tension in the music industry. Streaming taught the public that songs could be accessed instantly and globally, but AI training goes a step further: the value of a track may now extend beyond listening into machine learning, recommendation systems and generative outputs. That widens the debate from royalties alone to the larger economics of data extraction.
From streaming rights to AI rights
Music licensing was already complicated before the AI boom. Streaming services, synchronization rights, mechanical royalties, neighboring rights and international agreements have long formed a dense legal landscape. AI adds another layer by introducing a new kind of use case: not public playback, but model training.
That difference matters. In many cases, a service may have the right to stream a song to listeners but not to repurpose it as machine learning input. The question now is whether existing copyright frameworks can handle that distinction or whether lawmakers will need to build new rules specifically for AI.
The technical challenge behind music AI
Music is a particularly demanding domain for AI systems because it is structured yet highly expressive. A model may need to learn tempo, harmony, timbre, genre conventions, lyrical patterns and production techniques all at once. That makes large, diverse datasets especially attractive to developers.
But the technical usefulness of a dataset does not erase the ethical or legal questions around how it was assembled. In fact, the better a collection performs, the more incentive there is to gather data aggressively. That is one reason the fight over training sets has become so intense across the broader AI industry.
Unlike text, which can be handled in plain files, music datasets often depend on metadata, link structures and access layers that are harder to inspect. The Atlantic’s project reduces some of that opacity by surfacing specific names and songs, but it also exposes how much of the AI pipeline remains outside ordinary public scrutiny.
Timeline of the reporting and its implications
The Atlantic’s search tool is the latest development in a longer arc of scrutiny over AI data practices. The following timeline shows how the issue is evolving:
| Period | Development | Significance |
|---|---|---|
| Early AI model era | Researchers increasingly rely on large datasets scraped from the web | Training data practices become central to model performance |
| Generative AI boom | Creators begin questioning whether their work was used without permission | Copyright and consent debates intensify |
| Recent reporting | The Atlantic identifies four music datasets tied to AI training | Music enters the public training-data debate more visibly |
| Current moment | A searchable database makes the collections publicly inspectable | Transparency and accountability become harder to avoid |
That sequence suggests the industry is moving from abstraction to documentation. Once the datasets can be searched, they become part of a public record, not just a technical footnote.
Transparency may change the conversation, but not the conflict
Greater visibility is an important step, but it is unlikely to calm the dispute. If anything, public databases can sharpen it by giving creators concrete evidence to point to. That can lead to demands for compensation, licensing reform, takedown requests or new disclosure obligations for AI companies.
At the same time, developers may argue that training on large, diverse collections is essential for innovation, and that strict limits could slow research or make it impossible to build competitive systems. That argument is not new, but transparency makes it easier to test against real-world examples rather than abstract principles.
The heart of the conflict is no longer whether AI systems need data. They do. The question is what counts as fair access to cultural work, who should be paid, and what obligations AI developers should have when they use creative material at scale.
What comes next
The release of The Atlantic’s database is likely to become a reference point in future debates about music and AI. It gives journalists, lawyers, policymakers and creators a practical tool for examining the overlap between model training and copyrighted music.
Expect the following developments to follow:
- More scrutiny of datasets referenced in AI research papers.
- Greater pressure on platforms to explain how their content is accessed by third-party tools.
- Renewed calls for licensing systems that cover AI training explicitly.
- More artists checking whether their work appears in public or semi-public training collections.
At a minimum, the project shifts the conversation from speculation to evidence. It shows that the debate over AI is not just about text generators, image systems or chatbots. Music — one of the world’s most valuable forms of creative expression — is also part of the training-data economy.
And now, for the first time in a more accessible way, the public can see some of the tracks that may have helped teach machines how to listen, categorize and create.
The larger lesson
The Atlantic’s searchable database is more than a media feature. It is a reminder that the AI boom rests on vast, often invisible collections of human-made work. For music, that invisible layer may include songs people streamed, purchased, uploaded or archived under assumptions that had nothing to do with machine learning.
By making those collections searchable, Reisner has turned a hidden infrastructure problem into a public conversation. Whether that leads to better licensing, new regulation or more aggressive legal fights remains unclear. What is clear is that the debate over AI training data has moved another step closer to the music industry’s core catalog — and to the creators whose names are now appearing inside the machines.









