Pierre-Carl Langlais
Why LLM developers have to open their data (again)
Current close and open LLMs do not disclose the data used for training due to potential risks and liabilities associated with training on copyright content. Recent developments in large open datasets with permissible licenses and new demands for increased regulation and reproducibility are pushing for a change. This blog post will discuss the landscape and importance of open data in the ML ecosystem.
The release of Llama 3 is a breakthrough for open LLMs: a model that can be hosted everywhere with little license restriction is now nearing the capacities of frontier model GPT-4 or Claude 3. Provided there is enough computing power, any person or organization can host its own powerful version of ChatGPT.
Despite their massive advantages for end use, open-weight models like Llama, Mistral, or Qwen still fall short on the other dimensions of openness. A language model is not just a set of parameters. It’s complex scientific infrastructure that intermingles data, code, and architecture.
Data is the most egregious case. If they exist at all, sections on "training data" will only mention in passing that the model was trained on vague "publicly available data" or a wide selection of "books, websites, and code". The authors of GPT-4 explicitly state that "competitive and safety considerations" weigh above "the scientific value of further transparency" (p. 2).
Enhancements to the LLM are largely attributed to “better data”. And yet, we know absolutely next to nothing about the training set: where does it come from? Only Common Crawl or additional sources? What has been selected? According to which criteria? Which languages are represented?
Given that LLMs are largely “cultural models,” these are not just technical questions. They determine the model's nature, biases, and impact on society at large.
From a culture of openness to trade secrets
Training data has not always been closed. LLM is, in fact, one of the few research fields where open and transparent norms have regressed. By 2024, the open science movement had gradually expanded to include a wide variety of research artifacts: publication, data, code, reviews, and intermediary processes. Open science has been repeatedly proven to be more beneficial to science through enhanced reproducibility and to society as a whole, as research can freely circulate beyond specialized academic circles.
In 2018-2020, frontier models like BERT, GPT-2, or T5 were extensively documented to the point where they could be quoted as positive examples of open science. Researchers from universities and private labs like Google or OpenAI released not only the actual weights of the model but also the training code, the intermediary documentation, and even the dataset used for training, or, at the very least, enough information to reconstruct it. This openness largely contributed to quickly integrating a model like BERT into major NLP pipelines and industrial processes.
Fast-forward a few years, major LLM research papers have become secretive. The big releases of Google, OpenAI, Anthropic, or even committed open-weight companies like Mistral are essentially covered by "non-papers", that won't say anything about the actual details that matter: the data used for training, the architecture of the model, the hyperparameter.
The rising copyright problem of LLM data
There is a common explanation for the lack of data transparency in LLM training: models are trained on unreleasable data because the datasets are so big. Llama 3 has been trained on 15 trillion tokens, and likely as much, if not more, for GPT-4 (we don’t even know the size!). This is big enough to fit 1,000-2,000 editions of the English Wikipedia. Consequently, the model has to use a wide range of problematic sources, most of them under copyright or outright pirated. While it's still highly controversial whether a model can be trained on proprietary sources, releasing it creates added layers of liabilities
Even before questioning this line of reasoning, this raises the question of why it was admissible in the first place. Using pirated content could have been massively convenient for many industries in the past. This is just something that is not done.
Back in 2015, the very first components of the LLM stack were trained exclusively on open content. I remember training my first Word Embedding model on a selection of 100 million words on Wikipedia (without a GPU!). Years later, I got introduced to the first real “proto-GPT”, LSTM, with tutorials on public domain texts from 19th-century philosophers. Researchers, engineers, early users, and companies strived to use open, shareable content… until they stopped caring.
It was a slow drift from starting to using questionable data sources with many precautions to gradually lifting them or repurposing them as "free" content that was merely accessible. We can quote here two emblematic tales:
BookCorpus is a compilation of self-published e-books from SmashWords.com. While the non-professional authors provided the books for free, they never used a free license that would allow for republication. In 2015, a selection of 10,000 works was randomly selected for a sentence similarity task, under the unclear and erroneous claim they were "free books". In 2018, BookCorpus was one of the two main corpora of BERT, alongside the English Wikipedia, and despite being obviously dwarfed now by the massive pre-training dataset, it seems to be still in use for “quality” training (like late pre-training data, fine-tuning, etc.)
Web archives were primarily thought for long-time preservation. They were naturally covered by fair use and other similar exceptions. In the early 2010s, web archives started to be used for training on "transformative" data, like ngrams. Ngrams do not make it possible to recreate the original text, and provided they are completely shuffled, they can be shared without copyright concerns while still being of great value for classification use cases. In 2018, OpenAI started to experiment with the creation of a filtered version from large collections of web archives: WebText contains 8 million “qualitative” documents selected by at least 5 likes on Reddit. Web archives are now the absolute backbone of LLM pretraining.
In both cases, the increasing sophistication of model training entailed an erosion of the guardrail put into place to avoid potential misuse and a rising ambiguity of the meaning of open and usable data for training. This was not only due to the need for more data but more "expansive" data due to the lengthening of the context window: not just ngrams or short sentences, but full texts. Since GPT-3, LLM needs full samples of thousands of words, which would not fit into any copyright exception for short quotations.
The copyright issue goes beyond “grey” areas like web archives. There are many rumors circulating about the reuse of shadow libraries like Libgen or Anna’s Archive and pirated content being used as a source for major LLMs. This is especially the case for their scientific content (coded “STEM” corpus in the few available public information delivered by LLM companies), which provides a major source of reasoning. Due to less potential exposure, Chinese LLM like Deepseek openly admits they are training on 800,000 Chinese scientific books from Anna’s Archive. Once more, the lengthening of context size must be a major intensive: web archives are poor on long texts, and models able to ingest 1 million tokens are hungry for books.
Building a pre-training commons
In March 2024, PleIAs coordinated the release of Common Corpus, the largest available open corpus for pre-training to date: about 500 billion words in a wide variety of European languages. This is already sufficient to train a model like Llama 2 since corpora are apparently frequently repeated through pre-training (against something we guesstimate but don't know!).
Using exclusively content under a permissible resource is not only an ethical commitment but a major scientific initiative to ensure reproducible and qualitative research on LLM. Until now, released collections for pre-training have always been vulnerable due to the potential liabilities associated with the publication of copyrighted content: in the summer of 2023, one of the most popular accessible datasets for pretraining, The Pile, was removed following DMCA notices.
The lion’s share of open content is made of documents with expired copyright ("public domain") or produced for public use ("open data" in Europe and the older "federal public domain" in the United States): this is not just a few large-scale projects, but massive amounts of texts simply lying there, waiting for years to be collected and properly dealt with.
Other initiatives will significantly reinforce this emerging "pre-training data commons" in the months to come. Common Corpus will be significantly expanded as a large share of available text is still waiting to be released until copyright can be thoroughly checked. Eleuther is to release a new version of "The Pile" with a major focus on permissibly licensed content. Other ongoing initiatives are being prepared by Allen AI, Together AI, Cohere, or Sprawling AI.
At this point, it is fairly obvious that there is enough open content online to train a model like GPT-4 or llama 3: 3-5 trillion tokens, repeated 3-4 times. The shocking thing is not that this is possible at all but that this has never been attempted. All this open content has been hidden in plain sight for years.
Conclusion
The recent emergence of an open data movement in LLM research is an important development that will increase the reproducibility of model training and favor better scientific standards of data use. It is also a crucial step to ensure the social acceptability of generative AI and its integration into existing norms and regulations. The expansion of a full open dataset with permissible licenses is potentially a paradigm shift that has the potential to limit the questionable use of copyrighted content and bring back much-needed transparency over model training.
As the landscape of LLM development evolves, it's crucial for the industry to commit to ethical and responsible AI practices. One key aspect of this is the pre-training of foundational models. However, pre-training is not the only aspect that matters; ethical concerns can affect the entire lifecycle of the model. Given the recent development of fine-tuning, the adaptation of existing models to specific tasks and knowledge domains is equally important. There are several expert data providers in this field, and Toloka is one of them. They develop sophisticated technologies for collecting high-quality datasets and performing in-depth evaluations of LLMs responsibly. Book a demo if you're interested.
Article written by:
Pierre-Carl Langlais
Updated:
Jun 17, 2024