Categories: AI Technology

AI Takes a Step Towards Ethical Training with Public Domain Data

In an era where artificial intelligence (AI) continually reshapes the boundaries of technology and creativity, a groundbreaking development emerges, challenging long-held assumptions about the essential materials for training AI models. The heart of this evolution lies within the Common Corpus initiative, marking a monumental stride towards ethical AI development without infringing on copyrighted content. This initiative symbolizes a leap in AI practices and rekindles debates surrounding copyright issues in AI model training.


Traditionally, the AI industry, represented by leading entities such as OpenAI, argued that including copyrighted materials was indispensable for developing advanced AI models. This stance made clear during a statement to the UK Parliament in 2023, ignited a series of legal and ethical discussions. Yet, the landscape began to shift with the introduction of the Common Corpus, a collaborative effort led by Pleias, alongside a consortium of researchers focused on LLM pretraining, AI ethics, and cultural heritage. The project unveiled the most extensive public domain dataset dedicated to training large language models (LLMs), demonstrating that high-quality AI models can be developed without relying on copyrighted materials, thus offering a new paradigm for AI development.


The Common Corpus stands as a beacon of diversity and multilingual support, underscoring the potential of ethical AI training methodologies. In parallel, Fairly Trained, a pioneering non-profit organization within the AI realm, has awarded its first certification for an LLM developed without any copyright infringement to KL3M. Behind KL3M is 273 Ventures, a Chicago-based startup whose legal tech consultancy expertise has proven that models like KL3M can be developed in adherence to fair AI practices. Ed Newton-Rex, CEO of Fairly Trained, has voiced his belief in the feasibility of training LLMs ethically, marking a significant endorsement of copyright-compliant AI development.


The certification process led by Fairly Trained scrutinizes the adherence to copyright laws during the AI model training phase. A shining example of this commitment is the Kelvin Legal DataPack, an expansive collection of legal documents curated by Fairly Trained to ensure copyright compliance. Although smaller than datasets amassed by other industry giants, its refined nature and adherence to legal standards have helped its performance, showcasing the effectiveness of meticulously curated datasets in powering AI models.
273 Ventures’ KL3M model, fueled by the Kelvin Legal DataPack, is a testament to the potential locked within copyright-compliant datasets. The startup is now at the forefront, offering access to this groundbreaking resource, highlighting the growing demand and recognition of ethically trained AI models. The development and certification of the KL3M model, alongside the creation of the Common Corpus, signify a pivotal shift towards more responsible AI practices, advocating for the rights of creators while fostering innovation.


This movement towards fairer AI is not confined to LLMs alone. Somewhat Trained’s recent certifications extend to various applications, from voice-modulation technologies to AI-powered music bands, indicating a broader application of ethical training principles across the AI spectrum. However, challenges remain, especially concerning the reliance on public domain data, which may only sometimes provide the most current information due to extensive copyright protections in regions like the US.


The Common Corpus initiative and the achievements of projects like KL3M represent more than just technical milestones; they embody a growing consciousness within the AI community about the importance of ethical, copyright-compliant development practices. As these initiatives continue to gain traction, they challenge existing norms and open new avenues for innovation, setting a new standard for the AI industry at large. The conversation around the use of copyrighted material in AI training is evolving, with these developments highlighting the balance between innovation and ethical responsibility.

Source: Marktechpost


Like this article?  Keep up to date with AI news, apps, tools and get tips and tricks on how to improve with AI.  Sign up to our Free AI Newsletter

Also, come check out our free AI training portal and community of business owners, entrepreneurs, executives and creators. Level up your business with AI ! New courses added weekly. 

You can also follow us on X

AI News

Recent Posts

Kling AI from Kuaishou Challenges OpenAI’s Sora

In February 2024, OpenAI introduced Sora, a video-generation model capable of creating one-minute-long, high-definition videos.…

4 weeks ago

Alibaba’s Qwen2 AI Model Surpasses Meta’s Llama 3

Alibaba Group Holding has unveiled Qwen2, the latest iteration of its open-source AI models, claiming…

4 weeks ago

Google Expands NotebookLM Globally with New Features

Google has rolled out a major update to its AI-powered research and writing assistant, NotebookLM,…

4 weeks ago

Stability AI’s New Model Generates Audio from Text

Stability AI, renowned for its revolutionary AI-powered art generator Stable Diffusion, now unveils a game-changing…

4 weeks ago

ElevenLabs Unveils AI Tool for Generating Sound Effects

ElevenLabs has unveiled its latest innovation: an AI tool capable of generating sound effects, short…

4 weeks ago

DuckDuckGo Introduces Secure AI Chat Portal

DuckDuckGo has introduced a revolutionary platform enabling users to engage with popular AI chatbots while…

4 weeks ago