Machine learning systems are profoundly influenced by the methods of data collections and labelling that are used in their creation. Yet there has been a lack of research into the processes of how training data is constructed and used. Since our first Data Genesis workshop in 2017, the AI Now Institute has been developing new approaches to study and understand the role of training data in the machine learning field. Key research questions include: What type of information is used as training data? Who generates and collects it and for what purpose? What segments of society does it reflect? Who and what does it exclude? And how does that affect the functioning of AI systems themselves?

The Data Genesis program’s goal is to answer and demystify these questions through three core components:

– Archiving and analyzing the origin and construction of key datasets that serves as foundations for today’s AI systems;
– Producing visualizations, maps, and other designs to help crystallize and contextualize what this data is and what it means to communities, practitioners, companies, and policymakers; and
– Convening experts from across disciplines to help build a field around this topic.

The rapid proliferation of AI into various social and political contexts demands a thorough understanding of the data that these systems are trained on, including the biases and flaws this data may encode. Our Data Genesis program will investigate the complex foundation on which AI is built and will call into question the perception of AI as a magical force that is superior to human judgement.

This project is funded through the support of the Alfred P. Sloan Foundation. The project will be overseen by AI Now Co-Founders Kate Crawford and Meredith Whittaker, with support from an exciting group of forthcoming new hires, including a Research Lead, a Researcher, and a Designer, among others.

We are currently accepting applications for the Research Lead role; you can learn more and apply [here](