Illustration by Somnath Bhatt

Contextualizing ‘Open Data’ and AI

A guest post by Mehtab Khan. Mehtab is a Resident Fellow at the Yale Information Society Project, and Lead for the Yale/Wikimedia Initiative on Intermediaries and Information. Twitter: @Mehtabkn

This essay is part of our ongoing “AI Lexicon” project, a call for contributions to generate alternate narratives, positionalities, and understandings to the better known and widely circulated ways of talking about AI.

The “open” nature of the data is often seen as vital to support digital technologies by connecting people and generating applications.¹ Open-source code helps people collaboratively build technology. Openness is necessary for a decentralized and equitable internet. Open-access information allows people to share knowledge freely. Open-source frameworks like Google’s TensorFlow and Facebook’s PyTorch allow anyone to share and build on algorithms. Open data has been especially crucial in developing responses to the COVID-19 pandemic.²

Despite the potential benefits of open data, there has been little research or discussion on the assumptions and applications of open data in the context of AI technologies, specifically how data is collected and made available. To provide an overview of the complexity of open data and the development of AI technologies, this essay explores four often overlooked areas: 1: How datasets are developed; 2: How datasets are made available and used; 3: Power imbalances and the Global South; and 4: Accountability mechanisms for open data.

1: How datasets are developed.

The development and training of AI technologies often requires large-scale datasets consisting of millions of documents, images, videos and words. For example, large-scale computer vision datasets (LSCVD) such as ImageNet are used in a wide range of academic and industry contexts, and used to train applications such as facial and object recognition.³

Open datasets are built by scraping “publicly available information” found on social media platforms and open knowledge sources like Wikipedia. While open datasets form the foundations of numerous AI applications, they also pose a number of privacy risks. Data collection, sharing, and analysis rely on a tradeoff between openness and individual privacy. While there are a number of technical measures to overcome some of the privacy risks, they are not foolproof.

Furthermore, “publicly available information” is highly contextual and often contested and negotiated by policymakers and dataset curators based on jurisdiction. A further complication in using “publicly available information” is that the privacy regulations pertaining to publicly available data differs widely in the US, EU, and Global South countries. For example, a growing number of states in the US are now providing biometric privacy protection,⁴ and while the EU and US rely on the General Data Protection Regulation,⁵ Biometric Information Privacy Act,⁶ and the California Consumer Privacy Act,⁷ there are few Global South countries with equally resilient and uniform regulatory structures. Some have little to no data protection infrastructure.⁸ This has led to an inherent power imbalance between data collectors and those represented in the datasets,⁹ and means that the data collected from open sources and made openly downloadable invariably sidelines those without proper legal infrastructure to protect their privacy rights or to contest being a part of open data pipelines.

2: How data sets are made available and used

Even if open datasets do manage to address privacy concerns, there is little in the way of people developing and using the datasets in ways that cause harm. For example, ImageNet consists of millions of pictures of people, animals, and everyday objects, collected from the web and compiled into one comprehensive database. ImageNet is used for “object recognition,” meaning it is used to train AI to describe what an image depicts (e.g., a cat or dog). But ImageNet has also been used for more consequential applications like training systems that are used for military target identification and surveillance. There are well-documented instances of the dataset containing biased and discriminatory representations and labeling of millions of images. More recently, ImageNet was in the news for applying racist, homophobic, and sexist labels to photos uploaded by users.¹⁰ Some of these labels associated faces with attributes like “wrongdoer” that had consequences for criminal justice applications.¹¹

Such issues with open access datasets are exacerbated when the datasets are used as “benchmarks” for the AI community. Taking the example of LSCVDs forward, ImageNet is a “benchmark” in the sense that it is used to evaluate how well an AI model has been trained and how it would perform, and therefore industry and academic researchers require easy access.¹² What is overlooked about benchmarks is the assumption that they are reliable measures of AI software performance,¹³ yet as discussed above, they are often discriminatory by nature and therefore flawed.

Building on the idea of dataset development, there are also issues with how large datasets are made available, and under what terms (if any). Many LSCVD benchmarks are proprietary, but some benchmark datasets are open and freely downloadable. They are available on platforms that host repositories of datasets such as GitHub, as well as on individually hosted websites. Some LSCVDs are accompanied by Terms of Service (ToS) agreements, but others do not have any sort of terms of agreements attached to the dataset before downloading. However, ImageNet is currently openly available for anyone to download with a brief and somewhat vague terms of use notice.

Because of the contingencies involved in the dataset development process, the fact that many benchmark datasets are openly available with limited or no terms should warrant a more critical examination. The curators and collectors of data used to build LSCVDs make decisions at every step of the process, such as what data to include, how to navigate potential legal hurdles, and how to label and annotate the data. As illustrated above, collection and labeling decisions have consequences later when applications are being developed. There is a need to clarify the legal obligations of dataset curators and hosting websites in how datasets are developed and distributed in order to mitigate harms.

3: Power Imbalances and the Global South

While openly available datasets can be useful in building AI technologies, they may lead to perilous consequences for marginalized communities, especially in the Global South. For example, datasets may embed social hierarchies and exclude or marginalize the representations of certain groups. When benchmark datasets are made openly available without any regard to terms or restrictions, we make the context and decisions that went into a particular dataset the default. What is included and left out, and what is ignored and unquestioned, is a form of power held by the parties involved in the data pipeline. Scholars have explained that the dataset development is a contextual process based on the practices and perceptions of those involved.¹⁴ When we overlook the social process of dataset development, we fail to adequately address the biases, hierarchies, and politics encoded in the final products.¹⁵

While the very nature of open data means that anyone anywhere can use the data, often with few stipulations, little attention is given to the ways in which powerful companies in the Global North use these datasets to develop surveillance technologies. Consider the use and sale of surveillance tools like Clearview AI’s proprietary software, which is being sold to countries around the world, including those in the Global South.¹⁶ For instance, Clearview AI’s tools have been sold to state agencies in the United Arab Emirates where there are discriminatory laws against the LGBTQ community.¹⁷

The proprietary nature of the technology combined with the lack of transparency over its development means that individuals in Global South countries can be both data subjects and targets of the resulting technology, and have little agency or recourse over its development and use. While there is an ongoing struggle to define the boundaries of Clearview AI’s liability in the US,¹⁸ the court cases are grounded in a unique US First Amendment tradition. It is not clear how, if at all, this would provide remedy for a person in a Global South country being surveilled using Clearview AI’s exported technology.

Large language models have also exposed the consequences of decontextualized open data used to build AI technologies. These models are built using openly available data from sites such as Reddit and Wikipedia,¹⁹ and used to train natural language processing applications. OpenAI, which has made large-scale datasets open source in the past, opted not to release GPT-3 over concerns about “malicious” use of the technology.²⁰ GPT-3 was initially found to make racist and misogynistic associations, with prompts involving “Muslim” leading to text insinuating “violence.”²¹

4: Accountability mechanisms for open data

How do we build accountability into datasets and systems that purportedly prioritize transparency and ease of download? As a community of researchers who use these datasets, we should be asking why certain classifications exist, who they serve, and who holds the power to classify people and objects within open data systems. Designers and engineers need to show their work and clarify the motivations, routines, and norms of creating classifications and annotations. For instance, scholars have made recommendations like mandating dataset curators to fill out datasheets in the process of dataset development, and include reflexive questions like whether a dataset identifies any subpopulations.²²

More openness does not necessarily lead to equitable outcomes for all people, especially marginalized and persecuted groups in Global South countries who are excluded from critical decision-making processes. Wikipedia, which constitutes a major source of data for datasets, has well-documented issues with representation of minority groups, languages, and communities.²³ Relatedly, a salient example of the harmful effects of openness on marginalized groups is the appropriation of traditional and indigenous knowledge, that is sometimes facilitated by open licenses. For example, when cultural depictions, such as art or music, are deemed open by virtue of an open license, this results in a loss of control by the indigenous community that may not be able to preserve the meanings and applications of that knowledge.²⁴ On the internet, it is easier for traditional knowledge to be mislabeled as being part of the public domain by default, erasing any form of agency that a community could have exercised over a piece of knowledge.²⁵ By extension, the concern with open datasets is that they may appropriate, cover, or obfuscate contextual knowledge and information from communities that lack agency over its collection and use.

Openness is as egalitarian as the infrastructures it operates on. If a person wants to contest the use of their data in a LSCVD, they would ordinarily rely on legal recourse especially in the absence of terms of use. But this recourse is only available to those with a corresponding national legal framework, and the resources to access the legal system. Transparency over what ImageNet is used for might instigate both judicial scrutiny and third party auditing in the US, but not equally so for a country without legal and technical infrastructures to do either. Does an audit conducted in the US for a US-based entity hold the same weight as an audit conducted in a Global South country? How is a dataset ‘open’ if the people included in it are closed to meaningful action when harmed?

These points of inquiry should lead us to more critical discussion of what “open” means in the context of AI technologies, and make us reflect on how openness becomes a co-opted phrase and by whom. Openness is a term of use that facilitates the development of AI technologies, whether it is through open license terms, open code sharing, or open downloads. Openness is also a structural feature of the AI ecosystem, shaping collection, curation, training, and accountability. Openness is both a means to exacerbate existing inequities in the Global AI system but also a way to collaborate and innovate, and acknowledging these tensions can help us move towards more informed and inclusive conversations about resolving them.

The pressing challenge now is to negotiate and build accountability measures within this ecosystem. It remains to be seen how the Clearview AI lawsuit will proceed and the implications it will have for privacy, domestically and internationally. However, the issues with the open AI ecosystem go beyond privacy fixes. We need to acknowledge broader contexts of openness and engage more with how to create and facilitate inclusive accountability measures for communities around the world.


[1] Jennifer Yokoyama, “Closing the Data Divide: the Need for Open Data”, Microsoft On the Issues, May 4, 2021.; Data Europe, “AI and Open Data: a crucial combination”, July 4, 2018,,bad%20data%20and%20start%20over.

[2] For example, the development of effective vaccines can be credited in part to early data sharing among scientists around the world: Scientific Data Collection, “Open data in the COVID-19 pandemic”, Aug 27, 2020.

[3] Jason Brownlee, “A Gentle Introduction to the ImageNet Challenge (ILSVRC), May 1, 2019.

[4] Kyle Wiggers, “AI has a privacy problem, but these techniques could fix it,” Dec 21, 2019.

[5] General Data Protection Regulation,

[6] Illinois Biometric Privacy Act,

[7] California Consumer Privacy Act,

[8] UNCTAD, Data Protection and Privacy Legislation Worldwide,

[9] Shira Ovide, “The Internet is Splintering,” Feb 17, 2021.

[10] Julia Carrie Wong, “The viral selfie app ImageNet Roulette seemed fun — until it called me a racist slur”, Sept 18, 2019,

[11] Xiaolin Wu and Xi Zhang (2016) Automated Inference on Criminality using Face Images,; Lutz Finger, “It’s the Data Stupid! Why AI Might Get It Wrong”, May 28, 2020.

[12] Emily Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, Hilary Nicole, Morgan Klaus Scheuerman, (2020) Bringing the People Back In: Contesting Benchmark Machine Learning Datasets,

[13] Alex Hanna, Emily Denton, Razvan Amironesei, Andrew Smart, Hilary Nicole, “Lines of Sight”, Logic Magazine, Dec 20, 2020.

[14] Madeleine. C. Elish & danah boyd (2018) Situating methods in the magic of Big Data and AI, Communication Monographs, 85:1, 57–80, DOI: 10.1080/03637751.2017.1375130.

[15] Kate Crawford and Trevor Paglen. 2019. Excavating AI: The Politics of Images in Machine Learning Training Sets.; Catherine D’Ignazio and Lauren F. Klein. 2020. Data feminism. The MIT Press, Cambridge, Massachusetts.

[16] Corinne Reichert, “Clearview AI probed over facial recognition sales to foreign governments,” March 3, 2020.

[17] Ryan Mac, Caroline Haskins, Logan McDonald, “Clearview’s Facial Recognition App Has Been Used By The Justice Department, ICE, Macy’s, Walmart, And The NBA”, Feb 27, 2020.

[18] Jameel Jaffer and Ramya Krishnan, “Clearview AI’s First Amendment Theory Threatens Privacy — and Free Speech, Too”, Nov 17, 2020.

[19] Khari Johnson, “OpenAI and Stanford researchers call for urgent action to address harms of large language models like GPT-3”, Feb 9, 2021. (Large amounts of text is often scraped from sites like Reddit or Wikipedia.),

[20] OpenAI Blog, “Better Language Models and Their Implications”, Feb 14, 2019.

[21] Khari Johnson, “OpenAI and Stanford researchers call for urgent action to address harms of large language models like GPT-3”, Feb 9, 2021.

[22] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for Datasets. arXiv:1803.09010 [cs] (March 2020). arXiv: 1803.09010;

[23] Carwil Bjork-James (2021) New maps for an inclusive Wikipedia: decolonial scholarship and strategies to counter systemic bias, DOI: 10.1080/13614568.2020.1865463.

[24] Mehtab Khan, “Traditional Knowledge and Creative Commons: White Paper”, (Sept 2018).

[25] Mehtab Khan, “Traditional Knowledge and the Commons: The Open Movement, Listening, and Learning”, Creative Commons Blog, Sept 18, 2018,