Let’s begin by removing ‘black box’ algorithms from core public agencies

Today we released our second annual research report on the state of artificial intelligence. Since last year’s report, we’ve seen early stage AI technologies continue to filter into many everyday systems: from scanning faces at airport security, to recommending to hire someone, to granting someone bail, to denying someone a loan. This report was developed for our annual AI Now Experts’ Workshop, which included 100 invited researchers across relevant domains, and it reflects a range of views that were discussed at the event.

While AI holds significant promise, we’re seeing significant challenges in the rapid push to integrate these systems into high stakes domains. In criminal justice, a team at Propublica, and multiple academics since, have investigated how an algorithm used by courts and law enforcement to predict recidivism in criminal defendants may be introducing significant bias against African Americans. In a healthcare setting, a study at the University of Pittsburgh Medical Center observed that an AI system used to triage pneumonia patients was missing a major risk factor for severe complications. In the education field, teachers in Texas successfully sued their school district for evaluating them based on a ‘black box’ algorithm, which was exposed to be deeply flawed.

This handful of examples is just the start — there’s much more we do not yet know. Part of the challenge is that the industry currently lacks standardized methods for testing and auditing AI systems to ensure they are safe and not amplifying bias. Yet early-stage AI systems are being introduced simultaneously across multiple areas, including healthcare, finance, law, education, and the workplace. These systems are increasingly being used to predict everything from our taste in music, to our likelihood of experiencing mental illness, to our fitness for a job or a loan.

The problem here is not the willful misuse of AI. It’s that AI and related technologies are being used without processes or standards to ensure safety or fairness, or without a deeper consideration of their complex social interactions. When a new drug is released into the marketplace, it must first undergo rigorous scientific trials and testing, and continued monitoring of its medium and long-term effects. Care and caution is paramount in this domain, because if things go wrong, many people experience significant harm. The same is true for AI systems in high stakes domains.

As part of our report, we are offering ten recommendations for the AI industry, researchers, and policy makers. We’ve listed these recommendations below, along with some additional context for each. These recommendations aren’t the solution; they are a starting place for much-needed further work. While the deployment of AI products is moving quickly, research into bias and fairness are in their early stages, and there is much to be done if we’re going to ensure that AI systems are deployed and managed responsibly. That will require a joint effort. For our part, we are committed to further research on these issues, and sharing that with the wider community. We think it’s urgently needed. Finally, if you’re interested in pursuing a postdoctoral fellowship centered on the social implications of AI, we hope you’ll consider joining us in this effort.

Recommendations

1 — Core public agencies, such as those responsible for criminal justice, healthcare, welfare, and education (e.g “high stakes” domains) should no longer use ‘black box’ AI and algorithmic systems. This includes the unreviewed or unvalidated use of pre-trained models, AI systems licensed from third party vendors, and algorithmic processes created in-house. The use of such systems by public agencies raises serious due process concerns, and at a minimum such systems should be available for public auditing, testing, and review, and subject to accountability standards.

This would represent a significant shift: our recommendation reflects the major decisions that AI and related systems are already influencing, and the multiple studies providing evidence of bias in the last twelve months (as detailed in our report). Others are also moving in this direction, from the ruling in favor of teachers in Texas, to the current process underway in New York City this month, where the City Council is considering a bill to ensure transparency and testing of algorithmic decision making systems.

2 — Before releasing an AI system, companies should run rigorous pre-release trials to ensure that they will not amplify biases and errors due to any issues with the training data, algorithms, or other elements of system design. As this is a rapidly changing field, the methods and assumptions by which such testing is conducted, along with the results, should be openly documented and publicly available, with clear versioning to accommodate updates and new findings.

We believe that those who develop and profit from these systems should be responsible for leading testing and assurance, including pre-release trials. We recognize that the field is a long way from standardized methods, which is why we recommend that these methods and assumptions are open for scrutiny and discussion. This openness will be crucial if the AI field is to develop robust testing standards over time. We also recognize that testing “in a lab”, even with standardized methods, may not catch all errors and blind spots, which leads us to Recommendation 3.

3 — After releasing an AI system, companies should continue to monitor its use across different contexts and communities. The methods and outcomes of monitoring should be defined through open, academically rigorous processes, and should be accountable to the public. Particularly in high stakes decision-making contexts, the views and experiences of traditionally marginalized communities should be prioritized.

Ensuring that AI and algorithmic systems are safe is extraordinarily complex, and needs to be an ongoing process through the life cycle of a given system. It’s not a compliance checkbox that can be completed and forgotten. Monitoring across dynamic use-cases and contexts is needed to ensure AI systems don’t introduce errors and bias as cultural assumptions and domains shift and change. It is also important to note that many AI models and systems are “general purpose”, where products might use plug-and-play add-ons like emotion detection, or facial recognition capabilities. This means that those offering general-purpose AI models could also consider the option of licensing for ‘approved uses’ where potential downsides and risks have been considered.

4 — More research and policy making is needed on the use of AI systems in workplace management and monitoring, including hiring and HR. This research will complement the existing focus on worker replacement via automation. Specific attention should be given to the potential impact on labor rights and practices, and should focus especially on the potential for behavioral manipulation and the unintended reinforcement of bias in hiring and promotion.

The debate around AI and labor usually focuses on the displacement of human workers, which is a very serious concern. However, we think it’s just as important to track how AI and algorithmic systems are used within today’s workplaces, for everything from behavioural nudging, to surveillance, to rating performance. For example, a company called HireVue recently deployed an AI-based video interviewing service, which analyzes a job applicant’s speech, body language, and tone to determine whether the applicant matches the model of “top performers” at a given company. Given the potential of these systems to reduce diversity and entrench existing biases, more work is needed to fully understand how AI is being integrated into management, hiring, scheduling, and the structures and practices of everyday workplaces.

5 — Develop standards to track the provenance, development, and use of training datasets throughout their life cycle. This is necessary to better understand and monitor issues of bias and representational skews. In addition to developing better records for how a training dataset was created and maintained, social scientists and measurement researchers within the AI bias research field should continue to examine existing training datasets, and work to understand potential blind spots and biases that may already be at work.

AI relies on large-scale data in order to detect patterns and make predictions. This data reflects human history, and inevitably reflects biases and prejudices from the training dataset. Machine learning techniques excel at picking up such statistical patterns, often omitting diverse outliers in an attempt to generalize the common cases. This is why it is important that research into bias not take data at face value, and that such research begin by understanding where data used to train AI systems came from, tracking how such data is used across systems, and validating the methods and assumptions that shape a given dataset over time. By understanding this, we can better understand errors and bias reflected in data, and develop ways of recognizing and possibly mitigating them during data creation and collection.

6 — Expand AI bias research and mitigation strategies beyond a narrowly technical approach. Bias issues are long term and structural, and contending with them necessitates deep interdisciplinary research. Technical approaches that look for a one-time “fix” for fairness risk oversimplifying the complexity of social systems. Within each domain — such as education, healthcare or criminal justice — legacies of bias and movements toward equality have their own histories and practices. Legacies of bias cannot be “solved” without drawing on domain expertise. Addressing fairness meaningfully will require interdisciplinary collaboration and methods of listening across different disciplines.

The recent increase in work on AI and algorithmic bias is an excellent sign, but we caution against taking a purely technical approach. Otherwise, there is a risk that systems are merely ‘optimized’ without knowing what to optimize for. Computer scientists can learn more about underlying structural inequalities that shape data and the contextual integration of AI systems by collaborating with domain experts in fields like law, medicine, sociology, anthropology, and communication.

7 — Strong standards for auditing and understanding the use of AI systems “in the wild” are urgently needed. Creating such standards will require the perspectives of diverse disciplines and coalitions. The process by which such standards are developed should be publicly accountable, academically rigorous and subject to periodic review and revision.

Currently, there are no established methods for measuring and assessing the impacts of AI systems as they are used in specific social contexts. This is a significant problem, given the determinations that early-stage AI systems are already influencing across multiple high stakes domains. Developing such standards and methods should be an urgent priority for the AI field.

8 — Companies, universities, conferences and other stakeholders in the AI field should release data on the participation of women, minorities and other marginalized groups within AI research and development. Many now recognize that the current lack of diversity in AI is a serious issue, yet there is insufficiently granular data on the scope of the problem, which is needed to measure progress. Beyond this, we need a deeper assessment of workplace cultures in the technology industry, which requires going beyond simply hiring more women and minorities, toward building more genuinely inclusive workplaces.

The assumptions and perspectives of those who create AI systems will necessarily shape them. AI developers are often male, white, and with similar backgrounds in terms of education and training. We have already seen evidence that this causes problems, from voice recognition systems that don’t “hear” women, to AI assistants that fail to give information on women’s health. However, beyond general tech industry diversity statistics, there are few efforts to quantify and better understand the issue of diversity in the AI field specifically. If AI is to be safe, fair, and widely relevant, efforts need to be made not only to track diversity and inclusion, but also to ensure that the culture in which AI is being designed and developed is welcoming to women, minorities, and other marginalized groups.

9 — The AI industry should hire experts from disciplines beyond computer science and engineering and ensure they have decision making power. As AI moves into diverse social and institutional domains, influencing increasingly high stakes decisions, efforts must be made to integrate social scientists, legal scholars, and others with domain expertise that can guide the creation and integration of AI into long-standing systems with established practices and norms.

Just as we wouldn’t expect a lawyer to optimize a deep neural network, we shouldn’t expect technical AI researchers and engineers to be experts in criminal justice, or any of the other social domains where technical systems are being integrated. We need domain experts to be at the table, to help lead decision making and ensure AI systems don’t naively misunderstand the complex processes, histories, and contexts in areas like law, health, and education.

10 — Ethical codes meant to steer the AI field should be accompanied by strong oversight and accountability mechanisms. More work is needed on how to substantively connect high level ethical principles and guidelines for best practices to everyday development processes, promotion and product release cycles.

Several computing industry groups are developing ethical codes to help ensure the development of safe and fair AI (detailed further in our report). However, these codes are voluntary, and generally high-level, asking AI developers to prioritize the common good. But how should the common good be determined, and by whom? In addition to questions of representation, such codes will need to be connected to clear systems of accountability, while also remaining conscious of the incentive structures and power asymmetries at work in the AI industry.

You can read the full report here.