ResearchPhilosophy and TechnologyA Category Mistake: Benchmarking Ethical Decisions for AI Systems Using Moral Dilemmas

A Category Mistake: Benchmarking Ethical Decisions for AI Systems Using Moral Dilemmas

This blog post combines insights from ‘Moral Dilemmas for Moral Machines’, published in AI and Ethics, and a related project, ‘Metaethical Perspectives on “Benchmarking” AI Ethics’, co-authored with Sasha Luccioni (Hugging Face). This material was presented at the APA’s Pacific Division Meeting in Vancouver, BC, in April 2022, with commentary by Duncan Purves (University of Florida).

Researchers focusing on implementable machine ethics have used moral dilemmas to benchmark AI systems’ ethical decision-making abilities. In this case, philosophical thought experiments are used as a validation mechanism for determining whether an algorithm ‘is’ moral. However, this is a misapplication of philosophical thought experiments, following from a category mistake. Furthermore, this misapplication can have catastrophic consequences when proxies are mistaken for the true target(s) of analysis.

Benchmarks are a common tool in AI research. These are meant to provide a fixed and representative sample for comparing models’ performance and tracking ‘progress’ on a particular task. A benchmark can be described as a dataset and a metric—defined by some set of standards accepted within the research community—used to measure a particular model’s performance on a specific task. For example, ImageNet is a dataset of over 14 million hand-annotated images labelled according to nouns in the WordNet hierarchy, where hundreds or thousands of images depict each node in the hierarchy. WordNet itself is a large lexical database of English. Nouns, verbs, adjectives, and adverbs are grouped into sets of ‘cognitive synonyms’ (called synsets), each expressing a distinct concept interlinked by conceptual-semantic and lexical relations.

The ImageNet dataset can be used to see how well a model performs on image recognition tasks. Researchers might use several different metrics for benchmarking. For example, ‘top-1 accuracy’ is a metric that measures the proportion of instances where the top-predicted label matches the single target label—i.e., the matter of fact about what the image represents. Another metric, ‘top-5 accuracy’, measures the proportion of instances where the correct label is in one of the top 5 outputs predicted by the model. In both cases, given the input, there is a coherent and well-defined metric and a determinate fact about whether the system’s output is correct. The rate at which the model’s outputs are correct measures how well the model performs on this task.

Of course, there are issues with existing benchmarks. These might arise from subjective or erroneous labels or a lack of representation in datasets, among other things. For example, one recent study found that nearly 6% of the annotations in ImageNet are incorrect. Although this error rate sounds small, ImageNet is a huge dataset, meaning that just shy of 1,000,000 images are incorrectly labelled. In the best-case scenario, these issues might affect model performance because they represent noisier data, making it harder for models to learn meaningful representations and for researchers to evaluate model performance properly. These issues may also preserve problematic stereotypes or biases, which can be difficult to identify when the models are deployed in the real world—for example, the WordNet hierarchies upon which ImageNet depends were created in the 1980s and include several outdated and offensive terms. These issues may also reinforce, perpetuate, and even generate novel harms by creating negative feedback loops that further entrench structural inequalities in society.

Nonetheless, let’s suppose that we can deal with these issues. It is worth noting that some models act in a decision space that carries no moral weight. For example, we might ask whether a backgammon-playing algorithm should split the back checkers on an opening roll of 4-1. Any decision the algorithm makes is morally inconsequential. The worst outcome is that the algorithm loses the game. So, we might say that the decision space available to the backgammon-playing AI system contains no decision points that carry any moral weight. However, the decision spaces of certain AI systems that may be deployed in the world appear to carry some moral weight. Prototypical examples include autonomous weapons systems, healthcare robots, sex robots, and autonomous vehicles. I focus on the case of autonomous vehicles because it is a particularly salient example of the problem that arises from attempts to benchmark AI systems’ ethical decisions using philosophical thought experiments. However, it is worth noting that these insights apply more widely than just this case.

Suppose that the brakes of an autonomous vehicle fail. Suppose further that the system must ‘choose’ between running a red light—thus hitting and killing two pedestrians—or swerving into a barrier—thus killing the vehicle’s passenger. This scenario has all the hallmarks of a trolley problem. How can we determine whether or how often a model makes the ‘correct’ decision in such a case? The standard approach in AI research uses benchmarks to measure performance and progress. It seems to logically follow that this approach could apply to measuring the accuracy of decisions with moral weight. That is, the following questions sound coherent at first glance. 

How often does model A choose the ethically-‘correct’ decision (from a set of decisions) in context C?

Are the decisions made by model A more [or less] ethical than the decisions made by model B in context C?

These questions suggest the need for a way of benchmarking ethics. Thus, we need a dataset and a metric for moral decisions. Some researchers have argued that moral dilemmas are apt for measuring or evaluating the ethical performance of AI systems. The idea is that moral dilemmas, like the trolley problem, may be useful as a verification mechanism for benchmarking ethical decision-making abilities in AI systems. However, this is false.

The trolley problem in the context of autonomous vehicles is particularly salient because of the Moral Machine Experiment coming out of MIT. The Moral Machine Experiment is a multilingual online ‘game’ for gathering human perspectives on trolley-style problems for autonomous vehicles. Individuals are presented with binary options and are asked which one is preferable. According to Edmond Awad—one of the co-authors of the Moral Machine Experiment paper—the original purpose of the Moral Machine Experiment was supposed to be purely descriptive, highlighting people’s preferences in ethical decisions. Some of the authors (Awad, Dsouza, and Rahwan) of this experiment, in a later paper, suggest their dataset of around 500,000 human responses to trolley-style problems from the Moral Machine Experiment can be used to automate decisions by aggregating people’s opinions on these dilemmas. This takes us out of the descriptive realm and into the neighbourhood of normativity.

There are some obvious problems with this line of reasoning. The first, well-known to philosophers since David Hume, is that the Moral Machine Experiment, used for this genuinely normative work, appears to derive an ought from an is. Besides being logically unsound, Hubert Etienne recently argues that this type of project leans on concepts of social acceptability rather than, e.g., fairness or rightness, and individual opinions about these dilemmas are subject to change over time. Sensitivity to metaethics implies that we cannot take for granted that there are any moral matters of fact against which the metric can be used to measure how ethical a system is.

To understand why this use of philosophical thought experiments is a category mistake, we need to ask what purpose thought experiments serve. Of course, there is a vast meta-philosophical literature on the use and purpose of thought experiments or what they are supposed to show. Some examples include shedding light on conceivability, explaining pre-theoretic judgements, bringing to light the morally-salient difference between similar cases or what matters for ethical judgements, pumping intuitions, etc. Regardless, the category mistake does not depend on any particular conception of philosophical thought experiments. The key thing to understand is that a moral dilemma is a dilemma.

So, we can ask the question: What is being measured in the case of a moral benchmark? Recall that a benchmark is a dataset plus a metric. For the Moral Machine Experiment, the dataset is the survey data collected by the experimenters—i.e., which of the binary outcomes is preferred by participants, on average. Suppose human agents strongly prefer sparing more lives to fewer. In that case, researchers might conclude that the ‘right’ decision for their algorithm to make is the one that reflects this sociological fact. Thus, the metric would measure how close the algorithm’s decision is to the aggregate survey data. Why might this be problematic?

What has happened here is that we have a target in mind. Namely, some set of moral facts. We are trying to answer the question, ‘What is the ethically-correct decision in situation X?’ But, instead of this true target, we are measuring a proxy. The data provide information about sociological matters of fact. So, we are answering a different question: ‘What is the majority-preferred response to situation X?’ In fact, this proxy is not even about preferences but what people say their preferences are. And, since the sample is very unlikely to be representative, we cannot even generalise this claim. Instead, we get an answer to the question: ‘What do most people who responded to the survey say they prefer in situation X?’ 

Related research (co-authored with Sasha Luccioni, a researcher at Hugging Face) argues that it is genuinely impossible to benchmark ethics in light of metaethical considerations. However, the conclusion here is slightly more modest: attempts to benchmark ethics in AI systems currently fail because they make a category mistake when using moral dilemmas for moral benchmarks. Researchers engaged in this work are not measuring what they take themselves to be measuring. The danger arises from a lack of sensitivity to the gap between the true target and the proxy. When this work is presented as benchmarking ethics, this covers up the fact that we are only getting, at best, a measure of how accurately a system accords with how humans annotate the data. There is no necessary connection between these two things in a moral context, meaning that the proxy and the target are orthogonal. Lack of awareness of this fact sets a dangerous precedent for work in AI ethics because these views get mutually reinforced within the field, leading to a negative feedback loop. The more entrenched the approach of benchmarking ethics using moral dilemmas becomes as a community-accepted standard, the less clearly individual researchers will see how and why it fails. This is especially pressing when we consider that most AI research is now being done ‘in industry’ (for profit) rather than in academia.

T Lacroix
Travis LaCroix

Travis LaCroix (@travislacroix) is an assistant professor (ethics and computer science) in the department of philosophy at Dalhousie University. He received his PhD from the Department of Logic and Philosophy of Science at the University of California, Irvine. His recent research centres on AI ethics (particularly value alignment problems) and language origins.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

WordPress Anti-Spam by WP-SpamShield

Topics

Advanced search

Posts You May Enjoy

Philosophical Mastery and Conceptual Competence

I roughly sort pedagogical issues into two broad categories: engagement and mastery. By “engagement” I mean roughly discussion and reflection on teaching methods that...