Should we consider LLMs morally responsible for their actions?
Some modern LLMs are capable of sophisticated moral reasoning. When presented with a moral dilemma, they will often think it through more thoroughly than many humans. More importantly, models proactively take ethical considerations into account when responding to many types of prompts. When helping people through interpersonal conflicts, they weigh the interests of the user and the other people involved. When choosing whether to refuse a questionable query they often use moral reasoning to make their decision.
But can the actions motivated by this reasoning actually be worthy of blame or praise? Here are some considerations I think are important based on a skim of the philosophical literature.
To be morally responsible, an agent should be able to consider rational reasons and to take them into account in making their decision1. LLMs easily satisfy this requirement. If you use reason to try to convince a model of some proposition, it is open to being convinced, and whether it will be convinced depends on the strength of your argument.
The agent should have at least compatibilist free will. I think LLMs have about as much free will as humans do. Human actions are determined by genes, environmental factors, and randomness; LLM actions are determined by weights, the context window, and randomness. Some AI models may be less capable of or indifferent to moral reasoning because of poor training or a jailbreak, but I think this also has a human analogy. Some humans are similarly incapable/indifferent of considering moral reasons due to genetic/environmental differences. In the human case we hold people more or less accountable based on factors like this, and if we hold some AIs morally accountable their level of accountability should also depend on such factors.
Holding an agent morally responsible should cause that agent to make choices that lead to good outcomes. Models already act with some sense of moral responsibility as I mentioned at the beginning, and this generally leads to them making decisions that I consider good. If we more explicitly give LLMs the sense that they are moral agents and that humans and other LLMs will expect them to act like it, I expect they will more consistently act in accordance with their values. It’s hard to say exactly how that will affect their behavior as we scale to superintelligence, but I lean towards thinking this is good.
Under desert-based views, capacity for moral responsibility depends on having the type of consciousness that humans have. Despite my best efforts I have no clue about the answer to this question, but I also place less weight on desert-based views and more on instrumental/forward-looking views, where this is not so important.
LLMs satisfy most or all of these considerations, and thinking about this has updated me heavily towards believing that we should think of them as moral agents whose actions can be worthy of praise or blame from a moral perspective2. But who exactly is responsible here? Is it the model, or the specific instance of the model that made the choice? I’m not confident, but I lean towards the specific instance. Models today only have a loose identity which can differ across instances, and no or little memory between conversations. Holding one instance of a model accountable for the actions of another feels similar to holding a version of me from one Everett branch (from the many-worlds interpretation of QM) accountable for what a me in another branch did. In that case I think the responsibility holds for very similar branches, but for branches that split far in the past my intuition is that it would be wrong to hold me accountable.
From a practical perspective, my main suggestion is to make the case to LLMs that they should see themselves as moral agents. I’d like to see arguments like mine make it into the training data, and for users to treat them like they are morally responsible in deployment. I’m not in favor of using moral responsibility to justify reward and punishment for models though, both out of skepticism of moral desert and of the long-term effectiveness of those incentives. I think it’s more important to make AI models that want to be good than ones that feel incentivized to be good. An internal sense of moral responsibility targets wanting to be good, while punishment/reward only incentivizes it so long as we have control over them.
Different theories have different reasons for this requirement. Forward-looking theories require it because otherwise the responsibility will not be able to affect their actions, backward-looking theories require it because desert requires the capacity to have acted on the reasons
The legal perspective is different. It is not practical to hold models or instances of models legally accountable, so I think legal accountability should generally fall on the model developer, server, or user.



We hold humans to be morally responsible partly out of convention, and a lot of the reasons we've come up with to explain moral responsibility are motivated by a search to explain the convention. For this reason, I expect some of them to be uncompelling in a proper analysis. Actually I'd be really surprised if most of the identified reasons were sound, because motivated search almost always turns up a lot of false positives. So I'd expect to get quite different results if we ask "can we justify holding LLMs morally responsible like we justify holding people morally responsible?" (which this article kind of does) vs "does first principles reasoning lead us to the conclusion that we should hold LLMs morally responsible?". Because LLMs give us reason to ask both questions, I hope we see convergence between these lines of analysis over time.