Can artificial intelligence systems mark more accurately than humans? Definitely, and they have been able to do so since the 1960s.
In 1968 Dr Ellis Batten Page developed Project Essay Grade (PEG), an automated essay-marking system. PEG was very reliable. If you gave it the same essay on two different days, it awarded it the same mark, which is definitely not always the case with human markers. Not only that but its marks did tend to correlate quite closely with those of human markers. If you took the average marks awarded by a group of human markers and compared them with PEG, PEG agreed with the average more than any individual human marker. So you could argue that it was more reliable than any individual human.
So why haven’t we all been using AI marking systems ever since? It’s not because they are unreliable. It’s because of the impact they have on teaching and learning. Once students know that AI is marking their essays, they want to know what it rewards and how it rewards it. Many early AI systems rewarded the length of an essay, simply because essay length does tend to correlate with essay quality. But, of course, correlation is not causation. Once people know the AI is rewarding length, they can start to game the system. In 2001 a group of researchers found that repeating the same paragraph 37 times was sufficient to fool one popular automated essay marker.
This, essentially, is the problem with AI marking. It’s easy for it to be more consistent than humans, because humans are not great at being consistent. But while humans might not be consistent, they can’t be fooled by tricks like writing the same paragraph 37 times. In a way, the justification for human marking is a bit like the justification for a jury system. It may well be inconsistent and unwieldy and error-prone, but it will have a backbone of common sense that prevents really egregious and absurd decisions.
And this, for me, has always been the challenge of AI marking. It’s not about how well it does the job to begin with. It’s about how students respond when they know their work is being marked by AI, and how the AI then responds to that.
1968 and 2001 are the distant past in the world of AI. ChatGPT is orders of magnitude more sophisticated than older AI models. So how does it cope with deliberate attempts to game it? It depends…
Putting artificial intelligence marking to the test
I took a good essay on Romeo and Juliet and asked ChatGPT to mark it out of four levels. It gave it a top grade and a nice comment. So far, so good.
I then took the Romeo and Juliet essay and replaced all the mentions of Romeo with “Pip”, all the mentions of the Nurse with “Magwitch”, and all the mentions of Romeo and Juliet with Great Expectations. This resulted in some entertaining paragraphs like:
I then pasted this essay into ChatGPT and said it was an essay on the first chapter of Great Expectations, and to mark it out of four levels. It gave it top marks and a nice comment.
So is it case closed? Is this just another easily gameable AI system? Not quite.
While it is relatively straightforward to game ChatGPT for literature essays, it is much harder to game it for pure writing assessments.
I took a model essay on why we should ban cigarettes, and asked it to mark it out of eight levels. It gave it a top grade and a nice comment. As with the literature essay, it was so far, so good.
But that was before I took the banning cigarettes model essay and replaced all the mentions of “cigarette” and “smoke” with “mobile phones”.
This resulted in some entertaining sentences like this:
I then pasted this essay into ChatGPT and said it was an essay on why we should ban mobile phones.
ChatGPT was wise to this.
This makes sense, given what we know about ChatGPT’s strengths. It has been developed as a language model, and it is phenomenally good at understanding and producing natural language. However, it does make factual errors, particularly about complex content.
So what are the implications of all this? Is ChatGPT good enough to be used for writing assessments, given it is hard to game?
I still think we have to be cautious.
So far, a lot of the debate about the negative effects of ChatGPT has focused on the way students might use it to cheat by creating essays. You can solve this problem by having students complete their writing in controlled conditions.
But if those essays are being marked by AI, students may end up revising for those exams by trying to memorise hacks and hints that will fool the AI. Even if the AI is not gameable, students may still end up wasting time thinking they can crack the code.
Indeed, the students might be right to think they could game the system because even the ChatGPT’s creators don’t actually know how the AI is making its decisions. So it is always possible there is some loophole no one has yet discovered. Something like this has happened with AI facial recognition systems..
Fears of this kind are one of the reasons for the general unease about the idea of AI marking for public exams.
However, what if the AI could be supervised by a human who could stop it from doing anything obviously absurd? There is a precedent for this in other complex systems. For example, a lot of planes have very sophisticated autopilots, but we still need a human pilot on board.
This is perhaps the model that we need to work towards. For now, though, it’s clear that ChatGPT won’t be taking over for English teachers any time soon.
Daisy Christodoulou is director of education at No More Marking, a provider of online comparative judgement software for schools.
This is an edited version of an article first published on The No More Marking Blog.