But it’s still no match for cutting-edge humanoids.
Your Back Page scribbler is conscious of the fact we write a lot about artificial intelligence.
So we promise this will be the last digital rectangle on the subject for a month or so.
The thing is, as someone whose default setting is arch scepticism, the shortcomings of the over-hyped but underperforming technology really is the gift that keeps on giving.
One of the key issues with the development of large language models that power the AI tools such as ChatGPT, Gemini and Claude is measuring how well they are performing.
To do that you need benchmarks. What’s more, as the AIs are learning fast, you need to develop benchmarks that set the bar ever higher to gauge any meaningful progress.
Which is where a team of Australian university boffins, supported by the Center for AI Safety and Scale AI, comes in.
These clever folks have developed what they call Humanity’s Last Exam, which they describe as an “expert-level closed-ended academic benchmark”, consisting of 2500 questions across dozens of subjects, including mathematics, humanities and the natural sciences.
Most importantly, while the exam questions are from the “frontier of human knowledge”, each question has a “known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval”.
So how do you reckon the AI tools did on the exam?
Being intelligent TMR readers I reckon you’ve already guessed that ChatGPT and comrades did not cover themselves in glory. What might surprise you, however, is just how incredibly poorly they performed.
Publishing their results this week in the journal Nature, the research team revealed that there exists a “marked gap” between “current LLM capabilities and the expert human frontier on closed-ended academic questions”.
That’s putting it politely. The AI tools flunked badly. Very badly.
Many of the AIs scored less than 10% on the exam, while the highest result was a paltry 25.3% achieved by GPT-5.
To be fair, your correspondent did have a look at some of the questions posed by the humanity exam and they seemed pretty jolly difficult for our tiny brains.
On the other hand, we are not claiming, as the AI spruikers are doing, to be so smart we can make human work “optional” or “irrelevant”. Nor do we suggest we have the “capacity to solve major global challenges, including curing diseases”.
By the looks of these results, it could be a long wait before that ever happens.
But at least we now have a clear measure and a common reference point for assessing AI progress and capabilities.
And we have flesh and blood human beings to thank for that.
Prove your humanity by sending story tips to Holly@medicalrepublic.com.au.
