HOW CHATBOTS AND OTHER AI REALLY PERFORM?

5 juillet 2022 Intelligence Artificielle

A vast team of over 400 researchers recently released a new open-access study on the performance of recent, popular text-based AI architectures such as GPT, the Pathways Language Model, the (recently controversial) LaMBDA architecture, and sparse expert models. The study, titled the “Beyond the Imitation Game,” or BIG, tries to provide a general benchmark for the state of text-based AI, how it compares to humans on the same tasks, and the effect of model size on the ability to perform the task.

First, many of the results were interesting though not surprising:

● In all categories, the best humans outdid the best AIs (though that edge was smallest on translation problems from the International Language Olympiad).
● Bigger models generally showed better results.
● For some tasks, the improvement was linear with model size. These were primarily knowledge-based tasks where the explicit answer was already somewhere in the training data.
● Some tasks (“breakthrough” tasks) required a very large AI model to even get started. These were mostly what the team called “composite” tasks — where two different skills must be combined or multiple steps followed to get the right answer.

However, some results were a little more interesting. Essentially, the researchers found that all model sizes were highly sensitive to the way the question was asked. For some ways of asking a question, the answers improved with larger model sizes but for other ways the results were no better than random, no matter the model size.