UF researchers evaluate academic performance of chatbots

University of Florida (UF) Health News
By Leah Buletti

As people increasingly turn to artificial intelligence (AI) tools like the popular chatbot ChatGPT for any number of questions, concerns remain about their accuracy when it comes to things like academics or health care.

In a new study, University of Florida researchers found OpenAI’s GPT-4 — the latest version of the model that powers ChatGPT — performed better than the student average on seven of nine graduate-level exams in the biomedical sciences. But they found its performance on the free-text assessments was limited for some types of complex questions, raising concerns about irrelevant data and plagiarism.

The study, published on March 7 in the journal Scientific Reports, highlights that GPT-4 responses need more evaluation for trustworthiness and accuracy across numerous subjects before they can be used as reference resources.

“Although responses on expert-level topics had very high accuracy on average, we would not recommend relying yet on ChatGPT/GPT-4 to accurately provide information as a sole resource,” said lead author Daniel Stribling, a UF College of Medicine M.D.-Ph.D. trainee in the lab of Rolf Renne, Ph.D., associate director for basic sciences at the UF Health Cancer Center and the paper’s senior author.

The researchers noted GPT-4 had a surprising capability to answer expert-level questions across biomedical science disciplines without any additional model training. It performed well on fill-in-the-blank, short-answer and essay questions, and it correctly answered several questions on figures sourced from published manuscripts. However, GPT-4 performed poorly on questions with figures containing simulated data and those that required a hand-drawn answer. And the team observed convincing “hallucinations,” meaning the model invented fictional data to support real scientific findings.

The study highlights the need for open discussions about the appropriate use of these new tools in science and education, Stribling said.

“Similar to the advent of the printing press, in the chatbot era we may need to adapt our paradigms to these new technologies and critically evaluate whether there is now a distinguishable border between ‘editing tool’ and ‘co-author,’ which will have significant implications in educational assessments moving forward,” he said.

In addition to Stribling and Renne, co-authors included Kiley Graim, Ph.D., an assistant professor in the department of computer and information science and engineering in the Herbert Wertheim College of Engineering and a member of the UF Health Cancer Center, and Connie Mulligan, Ph.D., a professor in the department of anthropology.