SynGatorTron™ - UF develops AI chatbot to speed medical research and alleviate privacy concerns

University of Florida (UF) Health Communications
By Diana Tonnessen

Gainesville, Florida - “Dr. Chatbot will see you now.”

The next generation of super-smart computers, tablets and cell phones may come equipped with artificial intelligence-generated medical chatbots that can interact with patients using human language and medical knowledge.

According to Yonghui Wu, Ph.D., director of natural language processing at the University of Florida Clinical and Translational Science Institute, the medical chatbot you interact with online will be able to use conversational language to communicate with and educate patients in much the same way we now interact with Apple’s chatbot, Siri, and Amazon’s Alexa.

The chatbot may also be culturally sensitive and matched to your age.

“It will be like having your own personal medical avatar,” Wu said.

Medical chatbots are just one of many possible applications to arise out of groundbreaking new AI (Artificial Intelligence) tools developed by Wu and other researchers at UF and NVIDIA (Nvidia Corporation, a computer systems design services company) as part of a $100 million artificial intelligence public-private collaboration formed in 2020. Last year, they launched a clinical language AI model, GatorTron™. This AI tool enables computers to quickly access, read and interpret medical language in clinical notes and other unstructured narratives stored in real-world electronic health records. The model was trained on HiPerGator-AI, the university’s NVIDIA DGX SuperPOD system, which ranks among the world’s top 30 supercomputers.

The GatorTron™ model is expected to accelerate research and medical decision-making by extracting information and insights from massive amounts of clinical data with unprecedented speed and clarity. It will also lead to innovative AI tools and advanced, data-driven health research methods that were unimaginable even 10 or 15 years ago.

This year, the team is rolling out another model – SynGatorTron™ — with different capabilities. SynGatorTron™ can generate synthetic patient data untraceable to real patients. This synthetic data can then be used to train the next generation of medical AI systems to understand conversational language and medical terminology.

Most data-driven health research and health-related AI applications today rely on ‘de-identified’ patient data in electronic health records, from which patients’ private information such as name, address and birthdate, has been removed before it is used for research and development.

Removing patient data is time-consuming and labor-intensive. Automated de-identification systems can be used to generate large-scale machine de-identified data, but it’s not an ironclad solution.

According to Wu, even after all identifying patient information has been removed, there’s still a remote chance that someone could identify a patient by tracking data over time.

“Generating synthetic patient data is a safe way to preserve the knowledge of medical language but mitigate the risks of patient privacy,” Wu said.

Patient privacy isn’t the only barrier to training the next generation of AI models for research and other applications. The sheer volume of data required to train AI models can also stand in the way.

“There’s a finite amount of patient data available to us, and training AI computer models requires a tremendous amount of data,” said Duane Mitchell, M.D., Ph.D., director of the UF Clinical and Translational Science Institute and associate dean for clinical and translational sciences at the UF College of Medicine. “With SynGatorTron™, we can generate all the data we need.”

Another advantage SynGatorTron™ has over its competitors is that because real-world patient data is used as a model for generating synthetic patient data, the synthetic data has “real human characteristics,” Mitchell said.

“The synthetic patient data generated by SynGatorTron™ reflects the complexity and diversity of the human population,” he said. “This diversity in the synthetic data is crucial because AI is only as good as the data it is trained on.”

Low-quality data used in training algorithms has already been found to introduce or reinforce bias in a few high-profile applications, including gender bias in Google Translate and racial bias in Amazon’s Rekognition facial recognition technology.

The data produced by SynGatorTron™ could be used to address issues with underrepresented minorities and other potential sources of bias, Mitchell said.

Having the ability to generate high-quality synthetic patient data that can be used to develop new AI applications opens up a new world of possibilities.

“We haven’t even begun to think of all the downstream uses that will spring from this,” Mitchell said.

One thing is certain: “There is a lot of interest in the race for AI applications to generate relevant and accurate synthetic patient data,” he said. “With the development and launch of SynGatorTron™, UF and NVIDIA will certainly be positioned at the forefront of these efforts within the field.”