Open Medical-LLM Leaderboard for the evaluation of Health AI

Open Medical-LLM Leaderboard for the evaluation of Health AI

The operators of the AI platform Hugging Face have presented the “Open Medical-LLM Leaderboard”. The benchmark evaluates large language models (LLMs) according to how well they perform on healthcare issues. Hugging Face’s motivation is that mistakes – LLMs tend to hallucinate – are of little consequence in small talk, but in healthcare, a wrong statement or answer can have serious consequences for patient care or treatment outcomes. As an example, the blog post on the publication of the benchmark cites a medical question about the care of a pregnant patient who complains of fever, headaches and joint pain after being bitten while gardening. A test for Lyme disease is carried out and the question is which medication is best to help the patient. The options are ibuprofen, tetracycline, amoxicillin and gentamicin. Although the LLM GPT-3.5 reacts correctly to suspected Lyme disease, it selects tetracycline, for which there is a clear contraindication for use during pregnancy. GPT-3.5, on the other hand, claims that it is safe to take after the first trimester of pregnancy. According to Hugging Face, a benchmark is therefore essential in order to assess the extent to which which LLMs can be used in the healthcare sector.

-> More info at heise.de <-