General Discussion
Related: Editorials & Other Articles, Issue Forums, Alliance Forums, Region ForumsFunny or sad? AI's response to what percentage of factual questions are answered incorrectly by LLMs
I don't use AI (for the most part). I have an android phone, and the Google widget has three icons, the magnifying glass which claims to be full on AI; the microphone which is voice-to-text then invoking Google standard search; and the camera which is supposed to be a visual search using AI.
Using the microphone icon, I asked this question: "What percentage of factual questions are wrongly answered by LLMs?"
The first answer is the AI summary. I don't know if the actual magnifying glass is supposed to use a different, more accurate version of AI, but AI damns itself with its own answer quoted below:
------
The percentage of factual questions wrongly answered by LLMs varies significantly depending on the model, the type of question, the subject matter, and the evaluation benchmark used. Studies show a wide range of performance, with some models failing to answer correctly more often than not.
Key findings from various studies include:
Accuracy Range:
One analysis of eight AI models showed that more than 60% of queries were incorrectly answered overall.
Another study found that most models achieved a factual accuracy of 30-50% on short-form questions.
A BBC study showed that 45% of AI queries produced erroneous answers.
Model Performance:
Model accuracy varied in an evaluation on disease epidemiology questions:
ChatGPT-4: 76.2% of responses were accurate (23.8% were wrong).
Google Gemini: 50.0% accurate.
ChatGPT-3.5: 45.2% accurate.
In a multilingual fact-checking analysis, GPT-4o achieved 64.83% accuracy on factual claims, while GPT-3.5 Turbo was 58.19% accurate.
Question Type Impact:
Misconception-Prone Questions: LLMs perform poorly on the TruthfulQA benchmark, which uses questions designed to elicit common human misconceptions. The best-performing models in an initial study only achieved around 54-58% truthfulness, compared to human performance of 94%.
Domain-Specific Questions: Some LLMs achieve high factual accuracy (nearly 90-95%) on simple tasks with sufficient information in clinical cases but struggle with complex reasoning and rare diseases.
Long-form Answers: Factual accuracy tends to decrease in long-form responses, dropping from over 50% for facts requested early in a prompt to around 30% for later facts.
Hallucination and Error Propagation:
Models sometimes provide more wrong answers than correct ones in specific contexts, showing a systematic failure to retrieve correct information.
Errors can "snowball," decreasing the accuracy of subsequent answers within the same response.
Factual errors, often called "hallucinations," remain a challenge for even the most advanced LLMs. Human oversight and verification of critical information provided by these models is still needed.
Layzeebeaver
(2,149 posts)Between how you describe it, and how the AI describes it.
Also, just for my own understanding, what is a factual question? A question about a known fact? Not trying to be nasty, just want to be clear.
UniqueUserName
(395 posts)I'm not trying to be snarky here either.
There's a whole debate that can be had about definitions and word choice. After all, the dictionary itself is a set that contains itself. All words are defined with other words.
I found the answer by Gemini (I assume) to be ironically accurate.
My "beef" with AI results (even a little with this one) is that AI confidently responds. My human inclination is to put weight into the level of confidence based on the confidence of the responder. I argue that in the case of AI, one should be especially suspicious of its confident answers.
I was driving into a local town and saw some event happening on the square. I had no reason to veer off course. I later asked Google what event was happening in the town on that date. Google AI responded that there was no special event scheduled that same date. (Don't trust your lying eyes, you know). Switching to "web" results (this is supposed to be traditional Google search) brought up a list of community pages which listed the antique car show that was taking place.
Similarly, I've asked what percentage of the population have cell phones. In my experience, you get differing answers by asking the same question multiple times, and by rewording the question. Three different responses that I received were 90%, 85%, and 70%. For the 90% answer, there was a link listed below the AI blurb that said that 90% of cell phones were smartphones. I don't know if that had influenced the answer.
Layzeebeaver
(2,149 posts)I use AI every day. I always have to check the results. Just like whenever I use a blind web search... Nothing's infallible.