I guess you can ask any doctor or paramedic and they’ll tell you the tricky part of the job is to ask the right questions. And do some detective work to get to know important symptoms, severe drugs they’re having for breakfast every day because of some precondition… A patient will casually omit mentioning all of that, about half of the times. Calculating some Glasgow Coma Scale on them, as it’s in the examn, is the easy part of the job.
Bingo!
And then steelman whatever results you get (along with check the sources that agree with the results).
I’ve found steelmannig produces impressive results - but you must also have knowledge in the area you’re researching.
My questions always require sources - after validating via sources I then steel man the results.
Having done this in areas where I am an expert, I can say it does work so long as you approach this the same as anything else: with skepticism and strong criticism.
LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings
This is an interesting distinction. Intuitively it feels like something similar is going on with programming. Gemini is apparently passing all these crazy benchmarks but I couldn’t even get it to one-shot a game of snake in C
LLMs perform poorly when assisting humans with medical diagnostic, study blames human participants for not working with them.
GIGO.
Good, competent questions are key. I’d bet doctors, biochemists, etc, get better results because they have deep knowledge.
I get great results in areas where I have expertise, because I know immediately whether I’m getting reasonable results or not.




