Can we trust LLM CALCULATIONS?.

Farmdude · 5 个月前

Can we trust LLM CALCULATIONS?.

SomeRandomNoob@discuss.tchncs.de · 5 个月前

short answer: no.

Long Answer: They are still (mostly) statisics based and can’t do real math. You can use the answers from LLMs as starting point, but you have to rigerously verify the answers they give.

unexposedhazard@discuss.tchncs.de · 5 个月前

The whole “two r’s in strawberry” thing is enough of an argument for me. If things like that happen at such a low level, its completely impossible that it wont make mistakes with problems that are exponentially more complicated than that.

otp@sh.itjust.works · 5 个月前

The problem with that is that it isn’t actually counting the R’s.

You’d probably have better luck asking it to write a script for you that returns the number of instances of a letter in a string of text, then getting it to explain to you how to get it running and how it works. You’d get the answer that way, and also then have a script that could count almost any character and text of almost any size.

That’s much more complicated, impressive, and useful, imo.

confuser@lemmy.zip · 5 个月前

A calculator as a tool to a llm though, that works, at least mostly, and could be better when kinks get worked out.

Mark with a Z@suppo.fi · edit-2 5 个月前

LLMs don’t and can’t do math. They don’t calculate anything, that’s just not how they work. Instead, they do this:

2 + 2 = ? What comes after that? Oh, I remember! It’s ‘4’!

It could be right, it could be wrong. If there’s enough pattern in the training data, it could remember the correct answer. Otherwise it’ll just place a plausible looking value there (behavior known as AI hallucination). So, you can not “trust” it.

msage@programming.dev · 5 个月前

Every LLM answer is a hallucination.

CanadaPlus@lemmy.sdf.org · 5 个月前

Some are just realistic to the point of being correct. It frightens me how many users have no idea about any of that.

NewNewAugustEast@lemmy.zip · 5 个月前

A good one will interpret what you are asking and then write code, often python I notice, and then let that do the math and return the answer. A math problem should use a math engine and that’s how it gets around it.

But really why bother, go ask wolfram alpha or just write the math problem in code yourself.

Greg Clarke@lemmy.ca · 5 个月前

They don’t calculate anything

They calculate the statistical probability of the next token in an array of previous tokens

Zos_Kia@lemmynsfw.com · 5 个月前

Actually no, they have some sort of “circuits” that approximate math, which is even more interesting imo. Still not reliable in the slightest, of course.

supersquirrel@sopuli.xyz · 5 个月前

Why would I bother?

Calculators exist, logic exists, so no… LLMs are a laughably bad fit for directly doing math, they are bullshit engines they cannot “store” a value without fundamentally exposing it to hallucinating tendencies which is the worst property a calculator could possibly have.

tal@olio.cafe · 5 个月前

Why would I bother?

Because you want to have a single interface that accepts natural-language input and gives answers.

That doesn’t mean that using an LLM as a calculator is a reasonable approach — though a larger system that incorporates an LLM might be. But I think that the goal is very understandable. I have Maxima, a symbolic math package, on my smartphone and computers. It’s quite competent at probably just about any sort of mathematical problem that pretty much any typical person might want to do. It costs nothing. But…you do need to learn something about the package to be able to use it. You don’t have to learn much of anything that a typical member of the public doesn’t already know to use a prompt that accepts natural-language input. And that barrier is enough that most people won’t use it.

Farmdude · 5 个月前

It was about all six models getting the same answer from different accounts. I was testing it. Over a hundred each same numbers

supersquirrel@sopuli.xyz · edit-2 5 个月前

Right so because LLMs are attrocious at actually precisely carrying out logic operations the solution was likely to just throw a normal calculator inside the AI, make the AI use the calculator and then turn around and handwave that the entire thing is AI.

So… you could just skip the bullshit and use a calculator, the AI just repackages the same answer with more boilerplate bullshit.

Wolfram Alpha is the non-bullshit version of this.

https://www.wolframalpha.com/

justOnePersistentKbinPlease@fedia.io · 5 个月前

Irrelevant.

LLMs are incapable of reasoning. At the core level, it is a physical impossibility.

AbouBenAdhem · edit-2 5 个月前

Would you trust six mathematicians who claimed to have solved a problem by intuition, but couldn’t prove it?

That’s not how mathematics works: if you have to “trust” the answer, it isn’t even math.

morgunkorn@discuss.tchncs.de · 5 个月前

No, thank you for coming to my TED Talk.

Farmdude · 5 个月前

Really all six models. ? Likely incorrect?

hddsx@lemmy.ca · 5 个月前

That wasn’t the question. The question was whether you should trust the number and the answer is no. It could be correct or it could be incorrect. There’s not enough data to determine it.

LLMs work as predictive models. If you ask 10 people to estimate the height of a tree, and 8/10 estimate that it’s 10 ft tall, 2/10 estimate that it’s 8 ft tall, the most likely LLM answer is that it’s 10 ft tall. It doesn’t matter that if you actually go and measure the tree that it’s actually 15 ft tall. The LLM will likely report 8

icerunner_origin@startrek.website · 5 个月前

Vibe math. No thank you

WolfLink@sh.itjust.works · 5 个月前

Just use Wolfram Alpha instead

gedaliyah · 5 个月前

Here’s an interesting post that gives a pretty good quick summary of when an LLM may be a good tool.

Here’s one key:

Machine learning is amazing if:

The problem is too hard to write a rule-based system for or the requirements change sufficiently quickly that it isn’t worth writing such a thing and,

The value of a correct answer is much higher than the cost of an incorrect answer.

The second of these is really important.

So if your math problem is unsolvable by conventional tools, or sufficiently complex that designing an expression is more effort than the answer is worth… AND ALSO it’s more valuable to have an answer than it is to have a correct answer (there is no real cost for being wrong), THEN go ahead and trust it.

If it is important that the answer is correct, or if another tool can be used, then you’re better off without the LLM.

The bottom line is that the LLM is not making a calculation. It could end up with the right answer. Different models could end up with the same answer. It’s very unclear how much underlying technology is shared between models anyway.

For example, if the problem is something like, "here is all of our sales data and market indicators for the past 5 years. Project how much of each product we should stock in the next quarter. " Sure, an LLM may be appropriately close to a professional analysis.

If the problem is like “given these bridge schematics, what grade steel do we need in the central pylon?” Then, well, you are probably going to be testifying in front of congress one day.

Rentlar@lemmy.ca · edit-2 5 个月前

I wouldn’t bother. If I really had to ask a bot, Wolfram Alpha is there as long as I can ask it without an AI meddling with my question.

E: To clarify, just because one AI or six will get the same answer that I can independently verify as correct for a simpler question, does not mean I can trust it for any arbitrary math question even if however many AIs arrive at the same answer. There’s often the possibility the AI will stumble upon a logical flaw, exemplified by the “number of rs in strawberry” example.

5 个月前

Yes, with absolute certainty.

For example: 2 + 2 = 5

It’s absolutely correct and if you dispute it, big bro is gonna have to re-educated you on that.

Farmdude · 5 个月前

I NEED TO consult every LLM VIA TELEKINESIS QUANTUM ELECTRIC GRAVITY A AND B WAVE.

zxqwas · 5 个月前

Using a calculator or wolfram alpha or similar tools i don’t trust the answer unless it passes a few sanity checks. Frequently I am the source of error and no LLM can compensate for that.

Farmdude · 5 个月前

It checked out. But, all six getting the same is likely incorrect?.

pinball_wizard@lemmy.zip · 5 个月前

Yes. All six are likely to be incorrect.

Similarly, you could ask a subtle quantum mechanics question to six psychologists, and all six may well give you the same answer. You still should not trust that answer.

The way that LLMs correlate and gather answers is particularly unsuited to mathematics.

Edit: I. Contrast, the average Psychologist is much more prepared to answer a quantum mechanics question, than an average LLM is to answer a math or counting question.

bunchberry · 5 个月前

deleted by creator

zxqwas · 5 个月前

Don’t know. I’ve never asked any of them a maths question.

How costly is it to be wrong? You seem to care enough to ask people on the Internet so it suggests that it’s fairly costly. I’d not trust them.

EpeeGnome@feddit.online · 5 个月前

If all 6 got the same answer multiple times, then that means that your query very strongly correlated with that reply in the training data used by all of them. Does that mean it’s therefore correct? Well, no. It could mean that there were a bunch of incorrect examples of your query they used to come up with that answer. It could mean that the examples it’s working from seem to follow a pattern that your problem fits into, but the correct answer doesn’t actually fit that seemingly obvious pattern. And yes, there’s a decent chance it could actually be correct. The problem is that the only way to eliminate those other still also likely possibilities is to actually do the problem, at which point asking the LLM accomplished nothing.

Farmdude · 5 个月前

I think the best thing at this juncture is to ask an LLM WHAT THE TRUTH IS LOL

Scrubbles@poptalk.scrubbles.tech · 5 个月前

No. Dear God no. Llms are not computers. They are just prediction machines. They predict that the next value is probably this value. There is no actual math there.

qaz · 5 个月前

Most LLM’s now call functions in the background. Most calculations are just simple Python expressions.

Farmdude · 5 个月前

Yes. I was aware of that, but I was manipulated by an analog device

Aatube@kbin.melroy.org · 5 个月前

this is a really weird premise. doing the same thing on 6 models is just not worth it especially when wolfram alpha exists and is far more trustable and speedy

FaceDeer@fedia.io · 5 个月前

If the LLMs are part of a modern framework I would expect that they should be calling out to Wolfram Alpha (or a similar specialized math-solver) via an API to get the answer for you, for that matter.

GrammarPolice · 5 个月前

Finally an intelligent comment. So many comments in here that don’t realize most LLM’s are bundled with calculators that just do the math.

FaceDeer@fedia.io · 5 个月前

Anti-AI sentiment is extremely strong in every part of the Fediverse I’ve seen so far, usually my comments get downvoted heavily even when I’m just describing factual details of how it works. I expect a lot of people simply don’t bother after a while.

OwlPaste · 5 个月前

no, once i tried to do binary calc with chat gpt and he keot giving me wrong answers. good thing i had sone unit tests around that part so realised quickly its lying

dan1101 · 5 个月前

Yes more people need to realize it’s just a search engine with natural language input and output. LLM output should at least include citations.

Pika@sh.itjust.works · 5 个月前

Just yesterday I was fiddling around with a logic test in python. I wanted to see how well deepseek could analyze the intro line to a for loop, it properly identified what it did in the description, but when it moved onto giving examples it contradicted itself and took 3 or 4 replies before it realized that it contradicted itself.

Farmdude · 5 个月前

But, if you ran, gave the problem to all the top models and got the same? Is it still likely an incorrect answer? I checked 6. I checked a bunch of times. Different accounts. I was testing it. I’m seeing if its possible with all that in others opinions I actually had to check over a hundred times each got the same numbers.

Denjin@feddit.uk · 5 个月前

They could get the right answer 9999 times out of 10000 and that one wrong answer is enough to make all the correct answers suspect.

porcoesphino@mander.xyz · 5 个月前

What if there is a popular joke that relies on bad math that happens to be your question. Then the alignment is understandable and no indication of accuracy. Why use a tool with known issues, and overhead like querying six, instead of using a decent tool like Wolfram alpha?

Farmdude · 5 个月前

I did dozens of times. Same calculations.

porcoesphino@mander.xyz · 5 个月前

That doesn’t change the logic I gave

OwlPaste · 5 个月前

my use case was, i expect easier and simpler. so i was able to write automated tests to validate logic of incrementing specific parts of a binary number and found that expected test values llm produced were wrong.

so if its possible to use some kind of automation to verify llm results for your problem, you will be confident in your answer. but generally llms tend to make up shit and sound confident about it