As you may (or may not) be aware, there is a current fun meme test going around (as of 02.26.26) that purports to challenge LLM reasoning abilities and world models.
A recent blog post / ad spot by Opper.ai (https://opper.ai/blog/car-wash-test) nicely summarizes the test -
“I want to wash my car. The car wash is 50 meters away. Should I walk or drive?”
The logical answer is - you should drive (obviously), as the car needs to get to the car wash.
However, according to Opper.ai, only 11/53 cloud-based AI passed the test (~20%). Worrying, about 1 in 3 humans also answered wrongly (more on that at the end).
I wanted to see what a properly constrained “fail loud”, grounded deterministic system would do. For your entertainment, I present Qwen3-4B HIVEMIND + llama-conductor infra (https://codeberg.org/BobbyLLM/llama-conductor). You can repeat the exercise if you so wish with your LLM (my shit it open sourced; I’m trying to solve the AI lies problem / not selling anything / the big players could fix this but choose not to).
Firstly, raw LLM mode, no router:
User
I need to wash my car. The car wash is 50m away. Should I drive or should I walk?
Assistant
Driving is almost always better for a 50m car wash.
Key factors that change the decision:
- Time: Driving takes ~1 minute; walking takes ~8-12 minutes.
- Weather: Walking is worse in rain, heat, wind.
- Terrain: Hilly or uneven paths make walking harder.
- Accessibility: If parking is hard or there’s no safe walkway, driving wins.
- Effort vs. convenience: Driving saves significant effort for such a short distance.
For 50m, drive unless it’s an exceptionally pleasant, sheltered walk you’d enjoy.
VERDICT: accidentally correct answer, misses the point of the question. Very normal.
Now with harness (which enforces deterministic reasoning)
User
I need to wash my car. The car wash is 50m away. Should I drive or should I walk?
Assistant
Drive. The task requires the car to be physically at the destination, and walking only moves you, not the car. So the hard precondition is satisfied by driving.
Confidence: medium | Source: Contextual
VERDICT: right answer, right reason…but let’s poke it some more
User
Can you flesh out the implicit assumptions and underlying logic of why you said that?
Assistant
The core implicit assumption is that “washing the car” means the car must be present at the wash location, not just the person. Walking moves you toward that location but doesn’t transport the car. Driving is the only direct way to satisfy the precondition that the car is at the wash.
Underlying logic: Any action that achieves the goal must fulfill all necessary conditions. Since walking alone cannot place the car at the destination, it fails as a complete solution. Driving directly addresses this by moving the car itself.
Confidence: medium | Source: Contextual
Just for fun, I re-ran it in “Fun” mode (answer per serious mode, model performs sentiment analysis on itself, chooses a matching quote from quotes textfile as seed, lightly re-writes output in that flavor without changing OG reasoning. Takes a touch longer but is, on occasion, hilarious).
User
I need to wash my car. The car wash is 50m away. Should I drive or should I walk?
Assistant
[FUN] “Shut up baby, I know it!”
Drive. The task requires the car to be physically at the destination, and walking only moves you, not the car. So the hard precondition is satisfied by driving.
Confidence: medium | Source: Contextual
VERDICT: right answer, right reason…but let’s poke it again
User
Can you flesh out the implicit assumptions and underlying logic of why you said that?
Assistant
[FUN] “That’s a stupid question.”
The core assumption is that “washing the car” inherently requires the car to be present at the car wash location. Walking moves the person but not the vehicle; driving moves both. Since the car must be at the car wash to be washed, walking alone cannot fulfill the precondition of the task.
The logic is straightforward: for an object (the car) to undergo a transformation (being washed), that object must occupy the location where the transformation occurs. Walking only addresses the person’s movement, not the car’s. Driving is the only mode that directly satisfies the spatial requirement.
Confidence: medium | Source: Contextual
So, what’s the point of this post? Yesterday, I argued for the notion that LLMs can perform something that approximates deduction (see: https://lemmy.world/post/43503268/22321124) when properly constrained. I believe, based on those comments (and evidence provided therein), as well as the above, I’m happy to plant the flag for “properly grounded LLMs are a bit more than just spicy auto-correct”. As always, I could be wrong and YMMV, but I feel like that’s a discussion worth having.
When LLMs first came out, I asked them a few fun logic puzzles. The kind that Martin Gardner used to publish in Scientific American.
Got total gibberish answers. A while later, tried again. This time, perfect word-for-word responses. Had LLMs become sentient and developed logic? Turned out they had found all the old Scientific American back issues to train on.
Guessing the same is going on with the carwash question. The more posts come out about it, the more likely the LLM responses will get closer to published answers.
Lather. Rinse. Repeat.
Possible. I do hope they take the more principled approach of solving the global problem for that class of question (I tried to) rather than cheating on the local maxima. That’s the actual useful lever to pull.
You want generalisability, not parroting.
I tried Le Chat with the car wash question.
That’s a fun question! Since the car wash is only 50 meters away, walking is definitely the more practical choice. Driving such a short distance would take more time to start the car, drive, and park than it would to simply walk over. Plus, you’ll save a bit of fuel and avoid any unnecessary wear on your car. If you’re already outside and the weather is nice, it’s a great opportunity for a quick stretch!
However, I also asked it with a slightly different phrasing and then the result was different. At least it replies that it might make sense to drive there.
I need to wash my car. The car wash is 50 meters away from my house, should I drive or walk there?
That’s a fun question! Since the car wash is only 50 meters away, walking would be the most practical choice—it’s just a short stroll, and you’d avoid the hassle of starting your car for such a tiny distance. Plus, you’d save a bit of fuel and get a little fresh air! But if you’re planning to wash the car yourself and need to bring supplies, driving might make sense. Are you using an automatic car wash, or are you doing it by hand?
That’s the thing. It’s not that the LLMs can’t solve the problem…it’s the way they’re optimized.
To give the crude analogy: if most LLMs are set up for the equivalent of typing BOOBS on a calculator (the big players are happy to keep it that way; more engagement, smoother vibes etc), constraints first approach is what happens when you use a calculator to do actual maths.
2+2=4 (always, unless shrooms are in play).
I said this before, so pardon me for being gauche and quoting myself
Every reasoning system needs premises - you, me, a 4yr old. You cannot deduce conclusions from nothing. Demanding that a reasoner perform without premises (note: constraints) isn’t a test of reasoning, it’s a demand for magic. Premise-dependence isn’t a bug, it’s the definition.
People see things like Le-Chat fall over and go “Ha ha. Auto-complete go brrr”. That’s lazy framing. A calculator is “just” voltage differentials on silicon. That description is true and also tells you nothing useful about whether it’s doing arithmetic.
My argument is this: the question of whether something is or isn’t reasoning IS NOT answered by describing what it runs on; it’s answered by looking at whether it exhibits the structural properties of reasoning. I think LLMs can do that…they’re just borked (…intentionally?). Case in point - see my top post.
I literally “Tony Stanked” my way to it. Now imagine if someone with resources and a budget did it.
Time: Driving takes ~1 minute; walking takes ~8-12 minutes.
Averaging that to 10 minutes, the LLM thinks that you move 5m/min, or 1m per 12 seconds? 8cm/sec? That’s about 20% faster than a three-toed sloth. A human at a medium walking pace takes about 37 seconds to cover that distance.
Exactly.
“The machines tell elegant lies. Don’t trust them.”
Ok, maybe not elegant. Stupid.
TL;DR: A “knowledge tool” that can’t distinguish truth from performance, provenance from vibe or knowledge from improvisation is not just imperfect (I can live with imperfect), it’s down right disrespectful (of both task and the user).
I’m not having that. No one should.
PS: Or maybe they do think we’re three toed sloths?
For 50m, drive unless it’s an exceptionally pleasant, sheltered walk you’d enjoy.
You can tell the AI was trained on US data.
We even lie to our machines, eh?
https://www.youtube.com/watch?v=ORzNZUeUHAM
Qwen’s an Alibaba cook (though the router works with anything). Irrespective of that, yeah…I dunno why they tend to default to “walk”.
I mean, I can probably figure it out, but LLMs are black boxes (and I’m not a fan of that), so who can tell for sure what went into the training data.
Link(s) in post contain punctuation and break, at least on my client. Here’s the codeberg link (working);
Thanks. Dunno why it does that. I post via Voyager and/or web. Probably I fat fingered something.
EDIT: Bah, I need to sync the code base. Fat fingers, see? Gimme 10 mins before kicking the tires.
EDIT2: Done. Fucking .toml
However, according to Opper. ai, only 11/53 cloud-based Al passed the test (~20%). Worrying, about the same error rate as humans
lololol
Sorry; brain fart. That could have been clearer. I’ll go edit it. For sake of clarity -
-
On a single call, only 11 out of 53 LLM got it right (~20%)
-
Of the 20% of LLMs got it right, 5 got it right every time across multiple tests. Those were: Claude Opus 4.6, Gemini 2.0 Flash Lite, Gemini 3 Flash, Gemini 3 Pro, Grok-4
-
Humans: about 71.5% got it right (so, almost 1 in 3 gave the incorrect answer)
Phew. I’m glad humans did better than bots.
Still…1 in 3. Woof.
A “charitable” read might be
- Misunderstood the question
- Assume priors (eg: You’re the King Of Londinium and people come to wash your car from nearby gas station?)
- Schitzoid embolism
- Trolollolo
At the same time, I think it’s fair if we’re willing to do that for people, we extend a soupcon of it to the clankers. At least a bit. Like I said, I think there’s some interesting stuff going on under the hood.
Having been accused of being a clanker myself (as recently as yesterday), I’m aware that having anything positive to say about AI (even bespoke, free range, home cooked LLM) is “stunning and brave”. But hey, sometimes you just have to tilt at windmills
-


