Humans don't make mistakes nearly as much, the mistakes they do make are way more predictable (they're easier to spot in code review), and they don't tend to make the kinds of catastrophic mistakes that could sink a business. They also tend to cause codebases to rapidly deteriorate, since even very disciplined reviewers can miss the kinds of strange and unpredictable stuff an LLM will do. Redundant code isn't evident in a diff, and things like tautological tests, or useless tests where they're mocking everything and only actually testing the mocks. Or they'll write a bunch of redundant code because they really just aggressively avoid code re-use unless you are very specific.
The real problem is just that they don't have brains, and can't think. They generate text that is optimized to look the most right, but not to be the most right. That means they're deceptive right off the bat. When a human is wrong, it usually looks wrong. When an LLM is wrong, it's generating the most correct looking thing it possibly could while still being wrong, with no consideration for actual correctness. It has no idea what "correctness" even means, or any ideas at all, because it's a computer doing matmul.
They are text summarization/regurgitation, pattern matching machines. They regurgitate summaries of things seen in their training data, and that training data was written by humans who can think. We just let ourselves get duped into believing the machine is the where the thinking is coming from and not the (likely uncompensated) author(s) whose work was regurgitated for you.
Yeah, I remember how in every large corporation the specs were perfectly interpreted and with no issues, at all. Humans are great at communication and understanding each other.
>The real problem is just that they don't have brains, and can't think.
That would have had more weight if you haven't just described junior developer behavior beforehand.
"LLMs can't think" is anthropocentric cope. It's the old AI effect all over again - people would rather die than admit that there's very little practical difference between their own "thinking" and that of an AI chatbot.
> That would have had more weight if you haven't just described junior developer behavior beforehand.
Effectively telling that junior developers "don't have brains" is in very bad taste and offensively wrong.
> people would rather die than admit that there's very little practical difference between their own "thinking" and that of an AI chatbot.
Would you like to elaborate on this?
I was told that McDonalds employees would have been replaced by now, self-driving cars will be driving the streets and new medicines would have been discovered.
It's been a couple of years that "AI" is out, and no singularity yet.
LLMs use the same type of "abstract thinking" process as humans. Which is why they can struggle with 6-digit multiplication (unlike computer code, very much like humans), but not with parsing out metaphors or describing what love is (unlike computer code, very much like humans). The capability profile of an LLM is amusingly humanlike.
Setting the bar for "AI" at "singularity" is a bit like setting requirements for "fusion" at "creating a star more powerful than the Sun". Very good for dismissing all existing fusion research, but not any good for actually understanding fusion.
If we had two humans, one with IQ 80 and another with IQ 120, we wouldn't say that one of them isn't "thinking". It's just that one of them is much worse at "thinking" than the other. Which is where a lot of LLMs are currently at. They are, for all intents and purposes, thinking. Are they any good at it though? Depends on what you want from them. Sometimes they're good enough, and sometimes they aren't.
> LLMs use the same type of "abstract thinking" process as humans
It's surprising you say that, considering we don't actually understand the mechanisms behind how humans think.
We do know that human brains are so good at patterns, they'll even see patterns and such that aren't actually there.
LLMs are a pile of statistics that can mimic human speech patterns if you don't tax them too hard. Anyone who thinks otherwise is just Clever Hans-ing themselves.
We understand the outcomes well enough. LLMs converge onto a similar process by being trained on human-made text. Is LLM reasoning a 1:1 replica of what the human brain does? No, but it does something very similar in function.
I see no reason to think that humans are anything more than "a pile of statistics that can mimic human speech patterns if you don't tax them too hard". Humans can get offended when you point it out though. It's too dismissive of their unique human gift of intelligence that a chatbot clearly doesn't have.
We do not, in fact, "understand the outcomes well enough" lol.
I don't really care if you want to have an AI waifu or whatever. I'm pointing out that you're vastly underestimating the complexity behind human brains and cognition.
And that complex human brain of yours is attributing behaviors to a statistical model that the model does not, in fact, possess.
I think saying that "LLMs can produce outcomes akin to those produced by human intelligence (in many but not all cases)" and "LLMs are intelligent" to both be fairly defensible.
> I see no reason whatsoever to believe that what your wet meat brain is doing now is any different from what an LLM does.
I don't think this follows though. Birds and planes can both fly, but a bird and a plane are clearly not doing the same thing to achieve flight. Interestingly, both birds and planes excel at different aspects of flight. It seems at least plausible (imo likely) that there are meaningful differences in how intelligence is implemented in LLMs and humans, and that that might manifest as some aspects of intelligence being accessible to LLMs but not humans and vice versa.
> It seems at least plausible (imo likely) that there are meaningful differences in how intelligence is implemented in LLMs and humans
Intelligence isn’t "implemented" in an LLM at all. The model doesn’t carry a reasoning engine or a mental model of the world. It generates tokens by mathematically matching patterns: each new token is chosen to best fit the statistical patterns it learned from its training data and the immediate context you give it. In effect, it’s producing a compressed, context-aware summary of the most relevant pieces of its training data, one token at a time.
The training data is where the intelligence happened, and that's because it was generated by human brains.
There doesn't seem to be much consensus on defining what intelligence is. For the definitions of at least some reasonable people of sound mind, I think it is defensible to call them intelligent, even if I don't necessarily agree. I sometimes call them "intelligent" because many of the things they do seem to me like they should require intelligence.
That said, to whatever extent they're intelligent or not, by almost any definition of intelligence, I don't think they're achieving it through the same mechanism that humans do. That is my main argument. I thing confident arguments that "LLMs think just like humans" are very bad, given that we clearly don't understand how humans achieve intelligence and the vastly different substrates and constraints that humans and LLMs are working with.
I guess to me, how is the ability to represent the statistical distribution of outcomes of almost any combination of scenarios, represented as textual data not a form of world model?
I think you're looking at it too abstractly. An LLM isn't representing anything, it has a bag of numbers that some other algorithm produced for it. When you give it some numbers, it takes them and does matrix operations with them in order to randomly select a token from a softmax distribution, one at a time, until the EOS token is generated.
If they don't have any training data that covers a particular concept, they can't map it onto a world model and make predictions about that concept based on an understanding of the world and how it works. [This video](https://www.youtube.com/watch?v=160F8F8mXlo) illustrates it pretty well. These things may or may not end up being fixed in the models, but that's only because they've been further trained with the specific examples. Brains have world models. Cats see a cup of water, and they know exactly what will happen when you tip it over (and you can bet they're gonna do it).
That video is a poor and mis-understood analysis of an old version of ChatGPT.
Analyzing an image generation failure modes from the dall-e family of models isn't really helpful in understanding if the invoking LLM has a robust world model or not.
The point of me sharing the video was to use the full glass of wine as an example for how generative AI models doing inference lack a true world model. The example was just as relevant now as it was then, and it applies to inference being done by LMs and SD models in the same way. Nothing has fundamentally changed in how these models work. Getting better at edge cases doesn't give them a world model.
That's the point though. Look at any end-to-end image model. Currently I think nano banana (Gemini 2.5 Flash) is probably the best in prod. (Looks like ChatGPT has regressed the image pipeline right now with GPT-5, but not sure)
SD models have a much higher propensity to fixate on proximal in distribution solutions because of the way they de-noise.
For example.. you can ask nano banana for a "Completely full wine glass in zero g" which I'm pretty sure is way more out of distribution, the model does a reasonable job at approximating what they might look like.
That's a fairly bad example. They don't have any trouble taking unrelated things and sticking them together. A world model isn't required for you to take two unrelated things and stick them together. If I ask it to put a frog on the moon, it can know what frogs look like and what the the moon looks like, and put the frog on the moon.
But what it won't be able to do, which does require a world model, is put a frog on the moon, and be able to imagine what that frog's body would look like on the moon in the vacuum of space as it dies a horrible death.
Your example is a good one. The frog won't work because ethically the model won't want to show a dead frog very easily, BUT if you ask nano-banana for:
"Create an image of what a watermelon would look like after being teleported to the surface of the moon for 30 seconds."
> "We don't fully understand how a bird works, and thus: "wind tunnel" is useless, Wright brothers are utter fools, what their crude mechanical contraptions are doing isn't actually flight, and heavier than air flight is obviously unattainable."
Completely false equivalency. We did in fact back then completely understand "how a bird works", how the physics of flight work. The problem getting man-made flying vehicles off the ground was mostly about not having good enough materials to build one (plus some economics-related issues).
Whereas in case of AI, we are very far from even slightly understanding how our brains work, how the actual thinking happens.
One of the Wright brothers achievements was to realize the published tables of flight physics was wrong and to carefully redo it with their own wind tunnel until they had a correct model from which to design a flying vehicle
https://humansofdata.atlan.com/2019/07/historical-humans-of-...
"Anthropocentric cope >:(" is one of the funniest things I've read this week, so genuinely thank you for that.
"LLMs think like people do" is the equivalent of flat earth theory or UFO bros.
Flerfers run on ignorance, misunderstanding and oppositional defiant disorder. You can easily prove the earth is round in quite a lot of ways (the Greeks did it) but the flerfers either don't know them or refuse to apply them.
There are quite a lot of reasons to believe brains work differently than LLMs (and ways to prove it) you just don't know them or refuse to believe them.
It's neat tech, and I use them. They're just wayyyyyyyy overhyped and we don't need to anthropomorphize them lol
This is wrong on so many levels. I feel like this is what I would have said if I never took a neuroscience class, or actually used an LLM for any real work beyond just poking around ChatGPT from time to time between TED talks.
There is no actual object-level argument in your reply, making it pretty useless. I’m left trying to infer what you might be talking about, and frankly it’s not obvious to me.
For example, what relevance is neuroscience here? Artificial neural nets and real brains are entirely different substrates. The “neural net” part is a misnomer. We shouldn’t expect them to work the same way.
What’s relevant is the psychology literature. Do artificial minds behave like real minds? In many ways they do — LLMs exhibit the same sorts fallacies and biases as human minds. Not exactly 1:1, but surprisingly close.
I didn't say brains and ANNs are the same, in fact I am making quite the opposite argument here.
LLMs exhibit these biases and fallacies because they regurgitate the biases and fallacies that were written by the humans that produced their training data.
Maybe. That’s not an obvious conclusion in the strong sense that you mean it here. If you train a LLM on transcripts of multiplying very large numbers, machine generated and perfectly accurate transcripts, the LLM still exhibits the same sorts of mental math errors that people make.
Math, logical reasoning, etc. are cultural knowledge, not architecturally built-in. These biases and fallacies arise because of how we process higher order concepts via language-like mechanisms. It should not be surprising that LLMs, which mimic human-like natural language abilities (at the culture/learned level of abstraction, if not computation substrate) exhibit the same sorts of errors.
Living in Silicon Valley, there are MANY self driving cars driving around right now. At the stop light the other day, I was between 3 of them without any humans in them.
It is so weird when people pull self driving cars out as some kind of counter example. Just because something doesn't happen on the most optimistic time scale, doesn't mean it isn't happening. They just happen slowly and then all at once.
15 years ago they said truck drivers would be obsolete in 1-2 years. They are still not obsolete, and they aren't on track to be any time soon, either.
Given that they all use pseudo-random (and not actually random) numbers, they are "deterministic" in the sense that given a fixed seed, they will produce a fixed result...
But perhaps that's not what was meant by deterministic. Something like an understandable process producing an answer rather than a pile of linear algebra?
I was thinking the exact same thing: if you don’t change the weights, use identical “temperature” etc, the same prompt will yield the same output. Under the hood it’s still deterministic code running on a deterministic machine
You can just change your definition of "AI". Back in the 60s the pinnacle of AI was things like automatic symbolic integration and these would certainly be completely deterministic. Nowadays people associate "AI" with stuff like LLMs and diffusion etc. that have randomness included in to make them seem "organic", but it doesn't have to be that way.
I actually think a large part of people's amazement with the current breed of AI is the random aspect. It's long been known that random numbers are cool (see Knuth volume 2, in particular where he says randomness make computer-generated graphics and music more "appealing"). Unfortunately being amazed by graphics and music (and now text) output is one thing, making logical decisions with real consequences is quite another.
Not really, code even in high level languages is always lower level than English just for computer nonsense reasons. Example: "read a CSV file and add a column containing the multiple of the price and quantity columns".
That's about 20 words. Show me the programming language that can express that entire feature in 20 words. Even very English-like languages like Python or Kotlin might just about do it, if you're working in something else like C++ then no.
In practice, this spec will expand to changes to your dependency lists (and therefore you must know what library is used for CSV parsing in your language, the AI knows this stuff better than you), then there's some file handling, error handling if the file doesn't exist, maybe some UI like flags or other configuration, working out what the column names are, writing the loop, saving it back out, writing unit tests. Any reasonable programmer will produce a very similar PR given this spec but the diff will be much larger than the spec.
Not only is this shorter, but it contains all of the critical information that you left out of your english prompt: where is the csv? what are the input columns named? what are output columns named? what do you want to do with the output?
I also find it easier to read than your english prompt.
You have to count the words in the functions you call to get the correct length of the implementation, which in this case is far far more than 20 words. read_csv has more than 20 arguments, you can't even write the function definition in under 20 words.
Otherwise, I can run every program by importing one function (or an object with a single method, or what have you) and just running that function. That is obviously a stupid way to count.
It isn't a joke, you need the Kolmogorov complexity of the code that implements the feature, which has nothing to do with the fact that you're using someone else's solution. You may not have to think about all the code needed to parse a CSV, but someone did and that's a cost of the feature, whether you want to think about it or not.
Again, if someone else writes a 100,000 line function for you, and they wrap it in a "do_the_thing()" method, you calling it is still calling a 100,000 line function, the computer still has to run those lines and if something goes wrong, SOMEONE has to go digging in it. Ignoring the costs you don't pay is ridiculous.
We are comparing between a) asking an LLM to write code to parse a csv and b) writing code to parse a csv.
In both cases, they'll use a csv library, and a bajillion items of lower-level code. Application code is always standing on the shoulders of giants. Nobody is going to manually write assembly or machine code to parse a csv.
The original contention, which I was refuting, is that it's quicker and easier to use an LLM to write the python than it is to just write the python.
Kolmogorov complexity seems pretty irrelevant to this question.
>"read a CSV file and add a column containing the multiple of the price and quantity columns"
This is an underspecification if you want to reliably repeatably produce similar code.
The biggest difference is that some developers will read the whole CSV into memory before doing the computations. In practice the difference between those implementation is huge.
Another big difference is how you represent the price field. If you parse them as floats and the quantity is big enough, you'll end up with errors. Even if quantity is small, you'll have to deal with rounding in your new column.
You didn't even specific the name of the new column, so the name is going to be different every time you run the LLM.
What happens if you run this on a file the program has already been ran on?
And these are just a few of the reasonable ways of fitting that spec but producing wildly different programs. Making a spec that has a good chance of producing a reasonably similar program each time looks more like:
“Read input.csv (UTF-8, comma-delimited, header row). Read it line by line, do not load the entire file into memory. Parse the price and quantity columns as numbers, stripping currency symbols and thousands separators; interpret decimals using a dot (.). Treat blanks as null and leave the result null for such rows. Compute per-row line_total = round(Decimal(price) * Decimal(quantity), 2). Append line_total as the last column (name the column "Total") without reordering existing columns, and write to output.csv, preserving quoting and delimiter. Do not overwrite existing columns. Do not evaluate or emit spreadsheet formulas.”
And even then you couldn't just check this in and expect the same code to be generated each time, you'd need a large test suite--just to constraint the LLM. And even then the LLM would still occasionally find ways to generate code that passes the tests but does thing you don't want it to.
But why would I want to reliably produce similar code? The underspec is deliberate. Maybe I don't care about the name of the column as long as it's reasonable.
How to represent prices: same. This is computer nonsense. There's one right way to do it, the LLM knows that way, it should do it.
How to do it scalably: same. If the file is named the agent can just look at its size to decide on the best implementation.
Your alternative spec is too detailed and has many details that can be easily inferred by the AI, like defaulting to UTF-8 and comma delimited. This is my point. There are many possible implementations in code, some better and some worse, and we shouldn't need to spell out all that detail in English when so much of it is just about implementation quality.
>But why would I want to reliably produce similar code?
If you're doing a one short CSV then an LLM or a custom program is the wrong way to do it. Any spreadsheet editor can do this task instantly with 4 symbols.
Assuming you want a repeatable process you need to define that repeatable process with enough specificity to make it repeatable and reliable.
You can do this in a formal language created for this or you can do invent your own English like specification language.
You can create a very loose specification and let someone else, a programmer or an LLM define the reliable, repeatable process for you. If you go with a junior programmer or an LLM though, you have to verify that the process they designed is actually reliable and repeatable. Many times it won't be and you'll need to make changes.
It's easier to write a few lines of python than to go through that process--unless you don't already know how to program, in which case you can't verify the output anyway.
That's not to say that I don't see beneficial use cases for AI, this just isn't one of them.
>This is my point. There are many possible implementations in code, some better and some worse, and we shouldn't need to spell out all that detail in English when so much of it is just about implementation quality.
If you don't actually care about implementation quality or correctness, sure. You should, and LLMs can not reliably pick the correct implementation details. They aren't even close to being able to do that.
The only people who are able to produce working software with LLMs are either writing very very detailed specifications. To the point where they aren't operating at a much higher level than Python.
Btw I had a Claude Sonnet 4 agent try your prompt.
It produced a 90 line python file in 7 minutes that reads the entire file into memory, performs floating point multiplication, doesn't correctly display the money values, and would crash if the price column ever had any currency symbols.
> I had a Claude Sonnet 4 agent try your prompt. It produced a 90 line python file in 7 minutes that reads the entire file into memory, performs floating point multiplication, doesn't correctly display the money values, and would crash if the price column ever had any currency symbols.
OK, that ups the stakes :)
I'm working on my own agent at the moment and gave it this task. I first had it generate a 10M row CSV with randomize product code, price and quantity.
It has two modes: fast and high quality. In fast mode I gave it the task "add to products.csv a column containing the multiple of the price and quantity columns". In 1m21s it wrote an AWK script that processed the file in a streaming manner and used it to add the column, with a backup file. So the solution did scale but it didn't avoid the other edge cases.
Then I tried the higher quality mode with the slightly generalized prompt "write a program that adds a column to a CSV file containing the multiple of the price and quantity columns". In this mode it generates a spec from the task, then reviews its own spec looking for potential bugs and edge cases, then incorporates its own feedback to update the spec, then implements the spec (all in separate contexts). This is with GPT-5.
The spec it settled on takes into account all those edge cases and many more, e.g. it thought about byte order marks, non-float math, safe overwrite, scientific notation, column name collisions, exit codes to use and more. It considered dealing with currency symbols but decided to put that out of scope (I could have edited the spec to override its decision here, but didn't). Time elapsed:
1. Generating the test set, 1m 9sec
2. Fast mode, 1m 21sec (it lost time due to a header quoting issue it then had to circle back and fix)
3. Quality mode, 48sec on initial spec, 2m on reviewing the spec, 1m 30sec on updating the spec (first attempt to patch in place failed, it tried again by replacing the entire file), 4m on implementing the spec - this includes time in which it tested its own solution and reviewed the output.
I believe the results to be correct and the program to tackle not only all the edge cases raised in this thread but others too. And yet the prompt was no more complex than the one I gave originally, and the results are higher quality than I'd have bothered to write myself.
I don't know which agent you used but right now we're not model intelligence constrained. Claude is a smart model, I'm sure it could have done the same, but the workflows the agents are implementing are full of very low hanging fruit.
Your spec isn’t actually a spec because it doesn’t produce the same software between runs.
The prompt is fantasy, all the “computer stuff” is reality. The computer stuff is the process that is actually running. If it’s not possible to look at your prompt and know fairly accurately what the final process is going to look like, you are not operating at a higher level of abstraction, you are asking a Genie to do your work for you and maybe it gets it right.
Your prompt produces a spec—the actual code. Now that code is the spec, but you need to spend the time reading it well enough to understand what the spec actually is since you didn’t write the spec.
Then you need to go through the new spec and make sure you’re happy with all of the decisions the LLM made. Do they make sense? Are there any requirements you need that it missed Do actually need to handle all of the edge cases it did handle?
>many more
The resulting code is almost certainly over engineered if its handling many more. Byte order marks, name collision etc… What you should do is settle on the column names beforehand.
This is a very common issue with junior developers. I call it “what if driven development”. Which again is why you the only people having success with LLM coding are writing highly detailed specs that are very close to programming language, or they are generating something small like a function at a time.
If you spend time to write out requirements in English in a way that cannot be misinterpreted in any way you end up with programming language.