I like how in spite of the author explaining why (father of two small children that occupy his free time), you jumped to the most negative set of possibilities. Instead, it sounds like when he's with his children, he is focusing on them instead of on productivity, which is the opposite of what you're suggesting.
Also, if he instead chose to occupy his drive time with listening to a comedy podcast, or NPR, or even a technical podcast, I can't help but imagine you wouldn't give it a second thought, in spite of that being just as "productive" and "avoiding thinking about the tough things".
I will say, I find it fascinating that there are some philosophers and consciousness researchers who seem to be less certain. I just listened to Chris Hayes interview David Chalmers this week, whose position seemed to be that it's probably not conscious, but that we can't be certain. And more than that: he seemed open to the idea that they may become conscious under further scaling/training/advancements.
Funny enough, the models seemingly go insane and decohere into noise output in the absence of sensory input, which is remarkably similar to what would happen to a human.
That said, I'm not sure I follow what you're actually asking here? I'll also note that I'm not taking a position one way or the other, just sharing a podcast and noting that an extremely reputable scholar on the subject of consciousness seems to have a bit more uncertainty and humility than many commenting here. ;)
LLMs just wait for a prompt, so they do nothing and are just frozen in place.
I'll find time to listen to your link, it sounds interesting. My objection is the strange idea that humans are automatons that are keyed off input like a clockwork machine and operate sequentially. This is clearly not the case.
>LLMs just wait for a prompt, so they do nothing and are just frozen in place.
I'm not sure that's a compelling argument. Humans can be put into a similar state where they are unconscious and not thinking. Think of someone in a coma, for example, where we actually measure and confirm that there is no brain activity where they're in that state.
They are not actively conscious, but that doesn't nullify their consciousness from when they were awake, right?
>My objection is the strange idea that humans are automatons that are keyed off input like a clockwork machine and operate sequentially. This is clearly not the case.
Well, a few thoughts here. First, it's worth noting that the argument isn't necessarily that AI are conscious in the way that humans are, nor that humans are strictly automatons.
But I think the more interesting thing is that our understanding about consciousness has evolved quite a bit in just the last fifty to one hundred years. We used to think that only humans were conscious, but assumed that primates, cows, dogs, and other mammals were just automatons. Then we started to think: okay, maybe primates are conscious. Then eventually: well, dogs also seem to have consciousness, and then rodents, etc.
This has continued such that most people in the study of consciousness think all mammals are conscious, and the debate is shifting down to insects and other creatures that we do think/have thought of more as automatons. We don't actually know where to draw the line, because it's essentially impossible to really feel/know the inner states of other living beings.
In the face of all this uncertainty, Chalmers just points out that since we understand consciousness so little, that ultimately we should probably be less definitive in pronouncing which things do or do not have it.
> I'm not sure that's a compelling argument. Humans can be put into a similar state where they are unconscious and not thinking. Think of someone in a coma, for example, where we actually measure and confirm that there is no brain activity where they're in that state.
He was responding to your comment
> Funny enough, the models seemingly go insane and decohere into noise output in the absence of sensory input
The assumption being that "sensory input" is a prompt. What did you mean by sensory input?
Yeah, I have to admit to finding it somewhat ironic that some individuals accuse the "pro AI" folks of magical thinking, when it seems that escalating levels of magical thinking are being used by the "anti" crowd to suggest that the models can never achieve something akin to human intelligence (particularly in light of the fact that they have on certain dimensions done exactly that).
It's pretty clear that there are significant differences between their intelligence and human intelligence. But that doesn't mean there isn't some sort of intelligence here.
>If anything AI will be used to correct all the crappy human made code that is still being pushed due to the vanity of coders still pretending that they are better than AI at coding.
In my organization, this is already happening. We've been using LLMs to boost our test coverage without touching our human code, then use that as a scaffold to let it go through and refactor, clean up, and optimize, and then validating against both our tests and gold standard test datasets.
In our case, it's made a legacy codebase far more readable to our junior engineers, and the performance improvements (from using an autoresearch-style approach) has resulted in a six figure decrease in our compute spend for the production service we trialed this on.
I'm not sure an article that gives one paragraph summaries of the common anti-LLM talking points is really a substantial contribution to the conversation. This is essentially a snarked up version of a "Criticisms" section one would expect to find in a Wikipedia article on modern generative AI. It's fairly hollow unless you've had your head in the sand and are just getting up to speed on the current conversation.
Your statement seems to be implying (correctly) that LLMs can program, but just not as well as humans. If they're able to program presumably without "thinking" as you seem to be (implicitly) narrowly defining it, then why do you think that limits them to always being sub-par?
It seems like if they can do it, that there's no reason they can't eventually be trained to do it better up to and beyond human performance. It seems strange to suggest that thinking unlocks some nominal margin of "better" specifically that can't be overcome.
All of that aside, even if they can't outperform the top human programmers...what if they get to within a margin where they're still better than most? Isn't a 95th percentile programmer that can run 24/7 and continuously refine its work still going to ultimately come out on top?
I'm more interested in the conclusion that programming doesn't require thinking. And that's where the argument breaks. It seems so obvious, but sometimes the most obvious things are the least true.
>I'm more interested in the conclusion that programming doesn't require thinking.
I suspect it largely has to do with how one defines "thinking". It seems like people like to implicitly define it in such a way as to require a human (or animal), but there are many examples of thinking/intelligence in nature that don't require a brain or even neurons.
I'm genuinely curious: without using the word "think" with all of its ambiguity, can you articulate what it is that we're doing that these models are not capable of? Because it's pretty clear (to me, at least) from the research, particularly a lot of the mechanistic interpretability work coming out of Anthropic, that the models are at least doing something akin to what we think of as thinking, even if it appears foreign to us.
>It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.
I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
I'm curious if someone here with a stronger background in the space has a similar intuition or not.
Scale is always desirable, and there are always gains from scale. It's a matter of whether you can afford training and inference at increased scale.
There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.
But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in narrow "summarize the search results" sorts of applications - but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.
I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some highly advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.
I think this is exactly right. Basically when I am coding, having an agent that roughly matches my intelligence is a feature, not a bug. Having one that is 10x as smart would actively slow me down because I would have to spend the time understanding what it is doing or hand over all architecture to it and just vibe code everything, hoping that it doesn’t do the PhD version of fizzbuzz instead of the maintainable one.
But for some classes of problems I think a model that is 10-100x smarter than the smartest expert is a huge boon. These would be problems that are very hard to solve but easy to verify that the solution is correct. Protein folding, sudoku, etc. Because of this I see the really smart models going to biomedical and pharma first and maybe a few high profit verticals rather than being widely deployed. I am sure Pfizer would be happy to pay for a 100x smarter than the smartest researcher model. But I am not certain that this kind of market fit would justify trillion dollar valuations in the long run. And in the meantime normal “human companion” models will go from Sonnet to some open weight model running on a Dell tower in your closet to maybe even on your phone in the next few years.
Maybe. I can’t imagine what kind of solutions a software engineer who is 10x smarter than any human who has ever lived would be like by definition. All I know is that there is a possibility it says that the most optimal way to solve a problem is too clever for me to understand and as long as I must verify its work I must be able to understand fully the code it writes.
Of course perhaps at that point I really do become more of a spec and prompt engineer and don’t actually look at the code any more than I look at the assembly code produced from my programs now. But still my gut says using hyperintelligence to do common tasks is all positive.
4.8 is demonstrating simplicity, hence its smarter?? It just refactored my 4.6 generated code (4.8 is very slow on difficult tasks - urgh! - without burning tokens - yey!) but the output was wow! Simple, elegant and exactly what i wanted to see.
It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.
The latter is much better (since you can clean up, review, update responses and filter your datasets).
I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)
Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].
Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].
I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.
So first - these are terrific papers and I'd not seen some of them before.
Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".
A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.
I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.
Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights
To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.
Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case?
I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.
I prefer synthetic dataset since the first day hearing distillation. The engineering friction is much lower than soft logits, and I have not observed or heard performance loss (in Speech and language area).
Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this)
The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.
> I don't disagree, but how much of this ends up being distillation?
A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.
A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.
I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.
There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.
> I don't disagree, but how much of this ends up being distillation?
You don't need distillation. They already have the training sets.
It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).
It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.
The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data
Raw training data is raw. A really big model trained on it has already done a first-pass of finding patterns and squeezing out redundancy. Re-ingesting the full training set to train a smaller model is probably more expensive, for marginal quality improvement over distilling from the large model.
Distilling from a larger model is not only probably cheaper than from data, it's also likely higher quality. There's pretty strong support for the proposition that NNs learn a smoothed and regularized version of the data. The NNs are likely higher quality than most of the data they are training from.
Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.
On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.
Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.
How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?
Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law).
MTP will still be highly valuable for interactive use of course.
It seems like a lot of things fed into that. Anthropic couldn't keep up with the compute costs when they got a huge influx of users. (So) effort level defaults got turned down. (Looks like we have direct effort control in the web interface now - thrilled about that!) Adaptive Thinking, while usually cheaper for them, seems less robust than Extended Thinking. And this part is just vibes, but the alignment on 4.7 feels too stiff. I understand wanting the model to push back more, but it seems like 4.7 will push back reflexively in situations where it's just odd.
Too much personality, if you ask me. My biggest use case of an LLM is tool, not therapy, but therapy and opinions have been sneaking into workhorse tasks.
haven't verified, but attributed to Askell:
"I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."
Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona.
The RLHF very much does do that. My take is that RLHF as a mechanism ought to be avoided altogether, and even the selection of the assistant attractor basin is suspect. If I am exploring a problem space I don't want to hire Igor to explore it with me, it's more helpful to have a colleague role who will sort of jump out and say "nah thats dumb what if we throw out that whole thing and do this completely different angle instead".
Given the incentives that bring out the personalities in various occupations, I would guess other personas would be better suited to getting a task done than 'therapist' or 'tech HR rep'.
For examples, that of an explosive ordnance disposal technician, a surgeon, or a salvage saturation diver.
4.7 is a different base model from 4.6, so it's possible that they introduced regressions with pre-training changes, or undercooked the post-training stage.
Just speculating but I "feel" 4.7 was post-trained using more synthetic techniques. The way it writes for one thing, it's "personality", is less human and more fatiguing-AI-slop like.
You don't need to fry with RLAF to get that "slop feel". The first iterations of "AI slop" were raw SFT+RLHF - all human input, all inhuman output.
That said, I completely agree that 4.7 was a pronounced "model personality" regression. Closer to ChatGPT, and I mean that as an insult. Yet to check whether 4.8 is better.
I must admit that I am going to find it fascinating when we hit the point where it becomes nearly impossible to deny the efficacy of these tools. I have straight up had people, even in real life, suggest that I'm lying about my productivity gains or what I'm able to accomplish with them.
Like, I understand the reasonable arguments against (I even agree with a few), but it's clear that some people have fully inserted their head into the sand and just don't want to believe any of this could be true. Which will be harsh, since I think getting hit with this train all at once in the future is going to be a rougher ride than a slower coming-to-terms-with, even if the result is one we're unhappy with.
What is the motivation for us users to lie about our experiences? It's to the degree now that people simply refuse to believe that I'm honestly describing my experiences with these tools?
I understand the motivations for the labs to lie, but what do you think mine is?
Oh, basic counting is now arithmetic? But I was told they were superintelligent and were going to cause an apocalypse because they can do pretty much everything ? Somehow because they can excrement a lot of text, we were told they can do everything else too?
I work in big tech and probably 90% of code over the last month has been written by AI. And I suspect it's probably higher within Anthropic, which is probably what he's basing his opinion on.
So, he's closer to correct than not.
That said, your recollection is also flawed. It was in mid-March, and here's the relevant quotes:
>I think we’ll be there in three to six months—where AI is writing 90 percent of the code. And then in twelve months, we may be in a world where AI is writing essentially all of the code.
[...]
>But the programmer still needs to specify, you know, what are—what are the conditions of what you’re doing, what—you know, what is the overall app you’re trying to make, what’s the overall design decision? How do we collaborate with other code that’s been written? You know, how do we have some common sense on whether this is a secure design or an insecure design?
[...]
>So as long as there are these small pieces that a programmer, a human programmer, needs to do, the AI isn’t good at, I think human productivity will actually be enhanced. But on the other hand, I think that eventually all those little islands will get picked off by AI systems.
With another 3-4 months left on the clock, his prediction seems remarkably on point for at least certain organizations and domains.
I welcome you to also hold yourself accountable in the coming months if this trend continues. ;)
Yep! We have a review process where we have a few agents, each tuned to a particular domain of expertise (security, code quality, etc) which iterate until the feedback meets a certain threshold, at which point it goes over to humans for (hopefully) final review.
That said, I generally agree that you're correct: writing code in many ways has not been the biggest bottleneck. However, by removing much of that writing, it frees up engineers to work on the uniquely human things that are larger bottlenecks.
I had a few comments in a thread here touching on where I think most of the value has come from for us (which is largely search/understanding of our dependencies and making away team work far more viable, which aids with cutting through bureaucracy and the tendency for teams to push back on work): https://news.ycombinator.com/item?id=48298731
Haven't you heard - these days they just throw slop generated by LLM agents over to other LLM agents which cosplay as internal QA. They know it works because they write really strict .MD files where they instruct agents in English language to 'never do this' and 'always do that'.
This is really what you think happens at large tech companies? You don't think it's possible this is maybe even slightly overly simplifying what the relevant processes are?
Comment does indicate you don’t really seek to know how things work with respect to this and seem to not be able to imagine that the Occam’s razor is: agents are more useful than you think they are.
> I welcome you to also hold yourself accountable in the coming months if this trend continues. ;)
My company did not swallow hundreds of billions in shady investment deals and is not publicly traded. We work with real money, and the revenue on our books is the revenue that is actually booked, not fake revenue we plan in 2 years time to maybe happen. So no, I am not going to hold myself accountable. But people who work with other people's money should be absolutely held accountable when their wild imaginations don't come true, repeatedly, quarter after quarter, year after year!
Mate, for 5 years I've been hearing that crap. I am not predicting anything / on the contrary the AI boosting bunch is. When are your predictions coming true?
AFAIK, most predictions from several years ago were for...approximately now to within the next few years. Can you be more specific?
You criticized a very specific (and fake/misquoted) prediction, ignored the correction, and are now criticizing vague hand-wavey "predictions" that you have left unspecified.
Can you please stop with the angry/ranty replies and actually have a real conversation grounded in actual facts?
Now, having said all of the above...I'll also point out that these are predictions, not promises/guarantees. These people are being asked to forecast and are doing so. I hardly think they should be held responsible for not being literal oracles, but even so--please, at least quote them correctly/at all.
In short: be better than the hallucinations you're seen to call out from the models.
I will note that you have essentially not responded to anything specific in my comment, nor at least acknowledged that you misstated Dario Amodei's actual prediction.
Also, if he instead chose to occupy his drive time with listening to a comedy podcast, or NPR, or even a technical podcast, I can't help but imagine you wouldn't give it a second thought, in spite of that being just as "productive" and "avoiding thinking about the tough things".
reply