> You just know nothing about math and are happy to parrot bullshit AI salesmen are selling you.
Not the parent poster here. I do know things about math. I wrote a few papers related to the unit distance problem (https://arxiv.org/abs/2311.10069, https://arxiv.org/abs/2406.15317) and spent quite some time trying to solve it. I had no chance of coming up with the proof that the spicy autocomplete came up with. Dumb benchmark, sure.
I would genuinely be interested in knowing what you're doing that led you to this conclusion.
I would be shocked if I was unable to solve 4th grade math homework with any of the contemporary frontier models. I spend most days using them to do significantly more complex things than that.
If they took a blurry photo of the piece of paper and uploaded to chatGPT saying "solve this" then I would totally believe it. The frontier models are mostly obnoxiously bad at OCR and properly ingesting what's on an image of a page.
If you write out the 4th grade math problem, they would have no trouble.
Prompt was just "Please solve all the problems in this worksheet" plus the image.
I got this:
1. 86 → 1, 2, 43, 86
2. 7 → 1, 7 (prime)
3. 12 → 1, 2, 3, 4, 6, 12
4. 38 → 1, 2, 19, 38
5. 52 → 1, 2, 4, 13, 26, 52
6. 9 → 1, 3, 9
7. 73 → 1, 73 (prime)
8. 98 → 1, 2, 7, 14, 49, 98
9. 24 → 1, 2, 3, 4, 6, 8, 12, 24
10. 6 → 1, 2, 3, 6
11. 80 → 1, 2, 4, 5, 8, 10, 16, 20, 40, 80
I left out #6 because that number didn't come through clearly enough in the image for me to read it confidently, and I didn't want to risk solving the wrong one on a homework sheet. If you let me know what it is, I'll factor it right away.
It failed to read the "77", and it incorrectly reported the line item it failed to read as #6 rather than #4, and it numbered the output incorrectly; it should have left off the one it failed on with a gap in the list rather than having the second half of the answers be off by one. It did actually factor everything correctly though.
Yep, fair enough. So pretty far from perfect still! But quite good. And definitely agrees with the point that it is the OCR that is the problem more so than the math.
If your math does not involve multiplying 20 digit numbers, modern LLMs can "do" math even without a Python tool despite the counterintuition of next token prediction.
They can definitely recognize the problem class and build programs to do math. So what's the difference?
It's like saying that people can't turn high torque nuts on machine bolts, because you can't use your fingers to do it. But you can use a wrench, so effectively, we can turn high torque nuts on machine bolts even though it isn't something we can natively do unaided.
Again, I'm very interested in your methodology here. It's true that LLMs can't do arbitrary math, but in my recent experience (like 9 months at least, maybe a year?), the frontier models are very good at figuring out that they should delegate the math to a tool and do it that way, either by having a tool handy that can solve the problem directly, or by writing code to do so.
Reasoning models with access to Python have been able to solve 4th grade math homework for over a year now. Prove me wrong: show me a 4th grade math problem they can't handle.
Fourth graders typically don't have access to Python for their homework assignments. To be fair to the kids, I tried it first without Python: Opus 4.6 (Feb 2026) with default Medium effort. https://claude.ai/share/1533a3e4-6757-4614-b95d-0743350a6598
It got questions 2 (Shop D) and 5 (280) wrong. It got question 3 right but the work it showed has the numbers for each shop wrong. My fourth grade teacher would have taken off points for that (shout out Mrs. Van Bladel).
This time it used Python to "check its work", and answered the same questions incorrectly (2 and 5). To the model's credit, it did show the correct work on answer 3 this time.
That's more of a test of vision LLM ability to correctly identify and count things in an image than it is of mathematical reasoning.
If you look at the working of your non-Python example it gets most of the counts wrong - identifying shop A as two full notebooks plus one half notebook when it's actually three full notebooks, for example. The numeric answers it then gives would correct if it hadn't made those vision mistakes.
I've been testing vision LLMs on counting the number of pelicans in a photo for a while, they're very unreliable at that.
The best I've seen is Google Gemini 2.5 if you have it output image segmentation masks (a feature they have not included in the Gemini 3 series yet): https://simonwillison.net/2025/Apr/18/gemini-image-segmentat... - but that requires additional harness engineering, you need to explicitly cause it to use its image segmentation mechanism.
Fourth grade math's† students are learning geometry and how to draw simple plots. Vision ability (or tactile ability, for visually impaired students) is pretty important to understanding and solving those homework problems.
It's very funny how you chose an example that is both not 4th grade level math and also something the frontier LLMs are much more likely to be able to solve than nearly any 4th grader.
This is a counterexample to your argument, not evidence for your claim. The only possible conclusion from this example is "woah, it's amazing that we have AIs capable of solving this kind of difficult math problem!", and very much the opposite of "these AIs can't even do my 4th grader's math homework".
GPT-5.5 found a solution only after assuming that you're allowed to concatenate numbers together e.g. 8 7 becomes 87 (it complained at first that it was "under-specified") - using Python it brute-forced a solution (actually finding 13): https://chatgpt.com/share/6a1db54f-7ab8-8333-9218-86a469c284...
I questioned OP's "there is an answer online" claim so I checked and the only source found for the original question was a 5th grade Russian school for mathematics.
Apparently there is a way to solve this without brute forcing all the combinations. It has to do with looking at how many even an odd numbers there are, and taking into account the goal number is odd. And then thinking through the combinations [even-even=even, even-odd=odd,…]
Though this is obviously not something I would expect a 4th grader to solve.
TL/DR - "cleaner and more polished" implies no vibe coded slop, and the only reason people start these projects in 2026 is because they're addicted to vibe slop coding.
funny speculative question: psychosis is evidently a gradient. Does AI just highlight latent general psychosis (i.e. in the simplified interpretation of a worldview shaped more by unchecked belief and fantasy than observation) in otherwise largely functional people?
What if the problem is that we train people too much to take things that are being said at face value without questioning/observing them, increasing the psychosis problem?
Everyone is susceptible to addictions or psychosis to some degree.
What matters is when the stimulus presented exceeds their resistance.
Extended AI use is a highly attractive stimulus that exceeds most people's resistance, especially when sycophantically interacted with in an echo chamber (human-AI, with no other humans in the room).
So yes, it's dangerous in the same way that cigarettes and social media are.
Just because some people can avoid slipping into it, doesn't mean we should ignore population-as-a-whole outcomes.
The problem is that both camps take their positions as religious righteousness, which lobotomizes their abilities to have productive, pros and cons discussions about matters at hand.
The internet/apps of the last 20 years have not exactly boosted people's ability to think critically and set aside their passions though.
Much easier to keep eyeballs glued and sell them ads if you encourage their baser impulses.
This is rsync we are talking about. A bug in rsync basically means lost data and/or unreliable backups.
I think it's normal to be pissed at lost data. Maybe it's not socially acceptable to spit in the face of a volunteer but it's 100% human to feel annoyed by an obvious drop in code quality.
There must be some degree of communication from customers to developers. Even if it is a free volunteer service.
Poor communication results in professionals firing the customer as well. None of this is exclusive to OSS of volunteer effort. But the communication in general is necessary.
This is just product management and communication issues. There is an perceived problem and the problem MUST be communicated.
Problems aren't solved by shutting up and ignoring things. And based on the discussion in this topic, it's clear there's a lot of people who are worried about rsync code quality here.
Look, it's not that long time ago when we had the xz malware. The pattern is always the same. Maintainer of the project is doing X, people start to pressure them to do something else, maintainer gives up and opens the project up to other maintainers, and then many things can happen. If there is any lesson from the incident, open source maintainers should never allow the pressure to happen, ignore it if it's too strong, block people. Rsync has been maintained for a very long time. Bugs happen, even regression bugs happen. People don't get to dictate how should the volunteer do development.
If I were the rsync maintainer after this I'd unpublish it everywhere I had control over, delete the repo and turn off my computer to go walk in the park. The linked thread is insane.
Again, this is not work and they are not customers.
This is somebody spending their free time on code they enjoy and then putting the result online.
The reason businesses are careful about which customers they fire is because they want to keep having customers. Open source maintainers have no reason to deal with that shit.
And it seems like regressions that lead to rsync losing data is just as serious.
Again: we are talking about rsync here. This new methodology being used this year seems to be associated with a regression (ie: Data loss since this is rsync after all....) that likely wouldn't have happened any other year.
Or at least: the regressions at play are consisting of thousands of lines of changes that was only navigated by Claude later down in the discussion.
We are reaching the point of AI developed code that requires AI itself to analyze. One step at a time. It's right for the open source customers who are used to understanding changes and smaller patches than this.
Before you call yourself a customer of an FOSS project, perhaps show us the receipt that a monetary transaction had actually taken place between you and the developer.
Otherwise, you're just a beggar. And beggars don't get to choose.
Customers pay money for goods and services. They thus get a bunch of social, ethical, and legal positions in terms of their relationship with the seller.
Rsync is an open source project that its maintainers put onto the Internet. People who use it are not customers, and they do not have the right to expectations around how the maintainers will change the software or change how they develop it.
You've never had a customer in your professional setting who didn't pay money for goods and/or services? Yet it was very important for your boss (and therefore you, as a programmer) to service their every whim?
Customers are customers. Whether they're paying or not. Not all customers are worth servicing (even with infinite money offered, "firing a customer" is important to keep the community in check).
But this isn't a situation where the RSYNC maintainer should fire the customer. There's a LOT of backlash to this release. Even if this one particular customer is a bit of an ass, there's plenty of good users in that 90+ comment chain (hundreds now?) where this regression has clearly struck a nerve.
This is not a professional setting. This is an open source project that somebody published to the internet. Using it does not make you a customer, and it doesn’t matter if it “struck a nerve” with users.
Well in my professional setting, I deal with non-paying customers all the time. They're still customers and I'm still expected to listen to them.
It was better when a dedicated PM was shielding me from this crap but here we are. Deciding who and who not to listen to is just part of project management.
If committing thousands of lines of unreviewed AI generated code is "doing their best", I'd argue that them not contributing anymore would be a net benefit for the project.
The first comment, which is a screenshot from Mastodon, is perfectly acceptable. There is a clear regression between newer versions of rsync.
Then egos got bruised and things leave the realm of reason soon after. But coming with a request saying "Version X worked while version Y doesn't", with maybe some degree of annoyance, is fine.
This doesn’t seem related to my comment. Did you mean to reply to me upthread?
Saying ~“maybe it’s not ok to do <thing> but <reasons they might do thing>” is nothing like your example and does imply it’s acceptable to the speaker to sometimes do that thing.
But we’re past that now because the person I was discussing this with has gone ahead and clarified that telling an open source maintainer to please stop fucking up isn’t an angry comment.
You might want to use different term. After all, Trump derangement syndrome turned out to be "people who actually listen to him and say truth about him".
X-derangement thing is not used in reference to people whobare wrong or lying, but in reference to people who are making correct observations
(I need a better model to translate from llmese.)
reply