Hacker Newsnew | past | comments | ask | show | jobs | submit | more userbinator's commentslogin

The key point (which I believe static analysers these days can easily check for) is to check the sizes of the source and destination.

It's really not surprising that letting websites run arbitrary code on your machine, even in a sandbox, would lead to things like this.

There's no such thing as a sandbox "on your machine" when you really think about it. The code still runs on the same hardware and there are tons of ways to fiddle with said hardware that could be exploited (like rowhammer). The only "real" sandbox is fully dedicated hardware down to bare metal with zero connections to sensitive systems.

And now that Google's web environment integrity is getting repackaged into captchas, it seems we won't even be able to try to block such things in the future...

It might've been off when packed, but all the vibration turned it on at some point.

[flagged]



It does happen, even to products being shipped new from the factory.

This reminds me of the story I read of someone trying to take a https://en.wikipedia.org/wiki/Calorimeter#Bomb_calorimeters onto a flight, in the pre-9/11 era. Fortunately he was allowed to after some questioning, but it did raise some eyebrows. I imagine trying to ship one of those would also arouse some attention.

but Hama has a similarly named device

...I mentally appended an "s" to that, and was momentarily very confused.


And if you think you discovered a bomb accidentally left discoverable, you don’t ask for it to be please turned off

That was the most hilarious part for me.


Turning it off would have solved the bureaucratic problem for flight crew. Sadly, the passengers (collectively) failed to accomplish this basic task.

> Turning it off would have solved the bureaucratic problem

The article says two Bluetooth radios weren’t turned off. Do we know if one of those was “the bomb?”


You can't really turn off most BLE devices with internal batteries, off means low power mode nowadays. Some of them are still discoverable on wireshark when they are 'off'.

It could've been in checked luggage and turned itself on from the movement. No way for the passengers to get to it. Unfortunately it didn't turn itself off (although if it did, and then later turned on again, that would've been even worse.)

The passenger may not have even known, I've certainly renamed friends' phones as a goof, although not to something that would get them in to trouble.

I don't think it's as silly as people are making out. It at least proves a passenger is in control of the device, rather than it being stashed / hidden in the cabin. A device in the cabin not owned by any passenger broadcasting a signal is definitively more suspicious than one with a passenger in control. We don't know what their next step would have been - they might have asked everyone from row X to Y to turn their bluetooth back on to narrow down the search etc. They probably didn't expect that anybody would fail to respond to the first instruction.

Also, how do we even know they're really "AI scrapers", or just a deliberate DDoS to push sites into using CF or other "anti-bot" providers?

They showed up when the AI money did. The evidence is circumstantial, but… some of them are remarkably well engineered (from a “how difficult is it to identify this traffic” perspective, in a way that never existed before (I have been running a quite sizeable site for 8 years, over 200k registered users, and you don’t need to register to use 99% of it).

I run a quite large website and there are a few patterns.

The usage is extremely quick, and follows easy-to-spot patterns. We noticed a spike in bounce rate.

They never come from Google, and the bad programmed ones just crawl several pages at a time, faster than a user could do.

Then there's the crazy spikes in visits from specific countries, pretty much scraping the entire content. Often from pools of IPs. In some cases had 30% unexplained (meaning: it wasn't viral or a marketing campaign) random sustained increases in traffic.

There's also the fact they don't interact with the complicated widgets, so zero XHR requests other than analytics pings.

They also don't cause spikes in Google Analytics, so I assume it's blocked, but they show up in logs and in the internal analytics.

It's not enough to DDOS the website at all, but it's a lot of noise in statistics that we gotta learn to filter.


> They never come from Google, and the bad programmed ones just crawl several pages at a time, faster than a user could do.

I’ve triggered this kind of “bot protection” right here on Hacker News many times. I did that by having a bunch of Hacker News pages open and then closing and reopening my browser. I’ve also triggered it by opening a bunch of links in the background too quickly. I’ve also triggered it by reading the article, then clicking back and upvoting/favouriting too quickly. I’m also located in Singapore, which people have started to advocate for blocking here recently.

A single non-bot legitimate user can easily trigger these kinds of heuristics just by using the site in a way you don’t expect. This can affect some users disproportionately more than others, e.g. disabled people who need to use assistive technology.


Oh I also do this all the time.

What I mean by "too fast" is opening 50 pages in the span of two or three milliseconds.

Either way, I'm not blocking. The CDN is handling the traffic alright.


I hate that sort of thing - when I rolled my own proof-of-work bot protection (providers wanted $$$$), I set it up so that

A) you'd have to open >200 tabs, and B) if any tab solves the proof-of-work, any that are still waiting to do so reload in the background.


Yes, circumstantial is exactly the point; it's easy to use AI as a scapegoat because it's something popular to hate on.

It's circumstantial evidence, but Occam's Razor also applies.

It's not a hostile DOS in the traditional sense (I've mitigated a few of those) - no "pay us to make it stop", no pattern to the requests other than "fetch every unique URL a few times".

It wasn't happening until financial incentives to gather large datasets for AI training appeared.

Bad actors (using residential proxies & claiming to be a real browser) mostly showed up after folk started blocking ones that identified themselves as AI scrapers.

It's obvious to blame AI training because there's a shortage of better explanations. Who else would be paying for these (expensive) residential botnets, only to use them to (eg) web-scrape wikipedia (which offers free downloads of its content in a structured format)?

The simplest explanation of the technical behavior is "a bot coded to follow every link it sees & save the results", and the simplest explanation of the motive to run such a bot is "to train a large language model".


no "pay us to make it stop"

"use Cloudflare to make it stop"


Or fastly, or akamai, or bunny, or any number of other providers.

Cloudflare are merely the cheapest of the bunch.


Exactly. They (and most of all, Big G) stand to profit greatly from this browser discrimination. What better than to make more sites use them by launching DDoS attacks in the name of "AI scraping".

"If they know you're spoofing, you're not spoofing hard enough."

This stupid "war against bots" is going to lead to the downfall of the Internet and effectively turn it into another walled garden where only "approved" (anti-)user agents are allowed. Don't fall for the nonsense about "AI scrapers" --- it's just a way to manufacture consent.


Idk, if bots ate hammering your server then setup rate limits. If you have content that you don't want others to have access to, don't serve it with a webserver.

I used to just start giving any IP downloading way too much a redirect to multi-tb NASA images. This was a long time ago but it was surprisingly how many would follow redirects and never time out. Wouldn't see a request again for hours and then its right back to downloading a new part of the sky.

Those images also used to crash all the early GUI irc and chat clients that showed inline images without size checks...


How do you know it followed the redirect and downloaded the image?

Because it didn't come back for hours.

How were you tracking each IP address's data usage? Did you parse the logs every request? Store usage in a database? At the application or webserver level?

Webalayzer! I'm not sure there were really any other options at the time other than writing your own. Parsed the apache logs and gave you pretty detailed results and you could see the usage (in kb, which tells you how long ago this was!) broken down by date and IP.

Once you added a redirect rule for the IP to apache you'd just check your log and see the IP that was hitting you every couple of minutes poofed for a good few hours.


Now that's a name I've not heard in a long time.

That's nuts. I suppose you had Webalayzer on a minutely cron job. It might have been drawing more resources than Apache itself!


This. What even is the point of blocking scapers if Google consumes your content anyway and serves it as an AI answer?

These are sad times we're living as far as openness of the web goes. People would have less of a scraping problem if their websites didn't ship with 20MB of JS.


> What even is the point of blocking scapers if Google consumes your content anyway and serves it as an AI answer?

Google bot is generally fairly well behaved, but this is not the case for all scrapers and it can cause significant traffic (and expense).


I have blocked several Asian countries because their IP ranges kept sending stupid scrapers that repeatedly downloaded the same image with a made-up query, bursting through the basic cache setup. Now a billion or so people can't acces my server.

Rate limits didn't work because they kept rotating IP addresses.

I'm pretty sure Turnstyle would allow more people through than my current solution, but this was quick and easy. I expect to have to ban more ASNs from other countries in the future but the worst bots are now gone.


There is something to be said for "one way indexes."

Imagine you run a company register for a local government. You want to let people look up companies by their registration number (which they must disclose in all communications to you) to see if they're legit and whether any warnings have been raised against them. You don't want unscrupulous marketers to just be able to `SELECT * FROM companies WHERE type='nail_salon' AND city='london'`.

If you aren't super strict about scraping, some shadowy business in Neverland, completely unconcerned with following your laws, will build that database.


> Imagine you run a company register for a local government.

Is this data not public for some reason? I think it will not hurt if there are multiple copies spread between public offices and private companies. What really hurts is a private company hammering your webserver for their own profit. They should get their own copy.


If the purpose of the index is to allow people to lookup registration and warnings, probably just serve the list. This is public information and doesn't need to be gated. CSV header could be:

Reg_no, status, no_warnings_last_12m


Rate limits don’t work if bots rotate IPs from residential blocks on every request.

I would LOVE to be able to use rate limits (well actually, since I'm dealing with fraud not scraping, I'd ban the IP).

I can't, because every request comes from a new IP!!!


Yeah. I can already see the future. Only computers that pass remote attestation will be able to connect to the internet at all.

I found an article about finding a seashell in the middle of the desert on GitHub...

More seriously, I wonder if there's anything inside. Somewhat reminds me of the https://en.wikipedia.org/wiki/Coso_artifact


They're open-sourcing things either because they get no value from them anymore, or just want more unpaid "community" labour.

OK well that's the whole "open source" model. It's not some Microsoft perversion of it. The reason they moved from "free software" to "open source" was specifically rejecting the ideological stuff that would prevent business exploitation

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: