Rendered at 21:33:27 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
MyNameIsNickT 2 days ago [-]
Hey! I'm Nick, and I work on Integrity at OpenAI. These checks are part of how we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform.
A big reason we invest in this is because we want to keep free and logged-out access available for more users. My team’s goal is to help make sure the limited GPU resources are going to real users.
We also keep a very close eye on the user impact. We monitor things like page load time, time to first token and payload size, with a focus on reducing the overhead of these protections. For the majority of people, the impact is negligible, and only a very small percentage may see a slight delay from extra checks. We also continuously evaluate precision so we can minimize false positives while still making abuse meaningfully harder.
vlovich123 2 days ago [-]
That still doesn’t explain why you can’t even start typing until that check proceeds. You could condition the outbound request from being processed until that’s the case. But preventing from typing seems like it’s just worse UX and the problem will fail to appear in any metrics you can track because you have no way of measuring “how quickly would the user have submitted their request without all this other stuff in the way”.
Said another way, if done in the background the user wouldn’t even notice unless they typed and submitted their query before the check completed. In the realistic scenario this would complete before they even submit their request.
mike_hearn 1 days ago [-]
I developed the first version of Google's equivalent of this (albeit theirs actually computes a constantly rotating key from the environment, it doesn't just hard-code it in the program!).
The reason it has to block until it's loaded is that otherwise the signal being missing doesn't imply automation. The user might have just typed before it loaded. If you know a legit user will always deliver the data, you can use the absence of it to infer something about what's happening on the client. You can obviously track metrics like "key event occurred before bot detection script did" without using it as an automation signal, just for monitoring.
fc417fc802 1 days ago [-]
That doesn't make sense. The server would wait to process anything until after you received the signal. If it doesn't arrive within a reasonable period of time that tells you something, the same as right now.
If you mean that you can infer client side tampering with the page contents you could still do that - permit typing but don't permit the submit action on the client. The user presses enter but nothing happens until the check is complete. There you go, now you can tell if the page was tampered with (not that it makes much difference tbh).
mike_hearn 1 days ago [-]
The typing actions have to be observed by JavaScript. It's not different to any other JS blocking page load because it's needed for the site to work, that's just how the web works.
electroly 1 days ago [-]
This doesn't seem to be the same thing. The article isn't about being unable to type before JavaScript starts executing. If I understand correctly, you're unable to type until a network request to Cloudflare returns. The question is: why not allow typing during that network request? JavaScript is running and it's observing the keystrokes. Everyone understands that you can't use a React application until JavaScript is running. They're asking why the network request doesn't happen in the background with the user optimistically allowed to type while waiting for it to return.
(Separately, I don't think the article has adequately demonstrated this claim. They just make the claim in the title. The actual article only shows that some network request is made, and that the request happens after the React app is loaded, but not that they prevent input until it returns. Maybe it's obvious from using it, but they didn't demonstrate it.)
mike_hearn 1 days ago [-]
The network request to Cloudfare is part of the JavaScript (in effect).
electroly 1 days ago [-]
I don't think that's true in this case; the React application loads first, fully initializes, and then sends its state via Cloudflare request. It can't happen at the same time, by design. It has to happen serially. The article's claim is that you can't type during this second request. Frankly, I wonder if this is actually true at all. The article did not demonstrate this, and there's no problem if you can actually interact as soon as the React application is running. ChatGPT running abuse prevention and React applications requiring JavaScript to work are both uncontroversial, I think.
mike_hearn 1 days ago [-]
OK, I haven't looked at the exact sequencing here. But generally, once the action goes back to the anti-abuse service for checking the user can't be allowed to change what they're submitting. The view the anti-abuse system saw has to match what the app server sees.
vlovich123 19 hours ago [-]
Still incorrect because the user in this case is being prohibited from submitting anything at all.
root_axis 1 days ago [-]
Why can't you allow typing and just consume the state of the text input as the initial state of the js logic?
arccy 1 days ago [-]
how you type is also part of the signal
sbochins 19 hours ago [-]
Then track that data and upload when you can make the request.
susupro1 1 days ago [-]
This perfectly explains the trade-off. But from a pure UX perspective, freezing the input pipeline feels uniquely hostile. They could buffer the keystrokes invisibly in the background instead of locking the cursor, which creates the jarring perception that the site is actively fighting the user.
1 days ago [-]
toinewx 1 days ago [-]
can you reformulate your message?
gavinray 1 days ago [-]
Mike is saying that if you allow users to type before the scripts are fully loaded, there is no way to tell the difference between a human and bot.
Blocking until load means that human interaction is physically impossible, so you are certain that any input before that is automated.
If you allow typing, this distinction vanishes
LtWorf 1 days ago [-]
Load fewer scripts so it doesn't take that long?
diablevv 1 days ago [-]
[dead]
triangleman 1 days ago [-]
[dead]
p-e-w 2 days ago [-]
Many cloud products now continuously send themselves the input you type while you are typing it, to squeeze the maximum possible amount of data from your interactions.
I don’t know whether ChatGPT is one of those products, but if it is, that behavior might be a side effect of blocking the input pipeline until verification completes. It might be that they want to get every single one of your keystrokes, but only after checking that you’re not a bot.
davidkunz 2 days ago [-]
It's still possible to let users already type from the beginning, just delay sending the characters until checks are complete. Hold them in memory until then.
miyuru 2 days ago [-]
Instagram was uploading the images while the user were adding post details, back in 2012!
No one seem to use or care about their own product anymore. Only uses dashboard and metrics, which does not explain the full situation.
AlecSchueler 2 days ago [-]
That makes total sense from a UX perspective though, the ChatGPT thing does not.
scottyah 1 days ago [-]
there were a lot of helpdesk chats doing the same, so you could see users typing messages, then deleting words, etc before hitting send.
Imustaskforhelp 1 days ago [-]
This was actually one of the reasons why Instagram felt smooth.
Another thing but Facebook/Instagram have also detected if a person uploads an image and then deletes it and recognizes that they are insecure, and in case of TEENAGE girls, actually then have it as their profile (that they are insecure) and show them beauty products....
I really like telling this example because people in real life/even online get so shocked, I mean they know facebook is bad but they don't know this bad.
[Also a bit offtopic, but I really like how the item?id=3913919 the 391 came twice :-) , its a good item id ]
mort96 1 days ago [-]
I just checked the network inspector, the only thing it does per key press is to generate an autocomplete list. It doesn't seem too hard to wait with the autocomplete generation until after whichever checks you run pass.
andai 2 days ago [-]
I wondered if ChatGPT streams my message to the GPU while I type it, because the response comes weirdly fast after I submit th message. But I don't know much about how this stuff works.
aabhay 1 days ago [-]
Likely prefix caching among many other things
matchagaucho 1 days ago [-]
Keyboard response feels 10x slower in ChatGPT Projects (possibly for reasons other than react state).
m3kw9 1 days ago [-]
Because the way they have the server architecture setup and how it loads the screen. You don’t even want all the bots hitting servers
dncornholio 1 days ago [-]
You cannot know what verifications they use. I could argue the disabled textbox is some sort part of the verification process. Humans will click on it while bots won't.
root_axis 1 days ago [-]
Seems like a trivially simple verification to defeat.
YetAnotherNick 1 days ago [-]
You can defeat all client side verification by definition if you know what verification is run.
QEDCTrL 1 days ago [-]
Sounds like anti-distillation to me. But, know what? Meh.
mcmcmc 1 days ago [-]
I’d be inclined to agree with the “meh” if their entire product weren’t built off pirated content
federicosimoni 1 days ago [-]
[flagged]
deadbabe 1 days ago [-]
Remember you’re talking to a vibe coder who just stares at code being printed out by AI.
mcmcmc 1 days ago [-]
That’s a big assumption. It’s a brand new account, might be a bot. PR/astroturfing is a great use case for agentic AI
Imnimo 2 days ago [-]
It's interesting to me that OpenAI considers scraping to be a form of abuse.
DrinkyBird 1 days ago [-]
It’s funny because the first AI scraper I remember blocking was from OpenAI’s, as it got stuck in a loop somehow and was impacting the performance of a wiki I run. All to violate every clause of the CC BY-NC-SA license of the content it was scraping :)
raincole 2 days ago [-]
Quite sure even literal thieves would consider thievery a form of abuse.
mcmcmc 1 days ago [-]
What’s being stolen? AI output isn’t copyrightable, and it’s not like they’re ripping pages out of a book
plutokras 1 days ago [-]
They can train on the outputs i.e. distillation attacks.
mcmcmc 21 hours ago [-]
How is that theft?
duped 1 days ago [-]
Engineers working on AI and AI enthusiasts are seemingly incapable of seeing the harm they cause, so I disagree.
It is difficult to get a man to understand something, when his salary depends on his not understanding it.
littlestymaar 2 days ago [-]
Yeah, they know it's bad, they just don't think the rules apply to them.
mapt 1 days ago [-]
The rules are that a large corporate AI company is able to scrape literally everything, and will use the full force of the law and any technology they can come up with to prevent you as an individual or a startup from doing so. Because having the audacity to try to exploit your betters would be "Theft".
vbezhenar 1 days ago [-]
They know that the rules apply to them. They hope that they can avoid being caught.
catoc 2 days ago [-]
It’s only bad if you’re a closed, for-profit entity
</sarcasm>
lukan 2 days ago [-]
Was that sarcasm? Speaking of it, what parts of OpenAI are still open?
catoc 2 days ago [-]
I know, always hard to tell on HN.
Added the relevant declarative tag
1 days ago [-]
reactordev 1 days ago [-]
The front door…
1 days ago [-]
skeeter2020 1 days ago [-]
Small mitigation (by no way absolving them): isolated developers, different teams. Another way: they see "stealing" of their compute directly in their devop tools every day, but are several abstractions away from doing the same thing to other people.
splatter9859 1 days ago [-]
They never have and feel they are above reproach. Anytime Altman opens his mouth that's apparent. It's for the good of humanity dontcha know. LOL
kamban 2 days ago [-]
You nailed it.
tedsanders 2 days ago [-]
For what it's worth, the big AI companies do have opt out mechanisms for scraping and search.
I'm not sure if Gemini lets you opt out without also delisting you from Google search rankings.
foresterre 2 days ago [-]
I think opt-outs are a bit backwards, ethically speaking. Instead of asking for permission, they take unless you tell them to no longer do it from now on.
I can imagine their models have been trained on a lot of websites before opt outs became a thing, and the models will probably incorporate that for forever.
But at least for websites there's an opt-out, even if only for the big AI companies. Open source code never even got that option ;).
kneel25 1 days ago [-]
> a lot of websites
It was a dataset of the entirety of the public internet from the very beginning that bypassed paywalls etc, there’s virtually nothing they haven’t scraped.
qaadika 1 days ago [-]
> the big AI companies do have opt out mechanisms for scraping and search.
PRESS RELEASE: UNITED BURGLARS SOCIETY
The United Burglars Society understands that being burgled may be inconvenient for some. In response, UBS has introduced the Opt-Out system for those who wish not to be burgled.
Please understand that each burglar is an independent contractor, so those wishing not to burgled should go to the website for each burglar in their area and opt-out there. UBS is not responsible for unwanted burglaries due to failing to opt-out.
netdevphoenix 1 days ago [-]
Performing an automated action on a website that has not consented is the problem. OpenAI showing you how to opt-opt is backwards. Consent comes first.
Bit concerning that some professional engineers don't understand this given the sensitive systems they interact with.
Tarq0n 11 hours ago [-]
It seems likely that they buy data from companies who don't obey the same constraints however, making it easy to launder the unethical part through a third party.
subscribed 1 days ago [-]
Just respect the bloody robots.txt and hold your horses. Ask your precious product built on the relentless, hostile scraping to devise a strategy that doesn't look like a cancer growth.
keybored 1 days ago [-]
Death by a thousand opt-outs.
jordanb 1 days ago [-]
They don't want anyone to take that which they have rightfully stolen.
altmanaltman 1 days ago [-]
Well at least they have 1 person working on "Integrity" so can't be too bad
splatter9859 1 days ago [-]
Exactly! How dare you have access to their stolen content in the midst of them doing the same.
axegon_ 2 days ago [-]
The levels of irony that shouldn't be possible...
ProofHouse 2 days ago [-]
The irony is thick
sabedevops 2 days ago [-]
Seriously. The hypocrisy is staggering!
wiseowise 2 days ago [-]
Church, politicians, moralists are all the biggest hypocrites that want to teach you something.
newsoftheday 1 days ago [-]
I agree on politicians, no idea what a "moralist" is supposed to be but there are good and bad churches and church goers; lumping all church goers into one category calling them hypocrites is wrong. There are many good churches and church goers who help people and their communities.
zer00eyz 2 days ago [-]
" Integrity at OpenAI .. protect ... abuse like bots, scraping, fraud "
Did you mean to use the word hypocrisy. If not, I'm happy to have said it.
I just want to note, that it is well covered how good the support is for actual malware...
RobotToaster 1 days ago [-]
"You're trying to kidnap what I've rightfully stolen!"
Aurornis 2 days ago [-]
I interpreted scraping to mean in the context of this:
> we want to keep free and logged-out access available for more users
I have no doubt that many people see the free ChatGPT access as a convenient target for browser automation to get their own free ChatGPT pseudo-API.
lelanthran 2 days ago [-]
> I have no doubt that many people see the free ChatGPT access as a convenient target for browser automation to get their own free ChatGPT pseudo-API.
Not that hard - ChatGPT itself wrote me a FF extension that opened a websocket to a localhost port, then ChatGPT wrote the Python program to listen on that websocket port, as well as another port for commands.
Given just a handful of commands implemented in the extension is enough for my bash scripts to open the tab to ChatGPT, target specific elements, like the input, add some text to it, target the relevant chat button, click it, etc.
I've used it on other pages (mostly for test scripts that don't require me to install the whole jungle just to get a banana, as all the current playright type products do). Too afraid to use it on ChatGPT, Gemini, Claude, etc because if they detect that the browser is being drive by bash scripts they can terminate my account.
That's an especially high risk for Gemini - I have other google accounts that I won't want to be disabled.
wolvoleo 2 days ago [-]
This is bad why? Well yeah for openai because all they want it to be is a free teaser to get people hooked and then enshittify.
Morally I don't see any issues with it really.
gib444 1 days ago [-]
And have absolutely no reservations about making such an obvious statement on a public forum
rsrsrs86 1 days ago [-]
This
ValveFan6969 20 hours ago [-]
[dead]
miki123211 2 days ago [-]
It's not scraping they're concerned about, it's abusing free GPU resources to (anonymously) generate (abusive) content.
nikitaga 2 days ago [-]
Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.
The former relies on fairly controversial ideas about copyright and fair use to qualify as abuse, whereas the latter is direct financial damage – by your own direct competitors no less.
It's fun to poke at a seeming hypocrisy of the big bad, but the similarity in this case is quite superficial.
PunchyHamster 2 days ago [-]
> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.
I bet people being fucking DDOSed by AI bots disagree
Also the fucking ignorance assuming it's "static content" and not something needing code running
remus 2 days ago [-]
I think the parent is just pointing out that these things lie on a spectrum. I have a website that consists largely of static content and the (significant) scraping which occurs doesn't impact the site for general users so I don't mind (and means I get good, up to date answers from LLMs on the niche topic my site covers). If it did have an impact on real users, or cost me significant money, I would feel pretty differently.
0xEF 2 days ago [-]
Putting everything on a spectrum is what got us into this mess of zero regulation and moving goal posts. It's slippery slope thinking no matter which way we cut it, because every time someone calls for a stop sign to be put up after giving an inch, the very people who would have to stop will argue tirelessly for the extra mile.
Aerroon 1 days ago [-]
What mess are you talking about? The existence of LLMs? I think it's pretty neat that I can now get answers to questions I have.
This is something I couldn't have done before, because people very often don't have the patience to answer questions. Even Google ended up in loops of "just use Google" or "closed. This is a duplicate of X, but X doesn't actually answer the question" or references to dead links.
Are there downsides to this? Sure, but imo AI is useful.
butlike 1 days ago [-]
It's just repackaged Google results masquerading as an 'answer.' PageRank pulled results and displayed the first 10 relevant links and the LLM pulls tokens and displays the first relevant tokens to the query.
Just prompt it.
daveidol 1 days ago [-]
I’d argue putting everything in terms of black and white is the bigger issue than understanding nuance
instig007 1 days ago [-]
Generalizing with "everything", "all", etc exclusive markers is exactly the kind of black/white divide you're arguing against. What happened to your nuanced reality within a single sentence? Not everything is black and white, but some situations are.
fc417fc802 1 days ago [-]
The person he's replying to argued against putting things on a spectrum. Does that not imply painting everything in black and white? Thus his response seems perfectly sensible to me.
instig007 1 days ago [-]
He argued against putting things in a spectrum in many instances where that would be wrong, including the case under the question. What's your argument against that idea? LLM'ed too much lately?
fc417fc802 22 hours ago [-]
He argued against and the response presented a counterargument. Both were based around social costs and used the same wording (ie "everything").
You made a specious dismissal. Now you're making personal attacks. Perhaps it's actually you who is having difficulty reasoning properly here?
Den_VR 2 days ago [-]
I miss the www where the .html was written in vim or notepad.
mghackerlady 1 days ago [-]
It still can be. Do it. Go make your website in M$ Frontpage, for all I care
butlike 1 days ago [-]
Shameless plug: My music homepage follows the HTML 2.0 spec and is written by hand
Just did that for a test frontend for a module I needed to build (not my primary job so don't know anything about UI but running in browsers was a requirement), so basic HTML with the bare minimum of JS and all DOM. Colleagues were very surprized. And yes, vim is still the goto editor and will be for a long time now all "IDE" are pushing "AI" slop everywhere.
holler 2 days ago [-]
ahh yes, fresh off reading "Html For Dummies" I made my first tripod.com site
This is great! The name reference also made me smile.
eloisius 2 days ago [-]
Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article. Authors spend their blood, sweat and tears writing and then OpenAI comes to Hoover it up without a care in the world about license, copyright or what constitutes fair use. But don’t you dare scrape their slop.
lelanthran 2 days ago [-]
> Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article.
Exactly. I think the unfairness can be mitigated if models trained on public information, or on data generated by a model trained on public information, or has any of those two in its ancestry, must be made public.
Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.
mikkupikku 1 days ago [-]
[flagged]
jazzyjackson 1 days ago [-]
The library's archive is not a service provided by the newspaper
mikkupikku 1 days ago [-]
So? If the newspaper's website is willing to serve the documents, what's the problem?
The point is, if you're pleading with others to respect ""intellectual property"" then you're a worm serving corporate interests against your own.
jazzyjackson 1 days ago [-]
I may be a worm but at least I respect that others might have a different take on how best to make creative work an attainable way of life since before copyright law it was basically "have a wealthy patron who steered if not outright commissioned what you would produce"
eru 2 days ago [-]
> I bet people being fucking DDOSed by AI bots disagree
Are you sure it's a DDoS and not just a DoS?
MattJ100 2 days ago [-]
Yes, it is. The worst offenders hammer us (and others) with thousands upon thousands of requests, and each request uses unique IP addresses making all per-IP limits useless.
We implemented an anti-bot challenge and it helped for a while. Then our server collapsed again recently. The perf command showed that the actual TLS handshakes inside nginx were using over 50% of our server's CPU, starving other stuff on the machine.
It's a DDoS.
troyvit 2 days ago [-]
You should see Cloudflare's control panel for AI bot blocking. There are dozens of different AI bots you can choose to block, and that doesn't even count the different ASNs they might use. So in this case I'd say that a DDoS is a decent description. It's not as bad as every home router on the eastern seaboard or something, but it's pretty bad.
Bilal_io 2 days ago [-]
Uncoordinated DDoS, when multiple search and AI companies are hammering your server.
susupro1 1 days ago [-]
[dead]
catoc 2 days ago [-]
> Are you sure it's a DDoS and not just a DoS?
I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping
SolarNet 2 days ago [-]
When every AI company does it from multiple data centers... yes it's distributed.
1718627440 1 days ago [-]
Off topic, but why is a DoS something considered to act on, often by just shutting down the service altogether? That results in the same DoS just by the operator than due to congestion. Actually it's worse, because now the requests will never actually be responded rather then after some delay. Why is the default not to just don't do anything?
pocksuppet 1 days ago [-]
It keeps the other projects hosted on the same server or network online. Blackhole routes are pushed upstream to the really big networks and they push them to their edge routers, so traffic to the affected IPs is dropped near the sender's ISP and doesn't cause network congestion.
DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.
echoangle 1 days ago [-]
I think some people use hosting that is paid per request/load, so having crawlers make unwanted requests costs them money.
ImPostingOnHN 1 days ago [-]
*> Why is the default not to just don't do anything?
Because ingress and compute costs often increase with every request, to the point where AI bot requests rack up bills of hundreds or thousands of dollars more than the hobbyist operator was expecting to send.
lm411 2 days ago [-]
> Also the fucking ignorance assuming it's "static content" and not something needing code running
Wild eh.
If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".
littlestymaar 2 days ago [-]
What's a database after all.
nikitaga 1 days ago [-]
All this reactionary outrage in the comments is funny. And lame.
Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.
This isn't controversial at all, it's a well understood fact, outside of this irrationally angry thread at least. I don't know, maybe you don't understand the economic term "marginal cost", thus not understanding the limited scope of my statement.
If such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all. But no, they're rare edge cases, from a combination of shoddy scrapers and shoddy website implementations, including the lack of even basic throttling for expensive-to-serve resources.
The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.
If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.
ipaddr 1 days ago [-]
"such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all"
They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.
The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.
nikitaga 1 days ago [-]
> They are common
Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?
ipaddr 1 days ago [-]
Love AI so can't be that. Not devs website owners. Yes ask AI for stats.
fireflash38 1 days ago [-]
It's not a cost for me to scrape LLM.
It is a cost for me for LLM to scrape me.
Why should I care about costs that have when they don't care about the costs I have?
grayhatter 1 days ago [-]
The extent of the utilization is new.
The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.
juliangmp 1 days ago [-]
"They are rare edge cases" are we on the same internet?
1 days ago [-]
expedition32 1 days ago [-]
One euro is marginal for me for someone else it is their daily meal.
not2b 2 days ago [-]
I understand why OpenAI is trying to reduce its costs, but it simply isn't true that AI crawlers aren't creating very significant load, especially those crawlers that ignore robots.txt and hide their identities. This is direct financial damage and it's particularly hard on nonprofit sites that have been around a long time.
zer00eyz 2 days ago [-]
> but it simply isn't true that AI crawlers aren't creating very significant load.
And how much of this is users who are tired of walled gardens and enshitfication. We murdered RSS, API's and the "open web" in the name of profit, and lock in.
There is a path where "AI" turns into an ouroboros, tech eating itself, before being scaled down to run on end user devices.
stingraycharles 2 days ago [-]
These are ChatGPT and Claude Desktop crawlers we’re talking about? Or what is it exactly? Are these really creating significant load while not honoring robots.txt?
Genuinely interested.
63stack 2 days ago [-]
Is this the first time you are reading HN? Every day there are posts from people describing how AI crawlers are hammering their sites, with no end. Filtering user agents doesn't work because they spoof it, filtering IPs doesn't work because they use residential IPs. Robots.txt is a summer child's dream.
miki123211 2 days ago [-]
They seem to mostly be third-party upstarts with too much money to burn, willing to do what it takes to get data, probably in hopes of later selling it to big labs. Maaaybe Chinese AI labs too, I wouldn't put it past them.
OpenAI et al seem to mostly be well-behaved.
cruffle_duffle 2 days ago [-]
I bet dollars to doughnuts that 95% of the traffic is from Claude and ChatGPT desktop / mobile and not literal content scraping for training.
crote 2 days ago [-]
That wouldn't explain the 1000x increase in traffic for extremely obscure content, or seeing it download every single page on a classic web forum.
duttish 2 days ago [-]
And doing it over, and over, and over and over again. Because sure it didn't change in the last 8 years but maybe it's changed since yesterdays scrape?
lm411 2 days ago [-]
That is ridiculous.
You imply that "an expensive llm service" is harmed by abuse, but, every other service is not? Because their websites are "static" and "near-zero marginal cost"?
You have no clue what you are talking about.
camillomiller 2 days ago [-]
Well he’s a simp
cicko 2 days ago [-]
Interesting how other people's cost is "near-zero marginal cost" while yours is "an expensive LLM service".
Also, others' rights are "fairly controversial ideas about copyright and fair use" while yours is "direct financial damage".
I like how you frame this.
sandeepkd 2 days ago [-]
Lets not try to qualify the wrongs by picking a metric and evaluating just one side of it. A static website owner could be running with a very small budget and the scraping from bots can bring down their business too. The chances of a static website owner burning through their own life savings are probably higher.
expedition32 2 days ago [-]
Perhaps the long play is to destroy all small hobby websites until only a AI directed web is left.
miki123211 2 days ago [-]
If you're truly running a static site, you can run it for free, no matter how much traffic you're getting.
Github pages is one way, but there are other platforms offering similar services. Static content just isn't that expensive to host.
THe troubles start when you're actually running something dynamic that pretends to be static, like Wordpress or Mediawiki. You can still reduce costs significantly with CDNs / caching, but many don't bother and then complain.
ezrast 1 days ago [-]
Setting aside the notion that a site presenting live-editability as its entire core premise is "pretending to be static", do the actual folks at Wikimedia, who have been running a top 10 website successfully for many years, and who have a caching system that worked well in the environment it was designed for, and who found that that system did not, in fact, trivialize the load of AI scraping, have any standing to complain? Or must they all just be bad at their jobs?
It's true it can be done but many business owners are not hip to cloudflare r2 buckets or github pages. Many are still paying for a whole dedicated server to run apache (and wordpress!) to serve static files. These sites will go down when hammered by unscrupulous bots.
arminsergiony 23 hours ago [-]
[dead]
alsetmusic 2 days ago [-]
Have you not seen the multiple posts that have reached the front page of HN with people taking self-hosted Git repos offline or having their personal blogs hammered to hell? Cause if you haven't, they definitely exist and get voted up by the community.
AmbroseBierce 2 days ago [-]
It's not like those models are expensive because the usefulness that they extracted from scraping others without permission right? You are not even scratching the surface of the hypocrisy
wolvoleo 2 days ago [-]
It's more ironic because without all the scraping openai has done, there would have been no ChatGPT.
Also, it's not just the cost of the bandwidth and processing. Information has value too. Otherwise they wouldn't bother scraping it in the first place. They compete directly with the websites featuring their training data and thus they are taking away value from them just as the bots do from ChatGPT.
In fact the more I think of it, I think it's exactly the same thing.
expedition32 2 days ago [-]
This leads me to thinking: I ask chatGPT a question and they get the answer from gamefaqs.
But what happens if gamefaqs disappears because of lack of traffic?
Can LLM actually create or only regurgitate content.
Aerroon 1 days ago [-]
>Can LLM actually create or only regurgitate content.
Contrary to what others say, LLMs can create content. If you have a private repo you can ask the LLM to look at it and answer questions based on that. You can also have it write extra code. Both of these are examples of something that did not exist before.
In terms of gamefaqs, I could theoretically see an LLM play a game and based on that write about the game. This is theoretical, because currently LLMs are nowhere near capable enough to play video games.
wolvoleo 2 days ago [-]
It will remain in their scraped data so they can keep including it in their later training datasets if they wish. However it won't be able to do live internet searches anymore. And it will not generate new content of course. Especially not based on games released after the site codes down so it doesn't know. Though it could of course correlate data from other sources that talk about the game in question.
stefanka 2 days ago [-]
They cannot create original content.
wolvoleo 2 days ago [-]
Well they can make some up, like hallucination. That's an additional problem: when the original site that provided the training data is gone: how can they use verify the AI output to make sure it's correct?
VadimPR 2 days ago [-]
Getting scraped by abusive bots who bring down the website because they overload the DB with unique queries is not marginal. I spent a good half of last year with extra layers of caching, CloudFlare, you name it because our little hobby website kept getting DDoS'd by the bots scraping the web for training data.
Never in 15 years if running the website did we have such issues, and you can be sure that cache layers were in place already for it to last this long.
I don't think a rule along the lines of "Doing $FOO to a corporate is forbidden, but doing $FOO to a charitable initiative is fine" is at all fair.
What "$FOO" actually is, is irrelevant. I'm curious how you would convince people that this sort of rule is fair.
The corp can always ban users who break ToS, after all. They don't need any help. The charitable initiative can't actually do that, can they?
ungreased0675 1 days ago [-]
You’re describing the tragedy of the commons.
No single raindrop thinks it’s responsible for the flood.
cindyllm 1 days ago [-]
[dead]
razingeden 2 days ago [-]
It is direct financial damage if my servers not on an unmetered connection — after years of bills coming in around $3/mo I got a surprise >$800 bill on a site nobody on earth appears to care about besides AI scrapers.
It hasn’t even been updated in years so hell if I know why it needs to be fetched constantly and aggressively, - but fuck every single one of these companies now whining about bots scraping and victimizing them, here’s my violin.
gzread 2 days ago [-]
If you can identify the scraper you should have a valid legal case to recover damages.
thisislife2 1 days ago [-]
Only if they had a robots.txt for their site.
razingeden 1 days ago [-]
I hadn’t even considered that. Don’t know why that comment is greyed out or downvoted.
It’s a static site that hasn’t been updated since 2016—- so it’s .. since been moved to cloudflare r2 where it’s getting a $0.00 bill, and it now has a disallow / directive. I’m not sure if it’s being obeyed because the cf dash still says it’s getting 700-1300 hits a day even with all the anti bot, “cf managed robots” stuff for ai crawlers in there.
The content is so dry and irrelevant I just can’t even fathom 1/100th of that being legitimate human interest but I thought these things just vacuumed up and stole everyone’s content instead of nailing their pages constantly?
gzread 1 days ago [-]
No, it's still illegal to DDoS sites that don't have robots.txt.
thisislife2 1 days ago [-]
You are right, I hadn't considered that aspect.
the_sleaze_ 2 days ago [-]
60% of our traffic is bot, on average. Sometimes almost 100%.
grishka 2 days ago [-]
> Scraping static content from a website at near-zero marginal cost to its server
It's not possible to know in advance what is static and what is not. I have some rather stubborn bots make several requests per second to my server, completely ignoring robots.txt and rel="nofollow", using residential IPs and browser user-agents. It's just a mild annoyance for me, although I did try to block them, but I can imagine it might be a real problem for some people.
I'm not against my website getting scraped, I believe being able to do that is an important part what the web is, but please have some decency.
not_your_vase 2 days ago [-]
> net-zero marginal cost
Lol, you single-handedly created a market for Anubis, and in the past 3 years the cloudflare captchas have multiplied by at least 10-fold, now they are even on websites that were very vocal against it. Many websites are still drowning - gnu family regularly only accessible through wayback machine.
Spare me your tears.
xmcqdpt2 1 days ago [-]
AI providers also claim to have small marginal costs. The costs of token is supposedly based on pricing in model training, so not that different from eg your server costs being low but the content production costs being high. And in many cases AI companies are direct competitors (artists, musicians etc.)
(TBH it's not clear to me that their marginal costs are low. They seem to pick based on narrative.)
ori_b 1 days ago [-]
My website serving git that only works from Plan 9 is serving about a terabyte of web traffic monthly. Each page load is about 10 to 30 kilobytes. Do you think there's enough organic, non-scraper interest in the site that scrapers are a near-zero part of the cost?
SkiFire13 2 days ago [-]
> Scraping static content
How do you know the content is static?
bakugo 2 days ago [-]
The cost is so marginal that many, many websites have been forced to add cloudflare captchas or PoW checks before letting anyone access them, because the server would slow to a crawl from 1000 scrapers hitting it at once otherwise.
foobiekr 1 days ago [-]
You are, of course, ignoring the production costs of the static content that OpenAi is stealing.
Stop justifying their anti-social behavior because it lines your pockets.
heyethan 2 days ago [-]
I think this also explains why the checks are moving up the stack.
If the real cost is in actually running the app or the model, then just verifying a browser isn’t enough anymore. You need to verify that the expensive part actually happened.
Otherwise you’re basically protecting the cheapest layer while the expensive one is still exposed.
swagmoney1606 2 days ago [-]
And yet I have to pay in my time and cash to handle the constant ddos'es from the constant LLM scraping
gmerc 2 days ago [-]
It’s not for techbros to decide at what threshold of theft it’s actually theft. “My GPU time is more valuable than your CPU time” isn’t a thing and Wikipedias latest numbers on scraping show that marginal costs at scale are a valid concern
mcfedr 1 days ago [-]
I'm sure the copyright holders would consider your use of their content as direct financial damage
2 days ago [-]
2 days ago [-]
nozzlegear 2 days ago [-]
Are they, actually?
make3 2 days ago [-]
Absolutely not, the former relies on controversial ideas to qualify as legal.
Stealing the content from the whole planet & actively reducing the incentive to visit the sites without financial restitution is pretty bad.
AtlasBarfed 2 days ago [-]
Because you say it is?
I obviously disagree. I mean, on top of this we are talking about not-open OpenAI.
karlshea 2 days ago [-]
I don’t know what world you live in but it’s not this one.
nickphx 1 days ago [-]
Speak for yourself.
platybubsy 2 days ago [-]
Bait or genuine techbro? Hard to say
andrepd 1 days ago [-]
> Scraping static content from a website at near-zero marginal cost to its server
The issue is that there are so many awful webmasters that have websites that take hundreds of milliseconds to generate and are brought down by a couple requests a second.
bakugo 2 days ago [-]
OpenAI must be the most awful webmasters of all, then, to need such sophisticated protections.
heyethan 2 days ago [-]
I think the distinction is less about scraping itself, and more about marginal cost.
Scraping static pages is cheap for both sides.
Scraping an LLM-backed service effectively externalizes compute costs onto the provider.
Same behavior, very different economics.
crote 2 days ago [-]
Very few websites are truly static. Something like a Wordpress website still does a nontrivial amount of compute and DB calls - especially when you don't hit a cache.
There's also the cost asymmetry to take into account. Running an obscure hobby forum on a $5 / month VPS (or cloud equivalent) is quite doable, having that suddenly balloon to $500 / month is a Really Big Deal. Meanwhile, the LLM company scraping it has hundred of millions of VC funding, they aren't going to notice they are burning a few million because their crappy scraper keeps hammering websites over and over again.
everdrive 2 days ago [-]
It's getting to the point where a user needs at minimum two browsers. One to allow all this horrendous client checking so that crucial services work, and another browser to attempt to prevent tracking users across the web.
Nick, I understand the practical realities regarding why you'd need to try to tamp down on some bot traffic, but do you see a world where users are not forced to choose between privacy and functionality?
mememememememo 2 days ago [-]
Local models for privacy.
You want to go to the world's best hotel? You are gonna be on their CCTV. Staying at home is crappier but private.
Unfortunately for the first time moores law isn't helping (e.g. give a poor person an old laptop and install linux they will be fine). They can do that and all good except no LLM.
karlgkk 2 days ago [-]
> You want to go to the world's best hotel? You are gonna be on their CCTV.
ironically, in high end hotels, there's often a lot less cctv. not none. just less. rich people enjoy privacy
Barbing 2 days ago [-]
So they’re not just hidden better? Does make sense.
Well, I can use the world‘s best safety deposit box without being on CCTV while I pass secrets in and out of it, right? Just not for free.
Bummer, this sounds like it is about to turn into a Monero ad (“let us pay privately”)
wolvoleo 2 days ago [-]
Probably not even hidden because rich people are also catching a lot of legal winds, in which case the hotel has no choice but to provide the material. Better not to have it in the first place. You don't want your hotel cams listed as evidence in a 500M$ divorce case I guess.
Also are hidden cameras even legal? I know here in EU they aren't.
xtajv 1 days ago [-]
In hotels of all tax brackets, you usually get a room key.
And the salient difference is that CCTV is simply defense-in-depth, not a primary means for authentication.
nozzlegear 2 days ago [-]
> Staying at home is crappier but private.
Doesn't make sense, my home is much more preferable to a hotel
hedora 2 days ago [-]
With any luck, local models will be too (soon).
littlestymaar 2 days ago [-]
My local models didn't get >20h of outage this quarter like Claude did so in a way it's already the case.
1 days ago [-]
0x3f 2 days ago [-]
Meet me in a cafe and I will sign a JWT saying you're not a bot. You can submit this to whoever will accept it.
Brilliant! Just the thing we want: more hardware attestation, more deanonymization, less user control, all diligently orchestrated in a repository where the only contributor is Anthropic Claude [0]. Comes complete with a misaligned ASCII diagram in the README to show how much effort the humans behind it put in!
Yes, even their "humanifesto" is LLM output, and is written almost exclusively in the "it's not X <emdash> it's Y" style.
Those are all situationally-valid criticisms, but I've long thought the ability to have smartphones' cameras cryptographically sign photos is good when available. The use case is demonstrating a photo wasn't doctored, and that it came from a device associated with e.g. a journalist, who maintains a public key. Of course, it should be optional.
magicseth 2 days ago [-]
Yes! That's what I'm getting at. This protocol optionally allows you to sign with your private key, but you don't have to for the protocol to provide utility. It could just be enough to say "if you trust magicseth's binary and apple, then this was typed one letter at a time"
There's nothing stopping folks from typing a message an LLM wrote one at a time, but the idea of increasing the human cost of sending messages is an interesting one, or at least I thought :-(
johnmaguire 2 days ago [-]
The problem is that it's not optional to end-users if sites enforce its use.
hedora 2 days ago [-]
The other problem is that the device or company might decide not to attest for you.
For instance, the employee at Apple that decided to pull ICE Block from the store could decide that the "admissible in court" bit should be false if it looks like a police officer is in frame.
Similarly, the keyboard could decide your social credit score is too low, and just stop attesting. A court could order this behavior.
Or, you could fail mandatory age / id verification because your credit card expired, and then all the above + more could happen! Good luck getting through to credit card tech support at that point...
magicseth 2 days ago [-]
Hi! I want anonymity! I also want to be able to prove what level of effort has been put in to something. I think there's room for both. This is an encrypted proof that I wrote something on a keyboard that tracks fingers. The protocol allows you to optionally sign it with your identity, but that isn't strictly required.
It is an attempt at putting something into the conversation more than just "OSS is broken because there are too many slop PRs." What if OSS required a human to attest that they actually looked at the code they're submitting? This tool could help with that.
Yes LLMs were used greatly in the production of this prototype!
It doesn't change the goal of the experiment! or it's potential utility! Do you see any potential area in your world where some piece of this is valuable?
Arainach 2 days ago [-]
> Yes, even their "humanifesto" is LLM output, and is written almost exclusively in the "it's not X <emdash> it's Y" style.
There are six emdashes on that page. NONE of them are "it's not X it's why".
> Emails, messages, essays, code reviews, love letters — all suspect.
> We believe this can be solved — not by detecting AI, but by proving humanity.
> KeyWitness captures cryptographic proof at the point of input — the keyboard.
> When you seal a message, the keyboard builds a W3C Verifiable Credential — a self-contained proof that can be verified by anyone, anywhere, without trusting us or any central authority.
> That's an alphabet of 774 symbols — each carrying log2(774) ≈ 9.6 bits. 27 emoji for 256 bits.
> They're a declaration: this message was written by a person — one of the diverse, imperfect, irreplaceable humans who still choose to type their own words.
Clarifications: 4
Continuation from a list: 1
Could just be a comma: 1
"It's not X -- it's Y": 0.
If you're going to make lazy commentary about good writing being AI, please at least be sure that you're reading the content and saying accurate things.
magicseth 2 days ago [-]
It is largely written by iteration with an LLM! No need to speculate or analyze em dashes :-)
The emoji idea was mine. I like it :-) unfortunately it doesn't work in places like HN that strip out emoji. So I had to make a base64 encoding option.
The goal was to create an effective encryption key for the url hash (so it doesn't get sent to the server). And encoding skin tone with human emojis allows a super dense bit/visual character encoding that ALSO is a cute reference to the humans I'm trying to center with this project!
josephg 2 days ago [-]
> We believe this can be solved — not by detecting AI, but by proving humanity
“It's not X -- it's Y": 1
dandellion 2 days ago [-]
It's either a bot, or someone who writes exactly like a bot. I don't care which it is, both go to the discard pile.
magicseth 2 days ago [-]
phew!
arrowsmith 2 days ago [-]
It’s a product for people who need help telling whether text was written by AI.
Maybe they deliberately write it like that, to filter out people who aren’t the target market?
arrowsmith 2 days ago [-]
From their “how it works” page:
> The server stores an encrypted blob it can't decrypt. We couldn't read your messages even if we wanted to. That's not a policy — it's math.
If you can’t tell that this is AI slop then maybe KeyWitness does solve a real problem after all.
Velocifyer 2 days ago [-]
<redacted because my friend posted it but accidentaly used my account>
magicseth 2 days ago [-]
Oh you think it's stupid? It was an attempt to encode an encryption key that isn't sent to the server in a way that is minimally invasive. The skintone emomis allow pretty high byte density, and also are cute!
Sorry it doesn't meet your needs.
There is irony in having an ai generated humanifesto. Could it be intentional? hmm?
Is there no irony in deriding a project for being potentially LLM generated, when it's goal is to aide people in differentiating?
:shrug:
Terretta 2 days ago [-]
The first widely distributed and open source version of this typist timing validation idea I saw (and incorporated into my own software at the time) was released by Michael Crichton as part of a password 2nd-factor checker (1st factor a known phrase or even your name, the 2nd factor being your idiosyncratic typing pattern) in Creative Computing magazine that printed the code.
You’re getting a negative reaction from others but I share this feedback in good faith: I don’t understand what problem your product is supposed to solve.
Yeah I guess the cryptographic stuff sounds vaguely impressive although it’s been a long time since I had to think about cryptography in detail. But what is this _for_? I’m going to buy an expensive keyboard so that I can send messages to someone and they’ll know it’s really me – but it has to be someone who a) doesn’t trust me or any of our existing communication channels and b) cares enough to verify using this weird software? Oh and it’s important they know I sent it from a particular device out of the many I could be using?
Who is that person? What would I be sending them? What is the scenario where we would both need this?
Also the server can’t read the message but the decryption key is in the URL? So anyone with the URL can still read it? Then why even bother encrypting it?
Maybe this is one of those cases where I’m so far outside your target market that it was never supposed to make sense to me but I feel like I’m missing something here. Or maybe you need to work on your elevator pitch.
Just sharing my honest reaction.
scoofy 2 days ago [-]
Somewhere there is someone 3D printing a keyboard cover that an llm can type with.
magicseth 2 days ago [-]
I'm actually building a physical keyboard for those people who don't have iphones! Though given the reaction I'm seeing here, I probably won't share it with this audience :-P it has capacitive keys, a secure enclave, and a fingerprint sensor.
mike_hearn 1 days ago [-]
Please do share. This sort of tech is necessary, for better or worse, and I'd have a bunch of use cases in mind for it!
Velocifyer 2 days ago [-]
This does not prove anything and it is only avalible to users with X.com accounts (you need a X.com account to download the app).
magicseth 2 days ago [-]
Hi! You don't need an x.com account to download, that's just the easiest way to dm me. If you're actually interested, I can let you try it! The source is also available.
It proves 1) that an apple device with a secure enclave signed it. 2) that my app signed it.
If you trust the binary I've distributed is the same as the one on the app store, then it also proves:
3) that it was typed on my keyboard not using automation (though as others have mentioned, you could build a capacitive robot to type on it)
4) that the typer has the same private key as previous messages they've signed (if you have an out of band way to corroborate that's great too)
5) optionally, that the person whose biometrics are associated with the device approved it.
There is also an optional voice to text mode that uses 3d face mesh to attempt to verify the words were spoken live.
Not every level of verification is required by the ptrotocol, so you could attest that it was written on a keyboard, but not who wrote it (not yet implemented in the client app).
The protocol doesn't require you to run my app, if you compile it yourself, you can create your own web of trust around you!
Velocifyer 2 days ago [-]
>that an apple device with a secure enclave signed it.
What Apple devices are supported? All I have is a iPhone 4 running a old iOS version(pre iOS 7) (which I will not update and I don't think has a secure enclave) and a M1 mac mini and some lightning earpods and a apple thunderbolt display and some USB-A chargers and some old MacBooks.
I think that the concept is stupid becuase it would require to somehow prove that the app is not modified(which is impractical) and there is no stylus on a motor or fake screen(which is also impractical).
I think that a better aproach would be to form a Web Of Trust where only people's (not just humans, this would include all animals and potentially aliens but no clankers) certificates are signed, but with a interface that is friendly to people who are not very into technology but with some sort of way to not have who your friends are revealed, but this would still allow someone to get a attestation for their robot.
xeyownt 2 days ago [-]
Why 256-bit key AES? It brings nothing but longer key. 128-bit is more than enough. Please don't mention PQC :fire:
ImPostingOnHN 1 days ago [-]
"why do you need more compute resources? Please don't mention computer programs"
toss1 2 days ago [-]
Oh Gawd, not this idea again!
This idea of capturing the timing of people's keystrokes to identify them, ensure it is them typing their passwords, or even using the timing itself as a password has been recurring every few years for at least three decades.
It is always just as bad. Because there are so many cases where it completely fails.
The first case is a minor injury to either hand — just put a fat bandage on one finger from a minor kitchen accident, and you'll be typing completely differently for a few days.
Or, because I just walked into my office eating a juicy apple with one hand and I'm in a hurry typing my PW with my other hand because someone just called with an urgent issue I've got to fix, aaaaannnd, your software balks because I'm typing with a completely different cadence.
The list of valid reasons for failure is endless wherein a person's usual solid patterns are good 90%+ of the time, but will hard fail the other 10% of the time. And the acceptable error rate would be 2-4 orders of magnitude less.
It's a mystery how people go all the way to building software based on an idea that seems good but is actually bad, without thinking it through, or even checking how often it has been done before and failed?
magicseth 2 days ago [-]
That's not what this is. at all.
monocularvision 2 days ago [-]
You might want to check out “How it Works” on the site as none of what you said applies: https://typed.by/how
josefx 2 days ago [-]
Then why does your link claim the following?
> While you type, the keyboard quietly records how you type — the rhythm, the pauses between keys, where your finger lands, how hard you press.
> Nobody types the same way. Your pattern is as unique as your handwriting. That's the signal.
arrowsmith 2 days ago [-]
I’m sceptical about this idea but, to give it full credit, it’s a custom piece of hardware that would presumably be more accurate than previous software-only attempts. Maybe it will actually work this time, idk, although I still don’t really see the point.
59nadir 2 days ago [-]
Vibe copy is a hell of a drug.
toss1 1 days ago [-]
Yes. This is from that page:
>>While you type, the keyboard quietly records how you type — the rhythm, the pauses between keys, where your finger lands, how hard you press.
>>Nobody types the same way. Your pattern is as unique as your handwriting. That's the signal.
This very precisely makes my point:
Yes, the typing pattern of any human is highly and possibly even completely unique to that human — UNTIL any of a myriad of everyday issues makes it falsely deny access because the human's typing pattern has changed in a way the human can't do anything to fix at the moment.
If you are only attempting to distinguish a human from an automated system, it'll be better, until someone just starts recording the same patterns and re-playing them to this upstream process; then its a mere race to who can get their hooks in at a lower level. And someone is always going to say: "Oh, this system can identify the specific human", and we're off to the races again.
So, no. Unless you can account for ALL of the reasonable everyday failure modes, typing with either hand, any finger or combination of fingers out of commission for a minute or a lifetime, this idea will fail.
toss1 1 days ago [-]
IOW, if you are doing this, it does not matter what you are doing afterwards.
You are assuming that a human's particular typing pattern is consistent, when the fact is that any number of ordinary events will render your assumption false (one or more fingers bandaged, sprained, whatever, or one hand occupied ATM).
This is not a hardware or software problem, and no amount of code, hardware, or cleverness will fix it; this is a fundamental mismatch between your assumption vs reality.
xtajv 1 days ago [-]
can confirm. am weird enough to routinely flag as "inhuman".
thaaaaaaaaanks
jagged-chisel 2 days ago [-]
Sounds like we’re bringing back the PGP key signing parties
__MatrixMan__ 2 days ago [-]
The sooner we do the better.
hathawsh 2 days ago [-]
I wonder what the PGP signing concept does to thwart people who want to profit and don't care about the public good. It seems like anyone who attends a signing party can sell their key to the highest bidder, leading to bots and spammers all over again.
__MatrixMan__ 2 days ago [-]
In the flat trust model we currently use most places, it's on each person to block each spammer, bot, etc. The cost of creating a new bot account is low so it's cheap to make them come back.
On a web of trust, if you have a negative interaction with a bot, you revoke trust in one of the humans in the chain of trust that caused you to come in contact with that bot. You've now effectively blocked all bots they've ever made or ever will make... At least until they recycle their identity and come to another key signing party.
Once you have the web in place though, a series of "this key belongs to a human" attestations, then you can layer metadata on top of it like "this human is a skilled biologist" or "this human is a security expert". So if you use those attestations to determine what content your exposed to then a malicious human doesn't merely need to show up at a key signing party to bootstrap a new identity, they also have to rebuild their reputation to a point where you or somebody you trust becomes interested in their content again.
Nothing can be done to prevent bad people from burning their identities for profit, but we can collectively make it not economical to do so by practicing some trust hygiene.
Key signing establishes a graph upon which more effective trust management becomes possible. It on its own is likely insufficient.
0x3f 2 days ago [-]
You can never prevent things like this, but you can make it expensive enough to effectively solve the problem for almost all use cases.
zar1048576 2 days ago [-]
Definitely miss those!
tshaddox 2 days ago [-]
Doesn’t really make sense, because any service can just say “you must paste your human-attestation JWT here to use this service” and plenty of people will.
0x3f 2 days ago [-]
You can just decay your trust level based on the `iat` value. That way people will need to keep buying me coffee. I can optionally chide them for giving out their token.
If you're engaging with the idea seriously, I suppose we'd need to build a reputation or trust network or something.
Although if you're talking about replay attacks specifically, there are other crypto based solutions for that.
tshaddox 2 days ago [-]
My point is that there probably is no way in principle to distinguish between a human user utilizing automation on their own behalf in good faith (e.g. RSS readers) and bad faith automations.
crote 2 days ago [-]
That's a feature, not a bug.
A human is personally responsible for a bot acting on their behalf. If your bot behaves, nothing is going to happen. If you keep handing out your personal keys to shitty misbehaving bots, then you will personally get banned - which gives you a pretty good incentive to be a bit more discerning about the bots you use.
0x3f 1 days ago [-]
Yes, everything should just be agnostic, as long as the incentives work out it's all fine. Like if we had worked out micropayments for the web (not saying that's a good idea per se), then who cares if you're a bot or a human when you're paying a toll either way? Flipping it to be a cost rather than payment is functionally equivalent.
magicseth 2 days ago [-]
I am engaging with this seriously! I don't know if there will be any real solution. But I think it's worth exploring.
2 days ago [-]
kevin_thibedeau 2 days ago [-]
I've been doing that for years. Cloudflare is slowly breaking more and more of the web.
subscribed 2 days ago [-]
This is indeed what I do. And you also should. Separate browser for banking, trusted shipping sites etc, and the normal one.
Make sure not to browse the Internet without adblock and/or similar.
madrox 2 days ago [-]
I am not Nick, but there's a few ways that world happens: the free tier goes away and what people pay for more correctly reflects what they use, this all becomes cheap enough that it doesn't matter, or we come up with an end to end method of determining usage is triggered by a person.
Another way is to just do better isolation as a user. That's probably your best shot without hoping these companies change policies.
lukewarm707 1 days ago [-]
i am increasingly moving towards a model of 'no browser'.
search for me is now a proprietary index (like exa) that filters rubbish, with a zero data retention sla. so we don't need google profiling.
the content is distilled into markdown pulled from cloudflare's browser rendering api.
i let cloudflare absorb the torrent of trackers and robot checks, i just get md from the api with nothing else. cloudflare is poacher and gamekeeper.
an alternative is groq compound which can call browsers in parallel.
for interactive sites, or local ai browsing, i sometimes run a browser in a photon os docker with vnc, which gives you the same browser window but it runs code not on your pc.
that said little of my use is now interacting with websites, its all agentic search and websets so i don't have to spend mental energy on it myself
lukewarm707 1 days ago [-]
is this bad?
atoav 2 days ago [-]
What if I run a website and OpenAI produces bot traffic? Do they also consider it abuse when they do it?
SV_BubbleTime 2 days ago [-]
Firefox multicontainers are pretty cool. But it’s an advanced process that most people wouldn’t do or do correctly.
Sabinus 2 days ago [-]
I love the containers too. My current use case is to keep my YouTube account separate from my Google one. Google doesn't need all that behavioural data in one place.
It's a pity Firefox doesn't get the praise it deserves half as much as it cops criticism.
halJordan 2 days ago [-]
It is absolutely not an advanced process. It's clicking a gui. It's not advanced thinking to understand profiles. It's a basic ability to hold multiple things in your mind at once. Telling people that's difficult only increases the societal problem that being ignorant is ok.
docjay 2 days ago [-]
“Difficult” is a relative term. They were saying it was a difficult concept for them, not you. In order to save their ego, people often phrase those events to be inclusive of the reader; it doesn’t feel as bad if you imagine everyone else would struggle too. Pay attention and you’ll notice yourself doing it too.
“Ignorant” is also infinite - you’re ignorant of MANY things as well, and I’m sure you would struggle with things I can do with ease. For example, understanding the meaning behind what’s being said so I know not to brow-beat someone over it.
SV_BubbleTime 2 days ago [-]
Mostly right; it’s not that it was difficult for me. It’s that normal people are never going to do it.
I’m almost endlessly surprised by the probably-autistic-spectrum responses to tech things from people with no idea how things seem to other people.
subscribed 2 days ago [-]
Mostof the people I met outside work wouldn't understand this concept.
I think you're lucky to hang around people whose heads don't hurt when they think.
Imustaskforhelp 2 days ago [-]
The possibilities with Firefox multi containers and automation scripts as well are truly endless.
It's also possible to make Firefox route each container through a different proxy which could be running locally even which then can connect to multiple different VPN's. I haven't tried doing that but its certainly possible.
It's sort of possible to run different browsers with completely new identities and sometimes IP within the convenience of one. It's really underrated. I don't use the IP part of this that I have mentioned but I use multi containers quite a lot on zen and they are kind of core part of how I browse the web and there are many cool things which can be done/have been done with them.
gib444 1 days ago [-]
> It's getting to the point where a user needs at minimum two browsers. One to allow all this horrendous client checking so that crucial services work, and another browser to attempt to prevent tracking users across the web.
Every time I try this, I end up crossing wires (ie using the browser that 'works' for most things, more than the one that is 'broken')
cruffle_duffle 2 days ago [-]
There is also the browser I use to get Claude to route around people blocking its webfetch. Both Playwright and chrome-mcp.
gck1 2 days ago [-]
Camoufox?
gruez 2 days ago [-]
>It's getting to the point where a user needs at minimum two browsers. One to allow all this horrendous client checking so that crucial services work, and another browser to attempt to prevent tracking users across the web.
What are you talking about? It works fine with firefox with RFP and VPN enabled, which is already more paranoid than the average configuration. There are definitely sites where this configuration would get blocked, but chatgpt isn't one of them, so you're barking up the wrong tree here.
scared_together 2 days ago [-]
Is your interlocutor barking up the wrong tree, or are you missing the forest for the trees?
According to the OP:
> The program checks 55 properties spanning three layers: your browser (GPU, screen, fonts), the Cloudflare network (your city, your IP, your region from edge headers), and the ChatGPT React application itself (__reactRouterContext, loaderData, clientBootstrap).
I guess Firefox VPN will hide the IP at least. But what about the other data, is it faked by RFP? Because if not, the so-called privacy offered by this configuration is outdated.
You might be fingerprinted by OpenAI right now, as “that guy with all the Firefox anti-fingerprinting stuff enabled, even though it breaks other sites”.
gruez 1 days ago [-]
>But what about the other data, is it faked by RFP?
Yes, RFP spoofs or at least somewhat obfuscates/normalizes GPU/screen/font info. The rest are integrity validations of the server/app, and not really identifying in any way.
>You might be fingerprinted by OpenAI right now, as “that guy with all the Firefox anti-fingerprinting stuff enabled, even though it breaks other sites”.
I'm not sure what the broader point you're trying to make here is. Is fingerprinting bad? Yes. All things being equal, I'd rather not have it than have it, but at the same time it's not realistic to expect openai to serve anonymous requests from anyone. Back when chatgpt was first launched you had to sign up and verify your phone number. Compared to mandatory logins, fingerprinting is definitely the lesser evil here.
scared_together 22 hours ago [-]
I wasn’t thinking too hard about the distinction between an integrity check and an identifiable detail, and I guess it makes sense that you’d be okay with one and not the other.
My broader point would have been that if OpenAI can identify you even when using Firefox RFP, it doesn’t make sense to give them credit for letting you use ChatGPT with RFP enabled. But maybe I was making too many assumptions.
lionkor 2 days ago [-]
Hi Nick, first of all, very cool of you to respond here instead of letting us all sit in the dark. I think that's what makes HN special.
That said, is it not a little bit weird that you want to protect yourself from scraping and bots, when your entire company, product, revenue, and your employment, depends on the fact that OpenAI can bot and scrape literally every part of the internet? So your moat is non-hydrated react code in the frontend?
Schiendelman 1 days ago [-]
Don't beat up an engineer for decisions made by company leadership. It's really inappropriate.
diebillionaires 1 days ago [-]
Yeah, no one is responsible for what they do as long as someone else tells them to do it.
lionkor 1 days ago [-]
They decided to work at this company, I think it's a reasonable discussion to have?
SilasX 1 days ago [-]
While I would generally sympathize on that front, it doesn't really apply here.
None of the management-level desiderata he appealed to require that the user experience be broken this bad. There is very little bot deterrence from prevention of typing at that stage, while it heavily impacts user experience, especially on mobile.
Don’t know if it’s related to the article, but the chats ui performance becomes absolutely horrendous in long chats.
Typing the chat box is slow, rendering lags and sometimes gets stuck altogether.
I have a research chat that I have to think twice before messaging because the performance is so bad.
Running on iPhone 16 safari, and MacBook Pro m3 chrome.
DenisM 2 days ago [-]
In the good old days Netflix had "Dynamic HTML" code that would take a DOM element which scrolled out of view port and move it to the position where it was about to be scrolled in from the other end. Hence he number of DOM elements stayed constant no matter how far you scroll and the only thing that grows is the Y coordinate.
They did it because a lot of devices running Netflix (TVs, DVD players, etc) were underpowered and Netflix was not keen on writing separate applications. They did, however, invest into a browser engine that would have HW acceleration not just for video playback but also for moving DOM elements. Basically, sprites.
The lost art of writing efficient code...
zdragnar 2 days ago [-]
> Hence he number of DOM elements stayed constant no matter how far you scroll and the only thing that grows is the Y coordinate.
This is generally called virtual scrolling, and it is not only an option in many common table libraries, but there are plenty of standalone implementations and other libraries (lists and things) that offer it. The technique certainly didn't originate with Netflix.
weird-eye-issue 2 days ago [-]
Yes, tables and lists, since they have a fixed height per item/row. Chat messages don't have a fixed height so its more difficult. And by more difficult I mean that every single virtual paging library that I've looked at in the past would not work.
amluto 1 days ago [-]
But they do have constant height in the sense that, unless you resize the window horizontally, the height doesn’t change.
For what it’s worth, modern browsers can render absurdly large plain HTML+CSS documents fairly well except perhaps for a slow initial load as long as the contents are boring enough. Chat messages are pretty boring.
I have a diagnostic webpage that is a few million lines long. I could get fancy and optimize it, but it more or less just works, even on mobile.
weird-eye-issue 1 days ago [-]
Exactly, browsers can render it fast. It's likely a re-rendering issue in React. So the real solution is just preventing the messages from getting rendered too often instead of some sort of virtual paging.
zdragnar 1 days ago [-]
Dynamic height of virtual scrolling elements is a thing. You just need to recalculate the scrollable height on the fly. tanstack's does it, as do some of the nicer grid libraries.
weird-eye-issue 1 days ago [-]
To be fair I haven't looked at any solutions in about a decade lol
tmpz22 2 days ago [-]
Its been about three years but infinite scroll is naunced depending on the content that needs to be displayed. Its a tough nut to crack and can require a lot of maintenance to keep stable.
None of which chatgpt can handle presumably.
dotancohen 2 days ago [-]
And yet ChatGPT does not use it.
GP was mentioning that a solution to the problem exists, not that Netflix specifically invented it. Your quip that the technique is not specific to Netflix bolsters the argument that OpenAI should code that in.
jasonfarnon 2 days ago [-]
I'm ignorant of the tech here. But I have noticed that ctrl-F search doesn't work for me on these longer chats. Which is what made me think they were doing something like virtual scrolling. I can't understand how the UI can get so slow if a bunch of the page is being swapped out.
dotancohen 2 days ago [-]
Ctrl-A for select all doesn't work either. I actually wondered how they broke that.
BoorishBears 2 days ago [-]
They didn't actually name the solution: the solution is virtualization.
They described Netflix's implementation, but if someone actually wanted to follow up on this (even for their own personal interest), Dynamic HTML would not get you there, while virtualization would across all the places it's used: mobile, desktop, web, etc.
groundzeros2015 2 days ago [-]
This is how every scrolling list has been implemented since the 80s. We actually lost knowledge about how to build UI in the move to web
bloomca 2 days ago [-]
The biggest issue is that there is no native component support for that. So everyone implements their own and it is both brittle and introduces some issues like:
- "ctrl + f" search stops working as expected
- the scrollbar has wrong dimensions
- sometimes the content might jump (common web issue overall)
The reason why we lost it is because web supports wildly different types of layouts, so it is really hard to optimize the same way it is possible in native apps (they are much less flexible overall).
TeMPOraL 2 days ago [-]
Right. This is one of my favorite examples of how badly bloated the web is, and how full of stupid decisions. Virtual scrolling means you're maintaining a window into content, not actually showing full content. Web browsers are perfectly fine showing tens of thousands of lines of text, or rows in a table, so if you need virtual scrolling for less, something already went badly wrong, and the product is likely to be a toy, not a tool (working definition: can it handle realistic amount of data people would use for productive work - i.e. 10k rows, not 10 rows).
exchemist 1 days ago [-]
Agreed - I've had this argument with people who've implemented virtual scroll on technical tools and now users can't Ctrl-F around, or get a real sense of where they are in the data. Want to count a particular string? Or eyeball as you scroll to get a feel for the shape of it?
More generally, it's one of the interesting things working in a non-big-tech company with non-public-facing software. So much of the received wisdom and culture in our field comes from places with incredible engineering talent but working at totally different scales with different constraints and requirements. Some of time the practices, tools, approaches advocated by big tech apply generally, and sometimes they do things a particular way because it's the least bad option given their constraints (which are not the same as our constraints).
There are good reasons why Amazon doesn't return a 10,000 row table when you search for a mobile phone case, but for [data ]scientists|analysts etc many of those reasons no longer apply, and the best UX might just be the massive table/grid of data.
Not sure what the answer is, other than keep talking to your users and watching them using your tools :)
mike_hearn 1 days ago [-]
Desktop GUI toolkits aren't less flexible on layout, they're often more flexible.
We lost it because the web was never designed for applications and the support it gives you for building GUIs is extremely basic beyond styling, verging on more primitive than Windows 3.1 - there are virtually no widgets, and the widgets that do exist have almost no features. So everyone rolls their own and it's really hard to do that well. In fact that's one of the big reasons everyone wrote apps for Windows back in the day despite the lockin, the value of the built-in widget toolkit was just that high. It's why web apps so often feel flaky and half baked compared to how desktop apps tend(ed) to feel - the widgets just don't get the investment that a shared GUI platform allows.
bschwindHN 2 days ago [-]
Almost certainly running some sort of O(n^2) algorithm on the chat text every key press. Or maybe just insane hierarchies of HTML.
Either way, pretty wild that you can have billions of dollars at your disposal, your interface is almost purely text, and still manage to be a fuckup at displaying it without performance problems.
stacktraceyo 2 days ago [-]
Same. It’s wild how bad it can get with just like a normal longer running conversation
qingcharles 2 days ago [-]
OpenAI sites are the only ones that do this to me. I have to keep a separate browser profile just for my OpenAI login with absolutely nothing installed on it or it'll end up being dogshit slow and unusable.
moffkalast 2 days ago [-]
Yeah just had this earlier today, I had to write my response in vscode and paste it in, there were literal seconds of lag for typing each character. Typical bloated React.
scq 2 days ago [-]
Just because a web application uses React and is slow, it does not follow that it is slow because of React.
It's perfectly possible to write fast or slow web applications in React, same as any other framework.
Linear is one of the snappiest web applications I've ever used, and it is written in React.
moffkalast 2 days ago [-]
Sure it's possible but those are a handful of exceptions against the norm, when the general approach so easily guides you towards bloat upon bloat that you have to be an expert to actively avoid going down that route.
brigandish 2 days ago [-]
Does not, in the seeming absence of other snappy examples and the overwhelming evidence of many, many slow React apps, the exception prove the rule?
scq 2 days ago [-]
There are plenty of snappy examples. Off the top of my head: Discord, Netflix, Signal Desktop, WhatsApp Web.
genthree 1 days ago [-]
Those are all really poorly-performing.
TeMPOraL 1 days ago [-]
Discord, maybe. But Netflix and WhatsApp Web? Those are bloated cows, just less broken than average.
1 days ago [-]
PunchyHamster 2 days ago [-]
That's how eating your own dogshit works, or whatever was that saying
sebmellen 2 days ago [-]
Great to hear from a first-party source. I'm a Pro subscriber and my team spends well over two thousand dollars per month on OpenAI subscriptions. However, even when I'm logged in with my Pro account, if I'm using a VPN provider like Mullvad, I often have trouble using the chat interface or I get timeout errors.
Is this to be expected? I would presume that if I'm authenticated and paying, VPN use wouldn't be a worry. It would be nice to be able to use the tool whether or not I'm on a VPN.
JumpCrisscross 2 days ago [-]
> even when I'm logged in with my Pro account, if I'm using a VPN provider like Mullvad, I often have trouble using the chat interface or I get timeout errors
Heard from a founder who recently switched his company to Claude due to OpenAI's lagginess–it's absolutely an OpenAI problem. Not an AI problem in general.
seba_dos1 2 days ago [-]
Hi! It's all perfectly understandable - after all, we use things like Anubis to protect our services from OpenAI and similar actors and keep them available to the real users for exactly the same reasons.
lm411 2 days ago [-]
"we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform"
The scary part is that you don't even see the irony in writing this.
Or, are you just okay "misusing" everyone for your own benefit?
noosphr 2 days ago [-]
>These checks are part of how we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform.
Can you share these mitigations so we can mitigate against you?
0x3f 2 days ago [-]
It's just Cloudflare. Bypassing it is a whole industry.
zenethian 2 days ago [-]
I read the comment as “use it to mitigate against OpenAI bots scraping the web” and not to mitigate Cloudflare.
0x3f 2 days ago [-]
Well it's the same answer isn't it... use Cloudflare. And hope OpenAI doesn't have a backroom scraping deal with them, which they might.
dawnerd 2 days ago [-]
Flaresolverr is one way. Isn’t perfect but bypasses a lot.
driverdan 2 days ago [-]
Brand new account with 2 comments in this thread. How can we be sure you're not a bot deployed to defend OpenAI?
Please run Cloudflare's privacy invasive tool and share all the values it generates here so we can determine if you're a real person.
ghm2199 1 days ago [-]
Would OpenAI also consider renumerations to every site they have scraped that had a robots.txt file and they chose to ignore it anyway? Feel free to not answer this question.
I have kind of lost count of how many content creators have said personally to me traffic is meaningfully down because of all these chatbots. The latest example is this poor but standup guy: moneyfortherestofus.com.
timeinput 1 days ago [-]
I'm really glad Hacker News disallows AI generated comments. The response I got from asking that question really is quite enlightening. Short answer: "no", long answer: "no -- fuck off", longer answer: "no -- fuck off -- if you want I can dig into whether or not you should fuck off harder"
mehov 2 days ago [-]
> because we want to keep free and logged-out access
But don't you run these checks on logged-in users too?
MyNameIsNickT 2 days ago [-]
Yep, on logged-in users too. The reason is basically the same: we want scarce compute going to real people, not attackers. Being logged in is one useful signal, but it doesn’t fully prevent automation, account abuse, or other malicious traffic, so we apply protections in both cases.
lelanthran 2 days ago [-]
> The reason is basically the same: we want scarce compute going to real people, not attackers.
You are defining "Bots" and "Scrapers" as a subset of attackers, though.
Is this really fair? The value in your product came from people who wrote for other people, not bots, but your bot scraped them anyway.
There is no way to determine if a request that is coming from my browser is typed in by me or automated with a browser extension. Your only way to win this "war" on "attackers" is by forcing users into using your own application to access your product.
My browser extension (see my previous reply on this story) automates the existing open tab I have to all the different chat AIs (GPT, Claude, Gemini, etc).
I suppose all you can do is rate-limit each user.
angoragoats 2 days ago [-]
Nothing you do can fully prevent automation. Someone who wants to automate requests badly enough will be able to do it, especially when the “protections” are as easy to decrypt and analyze as the OP proved.
Meanwhile, the rest of us (well, not me, because I don’t use your garbage product, but lots of others do) have to suffer and have our compute resources used up in the name of “protection.”
3form 2 days ago [-]
Yeah, that's it. Also, it is a bit amusing to me - "We want to prevent automation", says the employee of Let's Automate Inc.
geetee 2 days ago [-]
[flagged]
jorvi 2 days ago [-]
I'm glad you guys at least went with CloudFlare. LMarena went with Google's ReCaptcha, which is plain evil. It'll often gaslight you and pretend you failed a captcha of identifying something as simple as fire hydrants. Another lovely trick is asking you to identify bridges or busses, but in actuality it also wants you to identify viaducts or semi-trucks.
salawat 2 days ago [-]
More like "We want your money, but don't want to provide service." Are you sure OpenAI isn't morphing into a finance/insurance company?
pixl97 2 days ago [-]
While OAI is one of the more hypocritical of the bunch, it is not uncommon for paid services to have some limitations in their terms of service. Like going in a store and buying stuff, it doesn't me a free for all doing whatever you want.
zamadatix 2 days ago [-]
Limitations on the ChatGPT subscription should have to do with the usage limits of the tier you paid for (and I don't think anyone has a problem with that). If I'm in the limits of requests I paid for then it's usage rather than abuse.
"Abuse" checks should only come into play when someone tries to leverage the free tier. It reminds me of those cable companies that try to sell "unlimited" plans and then try to say customers who use more than x GB/month are abusing the service rather than just say what the real limits are because "unlimited" sounds better in marketing.
c0_0p_ 2 days ago [-]
Can't have those bots or scrapers running amok can we...
conartist6 1 days ago [-]
Still feels very anti-consumer.
If every company behaved like you do, the internet would be a much worse place.
In fact, OpenAI has already made the Internet a much worse place, already much, much less open and much less optimistic about its own future than it was even five years ago...
arendtio 2 hours ago [-]
Do you do those checks only for users without accounts or also for those with accounts?
pdntspa 2 days ago [-]
Y'all just salty that DeepSeek et al are training their LLMs on yours
wiseowise 2 days ago [-]
> A big reason we invest in this is because we want to keep free and logged-out access available for more users.
Thank you for the reply, Nick. It wouldn’t be a problem to disable the tracking for authenticated users then, would it?
lloydatkinson 2 days ago [-]
It would because someone's KPI depends on number of tracked users lol
matsemann 1 days ago [-]
If logging in disabled all checks, all bots would just spam-create users first. Of course it needs to run for all users, without it being necessarily nefarious.
toddmorey 1 days ago [-]
Paid users?
lm411 2 days ago [-]
"Integrity at OpenAI"
Basically an oxymoron at this point.
witx 2 days ago [-]
> These checks are part of how we protect our first-party products from abuse like bots, scraping,
Do you guys see the irony here?
hosteur 2 days ago [-]
They obviously get it. They just do not care.
the_gipsy 2 days ago [-]
But is the title true, is typing specifically blocked? Or does it just block submitting the text?
I ask because I have seen huge variations in load time. Sometimes I had to wait seconds until being able to type. Nowadays it seems better though.
xg15 1 days ago [-]
> how we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform.
Are you applying the same standards to your own scraper bots?
numlock86 2 days ago [-]
> [...] we protect our first-party products from abuse like [...] scraping [...]
what an odd thing to say for someone whose product is built entirely on exactly that
egorfine 2 days ago [-]
Paying customer since inception here.
I presume the local ChatGPT.app has even more measures to prevent automation, right? Presumably privacy-invasive ones as it is customary these days?
Is there a way I can opt out? I really, really, really don't like it.
radicality 1 days ago [-]
The way I use the products something like this. My main account on my MacBook - ChatGPT website, codex cli. Then, a Mac VM running via UTM with shared writable dir - anything more ‘shady’ in terms of permissions and for playing with new ai apps - eg ChatGPT/Codex standalone apps, Atlas, Claude desktop app etc. Seems to work decently enough.
And I do totally agree that there should be a way to opt out of all these privacy invasive measures, especially after paying $200/mo
huertouisj 2 days ago [-]
sometimes I paste giant texts (think summarization) in the chatgpt (paid) webapp and I noticed that the CPU fans spin up for about 5 seconds after, as if the text is "processed" client side somehow. this is before hitting "submit" to send the prompt to the model.
I assumed it was maybe some tokenization going on client side, but now I realize maybe it's some proof of work related to prompt length?
tipiirai 2 days ago [-]
I don't trust what OpenAI says. Sam Altman gives shivers, and these kinds of blog posts make things look even worse.
account42 10 hours ago [-]
> These checks are part of how we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform.
The lack of self awareness...
myHNAccount123 2 days ago [-]
Can you fix the resizing text box issue on Safari when a new line is inserted? When your question wraps to a newline Safari locks up for a few seconds and it's really annoying. You can test by pasting text too.
vkou 2 days ago [-]
> Hey! I'm Nick, and I work on Integrity at OpenAI. These checks are part of how we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform.
How can first-party products protect themselves from abuse by OpenAI's bots and scraping?
mystraline 2 days ago [-]
This is a completely in-scope question.
How do we defend against your scraping, OpenAI?
I dont want any of my content scraped or seen by you all. Frankly, fuck you all for thinking my content is owned by you.
CableNinja 2 days ago [-]
I use nginx conditionals and useragent checking, then respond with 418 or 410.
The article is from 2024. Is this still happening?
ImPostingOnHN 1 days ago [-]
Do we have any evidence they started complying?
If not, we can conclude they did not, until such evidence shows up.
stefanka 1 days ago [-]
I’m genuinely curious to know whether there was a change in behavior especially after OpenAI informed about how to prevent scraping (robot.txt, etc.).
ImPostingOnHN 1 days ago [-]
I am as well. Like, is there any evidence of a change, or can we assume nothing changed?
wilg 2 days ago [-]
should be pretty easy to test and not rely on an anonymous source from a weird analytics company via business insider. are these bots actually from openai or are they just using their user agent? are they coming from openai ip ranges? etc. https://openai.com/gptbot.json
stefanka 1 days ago [-]
Are all of OpenAI’s ip ranges known?
ImPostingOnHN 1 days ago [-]
> should be pretty easy to test
I look forward to your results, whether or not they disprove the article.
leros 1 days ago [-]
Fwiw, I stopped using ChatGPT and went to a competitor because the checks slow down ChatGPT so much that the webapp becomes unusable in anything but a new short chat. CPU usage goes to 100%, you can't type, the entire tab freezes, etc. It's a miserable experience to use and I'm on a relatively new MacBook not some old computer. If you read around it's a very common problem people have been having for a while now.
jesuslop 1 days ago [-]
Hi Nick, the lag is quite bad in the field, honest. In desktop app in this case/datapoint. There was that "halt and catch fire" episode where they spoke about a millisencod threshold of delay that separated usability and non. Solvent hw and fiber connection.
dev1ycan 2 days ago [-]
"abuse like bots, scraping, fraud, and other attempts to misuse the platform"
This has to be a joke, right?
pera 2 days ago [-]
I really can't tell for sure (new user posting a ridiculously hypocritical corporate message on a Sunday) but if GP actually works for OpenAI the lack of self-awareness is seriously striking
singpolyma3 2 days ago [-]
How?
oblio 2 days ago [-]
Because OpenAI built their entire business around shamelessly scraping anything that had bits on it.
singpolyma3 2 days ago [-]
Maybe. But scraping isn't abuse. Seems a bit different?
cycomanic 2 days ago [-]
Quoting the OP
> These checks are part of how we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform.
That implies that OpenAI (or at least this employee) considers scraping abuse.
PunchyHamster 2 days ago [-]
Given that the scraping doesn't do any rate limiting and pisses on robots.txt, yes it is abuse
singpolyma3 2 days ago [-]
Is there any evidence OpenAI has been ignoring robots.txt for scraping purposes? AFAIK the main sources of that traffic are still unknown.
ludwik 2 days ago [-]
The top comment categorized scraping as abuse ("abuse such as [...] scraping") - that's precisely why some accuse its author of lack of self awareness.
2 days ago [-]
xtajv 1 days ago [-]
Earnest question: if I was feeling lazy and security-conscious at the same time, would I be better off...
(A) opening chatgpt.com in qubes (but staying logged out, i.e. never creating a chatgpt account)
-or-
(B) creating a freemium chatgpt account
?
(Obviously, the "best" answer would be something like running a local LLM from an airgapped machine in a concrete bunker :) But that's not what I'm after).
20k 1 days ago [-]
>abuse like bots, scraping
10/10, I've got no notes
freeopinion 2 days ago [-]
Its your business and your call. But my opinion is that I wish you would quit offering free services. I'm pretty concerned about the horrible effect your free services are having on education. Yes, AI can be an incredible tool to enhance education. But the reality is that it is decimating children's will to learn anything.
I don't want to blame AI for all the world's problems. And I don't want to throw the baby out with the bath water. But I think you should think really hard about the value of gates. Smart people can build better gates than cash. But right now, cash might be better than nothing. Clearly you have already thought about how to build gates, but I don't think you have spent enough time thinking about who should be gated and why. You should think about gates that have more purpose than just maximizing your profit.
"We want to hook as many people as possible without letting in our competitors" is a pretty crummy thought to use as a public justification.
(Edited for typos.)
cheese_van 1 days ago [-]
<protect our first-party products from abuse like scraping>
Abuse from scraping has long been a serious problem for many, good job!
invalidusernam3 2 days ago [-]
But why block the ui until then? Surely you can just not make any requests until the checks are complete?
piskov 2 days ago [-]
Tangential question: are there chatgpt app devs on X? There are a few from Codex team but I couldn’t find guys from “ordinary” chatgpt.
Also if you could pass this over: it takes 5 taps to change thinking effort on ios and none (as in completely hidden) on macos.
If I were to guess it seems that you were trying to lower the token usage :-). Why the effort is only nicely available on web and windows is beyond me
toddmorey 1 days ago [-]
Why are all these checks still performed on an authenticated, paid user?
aucisson_masque 1 days ago [-]
Why send the Turnstile bytecode encrypted ? Surely people savvy enough to abuse the system will find out how to decrypt it, see OP, and it gives the impression that you are trying to hide stuffs you're not proud about.
pocksuppet 1 days ago [-]
Because they want to make it as hard as possible to reverse engineer. If they wanted it to be easy, they'd use <input type="checkbox" name="ishuman">I am a human
diebillionaires 1 days ago [-]
As a free tier user I only get like three queries in now without model quality reduction, so I'd say your bases are covered as far as GPU costs around misuse.
gck1 2 days ago [-]
I always wondered why you even have logged out access. I'm glad I can use ChatGPT in incognito when I want a "clean room" response, but surely that's not the primary use case.
Is user base that never logs in really that significant?
pocksuppet 1 days ago [-]
This episode proves they know who you are, even when you're logged out. If they didn't know, they wouldn't let you use the service.
rglullis 2 days ago [-]
I shouldn't be giving ideas to your boss, but I bet he would be interested in making ChatGPT available only by paying customers or free for those whose who gets their eyes scanned by The Orb. Give 30 days of raised limits and we're all set to live in the dystopia he wants.
prmoustache 2 days ago [-]
> we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform.
Isn't that how you build your service from the very start? How ironic.
sourcecodeplz 1 days ago [-]
I really appreciate the free options, without even needing a login. Wish they would also keep the small free weekly allowance for Codex.
JumpCrisscross 2 days ago [-]
> we want to keep free and logged-out access available for more users
How does this comport with OpenAI's new B2B-first strategy?
> We also keep a very close eye on the user impact
Are paid or logged-in users also penalised?
mghackerlady 1 days ago [-]
No, leave it. Surely the mighty OpenAI can deal with the scraping. At least, it seems to think everyone else can
andrepd 2 days ago [-]
> OpenAI: These checks are part of how we protect products from abuse like bots, scraping, and other attempts to misuse the platform.
This would be fucking HILARIOUS if it wasn't so tragic.
rchaud 2 days ago [-]
Manifest destiny for me, border enforcement for thee.
lmz 2 days ago [-]
This kind of flawed thinking again. Like the natives didn't fight and lose wars against the manifest destiny types.
ImPostingOnHN 1 days ago [-]
I don't think anybody claimed no Native Americans tried to fight back against their genocide?
lmz 1 days ago [-]
It's painting border enforcement as somehow immoral. There is no sin in trying to be better at it than those before.
ImPostingOnHN 1 days ago [-]
genociding people to take their land within their borders is generally frowned upon today
lmz 22 hours ago [-]
If only they were better at border control, maybe they wouldn't all get killed off.
ImPostingOnHN 2 hours ago [-]
yes, generally it is frowned upon today to genocide people and take their land within their borders
1 days ago [-]
Chance-Device 2 days ago [-]
It can be both
subscribed 2 days ago [-]
> "abuse like bots, scraping"
You what, mate? Would you please use that on yourselves first? Because it comes off as a GROSS hypocrisy. State of the art hypocrisy.
>> behavioral biometric layer
But this one, especially, takes the cake.
Quite disgusting.
kelnos 2 days ago [-]
> A big reason we invest in this is because we want to keep free and logged-out access available for more users.
Are these checks disabled for logged-in, paid users?
nicbou 2 days ago [-]
For what it's worth, I switched to Gemini because of the long ChatGPT load time. Gemini loads as fast as Google Search.
htx80nerd 1 days ago [-]
Thanks. I've used ChatGPT a million times and never had any input issues.
matheusmoreira 1 days ago [-]
> protect our first-party products from abuse like bots, scraping
You do see the irony here?
gmerc 2 days ago [-]
the company that scrapes every until it collapses really needs to protect itself from scraping. Lol.
SubiculumCode 2 days ago [-]
In long threads in chatgpt, it grinds to a halt in both Chrome and Firefox. Please fix
lifis 1 days ago [-]
Are you disabling them for paying subscribers?
tekawade 2 days ago [-]
Hey Nick, I find it concerning this account is. Frayed just to comment on this thread. And never even reply back to any of the real concerns.
Here to hoping this is real person and actually created account out of concern and sharing.
sandeepkd 1 days ago [-]
You do not ever trust the client side. Sometimes being simple is good enough. The maximum you can do is put rate limits on the IP address and/or user account. You just do not want some one to use the product at machine speeds.
ryanmcbride 1 days ago [-]
Protecting your site from bots and scraping is absolutely hilarious considering how you acquired (read: stole) the data you trained your bot on dude.
Just yank that ladder up behind you.
pocksuppet 1 days ago [-]
> Just yank that ladder up behind you.
You would be an irresponsible entrepreneur if you didn't. Don't forget your legal obligation to maximise shareholder value.
blactuary 1 days ago [-]
> I work on Integrity at OpenAI
Irony is truly dead. Show you have integrity by quitting your job
potsandpans 2 days ago [-]
Chatgpt banned me after I said disparaging things about Sam Altman in a chat.
When I appealed the ban, I was told that I couldn't be told exactly why I was banned, but if I wrote a written apology and "promised to never do it again" my ban could be appealed.
I asked for an update on the ban via email every month for over a year.
Maybe you could tell me a little bit about that process?
0dayman 2 days ago [-]
Hi Nick, your software is a horrendous encroachment on users' privacy and its quality is subpar to those of us who know what we're working with. We don't use your product here.
chronc6393 2 days ago [-]
> Hi Nick, your software is a horrendous encroachment on users' privacy and its quality is subpar to those of us who know what we're working with. We don't use your product here.
It’s ok, OpenAI is cooked.
Feel bad for anyone who joined OAI in the past 12 months. Their RSU ain’t going to be worth much later this year. IPO is too late.
jgalt212 2 days ago [-]
> we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform
Have you just described the dilemma facing all the content sites used to train LLMs?
MisterTea 1 days ago [-]
> These checks are part of how we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform.
Isn't this the same behavior used by AI companies to gather training data? Pot, meet kettle.
AndrewKemendo 1 days ago [-]
Kudos for trying
This whole thread was like watching a swarm of ants try and take a grasshopper down
SilasX 1 days ago [-]
It has not been negligible for me, and, however you're doing this, there is significant room for improvement.
There have been times when, across about ten minutes of usage, most of which is me typing on iOS Safari, it drained 15% of my battery. There is no functional justification for this beyond poor code quality. (It was on a long conversation FWIW.)
This when I'm logged in, with a paid (Plus) account, connected to a very old email address with a real user profile. That can't be the result of super-clever bot defense measures, because it's merely an inconvenience on desktop. And if you genuinely believe that email has been compromised, why aren't you reaching out the to the account owner, as the account isn't otherwise connected to fraud by your heuristics?
However brilliant the LLM agent it is, I'm seeing a lot of unforced errors regarding how you implement a web interface to it. If it makes you feel any better, it doesn't really register compared to all the bloat I see on other sites.
marxisttemp 1 days ago [-]
History will not be kind to you and your ilk. Quit your job.
owebmaster 1 days ago [-]
The reason why you did it is clear, why you guys settle down for such a poor implementation is why this thread exists
wackget 1 days ago [-]
I understand it's not your area, but can you please politely tell your colleagues that the clickbait-type teaser questions from the latest model are absolutely infuriating and are quickly leading to me abandon the platform entirely?
If you'd like, I can write a two-sentence paragraph to send to your colleagues. It contains a special phrase which most colleagues will find difficult to ignore. Would you like me to do that?
crest 2 days ago [-]
Then make sure they only target the free tier!
quotemstr 2 days ago [-]
We really need ZKPs of humanity
ctoth 2 days ago [-]
No, we really don't. We don't need worldcoin, we don't need papers, please. We just don't.
"Prove your humanity/age/other properties" with this mechanism quickly goes places you do not want it to go.
Muromec 2 days ago [-]
> quickly goes places you do not want it to go.
Which places?
quotemstr 2 days ago [-]
No, it doesn't go places we "do not want it to go". What part of zero knowledge doesn't make sense? How precisely does a free, unlinkable, multi-vendor, open-source cryptographic attestation of recent humanity create something terrible?
It would behoove people to engage with the substance of attestation proposals. It's lazy to state that any verification scheme whatsoever is equivalent to a panopticon, dystopia as thought-terminating cliche.
We really do have the technology now to attest biographical details in such a way that whoever attests to a fact about you can't learn the use to which you put that attestation and in such a way that the person who verifies your attestation can see it's genuine without learning anything about you except that one bit of information you disclose.
And no, such a ZK scheme does not turn instantly into some megacorp extracting monopoly rents from some kind of internet participation toll booth. Why would this outcome be inevitable? We have plenty of examples of fair and open ecosystems. It's just lazy to assert right out of the gate that any attestation scheme is going to be captured.
So, please, can we stop matching every scheme whatsoever for verifying facts as actors as the East German villain in a cold war movie? We're talking about something totally different.
ctoth 2 days ago [-]
The ZK part isn't the problem. The "attestation of recent humanity" part is. Who attests? What happens when someone can't get attested?
You've been to the doctor recently, right? Given them your SSN? Every identity system ever built was going to be scoped || voluntary. None of them stayed that way.
Once you have the identity mechanism, "Oh it's zero knowledge! So let's use it for your age! Have you ever been convicted?" which leads to "mandated by employers" which leads to...
We've seen this goddamn movie before. Let's just skip it this time? Please?
dzikimarian 2 days ago [-]
The part where FAANG does usual Embrace, Extend, Extinguish, masses don't care/understand and we have yet another "sign in with... " that isn't open source nor zero-knowledge in practice and monetizes your every move. And probably at least one of the vendors has massive leak that shows half-assed or even flawed on purpose implementation.
gzread 2 days ago [-]
Sure. I'll provide an API to provide mine to your bot for $1 each time.
user3939382 2 days ago [-]
Have you given any thought to what we trade when big tech elects one corporation as the gatekeeper for vast swaths of the Internet?
2 days ago [-]
rsrsrs86 1 days ago [-]
Hi Nick, do you believe what you say? You scraped the shit out of everyone
Razengan 2 days ago [-]
> we want to keep free and logged-out access available for more users.
And THANK YOU for that!
Being able to use ChatGPT and Grok without signing in is a big part of why I like those services over Gemini etc.
Hell, dummy Claude won't even let me Sign-In-with-Apple on the Mac desktop, even though it let me Sign-UP-with-Apple on the iPhone! BUT they do support Sign-In-with-Google!!? What in the heavenly hell is this dumbassery
tomalbrc 2 days ago [-]
Fake Account
thegreatpeter 2 days ago [-]
You’re doing gods work sir, thank you!
1 days ago [-]
boesboes 1 days ago [-]
lol, hypocrites.
nickphx 2 days ago [-]
the irony of your statement is hilarious, disappointing, and infuriating.
opsdu 1 days ago [-]
[dead]
throwaway613746 1 days ago [-]
[dead]
tempaccountabgd 1 days ago [-]
[dead]
huflungdung 2 days ago [-]
[dead]
lxgr 2 days ago [-]
It's absurd how unusable Cloudflare is making the web when using a browser or IP address they consider "suspicious". I've lately been drowning in captchas for the crime of using Firefox. All in the interest of "bot protection", of course.
lucasfin000 2 days ago [-]
The real frustrating part is that Cloudflare's "definition" of suspicious keeps changing and expanding. VPN users, privacy-first browsers, uncommon IP ranges, they all get flagged. The people most likely to get caught by these systems are exactly the ones who care most about their privacy, and not the bots that they are apparently targeting.
gruez 2 days ago [-]
>The real frustrating part is that Cloudflare's "definition" of suspicious keeps changing and expanding.
That's... exactly expected? It's a cat and mouse game. People running botnets or AI scrapers aren't diligently setting the evil bit on their packets.
Gormo 9 hours ago [-]
To the contrary, people running botnets or AI scrapers are likely going out of their way to mimic ordinary web traffic from consumer devices. Ultimately, these measures will only affect users who are trying to protect their privacy and security, and will be ineffective at stopping bots.
lxgr 2 days ago [-]
So the stable state here is all humans eventually being locked out? (Bots are getting better every day; I doubt the same is true for all humans, including those with weird browsers or networks unwilling to install some dystopian Cloudflare "Internet passport".)
But hey, at least some bots are also not making it past Cloudflare!
small_scombrus 2 days ago [-]
> So the stable state here is all humans eventually being locked out?
Yep. The most easy to implement stable state for any system where you're aiming to prevent misuse is to just prevent use
WatchDog 2 days ago [-]
The inevitability is that these kinds of services just won't be offered without identifying yourself.
Claude's free tier requires a phone number just to try it.
sph 2 days ago [-]
PRISM as a Service.
kristjansson 1 days ago [-]
Or else a player too big to be blocked moves into the space with a service that provides some/all of the privacy benefits, but declines to offer the other undesirable aspects of VPN (e.g. location shifting to circumvent local restrictions)
i.e. iCloud private relay is the future
lxgr 1 days ago [-]
I’ve already had a few services lock me out with iCloud Private Relay.
jagged-chisel 2 days ago [-]
That’s obviously because they’re not being “evil”
Aurornis 2 days ago [-]
> The people most likely to get caught by these systems are exactly the ones who care most about their privacy, and not the bots that they are apparently targeting.
In my brief experience with abuse mitigation, connections coming from VPNs or unusual IP ranges were very significantly more likely to be associated with abuse.
It depends on your users. VPNs aren’t common at all, even though you hear about them a lot on Hacker News. For types of social sites where people got banned for abuse (forums) the first step to getting back on the forum was always to sign up for a VPN and try to reconnect. It got so bad that almost every new account connecting via VPN would reveal itself as a spammer, a banned member trying to return, or someone trying to sock puppet alternate accounts for some reason.
The worst offenders are Tor IP addresses. Anyone connecting from Tor was basically guaranteed to have bad intentions.
I heard from someone who dealt with a lot of e-mail abuse that the death threats, extortion, and other serious abuse almost always came from Protonmail or one of the other privacy-first providers that I can’t remember right now. He half-jokingly said they could likely block Protonmail entirely without impacting any real users.
It’s tough for people who want these things for privacy, but the sad reality is that these same privacy protections are favored by people who are trying to abuse services.
Gormo 9 hours ago [-]
> In my brief experience with abuse mitigation, connections coming from VPNs or unusual IP ranges were very significantly more likely to be associated with abuse.
Correlating these factors with abuse implies that you already have methods of identifying abuse per se, independently of these factors. Is there no feasible way of just blocking the abuse itself when it begins, or developing much more proximate indicators to act on?
> The worst offenders are Tor IP addresses. Anyone connecting from Tor was basically guaranteed to have bad intentions.
Do you handle this by blocking known Tor exit node IPs entirely, or just adding hurdles to attempts to post from those IPs?
> It’s tough for people who want these things for privacy, but the sad reality is that these same privacy protections are favored by people who are trying to abuse services.
But naturally P(A|B) and P(B|A) are two different things.
frig57 1 days ago [-]
The idea that normal people don't use proton is incredibly wrong. Same with VPNs to a large extent.
I work a customer facing email job and loads of people use Proton across demographics and industries
next_xibalba 1 days ago [-]
About what percentage of “normal people” who are email users would you estimate use Proton?
gzread 2 days ago [-]
The solution is for more people to use Tor routinely. Like I'm doing right now.
perching_aix 1 days ago [-]
How does the Tor network counter abuse? Like, say you're hosting a service on the Tor network, what does the Tor network offer if anything to defend against e.g. DDoS attacks?
gzread 1 days ago [-]
It's a solution for users because you can't afford to demand ID from your users (such as an IP address) if all your users quit when you do that.
perching_aix 1 days ago [-]
Sure, but if the service keeps getting overwhelmed (financially or traffic-wise) or compromised (not even necessarily in the security sense but in the semantic purpose sense, like via spam floods on a message board) due to a lessened capability to combat abuse, then the user is worse off all over again, no?
All it would solve then is laundering Tor traffic from being probably malicious to being reputationally ambiguous. Though for a within-network service, that's probably assumed anyways - hard to run a Tor service if you assume all Tor users are malicious, that would be nonsensical.
whatisthiseven 2 days ago [-]
Which VPNs are people using that actually care about the user's privacy? Most of them don't, sell their home IP to buyers, sell their DNS history to others, etc. Worse, some of them could require invasive MITM cert stuff most users will just click yes through.
I have yet to see a use case for VPNs for the casual internet audience, and for a tech savvy user, their better off renting through some datacenter or something, which at that point is hardly a VPN and more home IP obfuscation. All the same downsides, and at least you get real privacy.
traceroute66 2 days ago [-]
> Which VPNs are people using that actually care about the user's privacy?
Mullvad.
It has been proven in a court of law that when Mullvad says "no logging", they mean it.
They also regularly have security audits and publish the results[2][3]
Seconding Mullvad. I am paranoid and I think they're trustworthy
evilduck 2 days ago [-]
Using any popular datacenter's IP range for a personal VPN is likely to be outright blocked.
Imustaskforhelp 2 days ago [-]
Also you only get 1 IP so its not really anonymous and you definitely would have a fingerprint.
thisisnow 2 days ago [-]
you just rotate it?
lxgr 2 days ago [-]
I'm forced to use a VPN to occasionally check my US bank account, since a foreign IP address is obviously a harbinger of unspeakable evil (while the friendly Youtube advertised neighborhood VPN is obviously evidence of pure intentions).
gruez 2 days ago [-]
>Most of them don't, sell their home IP to buyers, sell their DNS history to others, etc. Worse, some of them could require invasive MITM cert stuff most users will just click yes through.
Source? I haven't seen any evidence that the major paid VPN providers engage in any of those things. At best it's vague implications something shady is happening because one of the key people was previously at [shady organization].
Imustaskforhelp 2 days ago [-]
ProtonVPN with bitcoin which you get from a monero swap is a good idea for complete privacy if you want port forwarding.
MullvadVPN is also another great one.
I have heard some good things about AirVPN, but I can absolutely attest for mullvad and to a degree ProtonVPN (Just with Proton, depending upon your threat model, do make the necessary precautions like buying with monero for example)
There are others, but mostly its the 2-3 that I trust.
lxgr 1 days ago [-]
How do you square "complete privacy" with the fact that you're authenticating to these VPNs with a persistent username or other credential and are then sending traffic through them, both from an IP address that might identify you, and to services that you authenticate against?
Best case, the VPN learns your residential IP and the names of every HTTPS host you connect to (if not your entire DNS traffic as well); worst case, they collude with any of the services you use (or some ad tracker they embed) and persistently deanonymize your account.
VPNs are structurally not great for privacy.
Gormo 9 hours ago [-]
> How do you square "complete privacy" with the fact that you're authenticating to these VPNs with a persistent username or other credential and are then sending traffic through them, both from an IP address that might identify you, and to services that you authenticate against?
IIRC, Mullvad allows anonymous accounts, allows payment in cash and via other methods that don't link PII to the transaction, and claims not to log inbound connections.
ymolodtsov 2 days ago [-]
Yes, using an incognito windows is more than enough to kick off their checks.
2 days ago [-]
ehnto 2 days ago [-]
I recently had the insane experience of filling out 15 consecutive captchas, after, I had checked out and entered my payment information into the payment processor widget. I just wanted to submit the order. I was logged in to their website, and the bank even needed a one time code for payment. If the bank is pretty sure I am human then your ecomm site can figure it out surely.
At least outside the US, there's 3DS as an (admittedly often high friction) high quality cardholder verification method, but in the US, that's of course considered much too consumer-hostile, so "select 87 overpasses" it is.
amatecha 2 days ago [-]
A while back I was buying tickets for a gondola for a trip in Europe and the checkout process failed during payment because their site didn't load their analytics/tracking stuff with proper error-handling, so when my ad-blocker prevented the tracking stuff, their checkout process failed to handle my CC's 2-factor auth and the checkout would fail. Had to contact my CC company and work with the gondola company to tell them what they're doing wrong so they could fix their website code. Pretty sad to know whoever built their stuff actually shipped a checkout flow (for a VERY popular tourist destination) without testing with ad-blockers enabled.
lxgr 2 days ago [-]
To be fair, this sometimes seems on the ad blocker. I've definitely seen mine accidentally nuke part of the payment Javascript (or maybe the 3DS iframe?) because some substring of it matched some common ad URL, which is obviously unrecoverable for the site itself.
girvo 2 days ago [-]
Surprising really, because I'm a Firefox + Ublock Origin die hard and I never get Cloudflare captchas. Wonder what the difference is? I have CGNAT turned off, if that matters at all (probably not).
lxgr 2 days ago [-]
I could definitely imagine a public IPv4 with lots of good, logged-in Cloudflare traffic to act as a positive signal for their heuristics, possibly even overriding the Firefox penalty.
danielheath 2 days ago [-]
Maybe check your network isn't sending web traffic you're not aware of?
I'm running firefox and seeing the normal amount.
jychang 2 days ago [-]
Most people are on a CGNAT these days, drowning in captchas is the new normal. You’re at the mercy of one of your neighbors not hosting a botnet from their home computer.
perching_aix 2 days ago [-]
For better or for worse, CF's fingerprinting and traffic filtering is a lot more in-depth than just IP trend analysis. Kind of by necessity, exactly because of what you mention. So I'd think that's not as big a worry per se.
lxgr 2 days ago [-]
Yet here I am drowning in captchas every once in a while, so it's quite a big worry for me.
Maybe I just have to disable all ad blockers and Safari tracking prevention? Or I guess I could send a link to a scan of my photo ID in a custom request header like X-Please-Cloudflare-May-I-Use-Your-Open-Web?
perching_aix 2 days ago [-]
> Yet here I am drowning in captchas every once in a while, so it's quite a big worry for me.
I think I was sufficiently clear that I was specifically talking about CGNAT-caused IP address tainting being an unreasonably emphasized worry, not the worry about their detections overall misfiring. Though I certainly don't hear much about people having issues with it (but then anecdotes are anecdotal).
> Or I guess I could send a link to a scan of my photo ID in a custom request header like X-Please-Cloudflare-May-I-Use-Your-Open-Web?
Sounds good, have you tried?
Not sure what's the point of these comically asinine rhetoricals.
tokioyoyo 2 days ago [-]
Not even remotely true, I genuinely have no idea what you're talking about. The only time I get captcha'ed is when I sometimes VPN around, or do some custom browser stuff and etc. I'll even say I get captcha'ed less now than maybe 5 years ago.
jychang 2 days ago [-]
Just wait until your ISP puts you behind a CGNAT.
Or if you ever need to travel a lot and tether off your phone. Most mobile devices are IPV6 only (via 464XLAT) behind a CGNAT these days.
tokioyoyo 11 hours ago [-]
Again, no clue what you’re talking about. The only time I had to deal with shit was when I was travelling a bit sketchy countries. I get that “Cloudfare is verifying your connection” loading screen from time to time, but there’s no captchas involved.
Super majority of people don’t use VPNs, or rare browsers, or avoid fingerprinting and etc. When you browse like regular you don’t notice the friction. That’s the selling point of companies like CF, because website owners don’t want to lose real traffic.
cogman10 2 days ago [-]
Every so often, usually after a firefox update, CF will get into a "I'm convinced your a bot" mode with me. I can get out of it by solving 20 CAPTCHAs.
hansvm 2 days ago [-]
It's probably just a higher rate of autonomous vehicles needing stop signs and buses identified at that moment, and cognitive bias causes you to only remember when that happens when you recently performed an update. /s
gruez 2 days ago [-]
>It's probably just a higher rate of autonomous vehicles needing stop signs and buses identified at that moment
I can't tell whether you're serious but in case you are, this theory immediately falls apart when you realize waymo operates at night but there aren't any night photos.
hansvm 2 days ago [-]
Thanks for the comment. Lack of seriousness is now appropriately indicated.
cogman10 2 days ago [-]
My assumption is that CF has something like a SVM that it's feeding a bunch of datapoints into for bot detection. Go over some threshold and you end up in the CAPTCHA jail.
I'm certain the User-Agent is part of it. I know that for certain because a very reliable way I can trigger the CF stuff is this plugin with the wrong browser selected [1].
I don't, and I rarely have issues with firefox. Private + blockers + VPN causes, expected, issues but otherwise i'm usually fine?
onion2k 2 days ago [-]
Is that because botnets spoof being Firefox? It's not really fair to blame Cloudflare it is. That's on the bots.
doctaj 2 days ago [-]
In what way would that not be fair? Their product giving false positives (unnecessary challenges for a normal browser humans commonly use) to real people is definitely their fault.
eks391 2 days ago [-]
That sounds like it is working as intended, not a false positive. A false positive would mean it blocked you whereas a challenge means more information is needed. You aren't noticing all of the times it correctly decides you are human, only the times when it needs to "inconvenience" you for more information because you prioritize privacy, a key similarity with some bots.
I also like privacy. I use GrapheneOS. I compartmentalize my credit cards, emails, and phone numbers. I don't use Google products, and the list continues, but I don't complain about Cloudflare because it is painless and I understand the price I pay for privacy.
I also have home services accessible via my home website, running on my home server(s). I chose to have cloudflare to host my domain specifically for the easy bot blocking, and it blocks more than 2000 bots/day that otherwise would be trying to find vulnerabilities on my servers, which contain a lot of sensitive things. I've never had an issue personally accessing my services through cloudflare. Sometimes I have to do captchas to access my own things, and that's barely an inconvenience (I am aware the domain isn't necessary to access services, but it makes more sense for my setup and intents)
gruez 2 days ago [-]
>Their product giving false positives (unnecessary challenges for a normal browser humans commonly use) to real people is definitely their fault.
Is it TSA's "fault" that non-terrorists are subject to screening?
lxgr 2 days ago [-]
No, but it's entirely within TSA's hands to make that process as frictionless as possible.
(It's a different question whether zero friction is actually desired, or whether some security theater is actually part of the service being provided, but that's a different question.)
forkerenok 2 days ago [-]
We're discussing the quality of screening here, not the act/necessity of screening itself.
gruez 2 days ago [-]
>We're discussing the quality of screening here
The "quality" of TSA's screening seems be pretty bad too given how many people have to go through secondary screening vs how many terrorist they catch (0?)
bdangubic 2 days ago [-]
they caught 11 million by now (just as arbitrary as your 0 but probably more accurate since we haven’t had a large terrorist attack since they got the gig to serve and protect and before we lost thousands of lives…)
gruez 2 days ago [-]
>they caught 11 million by now (just as arbitrary as your 0 but probably more accurate
Nice try but I used "caught", not "stopped", which requires they actually apprehended someone, not just prevented some hypothetical attack.
>since they got the gig to serve and protect and before we lost thousands of lives…)
You could easily reuse this argument for cloudflare: "if it wasn't for such invasive browser fingerprinting openai would be drowning in bajillion req/s from bots."
bdangubic 2 days ago [-]
> “if it wasn't for such invasive browser fingerprinting openai would be drowning in bajillion req/s from bots."
of course they would be drowning! I have no issues with what CF is doing. too funny that people use tools like chatgpt and expect privacy?!
DonHopkins 2 days ago [-]
They are failing to meet there quotas of shooting innocent people in the face, so ICE is helping out.
lxgr 2 days ago [-]
No, using a stupid authentication/verification method with lots of false positives is always on whoever deploys it.
Imagine an apartment building with a flimsy front door lock that breaks all the time, and the landlord only telling you that that can't be helped because of all the burglars.
josephcsible 2 days ago [-]
If it's just as easy to spoof being Chrome as it is to spoof being Firefox, then it is indeed fair to blame Cloudflare if they give Firefox users more CAPTCHAs than Chrome users.
conradkay 2 days ago [-]
Not really, there's camoufox but the vast majority use modified chrome/chromium
binaryturtle 2 days ago [-]
I'm with a slightly older Firefox and can't use many websites at all anymore because the Cloudflare cancer.
Of course then you got sites like gnu.org too that block you because your slightly outdated user agent.
mghackerlady 1 days ago [-]
I... Don't think it does that? It shouldn't, anyway. How long has that been a thing? They've been hit pretty hard by the slop crew lately but I couldn't imagine it being so bad they require an up to date UA
binaryturtle 21 hours ago [-]
It's going on since quite a while. Want to update some GNU software, or look up something? I have to switch the user agent to "curl" to be able to visit the sites.
mghackerlady 1 days ago [-]
Heaven forbid you not use JavaScript, then they can't <s>track you</s> keep the internet safe!
geysersam 2 days ago [-]
I use firefox daily and I don't encounter the problems you describe, might be worth looking if there's some other issue.
lm411 2 days ago [-]
That's not Cloudflare trying to make your life hard.
It's the reality of how bad the bots have become.
dawnerd 2 days ago [-]
I’ve been getting it in safari too. It’s ridiculous frankly. My residential ip must have been flagged or something. The part that’s really annoying is its trivial for bots to bypass.
lxgr 2 days ago [-]
> I’ve been getting it in safari too.
I'm getting it on iCloud Private Relay all the time. It honestly makes it kind of useless.
Maybe that's the point? But then again, doesn't Cloudflare run part of it!? And wasn't there some "privacy-preserving captcha replacement" that iOS devices should already be opting me in to? So many questions, nobody there to answer them, because they can get away with it.
> The part that’s really annoying is its trivial for bots to bypass.
Not the ethical bots, though! My GPT-backed Openclaw staunchly refuses to go anywhere near a "I'm not a robot" button.
gzread 2 days ago [-]
Cloudflare makes money on both sides. It makes money from Apple to run Private Relay and it makes money from website operators to block Private Relay. It hosts the websites of DDoS services and protects them from DDoS, too.
segmondy 2 days ago [-]
trying using firefox and then using a cellphone network for internet. sometimes i can't access a site, because i get infinite captcha. i know what a damn bus, stairwell, stop light or motorcycle looks like.
lazycouchpotato 2 days ago [-]
At times I'm completely locked out of a website and Cloudflare asks me to email the website owner to get the issue resolved.
.. how do they expect me to find the website owner's email if I can't access said website?
wongarsu 1 days ago [-]
Once upon a time we had whois lookup for exactly that usecase (finding a domain's owner without visiting the site). Of course now nearly everyone has meaningless entries from some domain privacy service
tshaddox 2 days ago [-]
Is anyone talking about the fact that this is a fundamental design flaw of the web? Or arguably even the entire Internet?
3form 2 days ago [-]
It's hard to call something a "fundamental flaw of web" if it wasn't an issue for 30 years. Unless you mean something more general that I'm missing.
tshaddox 2 days ago [-]
Arguably it didn’t see widespread commercial adoption for 30 years, and you wouldn’t expect fundamental design flaws regarding commercial incentives to manifest before that.
fastball 2 days ago [-]
Cloudflare isn't providing Turnstile as a service in a vacuum, this is a direct response to bad actors who can trivially abuse the web.
pixl97 2 days ago [-]
A flaw can be fundamental but not immediate. It's probably better to say it's a fundamental flaw of the open web, that is the system collapses as the number of bad actors increases, and there is no way to prevent bad actors and have the system keep the name as open web.
2 days ago [-]
lukewarm707 1 days ago [-]
sometimes when there is mafia you get no option but pay pizzo
hence i am just using cloudflare remote browser rendering.
amatecha 2 days ago [-]
These days I just close sites that show that "checking if you're a bot" shit. If this is how the web is going to be now, I don't care, I'll just not use it. I didn't need to see that article or post that badly anyways. I'm tired of paying the price for the sociopathic, greedy actions of others. It's especially bad for anyone who uses an open source OS like Linux or *BSD (to the extent many sites just block me automatically with a 403 Forbidden simply for using OpenBSD + Firefox, completely free pass if I try the same site from a Windows or Linux computer).
jgalt212 2 days ago [-]
We use Cloudflare to protect our content, but at the same time our machines mostly run Linux / Firefox so it really is quite a frustrating relationship. It really bums me out how much of Turnstile boils down to these two questions:
is it Linux (or similar)?
is it Firefox?
If yes, to one or both, you're blocked! Clearly millions of dollars of engineering talent and petabytes of data collection should be able to come up with something more nuanced than this.
dheera 2 days ago [-]
Exactly. For the most part all this bot protection is only protecting these websites against humans.
I don't do free work. I'm not going to label 50 images of crosswalks and motorcycles for free.
ronbenton 2 days ago [-]
> For the most part all this bot protection is only protecting these websites against humans.
Curious how do you know this?
TiredOfLife 2 days ago [-]
[flagged]
EGreg 2 days ago [-]
Well, that's for the public internet.
I'm building Safebox and Safecloud, where this won't be the case anymore. Not only will you have a decentralized hosting network that can sideload resources (e.g. via a browser extension that looks at your "integrity" attribute on websites) but also the websites will require you to be logged in with a HMAC-signed session ID (which means they don't need to do any I/O to reject your requests, and can do so quickly)... so the whole thing comes down to having a logged in account.
As far as server-to-server requests, they'll be coming from a growing network of cryptographically attested TPMs (Nitro in AWS, also available in GCP, IBM, Azure, Oracle etc.) so they'll just reject based on attestations also.
In short... the cryptographically attested web of trust will mean you won't need cloudflare. What you will need, however, to prevent sybil attacks, is age verification of accounts (e.g. Telegram ID is a proxy for that if you use Telegram for authentication).
password4321 2 days ago [-]
Wow, if Seinfeld can have a soup nazi, I think it's within reason for you to be called the internet nazi.
"No s̶o̶u̶p̶ internet for you!"
Good luck!
ale42 2 days ago [-]
This was sarcasm, right?
EGreg 1 days ago [-]
Why would you assume it needs to be? You don’t think that websites on the Internet might not want to allow random bots and scrapers to waste their resources, and require people to have an account in order to access non-static resources on the website? You do realize that API keys exist, right?
simonw 2 days ago [-]
Presumably this is all because OpenAI offers free ChatGPT to logged out users and don't want that being abused as a free API endpoint.
NotPractical 2 days ago [-]
But do they do it whether you're logged in or not?
I noticed the ChatGPT app also checks Play Integrity on Android (because GrapheneOS snitches on apps when they do this), probably for the same reason. Claude's app doesn't, by the way, but it also requires a login.
Gander5739 2 days ago [-]
Because accounts are free, and could still be used to abuse as a free endpoint, with a little trickiness.
gzread 2 days ago [-]
Don't you need a Google account and to get a Google account you need a phone number?
"You're posting too fast! Please slow down."
Gander5739 1 days ago [-]
You don't need a phone number to create a google account. (Though the account creation flow is inconsistent in this, in sone situations it will require a phone number, in some it won't.)
appreciatorBus 2 days ago [-]
Yup.
Coincidentally about an hour ago, I wanted to look something up in ChatGPT and I happened to be in a browser window I don’t normally use, with no logged in accounts. I assumed it wouldn’t work, but to my surprise with no account, no cookies of any kind it took my query and gave me an answer.
gruez 2 days ago [-]
>I assumed it wouldn’t work, but to my surprise with no account, no cookies of any kind it took my query and gave me an answer.
They allowed anonymous requests for months now, maybe even a year.
solaire_oa 2 days ago [-]
Yeah, additionally gemini.google.com is also free unauthenticated, which I've been using for a very long time (a year?). Why this is being treated as news is confusing.
iberator 2 days ago [-]
Microsoft and Gemini can be used without account. just works! (talking about web app)
aziaziazi 2 days ago [-]
I used to mostly use chatgpt in an incognito tab, logged out. Until I notice it seems to have some context of my logged in session, and of the logged out as well. It may be paranoia or prompt deduction as well but that felt strange.
FergusArgyll 2 days ago [-]
Yeah it works but it's a dumber model. Prob mini
lelandfe 1 days ago [-]
You get a couple requests in at a smarter model and then it prompts you to sign up, and from there uses an extremely dumb model.
vscode-rest 2 days ago [-]
[dead]
bredren 2 days ago [-]
It is also intended to protect the usage patterns of pro subscribers.
As has been amply explained, the API pricing per token is far more for equivalent use when maximizing a subscription plan.
It isn’t really a massive hurdle to deal with this full SPA load check. If one is even aware it exists they already have the skills to bypass it anyway.
I get why people would “what about” the automation inherit in what OpenAI is doing but that is a separate matter.
Other businesses and applications can put into place their own hurdles and anti bot practices to protect the models they’ve leaned into—-and they have been.
darepublic 2 days ago [-]
Using 5.2 at 20 a month would also be a steal. Other shoe will drop on codex sooner or later
thisisnow 2 days ago [-]
Its probably same for copilot.microsoft.com and their cloudfart usage
petcat 2 days ago [-]
> These properties only exist if the ChatGPT React application has fully rendered and hydrated. A headless browser that loads the HTML but doesn't execute the JavaScript bundle won't have them. A bot framework that stubs out browser APIs but doesn't actually run React won't have them.
> This is bot detection at the application layer, not the browser layer.
I kind of just assumed that all sophisticated bot-detectors and adblock-detectors do this? Is there something revealing about the finding that ChatGPT/CloudFlare's bot detector triggers on "javascript didn't execute"?
iancarroll 2 days ago [-]
It’s pretty interesting to me that Cloudflare is collecting additional client-side data for individual customers. This is not widely done by most anti-bot solutions.
supriyo-biswas 2 days ago [-]
OpenAI is on an enterprise plan and (presumably) gets a customized version of Turnstile.
red_admiral 1 days ago [-]
"Sophisticated" may vary, but for a lot of EU media products you can just block the script that launches the paywall/consent overlay. Sometimes disabling JS does it; sometimes activating reading mode works.
Chance-Device 2 days ago [-]
Perhaps the author should have made it clearer why we should care about any of this. OpenAI want you to use their real react app. That’s… ok? I skimmed the article looking for the punchline and there doesn’t seem to be one.
raincole 2 days ago [-]
Why does every article need a 'punchline'? It's a technical analysis. Do you expect punchlines when you read recipes or source code?
Chance-Device 1 days ago [-]
Where did I say “every article”? This is AI slop that’s set up like it’s some investigative expose of something scandalous and then shows us nothing interesting. A competent human writer would have reframed the whole thing or just not published it.
raincole 1 days ago [-]
Do you think
1. Every person is born with the knowledge of how ChatGPT uses Cloudflare Turnstile?
2. This article contains factual mistakes? If so, what are they?
If neither of these is true, then this article strictly provides information and educational value for some readers. The writing style, AI-like or not, doesn't change that.
Chance-Device 1 days ago [-]
Do you think I have some obligation to agree with you or something? You love the article, nice, good for you. I think it’s crap.
bogdan 1 days ago [-]
Whilst you and a few other commentators call this AI slop and refuse to engage with it, the rest of us have read something interesting and learned something new. Is anything gained if one points out that it's written by AI? I personally know it's written by AI but the value outweighs the stylistic idiosyncrasies.
Consider also that many people aren't the best at writing blog-like posts but still have things to share and AI empowers them to do that. I can't find anything constructive in your post and I don't understand why you are posting at all.
Chance-Device 1 days ago [-]
What’s not constructive about it, Bogdan? I’ve said exactly what I think is wrong with the article, the framing is AI pattern matching to something that it isn’t. It’s a weird kind of incongruent clickbait, it’s not positioning itself as a piece about cloudflare or turnstile, it’s implicitly saying “look at this sneaky thing OpenAI are doing that I uncovered!” and it turns out they’re not doing much of anything at all.
This may be unintentional and the author simply couldn’t tell it sounded this way. The less charitable interpretation is that they did know it sounded this way and thought that a straightforward blog post about cloudflare bot detection wouldn’t end up on the HN front page.
What’s my constructive criticism to the author? Write your own posts. Use your own voice. Make sure that what you’re creating actually reads like the kind of thing it is. Don’t get the AI to write it for you. It’s annoying.
And I would say that if someone is really so bad at writing blogs that they are unable to do this, which I am not saying this author is, then maybe they shouldn’t be writing them.
nickelpro 1 days ago [-]
The intended value is difficult to discern in AI written pieces.
I agree with both of you, there's some interesting tricks here for how a website builds anti-bot protection, but the AI sloppification is framing it as a consumer protection issue but not delivering on that premise.
It is a reasonable criticism that the post does not deliver a "so what?" on its basic framing.
1 days ago [-]
dmos62 2 days ago [-]
For me the interesting parts of the article is how author got to the decompiled checks and what the checks are. Anti-bot is an interesting space.
elwebmaster 2 days ago [-]
That's because the article is AI slop.
londons_explore 2 days ago [-]
I just don't understand why bot owners can't just run a complete windows 11 VM running Google Chrome complete with graphics acceleration.
You can probably run 50 of those simultaneously if you use memory page deduplication, and with a decent CPU+GPU you ought to be able to render 50 pages a second. That's 1 cent per thousand page loads on AWS. Damn cheap.
jaccola 2 days ago [-]
There are myriad providers competing to offer this, nicely packaged with all the accoutrements (IP rotation, location spoofing, language settings, prebuilt parsers, etc.) behind an easy to use API.
Honestly it is a very healthy competitive market with reasonably low switching costs which drives prices down. These circumstances make rolling your own a tough sell.
arcfour 2 days ago [-]
They do, but the fact that they have to do this means there are fewer bots because it's less economical to go to such lengths, compared to something much less complex (which is orders of magnitude cheaper).
huertouisj 2 days ago [-]
there are scraping subreddits.
if you browse them you will see that bot writers are very annoyed if they can't scrape a site with a headless browser.
you can do what you suggested, but with Linux VMs/containers. windows is too heavy, each VM will cost you 4 GB of RAM
londons_explore 2 days ago [-]
The reason to use windows is that anti bot tech is going to be a lot stricter if Linux is detected...
xmcp123 2 days ago [-]
I’m in those. xvfb and headless=false still works great
poly2it 2 days ago [-]
If you know of a simple way to run a Windows 11 VM with good graphics acceleration (no GPU passthrough), please contact me.
MarioMan 2 days ago [-]
I assume your concern with GPU passthrough is that each VM needs a whole GPU?
You can use GPU-PV to split your GPU between VM instances.
Then the main bottleneck becomes how thin you split out your VRAM.
Wouldn't virtualbox or vmware's paravirtual GPUs be a better fit for this use case? Unfortunately the offerings with qemu/libvirt still lag vmwares by a lot.
himata4113 2 days ago [-]
284 on 296gb of ram with deduplication enabled on a 128c with 32Q vgpu.
YetAnotherNick 1 days ago [-]
I am reasonably sure that these kind of fingerprints can detect if the browser is inside a VM.
kristjansson 1 days ago [-]
… yup?
I mean you missed the minigame of preventing Chrome from signaling that it’s being programmatically (webdriver etc) driven and tipping your hand, but … yup?
hrmtst93837 2 days ago [-]
In theory you could run hundreds of full-fat Chrome bots if you don't care about the ops mess, but keeping Windows images stable while Cloudflare and friends keep changing the fingerprinting game turns the cheap math into a maintenance job from hell. AWS VM signals are a big red flag, so you still eat CAPTCHAs and blocks even with a full browser stack. The page load number looks cheap.
i18nagentai 1 days ago [-]
The irony of a company that sells DDoS protection making the browsing experience worse for legitimate users. The real issue is that Cloudflare's bot detection runs JavaScript that introspects the page state — which means any site using Cloudflare is implicitly giving Cloudflare access to read the DOM of the protected application. That's a much bigger concern than the typing delay.
technion 2 days ago [-]
To prompt a discussion that's purely technical: I'm interested in how this was done.
Specifically, Turnstile as far as I'm aware doesn't do anything specifically configurable or site specific. It works on sites that don't run React, and the cookie OpenAI-Sentinel-Turnstile-Token is not a CF cookie.
Did OpenAI somehow do something on their own API that uses data from Turnstile?
XYen0n 2 days ago [-]
Cloudflare should be able to determine whether a website uses React by analyzing data flowing through its CDN.
technion 2 days ago [-]
Whilst true, "validate the right state is loaded" would surely be something not done without developer input.
kristjansson 1 days ago [-]
If your CF bill reached into 8 figures, you might ask them to accept some developer input?
ripbozo 2 days ago [-]
and chatgpt was then used to write this article. at least try to clean it up a bit
hx8 2 days ago [-]
Ah yes, the timeless hallmark of web blogs: a draft so messy even a language model would ask for a second pass.
tommodev 2 days ago [-]
Ah, this explains chatgpt (and probably copilot) performance behind corporate firewalls such as zscaler.
Between the network latency and low end machines, there is an enormous lag between chatgpts response and being able to reply, especially for editing a canvas.
I've been sitting there for up to a minute plus waiting to be able to use the canvas controls or highlight text after an update.
refulgentis 2 days ago [-]
If you have AI write a blog post for ya, when you think it's set, check word count (can c+p to google docs if AI can't pull it off with built in tools), and ask it to identify repetitions if it's over 1000.
Also, you can have it spotcheck colors: light orange on light background is unreadable, ask it to find the L*[1] of colors and dark/lighten as necessary if gap < 40 (that's minimum gap for yuge header text on background, 50 for text on background, these have gap of 25)
I haven't tried this yet, but, maybe have it count word count-per-header too. It's got 11 headers for 1000 words currently, makes reading feel really stacatto and you gotta evaluate "is this a real transition or vibetransition"
[1] L* as in L*a*b*, not L in Oklab
bredren 2 days ago [-]
On a related note, ChatGPT.com changed how it handles large text pastes this past week.
It now behaves like Claude, attaching the paste as a file for upload rather than inlining it.
This affected page UX some and reduces the cost of the browser tab some.
At some point, maybe still true, very long conversations ~froze/crashed ChatGPT pages.
NSPG911 2 days ago [-]
I was using KeepChatGPT[1] for a while back in 2023-2024, pre-Gemini-in-Google era, and I was fascinated as to how it was able to mask being a user without needing any API or help from the end user. I stopped using it after 2024 because 1) Gemini and 2) It breaks quite a lot. I did however, like how you had an option to push the AI panel to the right, if only Google even considers doing so.
I have a little helper app I run sometimes that I have a button to push a query into ChatGPT and get a json response. You wouldn't even know OpenAI had any anti-bot tools because it doesn't get flagged at all. It just uses a webview inside WinForms.
natdempk 2 days ago [-]
Does anyone know how this is integrated on the Cloudflare side and across the app? Is this beyond standard turnstile? Is this custom/enterprise functionality? Something else?
croemer 1 days ago [-]
When using ChatGPT Android app with some NextDNS block lists, I get an error modal in app saying "security misconfiguration blah blah".
Clearly I'm blocking some tracker and it's upset about that. I allowlisted a sentry subdomain and since then got no more complaints.
tosh 2 days ago [-]
It used to be possible to type immediately while the page is loading and have all key presses end up in the input field.
Why run this check before user can type?
Why not run it later like before the message gets sent to the server?
balkanist 1 days ago [-]
[flagged]
TimLeland 1 days ago [-]
It seems they fixed the biggest issue Ive had where you start typing then it erases the content once the page fully loads
tripdout 2 days ago [-]
AI-written article?
avazhi 2 days ago [-]
Yep. I flag these as spam at this point.
CorneredCoroner 2 days ago [-]
> A headless browser that loads the HTML but doesn't execute the JavaScript bundle won't have them.
this is meaningless btw. A browser headless or not does execute javascript.
jaccola 2 days ago [-]
I disagree, a browser can have javascript execution disabled (and this is somewhat common in scraping to save time/resources).
I read it to mean: "A browser that doesn't execute the JavaScript bundle won't have [the rendered React elements]." Which is true.
maxwellg 2 days ago [-]
Wouldn't a browser that doesn't execute JS also not execute the browser fingerprinting code in the first place?
XYen0n 2 days ago [-]
If JavaScript is disabled, why use a headless browser instead of making HTTP requests directly?
girvo 2 days ago [-]
A bunch of the points in this AI generated blog post were like that. Makes me feel dirty when I'm 1/3rd of the way through and I realise how off it is.
thisisnow 2 days ago [-]
Hah, sure, you just let random JS execute from random sites on your machine...
dsparkman 1 days ago [-]
That explains why ChatGPT has been running like shit all weekend. In the desktop app on Mac, it could not even complete a response. On the web, it would hang before you could input anything.
EGreg 2 days ago [-]
Why does ChatGPT slow down so much when the conversations get long, while Claude does compaction?
My best guess is -- ChatGPT is running something in your browser to try to determine the best things to send down to the model API –- when it should have been running quantized models on its own server.
jtbayly 2 days ago [-]
Others here are asking if this is the cause of slow performance in a long chat.
But it seems clear to me that this is why I can't start typing right away when I first load the page and click to focus in the text field.
themafia 2 days ago [-]
My theory is that "AI" doesn't really have any long term paying customers and the majority of the "users" are people who have cooked up some clever hack to effectively siphon computing power from these providers in an effort to crank out the lowest effort ad supported slop imaginable.
Every provider seems to have been plauged by these freeloaders to such an extent that they've had to develop extreme and onerous countermeasures just to avoid losing their shirts.
What's the word? Schadenfreude?
darepublic 2 days ago [-]
I imagine to stop web automation from getting free API like use of the model
edg5000 1 days ago [-]
The chat client has serious performance issues on lower end systems. Now I see why!
tom-blk 8 hours ago [-]
Wild insight
pautasso 1 days ago [-]
AI goes through great lengths to ensure it's talking with humans.
Why would two AI bots want to chat with each other?
balkanist 1 days ago [-]
[flagged]
heliumtera 2 days ago [-]
I am shocked openai collects data about it's users before users have the opportunity to send the same data to openai servers!
self-portrait 2 days ago [-]
A/B testing /dev/ kit that tokenizes four permutations of language
littlecranky67 12 hours ago [-]
How is this fingerprinting even GDPR compliant? Fingerprinting + profiling need consent, and the service must work without tracking+profiling consent.
seker18 1 days ago [-]
Cómo puedo acceder a un celular
AndreyK1984 2 days ago [-]
CamuFox will fix it easy peasy.
lightedman 1 days ago [-]
Preventing me from typing until you SCAN MY SYSTEM?
Fine, by extension, you agree I can scan all of your systems for whatever I desire. This works both ways.
apsurd 2 days ago [-]
Haven't read yet but instantly matched with my experience of the chat being unusable at times. The latency and glitch-like feel is unbearable.
arcfour 2 days ago [-]
> They exist only if the request passed through Cloudflare's network. A bot making direct requests to the origin server or running behind a non-Cloudflare proxy will produce missing or inconsistent values.
...I don't think that's possible even if you are a bot? I would be very surprised if OAI had their origin exposed to the internet. What is a "non-Cloudflare proxy"? Is this AI slop?
It's likely just looking at the CF properties as part of a bot scoring metric (e.g. many users from this ASN or that geoip to this specific city exhibit abusive patterns).
aslihana 2 days ago [-]
I mean, I can easily get them to behaving defensively for not being abused. But MBP with M5 here, my chatgpt tab always get stucked when I hit some prompt.
Really really bad user experience, wondering about when they will leave this approach.
aucisson_masque 1 days ago [-]
Mistral chat is also free to use without account and doesn't do that.
j45 1 days ago [-]
This is a lot of fingerprinting.
tristor 1 days ago [-]
This explains some of the weird performance behavior I've seen in the last 24 hours with ChatGPT, sometimes lagging my entire browser while typing. Note, I'm a paying user with a Teams account, so it's kind of annoying that this is being applied to logged in paying users as well. I might have to vibe-code my own chat webUI using the APIs.
gobdovan 2 days ago [-]
Imagine if they'd put as much effort into making a decent frontend experience.
yapyap 2 days ago [-]
wow OpenAi sure doesnt like bots for a company enabling the botification of the world wide web
cloud flare will not be around for long, its a shame as it is the GOAT lol
avazhi 2 days ago [-]
Another AI-slop article.
Sick.
AIOperator2026 15 hours ago [-]
[dead]
massi24 1 days ago [-]
[dead]
syntheticmind 1 days ago [-]
[flagged]
summitwebaudit 1 days ago [-]
[dead]
syntheticmind 1 days ago [-]
[flagged]
kevinbaiv 2 days ago [-]
[flagged]
lancetheai 1 days ago [-]
[dead]
aplomb1026 2 days ago [-]
[dead]
oluwajubelo1 2 days ago [-]
[dead]
mistM 2 days ago [-]
[dead]
kaluga 2 days ago [-]
[dead]
techpulse_x 2 days ago [-]
[dead]
marsven_422 2 days ago [-]
[dead]
56745742597 2 days ago [-]
[dead]
pencilcode 2 days ago [-]
ai slop analysis finding CF detects non javascript capable browsers with no punchline
blinkbat 2 days ago [-]
Ok... so... ?
beering 2 days ago [-]
So are you able to get free inference now that you decrypted this?
superkuh 2 days ago [-]
It doesn't look like it in the full sense of "free". But part of how one pays these services is by running a permissive modern browser which allows the corporation to spy on you even when you already paid in currency. In a sense by depriving them of the ability to easily spy on your this workaround is closer to "free".
gruez 2 days ago [-]
>My best guess is -- ChatGPT is running something in your browser to try to determine the best things to send down to the model API
There's no way this is worth it unless the models are absolutely tiny, in which case any benefits from offloading to the client is marginal and probably isn't worth the engineering effort.
danny_codes 2 days ago [-]
It’s free as a loss leader. The trick is to upsell later. Unfortunately for OpenAI there are plenty of competitors with fungible products, so it might be hard to pull a classic monopoly rug-pull.
beering 2 days ago [-]
They already see everything I’m doing because I send my prompts to them. What “workaround” are you referring to?
superkuh 2 days ago [-]
They see everything your doing because you send the text. But this is talking about everything about your computer system. You would not normally be sending this to them or having it involved at all. This workaround allows you to not involve unneeded information about your computer setup. It is not about avoiding sending prompt text.
And as for "but chatgpt isn't paid" (another commenter), well, then yes, that's even closer to free by removing this spying on your computer setup. But they spy on the paid users too.
voxic11 2 days ago [-]
But isn't ChatGPT access free through the browser? What do you mean already paid in currency?
pocksuppet 1 days ago [-]
If you want to send more than a few prompts each day, you have to pay. With currency.
dgb23 1 days ago [-]
Why are companies like OpenAI and others that are all-in on LLMs still using ReactJS, Python and so on?
These programming languages and frameworks were made for developer convenience and got wide adoption, because it makes on-boarding easier.
This obviously comes at a cost of performance, complexity and introduces a liability into a system, because they are dependencies that come with a whole bunch of assumptions about how they are used.
Is this tradeoff even worth it anymore?
robmccoll 1 days ago [-]
Probably training data. The largest number of public repos are built on that stack. We recently picked React for new projects because LLMs seemed to be the most reliable when writing React code.
A big reason we invest in this is because we want to keep free and logged-out access available for more users. My team’s goal is to help make sure the limited GPU resources are going to real users.
We also keep a very close eye on the user impact. We monitor things like page load time, time to first token and payload size, with a focus on reducing the overhead of these protections. For the majority of people, the impact is negligible, and only a very small percentage may see a slight delay from extra checks. We also continuously evaluate precision so we can minimize false positives while still making abuse meaningfully harder.
Said another way, if done in the background the user wouldn’t even notice unless they typed and submitted their query before the check completed. In the realistic scenario this would complete before they even submit their request.
The reason it has to block until it's loaded is that otherwise the signal being missing doesn't imply automation. The user might have just typed before it loaded. If you know a legit user will always deliver the data, you can use the absence of it to infer something about what's happening on the client. You can obviously track metrics like "key event occurred before bot detection script did" without using it as an automation signal, just for monitoring.
If you mean that you can infer client side tampering with the page contents you could still do that - permit typing but don't permit the submit action on the client. The user presses enter but nothing happens until the check is complete. There you go, now you can tell if the page was tampered with (not that it makes much difference tbh).
(Separately, I don't think the article has adequately demonstrated this claim. They just make the claim in the title. The actual article only shows that some network request is made, and that the request happens after the React app is loaded, but not that they prevent input until it returns. Maybe it's obvious from using it, but they didn't demonstrate it.)
Blocking until load means that human interaction is physically impossible, so you are certain that any input before that is automated.
If you allow typing, this distinction vanishes
I don’t know whether ChatGPT is one of those products, but if it is, that behavior might be a side effect of blocking the input pipeline until verification completes. It might be that they want to get every single one of your keystrokes, but only after checking that you’re not a bot.
https://news.ycombinator.com/item?id=3913919
No one seem to use or care about their own product anymore. Only uses dashboard and metrics, which does not explain the full situation.
Another thing but Facebook/Instagram have also detected if a person uploads an image and then deletes it and recognizes that they are insecure, and in case of TEENAGE girls, actually then have it as their profile (that they are insecure) and show them beauty products....
I really like telling this example because people in real life/even online get so shocked, I mean they know facebook is bad but they don't know this bad.
[Also a bit offtopic, but I really like how the item?id=3913919 the 391 came twice :-) , its a good item id ]
It is difficult to get a man to understand something, when his salary depends on his not understanding it.
</sarcasm>
OpenAI documents how to opt out of scraping here: https://developers.openai.com/api/docs/bots
Anthropic documents how to opt out of scraping here: https://privacy.claude.com/en/articles/8896518-does-anthropi...
I'm not sure if Gemini lets you opt out without also delisting you from Google search rankings.
I can imagine their models have been trained on a lot of websites before opt outs became a thing, and the models will probably incorporate that for forever.
But at least for websites there's an opt-out, even if only for the big AI companies. Open source code never even got that option ;).
It was a dataset of the entirety of the public internet from the very beginning that bypassed paywalls etc, there’s virtually nothing they haven’t scraped.
PRESS RELEASE: UNITED BURGLARS SOCIETY
The United Burglars Society understands that being burgled may be inconvenient for some. In response, UBS has introduced the Opt-Out system for those who wish not to be burgled.
Please understand that each burglar is an independent contractor, so those wishing not to burgled should go to the website for each burglar in their area and opt-out there. UBS is not responsible for unwanted burglaries due to failing to opt-out.
Bit concerning that some professional engineers don't understand this given the sensitive systems they interact with.
Did you mean to use the word hypocrisy. If not, I'm happy to have said it.
I just want to note, that it is well covered how good the support is for actual malware...
> we want to keep free and logged-out access available for more users
I have no doubt that many people see the free ChatGPT access as a convenient target for browser automation to get their own free ChatGPT pseudo-API.
Not that hard - ChatGPT itself wrote me a FF extension that opened a websocket to a localhost port, then ChatGPT wrote the Python program to listen on that websocket port, as well as another port for commands.
Given just a handful of commands implemented in the extension is enough for my bash scripts to open the tab to ChatGPT, target specific elements, like the input, add some text to it, target the relevant chat button, click it, etc.
I've used it on other pages (mostly for test scripts that don't require me to install the whole jungle just to get a banana, as all the current playright type products do). Too afraid to use it on ChatGPT, Gemini, Claude, etc because if they detect that the browser is being drive by bash scripts they can terminate my account.
That's an especially high risk for Gemini - I have other google accounts that I won't want to be disabled.
Morally I don't see any issues with it really.
The former relies on fairly controversial ideas about copyright and fair use to qualify as abuse, whereas the latter is direct financial damage – by your own direct competitors no less.
It's fun to poke at a seeming hypocrisy of the big bad, but the similarity in this case is quite superficial.
I bet people being fucking DDOSed by AI bots disagree
Also the fucking ignorance assuming it's "static content" and not something needing code running
This is something I couldn't have done before, because people very often don't have the patience to answer questions. Even Google ended up in loops of "just use Google" or "closed. This is a duplicate of X, but X doesn't actually answer the question" or references to dead links.
Are there downsides to this? Sure, but imo AI is useful.
Just prompt it.
You made a specious dismissal. Now you're making personal attacks. Perhaps it's actually you who is having difficulty reasoning properly here?
https://sampleoffline.com/
It's still up in all its glory.
Exactly. I think the unfairness can be mitigated if models trained on public information, or on data generated by a model trained on public information, or has any of those two in its ancestry, must be made public.
Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.
The point is, if you're pleading with others to respect ""intellectual property"" then you're a worm serving corporate interests against your own.
Are you sure it's a DDoS and not just a DoS?
We implemented an anti-bot challenge and it helped for a while. Then our server collapsed again recently. The perf command showed that the actual TLS handshakes inside nginx were using over 50% of our server's CPU, starving other stuff on the machine.
It's a DDoS.
I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping
DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.
Because ingress and compute costs often increase with every request, to the point where AI bot requests rack up bills of hundreds or thousands of dollars more than the hobbyist operator was expecting to send.
Wild eh.
If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".
Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.
This isn't controversial at all, it's a well understood fact, outside of this irrationally angry thread at least. I don't know, maybe you don't understand the economic term "marginal cost", thus not understanding the limited scope of my statement.
If such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all. But no, they're rare edge cases, from a combination of shoddy scrapers and shoddy website implementations, including the lack of even basic throttling for expensive-to-serve resources.
The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.
If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.
They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.
The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.
Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?
It is a cost for me for LLM to scrape me.
Why should I care about costs that have when they don't care about the costs I have?
The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.
And how much of this is users who are tired of walled gardens and enshitfication. We murdered RSS, API's and the "open web" in the name of profit, and lock in.
There is a path where "AI" turns into an ouroboros, tech eating itself, before being scaled down to run on end user devices.
Genuinely interested.
OpenAI et al seem to mostly be well-behaved.
You imply that "an expensive llm service" is harmed by abuse, but, every other service is not? Because their websites are "static" and "near-zero marginal cost"?
You have no clue what you are talking about.
Github pages is one way, but there are other platforms offering similar services. Static content just isn't that expensive to host.
THe troubles start when you're actually running something dynamic that pretends to be static, like Wordpress or Mediawiki. You can still reduce costs significantly with CDNs / caching, but many don't bother and then complain.
https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...
Also, it's not just the cost of the bandwidth and processing. Information has value too. Otherwise they wouldn't bother scraping it in the first place. They compete directly with the websites featuring their training data and thus they are taking away value from them just as the bots do from ChatGPT.
In fact the more I think of it, I think it's exactly the same thing.
But what happens if gamefaqs disappears because of lack of traffic?
Can LLM actually create or only regurgitate content.
Contrary to what others say, LLMs can create content. If you have a private repo you can ask the LLM to look at it and answer questions based on that. You can also have it write extra code. Both of these are examples of something that did not exist before.
In terms of gamefaqs, I could theoretically see an LLM play a game and based on that write about the game. This is theoretical, because currently LLMs are nowhere near capable enough to play video games.
Never in 15 years if running the website did we have such issues, and you can be sure that cache layers were in place already for it to last this long.
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
What "$FOO" actually is, is irrelevant. I'm curious how you would convince people that this sort of rule is fair.
The corp can always ban users who break ToS, after all. They don't need any help. The charitable initiative can't actually do that, can they?
It hasn’t even been updated in years so hell if I know why it needs to be fetched constantly and aggressively, - but fuck every single one of these companies now whining about bots scraping and victimizing them, here’s my violin.
It’s a static site that hasn’t been updated since 2016—- so it’s .. since been moved to cloudflare r2 where it’s getting a $0.00 bill, and it now has a disallow / directive. I’m not sure if it’s being obeyed because the cf dash still says it’s getting 700-1300 hits a day even with all the anti bot, “cf managed robots” stuff for ai crawlers in there.
The content is so dry and irrelevant I just can’t even fathom 1/100th of that being legitimate human interest but I thought these things just vacuumed up and stole everyone’s content instead of nailing their pages constantly?
It's not possible to know in advance what is static and what is not. I have some rather stubborn bots make several requests per second to my server, completely ignoring robots.txt and rel="nofollow", using residential IPs and browser user-agents. It's just a mild annoyance for me, although I did try to block them, but I can imagine it might be a real problem for some people.
I'm not against my website getting scraped, I believe being able to do that is an important part what the web is, but please have some decency.
Spare me your tears.
(TBH it's not clear to me that their marginal costs are low. They seem to pick based on narrative.)
How do you know the content is static?
Stop justifying their anti-social behavior because it lines your pockets.
If the real cost is in actually running the app or the model, then just verifying a browser isn’t enough anymore. You need to verify that the expensive part actually happened.
Otherwise you’re basically protecting the cheapest layer while the expensive one is still exposed.
Stealing the content from the whole planet & actively reducing the incentive to visit the sites without financial restitution is pretty bad.
I obviously disagree. I mean, on top of this we are talking about not-open OpenAI.
The gall. https://weirdgloop.org/blog/clankers
Scraping static pages is cheap for both sides. Scraping an LLM-backed service effectively externalizes compute costs onto the provider.
Same behavior, very different economics.
There's also the cost asymmetry to take into account. Running an obscure hobby forum on a $5 / month VPS (or cloud equivalent) is quite doable, having that suddenly balloon to $500 / month is a Really Big Deal. Meanwhile, the LLM company scraping it has hundred of millions of VC funding, they aren't going to notice they are burning a few million because their crappy scraper keeps hammering websites over and over again.
Nick, I understand the practical realities regarding why you'd need to try to tamp down on some bot traffic, but do you see a world where users are not forced to choose between privacy and functionality?
You want to go to the world's best hotel? You are gonna be on their CCTV. Staying at home is crappier but private.
Unfortunately for the first time moores law isn't helping (e.g. give a poor person an old laptop and install linux they will be fine). They can do that and all good except no LLM.
ironically, in high end hotels, there's often a lot less cctv. not none. just less. rich people enjoy privacy
Well, I can use the world‘s best safety deposit box without being on CCTV while I pass secrets in and out of it, right? Just not for free.
Bummer, this sounds like it is about to turn into a Monero ad (“let us pay privately”)
Also are hidden cameras even legal? I know here in EU they aren't.
And the salient difference is that CCTV is simply defense-in-depth, not a primary means for authentication.
Doesn't make sense, my home is much more preferable to a hotel
Yes, even their "humanifesto" is LLM output, and is written almost exclusively in the "it's not X <emdash> it's Y" style.
[0]: https://github.com/magicseth/keywitness/graphs/contributors
There's nothing stopping folks from typing a message an LLM wrote one at a time, but the idea of increasing the human cost of sending messages is an interesting one, or at least I thought :-(
For instance, the employee at Apple that decided to pull ICE Block from the store could decide that the "admissible in court" bit should be false if it looks like a police officer is in frame.
Similarly, the keyboard could decide your social credit score is too low, and just stop attesting. A court could order this behavior.
Or, you could fail mandatory age / id verification because your credit card expired, and then all the above + more could happen! Good luck getting through to credit card tech support at that point...
It is an attempt at putting something into the conversation more than just "OSS is broken because there are too many slop PRs." What if OSS required a human to attest that they actually looked at the code they're submitting? This tool could help with that.
Yes LLMs were used greatly in the production of this prototype!
It doesn't change the goal of the experiment! or it's potential utility! Do you see any potential area in your world where some piece of this is valuable?
....no. There's not a single occurrence of that.
https://keywitness.io/manifesto
There are six emdashes on that page. NONE of them are "it's not X it's why".
> Emails, messages, essays, code reviews, love letters — all suspect.
> We believe this can be solved — not by detecting AI, but by proving humanity.
> KeyWitness captures cryptographic proof at the point of input — the keyboard.
> When you seal a message, the keyboard builds a W3C Verifiable Credential — a self-contained proof that can be verified by anyone, anywhere, without trusting us or any central authority.
> That's an alphabet of 774 symbols — each carrying log2(774) ≈ 9.6 bits. 27 emoji for 256 bits.
> They're a declaration: this message was written by a person — one of the diverse, imperfect, irreplaceable humans who still choose to type their own words.
Clarifications: 4
Continuation from a list: 1
Could just be a comma: 1
"It's not X -- it's Y": 0.
If you're going to make lazy commentary about good writing being AI, please at least be sure that you're reading the content and saying accurate things.
The emoji idea was mine. I like it :-) unfortunately it doesn't work in places like HN that strip out emoji. So I had to make a base64 encoding option.
The goal was to create an effective encryption key for the url hash (so it doesn't get sent to the server). And encoding skin tone with human emojis allows a super dense bit/visual character encoding that ALSO is a cute reference to the humans I'm trying to center with this project!
“It's not X -- it's Y": 1
Maybe they deliberately write it like that, to filter out people who aren’t the target market?
> The server stores an encrypted blob it can't decrypt. We couldn't read your messages even if we wanted to. That's not a policy — it's math.
If you can’t tell that this is AI slop then maybe KeyWitness does solve a real problem after all.
Sorry it doesn't meet your needs.
There is irony in having an ai generated humanifesto. Could it be intentional? hmm?
Is there no irony in deriding a project for being potentially LLM generated, when it's goal is to aide people in differentiating? :shrug:
Original here: https://archive.org/details/sim_creative-computing_1984-06_1...
Yeah I guess the cryptographic stuff sounds vaguely impressive although it’s been a long time since I had to think about cryptography in detail. But what is this _for_? I’m going to buy an expensive keyboard so that I can send messages to someone and they’ll know it’s really me – but it has to be someone who a) doesn’t trust me or any of our existing communication channels and b) cares enough to verify using this weird software? Oh and it’s important they know I sent it from a particular device out of the many I could be using?
Who is that person? What would I be sending them? What is the scenario where we would both need this?
Also the server can’t read the message but the decryption key is in the URL? So anyone with the URL can still read it? Then why even bother encrypting it?
Maybe this is one of those cases where I’m so far outside your target market that it was never supposed to make sense to me but I feel like I’m missing something here. Or maybe you need to work on your elevator pitch.
Just sharing my honest reaction.
It proves 1) that an apple device with a secure enclave signed it. 2) that my app signed it.
If you trust the binary I've distributed is the same as the one on the app store, then it also proves: 3) that it was typed on my keyboard not using automation (though as others have mentioned, you could build a capacitive robot to type on it) 4) that the typer has the same private key as previous messages they've signed (if you have an out of band way to corroborate that's great too) 5) optionally, that the person whose biometrics are associated with the device approved it.
There is also an optional voice to text mode that uses 3d face mesh to attempt to verify the words were spoken live.
Not every level of verification is required by the ptrotocol, so you could attest that it was written on a keyboard, but not who wrote it (not yet implemented in the client app).
The protocol doesn't require you to run my app, if you compile it yourself, you can create your own web of trust around you!
What Apple devices are supported? All I have is a iPhone 4 running a old iOS version(pre iOS 7) (which I will not update and I don't think has a secure enclave) and a M1 mac mini and some lightning earpods and a apple thunderbolt display and some USB-A chargers and some old MacBooks.
I saw something about android (https://typed.by/manifesto#:~:text=Android,Integrity) on the website, but it mentioned Play Integrity which I do not have becuase I use LineageOS for MicroG.
I think that the concept is stupid becuase it would require to somehow prove that the app is not modified(which is impractical) and there is no stylus on a motor or fake screen(which is also impractical).
I think that a better aproach would be to form a Web Of Trust where only people's (not just humans, this would include all animals and potentially aliens but no clankers) certificates are signed, but with a interface that is friendly to people who are not very into technology but with some sort of way to not have who your friends are revealed, but this would still allow someone to get a attestation for their robot.
This idea of capturing the timing of people's keystrokes to identify them, ensure it is them typing their passwords, or even using the timing itself as a password has been recurring every few years for at least three decades.
It is always just as bad. Because there are so many cases where it completely fails.
The first case is a minor injury to either hand — just put a fat bandage on one finger from a minor kitchen accident, and you'll be typing completely differently for a few days.
Or, because I just walked into my office eating a juicy apple with one hand and I'm in a hurry typing my PW with my other hand because someone just called with an urgent issue I've got to fix, aaaaannnd, your software balks because I'm typing with a completely different cadence.
The list of valid reasons for failure is endless wherein a person's usual solid patterns are good 90%+ of the time, but will hard fail the other 10% of the time. And the acceptable error rate would be 2-4 orders of magnitude less.
It's a mystery how people go all the way to building software based on an idea that seems good but is actually bad, without thinking it through, or even checking how often it has been done before and failed?
> While you type, the keyboard quietly records how you type — the rhythm, the pauses between keys, where your finger lands, how hard you press.
> Nobody types the same way. Your pattern is as unique as your handwriting. That's the signal.
>>While you type, the keyboard quietly records how you type — the rhythm, the pauses between keys, where your finger lands, how hard you press.
>>Nobody types the same way. Your pattern is as unique as your handwriting. That's the signal.
This very precisely makes my point:
Yes, the typing pattern of any human is highly and possibly even completely unique to that human — UNTIL any of a myriad of everyday issues makes it falsely deny access because the human's typing pattern has changed in a way the human can't do anything to fix at the moment.
If you are only attempting to distinguish a human from an automated system, it'll be better, until someone just starts recording the same patterns and re-playing them to this upstream process; then its a mere race to who can get their hooks in at a lower level. And someone is always going to say: "Oh, this system can identify the specific human", and we're off to the races again.
So, no. Unless you can account for ALL of the reasonable everyday failure modes, typing with either hand, any finger or combination of fingers out of commission for a minute or a lifetime, this idea will fail.
You are assuming that a human's particular typing pattern is consistent, when the fact is that any number of ordinary events will render your assumption false (one or more fingers bandaged, sprained, whatever, or one hand occupied ATM).
This is not a hardware or software problem, and no amount of code, hardware, or cleverness will fix it; this is a fundamental mismatch between your assumption vs reality.
thaaaaaaaaanks
On a web of trust, if you have a negative interaction with a bot, you revoke trust in one of the humans in the chain of trust that caused you to come in contact with that bot. You've now effectively blocked all bots they've ever made or ever will make... At least until they recycle their identity and come to another key signing party.
Once you have the web in place though, a series of "this key belongs to a human" attestations, then you can layer metadata on top of it like "this human is a skilled biologist" or "this human is a security expert". So if you use those attestations to determine what content your exposed to then a malicious human doesn't merely need to show up at a key signing party to bootstrap a new identity, they also have to rebuild their reputation to a point where you or somebody you trust becomes interested in their content again.
Nothing can be done to prevent bad people from burning their identities for profit, but we can collectively make it not economical to do so by practicing some trust hygiene.
Key signing establishes a graph upon which more effective trust management becomes possible. It on its own is likely insufficient.
If you're engaging with the idea seriously, I suppose we'd need to build a reputation or trust network or something.
Although if you're talking about replay attacks specifically, there are other crypto based solutions for that.
A human is personally responsible for a bot acting on their behalf. If your bot behaves, nothing is going to happen. If you keep handing out your personal keys to shitty misbehaving bots, then you will personally get banned - which gives you a pretty good incentive to be a bit more discerning about the bots you use.
Make sure not to browse the Internet without adblock and/or similar.
Another way is to just do better isolation as a user. That's probably your best shot without hoping these companies change policies.
search for me is now a proprietary index (like exa) that filters rubbish, with a zero data retention sla. so we don't need google profiling.
the content is distilled into markdown pulled from cloudflare's browser rendering api.
i let cloudflare absorb the torrent of trackers and robot checks, i just get md from the api with nothing else. cloudflare is poacher and gamekeeper.
an alternative is groq compound which can call browsers in parallel.
for interactive sites, or local ai browsing, i sometimes run a browser in a photon os docker with vnc, which gives you the same browser window but it runs code not on your pc.
that said little of my use is now interacting with websites, its all agentic search and websets so i don't have to spend mental energy on it myself
It's a pity Firefox doesn't get the praise it deserves half as much as it cops criticism.
“Ignorant” is also infinite - you’re ignorant of MANY things as well, and I’m sure you would struggle with things I can do with ease. For example, understanding the meaning behind what’s being said so I know not to brow-beat someone over it.
I’m almost endlessly surprised by the probably-autistic-spectrum responses to tech things from people with no idea how things seem to other people.
I think you're lucky to hang around people whose heads don't hurt when they think.
It's also possible to make Firefox route each container through a different proxy which could be running locally even which then can connect to multiple different VPN's. I haven't tried doing that but its certainly possible.
It's sort of possible to run different browsers with completely new identities and sometimes IP within the convenience of one. It's really underrated. I don't use the IP part of this that I have mentioned but I use multi containers quite a lot on zen and they are kind of core part of how I browse the web and there are many cool things which can be done/have been done with them.
Every time I try this, I end up crossing wires (ie using the browser that 'works' for most things, more than the one that is 'broken')
What are you talking about? It works fine with firefox with RFP and VPN enabled, which is already more paranoid than the average configuration. There are definitely sites where this configuration would get blocked, but chatgpt isn't one of them, so you're barking up the wrong tree here.
According to the OP:
> The program checks 55 properties spanning three layers: your browser (GPU, screen, fonts), the Cloudflare network (your city, your IP, your region from edge headers), and the ChatGPT React application itself (__reactRouterContext, loaderData, clientBootstrap).
I guess Firefox VPN will hide the IP at least. But what about the other data, is it faked by RFP? Because if not, the so-called privacy offered by this configuration is outdated.
You might be fingerprinted by OpenAI right now, as “that guy with all the Firefox anti-fingerprinting stuff enabled, even though it breaks other sites”.
Yes, RFP spoofs or at least somewhat obfuscates/normalizes GPU/screen/font info. The rest are integrity validations of the server/app, and not really identifying in any way.
>You might be fingerprinted by OpenAI right now, as “that guy with all the Firefox anti-fingerprinting stuff enabled, even though it breaks other sites”.
I'm not sure what the broader point you're trying to make here is. Is fingerprinting bad? Yes. All things being equal, I'd rather not have it than have it, but at the same time it's not realistic to expect openai to serve anonymous requests from anyone. Back when chatgpt was first launched you had to sign up and verify your phone number. Compared to mandatory logins, fingerprinting is definitely the lesser evil here.
My broader point would have been that if OpenAI can identify you even when using Firefox RFP, it doesn’t make sense to give them credit for letting you use ChatGPT with RFP enabled. But maybe I was making too many assumptions.
That said, is it not a little bit weird that you want to protect yourself from scraping and bots, when your entire company, product, revenue, and your employment, depends on the fact that OpenAI can bot and scrape literally every part of the internet? So your moat is non-hydrated react code in the frontend?
None of the management-level desiderata he appealed to require that the user experience be broken this bad. There is very little bot deterrence from prevention of typing at that stage, while it heavily impacts user experience, especially on mobile.
I elaborate here: https://news.ycombinator.com/item?id=47575982
Typing the chat box is slow, rendering lags and sometimes gets stuck altogether.
I have a research chat that I have to think twice before messaging because the performance is so bad.
Running on iPhone 16 safari, and MacBook Pro m3 chrome.
They did it because a lot of devices running Netflix (TVs, DVD players, etc) were underpowered and Netflix was not keen on writing separate applications. They did, however, invest into a browser engine that would have HW acceleration not just for video playback but also for moving DOM elements. Basically, sprites.
The lost art of writing efficient code...
This is generally called virtual scrolling, and it is not only an option in many common table libraries, but there are plenty of standalone implementations and other libraries (lists and things) that offer it. The technique certainly didn't originate with Netflix.
For what it’s worth, modern browsers can render absurdly large plain HTML+CSS documents fairly well except perhaps for a slow initial load as long as the contents are boring enough. Chat messages are pretty boring.
I have a diagnostic webpage that is a few million lines long. I could get fancy and optimize it, but it more or less just works, even on mobile.
None of which chatgpt can handle presumably.
GP was mentioning that a solution to the problem exists, not that Netflix specifically invented it. Your quip that the technique is not specific to Netflix bolsters the argument that OpenAI should code that in.
They described Netflix's implementation, but if someone actually wanted to follow up on this (even for their own personal interest), Dynamic HTML would not get you there, while virtualization would across all the places it's used: mobile, desktop, web, etc.
- "ctrl + f" search stops working as expected - the scrollbar has wrong dimensions - sometimes the content might jump (common web issue overall)
The reason why we lost it is because web supports wildly different types of layouts, so it is really hard to optimize the same way it is possible in native apps (they are much less flexible overall).
More generally, it's one of the interesting things working in a non-big-tech company with non-public-facing software. So much of the received wisdom and culture in our field comes from places with incredible engineering talent but working at totally different scales with different constraints and requirements. Some of time the practices, tools, approaches advocated by big tech apply generally, and sometimes they do things a particular way because it's the least bad option given their constraints (which are not the same as our constraints).
There are good reasons why Amazon doesn't return a 10,000 row table when you search for a mobile phone case, but for [data ]scientists|analysts etc many of those reasons no longer apply, and the best UX might just be the massive table/grid of data.
Not sure what the answer is, other than keep talking to your users and watching them using your tools :)
We lost it because the web was never designed for applications and the support it gives you for building GUIs is extremely basic beyond styling, verging on more primitive than Windows 3.1 - there are virtually no widgets, and the widgets that do exist have almost no features. So everyone rolls their own and it's really hard to do that well. In fact that's one of the big reasons everyone wrote apps for Windows back in the day despite the lockin, the value of the built-in widget toolkit was just that high. It's why web apps so often feel flaky and half baked compared to how desktop apps tend(ed) to feel - the widgets just don't get the investment that a shared GUI platform allows.
Either way, pretty wild that you can have billions of dollars at your disposal, your interface is almost purely text, and still manage to be a fuckup at displaying it without performance problems.
It's perfectly possible to write fast or slow web applications in React, same as any other framework.
Linear is one of the snappiest web applications I've ever used, and it is written in React.
Is this to be expected? I would presume that if I'm authenticated and paying, VPN use wouldn't be a worry. It would be nice to be able to use the tool whether or not I'm on a VPN.
Heard from a founder who recently switched his company to Claude due to OpenAI's lagginess–it's absolutely an OpenAI problem. Not an AI problem in general.
The scary part is that you don't even see the irony in writing this.
Or, are you just okay "misusing" everyone for your own benefit?
Can you share these mitigations so we can mitigate against you?
Please run Cloudflare's privacy invasive tool and share all the values it generates here so we can determine if you're a real person.
I have kind of lost count of how many content creators have said personally to me traffic is meaningfully down because of all these chatbots. The latest example is this poor but standup guy: moneyfortherestofus.com.
But don't you run these checks on logged-in users too?
You are defining "Bots" and "Scrapers" as a subset of attackers, though.
Is this really fair? The value in your product came from people who wrote for other people, not bots, but your bot scraped them anyway.
There is no way to determine if a request that is coming from my browser is typed in by me or automated with a browser extension. Your only way to win this "war" on "attackers" is by forcing users into using your own application to access your product.
My browser extension (see my previous reply on this story) automates the existing open tab I have to all the different chat AIs (GPT, Claude, Gemini, etc).
I suppose all you can do is rate-limit each user.
Meanwhile, the rest of us (well, not me, because I don’t use your garbage product, but lots of others do) have to suffer and have our compute resources used up in the name of “protection.”
"Abuse" checks should only come into play when someone tries to leverage the free tier. It reminds me of those cable companies that try to sell "unlimited" plans and then try to say customers who use more than x GB/month are abusing the service rather than just say what the real limits are because "unlimited" sounds better in marketing.
If every company behaved like you do, the internet would be a much worse place.
In fact, OpenAI has already made the Internet a much worse place, already much, much less open and much less optimistic about its own future than it was even five years ago...
Thank you for the reply, Nick. It wouldn’t be a problem to disable the tracking for authenticated users then, would it?
Basically an oxymoron at this point.
Do you guys see the irony here?
I ask because I have seen huge variations in load time. Sometimes I had to wait seconds until being able to type. Nowadays it seems better though.
Are you applying the same standards to your own scraper bots?
what an odd thing to say for someone whose product is built entirely on exactly that
I presume the local ChatGPT.app has even more measures to prevent automation, right? Presumably privacy-invasive ones as it is customary these days?
Is there a way I can opt out? I really, really, really don't like it.
I assumed it was maybe some tokenization going on client side, but now I realize maybe it's some proof of work related to prompt length?
The lack of self awareness...
How can first-party products protect themselves from abuse by OpenAI's bots and scraping?
How do we defend against your scraping, OpenAI?
I dont want any of my content scraped or seen by you all. Frankly, fuck you all for thinking my content is owned by you.
Probably too late now but my list needs updating
- why don't you just respect existing robots.txt that apply to you already?
- does every LLM scraper seriously think the onus to opt out from the EVERY SINGLE SCRAPER is on the webmasters/owners?
If not, we can conclude they did not, until such evidence shows up.
I look forward to your results, whether or not they disprove the article.
This has to be a joke, right?
> These checks are part of how we protect our first-party products from abuse like bots, scraping, fraud, and other attempts to misuse the platform.
That implies that OpenAI (or at least this employee) considers scraping abuse.
(A) opening chatgpt.com in qubes (but staying logged out, i.e. never creating a chatgpt account)
-or-
(B) creating a freemium chatgpt account
?
(Obviously, the "best" answer would be something like running a local LLM from an airgapped machine in a concrete bunker :) But that's not what I'm after).
10/10, I've got no notes
I don't want to blame AI for all the world's problems. And I don't want to throw the baby out with the bath water. But I think you should think really hard about the value of gates. Smart people can build better gates than cash. But right now, cash might be better than nothing. Clearly you have already thought about how to build gates, but I don't think you have spent enough time thinking about who should be gated and why. You should think about gates that have more purpose than just maximizing your profit.
"We want to hook as many people as possible without letting in our competitors" is a pretty crummy thought to use as a public justification.
(Edited for typos.)
Abuse from scraping has long been a serious problem for many, good job!
Also if you could pass this over: it takes 5 taps to change thinking effort on ios and none (as in completely hidden) on macos.
If I were to guess it seems that you were trying to lower the token usage :-). Why the effort is only nicely available on web and windows is beyond me
Is user base that never logs in really that significant?
Isn't that how you build your service from the very start? How ironic.
How does this comport with OpenAI's new B2B-first strategy?
> We also keep a very close eye on the user impact
Are paid or logged-in users also penalised?
This would be fucking HILARIOUS if it wasn't so tragic.
You what, mate? Would you please use that on yourselves first? Because it comes off as a GROSS hypocrisy. State of the art hypocrisy.
>> behavioral biometric layer
But this one, especially, takes the cake.
Quite disgusting.
Are these checks disabled for logged-in, paid users?
You do see the irony here?
Here to hoping this is real person and actually created account out of concern and sharing.
Just yank that ladder up behind you.
You would be an irresponsible entrepreneur if you didn't. Don't forget your legal obligation to maximise shareholder value.
Irony is truly dead. Show you have integrity by quitting your job
When I appealed the ban, I was told that I couldn't be told exactly why I was banned, but if I wrote a written apology and "promised to never do it again" my ban could be appealed.
I asked for an update on the ban via email every month for over a year.
Maybe you could tell me a little bit about that process?
It’s ok, OpenAI is cooked.
Feel bad for anyone who joined OAI in the past 12 months. Their RSU ain’t going to be worth much later this year. IPO is too late.
Have you just described the dilemma facing all the content sites used to train LLMs?
Isn't this the same behavior used by AI companies to gather training data? Pot, meet kettle.
This whole thread was like watching a swarm of ants try and take a grasshopper down
There have been times when, across about ten minutes of usage, most of which is me typing on iOS Safari, it drained 15% of my battery. There is no functional justification for this beyond poor code quality. (It was on a long conversation FWIW.)
This when I'm logged in, with a paid (Plus) account, connected to a very old email address with a real user profile. That can't be the result of super-clever bot defense measures, because it's merely an inconvenience on desktop. And if you genuinely believe that email has been compromised, why aren't you reaching out the to the account owner, as the account isn't otherwise connected to fraud by your heuristics?
However brilliant the LLM agent it is, I'm seeing a lot of unforced errors regarding how you implement a web interface to it. If it makes you feel any better, it doesn't really register compared to all the bloat I see on other sites.
If you'd like, I can write a two-sentence paragraph to send to your colleagues. It contains a special phrase which most colleagues will find difficult to ignore. Would you like me to do that?
"Prove your humanity/age/other properties" with this mechanism quickly goes places you do not want it to go.
Which places?
It would behoove people to engage with the substance of attestation proposals. It's lazy to state that any verification scheme whatsoever is equivalent to a panopticon, dystopia as thought-terminating cliche.
We really do have the technology now to attest biographical details in such a way that whoever attests to a fact about you can't learn the use to which you put that attestation and in such a way that the person who verifies your attestation can see it's genuine without learning anything about you except that one bit of information you disclose.
And no, such a ZK scheme does not turn instantly into some megacorp extracting monopoly rents from some kind of internet participation toll booth. Why would this outcome be inevitable? We have plenty of examples of fair and open ecosystems. It's just lazy to assert right out of the gate that any attestation scheme is going to be captured.
So, please, can we stop matching every scheme whatsoever for verifying facts as actors as the East German villain in a cold war movie? We're talking about something totally different.
You've been to the doctor recently, right? Given them your SSN? Every identity system ever built was going to be scoped || voluntary. None of them stayed that way.
Once you have the identity mechanism, "Oh it's zero knowledge! So let's use it for your age! Have you ever been convicted?" which leads to "mandated by employers" which leads to...
We've seen this goddamn movie before. Let's just skip it this time? Please?
And THANK YOU for that!
Being able to use ChatGPT and Grok without signing in is a big part of why I like those services over Gemini etc.
Hell, dummy Claude won't even let me Sign-In-with-Apple on the Mac desktop, even though it let me Sign-UP-with-Apple on the iPhone! BUT they do support Sign-In-with-Google!!? What in the heavenly hell is this dumbassery
That's... exactly expected? It's a cat and mouse game. People running botnets or AI scrapers aren't diligently setting the evil bit on their packets.
But hey, at least some bots are also not making it past Cloudflare!
Yep. The most easy to implement stable state for any system where you're aiming to prevent misuse is to just prevent use
Claude's free tier requires a phone number just to try it.
i.e. iCloud private relay is the future
In my brief experience with abuse mitigation, connections coming from VPNs or unusual IP ranges were very significantly more likely to be associated with abuse.
It depends on your users. VPNs aren’t common at all, even though you hear about them a lot on Hacker News. For types of social sites where people got banned for abuse (forums) the first step to getting back on the forum was always to sign up for a VPN and try to reconnect. It got so bad that almost every new account connecting via VPN would reveal itself as a spammer, a banned member trying to return, or someone trying to sock puppet alternate accounts for some reason.
The worst offenders are Tor IP addresses. Anyone connecting from Tor was basically guaranteed to have bad intentions.
I heard from someone who dealt with a lot of e-mail abuse that the death threats, extortion, and other serious abuse almost always came from Protonmail or one of the other privacy-first providers that I can’t remember right now. He half-jokingly said they could likely block Protonmail entirely without impacting any real users.
It’s tough for people who want these things for privacy, but the sad reality is that these same privacy protections are favored by people who are trying to abuse services.
Correlating these factors with abuse implies that you already have methods of identifying abuse per se, independently of these factors. Is there no feasible way of just blocking the abuse itself when it begins, or developing much more proximate indicators to act on?
> The worst offenders are Tor IP addresses. Anyone connecting from Tor was basically guaranteed to have bad intentions.
Do you handle this by blocking known Tor exit node IPs entirely, or just adding hurdles to attempts to post from those IPs?
> It’s tough for people who want these things for privacy, but the sad reality is that these same privacy protections are favored by people who are trying to abuse services.
But naturally P(A|B) and P(B|A) are two different things.
I work a customer facing email job and loads of people use Proton across demographics and industries
All it would solve then is laundering Tor traffic from being probably malicious to being reputationally ambiguous. Though for a within-network service, that's probably assumed anyways - hard to run a Tor service if you assume all Tor users are malicious, that would be nonsensical.
I have yet to see a use case for VPNs for the casual internet audience, and for a tech savvy user, their better off renting through some datacenter or something, which at that point is hardly a VPN and more home IP obfuscation. All the same downsides, and at least you get real privacy.
Mullvad.
It has been proven in a court of law that when Mullvad says "no logging", they mean it.
They also regularly have security audits and publish the results[2][3]
[1]https://mullvad.net/en/blog/mullvad-vpn-was-subject-to-a-sea... [2]https://mullvad.net/en/blog/new-security-audit-of-account-an... [3]https://mullvad.net/en/blog/successful-security-assessment-o...
https://github.com/mullvad/mullvad-browser/
Source? I haven't seen any evidence that the major paid VPN providers engage in any of those things. At best it's vague implications something shady is happening because one of the key people was previously at [shady organization].
MullvadVPN is also another great one.
I have heard some good things about AirVPN, but I can absolutely attest for mullvad and to a degree ProtonVPN (Just with Proton, depending upon your threat model, do make the necessary precautions like buying with monero for example)
There are others, but mostly its the 2-3 that I trust.
Best case, the VPN learns your residential IP and the names of every HTTPS host you connect to (if not your entire DNS traffic as well); worst case, they collude with any of the services you use (or some ad tracker they embed) and persistently deanonymize your account.
VPNs are structurally not great for privacy.
IIRC, Mullvad allows anonymous accounts, allows payment in cash and via other methods that don't link PII to the transaction, and claims not to log inbound connections.
At least outside the US, there's 3DS as an (admittedly often high friction) high quality cardholder verification method, but in the US, that's of course considered much too consumer-hostile, so "select 87 overpasses" it is.
I'm running firefox and seeing the normal amount.
Maybe I just have to disable all ad blockers and Safari tracking prevention? Or I guess I could send a link to a scan of my photo ID in a custom request header like X-Please-Cloudflare-May-I-Use-Your-Open-Web?
I think I was sufficiently clear that I was specifically talking about CGNAT-caused IP address tainting being an unreasonably emphasized worry, not the worry about their detections overall misfiring. Though I certainly don't hear much about people having issues with it (but then anecdotes are anecdotal).
> Or I guess I could send a link to a scan of my photo ID in a custom request header like X-Please-Cloudflare-May-I-Use-Your-Open-Web?
Sounds good, have you tried?
Not sure what's the point of these comically asinine rhetoricals.
Or if you ever need to travel a lot and tether off your phone. Most mobile devices are IPV6 only (via 464XLAT) behind a CGNAT these days.
Super majority of people don’t use VPNs, or rare browsers, or avoid fingerprinting and etc. When you browse like regular you don’t notice the friction. That’s the selling point of companies like CF, because website owners don’t want to lose real traffic.
I can't tell whether you're serious but in case you are, this theory immediately falls apart when you realize waymo operates at night but there aren't any night photos.
I'm certain the User-Agent is part of it. I know that for certain because a very reliable way I can trigger the CF stuff is this plugin with the wrong browser selected [1].
[1] https://addons.mozilla.org/en-US/firefox/addon/uaswitcher/
I also like privacy. I use GrapheneOS. I compartmentalize my credit cards, emails, and phone numbers. I don't use Google products, and the list continues, but I don't complain about Cloudflare because it is painless and I understand the price I pay for privacy.
I also have home services accessible via my home website, running on my home server(s). I chose to have cloudflare to host my domain specifically for the easy bot blocking, and it blocks more than 2000 bots/day that otherwise would be trying to find vulnerabilities on my servers, which contain a lot of sensitive things. I've never had an issue personally accessing my services through cloudflare. Sometimes I have to do captchas to access my own things, and that's barely an inconvenience (I am aware the domain isn't necessary to access services, but it makes more sense for my setup and intents)
Is it TSA's "fault" that non-terrorists are subject to screening?
(It's a different question whether zero friction is actually desired, or whether some security theater is actually part of the service being provided, but that's a different question.)
The "quality" of TSA's screening seems be pretty bad too given how many people have to go through secondary screening vs how many terrorist they catch (0?)
Nice try but I used "caught", not "stopped", which requires they actually apprehended someone, not just prevented some hypothetical attack.
>since they got the gig to serve and protect and before we lost thousands of lives…)
You could easily reuse this argument for cloudflare: "if it wasn't for such invasive browser fingerprinting openai would be drowning in bajillion req/s from bots."
of course they would be drowning! I have no issues with what CF is doing. too funny that people use tools like chatgpt and expect privacy?!
Imagine an apartment building with a flimsy front door lock that breaks all the time, and the landlord only telling you that that can't be helped because of all the burglars.
Of course then you got sites like gnu.org too that block you because your slightly outdated user agent.
It's the reality of how bad the bots have become.
I'm getting it on iCloud Private Relay all the time. It honestly makes it kind of useless.
Maybe that's the point? But then again, doesn't Cloudflare run part of it!? And wasn't there some "privacy-preserving captcha replacement" that iOS devices should already be opting me in to? So many questions, nobody there to answer them, because they can get away with it.
> The part that’s really annoying is its trivial for bots to bypass.
Not the ethical bots, though! My GPT-backed Openclaw staunchly refuses to go anywhere near a "I'm not a robot" button.
.. how do they expect me to find the website owner's email if I can't access said website?
hence i am just using cloudflare remote browser rendering.
is it Linux (or similar)?
is it Firefox?
If yes, to one or both, you're blocked! Clearly millions of dollars of engineering talent and petabytes of data collection should be able to come up with something more nuanced than this.
I don't do free work. I'm not going to label 50 images of crosswalks and motorcycles for free.
Curious how do you know this?
I'm building Safebox and Safecloud, where this won't be the case anymore. Not only will you have a decentralized hosting network that can sideload resources (e.g. via a browser extension that looks at your "integrity" attribute on websites) but also the websites will require you to be logged in with a HMAC-signed session ID (which means they don't need to do any I/O to reject your requests, and can do so quickly)... so the whole thing comes down to having a logged in account.
https://github.com/Safebots/Safecloud
As far as server-to-server requests, they'll be coming from a growing network of cryptographically attested TPMs (Nitro in AWS, also available in GCP, IBM, Azure, Oracle etc.) so they'll just reject based on attestations also.
In short... the cryptographically attested web of trust will mean you won't need cloudflare. What you will need, however, to prevent sybil attacks, is age verification of accounts (e.g. Telegram ID is a proxy for that if you use Telegram for authentication).
"No s̶o̶u̶p̶ internet for you!"
Good luck!
I noticed the ChatGPT app also checks Play Integrity on Android (because GrapheneOS snitches on apps when they do this), probably for the same reason. Claude's app doesn't, by the way, but it also requires a login.
"You're posting too fast! Please slow down."
Coincidentally about an hour ago, I wanted to look something up in ChatGPT and I happened to be in a browser window I don’t normally use, with no logged in accounts. I assumed it wouldn’t work, but to my surprise with no account, no cookies of any kind it took my query and gave me an answer.
They allowed anonymous requests for months now, maybe even a year.
As has been amply explained, the API pricing per token is far more for equivalent use when maximizing a subscription plan.
It isn’t really a massive hurdle to deal with this full SPA load check. If one is even aware it exists they already have the skills to bypass it anyway.
I get why people would “what about” the automation inherit in what OpenAI is doing but that is a separate matter.
Other businesses and applications can put into place their own hurdles and anti bot practices to protect the models they’ve leaned into—-and they have been.
> This is bot detection at the application layer, not the browser layer.
I kind of just assumed that all sophisticated bot-detectors and adblock-detectors do this? Is there something revealing about the finding that ChatGPT/CloudFlare's bot detector triggers on "javascript didn't execute"?
1. Every person is born with the knowledge of how ChatGPT uses Cloudflare Turnstile?
2. This article contains factual mistakes? If so, what are they?
If neither of these is true, then this article strictly provides information and educational value for some readers. The writing style, AI-like or not, doesn't change that.
Consider also that many people aren't the best at writing blog-like posts but still have things to share and AI empowers them to do that. I can't find anything constructive in your post and I don't understand why you are posting at all.
This may be unintentional and the author simply couldn’t tell it sounded this way. The less charitable interpretation is that they did know it sounded this way and thought that a straightforward blog post about cloudflare bot detection wouldn’t end up on the HN front page.
What’s my constructive criticism to the author? Write your own posts. Use your own voice. Make sure that what you’re creating actually reads like the kind of thing it is. Don’t get the AI to write it for you. It’s annoying.
And I would say that if someone is really so bad at writing blogs that they are unable to do this, which I am not saying this author is, then maybe they shouldn’t be writing them.
I agree with both of you, there's some interesting tricks here for how a website builds anti-bot protection, but the AI sloppification is framing it as a consumer protection issue but not delivering on that premise.
It is a reasonable criticism that the post does not deliver a "so what?" on its basic framing.
You can probably run 50 of those simultaneously if you use memory page deduplication, and with a decent CPU+GPU you ought to be able to render 50 pages a second. That's 1 cent per thousand page loads on AWS. Damn cheap.
Honestly it is a very healthy competitive market with reasonably low switching costs which drives prices down. These circumstances make rolling your own a tough sell.
if you browse them you will see that bot writers are very annoyed if they can't scrape a site with a headless browser.
you can do what you suggested, but with Linux VMs/containers. windows is too heavy, each VM will cost you 4 GB of RAM
More info here:
https://web.archive.org/web/20231107182321/https://mu0.cc/20...
https://youtu.be/XLLcc29EZ_8?t=570
https://github.com/jamesstringer90/Easy-GPU-PV
I mean you missed the minigame of preventing Chrome from signaling that it’s being programmatically (webdriver etc) driven and tipping your hand, but … yup?
Specifically, Turnstile as far as I'm aware doesn't do anything specifically configurable or site specific. It works on sites that don't run React, and the cookie OpenAI-Sentinel-Turnstile-Token is not a CF cookie.
Did OpenAI somehow do something on their own API that uses data from Turnstile?
Between the network latency and low end machines, there is an enormous lag between chatgpts response and being able to reply, especially for editing a canvas.
I've been sitting there for up to a minute plus waiting to be able to use the canvas controls or highlight text after an update.
Also, you can have it spotcheck colors: light orange on light background is unreadable, ask it to find the L*[1] of colors and dark/lighten as necessary if gap < 40 (that's minimum gap for yuge header text on background, 50 for text on background, these have gap of 25)
I haven't tried this yet, but, maybe have it count word count-per-header too. It's got 11 headers for 1000 words currently, makes reading feel really stacatto and you gotta evaluate "is this a real transition or vibetransition"
[1] L* as in L*a*b*, not L in Oklab
It now behaves like Claude, attaching the paste as a file for upload rather than inlining it.
This affected page UX some and reduces the cost of the browser tab some.
At some point, maybe still true, very long conversations ~froze/crashed ChatGPT pages.
[1]: https://github.com/xcanwin/keepchatgpt
Clearly I'm blocking some tracker and it's upset about that. I allowlisted a sentry subdomain and since then got no more complaints.
Why run this check before user can type?
Why not run it later like before the message gets sent to the server?
this is meaningless btw. A browser headless or not does execute javascript.
I read it to mean: "A browser that doesn't execute the JavaScript bundle won't have [the rendered React elements]." Which is true.
My best guess is -- ChatGPT is running something in your browser to try to determine the best things to send down to the model API –- when it should have been running quantized models on its own server.
But it seems clear to me that this is why I can't start typing right away when I first load the page and click to focus in the text field.
Every provider seems to have been plauged by these freeloaders to such an extent that they've had to develop extreme and onerous countermeasures just to avoid losing their shirts.
What's the word? Schadenfreude?
Why would two AI bots want to chat with each other?
Fine, by extension, you agree I can scan all of your systems for whatever I desire. This works both ways.
...I don't think that's possible even if you are a bot? I would be very surprised if OAI had their origin exposed to the internet. What is a "non-Cloudflare proxy"? Is this AI slop?
It's likely just looking at the CF properties as part of a bot scoring metric (e.g. many users from this ASN or that geoip to this specific city exhibit abusive patterns).
Really really bad user experience, wondering about when they will leave this approach.
Sick.
There's no way this is worth it unless the models are absolutely tiny, in which case any benefits from offloading to the client is marginal and probably isn't worth the engineering effort.
And as for "but chatgpt isn't paid" (another commenter), well, then yes, that's even closer to free by removing this spying on your computer setup. But they spy on the paid users too.
These programming languages and frameworks were made for developer convenience and got wide adoption, because it makes on-boarding easier.
This obviously comes at a cost of performance, complexity and introduces a liability into a system, because they are dependencies that come with a whole bunch of assumptions about how they are used.
Is this tradeoff even worth it anymore?