How an OpenAI bot crushed this seven-person company’s website ‘like a DDoS attack’

How an OpenAI bot crushed this seven-person company’s website ‘like a DDoS attack’

On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce website had crashed. It appears to be some type of distributed denial of service attack.

He soon discovered that the culprit was an OpenAI bot that was relentlessly trying to completely destroy his massive site.

“We have over 65,000 products, and every product has a page,” Tomchuk told TechCrunch. “Each page contains at least three photos.”

OpenAI was sending “tens of thousands” of server requests to try to download all of them, hundreds of thousands of images, along with their detailed descriptions.

“OpenAI used 600 IP addresses to extract data, and we’re still analyzing the logs from last week, and it’s probably a lot more than that,” he said of the IP addresses the bot used to try to consume its location.

“Their crawlers were crashing our site, it was basically a DDoS attack,” he said.

Triplegangers is her business. The seven-employee company has spent more than a decade compiling what it calls the largest database of “human digital doubles” on the web — 3D image files scanned from actual human models.

It sells 3D object files, as well as images — everything from hands to hair, skin, and full bodies — to 3D artists, video game makers, and anyone who needs to digitally recreate authentic human characteristics.

Team Tomchuk, based in Ukraine but also licensed in the US from Tampa, Florida, has Terms of Service page On her site, which prevents robots from taking her photos without permission. But this alone did nothing. Websites must use a properly configured robot.txt file with tags that tell the OpenAI bot, GTBot, specifically, to leave the site alone. (OpenAI also has two other bots, ChatGPT-User and OAI-SearchBot, which have their own tags, According to his information page on his crawlers.)

The Robot.txt file, also known as the Robot Exclusion Protocol, was created to tell search engine sites what not to crawl during web indexing. OpenAI says on its information page that it respects such files when configured with its own set of no-crawl flags, though it also warns that it can take up to 24 hours for its bots to recognize an updated robot.txt file.

As Tomchuk testified, if a site isn’t using robot.txt properly, OpenAI and others take that to mean they can access it to their hearts’ content. It is not a system of choice.

To make matters worse, not only was Triplegangers taken offline by the OpenAI bot during US business hours, but Tomchuk expects a higher AWS bill thanks to all the CPU and download activity from the bot.

Robot.txt is also not fail-safe. AI companies commit to it voluntarily. Another AI startup, Perplexity, was called out last summer by a Wired investigation When some evidence suggested that confusion was not the case Honor that.

Triplegangers product page
Each one is a product, with a product page that includes several other images. Used with permission.Image credits:Triple (Opens in a new window)

I can’t know for sure what was taken

By Wednesday, days after the OpenAI bot returned, Triplegangers had a properly configured robot.txt file, as well as a Cloudflare account set up to block GPTBot and several other bots it had detected, like Barkrowler (an SEO crawler) and Bytespider (a TokTok crawler). . Tomchuk also hopes to ban crawlers from other AI modeling companies. He said the website was not down on Thursday morning.

But Tomchuk still has no reasonable way of knowing exactly what OpenAI succeeded in obtaining or removing that material. He couldn’t find a way to contact OpenAI and inquire. OpenAI did not respond to TechCrunch’s request for comment. So far, OpenAI has done just that It failed to deliver the long-promised opt-out toolas TechCrunch recently reported.

This is a particularly difficult problem for Triplegangers. “We work in a field where rights are a serious issue, because we are examining actual people,” he said. Under laws like Europe’s General Data Protection Regulation, “they can’t take a photo of anyone on the web and use it.”

Triplegangers was also a particularly delicious find for AI crawlers. Multi-billion dollar startups, like Scale AIwhere humans tag images to train artificial intelligence. Triplegangers has tagged photos detailing: race, age, tattoos vs scars, all body types, etc.

The irony is that it was the greed of the OpenAI bot that alerted Triplegangers to how vulnerable it was. If he had been scraped more gently, Tomchuk would never have known, he said.

“It’s scary because there seems to be a loophole that these companies use to crawl data by saying ‘You can opt out if you update your robot.txt file with our tags,’” says Tomchuk, but that puts the onus on the employer to do that. Understand how to prevent them. .

log crawler openai
Triplegangers’ server logs showed how ruthlessly the OpenAI bot accessed the site, through hundreds of IP addresses. Used with permission.

He wants other small online businesses to know that the only way to find out if an AI bot is taking copyrighted website property is to actively search. He is certainly not the only one being intimidated by them. Recently told other site owners Business insider How OpenAI bots took down their sites and caused their AWS bills to increase.

And the problem gets even bigger in 2024. New research conducted by digital advertising company DoubleVerify Found that AI crawlers Data mining tools caused an 86% increase in “general invalid traffic” in 2024, i.e. traffic that does not come from a real user.

However, Tomchuk warns, “most sites are still not sure that these bots have been stolen.” “We now have to monitor log activity daily to detect these bots.”

When you think about it, the whole model works a bit like mafia blackmail: AI bots will take what they want unless you have protection.

“They should be asking for permission, not just collecting data,” Tomchuk says.

Leave a Comment

Your email address will not be published. Required fields are marked *