I check if a user agent has gptbot, and if it does I 302 it to web.sp.am.
Science Memes
Welcome to c/science_memes @ Mander.xyz!
A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.
Rules
- Don't throw mud. Behave like an intellectual and remember the human.
- Keep it rooted (on topic).
- No spam.
- Infographics welcome, get schooled.
This is a science community. We use the Dawkins definition of meme.
Research Committee
Other Mander Communities
Science and Research
Biology and Life Sciences
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- !reptiles and [email protected]
Physical Sciences
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
Humanities and Social Sciences
Practical and Applied Sciences
- !exercise-and [email protected]
- [email protected]
- !self [email protected]
- [email protected]
- [email protected]
- [email protected]
Memes
Miscellaneous
I'm pretty sure no one knows my blog and wiki exist, but it sure is popular, getting multiple hits per second 24/7 in a tangle of wiki articles I autogenerated to tell me trivia like whether the Great Fire of London started on a Sunday or Thursday.
When I was a kid I thought computers would be useful.
They are. Its important to remember that in a capitalist society what is useful and efficient is not the same as profitable.
I've suggested things like this before. Scrapers grab data to train their models. So feed them poison.
Things like counter factual information, distorted images / audio, mislabeled images, outright falsehoods, false quotations, booby traps (that you can test for after the fact), fake names, fake data, non sequiturs, slanderous statements about people and brands etc.. And choose esoteric subjects to amplify the damage caused to the AI.
You could even have one AI generate the garbage that another ingests and shit out some new links every night until there is an entire corpus of trash for any scraper willing to take it all in. You can then try querying AIs about some of the booby traps and see if it elicits a response - then you could even sue the company stealing content or publicly shame them.
Kind of reminds me of paper towns in map making.
What if we just fed TimeCube into the AI models. Surely that would turn them inside out in no time flat.
OK but why is there a vagina in a petri dish
I believe that's a close-up of the inside of a pitcher plant. Which is a plant that sits there all day wafting out a sweet smell of food, waiting around for insects to fall into its fluid filled "belly" where they thrash around fruitlessly until they finally die and are dissolved, thereby nourishing the plant they were originally there to prey upon.
Fitting analogy, no?
I was going to say something snarky and stupid, like "all traps are vagina-shaped," but then I thought about venus fly traps and bear traps and now I'm worried I've stumbled onto something I'm not supposed to know.
There should be a federated system for blocking IP ranges that other server operators within a chain of trust have already identified as belonging to crawlers. A bit like fediseer.com, but possibly more decentralized.
(Here's another advantage of Markov chain maze generators like Nepenthes: Even when crawlers recognize that they have been served garbage and they delete it, one still has obtained highly reliable evidence that the requesting IPs are crawlers.)
Also, whenever one is only partially confident in a classification of an IP range as a crawler, instead of blocking it outright one can serve proof-of-works tasks (à la Anubis) with a complexity proportional to that confidence. This could also be useful in order to keep crawlers somewhat in the dark about whether they've been put on a blacklist.
You might want to take a look at CrowdSec if you don't already know it.
Holy shit, those prices. Like, I wouldn’t be able to afford any package at even 10% the going rate.
Anything available for the lone operator running a handful of Internet-addressable servers behind a single symmetrical SOHO connection? As in, anything for the other 95% of us that don’t have literal mountains of cash to burn?
They do seem to have a free tier of sorts. I don't use them personally, I only know of their existence and I've been meaning to give them a try. Seeing the pricing just now though, I might not even bother, unless the free tier is worth anything.
Thanks. Makes sense that things roughly along those lines already exist, of course. CrowdSec's pricing, which apparently start at 900$/months, seem forbiddingly expensive for most small-to-medium projects, though. Do you or does anyone else know a similar solution for small or even nonexistent budgets? (Personally I'm not running any servers or projects right now, but may do so in the future.)
There are many continuously updated IP blacklists on GitHub. Personally I have an automation that sources 10+ of such lists and blocks all IPs that appear on like 3 or more of them. I'm not sure there are any blacklists specific to "AI", but as far as I know, most of them already included particularly annoying scrapers before the whole GPT craze.
--recurse-depth=3 --max-hits=256
Some details. One of the major players doing the tar pit strategy is Cloudflare. They're a giant in networking and infrastructure, and they use AI (more traditional, nit LLMs) ubiquitously to detect bots. So it is an arms race, but one where both sides have massive incentives.
Making nonsense is indeed detectable, but that misunderstands the purpose: economics. Scraping bots are used because they're a cheap way to get training data. If you make a non zero portion of training data poisonous you'd have to spend increasingly many resources to filter it out. The better the nonsense, the harder to detect. Cloudflare is known it use small LLMs to generate the nonsense, hence requiring systems at least that complex to differentiate it.
So in short the tar pit with garbage data actually decreases the average value of scraped data for bots that ignore do not scrape instructions.
The fact the internet runs on lava lamps makes me so happy.