this post was submitted on 26 Jul 2024

660 points (97.4% liked)

Technology

57895 readers

4958 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

660

Reddit blocking all major search engines, except Google (readwrite.com)

submitted 1 month ago by [email protected] to c/[email protected]

185 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 1 points 1 month ago

I don't have any more info on it, but I can prove it

[–] [email protected] 1 points 1 month ago

Net neutrality?

[–] [email protected] 3 points 1 month ago (1 children)

It's still possible to search with "site:reddit.com ..."

Has it been implemented yet or are they blocking non-flagged searches? Which seems odd.

[–] [email protected] 0 points 1 month ago (1 children)

You shouldn't be getting any new results if you do that, older posts will/may remain indexed.

[–] [email protected] 1 points 1 month ago

Aha. I was wondering about that possibility.

[–] [email protected] 4 points 1 month ago (1 children)

I'm kind of curious to understand how they're blocking other search engines. I was under the impression that search engines just viewed the same pages we do to search through, and the only way to 'hide' things from them was to not have them publicly available. Is this something that other search engines could choose to circumvent if they decided to?

[–] [email protected] 11 points 1 month ago (1 children)

Search engine crawlers identify themselves (user agents), so they can be prevented by both honor-based system (robots.txt) and active blocking (error 403 or similar) when attempted.

[–] [email protected] 2 points 1 month ago (1 children)

Thank you, I understand better now. So in theory, if one of the other search engines chose to not have their crawler identify itself, it would be more difficult for them to be blocked.

[–] [email protected] 1 points 1 month ago

This is where you get into the whole webscraping debate you also have with LLM "datasets".

If you, as a website host, are detecting a ton of requests coming from a singular IP you can block said address. There are ways around that by making the requests from different IP addresses, but there are other ways to detect that too!

I'm not sure if Reddit would try to sue Microsoft or DDG if they started serving results anyway through such methods. I don't believe it is explicitly disallowed.
But if you were hoping to deal in any way with Reddit in the future I doubt a move like this would get you in their good graces.

All that is to say; I won't visit Reddit at all anymore now that their results won't even show up when I search for something. This is a terrible move and will likely fracture the internet even more as other websites may look to replicate this additional source of revenue.

[–] [email protected] 3 points 1 month ago (2 children)

They're also blocking posts by users who aren't banned or even got a warning. It appears to the user as though it's been posted, but it hasn't.

[–] [email protected] 14 points 1 month ago

shadowbanning is a totally different issue that's existed for a long time though.

[–] [email protected] 1 points 1 month ago (2 children)

Shadowbanning? Do you have more info on this?

[–] [email protected] 1 points 1 month ago

I didn't know there was a name for it, I don't have anymore info on it, but I can show examples of it happening.

[–] [email protected] 4 points 1 month ago* (last edited 1 month ago) (1 children)

They’ve done this for a long time. It’s supposedly only supposed to be used on bots but it definitely isn’t in practice

[–] [email protected] 1 points 1 month ago

It definitely is in practice 100%

[–] [email protected] 3 points 1 month ago (1 children)

Hot take here.

I do believe in free information.

Instead of investing money in stop crawlers why do not make the data they are trying to crawl available to everyone for free so we can have a better world all together?

[–] [email protected] 6 points 1 month ago (1 children)

Data transfer isn't free. It costs real money and energy to respond to queries. Don't be surprised to see ~50% of all requests made to your server be from bots which you may have no interest in servicing outside of search engine indexers.

[–] [email protected] 2 points 1 month ago

If you publish your data in a friendly manner bots would have no need to crawl your site.

Data that is more interesting and requested a lot could even be served over p2p.

This moderl would generate less cost that dealing with constant bot scrappers.

It is not a technical discussion. Or a discussion about associated cost. It's a discussion about morals and economic models.

[–] [email protected] 4 points 1 month ago

Honestly? I'd be happy to not see their trash in any search engine I use.

[–] [email protected] 16 points 1 month ago (1 children)

I work for a different sort of company that hosts some publicly available user generated content. And honestly the crawlers can be a serious engineering cost for us, and supporting them is simply not part of our product offering.

I can see how reddit users might have different expectations. But I just wanted to offer a perspective. (I'm not saying it's the right or best path.)

[–] [email protected] 4 points 1 month ago* (last edited 1 month ago) (3 children)

Can you use something like the DDOS filter to prevent AI automated scrapings (too many requests per second)?

I'm not a tech person so probably don't even know what I'm talking about.

[–] [email protected] 2 points 1 month ago* (last edited 1 month ago) (1 children)

Blocking bots is hard, because with some work they can be made to look like users, down to simulating curved mouse movements from one button to the next if you are really ambitious.

[–] [email protected] 1 points 1 month ago (1 children)

So your saying reddit's activity analytics can't necessarily tell the difference between human activity and bot activity?

So the actual number of people using reddit vs bots isn't very clear. Someone should tell Reddit's share holders that's there's no way to tell if the advertisements are actually being viewed by people, and there's no way to tell how much the activity reports have been inflated by bots. I bet they wouldn't like that very much.

[–] [email protected] 3 points 1 month ago

Always has been. Technically the server sees no difference in what a browser does vs what a bot does: Downloading files and submitting requests.

[–] [email protected] 2 points 1 month ago

We have a variety of tactics and always adding more

[–] [email protected] 5 points 1 month ago* (last edited 1 month ago)

I worked with a company that used product data from competitors (you can debate the morals of it, but everyone is doing it). Their crawlers were set up so that each new line of requests came from a new IP.. I don’t recall the name of the service, and it was not that many unique IP’s but it did allow their crawlers to live unhindered..

They didn’t do IP banning for the same reasoning, but they did notice one of their competitors did not alter their IP when scraping them. If they had malicious intend, they could have changed data around for that IP only. Eg. increasing the prices, or decreasing the prices so they had bad data..

I’d imagine companies like OpenAI has many times the IP, and they’d be able to do something similarly.. meaning if you try’n ban IP’s, you might hit real users as well.. which would be unfortunate.

load more comments