I was hoping to play around with the dataset over the weekend to toy with some text-embedding techniques, but they’ve pulled the cord on the download links.
Anyone have a copy of the full archive they’re willing to share, or a magnet link?
This is a most excellent place for technology news and articles.
I was hoping to play around with the dataset over the weekend to toy with some text-embedding techniques, but they’ve pulled the cord on the download links.
Anyone have a copy of the full archive they’re willing to share, or a magnet link?
404? another source please? I don't trust them on this exact thing.
I was hoping people would do this!!!
If they were on OPEN servers, I doubt they cared that much.
"anonymized" sure. I highly doubt they read every message. I'm sure there is lots of de-anonymizing information in the messages themselves
For example--
Anon1: "hey jeff, wanna play Minecraft?"
Anon2: "sure"
Thus we know Anon2's name is Jeff. I imagine there's a lot of this.
Shit. My name is Jeff. Now they know
"scraped" via API? I don't think It means what you think it means.
wtf…… going to get worse after IPO!
If you don't want strangers knowing what you say don't join open servers it's pretty easy
Open or close, going to get worse!
If they release closed discord chats they may as well go out of business people will flee
They and companies already doing so.
That’s good news. Internet archiving is an important endeavor because you never know when they‘ll pull the plug. Now it‘s a little more secured and probably far more useful than in Discord‘s hands alone.
Not for messages that are supposed to be private lol. Let me just make a copy of all texts you've sent over the last decade, for "archiving".
If you think messages you post anywhere on the internet are private, you're in for a bad time.
Texts are sent in plain-text and I wouldn't recommend discussing anything you'd like to keep private via text.
This says it was done via the API so they wouldn't be private messages.
Great news for open source AI.
Ooh! Do Teams next
I see a lot of drama here in the thread, people decrying data leaks, how Discord is very very bad, and a number of people wanting the "good old days" of forums.
Yes. I like forums too, but, uh...
These researchers scraped publicly posted messages. Keyword here being "public". How would anything similarly public, like a forum, be better?
I actually remember the times when forums were at their peak. I hung out on BZPower for Bionicle things, and the Relic News Forum for Homeworld modding. You know what they had? Google bots that scraped messages, looked for certain words, and populated websites with advertisements based on what it could scrape from forums.
Pretty sure Lemmy doesn't do encryption either, unless there's some very special, private Lemmy server that nobody has access to. So the researchers could've just as well scraped the fediverse.
People saw “scraped Discord messages” and immediately jumped to “oh shit fuck my private chats have been leaked everybody panic”.
People in general have no idea and just want to get spun up on drama and manufactured outrage.
Same thing happened when people started scrapping Twitter 10-15 years ago.
How would anything similarly public, like a forum, be better?
Forums were the primary way that groups would talk with one another pre-global scale social media.
They could contain public subforums, but the majority of all of the forums that I've been a part of were not viewable without an account, which was manually approved or required a small payment (to make bans have a chance to actually stick).
Yeah this being just as easy on bb forums or literally any webpage with a public comment section was my first thought as well..
Isn't most of the internet scraped anyways, by the internet archive? The concerning part is that this is 100% going to be used to train some coomer brained AI. Scraping, botting, scamming: all those things are going to happen on large public communities.
Yeah, a lot of this push is about ushering in new laws to prevent data scraping.
Propaganda spreads easily through fake accounts—but how do we detect large-scale operations if they’re constantly creating and deleting accounts or trying to blend in with the rest of us? We’d need access to massive data sets to mine for patterns and expose coordinated behavior.
But the powers that benefit from shaping the narrative are the same ones pushing the idea that all scraping is bad. They want people to hate it, so they can justify laws that lock down access. That’s the end game.
So basically discord finally got a usable search. I count that as a win.
Saving this article for the next time someone says "Just message me on discord its easier".