this post was submitted on 01 Oct 2024

89 points (82.0% liked)

Asklemmy

43610 readers

1159 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy 🔍

If your post meets the following criteria, it's welcome here!

Open-ended question
Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
Not ad nauseam inducing: please make sure it is a question that would be new to most members
An actual topic of discussion

Looking for support?

Looking for a community?

Lemmyverse: community search
sub.rehab: maps old subreddits to fediverse options, marks official as such
[email protected]: a community for finding communities

~Icon~ ~by~ ~@Double_[email protected]~

founded 5 years ago

MODERATORS

[email protected]

Why are we training AIs on reddit posts instead of Research Papers? We could be saving the world! (lemmy.dbzer0.com)

submitted 2 weeks ago by [email protected] to c/[email protected]

92 comments fedilink hide all child comments

(page 2) 42 comments

sorted by: hot top controversial new old

[–] [email protected] 2 points 2 weeks ago

I think I read this post wrong.

I was thinking the sentence "We could be saving the world!" meant 'we' as in humans only.

No need to be training AI. No need to do anything with AI at all. Humans simply start saving the world. Our Research Papers can train on Reddit. We cannot be training, we are saving the world. Let the Research Papers run a train on Reddit AI. Humanity Saves World.

No cynical replies please.

[–] [email protected] 6 points 2 weeks ago

Because they are looking for conversations.

[–] [email protected] 1 points 2 weeks ago

Most research papers are likely ad valid as an average reddit point.

Getting published is a circlejerk, and rarely are they properly tested, or does anyone actually read them.

[–] [email protected] 17 points 2 weeks ago

Training it on research papers wouldn’t make it smarter, it would just make it better at mimicking their writing style.

Don’t fall for the hype.

[–] [email protected] 10 points 2 weeks ago (1 children)

They already do that. You're being a troglodyte.

[–] [email protected] 8 points 2 weeks ago (2 children)

Hmmm. Not sure if I'm being insulted. Is that one of those fish fossils that looks kind of like a horseshoe crab?

[–] [email protected] -2 points 2 weeks ago (1 children)

Dictionary Definitions from Oxford Languages · Learn more noun (especially in prehistoric times) a person who lived in a cave. a hermit. a person who is regarded as being deliberately ignorant or old-fashioned.

load more comments (1 replies)

[–] [email protected] 3 points 2 weeks ago (1 children)

Tons of people already are. The following site is useful for searching papers using ai https://consensus.app/

[–] [email protected] 1 points 2 weeks ago

Thank you! That was thoughtful

[–] [email protected] 2 points 2 weeks ago* (last edited 2 weeks ago)

AuroraGPT. They are trying to do it.

Its cause number of people who can read, understand, and then create the necessary dataset to train and test the LLM are very very very few for research papers vs the data for pop culture is easilier to source.

[–] [email protected] 2 points 2 weeks ago (1 children)

Because broken English from research paper and relatively structured style will be even worse than reddit posts

[–] [email protected] 2 points 2 weeks ago

Came to wonder about this.

The few I've seen weren't shining examples of the language, and could have used some editing.

As well, the rumours abound that a lot of papers are available before review, and that's likely to cause some harm if we trust a model predicting on bad data.

(Yes, I know: reddit isn't going to be better; but it has its own warning because, well, Reddit)

[–] [email protected] 5 points 2 weeks ago (1 children)

Nobody wants an AI that talks like that.

[–] [email protected] 1 points 2 weeks ago (4 children)

I kind of think my question is WHY ARE WE FOCUSING ON TALKING TO IT?

[–] [email protected] 3 points 2 weeks ago

Because "ai" ad we colloquially know today are language models: they train on and can produce language, that's what they are designed on. Yes, they can produce images and also videos, but they don't have any form of real knowledge or understanding, they only predict the next word or the next pixel based on their prompt and their vast examples of words and images. You can only talk to them because that's what they are for.

Feeding research papers will make it spit research-sounding words, which probably will contain some correct information, but at best an llm trained on that would be useful to search through existing research, it would not be able to make new one

load more comments (3 replies)

[–] [email protected] 3 points 2 weeks ago

How does that help disempower the fossil fuel mafia?

[–] [email protected] 4 points 2 weeks ago* (last edited 2 weeks ago)

Part of it is the same "human speech" aspects that have plagued NLP work over the past few years. Nobody (except the poor postdoctoral bastard who is running the paper farm for their boss) actually speaks in the same way that scholarly articles are written because... that should be obvious.

This combines with the decades of work by right wing fascists to vilify intellectuals and academia. If you have ever seen (or written) a comment that boils down to "This youtuber sounds smug" or "They are presenting their opinion as fact" then you see why people prefer "natural human speech" over actual authoritatively researched and tested statements.

And... while not all pay to publish journals are trash, I feel confident saying that most are. And filtering those can be shockingly hard by design.

But the big one? Most of the owners of the various journals are REALLY fucking litigious and will go scorched earth on anyone who is using their work (because Elsevier et al own your work) to train a model.

[–] [email protected] 21 points 2 weeks ago

You could feed all the research papers in the world to an LLM and it will still have zero understanding of what you trained it on. It will still make shit up, it can't save the world.

[–] [email protected] 6 points 2 weeks ago

We are. I just read an article yesterday about how Microsoft paid research publishers so they could use the papers to train AI, with or without the consent of the papers' authors. The publishers also reduced the peer review window so they could publish papers faster and get more money from Microsoft. So... expect AI to be trained on a lot of sloppy, poorly-reviewed research papers because of corporate greed.

[–] [email protected] 6 points 2 weeks ago

Anyone running a webserver and looking at their logs will know AI is being trained on EVERYTHING. There are so many crawlers for AI that are literally ripping the internet wholesale. Reddit just got in on charging the AI companies for access to freely contributed content. For everyone else, they're just outright stealing it.

[–] [email protected] 5 points 2 weeks ago

I saw an article about one trained on research papers. (Built by Meta, maybe?) It also spewed out garbage: it would make up answers that mimicked the style of the papers but had its own fabricated content! Something about the largest nuclear reactor made of cheese in the world...

[–] [email protected] 13 points 2 weeks ago (1 children)

Redditors are always right, peer reviewed papers always wrong. Pretty obvious really. :D

[–] [email protected] 1 points 2 weeks ago

Dank memes > science

tech bros, probably

[–] [email protected] 7 points 2 weeks ago (1 children)

The Ghost of Aaron Schwartz

[–] [email protected] 3 points 2 weeks ago

What he was fighting for was an awful lot more important than a tool to write your emails while causing a ginormous tech bubble.

[–] [email protected] 7 points 2 weeks ago

They're trained on technical material too.

[–] [email protected] 85 points 2 weeks ago (2 children)

AI isn't saving the world lol

[–] [email protected] 21 points 2 weeks ago* (last edited 2 weeks ago)

Machine learning has some pretty cool potential in certain areas, especially in the medical field. Unfortunately the predominant use of it now is slop produced by copyright laundering shoved down our throats by every techbro hoping they'll be the next big thing.

[–] [email protected] 10 points 2 weeks ago (2 children)

It's marketing hype, even in the name. It isn't "AI" as decades of the actual AI field would define it, but credulous nerds really want their cyberpunkerino fantasies to come true so they buy into the hype label.

[–] [email protected] 12 points 2 weeks ago

The term AI was coined in 1956 at a computer science conference and was used to refer to a broad range of topics that certainly would include machine learning and neural networks as used in large language models.

I don't get the "it's not really AI" point that keeps being brought up in discussions like this. Are you thinking of AGI, perhaps? That's the sci-fi "artificial person" variety, which LLMs aren't able to manage. But that's just a subset of AI.

[–] [email protected] 4 points 2 weeks ago (1 children)

Yeah, these are pattern reproduction engines. They can predict the most likely next thing in a sequence, whether that's words or pixels or numbers or whatever. There's nothing intelligent about it and this bubble is destined to pop.

[–] [email protected] 2 points 2 weeks ago

That "Frightful Hobgoblin" computer toucher would insist otherwise, claiming that a sufficient number of Game Boys bolted together equals or even exceeds human sapience, but I think that user is currently too busy being a bigoted sex pest.

[–] [email protected] 9 points 2 weeks ago

Papers are most importantly a documentation of exactly what and how a procedure was performed, adding a vagueness filter over that is only going to decrease its value infinitely.

Real question is why are we using generative ai at all (gets money out of idiot rich people)

[–] [email protected] 26 points 2 weeks ago (2 children)

Because AI needs a lot of training data to reliably generate something appropriate. It's easier to get millions of reddit posts than millions of research papers.

Even then, LLMs simply generate text but have no idea what the text means. It just knows those words have a high probability of matching the expected response. It doesn't check that what was generated is factual.

load more comments (2 replies)

[–] [email protected] 1 points 2 weeks ago

Who says they’re not?

[–] [email protected] 39 points 2 weeks ago (1 children)

Both are happening. Samples of casual writing are more valuable to use to generate an article than research papers though.

[–] [email protected] 8 points 2 weeks ago (1 children)

Yeah. Scientific papers may teach an AI about science, but Reddit posts teach AI how to interact with people and "talk" to them. Both are valuable.

[–] [email protected] 8 points 2 weeks ago (3 children)

Hopefully not too pedantic, but no one is “teaching” AI anything. They’re just feeding it data in the hopes that it can learn probabilities for certain types of output. It “understands” neither the Reddit post nor the scientific paper.

[–] [email protected] -1 points 2 weeks ago (1 children)

Describe how you ‘learned’ to speak. How do you know what word comes after the next. Until you can describe this process in a way that doesn’t make it ‘human’ or ‘biological’ only it’s no different. The only thing they can’t do is adjust their weights dynamically. But that’s a limitation we gave it not intrinsic to the system.

[–] [email protected] 5 points 2 weeks ago

I inherited brain structures that are natural language processors. As well as the ability to understand and repeat any language sounds. Over time, my brain focused in on only the language sounds I heard the most and through trial and repetition learned how to understand and make those sounds.

AI - as it currently exists - is essentially a babbling infant with none of the structures necessary to do anything more than repeat sounds back without understanding any of them. Anyone who tells you different is selling you something.

load more comments (2 replies)

[–] [email protected] 2 points 2 weeks ago

money. theres no money in saving the world. lots of money in not saving the world.

greed will be humanities downfall

[–] [email protected] 4 points 2 weeks ago

Brain damage is cheaper than professionals

load more comments