First of all, let me explain what "hapax legomena" is: it refers to words (and, by extension, concepts) that occurred just once throughout an entire corpus of text. An example is the word "hebenon", occurring just once within Shakespeare's Hamlet. Therefore, "hebenon" is a hapax legomenon. The "hapax legomenon" concept itself is a kind of hapax legomenon, IMO.
According to Wikipedia, hapax legomena are generally discarded from NLP as they hold "little value for computational techniques". By extension, the same applies to LLMs, I guess.
While "hapax legomena" originally refers to words/tokens, I'm extending it to entire concepts, described by these extremely unknown words.
I am a curious mind, actively seeking knowledge, and I'm constantly trying to learn a myriad of "random" topics across the many fields of human knowledge, especially rare/unknown concepts (that's how I learnt about "hapax legomena", for example). I use three LLMs on a daily basis (GPT-3, LLama and Gemini), expecting to get to know about words, historical/mythological figures and concepts unknown to me, lost in the vastness of human knowledge, but I now know, according to Wikipedia, that general LLMs won't point me anything "obscure" enough.
This leads me to wonder: are there LLMs and/or NLP models/datasets that do not discard hapax? Are there LLMs that favor less frequent data over more frequent data?
I don't think LLMs are the right tool for this. They're built to find statistically likely correlations, patterns etc. That's why they tend to give the correct answer (at least to simple questions) and why they produce legible output in the first place. And you want kind of the opposite of that. Somehing that's unlikely. But that goes against how they work. Of course you can ask them to surprise you and do something unexpected. And they'll try to do something with that. I doubt it'll do anything and I think it's a fundamental limitation. A better tool would be traditional statistics, going through datasets and counting frequencies and you can find your hapax legomena precisely. And I mean some linguists probably already studied this and you could also read their publications...
I've also fooled around with LLMs and in my experience, they don't perform well on uncommon things. If it's not in their dataset or just once, they'll struggle with the concept.
Me too. They hallucinate. And sometimes I learn things through these hallucinations, when I ask them about an uncommon thing. However, they won't give the uncommon thing, I'm the one who usually feeds the prompt with the uncommon thing for them to hallucinate. Indeed, what I'm seeking is likely the exact opposite of what's expected for LLMs: the extremely uncommon, close to complete hallucination and stochastic behavior.
I'm used to do it in a laughable "poorman's way" via Node.js + RegEx + JS key-value dictionary object (whose key is the token and the value is a number that increases as this token is found via interaction), downloading some JSON/TXT/CSV dataset, reading and parsing it, then iterating over its tokens. It consumes a lot of memory, time and CPU (yet I try to use a sleep/delay between N iterations in order to free the CPU from high loads). I know there are better ways, and a temperature/param-adjustable LLM seemed for me as a better way, hoping that there's some exception across the many LLMs publicly available that wouldn't discard hapaxes.
The things I'm willing to discover and learn weren't/aren't so well studied. I mean, human knowledge is a really vast universe of concepts, names and ideas, some of them got buried by time (sometimes centuries or millennia). Someone has to dig them because they could hold value, knowledge value. One of my purposes with this inquiry over the unknown is to find these really forgotten ideas and concepts, things never studied before, and try and study them, learn about them. That's how things were rediscovered throughout the entire human history: treasures are buried by the passage of time, and a curious person digs them, and humanity gets to know them once again. And a potential source of knowledge lurking in oblivion is the big data, or big datasets.
Is going trough text really that compuationally expensive? I guess the english language only uses a few thousand words frequently, plus some names and rare words. I'd imagine you can comfortably keep them in RAM next to a counter variable for each bucket. That should allow going through practically any book on earth on a regular computer, if I'm not mistaken. I'm not sure if that's I/O bound or CPU bound, but it shouldn't be that hard. It's something that gets taught in the first 3 semesters of computer science at university.
Regarding the hallucinations: There are two use-cases: If you want some creative output that doesn't need to be correct, you're fine. You'll be doing art like the people who manipulate electronic child toys and music instruments to coerce some strange sounds out of them. I think that's calles "circuit bending". You could also de-tune the parameters of an LLM, tinker around a bit and mess with the settings. Feed it random garbage prompts and see what it'll do. I guess that's an interesting arts project.
But if you want something that has to do with factuality or needs to be correct, the hallucinations will get in your way. A "hapax legomena" or unique word is a well-defined (objective) thing. It doesn't really help if the LLM returns some pretend answer. It might look interesting at first, but it won't be a unique word by real-world definition. And that's why I don't think an LLM can help in this case.
I've tried asking it the title and author of some children's story which I heard at a first communion ceremony at church. I tried googling that but all the church pages attribute the story to some random authors. So I tried asking Llama and ChatGPT but they wholehartedly make something up. I've tried like 20 times but all they return is made up and false. So it doesn't help. And I guess those more contemporary religous books just aren't in any dataset. And the LLM will just do something random in this case. As it'll do with everything that's rare or missing in the dataset (and it can't infer it).
Another thing I did (concerning language) is ask AI about idioms and figures of speech. Initially I did this because I'm not a native english speaker and figures of speech are very nice concepts. They can make your text more flowery or funny, and they always come with some interesting story of origin. But you have to learn and memorize them for later use, because they vary widely from country to country. And LLMs are really good at translating. And they indeed do well with that. And occasionally they'll hallucinate some idiom. Which can be hilarious. It won't be something that fits the definition of the term. But it definitely sparks my creativity at times. At least it makes me laugh.
And writing prose and longer stories with AI also shows their preference for likely things. They always try to push my stories towards some lame and common story arcs. Do super obvious plot twists. And lots of models (not all of them) always push towards resolving story arcs and an happy end. And it's difficult to impossible to overcome. It tends to get better with their size and "intelligence", but I don't think any of the current LLMs is close to being useful with that.
So summed up: You said in another comment, computer linguistics discards unique words because they have little value and additionally they get in the way. There is probably some reason to that, computer scientists generally aren't stupid and I suppose they tried, and put some thought into it. An LLM just can't make sense of the concept. It needs more training data to learn something. A unique word will just mess with the weights and shift them into some random direction. Likely degrading the LLM in some miniscule way. That's why they discard them. And even if they didn't do it, the LLM couldn't memorize a word if it's only there once. And if you put it into the dataset multiple times, an LLM could learn it... But it won't be unique anymore. So I don't see how it'd work. And also my experience tells me they generally don't do well with rare things.