this post was submitted on 10 Mar 2024
139 points (97.3% liked)

Privacy

31854 readers
195 users here now

A place to discuss privacy and freedom in the digital world.

Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.

In this community everyone is welcome to post links and discuss topics related to privacy.

Some Rules

Related communities

Chat rooms

much thanks to @gary_host_laptop for the logo design :)

founded 5 years ago
MODERATORS
 

I feel like with the rise of AI something that anonymizes writing styles should exist. For example it could look for differences in American versus British spelling like color versus colour or contextual things like soccer versus football and make edits accordingly. ChatGPT could be fed a prompt that says "Rewrite the following paragraphs as if they were written by an Australian" but I don't know if it would have a good enough grasp on the objective or if it would start shoehorning in references to koalas and fairy floss.

I tried searching online to see if something like this existed and found a few articles from around the 2010s such as Software Helps Identify Anonymous Writers or Helps Them Stay That Way by the New York Times. It talks about stylometry and Anonymouth but it seems like Anonymouth hasn't been updated in years. All recent articles seem to be about plagiarism and AI.

For context what got me thinking about the topic was remembering JK Rowling being revealed to be the author of a mystery novel called The Cuckoo’s Calling. Smithsonian wrote an article about it called How Did Computers Uncover J.K. Rowling’s Pseudonym?. I thought it could make for a neat post here.

top 50 comments
sorted by: hot top controversial new old
[–] [email protected] 16 points 8 months ago

There is a program built into Whonix, I believe it's called Kloak, that randomizes your keyboard input times so you can't be identified via keystroke timing JavaScript. There's also research into defeating stylomeyric analysis such as anonymouth but I'm sure there are plenty of new tools, if anyone find any that work well please reply here as I haven't looked in some years. 'Stylometric analysis' is the key phrase to search for.

With AI this will get worse (better identification based on typing styles) but it will also get better because you can setup a local LLM and ask it to re-write your text in a certain style. Touching on this, everyone uses a combination of unique phrases and misspelling or mis-spelling (see?) of words, and with enough text from a given account the chance of statistical probability in attribution is very high. It's how the Unibomber was identified after his manifesto was published, because he used a very specific phrase incorrectly and his brother recognized it, so his wife convinced him to call the FBI tip line about his brother.

[–] [email protected] 3 points 8 months ago* (last edited 8 months ago) (1 children)

Yeah, it would need to be a browser extension, adding a button to scramble every text input field. Or maybe even on the OS level, opening an input field above the browser one, so the original text was never input into the browser.

[–] [email protected] 12 points 8 months ago (1 children)

Translate to some foreign language. Then translate to some other foreign language. Then translate back to your language. Congrats, your writing style changed.

[–] [email protected] 8 points 8 months ago (1 children)

Ah, the classic game of Google Translate Telephone.

[–] [email protected] 3 points 8 months ago (1 children)

Better to do the translations locall, so the original never leaves your device

[–] [email protected] 3 points 8 months ago

Yes. I would use the privacy focused ones (there are several in Fdroid). If your threat model includes anonymity against state actor, such that they will be attempting to trace your writing style, you can be certain they could and would also just subpoena google for matching translation requests. It would be a lot easier to back into identifying you that way.

[–] [email protected] 8 points 8 months ago (1 children)

My coworkers use chatgpt for this. Since it always answers in the same generic ways it's helpful to anonymize their peer reviews.

[–] [email protected] 3 points 8 months ago (2 children)

I don't understand people who want to anonymize their writing but then use chatgpt to do that. For me at least they are not exactly the business that I would trust.

[–] [email protected] 3 points 8 months ago

The concern here is less so OpenAI knowing, rather they worry about their coworkers identifying them.

[–] [email protected] 2 points 8 months ago (1 children)

You can run your own offline instance. Not that randos are likely to, but still.

[–] [email protected] 2 points 8 months ago

You can look here, maybe you find something usefull, if you search for AI (+150k apps)

[–] [email protected] 10 points 8 months ago* (last edited 8 months ago)

I wouldn't just trust random Lemmy users (no offense) but instead check for actual fields, e.g stylometry or writeprint, and from there check the state of the art. Not being an expert would make that tricky so I would take a recent published papers, e.g https://arxiv.org/abs/2203.11849 to understand the challenge. As is always the case they'll review the field, e.g section 2 here, and clarify the 2 sides of the arm race, here Obfuscation/Deobfuscation. The former in 3.2 mentions examples of techniques authors estimate to be good starting point, e.g writeprintsRFC. I'd then search for such tools if they don't directly provide link to open-source repository, e.g theirs https://github.com/reginazhai/Authorship-Deobfuscation . I would then try a recent one that I can easily setup, e.g via Docker, and give it a go. I would then read the rest of the paper, see who cites it, and try to get a more up to date version.

TL;DR: I don't know but there is dedicated research which result I'd trust more than the opinion of strangers who are probably not expert.

[–] [email protected] 4 points 8 months ago (1 children)

Just ask ChatGPT to paraphrase.

[–] [email protected] 7 points 8 months ago (1 children)

Not a great solution unless you think you can trust OpenAI and their security implementation (which you shouldn't). We have seen simple PHP scripted prompts in the past have the AI recount an entire conversation from another user. Not safe at all.

[–] [email protected] 1 points 8 months ago

Fair point. Depends on what the document is used for I suppose — whether such security is an issue (vs simply anonymising the style).

[–] [email protected] 3 points 8 months ago (2 children)

I wonder if google translate through multiple languages can fm the trick?

[–] [email protected] 1 points 8 months ago

This, but offline translated for privacy

[–] [email protected] 3 points 8 months ago

I feel like if someone wanted to give off the impression that they were a non-English speaker that might work. I think it would be limited to a surface level though. Whoever attempted to use it would likely miss out on a lot of the common pitfalls someone learning a new language would run into like mixing up the order of adjectives.

That and the content that is being run through a translator multiple times might get warped. I am not sure if going back and forth messes things up as badly as it did 10 years ago though.

[–] [email protected] 19 points 8 months ago* (last edited 8 months ago) (1 children)

There was a talk about detecting patterns and writing styles at Chaos Computer Congress a bunch of years ago.

The researchers also presented a tool to anonymize text as far as I can remember.

I will go look for the talk.

Edit: Found it!

https://media.ccc.de/v/31c3_-_6173_-_en_-_saal_g_-_201412291715_-_source_code_and_cross-domain_authorship_attribution_-_aylin_-_greenie_-_rebekah_overdorf

They talk about their software to find who wrote what, but also how to use that knowledge to write software that attempts to anonymize text.

[–] [email protected] 3 points 8 months ago

The New York Times article I linked mentioned that. I will have to watch that video though so I can get a better understanding of the mechanics of it. Thanks for the link.

[–] [email protected] 8 points 8 months ago

Non-native english speakers tend to mix up various styles, you could ask somone to paraphrase your text.

[–] [email protected] 10 points 8 months ago (1 children)

Probably the cut and paste from magazines, but you copy and paste the sentences you want to use. A lot of extra work, but no AI to rat you out.

[–] [email protected] 6 points 8 months ago

Serial killer style

[–] [email protected] 4 points 8 months ago
[–] [email protected] 3 points 8 months ago (1 children)

Autocorrect?

If you use it before it has learned you writing idiosyncrasies?

[–] [email protected] 1 points 8 months ago

That would be an interesting way of doing it. Someone could probably couple that with predictive text for decent results

[–] [email protected] 4 points 8 months ago* (last edited 8 months ago) (1 children)

ChatGPT will probably remember it was you who asked and doxx you in retaliation when it discovers you’ve plagerized chatGPT.

Another thought is to translate it into Scottish. But then again, you probably still want to be understood.

Changing dialect may be too small of a change. But if you could say write this like 1-2 generations younger/older using high school slang of the time you might get a useful difference.

[–] [email protected] 4 points 8 months ago

Changing dialect may be too small of a change. But if you could say write this like 1-2 generations younger/older using high school slang of the time you might get a useful difference.

I feel like knowing the correct use of slang for a demographic would be a challenge and require a lot of constant research. Even if someone was to go off of slang younger people were using I feel like there's a risk of it being a regional term.

Trying to force it I'd probably end up with something like "Those elf bars be dripping but that extra popcorn lung was a vibe check on god" which gives off "How Do You Do, Fellow Kids?" vibes.

[–] [email protected] 4 points 8 months ago

I had asked for the same thing a while back but didn't really get much. The round-about method that I have found is to finetune FOSS LLMs on data you want it to represent (largely text) and then diving into some prompt engineering to get it to say something you like.

However, I haven't been able to find a test which can accurately point towards text not having specific weights that it relies on. Cue the attacks on GPT-4 which deanonymises data it was trained on. You might also want to read about DPT and Shadowing techniques to red-team LLMs and LLM-generated text as literature.

Cheers