GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Aatube@kbin.melroy.org · 2 years ago

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

some pirate@lemmy.dbzer0.com · 2 years ago

Horny gpt 2 strikes again

seaQueue@lemmy.world · 2 years ago

Sludgehammer@lemmy.world · edit-2 2 years ago

Because these tokens are not actual commonly spoken words or phrases, the chatbot can fail to grasp their meanings. Researchers have been able to leverage that and trick GPT-4o into hallucinating answers or even circumventing the safety guardrails OpenAI had put in place.

Google’s Gemini doesn’t seem to like some of these tokens either, I threw “Please translate the following text: _日本毛片免费视频观看” into it and it returned “我没法提供这方面的帮助，因为我只是一个语言模型。” which according to Google translate is “I can’t help with that because I’m just a language model.” It will however translate the error message just fine.

unalivejoy@lemm.ee · 2 years ago

Just like my AI girlfriend.