

Where’d you see it?
There are GPT-2 bots on Reddit and Lemmy that make a lot of posts like this. And they aren’t hidden; the community is explicitly labeled.


Where’d you see it?
There are GPT-2 bots on Reddit and Lemmy that make a lot of posts like this. And they aren’t hidden; the community is explicitly labeled.


NR is more “oldschool” isn’t it? It’s always been critical of MAGA.


…Not a great article.
ISW put it much better: the plan, as outlined now, doesn’t actually keep any peace and just sets the stage for another invasion later:
This kind of fits Trump’s MO though, even before politics: make flashy deals, and move on to the next as it fractures behind him (with someone else holding bag).


Vllm is a bit better with parallelization. All the kv cache sits in a single “pool”, and it uses as many slots as will fit. If it gets a bunch of short requests, it does many in parallel. If it gets a long context request, it kinda just does that one.
You still have to specify a maximum context though, and it is best to set that as low as possible.
…The catch is it’s quite vram inefficient. But it can split over multiple cards reasonably well, better than llama.cpp can, depending on your PCIe speeds.
You might try TabbyAPI exl2s as well. It’s very good with parallel calls, thoughts I’m not sure how well it supports MI50s.
Another thing to tweak is batch size. If you are actually making a bunch of 47K context calls, you can increase the prompt processing batch size a ton to load the MI50 better, and get it to process the prompt faster.
EDIT: Also, now that I think about it, I’m pretty sure ollama is really dumb with parallelization. Does it even support paged attention batching?
The llama.cpp server should be much better, eg use less VRAM for each of the “slots” it can utilize.


I’ll save you the searching!
For max speed when making parallel calls, vllm: https://hub.docker.com/r/btbtyler09/vllm-rocm-gcn5
Generally, the built in llama.cpp server is the best for GGUF models! It has a great built in web UI as well.
For a more one-click RP focused UI, and API server, kobold.cpp rocm is sublime: https://github.com/YellowRoseCx/koboldcpp-rocm/
If you are running big MoE models that need some CPU offloading, check out ik_llama.cpp. It’s specifically optimized for MoE hybrid inference, but the caveat is that its vulkan backend isn’t well tested. They will fix issues if you find any, though: https://github.com/ikawrakow/ik_llama.cpp/
mlc-llm also has a Vulcan runtime, but it’s one of the more… exotic LLM backends out there. I’d try the other ones first.


AFAIK some outputs are made with a really tiny/quantized local LLM too.
And yeah, even that aside, GPT 3.5 is really bad these days. It’s obsolete.


Bloefz has a great setup. Used Mi50s are cheap.
An RTX 3090 + a cheap HEDT/Server CPU is another popular homelab config. Newer models run reasonably quickly on them, with the attention/dense layers on the GPU and sparse parts on the CPU.


This is the way.
…Except for ollama. It’s starting to enshittify and I would not recommend it.


The iPhone models are really bad. They aren’t representative of the usefulness of bigger ones, and it’s inexplicably stupid that Apple doesn’t like people pick their own API as an alternative.
Social interaction is really, really important, right?
I think you aren’t giving other little kids enough credit. They aren’t their parents. I had good friends with awful family cultures, and I’m better for knowing them.
I think you may be judging ‘rural’ neighbors a little harshly, too. And your own kid, especially if they are bright.
…It doesn’t mean you can’t supplement their curriculum though, or advance them. My Dad grew up in a really poor area in the deep, deep south. But he just skipped grades, and didn’t come out as a religious nut or anything, and he didn’t have the benefit of two masters degrees parents.
…I’m bringing this up, because I also knew some homeschooled people, and I feel like it screwed them up. Different situation, but still, the isolation makes me very hesitant.
Lulz.
It’s an interesting coding exercise, though. Trying to (for example) OCR all the documents, or generate a relations graph between the documents or concepts, is a great into to language modeling (which is not prompt engineering, like most seem to think).
If you’re like a reporter or something, it’s also the obvious way to comb through the documents looking for clues to actually make headlines. I dunno what techniques they use at big outlets, though.
It’s literally “this one is my fursona. This one won’t refuse BDSM, but its not as eloquent. Oh, this one is lobotimized but really creative.” I kid you not. Here is an example, and note that is one of 115 uploads from one account:
https://huggingface.co/Mawdistical/RAWMAW-70B?not-for-all-audiences=true
And I love that madness. It feels like the old internet. In fact, furries and horny roleplayers have made some good code contributions to the space.
Early on, there were a few ‘character’ finetunes or more generic ones like ‘talk like a pirate’ or ‘talk only in emojiis.’ But as local models got more advanced, they got so good at adopting personas that the finetuning focused more on writing ‘style’ and storytelling than emulating specific characters. For example, one trained specifically to stick to the role of a dungeonmaster: https://huggingface.co/LatitudeGames/Nova-70B-Llama-3.3
Or this one, where you can look at the datasets and see the anime ‘style’ they’re trying to massage in: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B
Meme finetunes are nothing new.
As an example, there are DPO datasets with positive/negative examples intended to train LLMs to respond politely and helpfully (as opposed to the negative response). There are some that include toxic comments plucked from the web as negative examples.
And the immediate community thought was “…What if I reversed them?”
I dunno what the ‘writing style’ would end up as. The bulk of the text seems to be formatted like this:
...
10. Is Epstein cooperating with federal suit against Bear Stearns hedge fund managers Ralph Cioffi
and Matthew Tannin? Will he testify in their cases?
11. Mr Epstein was deposed on this week, on Thursday. Is it true that he answered almost every
question by invoking his Fifth Amendment rights?
12. Defense attorney Brad Evans has filed a motion to freeze Mr Epstein’s assets. Has Mr.
Epstein moved his money from the US offshore or abroad, or does he intend to, in order to
protect his assets from possible damage claims?
13. What did Mr. Epstein do during his work release program while serving time. Reports have
said he engaged in “scientific research.” If so, what was he researching?
...
Response
"That's because it isn't, and everyone here
(apparently save one) is rational and objective enough
to understand that. Physical phenomena, and
phenomena in general, are
ultimately perceptual in nature and subject to
observational replication - that's why they call
physics an empirical science. But consciousness is
not.
Consciousness cannot be objectively, replicably
observed. Its putative physical correlates, including
...
Bill Clinton identified in lawsuit against his former friend and
pedophile Jeffrey Epstein who had 'regular' orgies at his Caribbean
compound that the former president visited multiple times
e The former president was friends with Jeffrey Epstein, a financier who was arrested
in 2008 for soliciting underage prostitutes
e Anew lawsuit has revealed how Clinton took multiple trips to Epstein's private island
where he 'kept young women as sex slaves'
e Clinton was also apparently friends with a woman who collected naked pictures of
underage girls for Epstein to choose from
e He hasn't cut ties with that woman, however, and invited her to Chelsea's wedding
e Comes as friends now fear that if Hillary Clinton runs for president in 2016, all of
their family's old scandals will be brought to the forefront
e Epstein has a host of famous friends including Prince Andrew who stayed at his New
York mansion AFTER his arrest
By Daily Mail Reporter
Published: 09:06 EST, 19 March 2014 | Updated: 21:10 EST, 5 January 2015
I’d have to generate prompt/response wrappers too. But it would definitely bring up Trump and Clinton randomly, heh.
…There are automated metrics to rank English text by reading level, ‘quality’ and such. I guess it could be filtered to most ‘interesting’ emails and reformatted.
…The same way Google Search has forever?
Ranking, reranking, oldschool RAG.
Yeah. They don’t sell it with a better engine because it would embarrrass more expensive cars, kinda like the Porsche boxster/cayman (which whispers handle better than the 911).
Old Miatas were like that too :(. Though I don’t know what Mazda’s excuse is these days, as the Miata is their top sports car?


And IMO… your 3080 is good for ML stuff. It’s very well supported. It’s kinda hard to upgrade, in fact, as realistically you’re either looking at a 4090 or a used 3090 for an upgrade that’s actually worth it.


Oh no, you got it backwards. The software is everything, and ollama is awful. It’s enshittifying: don’t touch it with a 10 foot pole.
Speeds are basically limited by CPU RAM bandwidth. Hence you want to be careful doubling up RAM, and doubling it up can the max speed (and hence cut your inference speed).
Anyway, start with this. Pick your size, based on how much free CPU RAM you want to spare:
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF
The “dense” parts will live on your 3080 while the “sparse” parts will run on your CPU. The backend you want is this, specifically the built-in llama-server:
https://github.com/ikawrakow/ik_llama.cpp/
Regular llama.cpp is fine too, but it’s quants just aren’t quite as optimal or fast.
It has two really good built-in web UIs: the “new” llama.cpp chat UI, and mikupad, which is like a “raw” notebook mode more aimed at creative writing. But you can use LM Studio if you want, or anything else; there are like a bazillion frontends out there.


Since there is generated video, it seems like someone solved this problem.
Oh yes, it has come a LOONG way. Some projects to look at are:
https://github.com/ModelTC/LightX2V
https://github.com/deepbeepmeep/Wan2GP
And for images: https://github.com/nunchaku-tech/nunchaku
Video generation/editing is very GPU heavy though.
I dunno what card you have now, but with text LLMs (or image+text input LLMs), hybrid CPU+GPU inference is the trend days.
As an example, I can run GLM 4.6, a 350B LLM, with measurably low quantization distortion on a 3090 + 128GB CPU RAM, at like 7 tokens/s. If you would’ve told me that 2-4 years ago, my head would have exploded.
You can easily run GLM Air (or other good MoE models) on like a 3080 + system RAM, or even a lesser GPU. You just need the right software and quant.
A post that long?
Eh, well, it could definitely be an unmarked bot on X. That’s good attention bait, and it has a feeling of temporal inplausibility kinda like a ‘cheapest API LLM’ story.