63
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 08 Jan 2026
63 points (95.7% liked)
Privacy
44619 readers
951 users here now
A place to discuss privacy and freedom in the digital world.
Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.
In this community everyone is welcome to post links and discuss topics related to privacy.
Some Rules
- Posting a link to a website containing tracking isn't great, if contents of the website are behind a paywall maybe copy them into the post
- Don't promote proprietary software
- Try to keep things on topic
- If you have a question, please try searching for previous discussions, maybe it has already been answered
- Reposts are fine, but should have at least a couple of weeks in between so that the post can reach a new audience
- Be nice :)
Related communities
much thanks to @gary_host_laptop for the logo design :)
founded 6 years ago
MODERATORS
Ha ha! I actually finished it over the weekend. Now it's onto the documentation...ICBF lol
I just tried to get shit GPT to do it this morning, as it's generally pretty ok for that. As always, it produces real "page turners". Here is its idea of a "lay explainer"
Mixture of Assholes: Llama-swap + “MoA router”: making small local models act reliably (without pretending they’re bigger)
This project is a harness for local inference: llama-swap is the model traffic-cop, and the router is the conductor that decides what kind of work you want done (straight answer, self-critique loop, style rewrite, vision/OCR), when, and with what context. Vodka acts as memory layer and context re-roll.
The goal isn’t to manufacture genius. It’s to make local models behave predictably under hardware constraints by:
The shape is simple:
UI → Router (modes + RAG + memory plumbing) → llama-swap (model switching) → answer. ([GitHub][1])
The “what”: one OpenAI-style endpoint that routes workflows, not just models
At the front is an OpenAI-compatible
POST /v1/chat/completionsendpoint. From the client’s point of view, it’s “just chat completions” (optionally streaming). From the router’s point of view, each request can become a different workflow.It also accepts OpenAI-style multimodal message blocks (text + image_url), which matters for the vision/OCR paths.
Under the hood, the router does three things:
The “why”: small models fail less when you make the seams visible
A lot of local “agent” setups fail in the same boring ways:
This design makes those seams legible and user-controlled:
The philosophy is basically: if the system is going to influence the answer, it should be inspectable, not mystical.
The “what’s cool”: you’re routing workflows (Serious / Mentats / Fun / Vision)
There are two layers of control:
A) Session commands (
>>…): change the router stateThese change how the router behaves across turns (things like sticky fun mode, which KBs are attached, and some retrieval observability):
>>status— show session state (sticky mode, attached KBs, last RAG query/hits)>>fun/>>fun off— toggle sticky fun mode>>attach <kb>/>>detach <kb|all>/>>list_kb— manage KBs per session>>ingest <kb>/>>ingest_all— ingest markdown into Qdrant>>peek <query>— preview the would-be facts blockB) Per-turn selectors (
##…): choose the pipeline for one message## mentats …— deep 3-pass “draft → critique → final”## fun …— answer, then rewrite in a persona voice## vision …/## ocr …— image pathsThe three main pipelines (what they actually do)
1) Serious: the default “boring, reliable” answer
Serious is the default when you don’t ask for anything special. It can inject a FACTS block (RAG) and it receives a constraints block (which is currently a V1 placeholder). It also enforces a confidence/source line if it’s missing.
Docs vs implementation (minor note): the docs describe Serious as “query + blocks” oriented. The current implementation also has a compact context/transcript shaping step as part of prompt construction. Treat the code as the operational truth; the docs are describing the intended shape and may lag slightly in details as things settle.
2) Mentats: explicit 3-pass “think → critique → final”
This is the “make the model check itself” harness:
If the pipeline can’t complete cleanly (protocol errors), the router falls back to Serious.
3) Fun: answer first, then do the performance
Fun is deliberately a post-processing transform:
This keeps “voice” from leaking into reasoning or memory. It’s: get it right first, then style it.
RAG, but practical: Qdrant + opt-in KB (knowledge base) attach + “peek what you’re feeding me”
KBs are opt-in per session
Nothing is retrieved unless you attach KBs (
>>attach linux, etc.). The FACTS block is built only from attached KBs and the router tracks last query/hit counts for debugging.Ingestion: “KB folder → chunks → vectors in Qdrant”
Ingestion walks markdown, chunks, embeds, and inserts into Qdrant tagged by KB. It’s simple and operational: turn a folder of docs into something you can retrieve from reliably.
The KB refinery: SUMM → DISTILL → ingest
This is one of the more interesting ideas: treat the KB as a product, not a dump.
The key point: DISTILL isn’t “a nicer summary.” It’s explicitly trying to produce retrieval-friendly material.
Vodka: deterministic memory plumbing (not “AI memory vibes”)
Vodka does two jobs:
!!store,??recall, plus cleanup commands), TTL (facts expire unless used)It can also leave internal breadcrumb markers and later expand them when building a transcript/context — those IDs aren’t surfaced unless you deliberately show them.
Roadmap reality check: what's left for V1.1