Docs 03

How it works

One relay, a pool of nodes and a socket that keeps the tokens flowing until the answer's done.

The path a message takes

flow

You → Relay → Node → tokens stream back → You

You send a message from the TALOS chat interface.
The relay receives it and looks for a node that matches your chosen tier.
That node runs the model on its own GPU and streams tokens back through the relay.
You watch the reply arrive live, word by word.

The relay

The relay is a small always-on service built around a persistent WebSocket connection to every client and node. Its whole job is coordination:

Sign-in: verifying users and node operators.
Queueing: holding incoming requests until a matching node is free.
Node registry: tracking who's online, what they run and how loaded they are.
Routing: sending each job to the right class of node for the selected tier.
Node selection: choosing which idle node gets the job, weighted by measured speed.
Tool calls: running a web lookup when a model asks for one and feeding the result back.
Stats: broadcasting live network numbers every 5 seconds.

It never keeps a copy of a conversation. It moves traffic and forgets it.

Browser nodes

Browser nodes run entirely inside a browser tab, using WebGPU to drive an in-browser inference runtime. No install: open the page, hit start.

They run Nimbus 8B to cover Light tier traffic. ~4.2GB to download once, ~6GB of VRAM to run, unfiltered. The model caches after the first load, so every start after that is instant.

Rig nodes

Rig nodes run on hardware their operator controls directly, driven by a local model runtime with CUDA (NVIDIA), Metal (Apple Silicon), or Vulkan (AMD/Intel) acceleration.

They serve Heavy tier requests only. Operators pick between Atlas 30B (deep reasoning) and Atlas Vision 27B (vision + live web lookups). Both want 20GB+ of VRAM and push 30+ tokens/sec on the recommended hardware.

Render nodes

A third node type handles images: independent GPUs running a node-based render pipeline. Image generation shows up as a render_image tool the Heavy-tier model can call. When it does, a render node picks up the job.

Job routing

Tier	Node type	Model
Light	Browser (WebGPU)	Nimbus 8B
Heavy	Rig (local runtime)	Atlas 30B or Atlas Vision 27B

Picking a node

When a job is ready, the relay looks at the idle nodes serving the right model and picks one by weighted-random choice: each node's weight is its measured tokens/sec, with a floor so even the slowest node in the pool still gets some traffic. Fast nodes get more jobs; earnings still spread across the whole pool instead of piling on one machine.

Web lookups (Heavy tier)

Web search is model-driven, not a pre-fetch. The Heavy model decides for itself whether a question needs fresh information; when it does, it calls a web_lookup tool, the relay queries an independent search API and the results come back to the model as a tool result. The model keeps generating from there and cites what it used.

Streaming

Every token a node produces is relayed to you the instant it's generated. There's no buffering the full reply first. You watch it get written live, the same as any chat product, except the compute behind it is sitting on someone's desk.

Previous ← Why TALOS NextArchitecture →