flux,
one GPU, one queue, one frame at a time.
A local FLUX.1 image-generation FastAPI worker on127.0.0.1:8421. Single-GPU job queue, SQLite gallery, and a small surface that MasterAgent proxies through.
flux owns the GPU. Generation jobs are serialized through one worker thread because 16 GB of VRAM has headroom for one FLUX pass at a time, and parallelism here would be a regression. The MCP surface that other projects on this machine call (flux_generate, flux_edit, flux_fill, flux_variation, flux_structural, flux_search_gallery) routes through MasterAgent on :8420 and into this loopback service. Every result lands in a SQLite gallery with prompt, model, params, and thumbnail.
ITech scope
- Five pipeline kinds, each one entry in
MODEL_REGISTRYinworker/config.py: text2img (FLUX.1-schnell or -dev for prompt-only generation), kontext (image-to-image rewrite that preserves subject identity), fill (mask-based inpainting / outpainting), control (depth, edge, or pose conditioning), and redux (image-conditioned variation that keeps composition while perturbing style). Adding a sixth kind means a new registry entry, a new Pydantic schema, and an updated cancellation callback — in that order, never partially. config.ensure_dirs()runs before any HuggingFace import.HF_HOMEis set explicitly to$DATA_ROOT/hf-cache, which is what stops the diffusers libraries from caching to~/.cacheon the system disk and exhausting the OS partition during a model swap. The rule for new modules is rigid: any file that importstransformersordiffusersat the top level must be loaded afterimport config, because the env-var dance has to happen first or the cache directories are wrong by the time the libraries try to read them.- Cancellation is cooperative. The API sets a per-job
threading.Eventon receipt offlux_cancel_job; the diffusion pipeline’s per-step callback (callback_on_step_end) checks the event and raisesCancelled()on the next step boundary, which the worker translates into a clean job-state transition. Adding a new pipeline kind requires wiring this callback or cancellation silently breaks — a job marked cancelled would still run to completion and bill the GPU for it. - The prompt composer (
compose.pyplusprompts/<pack>/) reads pack files (pack.json,fixed_block.txt,variable_*.txt,subjects/*.txt) and assemblesFIXED + SUBJECT + VARIABLE + EXTRA. The assembled prompt is capped at 2 000 chars, then run through a negation-aware forbidden-pattern audit before queuing — "no people" inside the variable block must actually be respected, not silently overridden by a "people" token in the fixed block./compose/generateis the single entry point that performs this whole sequence atomically. - The MCP layer is the only public surface.
flux_generate,flux_edit,flux_fill,flux_variation,flux_structural,flux_search_gallery,flux_get_job, andflux_cancel_jobare routed through MasterAgent on:8420into the loopback FastAPI on:8421. The FastAPI bind is127.0.0.1, never an external interface; the GPU is not accessible from the network even by accident.
IILayout
worker/
―― main.py # FastAPI entry, lifespan, /healthz, route registration
―― jobs.py # single-thread job queue + cancellation
―― pipeline.py # diffusers pipeline loaders, keyed by MODEL_REGISTRY[kind]
―― compose.py # prompt assembly + audit (used by /compose/* routes)
―― gallery.py # SQLite persistence
―― progress.py # per-step progress callbacks
―― schemas.py # Pydantic request models for every route
―― config.py # paths, ports, MODEL_REGISTRY
―― deploy/ # register-service.ps1
prompts/<pack>/ # composer packs
data/ # runtime: gallery.sqlite, images/, uploads/, hf-cache/
The split is functional, not nominal. main.py is the FastAPI surface and the lifespan owner; jobs.py is the only file that holds queue state; pipeline.py is the only file that imports diffusers loaders. That layering is what lets the worker swap a model without touching the queue or the API surface, and it is what lets the test harness import compose.py without dragging the whole CUDA stack into a unit test.
IIIJob lifecycle
From the caller’s side, a generation is a request and a job id. From the worker’s side, the lifecycle is fixed: compose assembles and audits the prompt; schemas validates the request payload through Pydantic; queue appends the job to the single-worker thread (queue depth is observable, but the consumer is one); pipeline loads or reuses the diffusers pipeline keyed by MODEL_REGISTRY[kind]; callback runs the per-step progress + cancellation hook; gallery writes the result row to gallery.sqlite with prompt, model, seed, params, and a thumbnail; MCP return hands the job id and result path back to the caller.
Sixteen GB of VRAM has headroom for one FLUX pass at a time. A second concurrent generation regresses both: VRAM thrash, lower throughput, and a higher chance of an OOM mid-step. One worker thread is the consequence of the hardware budget, not a workaround for a concurrency bug, and it is also why a new pipeline kind has to wire the cancellation callback — the queue depth is the only thing protecting an operator from a five-minute uncancellable generation when they meant to abort.
IVWhy local
The image work in every other project on this site — including meshgen’s featured images and Pinterest pins — runs through this worker. The headline difference is zero API tokens for the pixel data: at the volumes meshgen needs, even a small per-image fee compounds. But the cost story is only part of it. The structural reason is that the prompt pipeline, the seeds, the audit rules, and the gallery all stay on the same disk as the projects that consume them — so a regenerated image six months from now lands the same composition because the prompt + seed + model + LoRA tuple is committed in the consumer’s repo, not in a hosted vendor’s database.
The HuggingFace cache lives at $DATA_ROOT/hf-cache and is treated as the only durable model store. Pinning HF_HOME there means a model upgrade is a deliberate huggingface-cli download followed by a registry edit, not an opaque background refresh; reproducibility starts at the model snapshot.
VSurface
The surface presented to other projects on this machine is the MCP layer: flux_generate, flux_edit, flux_fill, flux_variation, flux_structural, plus flux_search_gallery for browsing prior runs and the job-lifecycle pair flux_get_job / flux_cancel_job. Calls route through MasterAgent on :8420 into the loopback FastAPI on :8421; the FastAPI bind is 127.0.0.1, never an external interface, so the GPU is unreachable from the network even by accident. From the caller’s side it is a request and a job id; from the worker’s side it is a serialized queue of one.
VINumbers
Five pipeline kinds is a hard count, not a snapshot. Adding a sixth means a new MODEL_REGISTRY entry, a new Pydantic schema, and wiring the per-step cancellation callback — in that order, because skipping the callback wiring leaves cancellation silently broken (a cancelled job runs to completion and bills the GPU for it). Sixteen GB of VRAM has headroom for one FLUX pass at a time; a second concurrent generation regresses both throughput and VRAM stability and risks an OOM mid-step. One worker thread is the consequence of the hardware budget, not a concurrency workaround. The prompt composer caps at 2 000 chars after assembly and runs a negation-aware forbidden-pattern audit before queuing — “no people” in the variable block actually has to be respected, not silently overridden by a “people” token in the fixed block.