Serverless Inference with Image Generation models
This tutorial walks you through deploying FLUX.1-Dev — a high-quality text-to-image model by Black Forest Labs — as a production-ready serverless endpoint on YottaLabs. The deployment uses the official YottaLabs ComfyUI runtime image, requiring zero custom Docker builds.
What You Will Build
A serverless FLUX.1-Dev endpoint on YottaLabs that:
Runs FLUX.1-Dev on an NVIDIA RTX 5090 (32 GB VRAM)
Serves a ComfyUI HTTP API on port 8188
Accepts text prompts and returns Base64-encoded images
Can be called from any HTTP client or Python script
Prerequisites
A YottaLabs account with an API key (
x-api-key)A Hugging Face token (
HF_TOKEN) with read access approved for:black-forest-labs/FLUX.1-devcomfyanonymous/flux_text_encoders
curlandjqinstalled locally
Approving HuggingFace access: Visit https://huggingface.co/black-forest-labs/FLUX.1-dev and click Request access. Approval is typically granted within a few minutes.
Architecture Overview
The official YottaLabs image handles everything inside the container automatically: model downloading, text encoder setup, and ComfyUI startup. You only need to submit prompts via the ComfyUI HTTP API.
Step 1 — Create the Serverless Endpoint
Key parameters explained:
image
yottalabsai/flux1.dev:comfyui-...
Official YottaLabs FLUX image with ComfyUI runtime
serviceMode
ALB
Direct proxy mode — requests go straight to ComfyUI
containerVolumeInGb
100
FLUX.1-Dev weights + text encoders are ~24 GB total
expose.port
8188
ComfyUI's default HTTP port
HF_TOKEN
your token
Required to download gated FLUX.1-Dev weights
ENABLE_FLUX_VAE
true
Downloads and loads the VAE (ae.safetensors)
initializationCommand
sudo -E /start.sh
Runs the official startup script (model download + ComfyUI)
Save the id and domain from the response — you will need them for all subsequent calls.
Step 2 — Wait for the Endpoint to Be Ready
The first startup takes 10–20 minutes because the container must download ~24 GB of model weights from HuggingFace. Subsequent starts are fast if weights are already cached on persistent storage.
Check the endpoint status:
Wait for "status": "RUNNING".
Then verify ComfyUI is up by checking system stats:
A healthy response looks like:
Check the model download log if startup is taking long:
Step 3 — Submit an Image Generation Request
ComfyUI uses a workflow JSON format to describe the generation pipeline. Below is a minimal FLUX.1-Dev workflow that takes a text prompt and generates a 1024×1024 image.
3.1 — The Workflow Payload
Save this as flux_workflow.json:
3.2 — Submit the Prompt
Response:
Save the prompt_id — you will poll with it in the next step.
Step 4 — Poll for the Result
When generation is complete, the response contains the output image filename:
Step 5 — Fetch the Image as Base64
Use the filename from the history response to download and encode the image:
To get Base64 directly:
Step 6 — Full Python Client
This script wraps the entire flow — submit prompt, poll until done, return Base64:
Managing Your Endpoint
Stop the endpoint (pause billing):
Restart:
Check the generation queue:
Common Issues and Fixes
Model weights not downloaded yet (generation fails immediately)
Check the download log and wait:
The full download is ~24 GB and takes 10–20 minutes on first start.
403 on HuggingFace during download
Your HF_TOKEN does not have approved access to black-forest-labs/FLUX.1-dev. Visit the model page and request access, then recreate the endpoint with the correct token.
node_errors in the prompt response
A model file is missing or named differently. Check that all files exist under /home/ubuntu/ComfyUI/models/:
Generation is slow (>60s per image)
Normal for FLUX.1-Dev at 20 steps on FP16. Options to speed up:
Reduce
stepsto 10–15 (quality tradeoff)Use the Nunchaku-optimized image instead:
yottalabsai/flux1.dev:comfyui-nunchaku-cuda12.8.1-ubuntu22.04-2025102101
Performance Reference
10
512×512
~8s
20
1024×1024
~25s
20
1024×1024 (Nunchaku)
~12s
Nunchaku Variant (Faster Inference)
YottaLabs also provides a quantized version of the image with Nunchaku acceleration, which roughly halves generation time on RTX 5090:
Everything else — environment variables, ports, API calls — remains identical. Switch the image name and redeploy.
Next Steps
LoRA support: Place
.safetensorsLoRA files underComfyUI/models/loras/and add aLoraLoadernode to the workflowBatch generation: Set
batch_size> 1 in theEmptyLatentImagenode to generate multiple images per requestDifferent resolutions: FLUX.1-Dev supports arbitrary resolutions. Common choices: 768×1344 (portrait), 1344×768 (landscape), 1024×1024 (square)
img2img: Add an
ImageToLatentnode before the KSampler and setdenoise< 1.0
Last updated
Was this helpful?