Serverless Inference with Image Generation models

This tutorial walks you through deploying FLUX.1-Dev — a high-quality text-to-image model by Black Forest Labs — as a production-ready serverless endpoint on YottaLabs. The deployment uses the official YottaLabs ComfyUI runtime image, requiring zero custom Docker builds.


What You Will Build

A serverless FLUX.1-Dev endpoint on YottaLabs that:

  • Runs FLUX.1-Dev on an NVIDIA RTX 5090 (32 GB VRAM)

  • Serves a ComfyUI HTTP API on port 8188

  • Accepts text prompts and returns Base64-encoded images

  • Can be called from any HTTP client or Python script


Prerequisites

  • A YottaLabs account with an API key (x-api-key)

  • A Hugging Face token (HF_TOKEN) with read access approved for:

    • black-forest-labs/FLUX.1-dev

    • comfyanonymous/flux_text_encoders

  • curl and jq installed locally

Approving HuggingFace access: Visit https://huggingface.co/black-forest-labs/FLUX.1-devarrow-up-right and click Request access. Approval is typically granted within a few minutes.


Architecture Overview

The official YottaLabs image handles everything inside the container automatically: model downloading, text encoder setup, and ComfyUI startup. You only need to submit prompts via the ComfyUI HTTP API.


Step 1 — Create the Serverless Endpoint

Key parameters explained:

Parameter
Value
Notes

image

yottalabsai/flux1.dev:comfyui-...

Official YottaLabs FLUX image with ComfyUI runtime

serviceMode

ALB

Direct proxy mode — requests go straight to ComfyUI

containerVolumeInGb

100

FLUX.1-Dev weights + text encoders are ~24 GB total

expose.port

8188

ComfyUI's default HTTP port

HF_TOKEN

your token

Required to download gated FLUX.1-Dev weights

ENABLE_FLUX_VAE

true

Downloads and loads the VAE (ae.safetensors)

initializationCommand

sudo -E /start.sh

Runs the official startup script (model download + ComfyUI)

Save the id and domain from the response — you will need them for all subsequent calls.


Step 2 — Wait for the Endpoint to Be Ready

The first startup takes 10–20 minutes because the container must download ~24 GB of model weights from HuggingFace. Subsequent starts are fast if weights are already cached on persistent storage.

Check the endpoint status:

Wait for "status": "RUNNING".

Then verify ComfyUI is up by checking system stats:

A healthy response looks like:

Check the model download log if startup is taking long:


Step 3 — Submit an Image Generation Request

ComfyUI uses a workflow JSON format to describe the generation pipeline. Below is a minimal FLUX.1-Dev workflow that takes a text prompt and generates a 1024×1024 image.

3.1 — The Workflow Payload

Save this as flux_workflow.json:

3.2 — Submit the Prompt

Response:

Save the prompt_id — you will poll with it in the next step.


Step 4 — Poll for the Result

When generation is complete, the response contains the output image filename:


Step 5 — Fetch the Image as Base64

Use the filename from the history response to download and encode the image:

To get Base64 directly:


Step 6 — Full Python Client

This script wraps the entire flow — submit prompt, poll until done, return Base64:


Managing Your Endpoint

Stop the endpoint (pause billing):

Restart:

Check the generation queue:


Common Issues and Fixes

Model weights not downloaded yet (generation fails immediately)

Check the download log and wait:

The full download is ~24 GB and takes 10–20 minutes on first start.

403 on HuggingFace during download

Your HF_TOKEN does not have approved access to black-forest-labs/FLUX.1-dev. Visit the model page and request access, then recreate the endpoint with the correct token.

node_errors in the prompt response

A model file is missing or named differently. Check that all files exist under /home/ubuntu/ComfyUI/models/:

Generation is slow (>60s per image)

Normal for FLUX.1-Dev at 20 steps on FP16. Options to speed up:

  • Reduce steps to 10–15 (quality tradeoff)

  • Use the Nunchaku-optimized image instead: yottalabsai/flux1.dev:comfyui-nunchaku-cuda12.8.1-ubuntu22.04-2025102101


Performance Reference

Steps
Resolution
RTX 5090 (approx.)

10

512×512

~8s

20

1024×1024

~25s

20

1024×1024 (Nunchaku)

~12s


Nunchaku Variant (Faster Inference)

YottaLabs also provides a quantized version of the image with Nunchaku acceleration, which roughly halves generation time on RTX 5090:

Everything else — environment variables, ports, API calls — remains identical. Switch the image name and redeploy.


Next Steps

  • LoRA support: Place .safetensors LoRA files under ComfyUI/models/loras/ and add a LoraLoader node to the workflow

  • Batch generation: Set batch_size > 1 in the EmptyLatentImage node to generate multiple images per request

  • Different resolutions: FLUX.1-Dev supports arbitrary resolutions. Common choices: 768×1344 (portrait), 1344×768 (landscape), 1024×1024 (square)

  • img2img: Add an ImageToLatent node before the KSampler and set denoise < 1.0

Last updated

Was this helpful?