location-arrowGuide to Elastic Deployment

Introduction

This comprehensive guide explains how to use YottaLabs’ Elastic Deployment feature to quickly and reliably deploy Qwen3-0.6B model as containerized services in a production environment.


1. Background

Yotta Labs is a one-stop platform for AI application development. This Elastic Deployment feature is specifically designed for low-latency, high-concurrency inference services, supporting:

  • Auto-scaling

  • Self-healing (Fault tolerance)

  • Multi-GPU scheduling

Unlike traditional static deployments, Elastic Deployment abstracts computing units (Workers) into stateless service instances that can be dynamically scaled.


2. Core Deployment Steps

Step 1: Preparation – API Key & Console Access

img

Before deploying, you must obtain your identity credentials:

  1. Log in to the YottaLabs Console.

  2. Navigate to Settings → Access Keys.

  3. Generate and copy your API Key. This key is required for all automated operations and management interfaces.

  4. Return to the main menu and click on Elastic Deployment to begin.

Step 2: Image Configuration – Public vs. Private Registries

img

The image source is the foundation of your deployment. Yotta Labs supports two categories:

  • Docker Hub (Public): Simply provide the full image name, e.g., myorg/qwen-vllm:2.3.1-cu121.

  • Other (Private): For private registries , you must provide the registry URL and valid Credentials.

Step 3: Service Modes & GPU Allocation

Choose modes that matches your business logic:

SERVICE MODE

BEST USE CASE

LOGIC

ALB

Web APIs / Chatbots

Automatically distributes traffic.

Queue

Batch Processing

Workers pull tasks from a queue; ideal for non-real-time tasks.

Custom

Advanced Integration

Opens raw ports for users with their own load balancers.

For more information here, see our official docs of service mode

GPU Selection:

You can specify the GPU Type (e.g. RTX 4090), GPU Count per worker (1–3), and VRAM requirements.

Step 4: Elastic Mechanism – Worker Lifecycle

  • Declarative Scaling: If you set the target to 2 Workers, YottaLabs ensures 2 Workers are always running.

  • Automatic Recovery: If a Worker crashes/is terminated accidentally, the system will automatically spins up a new instance within seconds.


3. Deployment and validation

Once the status changes to Running, YottaLabs generates a unique HTTPS URL. Copy the url provided in the box above.

Use a curl command to test the service, for example:

A successful response confirms that the VLLM inference engine, Tokenizer, and CUDA acceleration path are all functioning correctly.

Last updated

Was this helpful?