awsLLM Serving on Yotta Labs with AWS Trainium

This tutorial walks you through deploying an LLM inference service on the Yotta Labs platform using an AWS Trainium Pod. Reference: AWS Neuron Docs — vLLM on Neuronarrow-up-right (Neuron SDK 2.29.0)

1. Platform Overview & Hardware Selection

Log in to https://yottalabs.aiarrow-up-right, go to the Pods page, and click Deploy.

In the hardware selector, switch to the AWS tab (NVIDIA is the default):

NVIDIA  |  [AWS]   ← click to switch

Select Trainium1:

Spec
Value

VRAM

32 GB

RAM

32 GB

vCPU

8

Price

$1.40 / Hr

2. AWS Trainium vs NVIDIA GPU — Key Differences

Aspect
NVIDIA GPU
AWS Trainium1

Compute backend

CUDA / cuDNN

AWS Neuron SDK (neuronx-distributed-inference)

Compilation

JIT / eager

AOT compilation — model is compiled to .neff format on first run

First startup time

1–3 min (download weights only)

~20 min (includes Neuron AOT compilation)

Subsequent startup

Same as above

Seconds–1 min (compiled artifacts cached)

vLLM integration

Native (pip install vllm)

vllm-neuron plugin — pre-installed in the Yotta Neuron image

Backend flag

None (cuda by default)

VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference

Parallelism

--tensor-parallel-size = # GPUs

--tensor-parallel-size = # NeuronCores

Memory mgmt

torch.cuda.*

torch_xla + Neuron Runtime

Monitoring

nvidia-smi, nvtop

neuron-top, neuron-ls, neuron-monitor

Compiled model cache

Not needed

Set NEURON_COMPILED_ARTIFACTS to persist across restarts

Weight sharding

Re-sharded every run

Can be persisted with save_sharded_checkpoint: true

CUDA libs

Required

Not present — libcuda.so warnings on import are expected and harmless

Key Takeaways

  • The vllm-neuron plugin exposes the same OpenAI-compatible /v1 API as standard vLLM — no client-side changes needed.

  • The Neuron plugin is pre-installed in the yottalabsai/pytorch-inference-vlm-neuronx image. No manual installation required.

  • The Triton warning (0 active driver(s) found) and libcuda.so import warning are expected on Trainium and do not affect inference.

3. Launching a Trainium Pod

3.1 Select the Image

Field
Value

Image Source

Docker Hub

Image Type

Public Image

Image

yottalabsai/pytorch-inference-vlm-neuronx:0.13.0-neuronx-py312-sdk2.27.1-ubuntu24.04

This image includes:

  • torch-neuronx 2.9.0 + neuronxcc (Neuron compiler)

  • neuronx-distributed (NxD Inference library)

  • vllm 0.13.0 with the vllm-neuron plugin pre-installed and auto-activated

  • transformers 4.56.2, torch_xla 2.9.0

3.2 Ports

Expose these ports in addition to the defaults (22, 8080):

Port
Service

8000

vLLM OpenAI-compatible API

8888

JupyterLab

3.3 Deploy

Review the pricing summary (~$1.40/Hr + storage) and click Deploy. The Pod reaches Running state in 1–2 minutes (image pull).

4. Connecting & Verifying the Environment

4.1 SSH into the Pod

4.2 Activate the Conda Environment

circle-exclamation

To make this permanent across SSH sessions:

4.3 Verify Neuron Devices

Expected neuron-ls output:

4.4 Verify All Required Packages

Run the following one-shot check:

Expected output (DeprecationWarnings on neuronx_distributed are normal and can be ignored):

circle-exclamation

5. Using the Prestarted vLLM Service

Reference: Quickstart: Serve models online with vLLM on Neuronarrow-up-right

circle-exclamation

5.1 Check if vLLM is already running

Run:

If you see a response like:

Then the service is already running.

5.2 Confirm the active model

The active model is:

5.3 Run inference (curl)

Quick Reference

Verified Package Versions

Package
Version

torch

2.9.0+cu128

torch_neuronx

2.9.0.2.11.19912

torch_xla

2.9.0

neuronx_distributed

(installed, no __version__)

transformers

4.56.2

accelerate

1.13.0

vllm

0.13.0 (with vllm-neuron plugin)

Reference: AWS Neuron Docs — vLLM on Neuronarrow-up-right

Last updated

Was this helpful?