Vision Language Action

A unified end-to-end model that sees the road, understands language, and acts — trained on the world's most challenging driving environments to generalise anywhere on Earth.

Packed with everything you require

End-to-End Driving Policy

From raw camera pixels to steering commands — no hand-crafted rules, no modular pipelines. One unified model drives.

Language-Guided Navigation

Natural language instructions are fused directly into the action policy, enabling passengers to direct the vehicle conversationally.

Open-Source Foundation

Built on publicly available VLA checkpoints, reducing training CapEx by an order of magnitude versus proprietary model development.

Continuous Fine-Tuning

New dashcam footage from our global taxi network flows into nightly fine-tuning cycles, keeping the model sharp on emerging edge cases.

Brain-Aligned Perception for Safer Decisions

Unlike purely data-hungry deep learning models, our VLA incorporates NeuroAI principles — representations shaped by human visual cortex research. This means the model makes perceptual errors that are interpretable and predictable, not random, dramatically improving safety in unpredictable urban environments.

Adversarially robust visual representations

Human-aligned failure modes

Reduced false-negative pedestrian risk

Robust Against Out-of-Distribution Scenarios

Trained on Lima's chaotic intersections, Cusco's mountain roads, and Cajamarca's rural routes, our VLA has seen conditions that break every other model. Aggressive lane changes, unmapped speed bumps, and tuk-tuks are standard training fare — not edge cases.

Validated on 22+ districts across Lima, Peru

Covers motorized and non-motorized vehicle classes

Benchmarked against YOLO-V8, EVA-01, and Detectron2

Powered by NVIDIA GPU Infrastructure

Every VLA training run, every inference call, and every real-time decision aboard our vehicles is accelerated by NVIDIA's GPU platform — the same hardware stack that underlies the world's most capable AI systems.

NVIDIA H100 Training Cluster

Multi-node H100 NVLink clusters reduce full VLA fine-tuning runs from weeks to hours, enabling rapid iteration on new regional driving data.

TensorRT Real-Time Inference

NVIDIA TensorRT optimizes the deployed VLA to run at 30 fps on in-vehicle GPUs, meeting the strict latency requirements of autonomous navigation.

CUDA-Accelerated Data Pipeline

Raw dashcam video from thousands of taxis is decoded, augmented, and tokenized using CUDA kernels, feeding a continuous stream of fresh training examples to the model.