Loading...
Loading...
Loading...
**Video Title:** "Train AI World Models on Tokyo City Data - Complete Cloud GPU Tutorial"
# YouTube Video Script: Train OccWorld on Tokyo 3D City Data with Lambda Cloud
**Video Title:** "Train AI World Models on Tokyo City Data - Complete Cloud GPU Tutorial"
**Thumbnail Text:** "TRAIN AI ON TOKYO 3D" + GPU icon + Lambda logo
**Video Length:** ~15-20 minutes
**Target Audience:** ML engineers, robotics researchers, autonomous vehicle developers
---
## VIDEO OUTLINE
```
0:00 - Hook & Intro
1:30 - What is OccWorld?
3:00 - Project Overview
5:00 - Lambda Cloud Setup
8:00 - Download Models & Data
11:00 - Start Training
14:00 - Monitor & Results
17:00 - Next Steps & Outro
```
---
## SCRIPT
### [0:00 - 1:30] HOOK & INTRO
**[VISUAL: Drone flying through 3D Tokyo cityscape, then cut to terminal with training logs]**
**NARRATOR:**
> What if you could train an AI to predict the future of a 3D world?
>
> That's exactly what OccWorld does - it's a world model that learns to predict how 3D environments will change over time. And today, I'm going to show you how to train it on Tokyo city data using cloud GPUs.
**[VISUAL: Split screen - left shows Gazebo simulation, right shows occupancy prediction visualization]**
> By the end of this video, you'll have a complete pipeline for:
> - Generating training data from 3D city simulations
> - Fine-tuning OccWorld on custom environments
> - Running everything on Lambda Cloud for about 40 bucks
**[VISUAL: Face cam or animated avatar]**
> I'm [YOUR NAME], and let's dive in.
**[TITLE CARD: "Training OccWorld on Tokyo 3D City Data"]**
---
### [1:30 - 3:00] WHAT IS OCCWORLD?
**[VISUAL: OccWorld paper screenshot, architecture diagram]**
**NARRATOR:**
> OccWorld is a 3D world model published at ECCV 2024. It takes a sequence of 3D occupancy grids - basically voxelized representations of the world - and predicts what the future will look like.
**[VISUAL: Animation showing occupancy grid prediction over time]**
> Think of it like video prediction, but in 3D. You give it the last few frames of a 3D scene, and it predicts the next several frames.
**[VISUAL: Diagram showing: Cameras + LiDAR → Occupancy Grid → World Model → Future Prediction]**
> This is incredibly useful for autonomous vehicles and robots. If you can predict how the world will change, you can plan safer paths.
**[VISUAL: nuScenes dataset examples]**
> The original OccWorld was trained on nuScenes - a dataset of real driving data. But what if you want to train on YOUR environment? Maybe a warehouse, a factory, or in our case - a Japanese city.
**[VISUAL: Tokyo PLATEAU 3D model flythrough]**
> That's where this project comes in.
---
### [3:00 - 5:00] PROJECT OVERVIEW
**[VISUAL: GitHub repo page, scrolling through files]**
**NARRATOR:**
> We've built a complete pipeline that connects three things:
**[VISUAL: Diagram with three boxes connected by arrows]**
> First - Tokyo PLATEAU. This is a free, incredibly detailed 3D model of Tokyo released by the Japanese government. We're talking building-level accuracy for all 23 wards.
**[VISUAL: PLATEAU website, 3D model examples]**
> Second - Gazebo simulation. We load these Tokyo models into a physics simulator, spawn drones and robots, and record sensor data as they navigate the city.
**[VISUAL: Gazebo with drone flying, sensor visualizations]**
> Third - OccWorld training. We take that simulated data and use it to fine-tune OccWorld, teaching it the patterns of urban environments.
**[VISUAL: Terminal showing training progress]**
> The beauty is - once trained, this model can help real robots navigate real cities. Simulation to reality transfer.
**[VISUAL: File tree structure]**
> Let me show you the project structure real quick:
> - `train.py` - our training script
> - `config/` - configuration files
> - `dataset/` - custom data loader for Gazebo data
> - `scripts/` - automation scripts for everything
> All the code is open source. Link in the description.
---
### [5:00 - 8:00] LAMBDA CLOUD SETUP
**[VISUAL: Lambda Cloud website]**
**NARRATOR:**
> Alright, let's get our hands dirty. We're using Lambda Cloud because they have A100 GPUs for about $1.79 per hour. No long-term commitments, pay by the minute.
**[VISUAL: Screen recording - Lambda Cloud signup]**
> Step one - create an account at cloud.lambda.ai. You'll need a credit card, but you only pay for what you use.
**[VISUAL: SSH key generation in terminal]**
> Step two - set up SSH keys. If you already have one, great. If not, let's generate one:
```bash
ssh-keygen -t ed25519 -C "lambda-occworld" -f ~/.ssh/lambda_key
```
**[VISUAL: Lambda Cloud SSH keys page]**
> Copy your public key and add it to Lambda Cloud in the SSH Keys section.
**[VISUAL: Lambda Cloud instance launch page]**
> Step three - launch an instance. I recommend:
> - 1x A100 80GB for training - about $1.79/hour
> - Or 1x H100 if you want faster training - about $3.29/hour
> Select your SSH key, pick a region close to you, and hit Launch.
**[VISUAL: Waiting animation, then instance ready]**
> It takes about 3-5 minutes to boot. Once it's ready, you'll see an IP address.
**[VISUAL: Terminal - SSH connection]**
> Connect via SSH:
```bash
ssh -i ~/.ssh/lambda_key ubuntu@YOUR_IP_ADDRESS
```
**[VISUAL: Terminal showing successful connection, nvidia-smi output]**
> And boom - we're in. Check out that A100. 80 gigs of VRAM ready to go.
---
### [8:00 - 11:00] DOWNLOAD MODELS & DATA
**[VISUAL: Terminal on Lambda instance]**
**NARRATOR:**
> Now let's set up the project. First, clone the repo:
```bash
git clone https://github.com/YOUR_USERNAME/VeryLargeWeebModel.git
cd VeryLargeWeebModel
```
**[VISUAL: Running setup script]**
> We have an automated setup script that installs everything:
```bash
chmod +x scripts/lambda_setup.sh
./scripts/lambda_setup.sh
```
**[VISUAL: Script output, progress bars]**
> This installs PyTorch, the MMDetection3D stack, and all the dependencies. Takes about 5 minutes.
**[VISUAL: Download script options]**
> Next, let's download the pretrained models and Tokyo data:
```bash
./scripts/download_and_prepare_data.sh --all
```
**[VISUAL: Download progress for PLATEAU data]**
> This downloads:
> - Tokyo PLATEAU 3D models - about 2 gigs of building meshes
> - Creates the Gazebo world files
> - Sets up the data directories
**[VISUAL: Tsinghua Cloud website]**
> Now here's the one manual step. The OccWorld pretrained weights are hosted on Tsinghua Cloud, which doesn't support direct wget downloads.
> Go to this URL - I'll put it in the description - and download:
> - `vqvae_epoch_125.pth`
> - `occworld_latest.pth`
**[VISUAL: SCP command in terminal]**
> Then upload them to your Lambda instance:
```bash
scp -i ~/.ssh/lambda_key ~/Downloads/occworld_latest.pth \
ubuntu@YOUR_IP:~/VeryLargeWeebModel/pretrained/occworld/
```
**[VISUAL: File listing showing pretrained models]**
> Perfect. We have everything we need.
---
### [11:00 - 14:00] START TRAINING
**[VISUAL: Terminal with tmux]**
**NARRATOR:**
> Before we start training, super important tip - use tmux. If your SSH connection drops, your training will keep running.
```bash
tmux new -s training
```
**[VISUAL: Activating conda environment]**
> Activate the environment:
```bash
conda activate occworld
cd ~/VeryLargeWeebModel
```
**[VISUAL: Training command]**
> And start training:
```bash
python train.py \
--config config/finetune_tokyo.py \
--work-dir ~/checkpoints/occworld_tokyo
```
**[VISUAL: Training logs starting, loss values appearing]**
> And we're training! Let's break down what's happening:
>
> The model loads the pretrained OccWorld weights, then fine-tunes on our Tokyo Gazebo data. We're using a learning rate of 1e-4 - lower than training from scratch because we want to preserve the knowledge from nuScenes.
**[VISUAL: Config file highlights]**
> The config file sets up:
> - 4 history frames, 6 future frames to predict
> - Extended altitude range for drone data - up to 150 meters
> - Both drone and ground robot data
**[VISUAL: Training loss curve graphic]**
> Training takes about 12-24 hours depending on your dataset size. But you'll see the loss dropping within the first few epochs.
**[VISUAL: tmux detach command]**
> To detach from tmux and let it run:
```
Ctrl+B, then D
```
> You can close your laptop, grab coffee, come back tomorrow. The training keeps going.
---
### [14:00 - 17:00] MONITOR & RESULTS
**[VISUAL: Multiple terminal windows]**
**NARRATOR:**
> Let's talk about monitoring. SSH back in and check GPU usage:
```bash
nvtop
```
**[VISUAL: nvtop showing GPU utilization]**
> You should see near 100% GPU utilization. If it's low, you might have a data loading bottleneck.
**[VISUAL: TensorBoard in browser]**
> For detailed metrics, start TensorBoard:
```bash
tensorboard --logdir ~/checkpoints --port 6006 --bind_all
```
> Then open your browser to `http://YOUR_IP:6006`
**[VISUAL: TensorBoard graphs - training loss, validation loss, learning rate]**
> Here you can see:
> - Training loss going down
> - Validation loss - watch for overfitting
> - Learning rate schedule
**[VISUAL: Reattaching to tmux]**
> To check the actual training output:
```bash
tmux attach -t training
```
**[VISUAL: Training completion message]**
> When training finishes, you'll see checkpoints saved in `~/checkpoints/occworld_tokyo/`
**[VISUAL: SCP downloading checkpoints]**
> Download your trained model:
```bash
scp -i ~/.ssh/lambda_key -r \
ubuntu@YOUR_IP:~/checkpoints/occworld_tokyo/best.pth \
./my_tokyo_model.pth
```
**[VISUAL: Lambda Cloud terminate button - BIG AND RED]**
> And SUPER IMPORTANT - terminate your instance when you're done! Lambda charges by the minute. Don't leave it running overnight unless you're training.
**[VISUAL: Cost breakdown graphic]**
> My training run took about 24 hours on an A100 80GB. Total cost: about $43. Not bad for a custom world model.
---
### [17:00 - END] NEXT STEPS & OUTRO
**[VISUAL: Inference visualization - predicted vs actual occupancy]**
**NARRATOR:**
> So what can you do with this trained model?
**[VISUAL: List with icons]**
> - Run inference on new Tokyo environments
> - Use it for robot path planning
> - Fine-tune further on real sensor data
> - Transfer to similar urban environments
**[VISUAL: GitHub repo, documentation pages]**
> All the code, documentation, and scripts are open source. Links in the description:
> - GitHub repo
> - Training guide
> - Lambda deployment guide
> - OccWorld paper
**[VISUAL: Face cam / avatar]**
> If you found this useful, smash that like button and subscribe for more robotics and ML content.
> Got questions? Drop them in the comments. I read everything.
> And if you train your own model on a cool environment - a factory, a warehouse, your city - share it! Tag me on Twitter/X.
**[VISUAL: End screen with subscribe button, related videos]**
> Thanks for watching, and I'll see you in the next one.
**[END CARD: 10 seconds with subscribe animation and video suggestions]**
---
## B-ROLL SUGGESTIONS
1. **Tokyo cityscape drone footage** (stock or PLATEAU visualization)
2. **Gazebo simulation recordings** - drones flying, rovers driving
3. **Terminal recordings** - all commands with clear, large fonts
4. **Architecture diagrams** - animated arrows showing data flow
5. **TensorBoard graphs** - real training run data
6. **GPU visualization** - nvidia-smi, nvtop
7. **3D occupancy grid visualizations** - from OccWorld paper
8. **Cost calculator animation** - showing hourly rates adding up
---
## DESCRIPTION TEMPLATE
```
Train AI World Models on Tokyo 3D City Data - Complete Cloud GPU Tutorial
In this video, I show you how to train OccWorld - a 3D world model for autonomous vehicles - on Tokyo city simulation data using Lambda Cloud GPUs.
📚 RESOURCES:
- GitHub Repo: [LINK]
- Training Guide: [LINK]
- Lambda Cloud: https://cloud.lambda.ai
- OccWorld Paper: https://arxiv.org/abs/2311.16038
- Tokyo PLATEAU Data: https://www.geospatial.jp/ckan/dataset/plateau-tokyo23ku
- Pretrained Models: https://cloud.tsinghua.edu.cn/d/ff4612b2453841fba7a5/
⏱️ TIMESTAMPS:
0:00 - Introduction
1:30 - What is OccWorld?
3:00 - Project Overview
5:00 - Lambda Cloud Setup
8:00 - Download Models & Data
11:00 - Start Training
14:00 - Monitor & Results
17:00 - Next Steps
💰 ESTIMATED COSTS:
- A100 80GB: $1.79/hr (~$43 for 24hr training)
- H100: $3.29/hr (~$40 for 12hr training)
🔧 COMMANDS USED:
```bash
# SSH into Lambda
ssh -i ~/.ssh/lambda_key ubuntu@YOUR_IP
# Setup
git clone https://github.com/USER/VeryLargeWeebModel.git
cd VeryLargeWeebModel
./scripts/lambda_setup.sh
./scripts/download_and_prepare_data.sh --all
# Train
tmux new -s training
conda activate occworld
python train.py --config config/finetune_tokyo.py --work-dir ~/checkpoints/occworld_tokyo
```
#MachineLearning #AutonomousVehicles #WorldModels #Robotics #CloudGPU
```
---
## THUMBNAIL DESIGN
**Layout:**
- Left 1/3: Stylized Tokyo skyline (purple/cyan gradient)
- Center: Large text "TRAIN AI" (white, bold)
- Right 1/3: GPU chip with glow effect
- Bottom: "ON TOKYO 3D DATA" (smaller text)
- Corner: Lambda Labs logo (small)
**Colors:** Dark background (#1a1a2e), cyan accents (#00d9ff), purple highlights (#9d4edd)
---
## TAGS
```
occworld, world model, 3d prediction, autonomous driving, robotics, machine learning, deep learning, pytorch, lambda cloud, gpu training, cloud gpu, tokyo, japan, plateau, gazebo simulation, occupancy prediction, transformer, computer vision, 3d vision, point cloud, lidar, sensor fusion, sim2real, simulation, urban environment, city model, digital twin, ai training, ml tutorial, gpu cloud
```
_Collection of CVPR 2017, including titles, links, authors, abstracts and my own comments. Created by Michael Liang, NUDT. All my work are based on http://www.cvpapers.com/cvpr2017.html
Notes about some of the important parts of the system.
**Authors**: Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive
title: Loss Functions