为什么端到端?
- 编码人类价值函数非常地难。(所为的价值函数,就是损失项及损失项的权重)
- real world is filled with tiny trolley problems -
- 多模块 信息有损(通过接口传递)、接口病态(决策跳变、长尾分布样本处理不了)
- 当多个obs连续通行时,较难处理每个对象。【鸡 鹅的例子】
- 单模块简单,容易scale,处理长尾分布较容易
- 同构计算、确定性的延迟 homogenous compute with deterministic latency
- 更加符合 w.r.t the bitter lesson
Main Challenges of learning pixels → control
- 【维度灾难】Curse of dimensionality
- 【可解释性与安全保证】Interpretability and safety guarantees
- 【评估】Evaluation
1. 维度灾难 Curse of dimensionality
- Input context length of 2 billion tokens:
- 7 cameras x 36 FPS x 5 Mega pixels x 30s history / (5x5 pixel patch)
- Navigation maps and route for next few miles
- 100 Hz kinematic data such as speed, IMU, odometry, etc
- 48 KHz audio data
- Output tokens:
- Next steering and acceleration
- Need to learn the correct causal mapping of 2 billion tokens -> 2 tokens
Tesla fleet can provide 500 years of driving data every single day
But most of driving data is boring
- Sophisticated trigger based data collection