Path to AGI
Incomplete Ideas towards Artificial General Intelligence
What changed in the last decade?
- Large-scale pre-training data.
- Bigger models with more parameters.
- More compute (FLOPS).
- General purpose learning algorithms and architectures over task-specific methods e.g. Transformers.
- Old school optimization technique e.g. Gradient Descent and Loss Function e.g. CrossEntropy are enough.
- Less human-engineered labels and features e.g. Self-Supervised Learning from raw data.
“The bitter Lesson” from Richard Sutton (2019).
What is Missing
- Sparsity for less expensive Models (better resource allocation / compute per token) e.g. Sparse Attention.
- Dynamic Compute (Flops per Token) depending on the difficulty of the task e.g. Routing Networks / Mixtures of Experts (MoE).
- Embodiment and Interaction with the Physical World e.g. Robotics and Simulation Environments (Embodied AI + Sim2Real).
- Unsupervised and Self-supervised Learning for generalization and One-shot Learning.
Pitfalls
- Human Engineered Feature Methods
- Task Specific Datasets for Training
- Human-designed Heuristics and Domain Knowledge Methods
- Antropomorphizing AI
What GPT-2 Does
Learns a probability distribution from a highly dimensional space. Our model then learns to sample from this learned distribution representing the training data. We use a generalized task which is to predict the next frame’s representation (token embedding).
During inference, we sample auto-regressively from this learned distribution, in other words, we generate more data by sampling from a distribution. Instead of sampling word tokens, we sample image tokens which can then be classified into categories of actions, objects, etc.