Lectures & Resources
Work through the modules in order. Each module includes lecture slides, recordings, and recommended readings.
Module 1: Introduction & Prerequisites
| Topic | Slides | Recording |
|---|---|---|
| Course Overview & Introduction to Agents | YouTube | |
| RL Review | YouTube | |
| LLM Review | YouTube | |
| VLM Review | YouTube |
Reinforcement Learning
- Reinforcement Learning — Sutton & Barto (free textbook)
- Stanford Deep RL Course
- Berkeley Deep RL Course
Large Language Models
- Utah NLP Course
- Stanford NLP Course
- HuggingFace LLM Tutorial
- RLHF Book
- Attention is All You Need
- The Annotated Transformer
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- OLMo 3
Vision-Language Models
- Berkeley Large Scale Vision and Language Models
- Stanford Deep CV Course
- Matt Gormley Generative Model Course
- Luke Zettlemoyer — Mixed-modal Language Modeling Keynote
- CVPR 2025 Saturday Keynote — Llama Herd of Models
- CLIP: Learning Transferable Visual Models from Natural Language Supervision
- Flamingo: A Visual Language Model for Few-Shot Learning
- LLaVA: Visual Instruction Tuning
Module 2: Agent Frameworks
| Topic | Slides | Recording |
|---|---|---|
| Agent Frameworks | YouTube |
Readings
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- ReAct: Synergizing Reasoning and Acting in Language Models
- Reflexion: Language Agents with Verbal Reinforcement Learning
- Charts-of-Thought: Enhancing LLM Visualization Literacy Through Structured Data Extraction
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Module 5: Code Agents
| Topic | Slides | Recording |
|---|---|---|
| Coding Agents | Not available |
Readings
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
- CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges
- How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python?
- SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
- CodeNav: Beyond tool-use to using real-world codebases with LLM agents
- If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents
- GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging
- Executable Code Actions Elicit Better LLM Agents
- D3: A Dataset for Training Code LMs to Act Diff-by-Diff
- LocAgent: Graph-Guided LLM Agents for Code Localization
- When “Correct” Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?
Module 7: Assistant Agents
| Topic | Slides | Recording |
|---|---|---|
| Assistant Agents | YouTube |
Readings
- GAIA: a benchmark for General AI Assistants
- AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
- TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
- CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments
Module 9: Computer Use
| Topic | Slides | Recording |
|---|---|---|
| Computer Use | YouTube |
Readings
- Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- AndroidEnv: A Reinforcement Learning Platform for Android
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
- Mind2Web: Towards a Generalist Agent for the Web
- Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning
- A data-driven approach for learning to control computers
- GPT-4V(ision) is a Generalist Web Agent, if Grounded
- Agent S: An Open Agentic Framework that Uses Computers Like a Human
- OpenCUA: Open Foundations for Computer-Use Agents
- WebGPT: Browser-assisted question-answering with human feedback
- Computer Use Survey - A Visual Survey of Computer Use Agents
Module 10: Robotics
| Topic | Slides | Recording |
|---|---|---|
| Robotics | YouTube |
Readings
- Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
- Code as Policies: Language Model Programs for Embodied Control
- π0.5: a Vision-Language-Action Model with Open-World Generalization
- π0.6: a VLA That Learns From Experience
- Emergence of Human to Robot Transfer in Vision-Language-Action Models
- Gemini Robotics: Bringing AI into the Physical World
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models
- Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models
- SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation
- R3M: A Universal Visual Representation for Robot Manipulation
- GR00T N1: An Open Foundation Model for Generalist Humanoid Robots