Multimodal LLM Agents — A Self-Paced Course

This is a free, self-paced version of CS 6960: Multimodal LLM Agents, a graduate course taught by Kenneth Marino at the University of Utah. All lecture slides, readings, and exercises are freely available here.


Course Description

This course explores the rapidly developing area of multimodal large language models in embodied settings. Students will learn the foundations of reinforcement learning and large language models to understand how large-scale models can be deployed to multimodal environments.

Topics include control flow and scaffolding for agents (ReAct, tool use), coding agents, game-playing agents, computer use, and robotics. The course combines lecture slides, paper readings, and programming exercises.


Topics Covered

Module Topics
Introduction & Prerequisites Course overview, RL basics, LLM basics, VLM basics
Agent Frameworks ReAct, chain-of-thought, reflection, reasoning
Retrieval and Memory RAG, dense retrieval, memory-augmented agents
Tool Use Tool-augmented LMs, Toolformer, function calling, MCP
Code Agents SWE-bench, coding agent systems, code execution
Agent Evaluation Benchmarks, evaluation methodology, LLM-as-judge
Assistant Agents General-purpose agents, web agents, task benchmarks
Game Agents Game-playing, interactive environments
Computer Use GUI agents, computer use benchmarks
Robotics Embodied agents, robot learning

Prerequisites

There are no formal prerequisites, but some familiarity with the following will help:

  • Machine learning fundamentals
  • Basics of natural language processing
  • Python programming

How to Use This Course

This is a self-paced course — there are no deadlines or grades. Work through the modules in order, read the recommended papers, and complete the optional exercises if you’d like hands-on practice.

Each module on the Lectures & Resources page contains lecture slides, recordings, and recommended readings.

Get started → Lectures & Resources



About the Instructor

Kenneth Marino is a faculty member at the Kahlert School of Computing at the University of Utah, where he teaches CS 6960 and researches multimodal AI and agents. This site is maintained as a public companion to that course.