Artem Zholus

I am a third year PhD student in MILA and Polytechnique Montréal supervised by Prof. Sarath Chandar. Besides that, I am a visiting researcher at FAIR @ Meta advised by Mido Assran. My ultimate research goal is to build adaptive and autonomous agents that solve open-ended tasks. To step towards this goal, I use Language and World Models, Structured Communication, and (Distributed) Reinforcement Learning.

👉 Expand to see what I mean by each of these ideas

Language and World Models. Learning large models that forecast the future (e.g. in the form of next token prediction) this way learning causality, among other things. These models lay the foundation for autonomy (e.g. through self-interaction/self-learning), adaptiveness (e.g. via continuous adaptation), and generality.
(Distributed) Reinforcement Learning.I am an adept of the Reward Hypothesis, but I still believe that only the data agent trains on matters in the end. Thus, to build truly capable agents we need powerful sampling/training loops that work at both small and large scales.
Structured Communication. To me, the most exciting thing about the current state of foundation models is how easily they can be combined (in parameter space, activation space or even token space) with each other or with themselves. By “connecting” models in a certain way we can achieve generalization of these models to new settings or even enable new types of learning. A good example for this is RLHF. I am not limiting myself to a specific type of learning. I coined this as “Structured Communication” which should be understood as any information exchange between models such that they don’t change the way they exchange info and how they use that info. This also encompasses gradient free learning enabled by LLMs.

My current research project focuses on scaling world models. My past research covered Model-based Reinforcement Learning via World Models, interactive learning of embodied agents, and boosting in-context memory of Model-based RL agents. Also, I spent some time in ML industry working as an ML Engineer doing drug discovery with RL and Language Models.

Previously I was a student researcher at Google DeepMind with Ross Goroshin. Also, I had two internships at EPFL: at the LIONS lab (in RL theory) under Prof. Volkan Cevher and at VILAB (in Multimodal Representation Learning) under Prof. Amir Zamir. I obtained my Masters degree at MIPT studying AI, ML, and Cognitive Sciences and working at the CDS lab under Prof. Aleksandr Panov on task generalization in model-based reinforcement learning. I received my BSc degree from ITMO University majoring in Computer Science and Applied Mathematics.

CV Email GitHub Google Scholar Twitter LinkedIn Blog

News

📄 June 2025 — V-JEPA 2 paper is out! Check out our blog post!
📄 April 2025 — "TAPNext" paper accepted to ICCV 2025!
📄 December 2024 — "BindGPT" paper accepted to AAAI 2025 with Best Poster Award!
📢 October 2024 — Started an internship at !
📢 Apr–Sep 2024 — Internship at DeepMind !
📄 October 2023 — "Mastering Memory Tasks with World Models" paper accepted to ICLR 2025 with oral (top-1.2% of accepted papers)
📄 May 2022 — "IGLU Gridworld" paper accepted to CVPR Workshop!
📄 April 2022 — "IGLU 2022" paper accepted to NeurIPS Competition Track!
📄 March 2022 — "Factorized World Models" paper accepted to ICLR Workshop!

Papers

	V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, and Nicolas Ballas Technical Report, 2025 website / arxiv / code / blogpost / hugginface / By scaling world model pretraining to over a million hours of internet videos, we build V-JEPA 2 that excels at motion understanding, human-action anticipation, and video question answering. We show how action-conditioned post training on just 62 hours of unlabeled robot videos, enables zero-shot generalization in robotic control through planning in the latent space for tasks such as pick-and-place.
	TAPNext: Tracking Any Point (TAP) as Next Token Prediction Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, Ross Goroshin ICCV, 2025 website / arxiv / video / code / A new model for the Point Tracking task. Achieves SOTA with a huge margin while offering significantly faster online inference. We use a drastically different (from anything existing before for this task) approach, showing that only the scale of compute and data matters for this task.
	BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning Artem Zholus, Maksim Kuznetsov, Roman Schutski, Rim Shayakhmetov, Daniil Polykovskiy, Sarath Chandar, Alex Zhavoronkov AAAI with Best Poster Award, 2025 website / arxiv / video / hugginface / BindGPT is a new framework for building drug discovery models that leverages compute-efficient pretraining, supervised funetuning, prompting, reinforcement learning, and tool use of LMs. This allows BindGPT to build a single pre-trained model that exhibits state-of-the-art performance in 3D Molecule Generation, 3D Conformer Generation, Pocket-Conditioned 3D Molecule Generation, posing them as downstream tasks for a pretrained model, while previous methods build task-specialized models without task transfer abilities.
	Mastering Memory Tasks with World Models Mohammad Reza Samsami, Artem Zholus, Janarthanan Rajendran, Sarath Chandar ICLR with oral (top-1.2% of accepted papers), 2024 website / arxiv / openreview / code / The new State-of-the-Art performance in a diverse set of memory-intense Reinforcement Learning domains: bsuite (tabular, low dimensional), POPgym (tabular, high dimensional), Memory Maze (3D, embodied, high dimensional, long-term). Importantly, we reach super-human performance in Memory-Maze!
	IGLU Gridworld: Simple and Fast Environment for Embodied Dialog Agents Artem Zholus, Alexey Skrynnik, Shrestha Mohanty, Zoya Volovikova, Julia Kiseleva, Artur Szlam, Marc-Alexandre Coté, Aleksandr I. Panov Embodied AI workshop @ CVPR, 2022 arxiv / code / slides / A lightweight reinforcement learning environment for building embodied agents with language context tasked to build 3D structures in Minecraft-like world.
	IGLU 2022: Interactive Grounded Language Understanding in a Collaborative Environment at NeurIPS 2022 Julia Kiseleva, Alexey Skrynni, Artem Zholus, Shrestha Mohanty, Negar Arabzadeh, Marc-Alexandre Côté, Mohammad Aliannejadi, Milagro Teruel, Ziming Li, Mikhail Burtsev, Maartje ter Hoeve, Zoya Volovikova, Aleksandr Panov, Yuxuan Sun, Kavya Srinet, Arthur Szlam, Ahmed Awadallah NeurIPS, Competition Track, 2022 website / arxiv / code / AI competition where the goal is to follow a language instruction with context while being embodied in a 3D blocks world (RL track) and to ask a clarifying question in the case of ambiguity (NLP track).
	Factorized World Models for Learning Causal Relationships Artem Zholus, Yaroslav Ivchenkov, and Aleksandr Panov OSC workshop, ICLR, 2022 arxiv / code / An RL agent that can generalize behavior on unseen tasks, which is done by learning a structured world model and constraining task specific information.

Design and source code from Jon Barron's website

Artem Zholus

News

Papers

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

Mastering Memory Tasks with World Models

IGLU Gridworld: Simple and Fast Environment for Embodied Dialog Agents

IGLU 2022: Interactive Grounded Language Understanding in a Collaborative Environment at NeurIPS 2022

Factorized World Models for Learning Causal Relationships