Abstract
The rise of the Large Language Model (LLM) in the field of machine learning has greatly transformed multiple industries, replacing multiple parts of the work pipeline that typically require complex logic. However, alongside this meteoric revolution is a concern for reliability and flexibility, especially when multiple studies warn against LLM hallucinations, where the model invents certain statements and assumes them as fact. Furthermore, like any other machine learning model, LLMs have their facts and worldview restricted to their training set. For example, a model trained on data up to 2025 would not be able to predict any events occurring in 2026. This greatly decreases the strength of the LLMs as knowledge bases, as they are incredibly good at synthesizing information but struggle with obtaining the necessary information outside of their training set.
A solution to this dilemma is to provide the LLM with a set of tools to help it obtain the right information to solve a user query. For example, a web search tool can allow the LLM to query for knowledge about the current day, and append the new knowledge to the context window to allow the LLM to operate on it. This addition of a single tool call upgrades the LLM from being a historical data synthesis to a more powerful search engine. With the presence of multiple tools, the LLM can not only retrieve information, but can also use that information to perform actions inside of designated environments such as banking and Slack, where it might process transactions or send messages to users. Instead of producing an answer to the user in a traditional conversation with an LLM, for every human query there is an entire conversation history where the LLM interacts with the environment and tools. Together, this addition of tool calls is referred to as Agentic AI.
The power of Agentic AI comes at a cost. Compared to previous concerns of an LLM hallucinating, now an LLM in an agentic environment performing the wrong action could cause much greater harm by directly manipulating objects in the environment, especially in sensitive industries such as banking. As a result, although an LLM has increased reliability for information retrieval in agentic contexts, the threshold for an acceptable agentic LLM is much higher, demanding strong logic and planning capabilities. While autoregressive models like ChatGPT and Gemini have consistently dominated in natural language contexts, agentic environments require a higher level of reasoning that a model which generates a response in a linear fashion may struggle with. Indeed, recently another type of model architecture, called diffusion models, have risen to challenge their autoregressive counterparts. Diffusion models are still weaker in natural language compared to autoregressive ones, but their ability to generate a response out-of-order and in bidirectional fashion allow them to excel at symbolic reasoning tasks and tasks that require complex planning. Thus, both the STS and technical parts of this report investigate how autoregressive and diffusion models compare in agentic environments.
Additionally, the technical report also goes into depth about several diffusion-centered techniques that help diffusion models in agentic contexts. These primarily target the bidirectional nature of the model generation process and boost the model’s capability in selecting the right tool calls. The STS research paper instead focuses on why the fundamental difference in architecture will result in autoregressive models being outperformed by diffusion models at some point in the future, drawing upon the framework of technological determinism to explain the gap in the high-level reasoning capabilities between the two. Overall, both papers discuss how different model architectures behave in agentic environments and how they can be improved.