GenAI has just made usability testing the most valuable research method

Generative AI has just put usability testing insights into the center of product development.

It’s fair to say that usability testing is now the most valuable research method in your toolkit. There has never been a better time to up your game and lean into this opportunity. What is this all about?

About: Zsombor Varnagy-Toth is a Sr UX Researcher at SAP with a background in machine learning and cognitive science. Author of A Knack for Usability Testing.

If you have ever tried to use an LLM, such as GPT-4o or Claude 3.5 Sonnet directly, for work, you surely have noticed that its output is just not ready. Turning it into professional-grade material takes a lot of pre- and post-work. This is why, as Sequoia Capital highlights, “Models have largely failed to make it into the application layer as breakout products…”

Many software companies have also noticed this gap between the LLMs’ output and real-world value and have wrapped LLMs in additional layers to enhance their outputs. The wrapper might be as simple as an application-specific prompt template or user interface. These simple additions made the general-purpose LLMs more useful in specific applications.

However, these so-called “last mile application providers” have long been looked down on as the understanding was that the majority of the value comes from the LLMs, and the application-specific UI is just a dash of glitter sprinkled on top. Additionally, since such UIs may be replicated in a matter of hours, these “wrapper companies” are not even defendable, not investment worthy businesses.

This view has changed dramatically during 2024. Read Sequoia Capital’s article “Generative AI’s Act o1” by Sonya Huang and Pat Grady for a thoughtful analysis of this phenomenon. I will only summarize the main thoughts relevant to usability testing practitioners.

To bridge the gap between LLMs and real-life value, product teams need to build entire architectures around the LLM. The goal of such an architecture is to place the LLM in a rich ecosystem of various memory and processing units, databases, and other utilities that complement each other. Together, these components have everything to carry out the task, but each component is only called upon when it’s their turn to work on the task.

Such systems are often referred to as “cognitive architectures” that emphasize the analogy with the human brain, which also has multiple processing components, and depending on the task, different sub-systems are called upon in different order. They each take turns with the input until a sufficiently good output response is achieved.

As it turns out, this is where most of the customer value is created. LLMs are smart, but in and of themselves, they can only produce, say, 20% of the value, while a cognitive architecture is needed to produce the remaining 80%. This realization is also supported by my team’s user research.

Moreover, these cognitive architectures make the product very defendable. Since such architectures are highly application-specific, platform companies, such as OpenAI, won’t / can’t go after them because there are just too many of them. Cognitive architectures are very difficult for application-level competitors to replicate, too. Firstly, the architecture is hidden from view; competitors can’t just look at the user interface and figure out what they miss. Secondly, developing such an application-specific architecture takes considerable domain expertise, including a lot of user research. A “fast follow” copycat strategy is just not possible here.

In other words, the “last mile” turns out to be the most valuable mile, both in terms of customer and business value. Unsurprisingly, Sequoia Capital predicts these products will create the next wave of value in the AI revolution.

So, how does usability testing come into the picture? Well, when product teams scramble to make the LLMs’ raw output into something useful, they often look to human cognitive architectures for inspiration. What processing steps does a human professional take to turn a piece of input into quality work output? That is precisely the insights usability testing delivers. We uncover the necessary architectural components and how they should be orchestrated to carry out a certain task.

In fact, the think-aloud method was developed in the 1970s for this exact purpose. Herbert A. Simon and Allen Newell sought to understand human problem-solving strategies, so they gave tasks to participants and asked them to think out loud as they executed the tasks. Simon and Newell wanted to peek into the human cognitive architecture, isolate the processing steps, map the cognitive components that take part in solving the task, and understand their interplay.

Later on, this methodology was adopted for usability testing purposes because it not only reveals the cognitive processes during computer use but also marks those points where the cognitive processes break down, i.e., the usability issues. However, discovering usability glitches is merely a useful byproduct of the think aloud protocol. The main goal of the method is to uncover the components and inner workings of a cognitive architecture.

This is how usability testing, specifically, the think aloud protocol- delivers the cognitive architecture insights that enable the building of cognitive architectures of AI applications. This is how AI applications go from 20% to 100% customer value. This is how last mile application providers become the most valuable, enduring, and defendable businesses.

Usability testing, and we, UX researchers, are at the center of that value creation.

GenAI has just made usability testing the most valuable research method was originally published in UX Planet on Medium, where people are continuing the conversation by highlighting and responding to this story.