Building Usable Conversations: Conversational Usability Testing

  • 11 minute read

In this fifth installment of our series on conversational usability, our focus shifts to conversational usability and the process of evaluating and improving conversational interfaces that often differ significantly from the visual and physical interfaces we normally use to test with end users.

Due to the unique complexities and the striking diversity of conversational interfaces, from voice assistants to interactive chatbots to run-of-the-mill messaging bots, conversational usability testing requires a nuanced approach and, sometimes, new and untested solutions.

Many of the techniques that we are most accustomed to as usability researchers still serve us well in the conversational context. For instance, we can conduct think-aloud testing whenever a subject is interacting with a chatbot, and eye tracking can also be relevant for those chatbots that are more visually complex. Nonetheless, for voice assistants and other voice-driven interfaces, we must resort to other techniques that still capture compelling data.

In this Experience Express column, we dig into the complications—and the unexpected benefits—of conversational usability testing.

Chatbot and messaging bot usability

Conversational usability testing is a relatively unexplored and misunderstood area, partially because of the diversity of conversational interfaces available on the market today. Some interfaces are solely aural and verbal with no visual or physical component, unlike interfaces that are typically evaluated for usability, such as websites, mobile applications, device hardware, and other manually manipulated interfaces.

Fortunately, for chatbots with extensive visual interfaces, the same techniques can be undertaken as are employed on websites, such as eye tracking (especially in the case of conversational forms or interfaces that are interpolated in chatbot messages) and think-aloud (which does not conflict with the interlocutions taking place with the interface). Nonetheless, there are several important areas of interest that are specific to chatbots and messaging bots.

Because chatbots and textbots benefit from a guided, unidirectional flow to articulate the information architecture contained therein, organizations interested in exploring conversational approaches should give particular consideration to how users navigate across different states of the application. For example, usability test results for chatbots will very quickly reveal issues of navigation and wayfinding that would not otherwise be easily discovered on more visual interfaces, due to the need for explicit instructions that map a trajectory forward for the user.

In addition, the distinction between informational and transactional conversational interfaces may require different methods to be undertaken to evaluate the usability of an interface. A usability test assessing the outcome of an informational conversation, for instance, would present tasks to subjects that require the recognition of certain information gleaned from interactions with the interface. Meanwhile, a usability test evaluating the result of a transactional conversation would ask the user to achieve a particular transactional outcome.

Voice assistant and voice interface usability

All of the usual calculations that go into a usability test for a chatbot go out the window when it comes to voice assistants and other voice-driven interfaces. First and foremost, all voice interfaces lack physical or visual components and instead require use of the subject's voice. This means that think-aloud, a common method used in website usability testing, cannot be leveraged. Think-aloud has the potential to introduce complications when it comes to the ability of the voice interface to respond and to cloud tests with bad data that could jeopardize results.

In addition, eye tracking is only a realistic proposition for those voice assistants that include a screen or some visual component, such as the Amazon Echo Show. While it may be interesting from the standpoint of the psychology of users working with conversational interfaces to evaluate where users look when they are speaking with an unseen conversational partner, eye tracking loses all usefulness in usability testing for smart speakers like Amazon Alexa and Google Home.

As such, we should look to other ways to maintain consistency across usability tests for voice assistants and voice-driven interfaces that do not have the potential to complicate the testing process. One of the lesser-known techniques in usability testing is retrospective probing, which has received less attention as of late due to the faulty memories all humans have when asked questions well after they have interacted with an interface. In the context of voice-driven interfaces, however, it can be quite the boon.

Retrospective probing for voice assistants

Some of the most common usability testing approaches, as we've seen, are unable to capture the sort of desirable data that usability researchers seek in voice-driven interfaces. Voice assistants, for instance, are easily invoked, even if the invocation is accidental and bears little real resemblance to "Alexa" or "OK Google." Concurrent probing (CP), for instance, requires the evaluator to ask questions during the progress of the test itself, and a user can very easily, in describing a particular positive or negative aspect of the interface, utter "Alexa" and trigger an interaction where one was not intended.

Meanwhile, retrospective probing (RP) is a technique that allows for a conversational experience to proceed to completion without external influence from potential intrusions such as think-aloud statements. It asks the user to answer questions or to give their impressions about a user experience after the entirety of their interaction with the interface has finished. Retrospective probing also has the added benefit of allowing the user to give their full impressions about an interface rather than offering them piecemeal, like in concurrent probing.

Nonetheless, retrospective probing has one key disadvantage that also motivates its relative lack of use in web usability testing, for instance. Users have notoriously poor memories and are often unable to recall characteristics of interfaces or occurrences that transpired only a few minutes before retrospective questions are asked. Just as concurrent probing during interactions with a conversational interface can introduce bad data, retrospective probing can also cause users to offer false recollections or to accidentally misinterpret the result of their conversation. This risk should always be taken into account when designing such usability tests.

How to conduct a voice usability test

While usability tests involving chatbots and other messaging bots can proceed in much the same surroundings and under the same specifications as web usability tests, voice usability tests must adhere to certain characteristics in order to guarantee a successful test. As usual, the normal principle "test early, test often" applies, and each milestone in the project should see another usability testing regimen.

One of the most readily obvious requirements for those who are building voice interfaces is the need to avoid interference from other sources of sound and noise. If you have a soundproof room or recording studio within your organization, that is an excellent place to hold a voice usability test, as absolute silence is of paramount importance for the user's attention span and data collection. Moreover, because voice-driven interfaces usually require a different sort of interaction from the user than for a website, users may be unaccustomed to sitting in silence before and during the test at certain moments.

The ideal voice usability test aims not only to evaluate whether a user understands how to get to a particular point in the interface but also to assess if a user is able to traverse the entirety of the experience to arrive at a successful destination, whether that is the acquisition of information or the completion of a transaction. A robust usability test for conversational interfaces will present the user with a task that requires the user to hit all major touchpoints, which can often be mapped easily to Erika Hall's key moments in conversational interfaces.

As for the tasks presented to the user, there are two ways forward: either allowing the user to proceed at their own volition and discover touchpoints organically or prescribing a particular task for the user that requires them to engage fully with key features of the interface. In the former case, the volume of usability tests you are able to conduct will determine how comprehensive the test's results will be. In the latter case, if your designated tasks do not adequately cover all of the possible trajectories through your interface, it's time to go back to the drawing board.


Conversational usability testing is an area fraught with potential complications and substantial risks, but with a carefully considered approach that addresses the limitations of certain types of conversational interfaces like voice assistants, organizations can ensure their conversational interfaces are useful for all people. In sum, whereas more "traditional" conversational approaches like chatbots have the benefit of a visual and physical interface that can embrace usual usability testing approaches, voice assistants require other techniques such as retrospective probing.

In the final installment of this series on building conversational interfaces, we closely inspect a case study that harnesses all of these best practices into a single implementation, namely the Ask GeorgiaGov skill for Amazon Alexa, an informational interface that aids citizens of the state of Georgia in tackling common tasks involving state government. In that final column, we will look at each of the topics we have covered in turn and apply our new knowledge in evaluating the advantages and disadvantages of various decisions made during the process. All aboard!