Have you ever tested a bot and thought, Wow that was really great!, or possibly, OMG that was the worst experience ever!?
Well, you’re not alone. Working for customers across all industries and countries, we needed to figure out a way to systematically understand what makes a bot good and what makes a bot bad. However, good and bad feel really subjective. This is our business, we need to be objective and tell our clients where to optimize and what to prioritize. Instead of trying to compare apples to oranges, we developed a system to objectively evaluate a bot’s ability to quickly and easily answer user questions. This became our Bot Audit Checklist.
Our audit checklist works with a 50-point scale and focuses on the four main pillars that are essential for a working bot:
NLP quality evaluates your bot’s knowledge and how well it can correctly recognize user questions. The highest possible score for this section is 20 points and it accounts for 40% of the audit score. This section has the heaviest weight because if the NLP model doesn’t work properly, then the bot doesn’t work properly.
Before testing the bot, you need a list of 15 questions that the bot SHOULD know based on its introduction, the placement (the context of a specific website, app, etc.) and what an average customer would ask. If you’re struggling to come up with questions, here are 10 generic questions to get you started:
From there, take your questions and adapt them based on the company’s FAQ page. FAQ pages are a great source of information because it’s a complete set of frequently asked questions. It’s also usually the starting point of the bot’s knowledge because the info is already organized and they’re sure to save human time since they’re already identified as high-volume questions.
That’s the only prep work you need to do! Now it’s onto the testing. Here’s a breakdown of every item to test when evaluating the bot’s NLP quality.
Ask the bot your 15 questions. For every rue positive, a question that the bot answers with an on-topic response that makes sense, you award 1 point. For every question that the bot doesn’t answer correctly, you’ll need to follow up with further questioning to decide whether it’s a false positive, false negative or true negative.
A false positive occurs when the bot falsely understands the question and gives the wrong response. These are the easiest to spot.
A false negative occurs when the bot falsely responds with a not understood answer. How do you know if the bot really doesn’t know the answer? You need to follow up with rephrased variations of the question. If one of your rephrased questions gets a correct answer, then it’s a false negative since the bot did actually know the answer to your initial question, it just didn’t understand the specific phrase you used. If none of your rephrased questions get a correct answer and the bot continues to respond with a not understood, then it’s a true negative, since the bot really doesn’t seem to know the answer to your questions.
Subtract -0.5 points for every false positive and false negative that you experience. Award 0.5 points for every true negative that you uncover.
Note down the results on your checklist and move on to the rest of the audit. You’ll total up everything at the end.
Here you’re testing various types of input. If there are buttons within the chat, then the bot should recognize the button when it’s clicked and when it’s typed. If there are number menus, then the bot should recognize a number as text and as digits. If the bot can recognize these different types of input, award 1 point. If it can’t, award 0.
Entities help the NLP identify variations within an intent. It’s easiest to think of them as keywords, that correspond to a list of synonyms. If the bot uses entities, award 1 point, if not award 0 points.
! TIP ! Looking for entities takes a bit of practice, but an easy way to check if a bot uses entities is to ask a question that has different answers based on a specifier. For example, “I broke something” gives a generic answer with a follow-up question asking what you broke. “I broke my laptop” gives a specific answer related to breaking your laptop, and “I broke my phone” gives a specific answer related to breaking your phone. Laptop and phone are synonyms under an “item” entity.
There are 55 countries in the world that have more than one official language, and approximately 60% of the world population speaks at least two languages or more. Switching from one language to another is a normal habit for lots of people. However, very few chatbots can detect, let alone accommodate that. Award 1 point if the chatbot can detect that you’re speaking another language. It doesn’t need to offer to chat with you in that language, it just needs to understand that you’re speaking another language and gives a response that manages expectations about which language(s) it can speak. If there’s no language detection, then award 0 points.
Randomly smashing the keys on your keyboard results in gibberish that no one can or should be able to decipher. This happens more frequently than you might think in chatbot conversion. Instead of responding with a generic not understood response, some bots respond with a specific message encouraging users to stick to an official language that the bot knows. Award 1 point if the bot can detect gibberish and 0 points if it can’t.
Only using NLP will quickly become a habit of the past. Many companies are already experimenting with NLG to provide more human and personalized responses. As the technology advances, this section of the audit will probably grow, but for now, award 1 point if you discover the bot using NLG and 0 points if not.
Conversational design focuses on the user experience and how the conversation is written and engineered. This is what distinguishes a bot from a functional question-answer machine to an interactive experience. The highest possible score for this section is 15 points, which accounts for 30% of the audit score.
Within the scope of conversational design, we evaluate 12 different criteria. This is probably the least objective section of the audit, so we’ll go through each criterion in detail and explain exactly what to look for and how to award points.
Conversationalized copy is text (or bot responses) that is written the same way people converse. If a bot’s response is short (~125 characters per bubble), breaks longer messages into multiple bubbles and averages less than four bubbles per response, then you can consider it conversationalized copy.
There are always certain responses that can be improved, but if a majority of the copy is conversationalized, then you should award 1 point. If there's little to no conversationalized copy, then award 0 points.
Personalization refers to the bot’s ability to use the knowledge that it already has. If the bot is placed in a logged-in environment, then the bot should already know all the personalized information associated with your account. This could be something as basic as your email, all the way down to your subscription plan, most recent purchases, etc.
Context refers to the bot’s ability to understand what’s been said throughout the conversation and where you are in the conversation. For example, if you tell the bot you lost your sweater and it correctly detects that you lost something, then the next question shouldn’t be: What did you lose? The bot should already know what you lost based on your initial message.
If the bot uses personalization and/or context, then award 1 point. If there’s no personalization and the context doesn’t work, then award 0 points.
! TIP ! An easy way to test for context is to find a button menu and type one of the button responses. Then return to the main menu and type the same thing you typed while in the flow. Compare those answers. If they’re the same, then the bot probably isn’t using context. If they’re different, then the bot is using context.
Copy variations make a bot feel less rigid and more spontaneous. For example, instead of always answering “Do you have another question?”, the bot sometimes says, “Is there something else I can help you with?”, or “Can I help you with anything else?”
Award 1 point if you find copy variations, and 0 points if don't find any.
! TIP ! The most common response variations to check are: Hello, Thank you, Another question, Bye.
Writing from the user’s perspective implies that the bot uses first-person pronouns. For example, when listing different options instead of saying,”Your number,” it says, “My number,” since the user clicks the button and it appears as an input response. Award 1 point if the bot writes in the first person, and 0 points if it doesn't.
It’s crucial for a chatbot to introduce itself as a bot. Not only will it soon be a law in Europe, but it’s also key for managing user expectations. If a bot clearly introduces itself as a chatbot, then you should award 1 point. If there's no clear introduction or it's not clear that the chat experience is run by a bot, then you should award 0 points.
Chatbots are all about expectation management. Giving a user a keyboard opens up the possibility to ask anything and everything. If a chatbot offers a menu or describes the types of topics or questions it can answer, then it’s narrowing the scope and guiding the user on what to ask. Award 1 point if the bot manages expectations and 0 points if it doesn’t.
There's nothing more frustrating than being trapped in a flow where the keyboard is blocked and none of the buttons match what you need. There are some cases when blocking a keyboard can be useful, however, those moments are rare. If the bot doesn’t unnecessarily block the keyboard, then award 1 point. If you find a blocked keyboard that doesn't allow an escape or alternative, then award 0 points.
Lots of companies use bots to automate data capture and conversationalize forms. A crucial part of accurate data capture is input validation. Award 1 point when you find some form of data/input validation, whether it be an automated address, phone number or email check, or a simple "review-and-confirm" from the user. Award 0 points if there’s no data/input validation.
Implicit confirmation is one of the best tools when it comes to conversational design. It allows the bot to confirm that it correctly understood the user, while smoothly moving the conversation forward. For example, if the user asks for a coffee, the bot might respond with “Would you like milk with your coffee?” The bot isn’t saying literally, “Do I understand correctly that you asked for a coffee?” Instead, it’s saying, “I understood that you ordered a coffee,” it smoothly moves the conversation forward by asking if you’d like milk with your coffee and it gives the user the opportunity to correct it if they didn’t order a coffee. Award 1 point if the bot uses implicit confirmation and 0 points if it doesn't.
Integrations are technical connections between the bot and other platforms. They enable the bot to give more personalized and accurate responses. There are simple integrations, such as storing user feedback in an online spreadsheet, and very complex integrations such as connecting to a user's bank account and checking his/her current balance. Integrations bring a bot to the next level, so for every integration that you spot, award 2 points. Bots can score a maximum of 5 points for integrations.
The ultimate failure of conversational design is getting stuck in an endless loop or a flow that a user can't escape. Not only is it frustrating, it means that the chatbot failed to do its job. Subtract 2 points for every time you get trapped in a flow or loop. You should cap the number of subtracted points at -10.
The structure of a chatbot is the information architecture and the flows are specific scenarios where the bot guides the user through a process. The highest possible score for this section is 10 points and it accounts for 20% of the audit score. The scoring here is quite straightforward.
If there’s a clear information architecture with a knowledge hierarchy, such as a main menu and/or sub-menus, then you should award 1 point. If not, then 0 points.
Fallbacks are the most complex part of this section and can use entities, database searches or offloads to ensure that the bot moves the conversation forward without saying, “I’m sorry, I don’t understand.” Award 1 point every time you experience a fallback instead of a “not-understood” message. Bots can score a maximum of 5 points for fallbacks.
A bot can’t and shouldn’t answer everything. That’s why you should award 1 point to bots that propose some sort of offload, whether it be a live chat handover, ticket creation, phone number or email address. If there’s no alternative form of communication to the not, then award 0 points.
Bots can always improve and learn. Award 1 point to bots that have some version of a feedback flow, whether it’s a thumbs up/thumbs down, star rating, smiley ranking or open feedback. If none of these exist, award 0 points.
In 99% of the cases when the bot identifies the same intent twice in a row it’s a wrong answer. This is frustrating for the user because they’re trying to rephrase the question, but it’s still not triggering the right answer to his/her question. In this case, the best practice is to alert the user that it has the same answer twice in a row and ask the user if they want that same answer or propose some sort of offload instead. Award 1 point if the bot recognizes a double intent, and 0 points if it doesn’t.
Since a majority of the population is at least bilingual, it’s very normal for a user to end up on a chatbot in his/her non-preferred language. This is why you should award 1 point to bots that can seamlessly switch from one language to another. Award 0 points if there's no language switch.
This section evaluates how the bot looks and it has the smallest impact on the final score since design is always subjective. It’s possible to score 5 points in this area and it accounts for 10% of the final score.
Try to keep your evaluation of visualization as objective as possible by awarding 1 point to bots that use rich media to show instead of tell. Rich text and media include emojis, icons, images, videos, gifs, etc. Award 0 points if this is missing.
Award 1 point to bots that use the best formats to simplify the user experience, such as lists, carousels, buttons, links, etc. If these are missing, award 0 points.
Finally, award 1 point per device (desktop, tablet and mobile) where the chat widget is responsive and works properly.
Now it’s time to calculate the total score! Add up the scores from each section. The final score is out of 50 points, so the higher the score, the better the bot performed.
There you have it! That’s how you can use our Bot Audit Checklist to identify what’s working in your bot and what needs improvement. If you don’t have the checklist yet, feel free to download it here 👇
Fill out your info on the right to receive a download link for an editable version of the checklist!
If you want to discuss AI in more detail, then reach out to Alexis.
He's ready to chat in French, English and Greek.