How to use Campfire’s Bot Audit Checklist

Well, you’re not alone. Working for customers across all industries and countries, we needed to figure out a way to systematically understand what makes a bot good and what makes a bot bad. However, good and bad feel really subjective. This is our business, we need to be objective and tell our clients where to optimize and what to prioritize. Instead of trying to compare apples to oranges, we developed a system to objectively evaluate a bot’s ability to quickly and easily answer user questions. This became our Bot Audit Checklist.

Our audit checklist works with a 50-point scale and focuses on the four main pillars that are essential for a working bot:
‍

NLP quality
Conversational design
Structure and flows
Visualization

‍

NLP quality

NLP quality evaluates your bot’s knowledge and how well it can correctly recognize user questions. The highest possible score for this section is 20 points and it accounts for 40% of the audit score. This section has the heaviest weight because if the NLP model doesn’t work properly, then the bot doesn’t work properly.

Before testing the bot, you need a list of 15 questions that the bot SHOULD know based on its introduction, the placement (the context of a specific website, app, etc.) and what an average customer would ask. If you’re struggling to come up with questions, here are 10 generic questions to get you started:
‍

Find a [something] near me
Become a customer/member
Change my [something]
I lost/forgot [something]
What is [something]
I have a problem/issue with [something], [Something] is wrong/not working/broken
I didn’t receive [something]
How can I do [something]
Talk to a human, contact info
Do you speak [another language], I speak [another language]

From there, take your questions and adapt them based on the company’s FAQ page. FAQ pages are a great source of information because it’s a complete set of frequently asked questions. It’s also usually the starting point of the bot’s knowledge because the info is already organized and they’re sure to save human time since they’re already identified as high-volume questions.

That’s the only prep work you need to do! Now it’s onto the testing. Here’s a breakdown of every item to test when evaluating the bot’s NLP quality.

‍

Intent recognition

Ask the bot your 15 questions. For every rue positive, a question that the bot answers with an on-topic response that makes sense, you award 1 point. For every question that the bot doesn’t answer correctly, you’ll need to follow up with further questioning to decide whether it’s a false positive, false negative or true negative.

A false positive occurs when the bot falsely understands the question and gives the wrong response. These are the easiest to spot.

Chatbot conversation examples that demonstrate what a true positive and false positive are

A false negative occurs when the bot falsely responds with a not understood answer. How do you know if the bot really doesn’t know the answer? You need to follow up with rephrased variations of the question. If one of your rephrased questions gets a correct answer, then it’s a false negative since the bot did actually know the answer to your initial question, it just didn’t understand the specific phrase you used. If none of your rephrased questions get a correct answer and the bot continues to respond with a not understood, then it’s a true negative, since the bot really doesn’t seem to know the answer to your questions.

Chatbot conversation examples that demonstrate what a false negative and a true negative are

‍

Subtract -0.5 points for every false positive and false negative that you experience. Award 0.5 points for every true negative that you uncover.

Note down the results on your checklist and move on to the rest of the audit. You’ll total up everything at the end.

‍

Buttons and numbers

Here you’re testing various types of input. If there are buttons within the chat, then the bot should recognize the button when it’s clicked and when it’s typed. If there are number menus, then the bot should recognize a number as text and as digits. If the bot can recognize these different types of input, award 1 point. If it can’t, award 0.

‍

Entities

Entities help the NLP identify variations within an intent. It’s easiest to think of them as keywords, that correspond to a list of synonyms. If the bot uses entities, award 1 point, if not award 0 points.

! TIP ! Looking for entities takes a bit of practice, but an easy way to check if a bot uses entities is to ask a question that has different answers based on a specifier. For example, “I broke something” gives a generic answer with a follow-up question asking what you broke. “I broke my laptop” gives a specific answer related to breaking your laptop, and “I broke my phone” gives a specific answer related to breaking your phone. Laptop and phone are synonyms under an “item” entity.

Chatbot conversation examples that demonstrate how to identify entities

Language detection

There are 55 countries in the world that have more than one official language, and approximately 60% of the world population speaks at least two languages or more. Switching from one language to another is a normal habit for lots of people. However, very few chatbots can detect, let alone accommodate that. Award 1 point if the chatbot can detect that you’re speaking another language. It doesn’t need to offer to chat with you in that language, it just needs to understand that you’re speaking another language and gives a response that manages expectations about which language(s) it can speak. If there’s no language detection, then award 0 points.

‍

Gibberish detection

Randomly smashing the keys on your keyboard results in gibberish that no one can or should be able to decipher. This happens more frequently than you might think in chatbot conversion. Instead of responding with a generic not understood response, some bots respond with a specific message encouraging users to stick to an official language that the bot knows. Award 1 point if the bot can detect gibberish and 0 points if it can’t.

‍

Natural Language Generation (NLG)

Only using NLP will quickly become a habit of the past. Many companies are already experimenting with NLG to provide more human and personalized responses. As the technology advances, this section of the audit will probably grow, but for now, award 1 point if you discover the bot using NLG and 0 points if not.

A screenshot of the NLP Quality Score section of Campfire's Bot Audit Checkllist

Conversational design

Conversational design focuses on the user experience and how the conversation is written and engineered. This is what distinguishes a bot from a functional question-answer machine to an interactive experience. The highest possible score for this section is 15 points, which accounts for 30% of the audit score.

Within the scope of conversational design, we evaluate 12 different criteria. This is probably the least objective section of the audit, so we’ll go through each criterion in detail and explain exactly what to look for and how to award points.

Conversationalized copy

Conversationalized copy is text (or bot responses) that is written the same way people converse. If a bot’s response is short (~125 characters per bubble), breaks longer messages into multiple bubbles and averages less than four bubbles per response, then you can consider it conversationalized copy.

There are always certain responses that can be improved, but if a majority of the copy is conversationalized, then you should award 1 point. If there's little to no conversationalized copy, then award 0 points.

Personalization and context

Personalization refers to the bot’s ability to use the knowledge that it already has. If the bot is placed in a logged-in environment, then the bot should already know all the personalized information associated with your account. This could be something as basic as your email, all the way down to your subscription plan, most recent purchases, etc.

Context refers to the bot’s ability to understand what’s been said throughout the conversation and where you are in the conversation. For example, if you tell the bot you lost your sweater and it correctly detects that you lost something, then the next question shouldn’t be: What did you lose? The bot should already know what you lost based on your initial message.

Chatbot conversation examples that demonstrate what unpersonalized and personalized copy look like

If the bot uses personalization and/or context, then award 1 point. If there’s no personalization and the context doesn’t work, then award 0 points.

! TIP ! An easy way to test for context is to find a button menu and type one of the button responses. Then return to the main menu and type the same thing you typed while in the flow. Compare those answers. If they’re the same, then the bot probably isn’t using context. If they’re different, then the bot is using context.

Copy variations

Copy variations make a bot feel less rigid and more spontaneous. For example, instead of always answering “Do you have another question?”, the bot sometimes says, “Is there something else I can help you with?”, or “Can I help you with anything else?”

Award 1 point if you find copy variations, and 0 points if don't find any.

! TIP ! The most common response variations to check are: Hello, Thank you, Another question, Bye.

Chatbot conversation examples that compare not using and using copy variations

User’s perspective

Writing from the user’s perspective implies that the bot uses first-person pronouns. For example, when listing different options instead of saying,”Your number,” it says, “My number,” since the user clicks the button and it appears as an input response. Award 1 point if the bot writes in the first person, and 0 points if it doesn't.

Chatbot conversation examples that demonstrate what second-person and first-person pronouns look like

Bot introduction

It’s crucial for a chatbot to introduce itself as a bot. Not only will it soon be a law in Europe, but it’s also key for managing user expectations. If a bot clearly introduces itself as a chatbot, then you should award 1 point. If there's no clear introduction or it's not clear that the chat experience is run by a bot, then you should award 0 points.

Expectation management

Chatbots are all about expectation management. Giving a user a keyboard opens up the possibility to ask anything and everything. If a chatbot offers a menu or describes the types of topics or questions it can answer, then it’s narrowing the scope and guiding the user on what to ask. Award 1 point if the bot manages expectations and 0 points if it doesn’t.

‍Unnecessarily block keyboard

There's nothing more frustrating than being trapped in a flow where the keyboard is blocked and none of the buttons match what you need. There are some cases when blocking a keyboard can be useful, however, those moments are rare. If the bot doesn’t unnecessarily block the keyboard, then award 1 point. If you find a blocked keyboard that doesn't allow an escape or alternative, then award 0 points.

‍Data/input validation

Lots of companies use bots to automate data capture and conversationalize forms. A crucial part of accurate data capture is input validation. Award 1 point when you find some form of data/input validation, whether it be an automated address, phone number or email check, or a simple "review-and-confirm" from the user. Award 0 points if there’s no data/input validation.

Chatbot conversation examples that demonstrate user input validation vs. automatic email validation

Implicit confirmation

Implicit confirmation is one of the best tools when it comes to conversational design. It allows the bot to confirm that it correctly understood the user, while smoothly moving the conversation forward. For example, if the user asks for a coffee, the bot might respond with “Would you like milk with your coffee?” The bot isn’t saying literally, “Do I understand correctly that you asked for a coffee?” Instead, it’s saying, “I understood that you ordered a coffee,” it smoothly moves the conversation forward by asking if you’d like milk with your coffee and it gives the user the opportunity to correct it if they didn’t order a coffee. Award 1 point if the bot uses implicit confirmation and 0 points if it doesn't.

Integrations

Integrations are technical connections between the bot and other platforms. They enable the bot to give more personalized and accurate responses. There are simple integrations, such as storing user feedback in an online spreadsheet, and very complex integrations such as connecting to a user's bank account and checking his/her current balance. Integrations bring a bot to the next level, so for every integration that you spot, award 2 points. Bots can score a maximum of 5 points for integrations.

Chatbot conversation examples that demonstrate what explicit and implicit confirmation look like

Trapped in a flow or loop

The ultimate failure of conversational design is getting stuck in an endless loop or a flow that a user can't escape. Not only is it frustrating, it means that the chatbot failed to do its job. Subtract 2 points for every time you get trapped in a flow or loop. You should cap the number of subtracted points at -10.

A screenshot of the Conversational Design Score section of Campfire's Bot Audit Checkllist

Structure and flows

The structure of a chatbot is the information architecture and the flows are specific scenarios where the bot guides the user through a process. The highest possible score for this section is 10 points and it accounts for 20% of the audit score. The scoring here is quite straightforward.

Clear information architecture

If there’s a clear information architecture with a knowledge hierarchy, such as a main menu and/or sub-menus, then you should award 1 point. If not, then 0 points.

Fallbacks

Fallbacks are the most complex part of this section and can use entities, database searches or offloads to ensure that the bot moves the conversation forward without saying, “I’m sorry, I don’t understand.” Award 1 point every time you experience a fallback instead of a “not-understood” message. Bots can score a maximum of 5 points for fallbacks.

Chatbot conversation examples that demonstrate what using a fallback looks like

Offload

A bot can’t and shouldn’t answer everything. That’s why you should award 1 point to bots that propose some sort of offload, whether it be a live chat handover, ticket creation, phone number or email address. If there’s no alternative form of communication to the not, then award 0 points.

Feedback flow

Bots can always improve and learn. Award 1 point to bots that have some version of a feedback flow, whether it’s a thumbs up/thumbs down, star rating, smiley ranking or open feedback. If none of these exist, award 0 points.

Double intent recognition

In 99% of the cases when the bot identifies the same intent twice in a row it’s a wrong answer. This is frustrating for the user because they’re trying to rephrase the question, but it’s still not triggering the right answer to his/her question. In this case, the best practice is to alert the user that it has the same answer twice in a row and ask the user if they want that same answer or propose some sort of offload instead. Award 1 point if the bot recognizes a double intent, and 0 points if it doesn’t.

‍Language switch

Since a majority of the population is at least bilingual, it’s very normal for a user to end up on a chatbot in his/her non-preferred language. This is why you should award 1 point to bots that can seamlessly switch from one language to another. Award 0 points if there's no language switch.

A screenshot of the Structure and Flow Score section of Campfire's Bot Audit Checkllist

Visualization

This section evaluates how the bot looks and it has the smallest impact on the final score since design is always subjective. It’s possible to score 5 points in this area and it accounts for 10% of the final score.

Rich text/media

Try to keep your evaluation of visualization as objective as possible by awarding 1 point to bots that use rich media to show instead of tell. Rich text and media include emojis, icons, images, videos, gifs, etc. Award 0 points if this is missing.

Formats

Award 1 point to bots that use the best formats to simplify the user experience, such as lists, carousels, buttons, links, etc. If these are missing, award 0 points.

Responsiveness

Finally, award 1 point per device (desktop, tablet and mobile) where the chat widget is responsive and works properly.

A screenshot of the Visualization Score section of Campfire's Bot Audit Checkllist

The total score

Now it’s time to calculate the total score! Add up the scores from each section. The final score is out of 50 points, so the higher the score, the better the bot performed.

There you have it! That’s how you can use our Bot Audit Checklist to identify what’s working in your bot and what needs improvement. If you don’t have the checklist yet, feel free to download it here 👇

How to use Campfire’s Bot Audit Checklist