Michelli Silva Updated by Michelli Silva

After adding a lot of examples for each intent in your dataset, you may want to know if your intelligence is working well, right?

To check if your training data is behaving as expected you can use BotHub's Test section!

This feature evaluates your dataset by executing your testing dataset in your training dataset and comparing them.

Users can add test sentences, simulating final user inputs, to evaluate the quality of the training data or algorithm selected. As the test sentences are different from the training sentences used in the intelligence, we can see through some graphics and metrics how we can improve the intelligence.

In this article, you will learn how to run a test in BotHub and how to analyze its results.

Running a Test

To run a test, go to Test -> Sentences in the intelligence you want to evaluate, and add test sentences for each intent.

Once you have enough sentences to test, select the language that you want and click on the Run Test button.

BotHub will redirect you to the Results screen, where all the relevant data from that test will show up!


You can find a list of previous results for the tests made in this intelligence, by going to the Test -> Results page.

Select one of these results and you will find the following graphics and metrics for the testing dataset given:

  • Sentence details
  • Precision and Recall reports
  • Intent confusion matrix
  • Intent confidence distribuition

Sentence details

It is a list of all the sentences tested by the algorithm and whether they were correctly predicted or not. Click one of them to see the details about the test, like confidence and intent predicted for each sentence.

Precision and recall reports

A Precision score of 1.0 for an intention X means that all sentences labeled as belonging to intention X actually belong to intention X (but it says nothing about the remaining sentences from intention X that were not correctly classified).

A Recall of 1.0 means that all sentences from intention X has been labeled as belonging to intention X (but it says nothing about how many sentences from other intentions have been incorrectly classified as belonging to intention X).

Precision answers the following question: "In the set of all the sentences labeled as intention X (correct and wrong), how many were correctly classified?"

  • An intention that has no false positive has a Precision of 1.0

Recall answers the following question: "In the set of all sentences belonging to an intention X, how many were correctly classified as X?"

  • An intention that has no false negative has a Recall of 1.0

False positive and False negative, in the context of intent classification, are:

(regarding farewell intent)

Intent confusion matrix

The confusion matrix shows you which intents are mistaken for others. In the vertical axis, are listed the intents that the bot should predict, and in the horizontal axis are the intents that the bot actually predicted. In the confusion matrix, the ideal distribuition of the data should be diagonally, because that way all the sentences would be predicted correctly.

The matrix above shows us that one of the test sentences failed. The sentence has the intent outside, and was predicted by the algorithm as the intent inside.

Intent confidence distribution

The histogram allows you to visualize the confidence distribution for all predictions, with the volume of correct and incorrect predictions being displayed by green and red bars respectively.

Improving the quality of your training data will move the green histogram bars to the right and the red histogram bars to the left of the plot, as ideally, incorrectly classified phrases should have a low confidence.

In the histogram above, we can notice that most of our sentences that were predicted correctly had 95% of confidence, however some were labeled with 100% confidence and were wrong, which is not a good sign.

Good practices

To get a better idea of ​​whether your bot is really smart, try adding real test sentences that have never been seen by the model (sentences other than trained ones). This will allow you to test whether your intelligence is really able to abstract and understand the meaning of the intentions that have been trained to successfully classify a test phrase never seen before.

How did we do?

Intelligence Force