Chatbot is no more a black box:
Most of us would have interacted with a Chabot in one way or the other like Siri, google assistant etc. and most of us would have wondered how it was able to understand what we were saying in first place. Even my first experience with a Chabot (LUIS) was a magical moment until I learnt what a sequential modelling is. So in this post we will learn sequential modelling technique, a widely used modelling approach for dealing with sequence related problems like Speech Recognition, Machine Translation, DNA sequence analysis, Named Entity Recognition etc. After reading this post I hope the next time you interact with a Chabot you know what is the deal.
Most of us would have interacted with a Chabot in one way or the other like Siri, google assistant etc. and most of us would have wondered how it was able to understand what we were saying in first place. Even my first experience with a Chabot (LUIS) was a magical moment until I learnt what a sequential modelling is. So in this post we will learn sequential modelling technique, a widely used modelling approach for dealing with sequence related problems like Speech Recognition, Machine Translation, DNA sequence analysis, Named Entity Recognition etc. After reading this post I hope the next time you interact with a Chabot you know what is the deal.
Book a
Flight from Chennai to Bangalore on Friday 10 am
|
|
The intent
of this sentence
|
Book
Flight
|
The
information extracted are
|
Source:
Chennai
Destination:
Bangalore
Time:
Friday 10 am
|
The two most important task that is performed the moment when a Chabot encounters a user sentence is (i) Classification of the given sentence to its underlying intent (book ticket, buy a pizza etc.) (ii) Extract the information from the given sentence (Place, Person, Organization etc.). This extraction of information in NLP terms is called as Information Extraction. The first task is a typical text classification problem (a supervised class of problem) while the second task is a sequential labelling problem (again a supervised class of problem).
What is Sequential Modelling
Sequence models have garnered a lot of attention because most of the data in the current world is in the form of sequences – it can be a number sequence, image pixel sequence, a video frame sequence or an audio sequence. Let us now try to understand the basic principle of sequential model.
Consider a sequence A B C B D B E B
What is the probability of the token B coming after A
What is the probability of the token C coming after B
What is the probability that the next token is B given the previous token is A.
Coming up with a set of probabilities for the given sequence is the fundamental principle of a sequential modelling. It goes without saying that most of the machine learning models are built around probability and statistics.
There are few algorithms that are designed to solve sequence problems some of the most widely used ones are
i. Hidden Markov Model or HMM
ii. Maximum Entropy Markov Model or MEMM
iii. Conditional Random Field or CRF
iv. Recurrent Neural Network etc. or RNN
In this post we will unravel how a Hidden Markov model is used for Named Entity Recognition task like identifying Named entities (Person, Organization, product, etc.) from the given sentence (a sequence of words/tokens)
What is a Markov Model?
In order to understand Hidden Markov Model, we need to understand what a Markov model is. Russian Mathematician Andrey Markov assumed that the next state in a sequence is solely based on the previous state in a sequence. This is called Markov Property and any model that is built on this assumption is called a Markov model. Let us try to understand it by an example, Consider the sentence
Sharukh khan salman khan amir khan zaheer khan
In the above sentence there are totally 8 words (token in NLP) and 5 distinct words. If I ask you what would be the next word in this sequence you might be guessing “khan” because in the given sentence “khan” appears 4 times out of 8 words and hence the guess.
*START* à Sharukh, Sharukh à khan, khan à salman, salman à khan, khan à amir,
amir à khan, khan à zaheer, zaheer à khan, khan à *END*
This is the same sequence along with the transition, one can clearly see that the word “khan” can be followed by either [ salman, amir, zaheer, *END*] and similarly for other words, let us try to consolidate here
(*START) Ã (Sharukh)
(Sharukh) Ã (khan)
(Khan) Ã ([Salman, amir, zaheer, *END*])
(salman) Ã (khan)
(amir) Ã (khan)
(zaheer) Ã (khan)
The above tells you that if the current state is *START* there is 100% chance the sentence begins with “sharukh”, if the current state is “sharukh” there is a 100% chance that the next word will be khan, if the current state is “khan” there is a 20% chance the next word will be “salman” 20% chance it will be “amir” 20% chance it will be “zaheer” and 20% chance that it could be the *END* of the sequence.
Let us put these numbers in a state diagram
The above state diagram with the start and transition probabilities is called the Markov Chain.
Hidden Markov Model
Now that we have learnt Markov model we will now discuss about Hidden Markov Model, one of the variation of Markov model which states that each state is only partially observable and there is some hidden state that is emitted by these partially observable states. Let us try to understand this by an example.
Bob and Sarah are friends who live 100 kms apart. Bobs mood on any given day depends upon the weather on that day and Sarah is aware of this.
Rainy – Bob is sad
Sunny – Bob is happy
One-day Bob called Sarah and told that he is happy, So Sarah inferred that it must be Sunny in Bobs place and vice verse
Here what Sarah observed is Bobs mood whether he is happy or sad and what was hidden are the states Sunny or Rainy. This is the fundamental premise on how Hidden Markov model works. An HMM consists of two stochastic processes, namely, an invisible process of hidden states and a visible process of observable symbols. In speech recognition, we are interested in predicting the uttered word from a recorded speech signal. For this purpose, the speech recognizer tries to find the sequence of phonemes (states) that gave rise to the actual uttered sound (observations).
In order to solve a HMM problem we need three probabilities
i. Initial State Probability (The start probability of a state)
ii. Transition Probability (The probability of making a transition from state i to j)
iii. Emission Probability (The probability of the observed symbols given the hidden state)
X = {X1,X2….Xn} is the observed sequence
Y = {Y1,Y2…..Yn} is the hidden state
Example
Let us now understand how the Named entity recognition in Chabot works using HMM.
All supervised class of problems requires labelled data for training and it is called training data, In NLP the labelled data is called as corpus. We have 3 sentences in our corpus for the task of NER
i. I want to buy chair
ii. Buy a laptop backpack
iii. Let me buy a chair
Step 1: (Annotate this corpus)
Input: Raw file
Output: Annotated text (In this case named entities)
I/OTHER want/OTHER to/OTHER buy/OTHER chairs/PRODUCT
Buy/OTHER a/OTHER laptop/PRODUCT backpack/PRODUCT
Let/OTHER me/OTHER buy/OTHER a/OTHER chair/PRODUCT
Step 2: (Parameter Estimation)
1. Find states
a. Here the hidden states are C = (OTHER, PRODUCT)
2. Initial Start Probability: The probability of the state being the start of the sequence
Start Probability = No of sentences starts with particular tag/Total no of sentences in the corpus
a. OTHER Ã 3/3 =1 (i.e. 3 out of 3 sentence starts with OTHER state)
b. PRODUCT Ã 0/3 =0 (i.e. No sequence started with the state PRODUCT)
3. Transition Probability matrix: The probability of making a transition from one to other
Transition probability = Total no of sequences from state i to j / Total no of i
OTHER
|
PRODUCT
| |
OTHER
|
7/10 = 0.7
|
3/10 = 0.3
|
PRODUCT
|
0/4 =0
|
1/4 = 0.25
|
4. Emission Probability: The probability of the Observed symbol given the state
Emission probability = Total no of occurrences of words as a tag / Total Occurrences of that tag
(Distinct words)
|
I
|
Want
|
to
|
Buy
|
chair
|
a
|
Laptop
|
backpack
|
Let
|
me
|
OTHER
|
1/10= 0.1
|
1/10= 0.1
|
1/10= 0.1
|
3/10= 0.3
|
0/10= 0
|
2/10=
0.2
|
0/10= 0
|
0/10= 0
|
1/10= 0.1
|
1/10= 0.1
|
PRODUCT
|
0/4= 0
|
0/4= 0
|
0/4= 0
|
0/4= 0
|
2/4 = 0.50
|
0/4= 0
|
1/4 = 0.25
|
1/4 = 0.25
|
0/4 = 0
|
0/4 = 0
|
Step 3 (Testing a sentence)
Test Sentence – Buy chair
We need to find the state sequence that maximizes the likelihood of this observation
For, “Buy Chair” the likely sequence can be
(OTHER OTHER)
(OTHER PRODUCT)
(PRODUCT OTHER)
(PRODUCT PRODUCT)
We need to find the maximum probability out pf these possible sequences
(OTHER OTHER)
(OTHER PRODUCT)
(PRODUCT OTHER)
(PRODUCT PRODUCT)
Instead of solving for all the possible sequences to find out the maximum probability sequence we can use a dynamic programming technique called Viterbi Algorithm.
Result
Buy/OTHER chair/PRODUCT
Conclusion
Now we have learnt how Chabot uses these probabilities that it estimated from the training corpus to arrive at the sequence for prediction of Named Entities. This is a simplest sequential modelling technique and the same approach can be used for part of speech tagging as well. Though HMM is simple to implement has its own problems
1. Predicting the sequence of states for an observation which is not contained in the training corpus
For example, if the test sentence was “Buy Gloves” Gloves was never present in the training corpus and hence the probability will always be zero irrespective of the sequence. This can be solved using Laplace smoothing technique which is beyond the scope of this post.
2. Markov model does not model long-term dependencies as the Markov assumption states that each state depends only on the last state
For example, “In France, I had a great time and learnt some of the ________ language”
If I ask you to fill in the blank you would have filled with “French”. How did you able to do that, because you looked back at the word “France” at the beginning of the sentence and hence able to fill the blank, but if our model is only looking back at last word then there is no way it will be able to guess the word. For this purpose, we use Recurrent Neural Network.
In my next post we will unravel other model such as MEMM and CRF which are widely used models for sequential modelling than HMM, but without a complete understanding of how an HMM models sequence it will be difficult to learn other models and hence knowledge about HMM is the basic requisite for learning other models.
Comments
Post a Comment