Sequential Modelling: Hidden Markov Model

Chatbot is no more a black box:
Most of us would have interacted with a Chabot in one way or the other like Siri, google assistant etc. and most of us would have wondered how it was able to understand what we were saying in first place. Even my first experience with a Chabot (LUIS) was a magical moment until I learnt what a sequential modelling is. So in this post we will learn sequential modelling technique, a widely used modelling approach for dealing with sequence related problems like Speech Recognition, Machine Translation, DNA sequence analysis, Named Entity Recognition etc. After reading this post I hope the next time you interact with a Chabot you know what is the deal.

Book a Flight from Chennai to Bangalore on Friday 10 am
The intent of this sentence	Book Flight
The information extracted are	Source: Chennai Destination: Bangalore Time: Friday 10 am

The two most important task that is performed the moment when a Chabot encounters a user sentence is (i) Classification of the given sentence to its underlying intent (book ticket, buy a pizza etc.) (ii) Extract the information from the given sentence (Place, Person, Organization etc.). This extraction of information in NLP terms is called as Information Extraction. The first task is a typical text classification problem (a supervised class of problem) while the second task is a sequential labelling problem (again a supervised class of problem).

What is Sequential Modelling

Sequence models have garnered a lot of attention because most of the data in the current world is in the form of sequences – it can be a number sequence, image pixel sequence, a video frame sequence or an audio sequence. Let us now try to understand the basic principle of sequential model.

Consider a sequence A B C B D B E B

What is the probability of the token B coming after A

What is the probability of the token C coming after B

What is the probability that the next token is B given the previous token is A.

Coming up with a set of probabilities for the given sequence is the fundamental principle of a sequential modelling. It goes without saying that most of the machine learning models are built around probability and statistics.

There are few algorithms that are designed to solve sequence problems some of the most widely used ones are

i. Hidden Markov Model or HMM

ii. Maximum Entropy Markov Model or MEMM

iii. Conditional Random Field or CRF

iv. Recurrent Neural Network etc. or RNN

In this post we will unravel how a Hidden Markov model is used for Named Entity Recognition task like identifying Named entities (Person, Organization, product, etc.) from the given sentence (a sequence of words/tokens)

What is a Markov Model?

In order to understand Hidden Markov Model, we need to understand what a Markov model is. Russian Mathematician Andrey Markov assumed that the next state in a sequence is solely based on the previous state in a sequence. This is called Markov Property and any model that is built on this assumption is called a Markov model. Let us try to understand it by an example, Consider the sentence

Sharukh khan salman khan amir khan zaheer khan

In the above sentence there are totally 8 words (token in NLP) and 5 distinct words. If I ask you what would be the next word in this sequence you might be guessing “khan” because in the given sentence “khan” appears 4 times out of 8 words and hence the guess.

*START* à Sharukh, Sharukh à khan, khan à salman, salman à khan, khan à amir,

amir à khan, khan à zaheer, zaheer à khan, khan à *END*

This is the same sequence along with the transition, one can clearly see that the word “khan” can be followed by either [ salman, amir, zaheer, *END*] and similarly for other words, let us try to consolidate here

(*START) à (Sharukh)

(Sharukh) à (khan)

(Khan) à ([Salman, amir, zaheer, *END*])

(salman) à (khan)

(amir) à (khan)

(zaheer) à (khan)

The above tells you that if the current state is *START* there is 100% chance the sentence begins with “sharukh”, if the current state is “sharukh” there is a 100% chance that the next word will be khan, if the current state is “khan” there is a 20% chance the next word will be “salman” 20% chance it will be “amir” 20% chance it will be “zaheer” and 20% chance that it could be the *END* of the sequence.

Let us put these numbers in a state diagram

The above state diagram with the start and transition probabilities is called the Markov Chain.

Hidden Markov Model

Now that we have learnt Markov model we will now discuss about Hidden Markov Model, one of the variation of Markov model which states that each state is only partially observable and there is some hidden state that is emitted by these partially observable states. Let us try to understand this by an example.

Bob and Sarah are friends who live 100 kms apart. Bobs mood on any given day depends upon the weather on that day and Sarah is aware of this.

Rainy – Bob is sad

Sunny – Bob is happy

One-day Bob called Sarah and told that he is happy, So Sarah inferred that it must be Sunny in Bobs place and vice verse

Here what Sarah observed is Bobs mood whether he is happy or sad and what was hidden are the states Sunny or Rainy. This is the fundamental premise on how Hidden Markov model works. An HMM consists of two stochastic processes, namely, an invisible process of hidden states and a visible process of observable symbols. In speech recognition, we are interested in predicting the uttered word from a recorded speech signal. For this purpose, the speech recognizer tries to find the sequence of phonemes (states) that gave rise to the actual uttered sound (observations).

In order to solve a HMM problem we need three probabilities

i. Initial State Probability (The start probability of a state)

ii. Transition Probability (The probability of making a transition from state i to j)

iii. Emission Probability (The probability of the observed symbols given the hidden state)

Here,

X = {X1,X2….X_n} is the observed sequence

Y = {Y1,Y2…..Y_n} is the hidden state

Example

Let us now understand how the Named entity recognition in Chabot works using HMM.

All supervised class of problems requires labelled data for training and it is called training data, In NLP the labelled data is called as corpus. We have 3 sentences in our corpus for the task of NER

i. I want to buy chair

ii. Buy a laptop backpack

iii. Let me buy a chair

Step 1: (Annotate this corpus)

Input: Raw file

Output: Annotated text (In this case named entities)

I/OTHER want/OTHER to/OTHER buy/OTHER chairs/PRODUCT

Buy/OTHER a/OTHER laptop/PRODUCT backpack/PRODUCT

Let/OTHER me/OTHER buy/OTHER a/OTHER chair/PRODUCT

Step 2: (Parameter Estimation)

1. Find states

a. Here the hidden states are C = (OTHER, PRODUCT)

2. Initial Start Probability: The probability of the state being the start of the sequence

Start Probability = No of sentences starts with particular tag/Total no of sentences in the corpus

a. OTHER à 3/3 =1 (i.e. 3 out of 3 sentence starts with OTHER state)

b. PRODUCT à 0/3 =0 (i.e. No sequence started with the state PRODUCT)

3. Transition Probability matrix: The probability of making a transition from one to other

Transition probability = Total no of sequences from state i to j / Total no of i

	OTHER	PRODUCT
OTHER	7/10 = 0.7	3/10 = 0.3
PRODUCT	0/4 =0	1/4 = 0.25

4. Emission Probability: The probability of the Observed symbol given the state

Emission probability = Total no of occurrences of words as a tag / Total Occurrences of that tag

(Distinct words)	I	Want	to	Buy	chair	a	Laptop	backpack	Let	me
OTHER	1/10= 0.1	1/10= 0.1	1/10= 0.1	3/10= 0.3	0/10= 0	2/10= 0.2	0/10= 0	0/10= 0	1/10= 0.1	1/10= 0.1
PRODUCT	0/4= 0	0/4= 0	0/4= 0	0/4= 0	2/4 = 0.50	0/4= 0	1/4 = 0.25	1/4 = 0.25	0/4 = 0	0/4 = 0

Step 3 (Testing a sentence)

Test Sentence – Buy chair

We need to find the state sequence that maximizes the likelihood of this observation

For, “Buy Chair” the likely sequence can be

(OTHER OTHER)

(OTHER PRODUCT)

(PRODUCT OTHER)

(PRODUCT PRODUCT)

We need to find the maximum probability out pf these possible sequences

(OTHER OTHER)

(OTHER PRODUCT)

(PRODUCT OTHER)

(PRODUCT PRODUCT)

So clearly we can see that the sequence that maximizes the observation “Buy Chair” is (OTHER PRODUCT) with a probability of 0.045.

Instead of solving for all the possible sequences to find out the maximum probability sequence we can use a dynamic programming technique called Viterbi Algorithm.

Result

Buy/OTHER chair/PRODUCT

Conclusion

Now we have learnt how Chabot uses these probabilities that it estimated from the training corpus to arrive at the sequence for prediction of Named Entities. This is a simplest sequential modelling technique and the same approach can be used for part of speech tagging as well. Though HMM is simple to implement has its own problems

1. Predicting the sequence of states for an observation which is not contained in the training corpus

For example, if the test sentence was “Buy Gloves” Gloves was never present in the training corpus and hence the probability will always be zero irrespective of the sequence. This can be solved using Laplace smoothing technique which is beyond the scope of this post.

2. Markov model does not model long-term dependencies as the Markov assumption states that each state depends only on the last state

For example, “In France, I had a great time and learnt some of the ________ language”

If I ask you to fill in the blank you would have filled with “French”. How did you able to do that, because you looked back at the word “France” at the beginning of the sentence and hence able to fill the blank, but if our model is only looking back at last word then there is no way it will be able to guess the word. For this purpose, we use Recurrent Neural Network.

In my next post we will unravel other model such as MEMM and CRF which are widely used models for sequential modelling than HMM, but without a complete understanding of how an HMM models sequence it will be difficult to learn other models and hence knowledge about HMM is the basic requisite for learning other models.

Rahul talks Cognitive

Search This Blog

Sequential Modelling: Hidden Markov Model

Comments

Post a Comment

Popular posts from this blog

Data Anonymization : A Crucial task of Data Security for any AI Engine