Skip to main content

Data Anonymization : A Crucial task of Data Security for any AI Engine

Definition
Data Anonymization is termed as conversion of personally identifiable data into anonymized data by applying some anonymization techniques.
Typically, Data Anonymization is an irreversible process meaning any kind of transformation on the anonymized data cannot bring back the original data.
In an anonymized data the risk of re-identification of data after anonymization should be negligible.

Factors deciding Anonymization techniques
Anonymization techniques applied varies from data to data like streaming data and static data. Some of the factors that are taken into account while deciding on a technique are
  • Nature and type of data
  • Risk of re-identification of data
  • Risk-Utility trade off

Terminologies
Some of the most common terminologies that one has to be aware of to understand Data anonymization are
§       Adversary – A party attempting to re-identify individual’s data from a dataset that is supposed to be anonymized.
§       Direct Identifier – A data attribute that on its own can be used to identify an individual (e.g. Fingerprint)
§       Indirect Identifier – Also known as Quasi-identifier, that on its own does not identify an individual but when combined with other information/attributes may help in identifying an individual.
§       Equivalence class – Records in a dataset that shares same value within certain attributes.
§       Non-Identifier – Attributes which are neither direct nor indirect identifiers. Such attributes need not undergo anonymization.
§       Pseudonymization – Technique of replacing an identifier with unrelated yet typically unique value (e.g. replacing ‘Joshua’ with 45896)

Things to remember before Applying Anonymization  
Purpose of Anonymization and Utility: Anonymization should be done specifically to the purpose in hand as the process of anonymization regardless of the technique applied reduces the utility of the dataset. The risk of adversary knowing the technique applied for anonymization and to what level of granularity it was applied might help him to understand the data better.
Characteristics of Anonymization technique: Different Anonymization techniques modifies the data in a different way based on the nature of the data under anonymization. For example, techniques like Data Perturbation works well for data that is continuous in nature.
  • Character masking – This modifies only part of an attribute and are prone to reveal the actual length of the data which helps in re-identification.
  • Aggregation – Replaces the value of an attribute across multiple records.
  • Pseudonymization – Replaces the entire attribute with unrelated but statistically computed consistent information.
  • Suppression – Removes the attribute entirely.

Subject matter expert: The presence of an expert pertaining to the data presented helps an organization to anonymize the data that makes sense.
Expert in Anonymization process: Anonymization is complex and hence anonymization process should be undertaken by people well-versed in anonymization techniques and principles.
Tools: Due to complexity and computation required, software tools like ARX, Wizuda will be useful in the execution of anonymization techniques.

Disclosure Risks
There are three kinds of Disclosure risks that are generally seen when anonymizing the data
Identity Disclosure: By Pseudonym reversal technique such as replacing ‘1’s with ‘a’ and ‘2’s with ‘b’ we might be able to guess the identity. Insufficient anonymization, re-identification by linking can also lead to identity disclosure.
Attribute Disclosure: Identifying a specific individual, even if the individual’s record cannot be distinguished. For example, an anonymized dataset of patients of a doctor A reveals that all his client who are below age of 30 has undergone a surgery before. With this information, we know if an individual whose age is below 30 and is a client of doctor A is likely that he has underwent a surgery even though that individual’s record cannot be distinguished from others in the anonymized dataset.
Inference Disclosure: By Statistical properties of the dataset we guess about an individual even if the individual is not part of that dataset.

Anonymization Techniques
There are many kinds and varieties of anonymizing a dataset, let us discuss some of the most famous techniques here.
ATTRIBUTE SUPPRESSION
Description: It refers to the removal of an entire part of the data (column) in a dataset.
When to use: When an attribute is not required in the anonymized dataset. Mostly applied at the start of an anonymization process.
How to apply: Delete the attribute permanently and not by hiding it, you may also derive an attribute from the original attribute and discard the original.
Example:
Before Anonymization                                                                  
Student
Trainer
Test Score
John
Mary
84
James
Mary
73
Melina
Cena
68

 After Anonymization
Trainer
Test Score
Mary
84
Mary
73
Cena
68






RECORD SUPPRESSION
Description: Removal of an entire row/record in a dataset that might affect multiple attributes at same time.
When to use: When we want to remove outliers.
How to apply: Delete the entire record permanently and not just by hiding.
Example:
Before Anonymization                                                                   
Student
Trainer
Test Score
1234
Mary
84
5988
Mary
73
4321
Cena
68

After Anonymization
Student
Trainer
Test Score
1234
Mary
84
5988
Mary
73





CHARACTER MASKING
Description: Changing the characters of a data value by using constant symbol such as “x” or “*”. It is typically partially applied and does not protect or hide the length of the data which possess a  risk of
re-identification.
When to use: When data is a string and is a direct identifier.
How to apply: Replace some of the character of the data with a chosen symbol.
Example:
Before Anonymization                                                                   
Name
Age
Phone
Rahul
29
9790858141
Ravan
35
8880044214

After Anonymization
Name
Age
Phone
Rahul
29
9xxxxxxxxx
Ravan
35
8xxxxxxxxx






PSEUDONYMISATION
Description: Replacement of data with some made-up pseudonymous values which are typically irreversible. Also referred to as Coding. It can be of random replacement or consistent replacement.
When to use: When data values needs to be uniquely distinguished or when no information on the original data should be revealed.
How to apply: (i) Have some values which are randomly generated and then randomly select a value from the list as a replacement. Make sure that the generated values and the original data has no relationship to avoid the risk of re-identification.
(ii). By using encryption.
(iii). By using format preserving encryption where the replacement pseudonym will be of the same format as original data.
Example:
Before Anonymization 
Person
Age
Rank
423214
29
4
337461
35
3
425980
22
1
256812
20
2
                                    





After Anonymization   
Person
Age
Rank
Rahul
29
4
Ravan
35
3
Kaushik
22
1
Jaithri
20
2
For reversible pseudonym we keep a secure identity database.
Identity Database (Single level decoding)                                                             
Pseudonym
Person
423214
Rahul
337461
Ravan
425980
Kaushik
256812
Jaithri

For added security we can have a double level decoding by having a linking database that is placed with a trusted 3rd party.
Linking Database                                                                              
Person
Interim Pseudonym
423214
QXYACD
337461
EIGLMK
425980
PHTGKM
256812
RJPOTM

Identity Database
Interim Pseudonym
Person
QXYACD
Rahul
EIGLMK
Ravan
PHTGKM
Kaushik
RJPOTM
Jaithri







GENERALISATION
Description: Generalizing the attribute value by deliberately reducing the precision of the data, at the same time not losing the utility of the attribute.
When to use: When the attributes are needed for the purpose but can be anonymized by reducing the precision.
How to apply: By grouping or assigning the value between a range. The range assigned should not be too large or too small.

Example:
Before Anonymization                                                               
Person
Age
Address
Rahul
29
#44, Raman St, Chennai
Ravan
31
#33, Haddows Rd, Chennai
Raj
28
#19, Ram St, Chennai

After Anonymization                                                               
Person
Age
Address
Rahul
25-30
#44, Raman St, Chennai
Ravan
30-35
#33, Haddows Rd, Chennai
Raj
25-30
#19, Ram St, Chennai

                                                                    

SWAPPING / SHUFFLING / PERMUTATION
Description: Swapping the data of an attribute in a dataset such that the individual data values are still present but are not linked to the original record.
When to use: When the analysis does not depend upon the relationship of attributes at the record level.
How to apply: Identify the attribute to the swapped, and then swap each value to other values.

Example:
Before Anonymization   
Person
Age
Phone
Rahul
25
7837561042
Ravan
31
9333276014
Raj
33
9790858141
Nisha
29
8877654310
                                                          






After Anonymization
Person
Age
Phone
Rahul
29
9790858141
Ravan
33
8877654310
Raj
31
7837561042
Nisha
25
9333276014

DATA PERTURBATION
Description: The data values in the dataset are modified slightly so that they look different. It is typically applied on numbers and dates. Like Generalization technique the degree of perturbation should not be too small or too large. This technique should not be employed in cases where utility is crucial. 
When to use: When slight changes in values of data are acceptable and data accuracy is not that crucial.
How to apply: (i). By rounding to the nearest base-x.
(ii). By adding some random noise.

Example:
Before Anonymization                                                              
Person
Weight
Height
A
51
173
B
64
164
C
75
155

 After Anonymization
Person
Age
Height
A
50
170
B
65
160
C
75
160
Weight -> Base – 5
Height -> Base – 10




SYNTHETIC DATA / DUMMY DATA
Description: Generating synthetic data by looking at the pattern of the original dataset and not modifying the original dataset itself. 
When to use: When the original dataset in hand does not suffice the analysis or when large amount of dataset required.
How to apply: By studying the patterns available in the original dataset and there by generating the synthetic data. The utility of the dataset decreases as it is generated based on a pre-conceived model.

Example:
Original Dataset                                                               
Customer
Date
Purch. Amt
A
1-APR
200
A
2-APR
350
B
1-APR
400
B
2-APR
375
C
1-APR
300
C
.
.
.
2-APR
.
.
.
450
.
.
.
Statistics obtained from data
Purch. Amt Range
#Customers
200-250
100
150-200
175
250-300
98
350-400
.
.
.
73
.
.
.










Synthetic Data (1-Day)                                                              
Customer
Date
Purch. Amt
1001
1-APR
178
1002
1-APR
155
1003
1-APR
74
1004
.
.
.
1-APR
.
.
.
105
.
.
.

DATA AGGREGATION
Description: Summarizing the attribute values by doing micro aggregation.
When to use: When individual data values are not required and aggregated data is more than enough.
How to apply: (i) By grouping and aggregating.
(ii) By average, median, mode etc.
But when the grouped attribute has only few records suppress it.

Example:
Before Anonymization                                                             
Employee
JL
Salary
A
3
25000
B
3
27000
C
5
72000
D
5
79500
E
5
81000
F
3
23500
G
4
42000
H
4
43500
I
.
.
.
4
.
.
.
48000
.
.
.

  After Anonymization
JL
Avg. Salary
3
25166
4
44500
5
77500






Risk measurement
Any Anonymization process should have a prior risk threshold determined and should ensure that the anonymized data does not go beyond the risk threshold. We will discuss most popular measure of risk here.
K-Anonymity
Description: K-Anonymity is often thought of as an anonymization technique but it is rather a measure of risk to ensure that the risk threshold is satisfied. K-anonymity is used to ensure that any record (direct or indirect identifier) are at least replicated/shared by k-1 other records.
How it works: Decide on a value of k. After anonymization techniques are applied to the dataset check that each record has at least k-1 other records with same attribute. Typically, k value should be greater than the equivalence class size. One important thing to note here is that as k value increases the utility of the dataset decreases.
Differential Privacy is another measure of risk that has become popular recently among people performing Data Anonymization tasks.

Comments

Popular posts from this blog

Sequential Modelling: Hidden Markov Model

Chatbot is no more a black box: Most of us would have interacted with a Chabot in one way or the other like Siri, google assistant etc. and most of us would have wondered how it was able to understand what we were saying in first place. Even my first experience with a Chabot (LUIS) was a magical moment until I learnt what a sequential modelling is. So in this post we will learn sequential modelling technique, a widely used modelling approach for dealing with sequence related problems like Speech Recognition, Machine Translation, DNA sequence analysis, Named Entity Recognition etc. After reading this post I hope the next time you interact with a Chabot you know what is the deal. Book a Flight from Chennai to Bangalore on Friday 10 am The intent of this sentence Book Flight The information extracted are Source: Chennai Destination: Bangalore Time: Friday 10 am The two most important task that is performed the moment wh...