Definition
Data Anonymization is termed as conversion of personally
identifiable data into anonymized data by applying some anonymization
techniques.
Typically, Data Anonymization is an irreversible process
meaning any kind of transformation on the anonymized data cannot bring back the
original data.
In an anonymized data the risk of re-identification of data
after anonymization should be negligible.
Factors deciding Anonymization
techniques
Anonymization techniques applied varies from data to data
like streaming data and static data. Some of the factors that are taken into
account while deciding on a technique are
- Nature and type of data
- Risk of re-identification of data
- Risk-Utility trade off
Terminologies
Some of the most common terminologies that one has to be
aware of to understand Data anonymization are
§ Adversary – A party attempting to re-identify
individual’s data from a dataset that is supposed to be anonymized.
§ Direct Identifier – A data attribute that on its
own can be used to identify an individual (e.g. Fingerprint)
§ Indirect Identifier – Also known as
Quasi-identifier, that on its own does not identify an individual but when
combined with other information/attributes may help in identifying an
individual.
§ Equivalence class – Records in a dataset that
shares same value within certain attributes.
§ Non-Identifier – Attributes which are neither
direct nor indirect identifiers. Such attributes need not undergo
anonymization.
§ Pseudonymization – Technique of replacing an
identifier with unrelated yet typically unique value (e.g. replacing ‘Joshua’
with 45896)
Things to remember
before Applying Anonymization
Purpose of
Anonymization and Utility: Anonymization should be done specifically to
the purpose in hand as the process of anonymization regardless of the technique
applied reduces the utility of the dataset. The risk of adversary knowing the
technique applied for anonymization and to what level of granularity it was
applied might help him to understand the data better.
Characteristics of
Anonymization technique: Different Anonymization techniques modifies
the data in a different way based on the nature of the data under
anonymization. For example, techniques like Data Perturbation works well for
data that is continuous in nature.
- Character masking – This modifies only part of an attribute and are prone to reveal the actual length of the data which helps in re-identification.
- Aggregation – Replaces the value of an attribute across multiple records.
- Pseudonymization – Replaces the entire attribute with unrelated but statistically computed consistent information.
- Suppression – Removes the attribute entirely.
Subject matter
expert: The presence of an expert pertaining to the data presented
helps an organization to anonymize the data that makes sense.
Expert in
Anonymization process: Anonymization is complex and hence anonymization
process should be undertaken by people well-versed in anonymization techniques
and principles.
Tools: Due
to complexity and computation required, software tools like ARX, Wizuda will be
useful in the execution of anonymization techniques.
Disclosure Risks
There are three kinds of Disclosure risks that are generally
seen when anonymizing the data
Identity
Disclosure: By Pseudonym reversal technique such as replacing ‘1’s with
‘a’ and ‘2’s with ‘b’ we might be able to guess the identity. Insufficient
anonymization, re-identification by linking can also lead to identity
disclosure.
Attribute
Disclosure: Identifying a specific individual, even if the individual’s
record cannot be distinguished. For example, an anonymized dataset of patients
of a doctor A reveals that all his client who are below age of 30 has undergone
a surgery before. With this information, we know if an individual whose age is
below 30 and is a client of doctor A is likely that he has underwent a surgery
even though that individual’s record cannot be distinguished from others in the
anonymized dataset.
Inference Disclosure:
By Statistical properties of the dataset we guess about an individual even if
the individual is not part of that dataset.
Anonymization
Techniques
There are many kinds and varieties of anonymizing a dataset,
let us discuss some of the most famous techniques here.
ATTRIBUTE
SUPPRESSION
Description: It refers to the removal of an entire
part of the data (column) in a dataset.
When to use: When an attribute is not required in the
anonymized dataset. Mostly applied at the start of an anonymization process.
How to apply: Delete the attribute permanently and
not by hiding it, you may also derive an attribute from the original attribute
and discard the original.
Example:
Before Anonymization
Student
|
Trainer
|
Test Score
|
John
|
Mary
|
84
|
James
|
Mary
|
73
|
Melina
|
Cena
|
68
|
After Anonymization
Trainer
|
Test Score
|
Mary
|
84
|
Mary
|
73
|
Cena
|
68
|
RECORD SUPPRESSION
Description: Removal of an entire row/record in a
dataset that might affect multiple attributes at same time.
When to use: When we want to remove outliers.
How to apply: Delete the entire record permanently
and not just by hiding.
Example:
Before Anonymization
Student
|
Trainer
|
Test Score
|
1234
|
Mary
|
84
|
5988
|
Mary
|
73
|
4321
|
Cena
|
68
|
After Anonymization
Student
|
Trainer
|
Test Score
|
1234
|
Mary
|
84
|
5988
|
Mary
|
73
|
CHARACTER MASKING
Description: Changing the characters of a data value
by using constant symbol such as “x” or “*”. It is typically partially applied
and does not protect or hide the length of the data which possess a risk of
re-identification.
re-identification.
When to use: When data is a string and is a direct
identifier.
How to apply: Replace some of the character of the
data with a chosen symbol.
Example:
Before Anonymization
Name
|
Age
|
Phone
|
Rahul
|
29
|
9790858141
|
Ravan
|
35
|
8880044214
|
After Anonymization
Name
|
Age
|
Phone
|
Rahul
|
29
|
9xxxxxxxxx
|
Ravan
|
35
|
8xxxxxxxxx
|
PSEUDONYMISATION
Description: Replacement of data with some made-up
pseudonymous values which are typically irreversible. Also referred to as Coding.
It can be of random replacement or consistent replacement.
When to use: When data values needs to be uniquely
distinguished or when no information on the original data should be revealed.
How to apply: (i) Have some values which are randomly
generated and then randomly select a value from the list as a replacement. Make
sure that the generated values and the original data has no relationship to
avoid the risk of re-identification.
(ii). By using encryption.
(iii). By using format preserving encryption where the replacement pseudonym will be of the same format as original data.
(ii). By using encryption.
(iii). By using format preserving encryption where the replacement pseudonym will be of the same format as original data.
Example:
Before Anonymization
Person
|
Age
|
Rank
|
423214
|
29
|
4
|
337461
|
35
|
3
|
425980
|
22
|
1
|
256812
|
20
|
2
|
After Anonymization
Person
|
Age
|
Rank
|
Rahul
|
29
|
4
|
Ravan
|
35
|
3
|
Kaushik
|
22
|
1
|
Jaithri
|
20
|
2
|
For reversible pseudonym we keep a secure identity database.
Identity Database (Single level decoding)
Pseudonym
|
Person
|
423214
|
Rahul
|
337461
|
Ravan
|
425980
|
Kaushik
|
256812
|
Jaithri
|
For added security we can have a double level decoding by
having a linking database that is placed with a trusted 3rd party.
Linking Database
Person
|
Interim Pseudonym
|
423214
|
QXYACD
|
337461
|
EIGLMK
|
425980
|
PHTGKM
|
256812
|
RJPOTM
|
Identity Database
Interim
Pseudonym
|
Person
|
QXYACD
|
Rahul
|
EIGLMK
|
Ravan
|
PHTGKM
|
Kaushik
|
RJPOTM
|
Jaithri
|
GENERALISATION
Description: Generalizing the attribute value by
deliberately reducing the precision of the data, at the same time not losing
the utility of the attribute.
When to use: When the attributes are needed for the
purpose but can be anonymized by reducing the precision.
How to apply: By grouping or assigning the value
between a range. The range assigned should not be too large or too small.
Example:
Example:
Before Anonymization
Person
|
Age
|
Address
|
Rahul
|
29
|
#44, Raman St, Chennai
|
Ravan
|
31
|
#33, Haddows Rd, Chennai
|
Raj
|
28
|
#19, Ram St, Chennai
|
After Anonymization
Person
|
Age
|
Address
|
Rahul
|
25-30
|
#44, Raman St, Chennai
|
Ravan
|
30-35
|
#33, Haddows Rd, Chennai
|
Raj
|
25-30
|
#19, Ram St, Chennai
|
SWAPPING / SHUFFLING / PERMUTATION
Description: Swapping the data of an attribute in a
dataset such that the individual data values are still present but are not
linked to the original record.
When to use: When the analysis does not depend upon
the relationship of attributes at the record level.
How to apply: Identify the attribute to the swapped,
and then swap each value to other values.
Example:
Example:
Before Anonymization
Person
|
Age
|
Phone
|
Rahul
|
25
|
7837561042
|
Ravan
|
31
|
9333276014
|
Raj
|
33
|
9790858141
|
Nisha
|
29
|
8877654310
|
After Anonymization
Person
|
Age
|
Phone
|
Rahul
|
29
|
9790858141
|
Ravan
|
33
|
8877654310
|
Raj
|
31
|
7837561042
|
Nisha
|
25
|
9333276014
|
DATA PERTURBATION
Description: The data values in the dataset are
modified slightly so that they look different. It is typically applied on
numbers and dates. Like Generalization technique the degree of perturbation
should not be too small or too large. This technique should not be employed in
cases where utility is crucial.
When to use: When slight changes in values of data
are acceptable and data accuracy is not that crucial.
How to apply: (i). By rounding to the nearest base-x.
(ii). By adding some random noise.
Example:
(ii). By adding some random noise.
Example:
Before Anonymization
Person
|
Weight
|
Height
|
A
|
51
|
173
|
B
|
64
|
164
|
C
|
75
|
155
|
After Anonymization
Person
|
Age
|
Height
|
A
|
50
|
170
|
B
|
65
|
160
|
C
|
75
|
160
|
Weight -> Base – 5
Height -> Base – 10
Height -> Base – 10
SYNTHETIC DATA / DUMMY DATA
Description: Generating synthetic data by looking at
the pattern of the original dataset and not modifying the original dataset
itself.
When to use: When the original dataset in hand does
not suffice the analysis or when large amount of dataset required.
How to apply: By studying the patterns available in
the original dataset and there by generating the synthetic data. The utility of
the dataset decreases as it is generated based on a pre-conceived model.
Example:
Example:
Original Dataset
Customer
|
Date
|
Purch. Amt
|
A
|
1-APR
|
200
|
A
|
2-APR
|
350
|
B
|
1-APR
|
400
|
B
|
2-APR
|
375
|
C
|
1-APR
|
300
|
C
. . . |
2-APR
. . . |
450
. . . |
Statistics obtained from data
Purch. Amt Range
|
#Customers
|
200-250
|
100
|
150-200
|
175
|
250-300
|
98
|
350-400
. . . |
73
. . . |
Synthetic Data (1-Day)
Customer
|
Date
|
Purch. Amt
|
1001
|
1-APR
|
178
|
1002
|
1-APR
|
155
|
1003
|
1-APR
|
74
|
1004
. . . |
1-APR
. . . |
105
. . . |
DATA AGGREGATION
Description: Summarizing the attribute values by
doing micro aggregation.
When to use: When individual data values are not
required and aggregated data is more than enough.
How to apply: (i) By grouping and aggregating.
(ii) By average, median, mode etc.
But when the grouped attribute has only few records suppress it.
(ii) By average, median, mode etc.
But when the grouped attribute has only few records suppress it.
Example:
Before Anonymization
Employee
|
JL
|
Salary
|
A
|
3
|
25000
|
B
|
3
|
27000
|
C
|
5
|
72000
|
D
|
5
|
79500
|
E
|
5
|
81000
|
F
|
3
|
23500
|
G
|
4
|
42000
|
H
|
4
|
43500
|
I
. . . |
4
. . . |
48000
. . . |
After Anonymization
JL
|
Avg.
Salary
|
3
|
25166
|
4
|
44500
|
5
|
77500
|
Risk measurement
Any Anonymization process should have a prior risk threshold
determined and should ensure that the anonymized data does not go beyond the
risk threshold. We will discuss most popular measure of risk here.
K-Anonymity
Description: K-Anonymity is often thought of as an
anonymization technique but it is rather a measure of risk to ensure that the
risk threshold is satisfied. K-anonymity is used to ensure that any record (direct
or indirect identifier) are at least replicated/shared by k-1 other records.
How it works: Decide on a value of k. After
anonymization techniques are applied to the dataset check that each record has
at least k-1 other records with same attribute. Typically, k value should be
greater than the equivalence class size. One important thing to note here is
that as k value increases the utility of the dataset decreases.
Differential Privacy is another measure of risk that has
become popular recently among people performing Data Anonymization tasks.
Comments
Post a Comment