Data Anonymization : A Crucial task of Data Security for any AI Engine

Definition

Data Anonymization is termed as conversion of personally identifiable data into anonymized data by applying some anonymization techniques.

Typically, Data Anonymization is an irreversible process meaning any kind of transformation on the anonymized data cannot bring back the original data.

In an anonymized data the risk of re-identification of data after anonymization should be negligible.

Factors deciding Anonymization techniques

Anonymization techniques applied varies from data to data like streaming data and static data. Some of the factors that are taken into account while deciding on a technique are

Nature and type of data
Risk of re-identification of data
Risk-Utility trade off

Terminologies

Some of the most common terminologies that one has to be aware of to understand Data anonymization are

§ Adversary – A party attempting to re-identify individual’s data from a dataset that is supposed to be anonymized.

§ Direct Identifier – A data attribute that on its own can be used to identify an individual (e.g. Fingerprint)

§ Indirect Identifier – Also known as Quasi-identifier, that on its own does not identify an individual but when combined with other information/attributes may help in identifying an individual.

§ Equivalence class – Records in a dataset that shares same value within certain attributes.

§ Non-Identifier – Attributes which are neither direct nor indirect identifiers. Such attributes need not undergo anonymization.

§ Pseudonymization – Technique of replacing an identifier with unrelated yet typically unique value (e.g. replacing ‘Joshua’ with 45896)

Things to remember before Applying Anonymization

Purpose of Anonymization and Utility: Anonymization should be done specifically to the purpose in hand as the process of anonymization regardless of the technique applied reduces the utility of the dataset. The risk of adversary knowing the technique applied for anonymization and to what level of granularity it was applied might help him to understand the data better.

Characteristics of Anonymization technique: Different Anonymization techniques modifies the data in a different way based on the nature of the data under anonymization. For example, techniques like Data Perturbation works well for data that is continuous in nature.

Character masking – This modifies only part of an attribute and are prone to reveal the actual length of the data which helps in re-identification.
Aggregation – Replaces the value of an attribute across multiple records.
Pseudonymization – Replaces the entire attribute with unrelated but statistically computed consistent information.
Suppression – Removes the attribute entirely.

Subject matter expert: The presence of an expert pertaining to the data presented helps an organization to anonymize the data that makes sense.

Expert in Anonymization process: Anonymization is complex and hence anonymization process should be undertaken by people well-versed in anonymization techniques and principles.

Tools: Due to complexity and computation required, software tools like ARX, Wizuda will be useful in the execution of anonymization techniques.

Disclosure Risks

There are three kinds of Disclosure risks that are generally seen when anonymizing the data

Identity Disclosure: By Pseudonym reversal technique such as replacing ‘1’s with ‘a’ and ‘2’s with ‘b’ we might be able to guess the identity. Insufficient anonymization, re-identification by linking can also lead to identity disclosure.

Attribute Disclosure: Identifying a specific individual, even if the individual’s record cannot be distinguished. For example, an anonymized dataset of patients of a doctor A reveals that all his client who are below age of 30 has undergone a surgery before. With this information, we know if an individual whose age is below 30 and is a client of doctor A is likely that he has underwent a surgery even though that individual’s record cannot be distinguished from others in the anonymized dataset.

Inference Disclosure: By Statistical properties of the dataset we guess about an individual even if the individual is not part of that dataset.

Anonymization Techniques

There are many kinds and varieties of anonymizing a dataset, let us discuss some of the most famous techniques here.

ATTRIBUTE SUPPRESSION

Description: It refers to the removal of an entire part of the data (column) in a dataset.

When to use: When an attribute is not required in the anonymized dataset. Mostly applied at the start of an anonymization process.

How to apply: Delete the attribute permanently and not by hiding it, you may also derive an attribute from the original attribute and discard the original.

Example:

Before Anonymization

Student	Trainer	Test Score
John	Mary	84
James	Mary	73
Melina	Cena	68

After Anonymization

Trainer	Test Score
Mary	84
Mary	73
Cena	68

RECORD SUPPRESSION

Description: Removal of an entire row/record in a dataset that might affect multiple attributes at same time.

When to use: When we want to remove outliers.

How to apply: Delete the entire record permanently and not just by hiding.

Example:

Before Anonymization

Student	Trainer	Test Score
1234	Mary	84
5988	Mary	73
4321	Cena	68

After Anonymization

Student	Trainer	Test Score
1234	Mary	84
5988	Mary	73

CHARACTER MASKING

Description: Changing the characters of a data value by using constant symbol such as “x” or “*”. It is typically partially applied and does not protect or hide the length of the data which possess a risk of
re-identification.

When to use: When data is a string and is a direct identifier.

How to apply: Replace some of the character of the data with a chosen symbol.

Example:

Before Anonymization

Name	Age	Phone
Rahul	29	9790858141
Ravan	35	8880044214

After Anonymization

Name	Age	Phone
Rahul	29	9xxxxxxxxx
Ravan	35	8xxxxxxxxx

PSEUDONYMISATION

Description: Replacement of data with some made-up pseudonymous values which are typically irreversible. Also referred to as Coding. It can be of random replacement or consistent replacement.

When to use: When data values needs to be uniquely distinguished or when no information on the original data should be revealed.

How to apply: (i) Have some values which are randomly generated and then randomly select a value from the list as a replacement. Make sure that the generated values and the original data has no relationship to avoid the risk of re-identification.
(ii). By using encryption.
(iii). By using format preserving encryption where the replacement pseudonym will be of the same format as original data.

Example:

Before Anonymization

Person	Age	Rank
423214	29	4
337461	35	3
425980	22	1
256812	20	2

After Anonymization

Person	Age	Rank
Rahul	29	4
Ravan	35	3
Kaushik	22	1
Jaithri	20	2

For reversible pseudonym we keep a secure identity database.

Identity Database (Single level decoding)

Pseudonym	Person
423214	Rahul
337461	Ravan
425980	Kaushik
256812	Jaithri

For added security we can have a double level decoding by having a linking database that is placed with a trusted 3^rd party.

Linking Database

Person	Interim Pseudonym
423214	QXYACD
337461	EIGLMK
425980	PHTGKM
256812	RJPOTM

Identity Database

Interim Pseudonym	Person
QXYACD	Rahul
EIGLMK	Ravan
PHTGKM	Kaushik
RJPOTM	Jaithri

GENERALISATION

Description: Generalizing the attribute value by deliberately reducing the precision of the data, at the same time not losing the utility of the attribute.

When to use: When the attributes are needed for the purpose but can be anonymized by reducing the precision.

How to apply: By grouping or assigning the value between a range. The range assigned should not be too large or too small.

Example:

Before Anonymization

Person	Age	Address
Rahul	29	#44, Raman St, Chennai
Ravan	31	#33, Haddows Rd, Chennai
Raj	28	#19, Ram St, Chennai

After Anonymization

Person	Age	Address
Rahul	25-30	#44, Raman St, Chennai
Ravan	30-35	#33, Haddows Rd, Chennai
Raj	25-30	#19, Ram St, Chennai

SWAPPING / SHUFFLING / PERMUTATION

Description: Swapping the data of an attribute in a dataset such that the individual data values are still present but are not linked to the original record.

When to use: When the analysis does not depend upon the relationship of attributes at the record level.

How to apply: Identify the attribute to the swapped, and then swap each value to other values.

Example:

Before Anonymization

Person	Age	Phone
Rahul	25	7837561042
Ravan	31	9333276014
Raj	33	9790858141
Nisha	29	8877654310

After Anonymization

Person	Age	Phone
Rahul	29	9790858141
Ravan	33	8877654310
Raj	31	7837561042
Nisha	25	9333276014

DATA PERTURBATION

Description: The data values in the dataset are modified slightly so that they look different. It is typically applied on numbers and dates. Like Generalization technique the degree of perturbation should not be too small or too large. This technique should not be employed in cases where utility is crucial.

When to use: When slight changes in values of data are acceptable and data accuracy is not that crucial.

How to apply: (i). By rounding to the nearest base-x.
(ii). By adding some random noise.

Example:

Before Anonymization

Person	Weight	Height
A	51	173
B	64	164
C	75	155

After Anonymization

Person	Age	Height
A	50	170
B	65	160
C	75	160

Weight -> Base – 5
Height -> Base – 10

SYNTHETIC DATA / DUMMY DATA

Description: Generating synthetic data by looking at the pattern of the original dataset and not modifying the original dataset itself.

When to use: When the original dataset in hand does not suffice the analysis or when large amount of dataset required.

How to apply: By studying the patterns available in the original dataset and there by generating the synthetic data. The utility of the dataset decreases as it is generated based on a pre-conceived model.

Example:

Original Dataset

Customer	Date	Purch. Amt
A	1-APR	200
A	2-APR	350
B	1-APR	400
B	2-APR	375
C	1-APR	300
C . . .	2-APR . . .	450 . . .

Statistics obtained from data

Purch. Amt Range	#Customers
200-250	100
150-200	175
250-300	98
350-400 . . .	73 . . .

Synthetic Data (1-Day)

Customer	Date	Purch. Amt
1001	1-APR	178
1002	1-APR	155
1003	1-APR	74
1004 . . .	1-APR . . .	105 . . .

DATA AGGREGATION

Description: Summarizing the attribute values by doing micro aggregation.

When to use: When individual data values are not required and aggregated data is more than enough.

How to apply: (i) By grouping and aggregating.
(ii) By average, median, mode etc.
But when the grouped attribute has only few records suppress it.

Example:

Before Anonymization

Employee	JL	Salary
A	3	25000
B	3	27000
C	5	72000
D	5	79500
E	5	81000
F	3	23500
G	4	42000
H	4	43500
I . . .	4 . . .	48000 . . .

After Anonymization

JL	Avg. Salary
3	25166
4	44500
5	77500

Risk measurement

Any Anonymization process should have a prior risk threshold determined and should ensure that the anonymized data does not go beyond the risk threshold. We will discuss most popular measure of risk here.

K-Anonymity

Description: K-Anonymity is often thought of as an anonymization technique but it is rather a measure of risk to ensure that the risk threshold is satisfied. K-anonymity is used to ensure that any record (direct or indirect identifier) are at least replicated/shared by k-1 other records.

How it works: Decide on a value of k. After anonymization techniques are applied to the dataset check that each record has at least k-1 other records with same attribute. Typically, k value should be greater than the equivalence class size. One important thing to note here is that as k value increases the utility of the dataset decreases.

Differential Privacy is another measure of risk that has become popular recently among people performing Data Anonymization tasks.

Rahul talks Cognitive

Search This Blog

Data Anonymization : A Crucial task of Data Security for any AI Engine

Comments

Post a Comment

Popular posts from this blog

Sequential Modelling: Hidden Markov Model