Pseudonymization
Encyclopedia
Pseudonymisation is a procedure by which the most identifying fields within a data record are replaced by one or more artificial identifier
's. There can be a single pseudonym for a collection of replaced fields or a pseudonym per replaced field. The purpose is to render the data record less identifying and therefore lower customer or patient objections to its use. Data in this form is suitable for extensive analytics and processing.
The choice of which data fields are to be pseudonymised is partly subjective, but should include all fields that are highly selective, NHS number (in the UK) for example. Less selective fields, such as Birth Date or Postal Code are often also included because they are usually available from other sources and therefore make a record easier to identify. Pseudonymising these less identifying fields removes most of their analytic value and should therefore be accompanied by the introduction of new derived and less identifying forms, such as Year of Birth or a larger Postal Code region.
Data fields that are less identifying, such as Date of Attendance, are usually not pseudonymised. It is important to realise that this is because too much statistical utility is lost in doing so, not because the data cannot be identified. For example given prior knowledge of a few attendance dates it is easy to identify someone's data in a pseudonymised dataset by selecting only those people with that pattern of dates. This is an example of an Inference attack
.
The weakness of pseudonymised data to Inference attacks is commonly overlooked. A famous example is the AOL search data scandal. This example illustrates that there is no way to universally protect pseudomymised data whilst allowing general analysis of it.
Protecting statistically useful pseudonymised data from re-identification requires:
The pseudonym
allows tracking back of data to its origins, which distinguishes pseudonymisation from anonymization (comment: better distinction is given in ), where all person-related data that could allow backtracking has been purged. Pseudonymisation is an issue in, for example, patient-related data that has to be passed on securely between clinical centers.
Identifier
An identifier is a name that identifies either a unique object or a unique class of objects, where the "object" or class may be an idea, physical [countable] object , or physical [noncountable] substance...
's. There can be a single pseudonym for a collection of replaced fields or a pseudonym per replaced field. The purpose is to render the data record less identifying and therefore lower customer or patient objections to its use. Data in this form is suitable for extensive analytics and processing.
The choice of which data fields are to be pseudonymised is partly subjective, but should include all fields that are highly selective, NHS number (in the UK) for example. Less selective fields, such as Birth Date or Postal Code are often also included because they are usually available from other sources and therefore make a record easier to identify. Pseudonymising these less identifying fields removes most of their analytic value and should therefore be accompanied by the introduction of new derived and less identifying forms, such as Year of Birth or a larger Postal Code region.
Data fields that are less identifying, such as Date of Attendance, are usually not pseudonymised. It is important to realise that this is because too much statistical utility is lost in doing so, not because the data cannot be identified. For example given prior knowledge of a few attendance dates it is easy to identify someone's data in a pseudonymised dataset by selecting only those people with that pattern of dates. This is an example of an Inference attack
Inference attack
An Inference Attack is a data mining technique performed by analyzing data in order to illegitimately gain knowledge about a subject or database. A subject's sensitive information can be considered as leaked if an adversary can infer its real value with a high confidence. This is an example of...
.
The weakness of pseudonymised data to Inference attacks is commonly overlooked. A famous example is the AOL search data scandal. This example illustrates that there is no way to universally protect pseudomymised data whilst allowing general analysis of it.
Protecting statistically useful pseudonymised data from re-identification requires:
- a sound Information securityInformation securityInformation security means protecting information and information systems from unauthorized access, use, disclosure, disruption, modification, perusal, inspection, recording or destruction....
base - controlling the risk that the analysts, researchers or other data workers cause a privacy breach
The pseudonym
Pseudonym
A pseudonym is a name that a person assumes for a particular purpose and that differs from his or her original orthonym...
allows tracking back of data to its origins, which distinguishes pseudonymisation from anonymization (comment: better distinction is given in ), where all person-related data that could allow backtracking has been purged. Pseudonymisation is an issue in, for example, patient-related data that has to be passed on securely between clinical centers.
See also
- PseudonymPseudonymA pseudonym is a name that a person assumes for a particular purpose and that differs from his or her original orthonym...
, privacyPrivacyPrivacy is the ability of an individual or group to seclude themselves or information about themselves and thereby reveal themselves selectively...
, clinical information system, FLAIMFLAIMFLAIM is a modular tool designed to allow computer and network log sharing through application of complex data sanitization policies.... - Pseudonymisation of Information for Privacy in e-Health