Synthetic data - AbsoluteAstronomy.com

Synthetic data are "any production data applicable to a given situation that are not obtained by direct measurement" according to the McGraw-Hill Dictionary of Scientific and Technical Terms; where Craig S. Mullins, an expert in data management, defines production data as "information that is persistently stored and used by professionals to conduct business processes.".

The creation of synthetic data is an involved process of data anonymization

Anonymity

Anonymity is derived from the Greek word ἀνωνυμία, anonymia, meaning "without a name" or "namelessness". In colloquial use, anonymity typically refers to the state of an individual's personal identity, or personally identifiable information, being publicly unknown.There are many reasons why a...

; that is to say that synthetic data is a subset

Subset

In mathematics, especially in set theory, a set A is a subset of a set B if A is "contained" inside B. A and B may coincide. The relationship of one set being a subset of another is called inclusion or sometimes containment...

of anonymized data. Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality

Confidentiality

Confidentiality is an ethical principle associated with several professions . In ethics, and in law and alternative forms of legal resolution such as mediation, some types of communication between a person and one of these professionals are "privileged" and may not be discussed or divulged to...

of particular aspects of the data. Many times the particular aspects come about in the form of human information (i.e. name, home address, IP address

IP address

An Internet Protocol address is a numerical label assigned to each device participating in a computer network that uses the Internet Protocol for communication. An IP address serves two principal functions: host or network interface identification and location addressing...

, telephone number, social security number, credit card number, etc.).

Usefulness

Synthetic data are generated to meet specific needs or certain conditions that may not be found in the original, real data. This can be useful when designing any type of system because the synthetic data are used as a simulation or as a theoretical value, situation, etc. This allows us to take into account unexpected results and have a basic solution or remedy, if the results prove to be unsatisfactory. Synthetic data are often generated to represent the authentic data and allows a baseline to be set⁴. Another use of synthetic data is to protect privacy and confidentiality of authentic data. As stated previously, synthetic data is used in testing and creating many different types of systems; below is a quote from the abstract of an article that describes a software that generates synthetic data for testing fraud detection systems that further explains its use and importance.
"This enables us to create realistic behavior profiles for users and attackers. The data is used to train the fraud

Fraud

In criminal law, a fraud is an intentional deception made for personal gain or to damage another individual; the related adjective is fraudulent. The specific legal definition varies by legal jurisdiction. Fraud is a crime, and also a civil law violation...

detection system itself, thus creating the necessary adaptation of the system to a specific environment." ⁴

History

The history of the generation of synthetic data dates back to 1993. In 1993, the idea of original fully synthetic data was created by Rubin

Donald Rubin

Donald Bruce Rubin is the John L. Loeb Professor of Statistics at Harvard University. He was hired by Harvard in 1984, and served as chair of the department from 1985-1994....

. Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household. Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file.

In 1994, Fienberg

Stephen Fienberg

Stephen Elliott Fienberg is the Maurice Falk University Professor of Statistics and Social Science in the Department of Statistics, the Machine Learning Department and Cylab at Carnegie Mellon University....

came up with the idea of critical refinement, in which he used a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling.⁵ Later, other important contributors to the development of synthetic data generation are Raghunathan, Reiter, Rubin

Donald Rubin

Donald Bruce Rubin is the John L. Loeb Professor of Statistics at Harvard University. He was hired by Harvard in 1984, and served as chair of the department from 1985-1994....

, Abowd, Woodcock

Jim Woodcock

Professor Jim C. P. Woodcock FRSA FBCS FREng is a British computer scientist.Woodcock gained his PhD from the University of Liverpool. Until 2001 he was Professor of Software Engineering at the Oxford University Computing Laboratory, where he was also a Fellow of Kellogg College...

. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly they came up with the technique of Sequential Regression Multivariate Imputation.⁵

Applications

Synthetic data are used in the process of data mining

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

. Testing and training fraud

Fraud

detection systems, confidentiality systems and any type of system is devised using synthetic data. As described previously, synthetic data may seem as just a compilation of “made up” data, but there are specific algorithms and generators that are designed to create realistic data. ⁶ This synthetic data assists in teaching a system how to react to certain situations or criteria. Researcher doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion. ⁴

Synthetic data is also used to protect the privacy

Privacy

Privacy is the ability of an individual or group to seclude themselves or information about themselves and thereby reveal themselves selectively...

and confidentiality

Confidentiality

of a set of data. Real data contains personal/private/confidential information that a programmer, software creator or research project may not want to be disclosed. ⁷ Synthetic data holds no personal information and cannot be traced back to any individual; therefore, the use of synthetic data reduces confidentiality and privacy issues.

Calculations

Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their algorithm

Algorithm

In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

s".¹⁰

"Synthetic data can be generated with random orientations and positions."⁸ Datasets can be get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data.⁹

Constructing a synthesizer build involves constructing a statistical model

Statistical model

A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

. In a linear regression

Linear regression

In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...

line example, the original data can be plotted, and a best fit linear line

Linear regression

can be created from the data. This linear line

Linear regression

is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the confidentiality

Confidentiality

of the original data.⁹

David Jensen from the Knowledge Discovery Laboratory mentioned how to generate synthetic data in his "Proximity 4.3 Tutorial" chapter 6: "Researchers frequently need to explore the effects of certain data characteristics on their data model

Data model

A data model in software engineering is an abstract model, that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data is stored and accessed....

." To help construct datasets

Data set

A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

exhibiting specific properties, such as auto-correlation

Autocorrelation

Autocorrelation is the cross-correlation of a signal with itself. Informally, it is the similarity between observations as a function of the time separation between them...

or degree disparity, proximity can generate synthetic data having one of several types of graph structure¹⁰:random graph

Random graph

In mathematics, a random graph is a graph that is generated by some random process. The theory of random graphs lies at the intersection between graph theory and probability theory, and studies the properties of typical random graphs.-Random graph models:...

s that is generated by some random process;lattice graph

Lattice graph

The terms lattice graph, mesh graph, or grid graph refer to a number of categories of graphs whose drawing corresponds to some grid/mesh/lattice, i.e., its vertices correspond to the nodes of the mesh and its edges correspond to the ties between the nodes.-Square grid graph:A common type of a...

s having a ring structure;lattice graph

Lattice graph

s having a grid structure, etc.
In all cases, the data generation process follows the same process:
1. Generate the empty graph structure

Graph (data structure)

In computer science, a graph is an abstract data structure that is meant to implement the graph and hypergraph concepts from mathematics.A graph data structure consists of a finite set of ordered pairs, called edges or arcs, of certain entities called nodes or vertices...

.
2. Generate attribute values

Attribute-value system

An attribute-value system is a basic knowledge representation framework comprising a table with columns designating "attributes" and rows designating "objects" An attribute-value system is a basic knowledge representation framework comprising a table with columns designating "attributes" (also...

based on user-supplied prior probabilities.

Since the attribute values

Attribute-value system

of one object may depend on the attribute values

Attribute-value system

of related objects, the attribute generation process assigns values collectively.¹⁰

External links

The datgen synthetic data generator: http://www.datasetgenerator.com

Fienberg, S. E. (1994). “Conflicts between the needs for access to statistical information and demands for confidentiality”, Journal of Official Statistics 10, 115–132.

Little, R (1993). “Statistical Analysis of Masked Data,” Journal of Official Statistics, 9, 407-426.

Raghunathan, T.E., Reiter, J.P., and Rubin, D.B. (2003). “Multiple Imputation for Statistical Disclosure Limitation,” Journal of Official Statistics, 19, 1-16.

Reiter, J.P. (2004). “Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation,” Survey Methodology, 30, 235-242.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.