Big data
Encyclopedia
Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime." Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data. Scientists regularly encounter this problem in meteorology
, genomics
, connectomics
, complex physics simulations , biological and environmental research , Internet search, finance and business informatics
. Data sets also grow in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing
) "software logs, cameras, microphones, RFID readers, wireless sensor networks and so on." Every day, 2.5 quintillion bytes of data are created and 90% of the data in the world today was created within the past two years.
One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers." The size of "big data" varies depending on the capabilities of the organization managing the set. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
In a 2001 research report and related conference presentations, then META Group (now Gartner
) analyst, Doug Laney, defined data growth challenges (and opportunities) as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in/out), and variety (range of data types, sources). Gartner continues to use this model for describing "big data."
), Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; medical records; photography archives; video archives; and large-scale eCommerce.
Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, backup, and optimize the use of the large data tables in the RDBMS.
The impact of, “Big Data,” has increased the demand of information management specialists in that Oracle, IBM, Microsoft, and SAP have spent more than $15 billion on software firms only specializing in data management and analytics. This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole.
Big Data has emerged because we are living in a society which has more of everything. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet. Basically, there are more people interacting with data or information than ever before. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. Cisco predicts that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.
neglecting principles such as choosing a representative sample
by being too concerned about actually handling the huge amounts of data. As such, the results often are biased
in one way or another. Integration across heterogeneous data resources - some that might be considered “big data” and others not - presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most exciting new frontiers in science .
Meteorology
Meteorology is the interdisciplinary scientific study of the atmosphere. Studies in the field stretch back millennia, though significant progress in meteorology did not occur until the 18th century. The 19th century saw breakthroughs occur after observing networks developed across several countries...
, genomics
Genomics
Genomics is a discipline in genetics concerning the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis,...
, connectomics
Connectomics
Connectomics is a high-throughput application of neural imaging and histological techniques in order to increase the speed, efficiency, and resolution of maps of the multitude of neural connections in a nervous system...
, complex physics simulations , biological and environmental research , Internet search, finance and business informatics
Business informatics
Business informatics or organizational informatics is a discipline combining information technology , informatics and management concepts. The BI discipline was created in Germany, from the concept of "Wirtschaftsinformatik"...
. Data sets also grow in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing
Remote sensing
Remote sensing is the acquisition of information about an object or phenomenon, without making physical contact with the object. In modern usage, the term generally refers to the use of aerial sensor technologies to detect and classify objects on Earth by means of propagated signals Remote sensing...
) "software logs, cameras, microphones, RFID readers, wireless sensor networks and so on." Every day, 2.5 quintillion bytes of data are created and 90% of the data in the world today was created within the past two years.
One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers." The size of "big data" varies depending on the capabilities of the organization managing the set. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
Definition
Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.In a 2001 research report and related conference presentations, then META Group (now Gartner
Gartner
Gartner, Inc. is an information technology research and advisory firm headquartered in Stamford, Connecticut, United States. It was known as GartnerGroup until 2001....
) analyst, Doug Laney, defined data growth challenges (and opportunities) as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in/out), and variety (range of data types, sources). Gartner continues to use this model for describing "big data."
Examples
Examples include web logs; RFID; sensor networks; social networks; social data (due to the Social data revolutionSocial data revolution
-Overview:The Social Data Revolution is the shift in human communication patterns towards increased personal information sharing and its related implications, made possible by the rise of social networks in early 2000s...
), Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; medical records; photography archives; video archives; and large-scale eCommerce.
Technologies
Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies being applied to big data include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, backup, and optimize the use of the large data tables in the RDBMS.
Impact
The Sloan Digital Sky Survey collected more data in its first few weeks than the entire data collection in the history of astronomy back in the year 2000. Since that time, it has amassed 140 terabytes of information. The successor to this telescope, the Large Synoptic Survey Telescope, will come online in the year 2016 and will acquire that amount of data every five days. Wal-Mart handles more than 1 million customer transactions every hour which in turn imports into databases estimated at more than 2.5 petabytes which is the equivalent of 167 times the books in America’s Library of Congress. Facebook handles 40 billion photos from its user base. Decoding the human genome originally took 10 years to process when it can now be achieved in one week.The impact of, “Big Data,” has increased the demand of information management specialists in that Oracle, IBM, Microsoft, and SAP have spent more than $15 billion on software firms only specializing in data management and analytics. This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole.
Big Data has emerged because we are living in a society which has more of everything. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet. Basically, there are more people interacting with data or information than ever before. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. Cisco predicts that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.
Critique
Concerns have been raised about the use of big data in scienceScience
Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe...
neglecting principles such as choosing a representative sample
Sampling (statistics)
In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....
by being too concerned about actually handling the huge amounts of data. As such, the results often are biased
Bias (statistics)
A statistic is biased if it is calculated in such a way that it is systematically different from the population parameter of interest. The following lists some types of, or aspects of, bias which should not be considered mutually exclusive:...
in one way or another. Integration across heterogeneous data resources - some that might be considered “big data” and others not - presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most exciting new frontiers in science .
See also
- Cloud computingCloud computingCloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a utility over a network ....
- Data assimilationData assimilationApplications of data assimilation arise in many fields of geosciences, perhaps most importantly in weather forecasting and hydrology. Data assimilation proceeds by analysis cycles...
- Database theoryDatabase theoryDatabase theory encapsulates a broad range of topics related to the study and research of the theoretical realm of databases and database management systems....
- Database-centric architectureDatabase-centric architectureDatabase-centric architecture or data-centric architecture has several distinct meanings, generally relating to software architectures in which databases play a crucial role. Often this description is meant to contrast the design to an alternative approach...
- Data Intensive ComputingData Intensive ComputingData-Intensive Computing is a class of parallel computing applications which use a data parallel approach to processing large volumes of data typically terabytes or petabytes in size and typically referred to as Big Data...
- Data structureData structureIn computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...
- Online databaseOnline databaseAn online database is a database accessible from a network, including from the Internet.It differs from a local database, held in an individual computer or its attached storage, such as a CD....
- Real-time database
- Relational databaseRelational databaseA relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...
- Social data revolutionSocial data revolution-Overview:The Social Data Revolution is the shift in human communication patterns towards increased personal information sharing and its related implications, made possible by the rise of social networks in early 2000s...
- SupercomputerSupercomputerA supercomputer is a computer at the frontline of current processing capacity, particularly speed of calculation.Supercomputers are used for highly calculation-intensive tasks such as problems including quantum physics, weather forecasting, climate research, molecular modeling A supercomputer is a...
- Tuple spaceTuple spaceA tuple space is an implementation of the associative memory paradigm for parallel/distributed computing. It provides a repository of tuples that can be accessed concurrently. As an illustrative example, consider that there are a group of processors that produce pieces of data and a group of...
Architecture Comparison
- Survey Distributed Databases
- Marin Dimitrov's Comparison on PNUTS, Dynamo, Voldemort, BigTable, HBase, Cassandra and CouchDB May 2010
- Big Data Architecture: Comparing Aster Data, Greenplum, Gluster etc 2009
- HBase vs. Cassandra: NoSQL Battle!
- Why Pick Cassandra for Real-time Transaction
- Why Use HBase-1: from Million Mark to Billion Mark
- Why Use HBase-2: Demystifying HBase Data integrity, Availability and Performance
- HBase MapReduce 101 - Part I
- HBase Architecture 101 - Write-ahead-Log
- HBase Architecture 101 - Storage
- Beyond Hadoop: Next-Generation Big Data Architectures by By Bill McColl Oct. 23, 2010 about "Not Only Hadoop".
- MPI and BSP See wiki about Bulk Synchronous Parallel and Apache HAMA on Hadoop cluster.