Graph database
Encyclopedia
A graph database uses graph structures
with nodes, edges, and properties to represent and store data. By definition, a graph database is any storage system that provides index-free adjacency. General graph databases that can store any graph are distinct from specialized graph databases such as triplestore
s and network databases.
. Graph databases employ nodes, properties, and edges. Nodes are very similar in nature to the objects that object-oriented programmers will be familiar with.
Nodes represent entities such as people, businesses, accounts, or any other item you might want to keep track of.
Properties are pertinent information that relate to nodes. For instance, if "Wikipedia" were one of the nodes, one might have it tied to properties such as "website", "reference material", or "word that starts with the letter 'w'", depending on which aspects of "Wikipedia" are pertinent to the particular database.
Edges are the lines that connect nodes to nodes or nodes to properties and they represent the relationship between the two. Most of the important information is really stored in the edges. Meaningful patterns emerge when one examines the connections and interconnections of nodes, properties, and edges.
operations. As they depend less on a rigid schema, they are more suitable to manage ad-hoc and changing data with evolving schemas. Conversely, relational databases are typically faster at performing the same operation on large numbers of data elements.
Graph databases are a powerful tool for graph-like queries, for example computing the shortest path between two nodes in the graph. Other graph-like queries can be performed over a graph database in a natural way (for example graph's diameter computations or community detection).
Graph (data structure)
In computer science, a graph is an abstract data structure that is meant to implement the graph and hypergraph concepts from mathematics.A graph data structure consists of a finite set of ordered pairs, called edges or arcs, of certain entities called nodes or vertices...
with nodes, edges, and properties to represent and store data. By definition, a graph database is any storage system that provides index-free adjacency. General graph databases that can store any graph are distinct from specialized graph databases such as triplestore
Triplestore
A triplestore is a purpose-built database for the storage and retrieval of Resource Description Framework metadata.Much like a relational database, one stores information in a triplestore and retrieves it via a query language...
s and network databases.
Structure
Graph databases are based on graph theoryGraph theory
In mathematics and computer science, graph theory is the study of graphs, mathematical structures used to model pairwise relations between objects from a certain collection. A "graph" in this context refers to a collection of vertices or 'nodes' and a collection of edges that connect pairs of...
. Graph databases employ nodes, properties, and edges. Nodes are very similar in nature to the objects that object-oriented programmers will be familiar with.
Nodes represent entities such as people, businesses, accounts, or any other item you might want to keep track of.
Properties are pertinent information that relate to nodes. For instance, if "Wikipedia" were one of the nodes, one might have it tied to properties such as "website", "reference material", or "word that starts with the letter 'w'", depending on which aspects of "Wikipedia" are pertinent to the particular database.
Edges are the lines that connect nodes to nodes or nodes to properties and they represent the relationship between the two. Most of the important information is really stored in the edges. Meaningful patterns emerge when one examines the connections and interconnections of nodes, properties, and edges.
Properties
Compared with relational databases, graph databases are often faster for associative data sets, and map more directly to the structure of object-oriented applications. They can scale more naturally to large data sets as they do not typically require expensive joinJoin (SQL)
An SQL join clause combines records from two or more tables in a database. It creates a set that can be saved as a table or used as is. A JOIN is a means for combining fields from two tables by using values common to each. ANSI standard SQL specifies four types of JOINs: INNER, OUTER, LEFT, and RIGHT...
operations. As they depend less on a rigid schema, they are more suitable to manage ad-hoc and changing data with evolving schemas. Conversely, relational databases are typically faster at performing the same operation on large numbers of data elements.
Graph databases are a powerful tool for graph-like queries, for example computing the shortest path between two nodes in the graph. Other graph-like queries can be performed over a graph database in a natural way (for example graph's diameter computations or community detection).
Graph database projects
The following is a list of several well-known graph database projects:- AllegroGraphAllegroGraphAllegroGraph is a closed source Graph database, an emerging category of databases. In contrast with a Relational database, a graph database considers each stored item to have any number of relationships. These relationships can be viewed as links, which together form a network, or graph....
- a scalable, high-performance RDFResource Description FrameworkThe Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...
and graph database. - Bigdata - a highly scalable RDF/graph database capable of 10B+ edges on a single node or clustered deployment for very high throughput.
- CloudGraph - a disk- and memory-based, fully transactional .NET graph database that uses graphs and key/value pairs to store data.
- CytoscapeCytoscapeCytoscape is an open source bioinformatics software platform for visualizing molecular interaction networks and integrating with gene expression profiles and other state data. Additional features are available as plugins...
- open-source platform, outgrowth of bioinformaticsBioinformaticsBioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software... - DEXDEX (Graph database)DEX is a high-performance and scalable graph database management system written in C++.Its development started on 2006 and its first version was available on Q3 - 2008. Fourth version is available since Q3-2010...
- A high-performance graph database from Sparsity Technologies, a technology transition company from DAMA-UPC - Filament - graph persistence framework and associated toolkits based on a navigational query style.
- GraphBase - a customizable, distributed, small-footprint, high-performance graph store with a rich tool set from FactNexus
- Graphd, the proprietary backend of FreebaseFreebaseFreebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions. Freebase aims to create a global resource which allows people to...
- Horton - a graph database from Microsoft Research Extreme Computing Group (XCG) based on the cloud programming infrastructure Orleans
- HyperGraphDB - an open-source (LGPL) graph database supporting generalized hypergraphHypergraphIn mathematics, a hypergraph is a generalization of a graph, where an edge can connect any number of vertices. Formally, a hypergraph H is a pair H = where X is a set of elements, called nodes or vertices, and E is a set of non-empty subsets of X called hyperedges or links...
s where edges can point to other edges - InfiniteGraph - a highly scalable, distributed and cloud-enabled commercial product with flexible licensing for startups.
- InfoGrid - an open-source / commercial (AGPLv3, free for small entities) graph database with web front end and configurable storage engines (MySQL, PostgreSQL, Files, Hadoop)
- Neo4jNeo4jNeo4j is an open-source graph database, implemented in Java. The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables". Neo4j version 1.0 was released in February, 2010. The community edition of...
- an open-source / commercial (GPLv3 community edition, AGPLv3 advanced and enterprise edition) graph database - OrientDBOrientDBOrientDB is an open source NoSQL database management system written in Java. Even if it is a document-based database, the relationships are managed as in graph databases with direct connections between records. It supports schema-less, schema-full and schema-mixed modes. It has a strong security...
- a high-performance open source document-graph database - OQGRAPH - Graph computation engine (GPLv2 licensed) for MySQLMySQLMySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...
, MariaDB and DrizzleDrizzle (database server)Drizzle is a free software/open source relational database management system that was forked from version 6.0 of the MySQL DBMS.Like MySQL, Drizzle has a client/server architecture and uses SQL as its primary command language... - sones GraphDBSones GraphDBSones GraphDB was developed by the company sones in Erfurt and Leipzig. GraphDB is a new type of database with its design based on weighted graphs. The open source edition has been available since July 2010...
- an open-source / commercial (AGPLv3) graph database and universal access layer (funded by Deutsche Telekom AG) - VertexDB - high performance graph database server that supports automatic garbage collection.
- Virtuoso Universal Server - a clustered high performance and scalable RDF graph database server
- R2DF - R2DF framework for ranked path queries over weighted RDF graphs
Distributed Graph Processing (mostly in-memory-only)
- Angrapa - graph package in Hama, a bulk synchronous parallel (BSPBulk synchronous parallelThe Bulk Synchronous Parallel abstract computer is a bridging model for designing parallel algorithms. A bridging model "is intended neither as a hardware nor a programming model but something in between" . It serves a purpose similar to the Parallel Random Access Machine model. BSP differs from...
) platform - FlockDBFlockDBFlockDB is an open source distributed, fault-tolerant graph database for managing data at webscale. It was initially used by Twitter to build its database of users and manage their relationships to one another...
- an open source distributed, fault-tolerant graph database based on MySQLMySQLMySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...
and the GizzardGizzard (scala framework)Gizzard is an open source sharding framework to create custom fault-tolerant, distributed databases. It was initially used by Twitter and emerged out of a wide variety of data storage problems. Gizzard operates as a middleware networking service that runs on the Java Virtual Machine...
framework for managing Twitter-like graph data (single-hop relationships) at webscale FlockDB on GitHub. - Giraph - a Graph processing infrastructure that runs on Hadoop (see Pregel).
- GoldenOrb - Pregel implementation built on top of Apache Hadoop
- Phoebus - Pregel implementation written in Erlang
- Pregel - Google's internal graph processing platform, released details in ACM paper.
- Trinity - Distributed in-memory graph engine under development at Microsoft Research Labs.
APIs and Graph Query/Programming Languages
- Blueprints - a Java API for Property Graphs from TinkerPop and supported by a few graph database vendors.
- Blueprints.NET - a C#/.NET API for generic Property Graphs.
- Cypher - a Property Graph Query Language developed by Neo4jNeo4jNeo4j is an open-source graph database, implemented in Java. The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables". Neo4j version 1.0 was released in February, 2010. The community edition of...
. - Gremlin - an open-source graph programming language that works over various graph database systems.
- Pacer - is a Ruby dialect/implementation of the Gremlin graph traversal language.
- Pipes - a lazy dataflow framework written in Java that forms the foundation for various property graph traversal languages.
- Pipes.NET - a data flow framework for C#/.NET for processing generic graphs and Property Graphs.
- PYBlueprints - a Python API for Property Graphs.
- Rexster - a HTTP/REST API for accessing remote graph databases and supported by a few graph database vendors.
See also
- NoSQL (concept)
- Document-oriented databaseDocument-oriented databaseA document-oriented database is a computer program designed for storing, retrieving, and managing document-oriented, or semi structured data, information...
- Structured storageStructured storageCOM Structured Storage is a technology developed by Microsoft as part of its Windows operating system for storing hierarchical data within a single file...
- Object databaseObject databaseAn object database is a database management system in which information is represented in the form of objects as used in object-oriented programming...
- Resource Description FrameworkResource Description FrameworkThe Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...
(RDF) - framework to express node-edge graphs - Graph transformation for a complementary topic (rule based in memory manipulation of graphs instead of transactionDatabase transactionA transaction comprises a unit of work performed within a database management system against a database, and treated in a coherent and reliable way independent of other transactions...
safe persistencePersistence (computer science)Persistence in computer science refers to the characteristic of state that outlives the process that created it. Without this capability, state would only exist in RAM, and would be lost when this RAM loses power, such as a computer shutdown....
).
External links
- Graph Database Tutorial
- NoSQL Frankfurt 2010 - The GraphDB Landscape and sones
- Graph Databases and the Future of Large-Scale Knowledge Management
- Graphs in the database: SQL meets social networks
- Social networks in the database: using a graph database
- Scaling Online Social Networks without Pains
- Large-scale Graph Computing at Google
- On building a stupidly fast graph database
- InfiniteGraph technical documentation
- Neo4j - an open source graph database
- DEX - a high-performance graph database
- Eric Lai. (2009, July 1). No to SQL? Anti-database movement gains steam
- Renzo Angles, Claudio Gutierrez. Survey of graph database models. ACM Computing Surveys, Feb. 2008.
- InfoGrid - an open-source application platform including a graph database
- Rodriguez, M.A., MySQL vs. Neo4j on a Large-Scale Graph Traversal
- Rodriguez, M.A., Neubauer, P, The Graph Traversal Pattern article.
- OrientDB - a high-performance open source document-graph database
- Optimizing Schema-Last Tuple-Store Queries in Graphd SIGMOD 2010