Shapefile
Encyclopedia
The Esri Shapefile or simply a shapefile is a popular geospatial
vector
data format for geographic information system
s software. It is developed and regulated by Esri
as a (mostly) open specification
for data interoperability among Esri and other software products.
Shapefiles spatially describe geometries: points
, polylines, and polygons. These, for example, could represent water well
s, rivers, and lake
s, respectively. Each item may also have attribute
s that describe the items, such as the name or temperature.
information. The shapefile format was introduced with ArcView GIS version 2 in the beginning of the 1990s. It is now possible to read and write shapefiles using a variety of free and non-free programs.
Shapefiles are simple because they store primitive geometrical data types of points, lines, and polygons. These primitives are of limited use without any attributes to specify what they represent. Therefore, a table of records will store properties/attributes for each primitive shape in the shapefile. Shapes (points/lines/polygons) together with data attributes can create infinitely many representations about geographical data. Representation provides the ability for powerful and accurate computations.
While the term "shapefile" is quite common, a "shapefile" is actually a set of several files. Three individual files are mandatory to store the core data that comprises a shapefile: ".shp", ".shx", ".dbf
", and other extensions on a common prefix name (e.g., "lakes.*"). The actual shapefile relates specifically to files with the ".shp" extension, but alone is incomplete for distribution, as the other supporting files are required.
There are a further eight optional files which store primarily index data to improve performance. Each individual file should conform to the MS DOS 8.3 filename convention (8 character filename prefix, period, 3 character filename suffix such as shapefil.shp) in order to be compatible with past applications that handle shapefiles, though many recent software applications accept files with longer names. For this same reason, all files should be located in the same folder.
Mandatory files :
Optional files :
In each of the .shp, .shx, and .dbf files, the shapes in each file correspond to each other in sequence. That is, the first record in the .shp file corresponds to the first record in the .shx and .dbf files, and so on. The .shp and .shx files have various fields with different endianness
, so an implementor of the file formats must be very careful to respect the endianness of each field and treat it properly.
Shapefiles deal with coordinates in terms of X and Y, although they are often storing longitude and latitude, respectively.
shape font source format, which shares the .shp extension.
The main file header is fixed at 100 bytes in length and contains 17 fields; nine 4-byte (32-bit signed integer or int32) integer fields followed by eight 8-byte (double
) signed floating point fields:
The file then contains any number of variable-length records. Each record is prefixed with a record-header of 8 bytes:
Following the record header is the actual record:
The variable length record contents depend on the shape type. The following are the possible shape types:
In common use, shapefiles containing Point, Polyline, and Polygon are extremely popular. The "Z" types are three-dimensional. The "M" types contain a user-defined measurement which coincides with the point being referenced. Three-dimensional shapefiles are rather uncommon, and the measurement functionality has been largely superseded by more robust databases used in conjunction with the shapefile data.
Using this index, it is possible to seek backwards in the shapefile by seeking backwards first in the shape index (which is possible because it uses fixed-length records), reading the record offset, and using that to seek to the correct position in the .shp file. It is also possible to seek forwards an arbitrary number of records by using the same method.
format. An alternative format that can also be used is the xBase
format, which has an open specification, and is used in open source
Shapefile libraries, such as the Shapefile C library.
specifies the geographic coordinate system
of the geometric data in the .shp file.
Although optional, it is usually provided, as it is not necessarily possible to guess the coordinate system of any given points.
The file is created in well-known text
(WKT) format when generated with ArcGIS Desktop versions 9 and later.
Previous ArcGIS versions and some third-party software generate it in another format, shown here:
Older projection file format example:
Projection UTM
Zunits NO
Units METERS
Spheroid CLARKE1866
Xshift 0.0000000000
Yshift -4000000.0000000000
Parameters
-108 0 0.000 /* longitude
36 0 0.000 /* latitude
New WKT format example:
GEOGCS["GCS_North_American_1927",DATUM["D_North_American_1927",SPHEROID["Clarke_1866",6378206.4,294.9786982],PRIMEM["Greenwich",0],UNIT"Degree",0.0174532925199433
The information contained in the .prj file specifies the:
information. ArcInfo coverages and Personal/File/Enterprise Geodatabase
s do have the ability to store feature topology.
are defined using points. The spacing of the points implicitly determines the scale for which the data are useful. Exceeding that scale results in jagged representation of features. Additional points would be required to achieve smooth shapes at greater scales. For features better represented by smooth curves, the polygon representation requires much more data storage than, for example, splines
, which can capture smoothly varying shapes efficiently. None of the shapefile types supports splines.
The attribute database format for the .dbf component file is based on an older dBase
standard. This database format inherently has a number of limitations, including:
Geospatial
Geospatial analysis is an approach to applying statistical analysis and other informational techniques to geographically based data. Such analysis employs spatial software and analytical methods with terrestrial or geographic datasets, including geographic information systems and...
vector
Vector graphics
Vector graphics is the use of geometrical primitives such as points, lines, curves, and shapes or polygon, which are all based on mathematical expressions, to represent images in computer graphics...
data format for geographic information system
Geographic Information System
A geographic information system, geographical information science, or geospatial information studies is a system designed to capture, store, manipulate, analyze, manage, and present all types of geographically referenced data...
s software. It is developed and regulated by Esri
ESRI
Esri is a software development and services company providing Geographic Information System software and geodatabase management applications. The headquarters of Esri is in Redlands, California....
as a (mostly) open specification
Open standard
An open standard is a standard that is publicly available and has various rights to use associated with it, and may also have various properties of how it was designed . There is no single definition and interpretations vary with usage....
for data interoperability among Esri and other software products.
Shapefiles spatially describe geometries: points
Point (geometry)
In geometry, topology and related branches of mathematics a spatial point is a primitive notion upon which other concepts may be defined. In geometry, points are zero-dimensional; i.e., they do not have volume, area, length, or any other higher-dimensional analogue. In branches of mathematics...
, polylines, and polygons. These, for example, could represent water well
Water well
A water well is an excavation or structure created in the ground by digging, driving, boring or drilling to access groundwater in underground aquifers. The well water is drawn by an electric submersible pump, a trash pump, a vertical turbine pump, a handpump or a mechanical pump...
s, rivers, and lake
Lake
A lake is a body of relatively still fresh or salt water of considerable size, localized in a basin, that is surrounded by land. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds. Lakes can be contrasted with rivers or streams,...
s, respectively. Each item may also have attribute
Attribute (computing)
In computing, an attribute is a specification that defines a property of an object, element, or file. It may also refer to or set the specific value for a given instance of such....
s that describe the items, such as the name or temperature.
Overview
A shapefile is a digital vector storage format for storing geometric location and associated attribute information. This format lacks the capacity to store topologicalGeospatial topology
Geospatial topology studies the rules concerning the relationships between the points, lines, and polygons that represent the features of a geographic region. For example, where two polygons represent adjacent counties, typical topological rules would require that the counties share a common...
information. The shapefile format was introduced with ArcView GIS version 2 in the beginning of the 1990s. It is now possible to read and write shapefiles using a variety of free and non-free programs.
Shapefiles are simple because they store primitive geometrical data types of points, lines, and polygons. These primitives are of limited use without any attributes to specify what they represent. Therefore, a table of records will store properties/attributes for each primitive shape in the shapefile. Shapes (points/lines/polygons) together with data attributes can create infinitely many representations about geographical data. Representation provides the ability for powerful and accurate computations.
While the term "shapefile" is quite common, a "shapefile" is actually a set of several files. Three individual files are mandatory to store the core data that comprises a shapefile: ".shp", ".shx", ".dbf
DBF
DBF may refer to:*dBASE filename extension*Detection By Fire*Divorced Black Female in personal ads*dBf — Decibel as referring to 1 femtowatt*Drude-Born-Federov form of constitutive equations...
", and other extensions on a common prefix name (e.g., "lakes.*"). The actual shapefile relates specifically to files with the ".shp" extension, but alone is incomplete for distribution, as the other supporting files are required.
There are a further eight optional files which store primarily index data to improve performance. Each individual file should conform to the MS DOS 8.3 filename convention (8 character filename prefix, period, 3 character filename suffix such as shapefil.shp) in order to be compatible with past applications that handle shapefiles, though many recent software applications accept files with longer names. For this same reason, all files should be located in the same folder.
Mandatory files :
- .shp — shape format; the feature geometry itself
- .shx — shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly
- .dbf — attribute format; columnar attributes for each shape, in dBaseDBASEdBase II was the first widely used database management system for microcomputers. It was originally published by Ashton-Tate for CP/M, and later on ported to the Apple II and IBM PC under DOS...
IV format
Optional files :
- .prj — projection format; the coordinate system and projection information, a plain text file describing the projection using well-known textWell-known textWell-known text is a text markup language for representing vector geometry objects on a map, spatial reference systems of spatial objects and transformations between spatial reference systems. A binary equivalent, known as well-known binary is used to transfer and store the same information on...
format - .sbn and .sbx — a spatial index of the features
- .fbn and .fbx — a spatial index of the features for shapefiles that are read-only
- .ain and .aih — an attribute index of the active fields in a table or a theme's attribute table
- .ixs — a geocoding index for read-write shapefiles
- .mxs — a geocoding index for read-write shapefiles (ODB format)
- .atx — an attribute index for the .dbf file in the form of shapefile.columnname.atx (ArcGIS 8 and later)
- .shp.xml — geospatial metadataGeospatial metadataGeospatial metadata is a type of metadata that is applicable to objects that have an explicit or implicit geographic extent, in other words, are associated with some position on the surface of the Globe...
in XML format, such as ISO 19115ISO 19115ISO 19115 "Geographic Information - Metadata" is a standard of the International Organization for Standardization . It is a component of the series of ISO 191xx standards for Geospatial metadata. ISO 19115 defines how to describe geographical information and associated services, including...
or other schemasXML schemaAn XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself... - .cpg — used to specify the code pageCode pageCode page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM's EBCDIC-based mainframe systems, but many vendors use this term including Microsoft, SAP, and Oracle Corporation...
(only for .dbf) for identifying the character encodingCharacter encodingA character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
to be used
In each of the .shp, .shx, and .dbf files, the shapes in each file correspond to each other in sequence. That is, the first record in the .shp file corresponds to the first record in the .shx and .dbf files, and so on. The .shp and .shx files have various fields with different endianness
Endianness
In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...
, so an implementor of the file formats must be very careful to respect the endianness of each field and treat it properly.
Shapefiles deal with coordinates in terms of X and Y, although they are often storing longitude and latitude, respectively.
Shapefile shape format (.shp)
The main file (.shp) contains the primary geographic reference data in the shapefile. The file consists of a single fixed length header followed by one or more variable length records. Each of the variable length records includes a record header component and a record contents component. A detailed description of the file format is given in the Esri Shapefile Technical Description. This format should not be confused with the AutoCADAutoCAD
AutoCAD is a software application for computer-aided design and drafting in both 2D and 3D. It is developed and sold by Autodesk, Inc. First released in December 1982, AutoCAD was one of the first CAD programs to run on personal computers, notably the IBM PC...
shape font source format, which shares the .shp extension.
The main file header is fixed at 100 bytes in length and contains 17 fields; nine 4-byte (32-bit signed integer or int32) integer fields followed by eight 8-byte (double
Double precision
In computing, double precision is a computer number format that occupies two adjacent storage locations in computer memory. A double-precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point .Modern computers with 32-bit storage locations...
) signed floating point fields:
Bytes | Type | Endianness Endianness In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits... | Usage |
---|---|---|---|
0–3 | int32 | big | File code (always hex value 0x0000270a) |
4–23 | int32 | big | Unused; five uint32 |
24–27 | int32 | big | File length (in 16-bit words, including the header) |
28–31 | int32 | little | Version |
32–35 | int32 | little | Shape type (see reference below) |
36–67 | double | little | Minimum bounding rectangle Minimum bounding rectangle The minimum bounding rectangle , also known as bounding box or envelope, is an expression of the maximum extents of a 2-dimensional object within its 2-D coordinate system, in other words min, max, min, max... (MBR) of all shapes contained within the shapefile; four doubles in the following order: min X, min Y, max X, max Y |
68–83 | double | little | Range of Z; two doubles in the following order: min Z, max Z |
84–99 | double | little | Range of M; two doubles in the following order: min M, max M |
The file then contains any number of variable-length records. Each record is prefixed with a record-header of 8 bytes:
Bytes | Type | Endianness Endianness In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits... | Usage |
---|---|---|---|
0–3 | int32 | big | Record number (1-based) |
4–7 | int32 | big | Record length (in 16-bit words) |
Following the record header is the actual record:
Bytes | Type | Endianness Endianness In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits... | Usage |
---|---|---|---|
0–3 | int32 | little | Shape type (see reference below) |
4– | - | - | Shape content |
The variable length record contents depend on the shape type. The following are the possible shape types:
Value | Shape type | Fields |
---|---|---|
0 | Null shape | None |
1 | Point | X, Y |
3 | Polyline | MBR, Number of parts, Number of points, Parts, Points |
5 | Polygon | MBR, Number of parts, Number of points, Parts, Points |
8 | MultiPoint | MBR, Number of points, Points |
11 | PointZ | X, Y, Z, M |
13 | PolylineZ | Mandatory: MBR, Number of parts, Number of points, Parts, Points, Z range, Z array Optional: M range, M array |
15 | PolygonZ | Mandatory: MBR, Number of parts, Number of points, Parts, Points, Z range, Z array Optional: M range, M array |
18 | MultiPointZ | Mandatory: MBR, Number of points, Points, Z range, Z array Optional: M range, M array |
21 | PointM | X, Y, M |
23 | PolylineM | Mandatory: MBR, Number of parts, Number of points, Parts, Points Optional: M range, M array |
25 | PolygonM | Mandatory: MBR, Number of parts, Number of points, Parts, Points Optional: M range, M array |
28 | MultiPointM | Mandatory: MBR, Number of points, Points Optional Fields: M range, M array |
31 | MultiPatch | Mandatory: MBR, Number of parts, Number of points, Parts, Part types, Points, Z range, Z array Optional: M range, M array |
In common use, shapefiles containing Point, Polyline, and Polygon are extremely popular. The "Z" types are three-dimensional. The "M" types contain a user-defined measurement which coincides with the point being referenced. Three-dimensional shapefiles are rather uncommon, and the measurement functionality has been largely superseded by more robust databases used in conjunction with the shapefile data.
Shapefile shape index format (.shx)
The shapefile index contains the same 100-byte header as the .shp file, followed by any number of 8-byte fixed-length records which consist of the following two fields:Bytes | Type | Endianness Endianness In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits... | Usage |
---|---|---|---|
0–3 | int32 | big | Record offset (in 16-bit words) |
4–7 | int32 | big | Record length (in 16-bit words) |
Using this index, it is possible to seek backwards in the shapefile by seeking backwards first in the shape index (which is possible because it uses fixed-length records), reading the record offset, and using that to seek to the correct position in the .shp file. It is also possible to seek forwards an arbitrary number of records by using the same method.
Shapefile attribute format (.dbf)
Attributes for each shape are stored in the dBaseDBASE
dBase II was the first widely used database management system for microcomputers. It was originally published by Ashton-Tate for CP/M, and later on ported to the Apple II and IBM PC under DOS...
format. An alternative format that can also be used is the xBase
XBase
xBase is the generic term for all programming languages that derive from the original dBASE programming language and database formats. These are sometimes informally known as dBASE "clones"...
format, which has an open specification, and is used in open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
Shapefile libraries, such as the Shapefile C library.
Shapefile projection format (.prj)
The information contained in the .prj filespecifies the geographic coordinate system
Geographic coordinate system
A geographic coordinate system is a coordinate system that enables every location on the Earth to be specified by a set of numbers. The coordinates are often chosen such that one of the numbers represent vertical position, and two or three of the numbers represent horizontal position...
of the geometric data in the .shp file.
Although optional, it is usually provided, as it is not necessarily possible to guess the coordinate system of any given points.
The file is created in well-known text
Well-known text
Well-known text is a text markup language for representing vector geometry objects on a map, spatial reference systems of spatial objects and transformations between spatial reference systems. A binary equivalent, known as well-known binary is used to transfer and store the same information on...
(WKT) format when generated with ArcGIS Desktop versions 9 and later.
Previous ArcGIS versions and some third-party software generate it in another format, shown here:
Older projection file format example:
Projection UTM
Zunits NO
Units METERS
Spheroid CLARKE1866
Xshift 0.0000000000
Yshift -4000000.0000000000
Parameters
-108 0 0.000 /* longitude
36 0 0.000 /* latitude
New WKT format example:
GEOGCS["GCS_North_American_1927",DATUM["D_North_American_1927",SPHEROID["Clarke_1866",6378206.4,294.9786982],PRIMEM["Greenwich",0],UNIT"Degree",0.0174532925199433
The information contained in the .prj file specifies the:
- Name of Geographic coordinate systemGeographic coordinate systemA geographic coordinate system is a coordinate system that enables every location on the Earth to be specified by a set of numbers. The coordinates are often chosen such that one of the numbers represent vertical position, and two or three of the numbers represent horizontal position...
or Map projectionMap projectionA map projection is any method of representing the surface of a sphere or other three-dimensional body on a plane. Map projections are necessary for creating maps. All map projections distort the surface in some fashion... - Datum (geodesy)
- SpheroidSpheroidA spheroid, or ellipsoid of revolution is a quadric surface obtained by rotating an ellipse about one of its principal axes; in other words, an ellipsoid with two equal semi-diameters....
- Prime meridianPrime MeridianThe Prime Meridian is the meridian at which the longitude is defined to be 0°.The Prime Meridian and its opposite the 180th meridian , which the International Date Line generally follows, form a great circle that divides the Earth into the Eastern and Western Hemispheres.An international...
- Units used
- Parameters necessary to define the map projectionMap projectionA map projection is any method of representing the surface of a sphere or other three-dimensional body on a plane. Map projections are necessary for creating maps. All map projections distort the surface in some fashion...
, for example:- Latitude of origin
- Scale factor
- Central meridian
- False northing
- False easting
- Standard parallels
Shapefile spatial index format (.sbn)
This is a binary spatial index file, which is used only by Esri software. The format is not documented by Esri. However it has been reverse-engineered and documented by the open source community. It is not currently implemented by other vendors. The .sbn file is not strictly necessary, since the .shp file contains all of the information necessary to successfully parse the spatial dataTopology and shapefiles
Shapefiles do not have the ability to store topologicalTopology
Topology is a major area of mathematics concerned with properties that are preserved under continuous deformations of objects, such as deformations that involve stretching, but no tearing or gluing...
information. ArcInfo coverages and Personal/File/Enterprise Geodatabase
Geodatabase
A geodatabase is a spatial database designed to store, query, and manipulate geographic information and spatial data of low dimensionality. It is a specialized type of spatial database often with optimizations for 2 and 3 dimensions, raster data and Euclidean distance.Within a spatial database,...
s do have the ability to store feature topology.
Spatial representation
The edges of a polyline or polygonPolygon
In geometry a polygon is a flat shape consisting of straight lines that are joined to form a closed chain orcircuit.A polygon is traditionally a plane figure that is bounded by a closed path, composed of a finite sequence of straight line segments...
are defined using points. The spacing of the points implicitly determines the scale for which the data are useful. Exceeding that scale results in jagged representation of features. Additional points would be required to achieve smooth shapes at greater scales. For features better represented by smooth curves, the polygon representation requires much more data storage than, for example, splines
Spline (mathematics)
In mathematics, a spline is a sufficiently smooth piecewise-polynomial function. In interpolating problems, spline interpolation is often preferred to polynomial interpolation because it yields similar results, even when using low-degree polynomials, while avoiding Runge's phenomenon for higher...
, which can capture smoothly varying shapes efficiently. None of the shapefile types supports splines.
Data storage
The maximum size of either .shp or .dbf component files cannot exceed 2 GB (or 231 bits). This translates to, at best, about 70 million point features. The maximum number of feature storage for other geometry types varies depending on the number of vertices used.The attribute database format for the .dbf component file is based on an older dBase
DBASE
dBase II was the first widely used database management system for microcomputers. It was originally published by Ashton-Tate for CP/M, and later on ported to the Apple II and IBM PC under DOS...
standard. This database format inherently has a number of limitations, including:
- While the current dBaseDBASEdBase II was the first widely used database management system for microcomputers. It was originally published by Ashton-Tate for CP/M, and later on ported to the Apple II and IBM PC under DOS...
standard, and GDALGDALGDAL is a library for reading and writing raster geospatial data formats, and is released under the permissive X/MIT style free software license by the Open Source Geospatial Foundation. As a library, it presents a single abstract data model to the calling application for all supported formats...
/OGR, the main open source software library for reading and writing shapefiles, support nullNull (SQL)Null is a special marker used in Structured Query Language to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL Null serves to fulfill the requirement that all true relational database management systems support...
values, ESRI software represents these values as zeros. This is a very serious issue for analyzing quantitative data, as it may skew representation and statistics if null quantities are represented as 0. - Poor support for UnicodeUnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
field names or field storage - Maximum length of field names is 10 characters
- Maximum number of fields is 255
- Supported field types are: floating point (13 character storage), integer (4 or 9 character storage), date (no time storage; 8 character storage), and text (maximum 254 character storage)
- Floating point numbers may contain rounding errors since they are stored as text
Mixing shape types
Because the shape type precedes each record, a shape file is physically capable of storing a mixture of different shape types. However, the specification states, "All the non-Null shapes in a shapefile are required to be of the same shape type." Therefore this ability to mix shape types must be limited to interspersing null shapes with the single shape type declared in the file's header. A shape file must not contain both Polyline and Polygon data, for example, and the descriptions for a well (point), a river (polyline) and a lake (polygon) would be stored in three separate files.See also
- Geographic information systemGeographic Information SystemA geographic information system, geographical information science, or geospatial information studies is a system designed to capture, store, manipulate, analyze, manage, and present all types of geographically referenced data...
- Open Geospatial ConsortiumOpen Geospatial ConsortiumThe Open Geospatial Consortium , an international voluntary consensus standards organization, originated in 1994. In the OGC, more than 400 commercial, governmental, nonprofit and research organizations worldwide collaborate in a consensus process encouraging development and implementation of open...
- Open Source Geospatial FoundationOpen Source Geospatial FoundationThe Open Source Geospatial Foundation , is a non-profit non-governmental organization whose mission is to support and promote the collaborative development of open geospatial technologies and data. The foundation was formed in February 2006 to provide financial, organizational and legal support to...
(OSGeo) - List of geographic information systems software
- Comparison of geographic information systems software
External links
- Shapefile file extensions – Esri Webhelp docs for ArcGIS 10.0 (2010)
- Esri Shapefile Technical Description – Esri White Paper, July 1998
- Esri – Understanding Topology and Shapefiles
- shapelib.maptools.org - Free c library for reading/writing shapefiles
- Simpliest Shapefile Viewer - for Win95+
- Python Shapefile Library - Open Source (MIT License) Python library for reading/writing shapefiles