DjVu
Encyclopedia
DjVu is a computer
file format
designed primarily to store scanned documents
, especially those containing a combination of text, line drawings, and photographs. It uses technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding
, and lossy compression for bitonal (monochrome
) images. This allows for high-quality, readable images to be stored in a minimum of space, so that they can be made available on the web
.
DjVu has been promoted as an alternative to PDF
, promising smaller files than PDF for most scanned documents. The DjVu developers report that color magazine pages compress to 40–70 kB, black and white technical papers compress to 15–40 kB, and ancient manuscripts compress to around 100 kB; a satisfactory JPEG
image typically requires 500 kB. Like PDF, DjVu can contain an OCR
text layer, making it easy to perform copy and paste and text search operations.
Free browser plug-ins and desktop viewers from different developers are available from the djvu.org website. DjVu is supported by a number of multi-format document viewers and e-book reader software on Linux (Okular, Evince), Android (VuDroid), and iPhone/iPad (Stanza).
, Léon Bottou
, Patrick Haffner, and Paul G. Howard at AT&T Labs
from 1996 to 2001.
Due to its declared higher compression ratio (and thus smaller file size) and the ease of converting large volumes of text into Djvu format, and because it is an open file format, some independent technologists (such as Brewster Kahle
) have historically considered it superior to PDF.
The DjVu file format specification has gone through a number of revisions:
). The JB2 encoding method identifies nearly identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs.
Optionally, these shapes may be mapped to ASCII
codes (either by hand or potentially by a text recognition system), and stored in the DjVu file. If this mapping exists, it is possible to select and copy text.
implementation named "DjVuLibre" under the GNU General Public License
. The ownership rights to the commercial development of the encoding software have been transferred to different companies over the years, including AT&T
, LizardTech
, Celartem and Caminova.
In 2002, the DjVu file format was chosen by the Internet Archive
as the format in which its Million Book Project
provides scanned public domain
books online (along with TIFF and PDF).
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...
file format
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...
designed primarily to store scanned documents
Image scanner
In computing, an image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting, or an object, and converts it to a digital image. Common examples found in offices are variations of the desktop scanner where the document is placed on a glass...
, especially those containing a combination of text, line drawings, and photographs. It uses technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding
Arithmetic coding
Arithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code...
, and lossy compression for bitonal (monochrome
Monochrome
Monochrome describes paintings, drawings, design, or photographs in one color or shades of one color. A monochromatic object or image has colors in shades of limited colors or hues. Images using only shades of grey are called grayscale or black-and-white...
) images. This allows for high-quality, readable images to be stored in a minimum of space, so that they can be made available on the web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
.
DjVu has been promoted as an alternative to PDF
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....
, promising smaller files than PDF for most scanned documents. The DjVu developers report that color magazine pages compress to 40–70 kB, black and white technical papers compress to 15–40 kB, and ancient manuscripts compress to around 100 kB; a satisfactory JPEG
JPEG
In computing, JPEG . The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality....
image typically requires 500 kB. Like PDF, DjVu can contain an OCR
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...
text layer, making it easy to perform copy and paste and text search operations.
Free browser plug-ins and desktop viewers from different developers are available from the djvu.org website. DjVu is supported by a number of multi-format document viewers and e-book reader software on Linux (Okular, Evince), Android (VuDroid), and iPhone/iPad (Stanza).
History
The Djvu technology was originally developed by Yann LeCunYann LeCun
Yann LeCun is a computer science researcherwith contributions in machine learning, computer vision, mobile robotics and computational neuroscience. He is well known for his work on optical character recognition and computer vision using convolutional neural networks...
, Léon Bottou
Léon Bottou
Léon Bottou is a researcher best known for his work in machine learning and data compression. His work presents stochastic gradient descent as a fundamental learning algorithm. He is also one of the main creators of the DjVu image compression technology , and the maintainer of , the open source...
, Patrick Haffner, and Paul G. Howard at AT&T Labs
AT&T Labs
AT&T Labs, Inc. is the research & development division of AT&T, where scientists and engineers work to understand and advance innovative technologies relevant to networking, communications, and information. Over 1800 employees work in six locations: Florham Park, NJ; Middletown, NJ; Austin, TX;...
from 1996 to 2001.
Due to its declared higher compression ratio (and thus smaller file size) and the ease of converting large volumes of text into Djvu format, and because it is an open file format, some independent technologists (such as Brewster Kahle
Brewster Kahle
Brewster Kahle is a computer engineer, internet entrepreneur, activist, and digital librarian.- Biography :Kahle graduated from the Massachusetts Institute of Technology in 1982 with a Bachelor of Science in computer science and engineering, where he was a member of the Chi Phi Fraternity. The...
) have historically considered it superior to PDF.
Release history
The DjVu library distributed as part of the open source package DjVuLibre, has become the reference implementation for the DjVu format. DjVuLibre has been maintained and updated by the original developers of DjVu since 2002.The DjVu file format specification has gone through a number of revisions:
EWLINE
|
class="wikitable" style="font-size: 90%; text-align: left; "> | |
Meaning | ||
---|---|---|
Red | Old Standard; not supported | |
Yellow | Old Standard; still supported | |
Green | Current Standard |
Compression
DjVu divides a single image into many different images, then compresses them separately. To create a DjVu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100dpi); the mask image is a high-resolution bilevel image (e.g., 300dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44. The mask image is compressed using a method called JB2 (similar to JBIG2JBIG2
JBIG2 is an image compression standard for bi-level images, developed by the Joint Bi-level Image Experts Group. It is suitable for both lossless and lossy compression...
). The JB2 encoding method identifies nearly identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs.
Optionally, these shapes may be mapped to ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
codes (either by hand or potentially by a text recognition system), and stored in the DjVu file. If this mapping exists, it is possible to select and copy text.
Format licensing
DjVu is an open file format. The file format specification is published as well as source code for the reference library. The original authors distribute an open sourceOpen-source software
Open-source software is computer software that is available in source code form: the source code and certain other rights normally reserved for copyright holders are provided under a software license that permits users to study, change, improve and at times also to distribute the software.Open...
implementation named "DjVuLibre" under the GNU General Public License
GNU General Public License
The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....
. The ownership rights to the commercial development of the encoding software have been transferred to different companies over the years, including AT&T
AT&T
AT&T Inc. is an American multinational telecommunications corporation headquartered in Whitacre Tower, Dallas, Texas, United States. It is the largest provider of mobile telephony and fixed telephony in the United States, and is also a provider of broadband and subscription television services...
, LizardTech
LizardTech
LizardTech is a geospatial software company headquartered in Seattle, Washington, notable for transferring the wavelet-based MrSID image encoding and viewing technology from initial development by the U.S...
, Celartem and Caminova.
In 2002, the DjVu file format was chosen by the Internet Archive
Internet Archive
The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...
as the format in which its Million Book Project
Million Book Project
The Million Book Project , is a book digitization project, led by Carnegie Mellon University School of Computer Science and University Libraries...
provides scanned public domain
Public domain
Works are in the public domain if the intellectual property rights have expired, if the intellectual property rights are forfeited, or if they are not covered by intellectual property rights at all...
books online (along with TIFF and PDF).
External links
- DjVu.org: Commercial website for DjVu-related software.
- DjVuLibre: open source viewers and compressors for many platforms.
- WinDjView: DjVu viewer for Windows and Mac.
- MiniDjVu: open source DjVu compressor for Linux/Unix and Windows.
- Caminova.net: commercial and free viewers and tools.
- Any2DjVu.org: online DjVu compression server.
- DjVuOutline: open source DjVu outline (contents, bookmarks) editor.
- DJVU web database
- STDU Viewer: DjVu viewer for Windows.
- DjVu Viewer: free DjVu viewer for Windows.