SIMILE
Encyclopedia
SIMILE is a joint research project run by the World Wide Web Consortium (W3C), Massachusetts Institute of Technology Libraries and CSAIL and funded by the Andrew W. Mellon Foundation
. Focused on developing tools to increase the interoperability of disparate digital collections, much of SIMILE's technical focus is oriented towards Semantic Web
technology and standards such as Resource Description Framework
(RDF).
, the open source system digital repository for scholarly materials developed at MIT. DSpace, which is now used at a number of research institutions, archives scholarly publications and makes them accessible. The aim of DSpace is to make it possible to federate the collections of the various holding libraries, avoiding the entombment of the contents of each DL within its individual research community. In order to grow and enable its users to find research material which has been described in various domain-specific ways, DSpace needs the ability to support metadata
schemas beyond Dublin Core
. The challenge for DSpace and other digital libraries is to assist communities in dealing with different schemes, vocabularies, ontologies and metadata and to provide research services to their users.
(RDF) is used to represent metadata about resources on the web, and is intended for situations where information is processed by applications rather than by human beings. Specifically, the SIMILE tools assist in the storage, querying, transformation and mapping of very large collections of RDF data. The tools developed within SIMILE are meant to allow people who are not Semantic Web
developers to create ontologies which describe their specialized metadata, create RDF and convert other types of metadata into RDF. These open source tools are designed to be scalable and provide for cross-community sharing of metadata at low cost.
which enables the user to visualize and browse any RDF data set, allowing the user to quickly build a user-friendly web site out of the RDF data without requiring the user to write any RDF code. Facets
are metadata fields considered important for a given data set. In its default configuration, the collection of facets is returned along the right-hand side of the page, and clicking on any facet causes the refinement of facets in relation to the data retrieved. Longwell then displays only the subset of the data which meet those restrictions. This appears on the left-hand side of the page. Previously selected restrictions can be removed, which causes a broadening of the subset of items displayed.
by using screen scrapers. This incremental approach to the realization of the Semantic Web vision allows the user to save and tag information gathered from web pages without having to cut, paste and label the various products of their browsing. By clicking on the keyword they have used to tag particular types of item, the user can view all of those items together within her browser, without having to open other applications. Users can also deposit saved data in the Semantic Bank, where other users can browse it and add their own contributions. This pooling of keywords underlies services such as Flickr
and del.icio.us
, where communities can collaborate to build a taxonomy
for shared data. These taxonomies, which emerge as information is accumulated, are known as folksonomies
.
inspector which enables the user to condense large amounts of well-formed XML data.
is a tool for visualizing events over time. It can be populated by pointing it at an XML file
is technology that enables developers to provide browsing of faceted classification
s in a web browser.
The SIMILE project has built RDFizers that convert from the following formats:
headless Mozilla-based browser. It is used as a research prototype to investigate how to enable the running of Piggy Bank Javascript
scrapers from the command line and thus automate web site scraping.
Andrew W. Mellon Foundation
The Andrew W. Mellon Foundation of New York City and Princeton, New Jersey in the United States, is a private foundation with five core areas of interest, endowed with wealth accumulated by the late Andrew W. Mellon of the Mellon family of Pittsburgh, Pennsylvania. It is the product of the 1969...
. Focused on developing tools to increase the interoperability of disparate digital collections, much of SIMILE's technical focus is oriented towards Semantic Web
Semantic Web
The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...
technology and standards such as Resource Description Framework
Resource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...
(RDF).
History
SIMILE stands for Semantic Interoperability of Metadata and Information in unLike Environments. It was born out of DSpaceDSpace
DSpace is an open source software package that provides the tools for management of digital assets, and is commonly used as the basis for an institutional repository. It supports a wide variety of data, including books, theses, 3D digital scans of objects, photographs, film, video, research data...
, the open source system digital repository for scholarly materials developed at MIT. DSpace, which is now used at a number of research institutions, archives scholarly publications and makes them accessible. The aim of DSpace is to make it possible to federate the collections of the various holding libraries, avoiding the entombment of the contents of each DL within its individual research community. In order to grow and enable its users to find research material which has been described in various domain-specific ways, DSpace needs the ability to support metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
schemas beyond Dublin Core
Dublin Core
The Dublin Core metadata terms are a set of vocabulary terms which can be used to describe resources for the purposes of discovery. The terms can be used to describe a full range of web resources: video, images, web pages etc and physical resources such as books and objects like artworks...
. The challenge for DSpace and other digital libraries is to assist communities in dealing with different schemes, vocabularies, ontologies and metadata and to provide research services to their users.
RDF-based tools
A Resource Description FrameworkResource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...
(RDF) is used to represent metadata about resources on the web, and is intended for situations where information is processed by applications rather than by human beings. Specifically, the SIMILE tools assist in the storage, querying, transformation and mapping of very large collections of RDF data. The tools developed within SIMILE are meant to allow people who are not Semantic Web
Semantic Web
The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...
developers to create ontologies which describe their specialized metadata, create RDF and convert other types of metadata into RDF. These open source tools are designed to be scalable and provide for cross-community sharing of metadata at low cost.
Longwell
Longwell is a faceted browserFaceted browser
Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters...
which enables the user to visualize and browse any RDF data set, allowing the user to quickly build a user-friendly web site out of the RDF data without requiring the user to write any RDF code. Facets
Faceted classification
A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, predetermined, taxonomic order. A facet comprises "clearly defined, mutually exclusive, and collectively exhaustive...
are metadata fields considered important for a given data set. In its default configuration, the collection of facets is returned along the right-hand side of the page, and clicking on any facet causes the refinement of facets in relation to the data retrieved. Longwell then displays only the subset of the data which meet those restrictions. This appears on the left-hand side of the page. Previously selected restrictions can be removed, which causes a broadening of the subset of items displayed.
Piggy Bank
Piggy Bank is a Firefox extension which enables the user to collect information from the Web, save it for future use, tag it with keywords, search and browse information collected, retrieve saved information, share collected information and install screen scrapers. Piggy Bank gathers RDF data where it is available, and where it is not available, it generates it from HTMLHTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
by using screen scrapers. This incremental approach to the realization of the Semantic Web vision allows the user to save and tag information gathered from web pages without having to cut, paste and label the various products of their browsing. By clicking on the keyword they have used to tag particular types of item, the user can view all of those items together within her browser, without having to open other applications. Users can also deposit saved data in the Semantic Bank, where other users can browse it and add their own contributions. This pooling of keywords underlies services such as Flickr
Flickr
Flickr is an image hosting and video hosting website, web services suite, and online community that was created by Ludicorp in 2004 and acquired by Yahoo! in 2005. In addition to being a popular website for users to share and embed personal photographs, the service is widely used by bloggers to...
and del.icio.us
Del.icio.us
Delicious is a social bookmarking web service for storing, sharing, and discovering web bookmarks. The site was founded by Joshua Schachter in 2003 and acquired by Yahoo! in 2005, and by the end of 2008, the service claimed more than 5.3 million users and 180 million unique bookmarked URLs...
, where communities can collaborate to build a taxonomy
Taxonomy
Taxonomy is the science of identifying and naming species, and arranging them into a classification. The field of taxonomy, sometimes referred to as "biological taxonomy", revolves around the description and use of taxonomic units, known as taxa...
for shared data. These taxonomies, which emerge as information is accumulated, are known as folksonomies
Folksonomy
A folksonomy is a system of classification derived from the practice and method of collaboratively creating and managing tags to annotate and categorize content; this practice is also known as collaborative tagging, social classification, social indexing, and social tagging...
.
Solvent
Solvent is a Firefox extension that enables the user to write screen scrapers for Piggy Bank.Gadget
Gadget is an XMLXML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
inspector which enables the user to condense large amounts of well-formed XML data.
Welkin
Welkin is a graph-based RDF visualizer. It graphs RDF data sets, allowing the user to visualize the global shape and clustering characteristics of the data, which can aid them in mentally modeling it, seeing how it connects and identifying mappings between the set and possible ontologies. A particular data cluster which stands out when graphed might well be missed when browsed at closer range.Fresnel
Fresnel is a vocabulary for specifying how RDF graphs are presented. Fresnel addresses the problem that currently, each RDF browser and visualization tool decides, on an ad hoc basis, what information in an RDF graph is presented and how to present it. Fresnel uses the concepts of lenses and formats. Lenses determine which properties are displayed and how they are ordered. Formats control how resources and properties are presented.Timeline
TimelineSIMILE Timeline
Timeline is open-source software developed by MIT's SIMILE Project to create interactive, graphically rich, representations of temporal data, released under a BSD license. It was originally developed by David Huynh, who is now a developer at Metaweb....
is a tool for visualizing events over time. It can be populated by pointing it at an XML file
Exhibit
ExhibitExhibit (web editing tool)
Exhibit is a lightweight, structured-data publishing framework that allows developers to create web pages with support for sorting, filtering and rich visualizations...
is technology that enables developers to provide browsing of faceted classification
Faceted classification
A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, predetermined, taxonomic order. A facet comprises "clearly defined, mutually exclusive, and collectively exhaustive...
s in a web browser.
Referee
Referee is a program that crawls the links that point to its user's pages. It extracts metadata from those pages and the text around the links that pointed to its user's pages, converting it, if need be, into RDF format. Referee discriminates between the pages that refer to the user's pages and the comments, meaning the text immediately surrounding the link. It generates a data graph, allowing it to display the fact that, for example, exactly the same comment in relation to its user's pages appears on more than one page, which is the container of the comment. A page can have more than one comment, and a comment can appear on more than one page. This can be illustrated in a data graph, but would not be possible with a data tree, such as is generated by the XML data model.RDFizer
The RDFizer project is a directory of tools for converting various data formats into RDF. MIT Libraries provides a home for some of these tools. RDFizers are a group of tools that allows the transformation of existing data into an RDF representation. Given a database of interest, these tools can often - when the data formats are highly structured -convert the data into an RDF representation without human intervention, first determining what ontology to use to express the information. Where semantic relationships are implicit, the RDFizers will not be as successful without human input.The SIMILE project has built RDFizers that convert from the following formats:
- JPEGJPEGIn computing, JPEG . The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality....
Joint Photographic Experts Group (Digital Photo-METADATA). - MARCMARC standardsMARC, MAchine-Readable Cataloging, is a data format and set of related standards used by libraries to encode and share information about books and other material they collect...
United States Library of Congress MAchine-Readable Cataloging of bibliographic data. - MODS Metadata Object Description Schema for bibliographic element sets.
- OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting.
- OCWOpenCourseWareOpenCourseWare, or OCW, is a term applied to course materials created by universities and shared freely with the world via the internet. The movement started in 1999 when the University of Tübingen in Germany published videos of lectures online in the context of its timms initiative...
Open Course Ware - EMailEmailElectronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...
- BibTeXBibTeXBibTeX is reference management software for formatting lists of references. The BibTeX tool is typically used together with the LaTeX document preparation system...
a tool for formatting lists of references usually associated with LaTex documents. - Flat
- Weather
- JavaJava (programming language)Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
is an object-oriented applications programming language - JavadocJavadocJavadoc is a documentation generator from Sun Microsystems for generating API documentation in HTML format from Java source code.The "doc comments" format used by Javadoc is the de facto industry standard for documenting Java classes. Some IDEs, such as Netbeans and Eclipse automatically generate...
tool for generating API documentation into HTML format from Java source code. - Subversion or SVN is a software revision control system.
- Random
Crowbar
Crowbar is a web scraping environment based on the use of a server-sideServer-side
Server-side refers to operations that are performed by the server in a client–server relationship in computer networking.Typically, a server is a software program, such as a web server, that runs on a remote server, reachable from a user's local computer or workstation...
headless Mozilla-based browser. It is used as a research prototype to investigate how to enable the running of Piggy Bank Javascript
JavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....
scrapers from the command line and thus automate web site scraping.
See also
- Haystack, a related project from the MIT which mostly concentrates on personal information managementPersonal information managementPersonal information management refers to the practice and the study of the activities people perform in order to acquire, organize, maintain, retrieve and use information items such as documents , web pages and email messages for everyday use to complete tasks and fulfill a person’s various...
External links
- SIMILE Project
- W3C Semantic Web Activity
- DSpace. Digital repository system which archives and makes accessible digital research material.