Condor cycle scavenger
Encyclopedia
Condor is an open source
high-throughput computing
software framework for coarse-grained distributed parallelization of computationally intensive tasks.
It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers — so-called cycle scavenging. Condor runs on Linux
, Unix
, Mac OS X
, FreeBSD
, and contemporary Windows
operating system
s. Condor can seamlessly integrate both dedicated resources (rack-mounted clusters) and non-dedicated desktop machines (cycle scavenging) into one computing environment.
Condor is developed by the Condor team at the University of Wisconsin–Madison
and is freely available for use. Condor follows an open source
philosophy (it's licensed under the Apache License
2.0). It can be downloaded from their Web site or by installing the Fedora Linux Distribution
. It is also available on other platforms, like Ubuntu
from the repositories.
By way of example, the NASA Advanced Supercomputing facility
(NAS) Condor pool consists of approximately 350 SGI
and Sun
workstations purchased and used for software development, visualization, email, document preparation, etc.
Each workstation runs a daemon
that watches user I/O
and CPU load. When a workstation has been idle for two hours, a job from the batch queue is assigned to the workstation and will run until the daemon detects a keystroke, mouse motion, or high non-Condor CPU usage. At that point, the job will be removed from the workstation and placed back on the batch queue.
Condor can run both sequential and parallel jobs. Sequential jobs can be run in several different "universes", including "vanilla" which provides the ability to run most "batch ready" programs, and "standard universe" in which the target application is re-linked with the Condor I/O library which provides for remote job I/O and job checkpointing. Condor also provides a "local universe" which allows jobs to run on the "submit host".
In the world of parallel jobs, Condor supports the standard MPI
and PVM
(Goux, et al. 2000) in addition to its own Master Worker "MW" library for extremely parallel tasks.
Condor-G allows Condor jobs to use resources not under its direct control.
It is mostly used to talk to Grid
and Cloud
resources, like pre-WS and WS Globus
, Nordugrid ARC
, UNICORE
and Amazon EC2
.
But it can also be used to talk to other batch systems, like Torque/PBS
and LSF
. Support for Sun Grid Engine
is currently under development as part of the EGEE
project.
Condor supports the DRMAA
job API. This allows DRMAA compliant clients to submit and monitor Condor jobs. The SAGA C++ Reference Implementation
provides a Condor plug-in (adaptor), which makes Condor job submission and monitoring available via SAGA's Python and C++ APIs.
Other Condor features include "DAGMan" which provides a mechanism to describe job dependencies.
Condor is one of the job scheduler
mechanisms supported by GRAM (Grid Resource Allocation Manager
), a component of the Globus Toolkit
.
Condor was the scheduler software used to distribute jobs for the first draft assembly of the Human Genome.
Whilst Condor makes good use of unused computing time, leaving computers turned on for use with Condor will increase energy consumption and associated costs. The University of Liverpool
has demonstrated an effective solution for this problem using a mixture of Wake-on-LAN and commercial power management PowerMAN (Software)
.. Starting from version 7.1.1, Condor can hibernate and wake machines based on user-specified policies without the need for third-party software.
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
high-throughput computing
High-throughput computing
High-throughput computing is a computer science term to describe the use of many computing resources over long periods of time to accomplish a computational task.-Challenges:...
software framework for coarse-grained distributed parallelization of computationally intensive tasks.
It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers — so-called cycle scavenging. Condor runs on Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
, Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
, Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
, FreeBSD
FreeBSD
FreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...
, and contemporary Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
s. Condor can seamlessly integrate both dedicated resources (rack-mounted clusters) and non-dedicated desktop machines (cycle scavenging) into one computing environment.
Condor is developed by the Condor team at the University of Wisconsin–Madison
University of Wisconsin–Madison
The University of Wisconsin–Madison is a public research university located in Madison, Wisconsin, United States. Founded in 1848, UW–Madison is the flagship campus of the University of Wisconsin System. It became a land-grant institution in 1866...
and is freely available for use. Condor follows an open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
philosophy (it's licensed under the Apache License
Apache License
The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer....
2.0). It can be downloaded from their Web site or by installing the Fedora Linux Distribution
Fedora (operating system)
Fedora is a RPM-based, general purpose collection of software, including an operating system based on the Linux kernel, developed by the community-supported Fedora Project and sponsored by Red Hat...
. It is also available on other platforms, like Ubuntu
Ubuntu (operating system)
Ubuntu is a computer operating system based on the Debian Linux distribution and distributed as free and open source software. It is named after the Southern African philosophy of Ubuntu...
from the repositories.
By way of example, the NASA Advanced Supercomputing facility
NASA Advanced Supercomputing facility
The NASA Advanced Supercomputing Division is located at the NASA Ames Research Center in Moffett Field, California ....
(NAS) Condor pool consists of approximately 350 SGI
Silicon Graphics
Silicon Graphics, Inc. was a manufacturer of high-performance computing solutions, including computer hardware and software, founded in 1981 by Jim Clark...
and Sun
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...
workstations purchased and used for software development, visualization, email, document preparation, etc.
Each workstation runs a daemon
Daemon (computer software)
In Unix and other multitasking computer operating systems, a daemon is a computer program that runs as a background process, rather than being under the direct control of an interactive user...
that watches user I/O
Input/output
In computing, input/output, or I/O, refers to the communication between an information processing system , and the outside world, possibly a human, or another information processing system. Inputs are the signals or data received by the system, and outputs are the signals or data sent from it...
and CPU load. When a workstation has been idle for two hours, a job from the batch queue is assigned to the workstation and will run until the daemon detects a keystroke, mouse motion, or high non-Condor CPU usage. At that point, the job will be removed from the workstation and placed back on the batch queue.
Condor can run both sequential and parallel jobs. Sequential jobs can be run in several different "universes", including "vanilla" which provides the ability to run most "batch ready" programs, and "standard universe" in which the target application is re-linked with the Condor I/O library which provides for remote job I/O and job checkpointing. Condor also provides a "local universe" which allows jobs to run on the "submit host".
In the world of parallel jobs, Condor supports the standard MPI
Message Passing Interface
Message Passing Interface is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers...
and PVM
Parallel Virtual Machine
The Parallel Virtual Machine is a software tool for parallel networking of computers. It is designed to allow a network of heterogeneous Unix and/or Windows machines to be used as a single distributed parallel processor. Thus large computational problems can be solved more cost effectively by...
(Goux, et al. 2000) in addition to its own Master Worker "MW" library for extremely parallel tasks.
Condor-G allows Condor jobs to use resources not under its direct control.
It is mostly used to talk to Grid
Grid computing
Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files...
and Cloud
Cloud computing
Cloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a utility over a network ....
resources, like pre-WS and WS Globus
Globus Toolkit
The Globus Toolkit, currently at version 5, is an open source toolkit for building computing grids developed and provided by the Globus Alliance.-Standards implementation:The Globus Toolkit is an implementation of the following standards:...
, Nordugrid ARC
Advanced Resource Connector
Advanced Resource Connector is grid computing middleware introduced by NorduGrid. ARC is an open source software distributed under the Apache License.- History :...
, UNICORE
UNICORE
UNICORE is a Grid computing technology that provides seamless, secure, and intuitive access to distributed Grid resources such as supercomputers or cluster systems and information stored in databases. UNICORE was developed in two projects funded by the German ministry for education and research...
and Amazon EC2
Amazon Elastic Compute Cloud
Amazon Elastic Compute Cloud is a central part of Amazon.com's cloud computing platform, Amazon Web Services . EC2 allows users to rent virtual computers on which to run their own computer applications...
.
But it can also be used to talk to other batch systems, like Torque/PBS
Portable Batch System
Portable Batch System is the name of computer software that performs job scheduling. Its primary task is to allocate computational tasks, i.e., batch jobs, among the available computing resources...
and LSF
Load Sharing Facility
Load Sharing Facility is a commercial computer software job scheduler sold by Platform Computing. It can be used to execute batch jobs on networked Unix and Windows systems on many different architectures....
. Support for Sun Grid Engine
Sun Grid Engine
Oracle Grid Engine, previously known as Sun Grid Engine , previously known as CODINE or GRD , is an open source batch-queuing system, developed and supported by Sun Microsystems...
is currently under development as part of the EGEE
Egee
Egee was a mythical Libyan queen, known in ancient Greece. Legend says she commanded an army of Amazon women warriors that traveled from Libya to Asia Minor to fight at Troy. Little else is known about her.-References:...
project.
Condor supports the DRMAA
DRMAA
DRMAA or Distributed Resource Management Application API is a high-level Open Grid Forum API specification for the submission and control of jobs to a Distributed Resource Management system, such as a Cluster or Grid computing infrastructure...
job API. This allows DRMAA compliant clients to submit and monitor Condor jobs. The SAGA C++ Reference Implementation
SAGA C++ Reference Implementation
The SAGA C++ Reference Implementation is a set of free cross-platform libraries written in C++ and Python which provide a set of high-level interfaces and runtime components that allow the development of distributed computing and grid computing applications, frameworks and tools...
provides a Condor plug-in (adaptor), which makes Condor job submission and monitoring available via SAGA's Python and C++ APIs.
Other Condor features include "DAGMan" which provides a mechanism to describe job dependencies.
Condor is one of the job scheduler
Job scheduler
A job scheduler is a software application that is in charge of unattended background executions, commonly known for historical reasons as batch processing....
mechanisms supported by GRAM (Grid Resource Allocation Manager
Grid Resource Allocation Manager
Globus Resource Allocation Manager is a software component of the Globus Toolkit that can locate, submit, monitor, and cancel jobs on Grid computing resources...
), a component of the Globus Toolkit
Globus Toolkit
The Globus Toolkit, currently at version 5, is an open source toolkit for building computing grids developed and provided by the Globus Alliance.-Standards implementation:The Globus Toolkit is an implementation of the following standards:...
.
Condor was the scheduler software used to distribute jobs for the first draft assembly of the Human Genome.
Whilst Condor makes good use of unused computing time, leaving computers turned on for use with Condor will increase energy consumption and associated costs. The University of Liverpool
University of Liverpool
The University of Liverpool is a teaching and research university in the city of Liverpool, England. It is a member of the Russell Group of large research-intensive universities and the N8 Group for research collaboration. Founded in 1881 , it is also one of the six original "red brick" civic...
has demonstrated an effective solution for this problem using a mixture of Wake-on-LAN and commercial power management PowerMAN (Software)
PowerMAN (Software)
PowerMAN is a computer software program that allows PC power management to be centrally monitored and managed. The software allows an enterprise-wide power management strategy to be implemented. The product is used by many public sector organisations in both the US and UK. It is also used by...
.. Starting from version 7.1.1, Condor can hibernate and wake machines based on user-specified policies without the need for third-party software.