The Cancer Genome Atlas
Encyclopedia
The Cancer Genome Atlas is a project to catalogue genetic mutations responsible for cancer
, using genome
analysis techniques started in 2005. TCGA represents an effort in the War on Cancer
that is applying recently developed high-throughput genome analysis techniques
and is seeking to improve our ability to diagnose, treat, and prevent cancer through a better understanding of the molecular basis of this disease.
In 2006 the National Cancer Institute
and the National Human Genome Research Institute
selected people and laboratories that will participate in this project. The goal of the project was to provide systematic, comprehensive genomic characterization and sequence analysis of three types of human cancers: glioblastoma multiforme
, lung
, and ovarian cancer
.
The project is unique in terms of the size of the patient cohort interrogated (scheduled are 500 patient samples, far more than most genomics studies), and the number of different techniques used to analyze the patient samples. Techniques that are being used include gene expression profiling, copy number variation profiling, SNP genotyping
, genome wide DNA methylation
profiling, microRNA profiling, and exon
sequencing
of at least 1,200 genes. Recently the group organizing the TCGA announced that they would sequence the entire genomes of some tumors and at least 6,000 candidate genes and microRNA sequences. This targeted sequencing is actively being performed by all three sequencing centers using hybrid-capture technology. A gene list is available on the TCGA website. In phase II, TCGA will perform whole exon sequencing on 80% of the cases and whole genome sequencing on 80% of the cases used in the project.
TCGA has expanded in 2009 from a pilot to a large scale project. Over the next 5 years TCGA will provide genomic characterization and sequence analysis on 20-25 different tumor types. In FY 2010 a number of new centers have been funded to characterize these new tumor types. There are Genome Characterization Centers (GCCs) and Genome Data Analysis Centers (GDACs) funded to move this project into the next phase. The fact that the RFA for the expanded phase of TCGA included the specific funding of these analysis cores reflects the growing need for dedicated funding to bioinformatics in these large scale programs.
The Project is managed by a project team composed of members from the NCI (Anna Barker, Ph.D, Joe Vockley, Ph.D., Kenna Shaw, Ph.D. and Carl Schaffer, Ph.D.) and the NHGRI (Mark Guyer, Ph.D., Brad Ozenberger, Ph.D., Peter Good, Ph.D. and Jane Peterson, Ph.D.). This team, along with a group consisting of all principal investigators funded by the project, makes up the Steering Committee. The Steering Committee is tasked with overseeing the scientific validity of the project while the NCI/NHGRI project team ensures that the scientific progress and goals of the project are met, the project is completed on time and on budget and the coordination of the various components of the project.
Recently, the NCI removed approximately $130M of ARRA from the NCI’s “Prime Contract” with Science Applications International Corporation (SAIC) to fund tissue accrual and a variety of other activities through the NCI Office of Acquisition. $42M is available for tissue accrual through the NCI using “Requests for Quotations” (RFQs) and “Requests for Proposals” (RFPs) to generate purchase orders and contracts, respectively. RFQs are primarily used for the collection of retrospective samples from established banks while RFPs are used for the prospective collection of samples.
Institutions that contribute samples to TCGA are paid for their samples. In addition, the contributing institution has access to all of the molecular data generated on their samples, while maintaining a link between the TCGA unique identifier and their own unique identifier. This permits contributing institutions to link back to the clinical data for their samples and to enter into collaborations with other institutions that have similar data on TCGA samples, thus increasing the power of outcome analysis.
at Washington University and Baylor College of Medicine. All three of these sequencing centers have shifted from Sanger sequencing to next-generation sequencing (NGS), although a variety of NGS technologies are being implemented simultaneously.
, Inc.
TCGA Targeted Tumors: Lung squamous cell carcinoma, kidney papillary carcinoma, clear cell kidney carcinoma, breast ductal carcinoma, diffuse large B-cell lymphoma, renal cell carcinoma, Cervical Cancer (squamous), Colon adenocarcinoma, stomach adenocarcinoma, rectal carcinoma, hepatocellular carcinoma, Astrocytoma, Head and neck (oral) squamous cell carcinoma, Thyroid carcinoma, Bladder urothelial carcinoma - nonpapillary, uterine Corpus (endometrial carcinoma), invasive urothelial bladder cancer, Pancreatic ductal adenocarcinoma, acute myeloid leukemia, prostate adenocarcinoma,lung adenocarcinoma, cutaneous melanoma, breast lobular carcinoma and multiple myeloma.
TCGA is accruing samples for all of these tumor types simultaneously. As samples become available, the tumor types with the most samples accrued will be entered into production. For more rare tumor types, tumor types where samples are difficult to accrue and for tumor types where TCGA cannot identify a source of high quality samples, these types of cancer will enter the “TCGA production pipeline” in the second year of the project. This will give the TCGA Program Office additional time to accrue sufficient samples for the project. If TCGA plans to characterized 20 tumor types in five years and there are 25 potential tumor types on the list, obviously, five types of cancer will not be studied unless additional funds are made available.
TCGA is currently utilizing ARRA funding to accrue both retrospectively and prospectively collected cases.
All of the data from the paper, as well as data that has been collected since the publication is publicly available at the Data Coordinating Center (DCC) for public access.
Most of the TCGA data is completely open access. However there is a tier of data, data that has information that can identify a specific patient, that is protected. This Clinically Controlled-Access data can be accessed only by individuals that apply. Approval is granted on a case-by-case basis and requires the end-user to submit an application to the Data Access Committee (DAC). This Data Use Certification provides evidence that the end user is a bona fide researcher and is asking a legitimate scientific question that merits access to individual level data. This process is similar to that of other NIH-funded programs, including dbGAP.
Since the publication of the first marker paper, several analysis groups within the TCGA Network have presented more detailed analysis of the glioblastoma data. An analysis group led by Roel Verhaak, PhD, Katie Hoadley, PhD, and Neil Hayes, MD, successfully correlated glioma gene expression subtypes with genomic abnormalities. The DNA methylation
data analysis team, headed by Houtan Noushmehr, PhD and Peter Laird, PhD, identified a distinct subset of glioma samples which displays concerted hypermethylation at a large number of loci, indicating the existence of a glioma-CpG island methylator phenotype (G-CIMP). G-CIMP tumors belong to the proneural subgroup and were tightly associated with IDH1 somatic mutations.
Cancer
Cancer , known medically as a malignant neoplasm, is a large group of different diseases, all involving unregulated cell growth. In cancer, cells divide and grow uncontrollably, forming malignant tumors, and invade nearby parts of the body. The cancer may also spread to more distant parts of the...
, using genome
Genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....
analysis techniques started in 2005. TCGA represents an effort in the War on Cancer
War on Cancer
The War on Cancer refers to the effort to find a cure for cancer by increased research to improve the understanding of cancer biology and the development of more effective cancer treatments, such as targeted drug therapies. The aim of such efforts is to eradicate cancer as a major cause of death....
that is applying recently developed high-throughput genome analysis techniques
Multiplex (assay)
A multiplex assay is a type of laboratory procedure that simultaneously measures multiple analytes in a single assay. It is distinguished from procedures that measure one or a few analytes at a time...
and is seeking to improve our ability to diagnose, treat, and prevent cancer through a better understanding of the molecular basis of this disease.
In 2006 the National Cancer Institute
National Cancer Institute
The National Cancer Institute is part of the National Institutes of Health , which is one of 11 agencies that are part of the U.S. Department of Health and Human Services. The NCI coordinates the U.S...
and the National Human Genome Research Institute
National Human Genome Research Institute
The National Human Genome Research Institute is a division of the National Institutes of Health, located in Bethesda, Maryland.NHGRI began as the National Center for Human Genome Research , which was established in 1989 to carry out the role of the NIH in the International Human Genome Project...
selected people and laboratories that will participate in this project. The goal of the project was to provide systematic, comprehensive genomic characterization and sequence analysis of three types of human cancers: glioblastoma multiforme
Glioblastoma multiforme
Glioblastoma multiforme is the most common and most aggressive malignant primary brain tumor in humans, involving glial cells and accounting for 52% of all functional tissue brain tumor cases and 20% of all intracranial tumors. Despite being the most prevalent form of primary brain tumor, GBMs...
, lung
Lung cancer
Lung cancer is a disease characterized by uncontrolled cell growth in tissues of the lung. If left untreated, this growth can spread beyond the lung in a process called metastasis into nearby tissue and, eventually, into other parts of the body. Most cancers that start in lung, known as primary...
, and ovarian cancer
Ovarian cancer
Ovarian cancer is a cancerous growth arising from the ovary. Symptoms are frequently very subtle early on and may include: bloating, pelvic pain, difficulty eating and frequent urination, and are easily confused with other illnesses....
.
The project is unique in terms of the size of the patient cohort interrogated (scheduled are 500 patient samples, far more than most genomics studies), and the number of different techniques used to analyze the patient samples. Techniques that are being used include gene expression profiling, copy number variation profiling, SNP genotyping
SNP genotyping
SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common types of genetic variation...
, genome wide DNA methylation
DNA methylation
DNA methylation is a biochemical process that is important for normal development in higher organisms. It involves the addition of a methyl group to the 5 position of the cytosine pyrimidine ring or the number 6 nitrogen of the adenine purine ring...
profiling, microRNA profiling, and exon
Exon
An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule either after portions of a precursor RNA have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing. The mature RNA molecule can be a messenger RNA...
sequencing
DNA sequencing
DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....
of at least 1,200 genes. Recently the group organizing the TCGA announced that they would sequence the entire genomes of some tumors and at least 6,000 candidate genes and microRNA sequences. This targeted sequencing is actively being performed by all three sequencing centers using hybrid-capture technology. A gene list is available on the TCGA website. In phase II, TCGA will perform whole exon sequencing on 80% of the cases and whole genome sequencing on 80% of the cases used in the project.
TCGA has expanded in 2009 from a pilot to a large scale project. Over the next 5 years TCGA will provide genomic characterization and sequence analysis on 20-25 different tumor types. In FY 2010 a number of new centers have been funded to characterize these new tumor types. There are Genome Characterization Centers (GCCs) and Genome Data Analysis Centers (GDACs) funded to move this project into the next phase. The fact that the RFA for the expanded phase of TCGA included the specific funding of these analysis cores reflects the growing need for dedicated funding to bioinformatics in these large scale programs.
Project Goals
The goal of the pilot project was to demonstrate that advanced genomic technologies could be utilized by a team of scientists from various institutions to generate statistically and biologically significant conclusions from the diverse genomic data set generated by the Project. Two tumor types were explored during the pilot phase, Glioblastoma Multiforma (GBM) and Cystadenocarcinoma of the Ovary. The goal of TCGA Phase II is to expand the success experienced in the pilot project to more cancer types, providing a large, statistically significant data set for further discovery. More information about TCGA is available at the TCGA home page (http://cancergenome.nih.gov/) and TCGA data can be accessed through the TCGA Data Portal (http://tcga-data.nci.nih.gov/tcga/).TCGA Project Management
TCGA is co-managed by a team composed of scientists and managers from the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). With the expansion of TCGA from the pilot phase to Phase II in October, 2009, the NCI created a TCGA Program Office. This office, formerly directed by Joe Vockley, Ph.D., is responsible for the operation of six Genome Characterization Centers, seven Genome Analysis Centers, two Biospecimen Core Resource Centers, the Data Coordination Center, and approximately one third of the sequencing done for the project by the three Genome Sequencing Centers. In addition, the TCGA Project Office is responsible for coordinating the accrual of tissues for the TCGA project. Brad Ozenburger, Ph.D., project manager from the NHGRI, directs two thirds of the sequencing at the Genome Sequencing Centers.The Project is managed by a project team composed of members from the NCI (Anna Barker, Ph.D, Joe Vockley, Ph.D., Kenna Shaw, Ph.D. and Carl Schaffer, Ph.D.) and the NHGRI (Mark Guyer, Ph.D., Brad Ozenberger, Ph.D., Peter Good, Ph.D. and Jane Peterson, Ph.D.). This team, along with a group consisting of all principal investigators funded by the project, makes up the Steering Committee. The Steering Committee is tasked with overseeing the scientific validity of the project while the NCI/NHGRI project team ensures that the scientific progress and goals of the project are met, the project is completed on time and on budget and the coordination of the various components of the project.
Tissue Accrual
Tissue requirements vary from tissue type to tissue type and from cancer type to cancer type. Disease experts from the project’s Disease Working Groups help to define the characteristics of the typical tissue samples accrued as “standard of care” in the United States and how TCGA can best utilize the tissue. For example, the Brain Disease Working Group determined that samples containing more than 50% necrosis would not be suitable for TCGA and that 80% tumor nuclei were required in the viable portion of the tumor. TCGA has some general guidelines that it follows as a starting point for collecting samples from any types of tumors. These include a minimum of 200 mg in size, no less than 80% tumor nuclei and a matched source of germline DNA (such as blood or purified DNA). In addition, institutions submitting tissues to TCGA must have a minimal clinical data set as defined by the Disease Working Group, signed consents which have been approved by their institution’s IRB as well as a material transfer agreement with TCGA.Recently, the NCI removed approximately $130M of ARRA from the NCI’s “Prime Contract” with Science Applications International Corporation (SAIC) to fund tissue accrual and a variety of other activities through the NCI Office of Acquisition. $42M is available for tissue accrual through the NCI using “Requests for Quotations” (RFQs) and “Requests for Proposals” (RFPs) to generate purchase orders and contracts, respectively. RFQs are primarily used for the collection of retrospective samples from established banks while RFPs are used for the prospective collection of samples.
Institutions that contribute samples to TCGA are paid for their samples. In addition, the contributing institution has access to all of the molecular data generated on their samples, while maintaining a link between the TCGA unique identifier and their own unique identifier. This permits contributing institutions to link back to the clinical data for their samples and to enter into collaborations with other institutions that have similar data on TCGA samples, thus increasing the power of outcome analysis.
TCGA Funding
The NCI and NHGRI equally co-funded the Pilot Project with $50M for the first three years. The NCI has committed $25M/year of appropriated funds for five years for TCGA Phase II. The NHGRI has committed $25M/year of appropriated funds for two years. The beginning of the second phase of the project coincides with President Obama’s American Recovery and Reinvestment Act of 2009(ARRA), providing $153.5M of additional funding to the NCI beyond their appropriated funds. The Office of the Director of the NIH has provided another $25M of ARRA funds dedicated to sequence analysis and another $25M of ARRA funds in the second year of Phase II if substantial progress is made during year 1. In all, $150M will be spent on sequencing. Another $70M will be spent on tissue accrual, sample QC and biomolecule (DNA and RNA) isolation.Organization of the Project
TCGA has a number of different types of centers that are funded to generate and analyze data. TCGA is the first large-scale genomics project funded by the NIH to include significant resources to bioinformatic discovery. The NCI has devoted 50% of TCGA appropriated funds, approximately $12M/year, to fund bioinformatic discovery. Genome Characterization Centers and Genome Sequencing Centers generate data. Two types of Genome Data Analysis Centers utilize the data for bioinformatic discovery. Two centers are funded to isolated biomolecules from patient samples and one center is funded to store the data. For more information on TCGA project organization, see http://cancergenome.nih.gov/newsevents/multimedialibrary/interactives/howitworks.Biospecimen Core Resource (BCR)
There are currently two BCRs funded by the NCI; Nationwide Children’s Hospital and The International Genome Consortium. These two centers are responsible for verifying the quality and quantity of tissue shipped by tissue source sites, the isolation of DNA and RNA from the samples, quality control of these biomolecules and the shipment of samples to the GSCs and GCCs. Currently the BCRs are being recompeted. More information on this is on www.fbo.gov. The due date for proposals is June 4, 2010.Genome Sequencing Centers (GSC)
There are three GSCs co-funded by the NCI and NHGRI. These include the Broad Institute, The Genome CenterThe Genome Center
The Genome Institute at Washington University in St. Louis, Missouri, is one of three NIH funded large-scale sequencing centers in the United States...
at Washington University and Baylor College of Medicine. All three of these sequencing centers have shifted from Sanger sequencing to next-generation sequencing (NGS), although a variety of NGS technologies are being implemented simultaneously.
Genome Characterization Centers (GCC)
There are six GCCs funded by the NCI. These include the Broad Institute, Harvard, University of North Carolina, University of Southern California, Baylor College of Medicine and the British Colombia Cancer Center.Data Coordinating Center (DCC)
The Data Coordinating Center is the central repository for TCGA data. It is also responsible for the quality control of data entering the TCGA database. The DCC also maintains the TCGA Data Portal which is where users access TCGA data. This work is performed under contract by bioinformatics scientists and developers from SRA InternationalSRA International
SRA International, Inc. is an information technology services and solutions consulting company incorporated as Systems Research and Applications Corporation in 1976 and beginning operations in 1978. Founded by Dr. Ernst Volgenau, it is headquartered in Fairfax, Virginia, and employs 7200 people...
, Inc.
Genome Data Analysis Centers (GDAC)
There are seven GDACs funded by the NCI/NHGRI. The GDACs are responsible for the integration of data across all characterization and sequencing centers as well as biological interpretation of TCGA data. The GDACs include The Broad Institute, University of North Carolina, the Lawrence Berkeley National Laboratory, University of California at Santa Cruz, MD Anderson Cancer Center, Memorial Sloan Kettering Cancer Center, and The Institute for Systems Biology. All seven GDACs work together to develop an analysis pipeline for automated data analysis.List of Tumors and Entrance of a Tumor Type into TCGA
A preliminary list of tumors for TCGA to study was generated by compiling incidence and survival statistics from the SEER Cancer Statistic website (http://seer.cancer.gov/). In addition, U.S. current “Standard of Care” was considered when choosing the top 25 tumor types, as TCGA is targeting tumor types where resection prior to adjunct therapy is the standard of care. Availability of samples also plays a critical role in determining which tumor types to study and the order in which tumor projects are started. The more common the tumor is, the more likely that samples will be accrued quickly, resulting in common tumor types, such as colon, lung and breast cancer becoming the first tumor types entered into the project, before rare tumor types.TCGA Targeted Tumors: Lung squamous cell carcinoma, kidney papillary carcinoma, clear cell kidney carcinoma, breast ductal carcinoma, diffuse large B-cell lymphoma, renal cell carcinoma, Cervical Cancer (squamous), Colon adenocarcinoma, stomach adenocarcinoma, rectal carcinoma, hepatocellular carcinoma, Astrocytoma, Head and neck (oral) squamous cell carcinoma, Thyroid carcinoma, Bladder urothelial carcinoma - nonpapillary, uterine Corpus (endometrial carcinoma), invasive urothelial bladder cancer, Pancreatic ductal adenocarcinoma, acute myeloid leukemia, prostate adenocarcinoma,lung adenocarcinoma, cutaneous melanoma, breast lobular carcinoma and multiple myeloma.
TCGA is accruing samples for all of these tumor types simultaneously. As samples become available, the tumor types with the most samples accrued will be entered into production. For more rare tumor types, tumor types where samples are difficult to accrue and for tumor types where TCGA cannot identify a source of high quality samples, these types of cancer will enter the “TCGA production pipeline” in the second year of the project. This will give the TCGA Program Office additional time to accrue sufficient samples for the project. If TCGA plans to characterized 20 tumor types in five years and there are 25 potential tumor types on the list, obviously, five types of cancer will not be studied unless additional funds are made available.
TCGA is currently utilizing ARRA funding to accrue both retrospectively and prospectively collected cases.
Glioblastoma Multiforme (GBM)
TCGA recently published its first results on GBM in Nature. These first results published on 91 tumor-normal matched pairs. It is interesting to note that the paper suggests that 587 biospecimens were collected for the study. The significant loss of samples, from 587 to 91, was due to the strict quality control placed on the specimens. These controls included the requirement for the tumor samples to contain at least 80% tumor nuclei and no more than 50% necrosis. Moreover, a secondary pathology assessment had to agree that the original diagnosis of GBM was an accurate diagnosis. A last batch of tumor-normal matched samples were excluded because the DNA or RNA collected was not of sufficient quality or quantity to be analyzed by all of the different platforms used in this study.All of the data from the paper, as well as data that has been collected since the publication is publicly available at the Data Coordinating Center (DCC) for public access.
Most of the TCGA data is completely open access. However there is a tier of data, data that has information that can identify a specific patient, that is protected. This Clinically Controlled-Access data can be accessed only by individuals that apply. Approval is granted on a case-by-case basis and requires the end-user to submit an application to the Data Access Committee (DAC). This Data Use Certification provides evidence that the end user is a bona fide researcher and is asking a legitimate scientific question that merits access to individual level data. This process is similar to that of other NIH-funded programs, including dbGAP.
Since the publication of the first marker paper, several analysis groups within the TCGA Network have presented more detailed analysis of the glioblastoma data. An analysis group led by Roel Verhaak, PhD, Katie Hoadley, PhD, and Neil Hayes, MD, successfully correlated glioma gene expression subtypes with genomic abnormalities. The DNA methylation
DNA methylation
DNA methylation is a biochemical process that is important for normal development in higher organisms. It involves the addition of a methyl group to the 5 position of the cytosine pyrimidine ring or the number 6 nitrogen of the adenine purine ring...
data analysis team, headed by Houtan Noushmehr, PhD and Peter Laird, PhD, identified a distinct subset of glioma samples which displays concerted hypermethylation at a large number of loci, indicating the existence of a glioma-CpG island methylator phenotype (G-CIMP). G-CIMP tumors belong to the proneural subgroup and were tightly associated with IDH1 somatic mutations.
Serous Ovarian
Starting a new era in cancer genome sequencing, TCGA reported on the exome sequencing of the incredible number of 316 tumor samples of high grade serous ovarian cancer in Nature in June 2011.See also
- Cancer Genome ProjectCancer Genome ProjectThe Cancer Genome Project, based at the Wellcome Trust Sanger Institute, aims to identify sequence variants/mutations critical in the development of human cancers...
at the Wellcome Trust Sanger Institute - International Cancer Genome ConsortiumInternational Cancer Genome ConsortiumThe International Cancer Genome Consortium is a voluntary scientific organization that provides a forum for collaboration among the world's leading cancer and genomic researchers....