TY - GEN
T1 - Linked cancer genome atlas database
AU - Saleem, Muhammad
AU - Padmanabhuni, Shanmukha S.
AU - Ngomo, Axel Cyrille Ngonga
AU - Almeida, Jonas S.
AU - Decker, Stefan
AU - Deus, Helena F.
PY - 2013
Y1 - 2013
N2 - The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional pilot project to create an atlas of genetic mutations responsible for cancer. One of the aims of this project is to develop an infrastructure for making the cancer related data publicly accessible, to enable cancer researchers anywhere around the world to make and validate important discoveries. However, data in the cancer genome atlas are organized as text archives in a set of directories. Devising bioinformatics applications to analyse such data is still challenging, as it requires downloading very large archives and parsing the relevant text files in order to collect the critical co-variates necessary for analysis. Furthermore, the various types of experimental results are not connected biologically, i.e. in order to truly exploit the data in the genome-wide context in which the TCGA project was devised, the data needs to be converted into a structured representation and made publicly available for remote querying and virtual integration. In this work, we address these issues by RDFizing data from TCGA and linking its elements to the Linked Open Data (LOD) Cloud. The outcome is the largest LOD data source (to the best of our knowledge) comprising of over 30 billion triples. This data source can be exploited through publicly available SPARQL endpoints, thus providing an easy-to-use, time-efficient, and scalable solution to accessing the Cancer Genome Atlas. We also describe showcases which are enabled by the new linked data representation of the Cancer Genome Atlas presented in this paper.
AB - The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional pilot project to create an atlas of genetic mutations responsible for cancer. One of the aims of this project is to develop an infrastructure for making the cancer related data publicly accessible, to enable cancer researchers anywhere around the world to make and validate important discoveries. However, data in the cancer genome atlas are organized as text archives in a set of directories. Devising bioinformatics applications to analyse such data is still challenging, as it requires downloading very large archives and parsing the relevant text files in order to collect the critical co-variates necessary for analysis. Furthermore, the various types of experimental results are not connected biologically, i.e. in order to truly exploit the data in the genome-wide context in which the TCGA project was devised, the data needs to be converted into a structured representation and made publicly available for remote querying and virtual integration. In this work, we address these issues by RDFizing data from TCGA and linking its elements to the Linked Open Data (LOD) Cloud. The outcome is the largest LOD data source (to the best of our knowledge) comprising of over 30 billion triples. This data source can be exploited through publicly available SPARQL endpoints, thus providing an easy-to-use, time-efficient, and scalable solution to accessing the Cancer Genome Atlas. We also describe showcases which are enabled by the new linked data representation of the Cancer Genome Atlas presented in this paper.
KW - LOD
KW - SPARQL
KW - TCGA
UR - http://www.scopus.com/inward/record.url?scp=84885204211&partnerID=8YFLogxK
U2 - 10.1145/2506182.2506200
DO - 10.1145/2506182.2506200
M3 - Contribución a la conferencia
AN - SCOPUS:84885204211
SN - 9781450319720
T3 - ACM International Conference Proceeding Series
SP - 129
EP - 134
BT - Proceedings of the 9th International Conference on Semantic Systems, I-SEMANTICS 2013
T2 - 9th International Conference on Semantic Systems, I-SEMANTICS 2013
Y2 - 4 September 2013 through 6 September 2013
ER -