Motivation: The diversity and huge omics data take biology and biomedicine research and application
into a big data era. Most of the current statistical analyses required to analyze omic data are not designed
to deal with big data. Principal component analyses and multivariate methods to integrate multi-omic data
are one of ...»»»»
Motivation: The diversity and huge omics data take biology and biomedicine research and application
into a big data era. Most of the current statistical analyses required to analyze omic data are not designed
to deal with big data. Principal component analyses and multivariate methods to integrate multi-omic data
are one of those examples. Therefore, having efficient and scalable functions are required to exploit the
large amount of omic data which is currently available.
Results: We developed a library called BigDataStatMeth which includes functions to perform basic
matrix operations and linear algebra for big matrices using HDF5 and DelayedArray Bioconductor’s
infrastructure. We tested its performance by comparing the computational time with the one obtained
with R base functions. Our results showed that our implementation outperforms existing functions and
that the improvement increases when sample size is also increasing. This package can be the basis
for implementing statistical methods required in omic data with large number of samples or features. As
a proof-of-concept, we implemented PCA and Lasso regression within the same package and we also
created another Bioconductor package, mgcca, which implements Generalized Canonical Correlation
Analysis (GCCA) that is used in multi-omic data integration. We implemented an algorithm that allows the
possibility of having missing individuals in one or more tables. The implemented methods have been used
to analyze real omic data. We first used PCA to call genotype inversions of more than 400K individuals
from UKBiobank. Then, data from TCGA was used to integrate multiple omic layers using GCCA.^^^^