The Trans-Meta Platform is the completely updated version of the Parallel-Meta pipeline towards a more effective platform, for more efficient and comprehensive taxonomical and functional analyses of microbial communities. Firstly, species-species co-occurrence network establishment, clustering and visualization modules have been developed and optimized. Secondly, the multi-thread CPU and GPU computation have been optimized for magnitudes of speed-ups. Thirdly, the analytical units in the package were better modularized and with clearer connections by using configuration file, as well as APIs for connecting external databases and tools, towards more flexible operation on metagenome data. Thus, Trans-Meta could facilitate better understanding of ecological patterns presented in metagenomic data, in an efficient manner.
Fig. 1. General pipeline of Trans-Meta and its new analytical units. (A) User-defined runtime configuration file. (B) Data importation of QIIME and MOTHER intermediate output. (C) Multiple algorithms for quantification of co-occurrence relationships between species, as well as for species-species network establishment and clustering. (D) Interactive network visualization based on HTML5 technologies. (E)Accelerating approaches for parallel computation.
1 Modularization of analytical units and customized platform We re-designed the whole analytical package, separating different analytical units into independent modules. APIs were defined to connect each module in sequential or parallel order. Most of the modules could be invoked separately or recombined as different pipelines for specific tasks. The way Trans-Meta and its modules work was determined by parameters specified in the configuration file (Fig. 1. (A)). By editing the configuration file, users would generate a customized pipeline for different purposes. To improve the compatibility, we also developed modules for external database and resource linking, like QIIME and MOTHUR intermediate output (Fig. 1. (B)). Trans-Meta could provide more options that are flexible for researchers to further analyze these outputs.
2 Multiple algorithms for species network establishment Identification of species-species co-occurrence relationships among microbial communities is an important task for community’s ecological pattern interpretation. To uncover ambiguous correlations among species, and further infer indirect associations, we adopted several algorithms for species-species co-occurrence relationship calculation. Firstly, we calculated traditional Pearson correlation coefficient and Spearman's rank correlation coefficient between species. Then, based on these information obtained, we adopted a graph-based method called FS-weight, which was originally developed for inferring protein interactions based on protein-protein interaction network topology and interaction weight, to further explore indirect associations between species. These two thus created Pearson-FS-weight or Spearman-FS-weight calculation scheme. In parallel, another pipeline for network analysis was proposed in which we inferred species-species co-occurrence from sample component data considering the non-linear dependence and topological structure of network by employing path consistency algorithm (PCA) based on part mutual information (PMI). In PCA-PMI, the conditional dependence between a pair of species was represented by the PMI between them, thus it could overcome the overestimation or underestimation problems, especially for those variables with tight associations in a network. (Fig. 1. (C)).
4 Efficient process based on parallel computation Considering that the volumes of metagenomics data are increasing rapidly, every small analysis step could become computationally demanding. In microbiome data analyses, most computational demanding modules would include network establishment and analysis, yet some of these modules could be computed in parallel, to speed-up the whole process. In Trans-Meta, we have firstly optimized multi-thread programming to speed-up parallel sample profiling: namely for pre-processing module, sample profiling module, and function prediction module, they were implemented as independent threads, and could run simply in parallel by multi-thread CPU processing. Secondly, we have optimized network establishment process by CUDA programming based on GPU as follow: we have computed co-occurrence relationships among species in parallel, and load balance was optimized for each thread. We have also reduced GPU memory latency by coalescing global memory access(Fig. 1. (E)).
Trans-Meta referred to or integrated the following databases and tools:
1.Greengenes database: the 16sRNA Gene Database. DeSantis, T. Z., P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen. 2006. Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl Environ Microbiol 72:5069-72.
2.Tara Oceans Project.The scientific activities of the Tara Oceans expedition, led by EMBL senior scientist Eric Karsenti, present an unprecedented effort that resulted in 35,000 samples containing millions of small organism collected in more than 210 ocean stations, chosen for their climatic significance or biodiversity.