Analyze your

mycobiome

with DAnIEL

DAnIEL (Describing, Analyzing and Integrating fungal Ecology to effectively study the systems of Life) is a web server to analyze ITS amplicon sequencing data beginning from raw reads to publication ready figures and tables. Interactions between fungal species and diseases, bacterial species, and immune system components can be explored. A knowledge base of clinical relevant species provides additional annotation to biological findings

Pipeline Overview

Overview

  1. Upload raw reads and meta data of samples (optional)
  2. Set parameters and start the pipeline
  3. Integration of existing projects and download of NCBI SRA runs
  4. Quality Control: Trimming primers and low quality bases using
  5. Denoising: Getting representative sequences by calling OTUs or ASVs
  6. Phylotyping: Taxonomic annotation of denoised sequences
  7. Feature generation: Grouping denoised sequence by taxonomic rank and count normalization
  8. Correlation networks: Identification of co-abundant species aware of sparseness and composition
  9. Machine Learning: Phenotype classification using fungal abundance profiles
  10. Statistics: Identification of significantly differentially abundant features
  11. Annotation: Integrating significant features with known insights from existing projects and literature
  12. Reporting: Creation of a single html file summarizing all results and methods applied
fungal databases

Fungal knowledge base

  • Manually curated database about fungal interactions and their associations to diseases, bacterial species, and biological molecules focused on immune system compounds
  • List of species found in clinical samples with a suspicious fungal infection from the German reference center NRZMyk
plots

Publication ready plots

  • Customizable and interactive plots
  • Single html file report summarizing both results and methods

Cite us

DAnIEL: a user-friendly web server for fungal ITS amplicon sequencing data
Daniel Loos, Lu Zhang, Christine Beemelmanns, Oliver Kurzai, and Gianni Panagiotou
Frontiers in Microbiology, doi: 10.3389/fmicb.2021.720513

Contact us

Maintainer and Developer: Daniel Loos daniel.loos@leibniz-hki.de
Head: Gianni Panagiotou gianni.panagiotou@leibniz-hki.de
Website: https://www.leibniz-hki.de/en/systembiologie-und-bioinformatik.html

Systems Biology and Bioinformatics
Leibniz Institute for Natural Product Research and Infection Biology
Hans Knöll Institute (HKI)

Adolf-Reichwein-Straße 23
07745 Jena
Germany
tutorial.knit

Tutorial

Here we give a brief introduction about how to use this web server to analyze ITS amplicon data.

The aim of this tutorial is to explain the overall workflow and not to exercise a real world ITS analysis. For instance, the provided samples are sub sampled to speed up the analysis.

The Biology of ITS sequencing

The Internal transcribed spacer (ITS) is a region in the eukaryotic genome between the genes coding for the small-subunit rRNA and the large-subunit rRNA. The ITS1 sub region in fungi is located between 18S and 5.8S rRNA genes, whereas ITS2 is between 5.8S and 28S rRNA genes. ITS regions are widely used in taxonomy due to these reasons:

  • The sequence has a high degree of variation, especially between closely related species
  • The size of one sub region can be easily covered with PE NGS reads
  • There are universal primers available in which most fungal clades can be easily amplified

Often there are multiple copies of ITS in the same genome. However, they tend to undergo concerted evolution via unequal crossing-over events. This preserves the sequence homogeneity within a particular genome.

Most projects in the field of metagenomics have only one the two sub regions sequenced. It is not recommended to compare abundance levels between projects using different sub regions due to PCR amplification bias. Besides ITS, other marker sequences e.g. translational elongation factor 1α (TEF1α) are also used for taxonomic profiling. These markers are not covered by this webserver. However, sometimes multiple marker genes have to be sequenced for precisely taxonomic annotation up to the species level. This is also up to the clade of interest: It is usually more difficult to classify molds than yeasts using ITS region only.

Input: Uploading samples

In this tutorial we will demonstrate how to analyze the different fungal communities in fecal and environmental samples. Totaling 38 samples will be analyzed from these cohorts:

  • 5 fecal samples from cancer patients (ENA study PRJEB33756)
  • 20 fecal samples from healthy individuals of the HMP cohort (Nash et al., 2017)
  • 13 environmental samples from a neonatal intensive care unit (NICU) (Heisel et al., 2019)

Start a new project

If you just want to learn about the workflow, you do not need to start a new project. All results can be already accessed on the left sidebar if the current project ID is example.

  1. Click on Home on the sidebar and then on Start project to create a new project.
  2. This will create a new ID for your project. This ID can be used to access the results later on by clicking on Resume project. You can also create a bookmark of any page: The ID is always included in the URL.
  3. Download the example raw data used in this tutorial:

Example raw data (ZIP)

Composing the sample meta data table

The meta data table is a spreadsheet describing the samples. It must be provided either as a Excel spreadsheet (.xlsx) or as a comma-separated values text file (.csv) It must contain at least one column called sample_id listing the ids of all uploaded samples. For machine learning, column names must contain alphanumerical characters only (e.g. no + or /).

Additional samples can be added from databases like the NCBI Sequence Read Archive (SRA). The meta data table will be automatically updated for those samples. However, one can annotate external samples using the meta data table. Here, sample_id must be the NCBI SRA id e.g. SRR5098710. We will do this for the 20 fecal samples from healthy individuals of the HMP cohort (Nash et al., 2017). For instance, we will assign these samples to the project healthy feces.

Raw read files can contain multiple samples, because this webserver is able to perform demultiplexing. However, one has to declare the multiplexed samples in the meta data table. Let’s have a look at the first lines of the multiplexed raw read file muxed_S4_S5_1.fq.gz in FASTQ format:

@FCBKRH9:1:2119:15517:24531 1:N:0:CAGTGCATATGC
GCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATCGAATCTTTGAACGCAC
+
CCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@FCBKRH9:1:2112:8633:25105 1:N:0:CAGTGCATATGC
GCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATCGAATCTTTGAACGCAC
+
CCGGGGGGGGFCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGG<CFGGGGGGGGGG
@FCBKRH9:1:2112:6602:7309 1:N:0:CAGTGCATATGC
GCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATCGAATCTTTGAACGCAC

The bar code sequence is located at the end of the header of each read (every 4th line). We need a mapping table to know which bar code belongs to which sample. In this example, bar code CAGTGCATATGC belongs to sample S4. This is usually determined during library preparation. We have to add two new columns to the meta data table for all multiplexed samples:

Part of meta data table describing the demultiplexing
sample_id barcode_seq barcode_file
S4 CAGTGCATATGC muxed_S4_S5
S5 TCACGGGAGTTG muxed_S4_S5

Column barcode_seq describes the bar code sequence to look for. Tis can be any substring of the header line. Column barcode_file indicates in which file the sample was sequenced.

In order to define groups e.g. for statistics, we define also a few other columns. The columns can contain character strings (e.g. project), binary values (FALSE or TRUE e.g. low_seqdepth) or numbers (e.g. sequencing depth seqdepth). If a column contains unique values only (e.g. library), it will not be used in any statistical analysis due to lacking replica. This columns will be ignored instead. We can provide another column called pair_id used for paired testing of differences. This is needed e.g. if the same subjects were sequenced multiple times.

Here is the content of the meta data file samples.xlsx to be uploaded:

Uploaded meta data table (samples.xlsx, first 10 rows)
sample_id barcode_seq barcode_file project disease low_seqdepth seqdepth library
S1 NA NA cancer feces cancer FALSE 83070 cancer_01A1
S2 NA NA cancer feces cancer FALSE 161697 cancer_01B1
S3 NA NA cancer feces cancer FALSE 166414 cancer_08A1
S4 CAGTGCATATGC muxed_S4_S5 cancer feces cancer FALSE 87166 cancer_09A1
S5 TCACGGGAGTTG muxed_S4_S5 cancer feces cancer FALSE 42744 cancer_10A1
SRR5098710 NA NA healthy feces healthy TRUE 935 HMP_SRR5098710
SRR5098580 NA NA healthy feces healthy TRUE 187 HMP_SRR5098580
SRR5098736 NA NA healthy feces healthy FALSE 102113 HMP_SRR5098736
SRR5098677 NA NA healthy feces healthy FALSE 104592 HMP_SRR5098677
SRR5098500 NA NA healthy feces healthy FALSE 104812 HMP_SRR5098500

Raw read format

Raw reads must be in gzip-compressed FASTQ format. In the reads directory of example.zip, there are 8 files containing raw sequencing data from all 6 fecal samples from cancer patients:

muxed_S4_S5_1.fq.gz
muxed_S4_S5_2.fq.gz
S1_1.fq.gz
S1_2.fq.gz
S2_1.fq.gz
S2_2.fq.gz
S3_1.fq.gz
S3_2.fq.gz

Please note:

  • Raw reads must be originated from illumina® paired-end (PE) sequencing machines.
  • Forward and reverse mates must be uploaded in two separated files.
  • Only paired FASTQ files following pattern {sample id}_{mate}.{file extension} were accepted. The file extension must be one of .fastq, .fq, fastq.gz or fq.gz. The mate muste be either 1 or 2 indicating the forward and reverse mate, respectively. Example for single samples: sample1_1.fq.gz. Example for multiplexed samples: muxed_S4_S5_2.fq.gz.

Uploading own local samples

We will only upload the 5 fecal samples from cancer patients. Click on Input to upload new samples to your project. Here we can upload both the raw read files generated by the sequencing machine and the table containing meta data about the samples:

Upload local samples

Please wait until all files are being uploaded. A progress bar underneath the button will be provided to indicate upload status.

Adding external samples

The other two cohorts have already been uploaded to the NCBI Sequence Read Archive (SRA). This webserver is able to access this database directly. We have two options to add samples from external cohorts:

SRA Accessions

Every sequencing run stored in the SRA has a unique ID starting with either SRR, ERR or DRR followed by a number. We must enter one sample per line in the text box SRA accessions. In this example, we insert all 20 fecal samples from healthy individuals:

SRR5098710
SRR5098580
SRR5098736
SRR5098677
SRR5098500
SRR5098394
SRR5098370
SRR5098413
SRR5098489
SRR5098436
SRR5098698
SRR5098401
SRR5098363
SRR5098721
SRR5098511
SRR5098498
SRR5098398
SRR5098371
SRR5098341
SRR5098421

External projects

Many ITS projects of the SRA are already downloaded and preprocessed. We will select the project hospital metagenome Targeted loci environmental (PRJNA505509) in the drop down menu External projects by clicking at the box and typing hospital to search for the study. Please click Projects on the sidebar to browse through all available projects.

We can subset samples from external cohorts by any meta data variable as displayed at Projects. This can be archived using a filter query. We will enter the query scientific_name == 'hospital metagenome' & total_spots > 30e3 to the text box Filter query for external projects. This will only include environmental samples of the NICU project which have at least 30*10³ raw reads. Character strings must be single quoted. == is used to indicate equality. > and < can be used to filter numerical attributes. Single terms can be combined using & or | to yield the sample-wise intersection or union over these terms, respectively. Please note that both %>% and parenthesis are blocked in this filed for security reasons. We can always filter samples according to these attributes:

Attributes available in all external projects to filter samples
Attribute Description
total_spots Number of raw reads
total_bases Number of raw bases
size File size in bytes
biosample_id NCBI accession of the biological sample starting with SAM
scientific_name Name of taxa of the sample e.g. human gut metagenome

A click on Add external samples will add these sample to the project. The column project will be filled with the NCBI BioProject id.

Upload local samples

Start: Setting parameters and starting the pipeline

First, click on Start on the sidebar. Here we can run the pipeline with default parameters by just clicking on Start pipeline. This will take around half an hour for the example cohort to run on our servers using 5 CPU threads. Depending on the number of samples and parameters, this can take up to a couple of hours. Please create a bookmark of the page or remember the project ID to view the results later on. Results can be accessed on the sidebar above the results menu once analyzed. We also have the option to create and select own parameter sets. A complete reference about all input elements and parameters is given at Reference on the sidebar.

Adjust the parameters

We can customize the analysis workflow by creating our own parameter set for each step. Each parameter set must be given a unique name consisting of alphanumerical characters and spaces.

The default ITS sub region is ITS1. If you have ITS2 data, you have to adjust the denoising parameter set!

Here, a briefly summary about changed parameters in the example project is given. A complete description about all parameters can be accessed by clicking Reference on the sidebar.

  • Quality Control
    • Aim: Remove sequencing errors by trimming and filtering the raw reads

    • We want to have our own primers ITS3 and ITS4 to be removed. Therefore we enter these primers in the Sequences to trim text box:

      >Forward primer ITS3
      GCATCGATGAAGAACGCAGC
      >Reverse primer ITS4
      TCCTCCGCTTATTGATATGC

      These sequences must be in FASTA format.

      • To remove the primers on both strands, ensure that the checkbox Include reverse complement of primer sequences is checked
  • Denoising
    • We will use Amplicon Sequencing Variants (ASV) as a method for denoising. Operational taxonomic units (OTU) can be used as well.
    • Our example project covers the ITS2 sub region. Therefore we click on the radio button ITS2.
  • Phylotyping
    • Aim of this step: Taxonomic annotation of these denoised representative sequences
    • We stay with the default parameters here
  • Feature generation
    • Aim of this step: Normalization, filtering and pooling denoised abundance profiles
    • We stay with the default parameters here. This includes to pool the abundances at genus rank, because many fungal clades can not determined up to the species rank using the ITS region only. However, it is possible to select the species rank if the data is trusted.
  • Analysis
    • Aim of this step: Creating statistics, Machine Learning, and correlation networks based on these features
    • Performing statistics and Machine Learning on hundreds of sample attributes can be very time consuming. Therefore, we can select the particular analysis groupings of interest. In this case: project, disease, and low_seqdepth.
    • Building correlation networks on all samples is meaningless in heterogeneous cohorts like ours. We want to compare correlations between healthy and diseased samples. Thus we select disease as the correlation grouping.

Logs

Once the pipeline has started, we can view the logs of the workflow engine snakemake by clicking Logs on the sidebar. This is an optional feature of this webserver and useful for debugging processes.

Result pages

Once a step is finished, a result page will be appear at the results section on the sidebar menu. All result pages follow the same structure:

  • Brief description about the methods and parameters applied
  • Results including summary bullet points and interactive figures
  • An optional notes section with comments how to interpret the results and which errors might occurred
  • A list of literature references of used tools
  • Optional panels for interactive analysis and to download the results

Quality Control (QC)

Quality Control (QC) is the process of removing low quality bases from the raw reads. This is done by utilizing the Phred quality score as calculated during base calling. Furthermore, primer and adapter sequences are cut.

This is the first analysis result which will appear at the result section. The test matrix indicates whether a sample failed, raised a warning, or passed a particular QC test.

QC tests by sample

By default and in our example project, samples will be excluded from downstream analysis, if any of these tests failed:

  • Adapter content is too high
  • Per base N content is too high
  • Per base sequence quality is too bad
  • Per sequence quality scores are too low
  • Minimum number of QC reads < 1000

All tests except the last one are calculated using FastQC In this table, we can see that 3 samples were excluded due to too shallow sequencing. Indeed, samples SRR5098710 and SRR5098580 do not satisfy this requirement even on raw read basis. They were included in this example project to demonstrate both the functionality of this webserver and the fact that not every sample uploaded in the SRA database is of well quality.

excluded samples
sample_id reason
S4 min_qc_read_count_failed
SRR5098580 min_qc_read_count_failed

The effect of QC can be impressively visualized by downloading the MultiQC reports at the bottom of this page (Ewels et al., 2016). Here we can see that the adapter content and per base sequence quality has been drastically improved after QC. This is typical for ITS analyzes. If the length of the ITS sub region is smaller than the fragment length during sequencing, not only the ITS subregion itself but also the adapter of the other read mate is being sequenced. This results in high adapter content and low sequence quality due to steric hindering during the process of sequencing by synthesis.

Raw Clean
Adapter clean Adapter clean
Adapter clean Adapter clean

Denoising

Denoising is the process of identifying correct biological sequences from quality controlled reads. This is done by removing potential sequencing errors.

As a sanity check, ITSx was used on the ASV sequences to see if the selected subregion ITS2 was indeed present in the raw reads:

Sanity checks

A tree of all representative sequences was generated using QIIME2, MAFFT, and FastTree (Bolyen et al., 2019; Katoh and Standley, 2013; Price et al., 2009). Here is a multiple sequence alignment of the 30 most prevalent ASVs: Multiple sequence alignment of ASVs Please note that this tree might not reflect proper phylogeny. To do so, one need usually a much larger amplicon e.g. a whole subunit of the rRNA genes. This tree was not used anywhere else in the analysis and is only to visualize differences between all inferred representative sequences.

DADA2

This step is specific to ASV profiling. They will not be used if OTU profiling was selected in the denoising parameter set.

This webserver process ASV profiling according to the official DADA2 ITS Pipeline Workflow (1.8) in a sample-wise manner. Sequencing errors are modeled for each sample and base transition separately. See the DADA2 paper for more details (Callahan et al., 2016). Here we can see the distribution of error frequency across all samples corresponding to the consensus quality score: DADA2 error model

During the steps of the DADA2 workflow, reads undergo additional filtering and trimming. Not every qc read can be assigned to a final ASV representative sequence. The read loss can be assessed here: Read loss during denoising

Phylotyping

Phylotyping is the process of taxonomic annotation of the denoised representative sequences.

59% of the AVS representative sequences were not annotated even at phylum level: Phylotyped sequences

However, taking taxon abundance into account, the majority of the reads are indeed taxonomically assigned: 92% and 94% at species and phylum rank, respectively. Most representative sequences are unclassified, but very low abundant. ASV profiling searches for very subtile sequence differences up to a single base. This results into a larger number of representative sequences compared to OTU profiling. Phylotyped abundances

Feature generation

Feature generation is the process of pooling denoised sequences at a given taxonomic rank. Furthermore, abundance profiles were normalized to make them comparable between differentially sequenced samples.

Features can be anything from species abundances to phylum abundances depending on the parameter set chosen. In our example, we have genus features normalized using centered log-ratio transformation (CLR). Usually there are hundreds of features remaining after feature generation. To visualize the abundance of all samples and all features, one can either generate a heatmap or reduce the dimensionality of the data set using a process called ordination. Let’s look at the results of the latter one:

Bray-Curtis unweighted UniFrac
Bray-Curtis ordination UniFrac ordination

The webserver offers an interactive way to create ordination plots. Sample groupings, dissimilarities and ordination methods can be individually selected. In this example, Principal Coordinates Analysis (PCoA) using the dissimilarities Bray-Curtis and unweighted UniFrac was performed.

The project groups were significantly differentially abundant in both ordinations (Adonis PERMANOVA, \(p < 0.001\)). With UniFrac, however, we are able to explain 54.3% of the total variance in the first ordination axis PCo1. On the other hand, we can only explain 30.6% of it using Bray-Curtis dissimilarity. Furthermore, the sample groups tend to be more compact using UniFrac ordination.

Both dissimilarity methods aim to express the distance between two samples considering the abundances of all taxa. Bray-Curtis treats taxa as disconnected sets, whereas UniFrac takes phylogeny into account.

Let \(i\) and \(j\) be two sample groups, \(S\) the total number of species, and \(C_{ij}\) the the sum of the lesser values for only those features in common between both sample groups. Then the distances are calculated using:

\[ d_{Bray-Curtis} = 1 - \frac{2C_{ij}}{S_i + S_j} \]

\[ d_{UniFrac} = \frac{\sum \textrm{unshared branch lengths}}{\sum \textrm{all tree branch lengths}} \]

Both distances do not make a difference how much a particular feature is abundant. They only consider the pure existence. We can use weighted UniFrac to also incorporate abundance. UniFrac requires a phylogenetic tree. In this example, the tree consist of the ranks genus, family, etc. up to the kingdom. The distance of all child nodes to their parents is set to 1. This is because sequence identity from a multiple sequence alignment of denoised sequences does not correlate with phylogenetic distance in the fungal kingdom using only the ITS region.

Correlation Networks

Correlation is used to assess co-abundant features.

We used the correlation grouping disease which gave us two networks for the healthy and the diseased individuals. Here is the SparCC correlation network of the healthy individuals with default thresholds for effect size and significance (Friedman and Alm, 2012; Watts et al., 2019):

Correlation network

We can hover the cursor over an edge to get a detailed description of the correlation. Moreover, the network can be also displayed as a table by clicking on the particular tab.

For example, Malassezia is significantly negativeley correlated to Cyberlindnera (\(c_{SparCC}=-0.402, p = 0.0249\)). The p values were obtained using permutation. Other correlation methods such as BAnOCC provide confidence intervals instead of p values.

Spurious correlations

Be always aware of potential false positive correlations. These spurious correlations can be omnipresent in microbiome data.

The abundance profile of Dekkera seems to be highly zero-inflated (see abundance boxplot at section Statistics). Only one sample got an outstanding high abundance of this genus. The correlation to Saccharomyces we observed above might be driven by this outlier. It is recommended to use a high prevalence threshold during feature generation. This implies avoiding the use of Spearman’s rho correlation. Furthermore, one should use a correlation method aware of sparseness (e.g. SparCC) and composition (e.g. BAnOCC). Different correlation methods can differ drastically in specificity and sensitivity (Weiss et al., 2016). A review emphasizing issues regarding the compositionality of relative count data is (Gloor et al., 2017).

Statistics

Here, statistics refers to comparisons of feature abundance between sample groups.

Testing the same subjects multiple times requires paired testing. Subject ids can be mapped to sample ids by providing a column pair_id in the sample meta data table. We do not have to provide this pairing information in our example. We can create box plots with significance labels of any feature and sample grouping:

Two means comparison may means comparison

Two means tests

Testing differences in the disease grouping is very simple, because disease is a binary variable (it can be either healthy or cancer)

Significant two means comparisons

Three or more means tests

The project sample grouping has three unique levels: cancer feces, healthy feces, and the environmental samples of BioProject PRJNA505509. We have to apply a two staged test here: First, Kruskal-Wallis rank sum test on normalized abundances was applied to test if there is any difference between the project groups.

Significant three or more means comparisons

Then, Dunn’s test was applied to test which groups are significantly differentially abundant. This is called post hoc testing.

Significant post hoc tests

Machine Learning (ML)

Machine Learning (ML) is the automated process of using features to build a model able to distinguish between any given classes. This model can be used to predict properties of new data sets.

Model construction

Both random forests and Support Vector Machines (SVM) were used as models. A random forest consist of multiple decision trees. Only a subset of samples and features are used in each decision tree. These models tend to be very robust. A SVM is a hyperplane dividing the features in two areas. The feature space was stretched using a Gaussian function (rbf kernel) to account for non-linear relationships.

These models have different hyperparameters to tune:

Hyperparameters
Model type Hyperparameter Description
random forest mtry Number of variables available for splitting at each tree node
random forest ntree Number of decision trees
SVM C Regularization penalty for the trade of between low testing and low training error

Not all features were used in all models: Feature selection was applied using both recursive feature selection and selection by filtering. The aim of this subsetting is to prevent over-fitting so that the model generalizes well at the prediction of new data sets. In recursive feature selection, features are removed iteratively whereas in selection by filtering, ANOVA was used to filter features not related to the property we want to classify.

Each set of hyper parameters and selected features were validated using nested cross-validation.

Model performance

The sample property we want to classify is called target class. ML models for all categorical variables of the samples meta data table were trained. Random forests performed best in the classification of both disease and project sample target property. Feature selection improved the validation performance: All best models used recursive feature selection.

Feature importance

Multiple metrics were calculated to validate model performance:

  • Sensitivity (a.k.a recall or true positive rate)
  • Specificity (a.k.a true negative rate)
  • Area under the receiver operating characteristic curve (AUC)

Let’s imagine a model which classifies all samples with the label cancer. This model is very sensitive: Obviously, all true cancer samples are classified correctly. But this is also for all healthy samples the case. This is why the specificity of this imagined model is 0. The performance metric to choose depends on the use case: We want a sensitive classifier for diagnosing a very severe disease for which a drug with no side effects is available. In this situation we do not care so much about false positives. However, we should focus on building a specific classifier whenever a misclassification has severe implications.

These metrics were generalized for targets with more than two levels (e.g. project): Here multiple models are trained e.g. cancer vs (healthy and environment together). Then, the performance values of the “one versus all” statistics are averaged.

Feature importance

Feature importance implies neither biological importance nor causality. Important features in ML models are only hints for new possible biological insights.

One can also identify features which are important in this classification process. In random forest models, we can count how often a particular feature was used at any node in the decision trees. The mean decrease in gini importance is one way to measure this feature importance. The gini importance value of Candida is very high in random forest models classifying the project. Therefore Candida might also play a different biological rule in healthy, cancer, and environmental samples. Feature importance

Summary

You can download the full findings summary either as Excel table or HTML report at the download section of the Summary webpage.

Here, significant results from all tests are combined. Furthermore, significant features were annotated with known interactions and related literature.

Let’s see whether we have findings for Wallemia: Summary of findings

This is a summary of all significant group-wise comparisons and correlations. Furthermore, features were annotated with FUNGuild and two more databases: One consist of clinical isolates with suspicious fungal infections from the NRZMyk reference database. It can be accessed by clicking at Infections on the sidebar.

The other one is about known interactions to other fungi, diseases, and immune system related molecules. This database was created using manual curation of article abstracts of fungi prevalent in fecal and skin samples. It can be accessed by clicking at Interactions on the sidebar.

References

Bolyen, E., Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C.C., Al-Ghalith, G.A., Alexander, H., Alm, E.J., Arumugam, M., Asnicar, F., et al. (2019). Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857.

Callahan, B.J., McMurdie, P.J., Rosen, M.J., Han, A.W., Johnson, A.J.A., and Holmes, S.P. (2016). DADA2: High-resolution sample inference from illumina amplicon data. Nature Methods 13, 581–583.

Ewels, P., Magnusson, M., Lundin, S., and Kaller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England) 32, 3047–3048.

Friedman, J., and Alm, E.J. (2012). Inferring correlation networks from genomic survey data. PLoS Computational Biology 8, e1002687.

Gloor, G.B., Macklaim, J.M., Pawlowsky-Glahn, V., and Egozcue, J.J. (2017). Microbiome Datasets Are Compositional: And This Is Not Optional. Front Microbiol 8, 2224.

Heisel, T., Nyaribo, L., Sadowsky, M.J., and Gale, C.A. (2019). Breastmilk and nicu surfaces are potential sources of fungi for infant mycobiomes. Fungal Genetics and Biology : FG & B 128, 29–35.

Katoh, K., and Standley, D.M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780.

Nash, A.K., Auchtung, T.A., Wong, M.C., Smith, D.P., Gesell, J.R., Ross, M.C., Stewart, C.J., Metcalf, G.A., Muzny, D.M., Gibbs, R.A., et al. (2017). The gut mycobiome of the human microbiome project healthy cohort. Microbiome 5, 153.

Price, M.N., Dehal, P.S., and Arkin, A.P. (2009). FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650.

Watts, S.C., Ritchie, S.C., Inouye, M., and Holt, K.E. (2019). FastSpar: Rapid and scalable correlation estimation for compositional data. Bioinformatics (Oxford, England) 35, 1064–1066.

Weiss, S., Treuren, W.V., Lozupone, C., Faust, K., Friedman, J., Deng, Y., Xia, L.C., Xu, Z.Z., Ursell, L., Alm, E.J., et al. (2016). Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. ISME J 10, 1669–1681.

reference.utf8

Reference

This page is about the specification of the server. A detailed description about tools, parameters, and tested browsers is given.

Tools

The general framework is based on Snakemake (Koster and Rahmann, 2012), R, Tidyverse, and Docker.

Tools
Pipeline step Tool Description Version Citation Source
qc cutadapt QC reads by filtering and trimming 0.39 Martin (2011) Anaconda
FastQC Assessing sequence read quality 0.11.8 Anaconda
MultiQC Combining FastQC results from multiple samples 1.7 Ewels et al. (2016) Anaconda
denoising DADA2 Denoising reads to get Amplicon Sequencing Variants (ASV) 1.14 Callahan et al. (2016) Anaconda
PIPITS Denoising reads to get Operational Taxonomic Units (OTU) 2.4 Gweon et al. (2015) Anaconda
phylotyping QIIME2 Taxonomic classifications of representative sequences, tree building r2019.7 Bolyen et al. (2019) Anaconda
analysis BAnOCC compositionality aware correlations between species 1.1 Schwager et al. (2017) Anaconda
caret R package for Machine Learning 6 Anaconda
phyloseq R package for handling samples, features, and trees McMurdie and Holmes (2013) Anaconda
SparCC / FastSpar sparseness aware correlations between species 0.0.1 Friedman and Alm (2012), Watts et al. (2019) Anaconda
vegan Diversity analysis 2 Anaconda
grabseqs Downlaoding raw reads from NCBI SRA 0.5 Anaconda

Parameters

In the input and start section, the following parameters can be set to customize the workflow:

Phred Score Format
Format of phred score used to indicate per base sequence quality of raw FASTQ read fiels.

Type: nominal
Default: phred33

Sequences to Trim (Adapters and Primers in FASTA format)
Primers and other sequences to remove from the reads. Multiple sequences are allowed. Must be provided in FASTA format. Sequence name after > is only for ducumentation purposes.

Type: string
Default: NA

Minimum read length (bp)
Minimum length of each read to pass QC. This value refers to unjoined reads. Used by trimmomatic to filter raw reads.

Type: integer
Default: 50

Include reverse complement of primer sequences
This is to indicate whether the reverse complement of sequences provided in the field ‘Sequences to Trim’ should be used as well for trimming. Highly recommended.

Type: binary
Default: TRUE

Minimim quality of trailing bases (Phred)
Minimim phred quality score of trailing bases. Used by trimmomatic for trimming trailing bases.

Type: integer
Default: 25

Minimim quality of leading bases (Phred)
Minimim phred quality score of leading bases. Used by trimmomatic for trimming leading bases.

Type: integer
Default: 25

Additional adapters
Additional adapter files to include in QC. These are the default files provided by trimmomatic.

Type: nominal
Default:

  • NexteraPE-PE.fa
  • TruSeq2-PE.fa
  • TruSeq3-PE-2.fa

Minimum number of QC reads
Minimum number of QC reads per sample. This is to remove samples which are not deep enough sequenced. Only applied if test is set in sample exclusion criteria.

Type: integer
Default: 100

Exclude sample if any of these tests failed
Criteria to exclude samples during QC from downstream analysis. Test are provided by FastQC. Per tile test is not available for NCBI SRA samples.

Type: nominal
Default:

  • fastqc_adapter_content_failed
  • fastqc_per_base_n_content_failed
  • fastqc_per_base_sequence_quality_failed
  • fastqc_per_sequence_quality_scores_failed
  • fastqc_per_sequence_length_distribution_failed
  • min_qc_read_count_failed

Maximum allowed error rate
Maximum allowed error rate in QC trimming (no. of errors divided by length of matching region)

Type: integer
Default: 0.1

ITS region
ITS region to extract. Used by PIPITS. If this value is set wrong, abundance profile can not be generated.

Type: nominal
Default: NA

Denoising method
Method used for denoising. This parameter determines whether Amplicon sequencing variants (ASV) or Operational Taxonomic Units (OTU) are generated.

Type: nominal
Default: asv_dada2

PIPITS include singletons
Indication to include denoised sequence clusters with only one single sequence. Not recommended.

Type: binary
Default: FALSE

PIPITS identity threshold
Required sequence identity to cluster sequences in the same OTU. This parameters is used by VSEARCH and only applies to OTU profiling.

Type: Float [0,1]
Default: 0.97

DADA2 maximal number of undefined bases
After truncation, sequences with more than maxN Ns will be discarded. Note that dada does not allow Ns. Only used by DADA2.

Type: Float
Default: 0

DADA2 minimal base quality
After truncation, reads contain a quality score less than minQ will be discarded. Only used by DADA2.

Type: integer
Default: 0

DADA2 maximal expected errors
Default Inf (no EE filtering). After truncation, reads with higher than maxEE "expected errors" will be discarded. Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10)). Two values for each mate are separated by a comma. Only used by DADA2.

Type: float,float
Default: 2,2

DADA2 truncate quality
Truncate reads at the first instance of a quality score less than or equal to truncQ. Only used by DADA2.

Type: integer
Default: 2

Reference database
Reference database used for taxonomic classification of denoised representative sequences.

Type: nominal
Default: Unite_8.2_dynamic

Sequence classifier
Reference database used for taxonomic classification of denoised representative sequences. The classifier will use the reference data base.

Type: nominal
Default: qiime2_blast

Minimum BLAST identity (fraction)
Minimum sequence identity required to classify denoised representative sequences. Only applied if BLAST sequence classifier is used.

Type: Float [0,1]
Default: 0.8

Taxonomic rank
Taxonomic rank at which all denoised clusters were pooled to. Used by all analysis steps as features.

Type: nominal
Default: genus

Phylogeny
Phylogenetic tree used for visualization and UniFrac based beta diversity analysis. Trees with equal edge length are allowed.

Type: nominal
Default: index_fungorum_2016

Normalization method
Normalization method used to generate features. Used in as most analysis steps as possible. SparCC and BanOCC correlation, however, rely on unnormalized counts.

Type: nominal
Default: clr

Minimum abundance to count prevalence (%)
Minimum abundance of a feature to be count as prevalent (in %). Only used for prevalence filtering.

Type: Float [0,100]
Default: 0.01

Minimum prevalence (%)
Minimum fraction of all samples a feature must be prevalent (in %) to be included in downstream analysis.

Type: Float [0,100]
Default: 5

Prevalence group
A feature is prevalent if it is prevalent in any group specified by the sample meta data column here.

Type: nominal
Default: all

Unknown strategy
Strategy to treat unassigned sequences classified as unknown.

Type: nominal
Default: infer

Inter feature correlation method
Method used to estimate correlations between pairs of features.

Type: nominal
Default: sparcc

Samples meta data
Optional file describing meta data about samples. One sample per row. Must include column sample_id. Must include columns barcode_seq and barcode_file if multiplexed files are used. CSV and XLSX (Excel) allowed.

Type: file
Default: NA

SRA Accessions (one SSR id per line)
List of SRR IDs from NCBI SRA to add. One ID per line.

Type: list of strings
Default: NA

Filter query for existing projects
Query to describe samples used to filter existing cohorts. Column names of samples meta data table must be used. Invalid characters: %, (, )

Type: string
Default: NA

Existing projects
List of projects used to add to the project. Samples selected here will be filtered by the Filter query for existing projects.

Type: list of strings
Default: NA

Local raw read files
Files to upload. File names must follow pattern {sample_id}_{mate}.{file extension} where {mate} is either 1 or 2. {sample_id} is used to map attributes in samples meta data table if provided. Allowed file extensions: fastq, fq, fq.gz, fastq.gz

Type: list of files
Default: NA

Analysis groupings
Columns of sample meta data table used to group samples in machine learning and statistics.

Type: nominal
Default: NA

Correlation grouping
Column name used to group samples to build correlation networks separately. Use all to build just one correlation network using all samples.

Type: nominal
Default: all

BanOCC chains
Number of Markov chains in BanOCC.

Type: integer
Default: 10

BanOCC total iterations
Number of total iterations for each chain (including warmup) in BanOCC.

Type: integer
Default: 10000

BanOCC warmup iterations
Number of warmup iterations for each chain in BanOCC. Must not be larger than total number of iterations.

Type: integer
Default: 5000

BanOCC alpha confidence
Fraction of the posterior density outside the credible interval.

Type: Float [0,1]
Default: 0.95

SparCC iterations and bootstraps
Number of iterations for convergence and bootstraps for p value calculation in SparCC.

Type: integer
Default: 200

Email address
Optional Email address. A HTML analysis report will be sent to this address after the pipeline has finished. You can also save this page as a bookmark to access the project.

Type: string
Default: NA

Browsers

DAnIEL was sucessfully tested with the following browers:

Browsers
Browser OS Versions
Firefox GNU/Linux 70.0

References

Bolyen, E., Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C.C., Al-Ghalith, G.A., Alexander, H., Alm, E.J., Arumugam, M., Asnicar, F., et al. (2019). Reproducible, interactive, scalable and extensible microbiome data science using qiime 2. (United States).

Callahan, B.J., McMurdie, P.J., Rosen, M.J., Han, A.W., Johnson, A.J.A., and Holmes, S.P. (2016). DADA2: High-resolution sample inference from illumina amplicon data. Nature Methods 13, 581–583.

Ewels, P., Magnusson, M., Lundin, S., and Kaller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England) 32, 3047–3048.

Friedman, J., and Alm, E.J. (2012). Inferring correlation networks from genomic survey data. PLoS Computational Biology 8, e1002687.

Gweon, H.S., Oliver, A., Taylor, J., Booth, T., Gibbs, M., Read, D.S., Griffiths, R.I., and Schonrogge, K. (2015). PIPITS: An automated pipeline for analyses of fungal internal transcribed spacer sequences from the illumina sequencing platform. Methods in Ecology and Evolution 6, 973–980.

Koster, J., and Rahmann, S. (2012). Snakemake–a scalable bioinformatics workflow engine. Bioinformatics (Oxford, England) 28, 2520–2522.

Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12.

McMurdie, P.J., and Holmes, S. (2013). Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. PloS One 8, e61217.

Schwager, E., Mallick, H., Ventz, S., and Huttenhower, C. (2017). A bayesian method for detecting pairwise associations in compositional data. PLoS Computational Biology 13, e1005852.

Watts, S.C., Ritchie, S.C., Inouye, M., and Holt, K.E. (2019). FastSpar: Rapid and scalable correlation estimation for compositional data. Bioinformatics (Oxford, England) 35, 1064–1066.

Database of fungal projects

Loading...
Loading...

Database of fungal interactions

We created a manually curated database about fungal interactions. Published papers were queried for the fungal species name and any of these terms: Disease, bacteria, immune system, and cytokine. All interactions reported in these papers were added to this database

Browse database

Loading...

Download database

The database of fungal interactions consists of a table and a BibTex file describing the interactions and their references, respectively. Medical Subject Headings (MeSH) are used whenever applicable.

Interactions (CSV) References (BibTeX)

Database of fungal infections

This database consists of clinical isolates with suspicious fungal infections from the NRZMyk database

Browse database

Loading...

Download database

infections (CSV)

Input

This page is about uploading own samples (FASTQ raw reads) and it's meta data (Excel or CSV). Furthermore, external samples from the data base NCBI SRA can be added to the project.

Upload local samples

Add external samples

Samples overview

Loading...
Loading...

Start

This page is about starting the analysis pipeline.Parameter sets can be created here to customize the workflow.

Start pipeline

Customize parameter

Click on the panel to customize parameters of this particular step
Include reverse complement of primer sequences
This is to indicate whether the reverse complement of sequences provided in the field ‘Sequences to Trim’ should be used as well for trimming. Highly recommended.
Default: TRUE

Logs

This page shows pipeline logs for debugging purposes.

Loading...

              

Interactive analysis

Alignment of denoised sequences

For the sake of clarity, only the most prevalent sequences were shown. The tree, however, was calculated based on all representative sequences

Loading...

Alpha diversity

Representative sequences (ASV or OTUs) of denoised reads are used to calculate diversity for each sample.

Significance is calculated using Wilcoxon test for binary outcomes and using Kruskal-Wallis and Dunn's test otherwise.

Loading...
Loading...

Download

Denoised sequences (FASTA) Denoised profile (XLSX)

Download

Phylotypes (XLSX)

Phylotypes

Loading...

Interactive abundance

Taxon abundance profile clustered by sample simillarity using Manhattan distance. Displayed sparseness can be controlled using the minimum number of taxa per sample.

Loading...
Loading...

Interactive alpha diversity

Feature counts pooled at provided taxonomic rank are used to calculate diversity for each sample.

Significance is calculated using Wilcoxon test for binary outcomes and using Kruskal-Wallis and Dunn's test otherwise.

Loading...
Loading...

Interactive beta diversity

Feature counts pooled at provided taxonomic rank are used to calculate dissimilarity for each pair of samples. This dissimilarity is approximately plotted in the two dimensional plane.
Loading...
Loading...

Download

Raw feature profile (XLSX) Normalized feature profile (XLSX) Feature meta data (XLSX)

Interactive analysis

Loading...
Loading...

Download

Model download can take up to a couple of minutes.
BAnOCC models (RDS)

Analyze a single feature

Abundance distribution is plotted here using any combination of feature and sample grouping. Significance is indicated by stars (p < 0.001: ★★★, p < 0.01: ★★, p < 0.05: ★) P values are taken from post hoc tests if multiple test statistics are available.

Significance is calculated using Wilcoxon test for binary outcomes and using Kruskal-Wallis and Dunn's test otherwise.

Loading...
Loading...

Download

Statistics summary table (XLSX)

Analyze a single model

Mean AUC over all folds and it's standard deviation are shown in the table.
Test samples of all folds are pooled together to calculate AUC as shown in the plots.

Loading...
Loading...
Loading...
Loading...