
Analyze your
mycobiome
with DAnIEL
DAnIEL (Describing, Analyzing and Integrating fungal Ecology to effectively study the systems of Life) is a web server to analyze ITS amplicon sequencing data beginning from raw reads to publication ready figures and tables. Interactions between fungal species and diseases, bacterial species, and immune system components can be explored. A knowledge base of clinical relevant species provides additional annotation to biological findings
Overview
- Upload raw reads and meta data of samples (optional)
- Set parameters and start the pipeline
- Integration of existing projects and download of NCBI SRA runs
- Quality Control: Trimming primers and low quality bases using
- Denoising: Getting representative sequences by calling OTUs or ASVs
- Phylotyping: Taxonomic annotation of denoised sequences
- Feature generation: Grouping denoised sequence by taxonomic rank and count normalization
- Correlation networks: Identification of co-abundant species aware of sparseness and composition
- Machine Learning: Phenotype classification using fungal abundance profiles
- Statistics: Identification of significantly differentially abundant features
- Annotation: Integrating significant features with known insights from existing projects and literature
- Reporting: Creation of a single html file summarizing all results and methods applied
Fungal knowledge base
- Manually curated database about fungal interactions and their associations to diseases, bacterial species, and biological molecules focused on immune system compounds
- List of species found in clinical samples with a suspicious fungal infection from the German reference center NRZMyk
Publication ready plots
- Customizable and interactive plots
- Single html file report summarizing both results and methods
Cite us
DAnIEL: a user-friendly web server for fungal ITS amplicon sequencing dataDaniel Loos, Lu Zhang, Christine Beemelmanns, Oliver Kurzai, and Gianni Panagiotou
Frontiers in Microbiology, doi: 10.3389/fmicb.2021.720513
Contact us
Maintainer and Developer: Daniel Loos
daniel.loos@leibniz-hki.de
Head: Gianni Panagiotou
gianni.panagiotou@leibniz-hki.de
Website:
https://www.leibniz-hki.de/en/systembiologie-und-bioinformatik.html
Systems Biology and Bioinformatics
Leibniz Institute for Natural Product Research and Infection Biology
Hans Knöll Institute (HKI)
07745 Jena
Germany
Tutorial
Here we give a brief introduction about how to use this web server to analyze ITS amplicon data.
The aim of this tutorial is to explain the overall workflow and not to exercise a real world ITS analysis. For instance, the provided samples are sub sampled to speed up the analysis.
- The Biology of ITS sequencing
- Input: Uploading samples
- Start: Setting parameters and starting the pipeline
- Logs
- Result pages
- Quality Control (QC)
- Denoising
- Phylotyping
- Feature generation
- Correlation Networks
- Statistics
- Machine Learning (ML)
- Summary
- References
The Biology of ITS sequencing
The Internal transcribed spacer (ITS) is a region in the eukaryotic genome between the genes coding for the small-subunit rRNA and the large-subunit rRNA. The ITS1 sub region in fungi is located between 18S and 5.8S rRNA genes, whereas ITS2 is between 5.8S and 28S rRNA genes. ITS regions are widely used in taxonomy due to these reasons:
- The sequence has a high degree of variation, especially between closely related species
- The size of one sub region can be easily covered with PE NGS reads
- There are universal primers available in which most fungal clades can be easily amplified
Often there are multiple copies of ITS in the same genome. However, they tend to undergo concerted evolution via unequal crossing-over events. This preserves the sequence homogeneity within a particular genome.
Most projects in the field of metagenomics have only one the two sub regions sequenced. It is not recommended to compare abundance levels between projects using different sub regions due to PCR amplification bias. Besides ITS, other marker sequences e.g. translational elongation factor 1α (TEF1α) are also used for taxonomic profiling. These markers are not covered by this webserver. However, sometimes multiple marker genes have to be sequenced for precisely taxonomic annotation up to the species level. This is also up to the clade of interest: It is usually more difficult to classify molds than yeasts using ITS region only.
Input: Uploading samples
In this tutorial we will demonstrate how to analyze the different fungal communities in fecal and environmental samples. Totaling 38 samples will be analyzed from these cohorts:
- 5 fecal samples from cancer patients (ENA study PRJEB33756)
- 20 fecal samples from healthy individuals of the HMP cohort (Nash et al., 2017)
- 13 environmental samples from a neonatal intensive care unit (NICU) (Heisel et al., 2019)
Start a new project
If you just want to learn about the workflow, you do not need to start a new project. All results can be already accessed on the left sidebar if the current project ID is
example
.
- Click on Home on the sidebar and then on Start project to create a new project.
- This will create a new ID for your project. This ID can be used to access the results later on by clicking on Resume project. You can also create a bookmark of any page: The ID is always included in the URL.
- Download the example raw data used in this tutorial:
Composing the sample meta data table
The meta data table is a spreadsheet describing the samples. It must be provided either as a Excel spreadsheet (.xlsx) or as a comma-separated values text file (.csv) It must contain at least one column called sample_id
listing the ids of all uploaded samples. For machine learning, column names must contain alphanumerical characters only (e.g. no +
or /
).
Additional samples can be added from databases like the NCBI Sequence Read Archive (SRA). The meta data table will be automatically updated for those samples. However, one can annotate external samples using the meta data table. Here, sample_id
must be the NCBI SRA id e.g. SRR5098710
. We will do this for the 20 fecal samples from healthy individuals of the HMP cohort (Nash et al., 2017). For instance, we will assign these samples to the project healthy feces
.
Raw read files can contain multiple samples, because this webserver is able to perform demultiplexing. However, one has to declare the multiplexed samples in the meta data table. Let’s have a look at the first lines of the multiplexed raw read file muxed_S4_S5_1.fq.gz
in FASTQ format:
@FCBKRH9:1:2119:15517:24531 1:N:0:CAGTGCATATGC
GCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATCGAATCTTTGAACGCAC
+
CCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@FCBKRH9:1:2112:8633:25105 1:N:0:CAGTGCATATGC
GCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATCGAATCTTTGAACGCAC
+
CCGGGGGGGGFCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGG<CFGGGGGGGGGG
@FCBKRH9:1:2112:6602:7309 1:N:0:CAGTGCATATGC
GCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATCGAATCTTTGAACGCAC
The bar code sequence is located at the end of the header of each read (every 4th line). We need a mapping table to know which bar code belongs to which sample. In this example, bar code CAGTGCATATGC
belongs to sample S4. This is usually determined during library preparation. We have to add two new columns to the meta data table for all multiplexed samples:
sample_id | barcode_seq | barcode_file |
---|---|---|
S4 | CAGTGCATATGC | muxed_S4_S5 |
S5 | TCACGGGAGTTG | muxed_S4_S5 |
Column barcode_seq
describes the bar code sequence to look for. Tis can be any substring of the header line. Column barcode_file
indicates in which file the sample was sequenced.
In order to define groups e.g. for statistics, we define also a few other columns. The columns can contain character strings (e.g. project
), binary values (FALSE
or TRUE
e.g. low_seqdepth
) or numbers (e.g. sequencing depth seqdepth
). If a column contains unique values only (e.g. library
), it will not be used in any statistical analysis due to lacking replica. This columns will be ignored instead. We can provide another column called pair_id
used for paired testing of differences. This is needed e.g. if the same subjects were sequenced multiple times.
Here is the content of the meta data file samples.xlsx
to be uploaded:
sample_id | barcode_seq | barcode_file | project | disease | low_seqdepth | seqdepth | library |
---|---|---|---|---|---|---|---|
S1 | NA | NA | cancer feces | cancer | FALSE | 83070 | cancer_01A1 |
S2 | NA | NA | cancer feces | cancer | FALSE | 161697 | cancer_01B1 |
S3 | NA | NA | cancer feces | cancer | FALSE | 166414 | cancer_08A1 |
S4 | CAGTGCATATGC | muxed_S4_S5 | cancer feces | cancer | FALSE | 87166 | cancer_09A1 |
S5 | TCACGGGAGTTG | muxed_S4_S5 | cancer feces | cancer | FALSE | 42744 | cancer_10A1 |
SRR5098710 | NA | NA | healthy feces | healthy | TRUE | 935 | HMP_SRR5098710 |
SRR5098580 | NA | NA | healthy feces | healthy | TRUE | 187 | HMP_SRR5098580 |
SRR5098736 | NA | NA | healthy feces | healthy | FALSE | 102113 | HMP_SRR5098736 |
SRR5098677 | NA | NA | healthy feces | healthy | FALSE | 104592 | HMP_SRR5098677 |
SRR5098500 | NA | NA | healthy feces | healthy | FALSE | 104812 | HMP_SRR5098500 |
Raw read format
Raw reads must be in gzip-compressed FASTQ format. In the reads
directory of example.zip
, there are 8 files containing raw sequencing data from all 6 fecal samples from cancer patients:
muxed_S4_S5_1.fq.gz
muxed_S4_S5_2.fq.gz
S1_1.fq.gz
S1_2.fq.gz
S2_1.fq.gz
S2_2.fq.gz
S3_1.fq.gz
S3_2.fq.gz
Please note:
- Raw reads must be originated from illumina® paired-end (PE) sequencing machines.
- Forward and reverse mates must be uploaded in two separated files.
- Only paired FASTQ files following pattern
{sample id}_{mate}.{file extension}
were accepted. The file extension must be one of .fastq, .fq, fastq.gz or fq.gz. The mate muste be either1
or2
indicating the forward and reverse mate, respectively. Example for single samples:sample1_1.fq.gz
. Example for multiplexed samples:muxed_S4_S5_2.fq.gz
.
Uploading own local samples
We will only upload the 5 fecal samples from cancer patients. Click on Input to upload new samples to your project. Here we can upload both the raw read files generated by the sequencing machine and the table containing meta data about the samples:
Upload local samples
Please wait until all files are being uploaded. A progress bar underneath the button will be provided to indicate upload status.
Adding external samples
The other two cohorts have already been uploaded to the NCBI Sequence Read Archive (SRA). This webserver is able to access this database directly. We have two options to add samples from external cohorts:
SRA Accessions
Every sequencing run stored in the SRA has a unique ID starting with either SRR
, ERR
or DRR
followed by a number. We must enter one sample per line in the text box SRA accessions
. In this example, we insert all 20 fecal samples from healthy individuals:
SRR5098710
SRR5098580
SRR5098736
SRR5098677
SRR5098500
SRR5098394
SRR5098370
SRR5098413
SRR5098489
SRR5098436
SRR5098698
SRR5098401
SRR5098363
SRR5098721
SRR5098511
SRR5098498
SRR5098398
SRR5098371
SRR5098341
SRR5098421
External projects
Many ITS projects of the SRA are already downloaded and preprocessed. We will select the project hospital metagenome Targeted loci environmental (PRJNA505509) in the drop down menu External projects
by clicking at the box and typing hospital
to search for the study. Please click Projects on the sidebar to browse through all available projects.
We can subset samples from external cohorts by any meta data variable as displayed at Projects. This can be archived using a filter query. We will enter the query scientific_name == 'hospital metagenome' & total_spots > 30e3
to the text box Filter query for external projects
. This will only include environmental samples of the NICU project which have at least 30*10³ raw reads. Character strings must be single quoted. ==
is used to indicate equality. >
and <
can be used to filter numerical attributes. Single terms can be combined using &
or |
to yield the sample-wise intersection or union over these terms, respectively. Please note that both %>%
and parenthesis are blocked in this filed for security reasons. We can always filter samples according to these attributes:
Attribute | Description |
---|---|
total_spots | Number of raw reads |
total_bases | Number of raw bases |
size | File size in bytes |
biosample_id | NCBI accession of the biological sample starting with SAM |
scientific_name | Name of taxa of the sample e.g. human gut metagenome |
A click on Add external samples will add these sample to the project. The column project
will be filled with the NCBI BioProject id.
Upload local samples
Start: Setting parameters and starting the pipeline
First, click on Start on the sidebar. Here we can run the pipeline with default parameters by just clicking on Start pipeline. This will take around half an hour for the example cohort to run on our servers using 5 CPU threads. Depending on the number of samples and parameters, this can take up to a couple of hours. Please create a bookmark of the page or remember the project ID to view the results later on. Results can be accessed on the sidebar above the results menu once analyzed. We also have the option to create and select own parameter sets. A complete reference about all input elements and parameters is given at Reference on the sidebar.
Adjust the parameters
We can customize the analysis workflow by creating our own parameter set for each step. Each parameter set must be given a unique name consisting of alphanumerical characters and spaces.
The default ITS sub region is ITS1. If you have ITS2 data, you have to adjust the denoising parameter set!
Here, a briefly summary about changed parameters in the example project is given. A complete description about all parameters can be accessed by clicking Reference on the sidebar.
- Quality Control
Aim: Remove sequencing errors by trimming and filtering the raw reads
We want to have our own primers ITS3 and ITS4 to be removed. Therefore we enter these primers in the
Sequences to trim
text box:>Forward primer ITS3 GCATCGATGAAGAACGCAGC >Reverse primer ITS4 TCCTCCGCTTATTGATATGC
These sequences must be in FASTA format.
- To remove the primers on both strands, ensure that the checkbox
Include reverse complement of primer sequences
is checked
- To remove the primers on both strands, ensure that the checkbox
- Denoising
- We will use Amplicon Sequencing Variants (ASV) as a method for denoising. Operational taxonomic units (OTU) can be used as well.
- Our example project covers the ITS2 sub region. Therefore we click on the radio button
ITS2
.
- Phylotyping
- Aim of this step: Taxonomic annotation of these denoised representative sequences
- We stay with the default parameters here
- Feature generation
- Aim of this step: Normalization, filtering and pooling denoised abundance profiles
- We stay with the default parameters here. This includes to pool the abundances at genus rank, because many fungal clades can not determined up to the species rank using the ITS region only. However, it is possible to select the species rank if the data is trusted.
- Analysis
- Aim of this step: Creating statistics, Machine Learning, and correlation networks based on these features
- Performing statistics and Machine Learning on hundreds of sample attributes can be very time consuming. Therefore, we can select the particular analysis groupings of interest. In this case:
project
,disease
, andlow_seqdepth
. - Building correlation networks on all samples is meaningless in heterogeneous cohorts like ours. We want to compare correlations between healthy and diseased samples. Thus we select
disease
as the correlation grouping.
Logs
Once the pipeline has started, we can view the logs of the workflow engine snakemake by clicking Logs on the sidebar. This is an optional feature of this webserver and useful for debugging processes.
Result pages
Once a step is finished, a result page will be appear at the results section on the sidebar menu. All result pages follow the same structure:
- Brief description about the methods and parameters applied
- Results including summary bullet points and interactive figures
- An optional notes section with comments how to interpret the results and which errors might occurred
- A list of literature references of used tools
- Optional panels for interactive analysis and to download the results
Quality Control (QC)
Quality Control (QC) is the process of removing low quality bases from the raw reads. This is done by utilizing the Phred quality score as calculated during base calling. Furthermore, primer and adapter sequences are cut.
This is the first analysis result which will appear at the result section. The test matrix indicates whether a sample failed, raised a warning, or passed a particular QC test.
QC tests by sample
By default and in our example project, samples will be excluded from downstream analysis, if any of these tests failed:
- Adapter content is too high
- Per base N content is too high
- Per base sequence quality is too bad
- Per sequence quality scores are too low
- Minimum number of QC reads < 1000
All tests except the last one are calculated using FastQC In this table, we can see that 3 samples were excluded due to too shallow sequencing. Indeed, samples SRR5098710 and SRR5098580 do not satisfy this requirement even on raw read basis. They were included in this example project to demonstrate both the functionality of this webserver and the fact that not every sample uploaded in the SRA database is of well quality.
sample_id | reason |
---|---|
S4 | min_qc_read_count_failed |
SRR5098580 | min_qc_read_count_failed |
The effect of QC can be impressively visualized by downloading the MultiQC reports at the bottom of this page (Ewels et al., 2016). Here we can see that the adapter content and per base sequence quality has been drastically improved after QC. This is typical for ITS analyzes. If the length of the ITS sub region is smaller than the fragment length during sequencing, not only the ITS subregion itself but also the adapter of the other read mate is being sequenced. This results in high adapter content and low sequence quality due to steric hindering during the process of sequencing by synthesis.
Raw | Clean |
---|---|
Denoising
Denoising is the process of identifying correct biological sequences from quality controlled reads. This is done by removing potential sequencing errors.
As a sanity check, ITSx was used on the ASV sequences to see if the selected subregion ITS2 was indeed present in the raw reads:
Sanity checks
A tree of all representative sequences was generated using QIIME2, MAFFT, and FastTree (Bolyen et al., 2019; Katoh and Standley, 2013; Price et al., 2009). Here is a multiple sequence alignment of the 30 most prevalent ASVs: Please note that this tree might not reflect proper phylogeny. To do so, one need usually a much larger amplicon e.g. a whole subunit of the rRNA genes. This tree was not used anywhere else in the analysis and is only to visualize differences between all inferred representative sequences.
DADA2
This step is specific to ASV profiling. They will not be used if OTU profiling was selected in the denoising parameter set.
This webserver process ASV profiling according to the official DADA2 ITS Pipeline Workflow (1.8) in a sample-wise manner. Sequencing errors are modeled for each sample and base transition separately. See the DADA2 paper for more details (Callahan et al., 2016). Here we can see the distribution of error frequency across all samples corresponding to the consensus quality score:
During the steps of the DADA2 workflow, reads undergo additional filtering and trimming. Not every qc read can be assigned to a final ASV representative sequence. The read loss can be assessed here:
Phylotyping
Phylotyping is the process of taxonomic annotation of the denoised representative sequences.
59% of the AVS representative sequences were not annotated even at phylum level:
However, taking taxon abundance into account, the majority of the reads are indeed taxonomically assigned: 92% and 94% at species and phylum rank, respectively. Most representative sequences are unclassified, but very low abundant. ASV profiling searches for very subtile sequence differences up to a single base. This results into a larger number of representative sequences compared to OTU profiling.
Feature generation
Feature generation is the process of pooling denoised sequences at a given taxonomic rank. Furthermore, abundance profiles were normalized to make them comparable between differentially sequenced samples.
Features can be anything from species abundances to phylum abundances depending on the parameter set chosen. In our example, we have genus features normalized using centered log-ratio transformation (CLR). Usually there are hundreds of features remaining after feature generation. To visualize the abundance of all samples and all features, one can either generate a heatmap or reduce the dimensionality of the data set using a process called ordination. Let’s look at the results of the latter one:
Bray-Curtis | unweighted UniFrac |
---|---|
The webserver offers an interactive way to create ordination plots. Sample groupings, dissimilarities and ordination methods can be individually selected. In this example, Principal Coordinates Analysis (PCoA) using the dissimilarities Bray-Curtis and unweighted UniFrac was performed.
The project groups were significantly differentially abundant in both ordinations (Adonis PERMANOVA, \(p < 0.001\)). With UniFrac, however, we are able to explain 54.3% of the total variance in the first ordination axis PCo1. On the other hand, we can only explain 30.6% of it using Bray-Curtis dissimilarity. Furthermore, the sample groups tend to be more compact using UniFrac ordination.
Both dissimilarity methods aim to express the distance between two samples considering the abundances of all taxa. Bray-Curtis treats taxa as disconnected sets, whereas UniFrac takes phylogeny into account.
Let \(i\) and \(j\) be two sample groups, \(S\) the total number of species, and \(C_{ij}\) the the sum of the lesser values for only those features in common between both sample groups. Then the distances are calculated using:
\[ d_{Bray-Curtis} = 1 - \frac{2C_{ij}}{S_i + S_j} \]
\[ d_{UniFrac} = \frac{\sum \textrm{unshared branch lengths}}{\sum \textrm{all tree branch lengths}} \]
Both distances do not make a difference how much a particular feature is abundant. They only consider the pure existence. We can use weighted UniFrac to also incorporate abundance. UniFrac requires a phylogenetic tree. In this example, the tree consist of the ranks genus, family, etc. up to the kingdom. The distance of all child nodes to their parents is set to 1. This is because sequence identity from a multiple sequence alignment of denoised sequences does not correlate with phylogenetic distance in the fungal kingdom using only the ITS region.
Correlation Networks
Correlation is used to assess co-abundant features.
We used the correlation grouping disease
which gave us two networks for the healthy and the diseased individuals. Here is the SparCC correlation network of the healthy individuals with default thresholds for effect size and significance (Friedman and Alm, 2012; Watts et al., 2019):
Correlation network
We can hover the cursor over an edge to get a detailed description of the correlation. Moreover, the network can be also displayed as a table by clicking on the particular tab.
For example, Malassezia is significantly negativeley correlated to Cyberlindnera (\(c_{SparCC}=-0.402, p = 0.0249\)). The p values were obtained using permutation. Other correlation methods such as BAnOCC provide confidence intervals instead of p values.
Spurious correlations
Be always aware of potential false positive correlations. These spurious correlations can be omnipresent in microbiome data.
The abundance profile of Dekkera seems to be highly zero-inflated (see abundance boxplot at section Statistics). Only one sample got an outstanding high abundance of this genus. The correlation to Saccharomyces we observed above might be driven by this outlier. It is recommended to use a high prevalence threshold during feature generation. This implies avoiding the use of Spearman’s rho correlation. Furthermore, one should use a correlation method aware of sparseness (e.g. SparCC) and composition (e.g. BAnOCC). Different correlation methods can differ drastically in specificity and sensitivity (Weiss et al., 2016). A review emphasizing issues regarding the compositionality of relative count data is (Gloor et al., 2017).
Statistics
Here, statistics refers to comparisons of feature abundance between sample groups.
Testing the same subjects multiple times requires paired testing. Subject ids can be mapped to sample ids by providing a column pair_id
in the sample meta data table. We do not have to provide this pairing information in our example. We can create box plots with significance labels of any feature and sample grouping:
Two means comparison | may means comparison |
---|---|
Two means tests
Testing differences in the disease grouping is very simple, because disease is a binary variable (it can be either healthy
or cancer
)
Significant two means comparisons
Three or more means tests
The project sample grouping has three unique levels: cancer feces, healthy feces, and the environmental samples of BioProject PRJNA505509. We have to apply a two staged test here: First, Kruskal-Wallis rank sum test on normalized abundances was applied to test if there is any difference between the project groups.
Significant three or more means comparisons
Then, Dunn’s test was applied to test which groups are significantly differentially abundant. This is called post hoc testing.
Significant post hoc tests
Machine Learning (ML)
Machine Learning (ML) is the automated process of using features to build a model able to distinguish between any given classes. This model can be used to predict properties of new data sets.
Model construction
Both random forests and Support Vector Machines (SVM) were used as models. A random forest consist of multiple decision trees. Only a subset of samples and features are used in each decision tree. These models tend to be very robust. A SVM is a hyperplane dividing the features in two areas. The feature space was stretched using a Gaussian function (rbf kernel) to account for non-linear relationships.
These models have different hyperparameters to tune:
Model type | Hyperparameter | Description |
---|---|---|
random forest | mtry | Number of variables available for splitting at each tree node |
random forest | ntree | Number of decision trees |
SVM | C | Regularization penalty for the trade of between low testing and low training error |
Not all features were used in all models: Feature selection was applied using both recursive feature selection and selection by filtering. The aim of this subsetting is to prevent over-fitting so that the model generalizes well at the prediction of new data sets. In recursive feature selection, features are removed iteratively whereas in selection by filtering, ANOVA was used to filter features not related to the property we want to classify.
Each set of hyper parameters and selected features were validated using nested cross-validation.
Model performance
The sample property we want to classify is called target class. ML models for all categorical variables of the samples meta data table were trained. Random forests performed best in the classification of both disease and project sample target property. Feature selection improved the validation performance: All best models used recursive feature selection.
Feature importance
Multiple metrics were calculated to validate model performance:
- Sensitivity (a.k.a recall or true positive rate)
- Specificity (a.k.a true negative rate)
- Area under the receiver operating characteristic curve (AUC)
Let’s imagine a model which classifies all samples with the label cancer. This model is very sensitive: Obviously, all true cancer samples are classified correctly. But this is also for all healthy samples the case. This is why the specificity of this imagined model is 0. The performance metric to choose depends on the use case: We want a sensitive classifier for diagnosing a very severe disease for which a drug with no side effects is available. In this situation we do not care so much about false positives. However, we should focus on building a specific classifier whenever a misclassification has severe implications.
These metrics were generalized for targets with more than two levels (e.g. project): Here multiple models are trained e.g. cancer vs (healthy and environment together). Then, the performance values of the “one versus all” statistics are averaged.
Feature importance
Feature importance implies neither biological importance nor causality. Important features in ML models are only hints for new possible biological insights.
One can also identify features which are important in this classification process. In random forest models, we can count how often a particular feature was used at any node in the decision trees. The mean decrease in gini importance is one way to measure this feature importance. The gini importance value of Candida is very high in random forest models classifying the project. Therefore Candida might also play a different biological rule in healthy, cancer, and environmental samples.
Summary
You can download the full findings summary either as Excel table or HTML report at the download section of the Summary webpage.
Here, significant results from all tests are combined. Furthermore, significant features were annotated with known interactions and related literature.
Let’s see whether we have findings for Wallemia:
This is a summary of all significant group-wise comparisons and correlations. Furthermore, features were annotated with FUNGuild and two more databases: One consist of clinical isolates with suspicious fungal infections from the NRZMyk reference database. It can be accessed by clicking at Infections on the sidebar.
The other one is about known interactions to other fungi, diseases, and immune system related molecules. This database was created using manual curation of article abstracts of fungi prevalent in fecal and skin samples. It can be accessed by clicking at Interactions on the sidebar.
References
Bolyen, E., Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C.C., Al-Ghalith, G.A., Alexander, H., Alm, E.J., Arumugam, M., Asnicar, F., et al. (2019). Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857.
Callahan, B.J., McMurdie, P.J., Rosen, M.J., Han, A.W., Johnson, A.J.A., and Holmes, S.P. (2016). DADA2: High-resolution sample inference from illumina amplicon data. Nature Methods 13, 581–583.
Ewels, P., Magnusson, M., Lundin, S., and Kaller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England) 32, 3047–3048.
Friedman, J., and Alm, E.J. (2012). Inferring correlation networks from genomic survey data. PLoS Computational Biology 8, e1002687.
Gloor, G.B., Macklaim, J.M., Pawlowsky-Glahn, V., and Egozcue, J.J. (2017). Microbiome Datasets Are Compositional: And This Is Not Optional. Front Microbiol 8, 2224.
Heisel, T., Nyaribo, L., Sadowsky, M.J., and Gale, C.A. (2019). Breastmilk and nicu surfaces are potential sources of fungi for infant mycobiomes. Fungal Genetics and Biology : FG & B 128, 29–35.
Katoh, K., and Standley, D.M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780.
Nash, A.K., Auchtung, T.A., Wong, M.C., Smith, D.P., Gesell, J.R., Ross, M.C., Stewart, C.J., Metcalf, G.A., Muzny, D.M., Gibbs, R.A., et al. (2017). The gut mycobiome of the human microbiome project healthy cohort. Microbiome 5, 153.
Price, M.N., Dehal, P.S., and Arkin, A.P. (2009). FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650.
Watts, S.C., Ritchie, S.C., Inouye, M., and Holt, K.E. (2019). FastSpar: Rapid and scalable correlation estimation for compositional data. Bioinformatics (Oxford, England) 35, 1064–1066.
Weiss, S., Treuren, W.V., Lozupone, C., Faust, K., Friedman, J., Deng, Y., Xia, L.C., Xu, Z.Z., Ursell, L., Alm, E.J., et al. (2016). Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. ISME J 10, 1669–1681.
Legal information
An overview of data protection
General
The following gives a simple overview of what happens to your personal information when you visit our website. Personal information is any data with which you could be personally identified. Detailed information on the subject of data protection can be found in our privacy policy found below.
Data collection on our website
Who is responsible for the data collection on this website? The data collected on this website are processed by the website operator. The operator’s contact details can be found below. How do we collect your data? Some data are collected when you provide it to us. This could, for example, be data you enter on a contact form. Other data are collected automatically by our IT systems when you visit the website. These data are primarily technical data such as the browser and operating system you are using or when you accessed the page. These data are collected automatically as soon as you enter our website. What do we use your data for? Part of the data is collected to ensure the proper functioning of the website. Other data can be used to analyse how visitors use the site. What rights do you have regarding your data? You always have the right to request information about your stored data, its origin, its recipients, and the purpose of its collection at no charge. You also have the right to request that it be corrected, blocked, or deleted. You can contact us at any time using the address given below if you have further questions about the issue of privacy and data protection. You may also, of course, file a complaint with the competent regulatory authorities.
Analytics and third-party tools
When visiting our website, statistical analyses may be made of your surfing behaviour. This happens primarily using cookies and analytics. The analysis of your surfing behaviour is usually anonymous, i.e. we will not be able to identify you from this data. You can object to this analysis or prevent it by not using certain tools. Detailed information can be found in the following privacy policy. You can object to this analysis. We will inform you below about how to exercise your options in this regard.
General information and mandatory information
Data protection
The operators of this website take the protection of your personal data very seriously. We treat your personal data as confidential and in accordance with the statutory data protection regulations and this privacy policy. If you use this website, various pieces of personal data will be collected. Personal information is any data with which you could be personally identified. This privacy policy explains what information we collect and what we use it for. It also explains how and for what purpose this happens. Please note that data transmitted via the internet, e.g. via email communication, may be subject to security breaches. Complete protection of your data from third party access is not possible.
Data protection
Notice concerning the party responsible for this website The party responsible for processing data on this website is:
Leibniz-Institut für Naturstoff-Forschung und Infektionsbiologie e. V. Hans-Knöll-Institut (HKI) Adolf-Reichwein-Straße 23 07745 Jena
Email: gianni.panagiotou@leibniz-hki.de
The responsible party is the natural or legal person who alone or jointly with others decides on the purposes and means of processing personal data e.g. names, email addresses, etc..
Revocation of your consent to the processing of your data
Many data processing operations are only possible with your express consent. You may revoke your consent at any time with future effect. An informal email making this request is sufficient. The data processed before we receive your request may still be legally processed.
Right to data portability
You have the right to have data which we process based on your consent or in fulfilment of a contract automatically delivered to yourself or to a third party in a standard, machine readable format. If you require the direct transfer of data to another responsible party, this will only be done to the extent technically feasible.
SSL or TLS encryption
This site uses SSL or TLS encryption for security reasons and for the protection of the transmission of confidential content, such as the inquiries you send to us as the site operator. You can recognize an encrypted connection in your browser’s address line when it changes from “http://” to “https://” and the lock icon is displayed in your browser’s address bar. If SSL or TLS encryption is activated, the data you transfer to us cannot be read by third parties.
Information, blocking, deletion
As permitted by law, you have the right to be provided at any time with information free of charge about any of your personal data that is stored as well as its origin, the recipient and the purpose for which it has been processed. You also have the right to have this data corrected, blocked or deleted. You can contact us at any time using the address given in our legal notice if you have further questions on the topic of personal data.
Opposition to promotional emails
We hereby expressly prohibit the use of contact data published in the context of website legal notice requirements with regard to sending promotional and informational materials not expressly requested. The website operator reserves the right to take specific legal action if unsolicited advertising material, such as email spam, is received.
Data protection officer
Statutory data protection officer We have appointed a data protection officer for our company.
Dipl.-Ing. Wolfgang Schmidt-Heck Leibniz-Institut für Naturstoff-Forschung und Infektionsbiologie e. V. Hans-Knöll-Institut (HKI) Adolf-Reichwein-Straße 23 07745 Jena
Telephone: +49 3641 532-1084 Email: wolfgang.schmidt-heck@leibniz-hki.de
Data collection on our website
Server log files
The website provider automatically collects and stores information that your browser automatically transmits to us in “server log files”. These are:
- Browser type and browser version
- Operating system used
- Referrer URL
- Host name of the accessing computer
- Time of the server request
- IP address
These data will not be combined with data from other sources. The basis for data processing is Art. 6 (1) (b) DSGVO which allows the processing of data to fulfil a contract or to implement measures preliminary to a contract.
Reference
This page is about the specification of the server. A detailed description about tools, parameters, and tested browsers is given.
Tools
The general framework is based on Snakemake (Koster and Rahmann, 2012), R, Tidyverse, and Docker.
Pipeline step | Tool | Description | Version | Citation | Source |
---|---|---|---|---|---|
qc | cutadapt | QC reads by filtering and trimming | 0.39 | Martin (2011) | Anaconda |
FastQC | Assessing sequence read quality | 0.11.8 | Anaconda | ||
MultiQC | Combining FastQC results from multiple samples | 1.7 | Ewels et al. (2016) | Anaconda | |
denoising | DADA2 | Denoising reads to get Amplicon Sequencing Variants (ASV) | 1.14 | Callahan et al. (2016) | Anaconda |
PIPITS | Denoising reads to get Operational Taxonomic Units (OTU) | 2.4 | Gweon et al. (2015) | Anaconda | |
phylotyping | QIIME2 | Taxonomic classifications of representative sequences, tree building | r2019.7 | Bolyen et al. (2019) | Anaconda |
analysis | BAnOCC | compositionality aware correlations between species | 1.1 | Schwager et al. (2017) | Anaconda |
caret | R package for Machine Learning | 6 | Anaconda | ||
phyloseq | R package for handling samples, features, and trees | McMurdie and Holmes (2013) | Anaconda | ||
SparCC / FastSpar | sparseness aware correlations between species | 0.0.1 | Friedman and Alm (2012), Watts et al. (2019) | Anaconda | |
vegan | Diversity analysis | 2 | Anaconda | ||
grabseqs | Downlaoding raw reads from NCBI SRA | 0.5 | Anaconda |
Parameters
In the input and start section, the following parameters can be set to customize the workflow:
Phred Score Format
Format of phred score used to indicate per base sequence quality of raw FASTQ read fiels.
Type: nominal
Default: phred33
Sequences to Trim (Adapters and Primers in FASTA format)
Primers and other sequences to remove from the reads. Multiple sequences are allowed. Must be provided in FASTA format. Sequence name after > is only for ducumentation purposes.
Type: string
Default: NA
Minimum read length (bp)
Minimum length of each read to pass QC. This value refers to unjoined reads. Used by trimmomatic to filter raw reads.
Type: integer
Default: 50
Include reverse complement of primer sequences
This is to indicate whether the reverse complement of sequences provided in the field ‘Sequences to Trim’ should be used as well for trimming. Highly recommended.
Type: binary
Default: TRUE
Minimim quality of trailing bases (Phred)
Minimim phred quality score of trailing bases. Used by trimmomatic for trimming trailing bases.
Type: integer
Default: 25
Minimim quality of leading bases (Phred)
Minimim phred quality score of leading bases. Used by trimmomatic for trimming leading bases.
Type: integer
Default: 25
Additional adapters
Additional adapter files to include in QC. These are the default files provided by trimmomatic.
Type: nominal
Default:
- NexteraPE-PE.fa
- TruSeq2-PE.fa
- TruSeq3-PE-2.fa
Minimum number of QC reads
Minimum number of QC reads per sample. This is to remove samples which are not deep enough sequenced. Only applied if test is set in sample exclusion criteria.
Type: integer
Default: 100
Exclude sample if any of these tests failed
Criteria to exclude samples during QC from downstream analysis. Test are provided by FastQC. Per tile test is not available for NCBI SRA samples.
Type: nominal
Default:
- fastqc_adapter_content_failed
- fastqc_per_base_n_content_failed
- fastqc_per_base_sequence_quality_failed
- fastqc_per_sequence_quality_scores_failed
- fastqc_per_sequence_length_distribution_failed
- min_qc_read_count_failed
Maximum allowed error rate
Maximum allowed error rate in QC trimming (no. of errors divided by length of matching region)
Type: integer
Default: 0.1
ITS region
ITS region to extract. Used by PIPITS. If this value is set wrong, abundance profile can not be generated.
Type: nominal
Default: NA
Denoising method
Method used for denoising. This parameter determines whether Amplicon sequencing variants (ASV) or Operational Taxonomic Units (OTU) are generated.
Type: nominal
Default: asv_dada2
PIPITS include singletons
Indication to include denoised sequence clusters with only one single sequence. Not recommended.
Type: binary
Default: FALSE
PIPITS identity threshold
Required sequence identity to cluster sequences in the same OTU. This parameters is used by VSEARCH and only applies to OTU profiling.
Type: Float [0,1]
Default: 0.97
DADA2 maximal number of undefined bases
After truncation, sequences with more than maxN Ns will be discarded. Note that dada does not allow Ns. Only used by DADA2.
Type: Float
Default: 0
DADA2 minimal base quality
After truncation, reads contain a quality score less than minQ will be discarded. Only used by DADA2.
Type: integer
Default: 0
DADA2 maximal expected errors
Default Inf (no EE filtering). After truncation, reads with higher than maxEE "expected errors" will be discarded. Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10)). Two values for each mate are separated by a comma. Only used by DADA2.
Type: float,float
Default: 2,2
DADA2 truncate quality
Truncate reads at the first instance of a quality score less than or equal to truncQ. Only used by DADA2.
Type: integer
Default: 2
Reference database
Reference database used for taxonomic classification of denoised representative sequences.
Type: nominal
Default: Unite_8.2_dynamic
Sequence classifier
Reference database used for taxonomic classification of denoised representative sequences. The classifier will use the reference data base.
Type: nominal
Default: qiime2_blast
Minimum BLAST identity (fraction)
Minimum sequence identity required to classify denoised representative sequences. Only applied if BLAST sequence classifier is used.
Type: Float [0,1]
Default: 0.8
Taxonomic rank
Taxonomic rank at which all denoised clusters were pooled to. Used by all analysis steps as features.
Type: nominal
Default: genus
Phylogeny
Phylogenetic tree used for visualization and UniFrac based beta diversity analysis. Trees with equal edge length are allowed.
Type: nominal
Default: index_fungorum_2016
Normalization method
Normalization method used to generate features. Used in as most analysis steps as possible. SparCC and BanOCC correlation, however, rely on unnormalized counts.
Type: nominal
Default: clr
Minimum abundance to count prevalence (%)
Minimum abundance of a feature to be count as prevalent (in %). Only used for prevalence filtering.
Type: Float [0,100]
Default: 0.01
Minimum prevalence (%)
Minimum fraction of all samples a feature must be prevalent (in %) to be included in downstream analysis.
Type: Float [0,100]
Default: 5
Prevalence group
A feature is prevalent if it is prevalent in any group specified by the sample meta data column here.
Type: nominal
Default: all
Unknown strategy
Strategy to treat unassigned sequences classified as unknown.
Type: nominal
Default: infer
Inter feature correlation method
Method used to estimate correlations between pairs of features.
Type: nominal
Default: sparcc
Samples meta data
Optional file describing meta data about samples. One sample per row. Must include column sample_id. Must include columns barcode_seq and barcode_file if multiplexed files are used. CSV and XLSX (Excel) allowed.
Type: file
Default: NA
SRA Accessions (one SSR id per line)
List of SRR IDs from NCBI SRA to add. One ID per line.
Type: list of strings
Default: NA
Filter query for existing projects
Query to describe samples used to filter existing cohorts. Column names of samples meta data table must be used. Invalid characters: %, (, )
Type: string
Default: NA
Existing projects
List of projects used to add to the project. Samples selected here will be filtered by the Filter query for existing projects.
Type: list of strings
Default: NA
Local raw read files
Files to upload. File names must follow pattern {sample_id}_{mate}.{file extension} where {mate} is either 1 or 2. {sample_id} is used to map attributes in samples meta data table if provided. Allowed file extensions: fastq, fq, fq.gz, fastq.gz
Type: list of files
Default: NA
Analysis groupings
Columns of sample meta data table used to group samples in machine learning and statistics.
Type: nominal
Default: NA
Correlation grouping
Column name used to group samples to build correlation networks separately. Use all to build just one correlation network using all samples.
Type: nominal
Default: all
BanOCC chains
Number of Markov chains in BanOCC.
Type: integer
Default: 10
BanOCC total iterations
Number of total iterations for each chain (including warmup) in BanOCC.
Type: integer
Default: 10000
BanOCC warmup iterations
Number of warmup iterations for each chain in BanOCC. Must not be larger than total number of iterations.
Type: integer
Default: 5000
BanOCC alpha confidence
Fraction of the posterior density outside the credible interval.
Type: Float [0,1]
Default: 0.95
SparCC iterations and bootstraps
Number of iterations for convergence and bootstraps for p value calculation in SparCC.
Type: integer
Default: 200
Email address
Optional Email address. A HTML analysis report will be sent to this address after the pipeline has finished. You can also save this page as a bookmark to access the project.
Type: string
Default: NA
Browsers
DAnIEL was sucessfully tested with the following browers:
Browser | OS | Versions |
---|---|---|
Firefox | GNU/Linux | 70.0 |
References
Bolyen, E., Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C.C., Al-Ghalith, G.A., Alexander, H., Alm, E.J., Arumugam, M., Asnicar, F., et al. (2019). Reproducible, interactive, scalable and extensible microbiome data science using qiime 2. (United States).
Callahan, B.J., McMurdie, P.J., Rosen, M.J., Han, A.W., Johnson, A.J.A., and Holmes, S.P. (2016). DADA2: High-resolution sample inference from illumina amplicon data. Nature Methods 13, 581–583.
Ewels, P., Magnusson, M., Lundin, S., and Kaller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England) 32, 3047–3048.
Friedman, J., and Alm, E.J. (2012). Inferring correlation networks from genomic survey data. PLoS Computational Biology 8, e1002687.
Gweon, H.S., Oliver, A., Taylor, J., Booth, T., Gibbs, M., Read, D.S., Griffiths, R.I., and Schonrogge, K. (2015). PIPITS: An automated pipeline for analyses of fungal internal transcribed spacer sequences from the illumina sequencing platform. Methods in Ecology and Evolution 6, 973–980.
Koster, J., and Rahmann, S. (2012). Snakemake–a scalable bioinformatics workflow engine. Bioinformatics (Oxford, England) 28, 2520–2522.
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12.
McMurdie, P.J., and Holmes, S. (2013). Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. PloS One 8, e61217.
Schwager, E., Mallick, H., Ventz, S., and Huttenhower, C. (2017). A bayesian method for detecting pairwise associations in compositional data. PLoS Computational Biology 13, e1005852.
Watts, S.C., Ritchie, S.C., Inouye, M., and Holt, K.E. (2019). FastSpar: Rapid and scalable correlation estimation for compositional data. Bioinformatics (Oxford, England) 35, 1064–1066.
Database of fungal projects
Database of fungal interactions
Browse database
Download database
The database of fungal interactions consists of a table and a BibTex file describing the interactions and their references, respectively. Medical Subject Headings (MeSH) are used whenever applicable.
Interactions (CSV) References (BibTeX)Database of fungal infections
Browse database
Download database
infections (CSV)Input
Upload local samples
Add external samples
Samples overview
Start
Start pipeline
Customize parameter
Logs
This page shows pipeline logs for debugging purposes.
Interactive analysis
Alignment of denoised sequences
For the sake of clarity, only the most prevalent sequences were shown. The tree, however, was calculated based on all representative sequences
Alpha diversity
Significance is calculated using Wilcoxon test for binary outcomes and using Kruskal-Wallis and Dunn's test otherwise.
Download
Denoised sequences (FASTA) Denoised profile (XLSX)Interactive abundance
Taxon abundance profile clustered by sample simillarity using Manhattan distance. Displayed sparseness can be controlled using the minimum number of taxa per sample.
Interactive alpha diversity
Significance is calculated using Wilcoxon test for binary outcomes and using Kruskal-Wallis and Dunn's test otherwise.
Interactive beta diversity
Download
Raw feature profile (XLSX) Normalized feature profile (XLSX) Feature meta data (XLSX)Interactive analysis
Analyze a single feature
Abundance distribution is plotted here using any combination of feature and sample grouping. Significance is indicated by stars (p < 0.001: ★★★, p < 0.01: ★★, p < 0.05: ★) P values are taken from post hoc tests if multiple test statistics are available.
Significance is calculated using Wilcoxon test for binary outcomes and using Kruskal-Wallis and Dunn's test otherwise.