Analyze your

mycobiome

with DAnIEL

DAnIEL (Describing, Analyzing and Integrating fungal Ecology to effectively study the systems of Life) is a web server to analyze ITS amplicon sequencing data beginning from raw reads to publication ready figures and tables. Interactions between fungal species and diseases, bacterial species, and immune system components can be explored. A knowledge base of clinical relevant species provides additional annotation to biological findings

## Overview

2. Set parameters and start the pipeline
4. Quality Control: Trimming primers and low quality bases using
5. Denoising: Getting representative sequences by calling OTUs or ASVs
6. Phylotyping: Taxonomic annotation of denoised sequences
7. Feature generation: Grouping denoised sequence by taxonomic rank and count normalization
8. Correlation networks: Identification of co-abundant species aware of sparseness and composition
9. Machine Learning: Phenotype classification using fungal abundance profiles
10. Statistics: Identification of significantly differentially abundant features
11. Annotation: Integrating significant features with known insights from existing projects and literature
12. Reporting: Creation of a single html file summarizing all results and methods applied

## Fungal knowledge base

• Manually curated database about fungal interactions and their associations to diseases, bacterial species, and biological molecules focused on immune system compounds
• List of species found in clinical samples with a suspicious fungal infection from the German reference center NRZMyk

• Customizable and interactive plots
• Single html file report summarizing both results and methods

## Cite us

DAnIEL: a user-friendly web server for fungal ITS amplicon sequencing data
Daniel Loos, Lu Zhang, Christine Beemelmanns, Oliver Kurzai, and Gianni Panagiotou
Frontiers in Microbiology, doi: 10.3389/fmicb.2021.720513

Maintainer and Developer: Daniel Loos daniel.loos@leibniz-hki.de
Website: https://www.leibniz-hki.de/en/systembiologie-und-bioinformatik.html

Systems Biology and Bioinformatics
Leibniz Institute for Natural Product Research and Infection Biology
Hans Knöll Institute (HKI)

07745 Jena
Germany
tutorial.knit

# Tutorial

Here we give a brief introduction about how to use this web server to analyze ITS amplicon data.

The aim of this tutorial is to explain the overall workflow and not to exercise a real world ITS analysis. For instance, the provided samples are sub sampled to speed up the analysis.

## The Biology of ITS sequencing

The Internal transcribed spacer (ITS) is a region in the eukaryotic genome between the genes coding for the small-subunit rRNA and the large-subunit rRNA. The ITS1 sub region in fungi is located between 18S and 5.8S rRNA genes, whereas ITS2 is between 5.8S and 28S rRNA genes. ITS regions are widely used in taxonomy due to these reasons:

• The sequence has a high degree of variation, especially between closely related species
• The size of one sub region can be easily covered with PE NGS reads
• There are universal primers available in which most fungal clades can be easily amplified

Often there are multiple copies of ITS in the same genome. However, they tend to undergo concerted evolution via unequal crossing-over events. This preserves the sequence homogeneity within a particular genome.

Most projects in the field of metagenomics have only one the two sub regions sequenced. It is not recommended to compare abundance levels between projects using different sub regions due to PCR amplification bias. Besides ITS, other marker sequences e.g. translational elongation factor 1α (TEF1α) are also used for taxonomic profiling. These markers are not covered by this webserver. However, sometimes multiple marker genes have to be sequenced for precisely taxonomic annotation up to the species level. This is also up to the clade of interest: It is usually more difficult to classify molds than yeasts using ITS region only.

In this tutorial we will demonstrate how to analyze the different fungal communities in fecal and environmental samples. Totaling 38 samples will be analyzed from these cohorts:

• 5 fecal samples from cancer patients (ENA study PRJEB33756)
• 20 fecal samples from healthy individuals of the HMP cohort (Nash et al., 2017)
• 13 environmental samples from a neonatal intensive care unit (NICU) (Heisel et al., 2019)

### Start a new project

If you just want to learn about the workflow, you do not need to start a new project. All results can be already accessed on the left sidebar if the current project ID is example.

1. Click on Home on the sidebar and then on Start project to create a new project.
2. This will create a new ID for your project. This ID can be used to access the results later on by clicking on Resume project. You can also create a bookmark of any page: The ID is always included in the URL.

### Composing the sample meta data table

The meta data table is a spreadsheet describing the samples. It must be provided either as a Excel spreadsheet (.xlsx) or as a comma-separated values text file (.csv) It must contain at least one column called sample_id listing the ids of all uploaded samples. For machine learning, column names must contain alphanumerical characters only (e.g. no + or /).

Additional samples can be added from databases like the NCBI Sequence Read Archive (SRA). The meta data table will be automatically updated for those samples. However, one can annotate external samples using the meta data table. Here, sample_id must be the NCBI SRA id e.g. SRR5098710. We will do this for the 20 fecal samples from healthy individuals of the HMP cohort (Nash et al., 2017). For instance, we will assign these samples to the project healthy feces.

Raw read files can contain multiple samples, because this webserver is able to perform demultiplexing. However, one has to declare the multiplexed samples in the meta data table. Let’s have a look at the first lines of the multiplexed raw read file muxed_S4_S5_1.fq.gz in FASTQ format:

@FCBKRH9:1:2119:15517:24531 1:N:0:CAGTGCATATGC
GCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATCGAATCTTTGAACGCAC
+
CCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@FCBKRH9:1:2112:8633:25105 1:N:0:CAGTGCATATGC
GCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATCGAATCTTTGAACGCAC
+
CCGGGGGGGGFCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGG<CFGGGGGGGGGG
@FCBKRH9:1:2112:6602:7309 1:N:0:CAGTGCATATGC
GCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATCGAATCTTTGAACGCAC

The bar code sequence is located at the end of the header of each read (every 4th line). We need a mapping table to know which bar code belongs to which sample. In this example, bar code CAGTGCATATGC belongs to sample S4. This is usually determined during library preparation. We have to add two new columns to the meta data table for all multiplexed samples:

Part of meta data table describing the demultiplexing
sample_id barcode_seq barcode_file
S4 CAGTGCATATGC muxed_S4_S5
S5 TCACGGGAGTTG muxed_S4_S5

Column barcode_seq describes the bar code sequence to look for. Tis can be any substring of the header line. Column barcode_file indicates in which file the sample was sequenced.

In order to define groups e.g. for statistics, we define also a few other columns. The columns can contain character strings (e.g. project), binary values (FALSE or TRUE e.g. low_seqdepth) or numbers (e.g. sequencing depth seqdepth). If a column contains unique values only (e.g. library), it will not be used in any statistical analysis due to lacking replica. This columns will be ignored instead. We can provide another column called pair_id used for paired testing of differences. This is needed e.g. if the same subjects were sequenced multiple times.

Here is the content of the meta data file samples.xlsx to be uploaded:

Uploaded meta data table (samples.xlsx, first 10 rows)
sample_id barcode_seq barcode_file project disease low_seqdepth seqdepth library
S1 NA NA cancer feces cancer FALSE 83070 cancer_01A1
S2 NA NA cancer feces cancer FALSE 161697 cancer_01B1
S3 NA NA cancer feces cancer FALSE 166414 cancer_08A1
S4 CAGTGCATATGC muxed_S4_S5 cancer feces cancer FALSE 87166 cancer_09A1
S5 TCACGGGAGTTG muxed_S4_S5 cancer feces cancer FALSE 42744 cancer_10A1
SRR5098710 NA NA healthy feces healthy TRUE 935 HMP_SRR5098710
SRR5098580 NA NA healthy feces healthy TRUE 187 HMP_SRR5098580
SRR5098736 NA NA healthy feces healthy FALSE 102113 HMP_SRR5098736
SRR5098677 NA NA healthy feces healthy FALSE 104592 HMP_SRR5098677
SRR5098500 NA NA healthy feces healthy FALSE 104812 HMP_SRR5098500

Raw reads must be in gzip-compressed FASTQ format. In the reads directory of example.zip, there are 8 files containing raw sequencing data from all 6 fecal samples from cancer patients:

muxed_S4_S5_1.fq.gz
muxed_S4_S5_2.fq.gz
S1_1.fq.gz
S1_2.fq.gz
S2_1.fq.gz
S2_2.fq.gz
S3_1.fq.gz
S3_2.fq.gz

• Raw reads must be originated from illumina® paired-end (PE) sequencing machines.
• Forward and reverse mates must be uploaded in two separated files.
• Only paired FASTQ files following pattern {sample id}_{mate}.{file extension} were accepted. The file extension must be one of .fastq, .fq, fastq.gz or fq.gz. The mate muste be either 1 or 2 indicating the forward and reverse mate, respectively. Example for single samples: sample1_1.fq.gz. Example for multiplexed samples: muxed_S4_S5_2.fq.gz.

We will only upload the 5 fecal samples from cancer patients. Click on Input to upload new samples to your project. Here we can upload both the raw read files generated by the sequencing machine and the table containing meta data about the samples:

Please wait until all files are being uploaded. A progress bar underneath the button will be provided to indicate upload status.

The other two cohorts have already been uploaded to the NCBI Sequence Read Archive (SRA). This webserver is able to access this database directly. We have two options to add samples from external cohorts:

#### SRA Accessions

Every sequencing run stored in the SRA has a unique ID starting with either SRR, ERR or DRR followed by a number. We must enter one sample per line in the text box SRA accessions. In this example, we insert all 20 fecal samples from healthy individuals:

SRR5098710
SRR5098580
SRR5098736
SRR5098677
SRR5098500
SRR5098394
SRR5098370
SRR5098413
SRR5098489
SRR5098436
SRR5098698
SRR5098401
SRR5098363
SRR5098721
SRR5098511
SRR5098498
SRR5098398
SRR5098371
SRR5098341
SRR5098421

#### External projects

Many ITS projects of the SRA are already downloaded and preprocessed. We will select the project hospital metagenome Targeted loci environmental (PRJNA505509) in the drop down menu External projects by clicking at the box and typing hospital to search for the study. Please click Projects on the sidebar to browse through all available projects.

We can subset samples from external cohorts by any meta data variable as displayed at Projects. This can be archived using a filter query. We will enter the query scientific_name == 'hospital metagenome' & total_spots > 30e3 to the text box Filter query for external projects. This will only include environmental samples of the NICU project which have at least 30*10³ raw reads. Character strings must be single quoted. == is used to indicate equality. > and < can be used to filter numerical attributes. Single terms can be combined using & or | to yield the sample-wise intersection or union over these terms, respectively. Please note that both %>% and parenthesis are blocked in this filed for security reasons. We can always filter samples according to these attributes:

Attributes available in all external projects to filter samples
Attribute Description
total_bases Number of raw bases
size File size in bytes
biosample_id NCBI accession of the biological sample starting with SAM
scientific_name Name of taxa of the sample e.g. human gut metagenome

A click on Add external samples will add these sample to the project. The column project will be filled with the NCBI BioProject id.

## Start: Setting parameters and starting the pipeline

First, click on Start on the sidebar. Here we can run the pipeline with default parameters by just clicking on Start pipeline. This will take around half an hour for the example cohort to run on our servers using 5 CPU threads. Depending on the number of samples and parameters, this can take up to a couple of hours. Please create a bookmark of the page or remember the project ID to view the results later on. Results can be accessed on the sidebar above the results menu once analyzed. We also have the option to create and select own parameter sets. A complete reference about all input elements and parameters is given at Reference on the sidebar.

We can customize the analysis workflow by creating our own parameter set for each step. Each parameter set must be given a unique name consisting of alphanumerical characters and spaces.

The default ITS sub region is ITS1. If you have ITS2 data, you have to adjust the denoising parameter set!

Here, a briefly summary about changed parameters in the example project is given. A complete description about all parameters can be accessed by clicking Reference on the sidebar.

• Quality Control
• Aim: Remove sequencing errors by trimming and filtering the raw reads

• We want to have our own primers ITS3 and ITS4 to be removed. Therefore we enter these primers in the Sequences to trim text box:

>Forward primer ITS3
GCATCGATGAAGAACGCAGC
>Reverse primer ITS4
TCCTCCGCTTATTGATATGC

These sequences must be in FASTA format.

• To remove the primers on both strands, ensure that the checkbox Include reverse complement of primer sequences is checked
• Denoising
• We will use Amplicon Sequencing Variants (ASV) as a method for denoising. Operational taxonomic units (OTU) can be used as well.
• Our example project covers the ITS2 sub region. Therefore we click on the radio button ITS2.
• Phylotyping
• Aim of this step: Taxonomic annotation of these denoised representative sequences
• We stay with the default parameters here
• Feature generation
• Aim of this step: Normalization, filtering and pooling denoised abundance profiles
• We stay with the default parameters here. This includes to pool the abundances at genus rank, because many fungal clades can not determined up to the species rank using the ITS region only. However, it is possible to select the species rank if the data is trusted.
• Analysis
• Aim of this step: Creating statistics, Machine Learning, and correlation networks based on these features
• Performing statistics and Machine Learning on hundreds of sample attributes can be very time consuming. Therefore, we can select the particular analysis groupings of interest. In this case: project, disease, and low_seqdepth.
• Building correlation networks on all samples is meaningless in heterogeneous cohorts like ours. We want to compare correlations between healthy and diseased samples. Thus we select disease as the correlation grouping.

## Logs

Once the pipeline has started, we can view the logs of the workflow engine snakemake by clicking Logs on the sidebar. This is an optional feature of this webserver and useful for debugging processes.

## Result pages

Once a step is finished, a result page will be appear at the results section on the sidebar menu. All result pages follow the same structure:

• Brief description about the methods and parameters applied
• Results including summary bullet points and interactive figures
• An optional notes section with comments how to interpret the results and which errors might occurred
• A list of literature references of used tools

## Quality Control (QC)

Quality Control (QC) is the process of removing low quality bases from the raw reads. This is done by utilizing the Phred quality score as calculated during base calling. Furthermore, primer and adapter sequences are cut.

This is the first analysis result which will appear at the result section. The test matrix indicates whether a sample failed, raised a warning, or passed a particular QC test.

By default and in our example project, samples will be excluded from downstream analysis, if any of these tests failed:

• Adapter content is too high
• Per base N content is too high
• Per base sequence quality is too bad
• Per sequence quality scores are too low
• Minimum number of QC reads < 1000

All tests except the last one are calculated using FastQC In this table, we can see that 3 samples were excluded due to too shallow sequencing. Indeed, samples SRR5098710 and SRR5098580 do not satisfy this requirement even on raw read basis. They were included in this example project to demonstrate both the functionality of this webserver and the fact that not every sample uploaded in the SRA database is of well quality.

excluded samples
sample_id reason