This QIIME 2 plugin supports methods for assessing and controlling the quality of feature and sequence data.

version: 2024.10.0
website: https://github.com/qiime2/q2-quality-control
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org

Actions

NameTypeShort Description
exclude-seqsmethodExclude sequences by alignment
filter-readsmethodFilter demultiplexed sequences by alignment to reference database.
bowtie2-buildmethodBuild bowtie2 index from reference sequences.
decontam-identifymethodIdentify contaminants
decontam-removemethodRemove contaminants
evaluate-compositionvisualizerEvaluate expected vs. observed taxonomic composition of samples
evaluate-seqsvisualizerCompare query (observed) vs. reference (expected) sequences.
evaluate-taxonomyvisualizerEvaluate expected vs. observed taxonomic assignments
decontam-score-vizvisualizerGenerate a histogram representation of the scores
decontam-identify-batchespipelineIdentify contaminants in Batch Mode

Artifact Classes

FeatureData[DecontamScore]

Formats

DecontamScoreFormat
DecontamScoreDirFmt


quality-control exclude-seqs

This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified perc_identity, evalue, and perc_query_aligned. This method could be used to define a positive filter, e.g., extract only feature sequences that align to a certain clade of bacteria; or to define a negative filter, e.g., identify sequences that align to contaminant or human DNA sequences that should be excluded from subsequent analyses. Note that filtering is performed based on the perc_identity, perc_query_aligned, and evalue thresholds (the latter only if method==BLAST and an evalue is set). Set perc_identity==0 and/or perc_query_aligned==0 to disable these filtering thresholds as necessary.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

method: Str % Choices('blast', 'blastn-short') | Str % Choices('vsearch')

Alignment method to use for matching feature sequences against reference sequences[default: 'blast']

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to reference is lower. Must be in range [0.0, 1.0][default: 0.97]

evalue: Float

BLAST expectation (E) value threshold for saving hits. Reject if E value is higher than threshold. This threshold is disabled by default.[optional]

perc_query_aligned: Float

Percent of query sequence that must align to reference in order to be accepted as a hit.[default: 0.97]

threads: Threads

Number of threads to use. Only applies to vsearch method.[default: 1]

left_justify: Bool % Choices(False) | Bool

Reject match if the pairwise alignment begins with gaps[default: False]

Outputs

sequence_hits: FeatureData[Sequence]

Subset of feature sequences that align to reference sequences[required]

sequence_misses: FeatureData[Sequence]

Subset of feature sequences that do not align to reference sequences[required]


quality-control filter-reads

Filter out (or keep) demultiplexed single- or paired-end sequences that align to a reference database, using bowtie2 and samtools. This method can be used to filter out human DNA sequences and other contaminants in any FASTQ sequence data (e.g., shotgun genome or amplicon sequence data), or alternatively (when exclude_seqs is False) to only keep sequences that do align to the reference.

Citations

Langmead & Salzberg, 2012; Li et al., 2009

Inputs

demultiplexed_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The sequences to be trimmed.[required]

database: Bowtie2Index

Bowtie2 indexed database.[required]

Parameters

n_threads: Threads

Number of alignment threads to launch.[default: 1]

mode: Str % Choices('local', 'global')

Bowtie2 alignment settings. See bowtie2 manual for more details.[default: 'local']

sensitivity: Str % Choices('very-fast', 'fast', 'sensitive', 'very-sensitive')

Bowtie2 alignment sensitivity. See bowtie2 manual for details.[default: 'sensitive']

ref_gap_open_penalty: Int % Range(1, None)

Reference gap open penalty.[default: 5]

ref_gap_ext_penalty: Int % Range(1, None)

Reference gap extend penalty.[default: 3]

exclude_seqs: Bool

Exclude sequences that align to reference. Set this option to False to exclude sequences that do not align to the reference database.[default: True]

Outputs

filtered_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The resulting filtered sequences.[required]


quality-control bowtie2-build

Build bowtie2 index from reference sequences.

Citations

Langmead & Salzberg, 2012

Inputs

sequences: FeatureData[Sequence]

Reference sequences used to build bowtie2 index.[required]

Parameters

n_threads: Threads

Number of threads to launch.[default: 1]

Outputs

database: Bowtie2Index

Bowtie2 index.[required]


quality-control decontam-identify

This method identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[default: 'prevalence']

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

Outputs

decontam_scores: FeatureData[DecontamScore]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]


quality-control decontam-remove

Remove contaminant sequences from a feature table and the associated representative sequences.

Inputs

decontam_scores: FeatureData[DecontamScore]

Pre-feature decontam scores.[required]

table: FeatureTable[Frequency]

Feature table from which contaminants will be removed.[required]

rep_seqs: FeatureData[Sequence]

Feature representative sequences from which contaminants will be removed.[required]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Decontam score threshold. Features with a score less than or equal to this threshold will be removed.[default: 0.1]

Outputs

filtered_table: FeatureTable[Frequency]

Feature table with contaminants removed.[required]

filtered_rep_seqs: FeatureData[Sequence]

Feature representative sequences with contaminants removed.[required]


quality-control evaluate-composition

This visualizer compares the feature composition of pairs of observed and expected samples containing the same sample ID in two separate feature tables. Typically, feature composition will consist of taxonomy classifications or other semicolon-delimited feature annotations. Taxon accuracy rate, taxon detection rate, and linear regression scores between expected and observed observations are calculated at each semicolon-delimited rank, and plots of per-level accuracy and observation correlations are plotted. A histogram of distance between false positive observations and the nearest expected feature is also generated, where distance equals the number of rank differences between the observed feature and the nearest common lineage in the expected feature. This visualizer is most suitable for testing per-run data quality on sequencing runs that contain mock communities or other samples with known composition. Also suitable for sanity checks of bioinformatics pipeline performance.

Citations

Bokulich et al., 2018

Inputs

expected_features: FeatureTable[RelativeFrequency]

Expected feature compositions[required]

observed_features: FeatureTable[RelativeFrequency]

Observed feature compositions[required]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[default: 7]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

plot_tar: Bool

Plot taxon accuracy rate (TAR) on score plot. TAR is the number of true positive features divided by the total number of observed features (TAR = true positives / (true positives + false positives)).[default: True]

plot_tdr: Bool

Plot taxon detection rate (TDR) on score plot. TDR is the number of true positive features divided by the total number of expected features (TDR = true positives / (true positives + false negatives)).[default: True]

plot_r_value: Bool

Plot expected vs. observed linear regression r value on score plot.[default: False]

plot_r_squared: Bool

Plot expected vs. observed linear regression r-squared value on score plot.[default: True]

plot_bray_curtis: Bool

Plot expected vs. observed Bray-Curtis dissimilarity scores on score plot.[default: False]

plot_jaccard: Bool

Plot expected vs. observed Jaccard distances scores on score plot.[default: False]

plot_observed_features: Bool

Plot observed features count on score plot.[default: False]

plot_observed_features_ratio: Bool

Plot ratio of observed:expected features on score plot.[default: True]

metadata: MetadataColumn[Categorical]

Optional sample metadata that maps observed_features sample IDs to expected_features sample IDs.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-seqs

This action aligns a set of query (e.g., observed) sequences against a set of reference (e.g., expected) sequences to evaluate the quality of alignment. The intended use is to align observed sequences against expected sequences (e.g., from a mock community) to determine the frequency of mismatches between observed sequences and the most similar expected sequences, e.g., as a measure of sequencing/method error. However, any sequences may be provided as input to generate a report on pairwise alignment quality against a set of reference sequences.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

show_alignments: Bool

Option to plot pairwise alignments of query sequences and their top hits.[default: False]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-taxonomy

This visualizer compares a pair of observed and expected taxonomic assignments to calculate precision, recall, and F-measure at each taxonomic level, up to maximum level specified by the depth parameter. These metrics are calculated at each semicolon-delimited rank. This action is useful for comparing the accuracy of taxonomic assignment, e.g., between different taxonomy classifiers or other bioinformatics methods. Expected taxonomies should be derived from simulated or mock community sequences that have known taxonomic affiliations.

Citations

Bokulich et al., 2018

Inputs

expected_taxa: FeatureData[Taxonomy]

Expected taxonomic assignments[required]

observed_taxa: FeatureData[Taxonomy]

Observed taxonomic assignments[required]

feature_table: FeatureTable[RelativeFrequency]

Optional feature table containing relative frequency of each feature, used to weight accuracy scores by frequency. Must contain all features found in expected and/or observed taxa. Features found in the table but not the expected/observed taxa will be dropped prior to analysis.[optional]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[required]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

require_exp_ids: Bool

Require that all features found in observed taxa must be found in expected taxa or raise error.[default: True]

require_obs_ids: Bool

Require that all features found in expected taxa must be found in observed taxa or raise error.[default: True]

sample_id: Str

Optional sample ID to use for extracting frequency data from feature table, and for labeling accuracy results. If no sample_id is provided, feature frequencies are derived from the sum of all samples present in the feature table.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-score-viz

Creates histogram based on the output of decontam identify

Inputs

decontam_scores: Collection[FeatureData[DecontamScore]]

Output from decontam identify to be visualized[required]

table: Collection[FeatureTable[Frequency]]

Raw OTU/ASV table that was used as input to decontam-identify[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate sequences will be removed from[optional]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float % Range(0.0, 1.0, inclusive_end=True)

Select bin size for the histogram[default: 0.02]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-identify-batches

This method breaks an ASV table into batches based on the given metadata and identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate seqeunces will be removed from[optional]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

split_column: Str

input metadata columns that you wish to subset the ASV table byNote: Column names must be in quotes and delimited by a space[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[required]

filter_empty_features: Bool

If true, features which are not present in a split feature table are dropped.[optional]

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

threshold: Float

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float

Select bin size for the histogram[default: 0.02]

Outputs

batch_subset_tables: Collection[FeatureTable[Frequency]]

Directory where feature tables split based on metadata and parameter split_column values should be written.[required]

decontam_scores: Collection[FeatureData[DecontamScore]]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]

score_histograms: Visualization

The vizulaizer histograms for all decontam score objects generated from the pipeline[required]

This QIIME 2 plugin supports methods for assessing and controlling the quality of feature and sequence data.

version: 2024.10.0
website: https://github.com/qiime2/q2-quality-control
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org

Actions

NameTypeShort Description
exclude-seqsmethodExclude sequences by alignment
filter-readsmethodFilter demultiplexed sequences by alignment to reference database.
bowtie2-buildmethodBuild bowtie2 index from reference sequences.
decontam-identifymethodIdentify contaminants
decontam-removemethodRemove contaminants
evaluate-compositionvisualizerEvaluate expected vs. observed taxonomic composition of samples
evaluate-seqsvisualizerCompare query (observed) vs. reference (expected) sequences.
evaluate-taxonomyvisualizerEvaluate expected vs. observed taxonomic assignments
decontam-score-vizvisualizerGenerate a histogram representation of the scores
decontam-identify-batchespipelineIdentify contaminants in Batch Mode

Artifact Classes

FeatureData[DecontamScore]

Formats

DecontamScoreFormat
DecontamScoreDirFmt


quality-control exclude-seqs

This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified perc_identity, evalue, and perc_query_aligned. This method could be used to define a positive filter, e.g., extract only feature sequences that align to a certain clade of bacteria; or to define a negative filter, e.g., identify sequences that align to contaminant or human DNA sequences that should be excluded from subsequent analyses. Note that filtering is performed based on the perc_identity, perc_query_aligned, and evalue thresholds (the latter only if method==BLAST and an evalue is set). Set perc_identity==0 and/or perc_query_aligned==0 to disable these filtering thresholds as necessary.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

method: Str % Choices('blast', 'blastn-short') | Str % Choices('vsearch')

Alignment method to use for matching feature sequences against reference sequences[default: 'blast']

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to reference is lower. Must be in range [0.0, 1.0][default: 0.97]

evalue: Float

BLAST expectation (E) value threshold for saving hits. Reject if E value is higher than threshold. This threshold is disabled by default.[optional]

perc_query_aligned: Float

Percent of query sequence that must align to reference in order to be accepted as a hit.[default: 0.97]

threads: Threads

Number of threads to use. Only applies to vsearch method.[default: 1]

left_justify: Bool % Choices(False) | Bool

Reject match if the pairwise alignment begins with gaps[default: False]

Outputs

sequence_hits: FeatureData[Sequence]

Subset of feature sequences that align to reference sequences[required]

sequence_misses: FeatureData[Sequence]

Subset of feature sequences that do not align to reference sequences[required]


quality-control filter-reads

Filter out (or keep) demultiplexed single- or paired-end sequences that align to a reference database, using bowtie2 and samtools. This method can be used to filter out human DNA sequences and other contaminants in any FASTQ sequence data (e.g., shotgun genome or amplicon sequence data), or alternatively (when exclude_seqs is False) to only keep sequences that do align to the reference.

Citations

Langmead & Salzberg, 2012; Li et al., 2009

Inputs

demultiplexed_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The sequences to be trimmed.[required]

database: Bowtie2Index

Bowtie2 indexed database.[required]

Parameters

n_threads: Threads

Number of alignment threads to launch.[default: 1]

mode: Str % Choices('local', 'global')

Bowtie2 alignment settings. See bowtie2 manual for more details.[default: 'local']

sensitivity: Str % Choices('very-fast', 'fast', 'sensitive', 'very-sensitive')

Bowtie2 alignment sensitivity. See bowtie2 manual for details.[default: 'sensitive']

ref_gap_open_penalty: Int % Range(1, None)

Reference gap open penalty.[default: 5]

ref_gap_ext_penalty: Int % Range(1, None)

Reference gap extend penalty.[default: 3]

exclude_seqs: Bool

Exclude sequences that align to reference. Set this option to False to exclude sequences that do not align to the reference database.[default: True]

Outputs

filtered_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The resulting filtered sequences.[required]


quality-control bowtie2-build

Build bowtie2 index from reference sequences.

Citations

Langmead & Salzberg, 2012

Inputs

sequences: FeatureData[Sequence]

Reference sequences used to build bowtie2 index.[required]

Parameters

n_threads: Threads

Number of threads to launch.[default: 1]

Outputs

database: Bowtie2Index

Bowtie2 index.[required]


quality-control decontam-identify

This method identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[default: 'prevalence']

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

Outputs

decontam_scores: FeatureData[DecontamScore]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]


quality-control decontam-remove

Remove contaminant sequences from a feature table and the associated representative sequences.

Inputs

decontam_scores: FeatureData[DecontamScore]

Pre-feature decontam scores.[required]

table: FeatureTable[Frequency]

Feature table from which contaminants will be removed.[required]

rep_seqs: FeatureData[Sequence]

Feature representative sequences from which contaminants will be removed.[required]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Decontam score threshold. Features with a score less than or equal to this threshold will be removed.[default: 0.1]

Outputs

filtered_table: FeatureTable[Frequency]

Feature table with contaminants removed.[required]

filtered_rep_seqs: FeatureData[Sequence]

Feature representative sequences with contaminants removed.[required]


quality-control evaluate-composition

This visualizer compares the feature composition of pairs of observed and expected samples containing the same sample ID in two separate feature tables. Typically, feature composition will consist of taxonomy classifications or other semicolon-delimited feature annotations. Taxon accuracy rate, taxon detection rate, and linear regression scores between expected and observed observations are calculated at each semicolon-delimited rank, and plots of per-level accuracy and observation correlations are plotted. A histogram of distance between false positive observations and the nearest expected feature is also generated, where distance equals the number of rank differences between the observed feature and the nearest common lineage in the expected feature. This visualizer is most suitable for testing per-run data quality on sequencing runs that contain mock communities or other samples with known composition. Also suitable for sanity checks of bioinformatics pipeline performance.

Citations

Bokulich et al., 2018

Inputs

expected_features: FeatureTable[RelativeFrequency]

Expected feature compositions[required]

observed_features: FeatureTable[RelativeFrequency]

Observed feature compositions[required]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[default: 7]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

plot_tar: Bool

Plot taxon accuracy rate (TAR) on score plot. TAR is the number of true positive features divided by the total number of observed features (TAR = true positives / (true positives + false positives)).[default: True]

plot_tdr: Bool

Plot taxon detection rate (TDR) on score plot. TDR is the number of true positive features divided by the total number of expected features (TDR = true positives / (true positives + false negatives)).[default: True]

plot_r_value: Bool

Plot expected vs. observed linear regression r value on score plot.[default: False]

plot_r_squared: Bool

Plot expected vs. observed linear regression r-squared value on score plot.[default: True]

plot_bray_curtis: Bool

Plot expected vs. observed Bray-Curtis dissimilarity scores on score plot.[default: False]

plot_jaccard: Bool

Plot expected vs. observed Jaccard distances scores on score plot.[default: False]

plot_observed_features: Bool

Plot observed features count on score plot.[default: False]

plot_observed_features_ratio: Bool

Plot ratio of observed:expected features on score plot.[default: True]

metadata: MetadataColumn[Categorical]

Optional sample metadata that maps observed_features sample IDs to expected_features sample IDs.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-seqs

This action aligns a set of query (e.g., observed) sequences against a set of reference (e.g., expected) sequences to evaluate the quality of alignment. The intended use is to align observed sequences against expected sequences (e.g., from a mock community) to determine the frequency of mismatches between observed sequences and the most similar expected sequences, e.g., as a measure of sequencing/method error. However, any sequences may be provided as input to generate a report on pairwise alignment quality against a set of reference sequences.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

show_alignments: Bool

Option to plot pairwise alignments of query sequences and their top hits.[default: False]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-taxonomy

This visualizer compares a pair of observed and expected taxonomic assignments to calculate precision, recall, and F-measure at each taxonomic level, up to maximum level specified by the depth parameter. These metrics are calculated at each semicolon-delimited rank. This action is useful for comparing the accuracy of taxonomic assignment, e.g., between different taxonomy classifiers or other bioinformatics methods. Expected taxonomies should be derived from simulated or mock community sequences that have known taxonomic affiliations.

Citations

Bokulich et al., 2018

Inputs

expected_taxa: FeatureData[Taxonomy]

Expected taxonomic assignments[required]

observed_taxa: FeatureData[Taxonomy]

Observed taxonomic assignments[required]

feature_table: FeatureTable[RelativeFrequency]

Optional feature table containing relative frequency of each feature, used to weight accuracy scores by frequency. Must contain all features found in expected and/or observed taxa. Features found in the table but not the expected/observed taxa will be dropped prior to analysis.[optional]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[required]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

require_exp_ids: Bool

Require that all features found in observed taxa must be found in expected taxa or raise error.[default: True]

require_obs_ids: Bool

Require that all features found in expected taxa must be found in observed taxa or raise error.[default: True]

sample_id: Str

Optional sample ID to use for extracting frequency data from feature table, and for labeling accuracy results. If no sample_id is provided, feature frequencies are derived from the sum of all samples present in the feature table.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-score-viz

Creates histogram based on the output of decontam identify

Inputs

decontam_scores: Collection[FeatureData[DecontamScore]]

Output from decontam identify to be visualized[required]

table: Collection[FeatureTable[Frequency]]

Raw OTU/ASV table that was used as input to decontam-identify[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate sequences will be removed from[optional]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float % Range(0.0, 1.0, inclusive_end=True)

Select bin size for the histogram[default: 0.02]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-identify-batches

This method breaks an ASV table into batches based on the given metadata and identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate seqeunces will be removed from[optional]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

split_column: Str

input metadata columns that you wish to subset the ASV table byNote: Column names must be in quotes and delimited by a space[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[required]

filter_empty_features: Bool

If true, features which are not present in a split feature table are dropped.[optional]

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

threshold: Float

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float

Select bin size for the histogram[default: 0.02]

Outputs

batch_subset_tables: Collection[FeatureTable[Frequency]]

Directory where feature tables split based on metadata and parameter split_column values should be written.[required]

decontam_scores: Collection[FeatureData[DecontamScore]]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]

score_histograms: Visualization

The vizulaizer histograms for all decontam score objects generated from the pipeline[required]

This QIIME 2 plugin supports methods for assessing and controlling the quality of feature and sequence data.

version: 2024.10.0
website: https://github.com/qiime2/q2-quality-control
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org

Actions

NameTypeShort Description
exclude-seqsmethodExclude sequences by alignment
filter-readsmethodFilter demultiplexed sequences by alignment to reference database.
bowtie2-buildmethodBuild bowtie2 index from reference sequences.
decontam-identifymethodIdentify contaminants
decontam-removemethodRemove contaminants
evaluate-compositionvisualizerEvaluate expected vs. observed taxonomic composition of samples
evaluate-seqsvisualizerCompare query (observed) vs. reference (expected) sequences.
evaluate-taxonomyvisualizerEvaluate expected vs. observed taxonomic assignments
decontam-score-vizvisualizerGenerate a histogram representation of the scores
decontam-identify-batchespipelineIdentify contaminants in Batch Mode

Artifact Classes

FeatureData[DecontamScore]

Formats

DecontamScoreFormat
DecontamScoreDirFmt


quality-control exclude-seqs

This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified perc_identity, evalue, and perc_query_aligned. This method could be used to define a positive filter, e.g., extract only feature sequences that align to a certain clade of bacteria; or to define a negative filter, e.g., identify sequences that align to contaminant or human DNA sequences that should be excluded from subsequent analyses. Note that filtering is performed based on the perc_identity, perc_query_aligned, and evalue thresholds (the latter only if method==BLAST and an evalue is set). Set perc_identity==0 and/or perc_query_aligned==0 to disable these filtering thresholds as necessary.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

method: Str % Choices('blast', 'blastn-short') | Str % Choices('vsearch')

Alignment method to use for matching feature sequences against reference sequences[default: 'blast']

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to reference is lower. Must be in range [0.0, 1.0][default: 0.97]

evalue: Float

BLAST expectation (E) value threshold for saving hits. Reject if E value is higher than threshold. This threshold is disabled by default.[optional]

perc_query_aligned: Float

Percent of query sequence that must align to reference in order to be accepted as a hit.[default: 0.97]

threads: Threads

Number of threads to use. Only applies to vsearch method.[default: 1]

left_justify: Bool % Choices(False) | Bool

Reject match if the pairwise alignment begins with gaps[default: False]

Outputs

sequence_hits: FeatureData[Sequence]

Subset of feature sequences that align to reference sequences[required]

sequence_misses: FeatureData[Sequence]

Subset of feature sequences that do not align to reference sequences[required]


quality-control filter-reads

Filter out (or keep) demultiplexed single- or paired-end sequences that align to a reference database, using bowtie2 and samtools. This method can be used to filter out human DNA sequences and other contaminants in any FASTQ sequence data (e.g., shotgun genome or amplicon sequence data), or alternatively (when exclude_seqs is False) to only keep sequences that do align to the reference.

Citations

Langmead & Salzberg, 2012; Li et al., 2009

Inputs

demultiplexed_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The sequences to be trimmed.[required]

database: Bowtie2Index

Bowtie2 indexed database.[required]

Parameters

n_threads: Threads

Number of alignment threads to launch.[default: 1]

mode: Str % Choices('local', 'global')

Bowtie2 alignment settings. See bowtie2 manual for more details.[default: 'local']

sensitivity: Str % Choices('very-fast', 'fast', 'sensitive', 'very-sensitive')

Bowtie2 alignment sensitivity. See bowtie2 manual for details.[default: 'sensitive']

ref_gap_open_penalty: Int % Range(1, None)

Reference gap open penalty.[default: 5]

ref_gap_ext_penalty: Int % Range(1, None)

Reference gap extend penalty.[default: 3]

exclude_seqs: Bool

Exclude sequences that align to reference. Set this option to False to exclude sequences that do not align to the reference database.[default: True]

Outputs

filtered_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The resulting filtered sequences.[required]


quality-control bowtie2-build

Build bowtie2 index from reference sequences.

Citations

Langmead & Salzberg, 2012

Inputs

sequences: FeatureData[Sequence]

Reference sequences used to build bowtie2 index.[required]

Parameters

n_threads: Threads

Number of threads to launch.[default: 1]

Outputs

database: Bowtie2Index

Bowtie2 index.[required]


quality-control decontam-identify

This method identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[default: 'prevalence']

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

Outputs

decontam_scores: FeatureData[DecontamScore]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]


quality-control decontam-remove

Remove contaminant sequences from a feature table and the associated representative sequences.

Inputs

decontam_scores: FeatureData[DecontamScore]

Pre-feature decontam scores.[required]

table: FeatureTable[Frequency]

Feature table from which contaminants will be removed.[required]

rep_seqs: FeatureData[Sequence]

Feature representative sequences from which contaminants will be removed.[required]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Decontam score threshold. Features with a score less than or equal to this threshold will be removed.[default: 0.1]

Outputs

filtered_table: FeatureTable[Frequency]

Feature table with contaminants removed.[required]

filtered_rep_seqs: FeatureData[Sequence]

Feature representative sequences with contaminants removed.[required]


quality-control evaluate-composition

This visualizer compares the feature composition of pairs of observed and expected samples containing the same sample ID in two separate feature tables. Typically, feature composition will consist of taxonomy classifications or other semicolon-delimited feature annotations. Taxon accuracy rate, taxon detection rate, and linear regression scores between expected and observed observations are calculated at each semicolon-delimited rank, and plots of per-level accuracy and observation correlations are plotted. A histogram of distance between false positive observations and the nearest expected feature is also generated, where distance equals the number of rank differences between the observed feature and the nearest common lineage in the expected feature. This visualizer is most suitable for testing per-run data quality on sequencing runs that contain mock communities or other samples with known composition. Also suitable for sanity checks of bioinformatics pipeline performance.

Citations

Bokulich et al., 2018

Inputs

expected_features: FeatureTable[RelativeFrequency]

Expected feature compositions[required]

observed_features: FeatureTable[RelativeFrequency]

Observed feature compositions[required]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[default: 7]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

plot_tar: Bool

Plot taxon accuracy rate (TAR) on score plot. TAR is the number of true positive features divided by the total number of observed features (TAR = true positives / (true positives + false positives)).[default: True]

plot_tdr: Bool

Plot taxon detection rate (TDR) on score plot. TDR is the number of true positive features divided by the total number of expected features (TDR = true positives / (true positives + false negatives)).[default: True]

plot_r_value: Bool

Plot expected vs. observed linear regression r value on score plot.[default: False]

plot_r_squared: Bool

Plot expected vs. observed linear regression r-squared value on score plot.[default: True]

plot_bray_curtis: Bool

Plot expected vs. observed Bray-Curtis dissimilarity scores on score plot.[default: False]

plot_jaccard: Bool

Plot expected vs. observed Jaccard distances scores on score plot.[default: False]

plot_observed_features: Bool

Plot observed features count on score plot.[default: False]

plot_observed_features_ratio: Bool

Plot ratio of observed:expected features on score plot.[default: True]

metadata: MetadataColumn[Categorical]

Optional sample metadata that maps observed_features sample IDs to expected_features sample IDs.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-seqs

This action aligns a set of query (e.g., observed) sequences against a set of reference (e.g., expected) sequences to evaluate the quality of alignment. The intended use is to align observed sequences against expected sequences (e.g., from a mock community) to determine the frequency of mismatches between observed sequences and the most similar expected sequences, e.g., as a measure of sequencing/method error. However, any sequences may be provided as input to generate a report on pairwise alignment quality against a set of reference sequences.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

show_alignments: Bool

Option to plot pairwise alignments of query sequences and their top hits.[default: False]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-taxonomy

This visualizer compares a pair of observed and expected taxonomic assignments to calculate precision, recall, and F-measure at each taxonomic level, up to maximum level specified by the depth parameter. These metrics are calculated at each semicolon-delimited rank. This action is useful for comparing the accuracy of taxonomic assignment, e.g., between different taxonomy classifiers or other bioinformatics methods. Expected taxonomies should be derived from simulated or mock community sequences that have known taxonomic affiliations.

Citations

Bokulich et al., 2018

Inputs

expected_taxa: FeatureData[Taxonomy]

Expected taxonomic assignments[required]

observed_taxa: FeatureData[Taxonomy]

Observed taxonomic assignments[required]

feature_table: FeatureTable[RelativeFrequency]

Optional feature table containing relative frequency of each feature, used to weight accuracy scores by frequency. Must contain all features found in expected and/or observed taxa. Features found in the table but not the expected/observed taxa will be dropped prior to analysis.[optional]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[required]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

require_exp_ids: Bool

Require that all features found in observed taxa must be found in expected taxa or raise error.[default: True]

require_obs_ids: Bool

Require that all features found in expected taxa must be found in observed taxa or raise error.[default: True]

sample_id: Str

Optional sample ID to use for extracting frequency data from feature table, and for labeling accuracy results. If no sample_id is provided, feature frequencies are derived from the sum of all samples present in the feature table.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-score-viz

Creates histogram based on the output of decontam identify

Inputs

decontam_scores: Collection[FeatureData[DecontamScore]]

Output from decontam identify to be visualized[required]

table: Collection[FeatureTable[Frequency]]

Raw OTU/ASV table that was used as input to decontam-identify[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate sequences will be removed from[optional]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float % Range(0.0, 1.0, inclusive_end=True)

Select bin size for the histogram[default: 0.02]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-identify-batches

This method breaks an ASV table into batches based on the given metadata and identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate seqeunces will be removed from[optional]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

split_column: Str

input metadata columns that you wish to subset the ASV table byNote: Column names must be in quotes and delimited by a space[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[required]

filter_empty_features: Bool

If true, features which are not present in a split feature table are dropped.[optional]

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

threshold: Float

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float

Select bin size for the histogram[default: 0.02]

Outputs

batch_subset_tables: Collection[FeatureTable[Frequency]]

Directory where feature tables split based on metadata and parameter split_column values should be written.[required]

decontam_scores: Collection[FeatureData[DecontamScore]]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]

score_histograms: Visualization

The vizulaizer histograms for all decontam score objects generated from the pipeline[required]

This QIIME 2 plugin supports methods for assessing and controlling the quality of feature and sequence data.

version: 2024.10.0
website: https://github.com/qiime2/q2-quality-control
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org

Actions

NameTypeShort Description
exclude-seqsmethodExclude sequences by alignment
filter-readsmethodFilter demultiplexed sequences by alignment to reference database.
bowtie2-buildmethodBuild bowtie2 index from reference sequences.
decontam-identifymethodIdentify contaminants
decontam-removemethodRemove contaminants
evaluate-compositionvisualizerEvaluate expected vs. observed taxonomic composition of samples
evaluate-seqsvisualizerCompare query (observed) vs. reference (expected) sequences.
evaluate-taxonomyvisualizerEvaluate expected vs. observed taxonomic assignments
decontam-score-vizvisualizerGenerate a histogram representation of the scores
decontam-identify-batchespipelineIdentify contaminants in Batch Mode

Artifact Classes

FeatureData[DecontamScore]

Formats

DecontamScoreFormat
DecontamScoreDirFmt


quality-control exclude-seqs

This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified perc_identity, evalue, and perc_query_aligned. This method could be used to define a positive filter, e.g., extract only feature sequences that align to a certain clade of bacteria; or to define a negative filter, e.g., identify sequences that align to contaminant or human DNA sequences that should be excluded from subsequent analyses. Note that filtering is performed based on the perc_identity, perc_query_aligned, and evalue thresholds (the latter only if method==BLAST and an evalue is set). Set perc_identity==0 and/or perc_query_aligned==0 to disable these filtering thresholds as necessary.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

method: Str % Choices('blast', 'blastn-short') | Str % Choices('vsearch')

Alignment method to use for matching feature sequences against reference sequences[default: 'blast']

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to reference is lower. Must be in range [0.0, 1.0][default: 0.97]

evalue: Float

BLAST expectation (E) value threshold for saving hits. Reject if E value is higher than threshold. This threshold is disabled by default.[optional]

perc_query_aligned: Float

Percent of query sequence that must align to reference in order to be accepted as a hit.[default: 0.97]

threads: Threads

Number of threads to use. Only applies to vsearch method.[default: 1]

left_justify: Bool % Choices(False) | Bool

Reject match if the pairwise alignment begins with gaps[default: False]

Outputs

sequence_hits: FeatureData[Sequence]

Subset of feature sequences that align to reference sequences[required]

sequence_misses: FeatureData[Sequence]

Subset of feature sequences that do not align to reference sequences[required]


quality-control filter-reads

Filter out (or keep) demultiplexed single- or paired-end sequences that align to a reference database, using bowtie2 and samtools. This method can be used to filter out human DNA sequences and other contaminants in any FASTQ sequence data (e.g., shotgun genome or amplicon sequence data), or alternatively (when exclude_seqs is False) to only keep sequences that do align to the reference.

Citations

Langmead & Salzberg, 2012; Li et al., 2009

Inputs

demultiplexed_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The sequences to be trimmed.[required]

database: Bowtie2Index

Bowtie2 indexed database.[required]

Parameters

n_threads: Threads

Number of alignment threads to launch.[default: 1]

mode: Str % Choices('local', 'global')

Bowtie2 alignment settings. See bowtie2 manual for more details.[default: 'local']

sensitivity: Str % Choices('very-fast', 'fast', 'sensitive', 'very-sensitive')

Bowtie2 alignment sensitivity. See bowtie2 manual for details.[default: 'sensitive']

ref_gap_open_penalty: Int % Range(1, None)

Reference gap open penalty.[default: 5]

ref_gap_ext_penalty: Int % Range(1, None)

Reference gap extend penalty.[default: 3]

exclude_seqs: Bool

Exclude sequences that align to reference. Set this option to False to exclude sequences that do not align to the reference database.[default: True]

Outputs

filtered_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The resulting filtered sequences.[required]


quality-control bowtie2-build

Build bowtie2 index from reference sequences.

Citations

Langmead & Salzberg, 2012

Inputs

sequences: FeatureData[Sequence]

Reference sequences used to build bowtie2 index.[required]

Parameters

n_threads: Threads

Number of threads to launch.[default: 1]

Outputs

database: Bowtie2Index

Bowtie2 index.[required]


quality-control decontam-identify

This method identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[default: 'prevalence']

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

Outputs

decontam_scores: FeatureData[DecontamScore]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]


quality-control decontam-remove

Remove contaminant sequences from a feature table and the associated representative sequences.

Inputs

decontam_scores: FeatureData[DecontamScore]

Pre-feature decontam scores.[required]

table: FeatureTable[Frequency]

Feature table from which contaminants will be removed.[required]

rep_seqs: FeatureData[Sequence]

Feature representative sequences from which contaminants will be removed.[required]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Decontam score threshold. Features with a score less than or equal to this threshold will be removed.[default: 0.1]

Outputs

filtered_table: FeatureTable[Frequency]

Feature table with contaminants removed.[required]

filtered_rep_seqs: FeatureData[Sequence]

Feature representative sequences with contaminants removed.[required]


quality-control evaluate-composition

This visualizer compares the feature composition of pairs of observed and expected samples containing the same sample ID in two separate feature tables. Typically, feature composition will consist of taxonomy classifications or other semicolon-delimited feature annotations. Taxon accuracy rate, taxon detection rate, and linear regression scores between expected and observed observations are calculated at each semicolon-delimited rank, and plots of per-level accuracy and observation correlations are plotted. A histogram of distance between false positive observations and the nearest expected feature is also generated, where distance equals the number of rank differences between the observed feature and the nearest common lineage in the expected feature. This visualizer is most suitable for testing per-run data quality on sequencing runs that contain mock communities or other samples with known composition. Also suitable for sanity checks of bioinformatics pipeline performance.

Citations

Bokulich et al., 2018

Inputs

expected_features: FeatureTable[RelativeFrequency]

Expected feature compositions[required]

observed_features: FeatureTable[RelativeFrequency]

Observed feature compositions[required]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[default: 7]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

plot_tar: Bool

Plot taxon accuracy rate (TAR) on score plot. TAR is the number of true positive features divided by the total number of observed features (TAR = true positives / (true positives + false positives)).[default: True]

plot_tdr: Bool

Plot taxon detection rate (TDR) on score plot. TDR is the number of true positive features divided by the total number of expected features (TDR = true positives / (true positives + false negatives)).[default: True]

plot_r_value: Bool

Plot expected vs. observed linear regression r value on score plot.[default: False]

plot_r_squared: Bool

Plot expected vs. observed linear regression r-squared value on score plot.[default: True]

plot_bray_curtis: Bool

Plot expected vs. observed Bray-Curtis dissimilarity scores on score plot.[default: False]

plot_jaccard: Bool

Plot expected vs. observed Jaccard distances scores on score plot.[default: False]

plot_observed_features: Bool

Plot observed features count on score plot.[default: False]

plot_observed_features_ratio: Bool

Plot ratio of observed:expected features on score plot.[default: True]

metadata: MetadataColumn[Categorical]

Optional sample metadata that maps observed_features sample IDs to expected_features sample IDs.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-seqs

This action aligns a set of query (e.g., observed) sequences against a set of reference (e.g., expected) sequences to evaluate the quality of alignment. The intended use is to align observed sequences against expected sequences (e.g., from a mock community) to determine the frequency of mismatches between observed sequences and the most similar expected sequences, e.g., as a measure of sequencing/method error. However, any sequences may be provided as input to generate a report on pairwise alignment quality against a set of reference sequences.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

show_alignments: Bool

Option to plot pairwise alignments of query sequences and their top hits.[default: False]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-taxonomy

This visualizer compares a pair of observed and expected taxonomic assignments to calculate precision, recall, and F-measure at each taxonomic level, up to maximum level specified by the depth parameter. These metrics are calculated at each semicolon-delimited rank. This action is useful for comparing the accuracy of taxonomic assignment, e.g., between different taxonomy classifiers or other bioinformatics methods. Expected taxonomies should be derived from simulated or mock community sequences that have known taxonomic affiliations.

Citations

Bokulich et al., 2018

Inputs

expected_taxa: FeatureData[Taxonomy]

Expected taxonomic assignments[required]

observed_taxa: FeatureData[Taxonomy]

Observed taxonomic assignments[required]

feature_table: FeatureTable[RelativeFrequency]

Optional feature table containing relative frequency of each feature, used to weight accuracy scores by frequency. Must contain all features found in expected and/or observed taxa. Features found in the table but not the expected/observed taxa will be dropped prior to analysis.[optional]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[required]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

require_exp_ids: Bool

Require that all features found in observed taxa must be found in expected taxa or raise error.[default: True]

require_obs_ids: Bool

Require that all features found in expected taxa must be found in observed taxa or raise error.[default: True]

sample_id: Str

Optional sample ID to use for extracting frequency data from feature table, and for labeling accuracy results. If no sample_id is provided, feature frequencies are derived from the sum of all samples present in the feature table.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-score-viz

Creates histogram based on the output of decontam identify

Inputs

decontam_scores: Collection[FeatureData[DecontamScore]]

Output from decontam identify to be visualized[required]

table: Collection[FeatureTable[Frequency]]

Raw OTU/ASV table that was used as input to decontam-identify[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate sequences will be removed from[optional]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float % Range(0.0, 1.0, inclusive_end=True)

Select bin size for the histogram[default: 0.02]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-identify-batches

This method breaks an ASV table into batches based on the given metadata and identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate seqeunces will be removed from[optional]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

split_column: Str

input metadata columns that you wish to subset the ASV table byNote: Column names must be in quotes and delimited by a space[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[required]

filter_empty_features: Bool

If true, features which are not present in a split feature table are dropped.[optional]

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

threshold: Float

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float

Select bin size for the histogram[default: 0.02]

Outputs

batch_subset_tables: Collection[FeatureTable[Frequency]]

Directory where feature tables split based on metadata and parameter split_column values should be written.[required]

decontam_scores: Collection[FeatureData[DecontamScore]]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]

score_histograms: Visualization

The vizulaizer histograms for all decontam score objects generated from the pipeline[required]

This QIIME 2 plugin supports methods for assessing and controlling the quality of feature and sequence data.

version: 2024.10.0
website: https://github.com/qiime2/q2-quality-control
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org

Actions

NameTypeShort Description
exclude-seqsmethodExclude sequences by alignment
filter-readsmethodFilter demultiplexed sequences by alignment to reference database.
bowtie2-buildmethodBuild bowtie2 index from reference sequences.
decontam-identifymethodIdentify contaminants
decontam-removemethodRemove contaminants
evaluate-compositionvisualizerEvaluate expected vs. observed taxonomic composition of samples
evaluate-seqsvisualizerCompare query (observed) vs. reference (expected) sequences.
evaluate-taxonomyvisualizerEvaluate expected vs. observed taxonomic assignments
decontam-score-vizvisualizerGenerate a histogram representation of the scores
decontam-identify-batchespipelineIdentify contaminants in Batch Mode

Artifact Classes

FeatureData[DecontamScore]

Formats

DecontamScoreFormat
DecontamScoreDirFmt


quality-control exclude-seqs

This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified perc_identity, evalue, and perc_query_aligned. This method could be used to define a positive filter, e.g., extract only feature sequences that align to a certain clade of bacteria; or to define a negative filter, e.g., identify sequences that align to contaminant or human DNA sequences that should be excluded from subsequent analyses. Note that filtering is performed based on the perc_identity, perc_query_aligned, and evalue thresholds (the latter only if method==BLAST and an evalue is set). Set perc_identity==0 and/or perc_query_aligned==0 to disable these filtering thresholds as necessary.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

method: Str % Choices('blast', 'blastn-short') | Str % Choices('vsearch')

Alignment method to use for matching feature sequences against reference sequences[default: 'blast']

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to reference is lower. Must be in range [0.0, 1.0][default: 0.97]

evalue: Float

BLAST expectation (E) value threshold for saving hits. Reject if E value is higher than threshold. This threshold is disabled by default.[optional]

perc_query_aligned: Float

Percent of query sequence that must align to reference in order to be accepted as a hit.[default: 0.97]

threads: Threads

Number of threads to use. Only applies to vsearch method.[default: 1]

left_justify: Bool % Choices(False) | Bool

Reject match if the pairwise alignment begins with gaps[default: False]

Outputs

sequence_hits: FeatureData[Sequence]

Subset of feature sequences that align to reference sequences[required]

sequence_misses: FeatureData[Sequence]

Subset of feature sequences that do not align to reference sequences[required]


quality-control filter-reads

Filter out (or keep) demultiplexed single- or paired-end sequences that align to a reference database, using bowtie2 and samtools. This method can be used to filter out human DNA sequences and other contaminants in any FASTQ sequence data (e.g., shotgun genome or amplicon sequence data), or alternatively (when exclude_seqs is False) to only keep sequences that do align to the reference.

Citations

Langmead & Salzberg, 2012; Li et al., 2009

Inputs

demultiplexed_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The sequences to be trimmed.[required]

database: Bowtie2Index

Bowtie2 indexed database.[required]

Parameters

n_threads: Threads

Number of alignment threads to launch.[default: 1]

mode: Str % Choices('local', 'global')

Bowtie2 alignment settings. See bowtie2 manual for more details.[default: 'local']

sensitivity: Str % Choices('very-fast', 'fast', 'sensitive', 'very-sensitive')

Bowtie2 alignment sensitivity. See bowtie2 manual for details.[default: 'sensitive']

ref_gap_open_penalty: Int % Range(1, None)

Reference gap open penalty.[default: 5]

ref_gap_ext_penalty: Int % Range(1, None)

Reference gap extend penalty.[default: 3]

exclude_seqs: Bool

Exclude sequences that align to reference. Set this option to False to exclude sequences that do not align to the reference database.[default: True]

Outputs

filtered_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The resulting filtered sequences.[required]


quality-control bowtie2-build

Build bowtie2 index from reference sequences.

Citations

Langmead & Salzberg, 2012

Inputs

sequences: FeatureData[Sequence]

Reference sequences used to build bowtie2 index.[required]

Parameters

n_threads: Threads

Number of threads to launch.[default: 1]

Outputs

database: Bowtie2Index

Bowtie2 index.[required]


quality-control decontam-identify

This method identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[default: 'prevalence']

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

Outputs

decontam_scores: FeatureData[DecontamScore]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]


quality-control decontam-remove

Remove contaminant sequences from a feature table and the associated representative sequences.

Inputs

decontam_scores: FeatureData[DecontamScore]

Pre-feature decontam scores.[required]

table: FeatureTable[Frequency]

Feature table from which contaminants will be removed.[required]

rep_seqs: FeatureData[Sequence]

Feature representative sequences from which contaminants will be removed.[required]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Decontam score threshold. Features with a score less than or equal to this threshold will be removed.[default: 0.1]

Outputs

filtered_table: FeatureTable[Frequency]

Feature table with contaminants removed.[required]

filtered_rep_seqs: FeatureData[Sequence]

Feature representative sequences with contaminants removed.[required]


quality-control evaluate-composition

This visualizer compares the feature composition of pairs of observed and expected samples containing the same sample ID in two separate feature tables. Typically, feature composition will consist of taxonomy classifications or other semicolon-delimited feature annotations. Taxon accuracy rate, taxon detection rate, and linear regression scores between expected and observed observations are calculated at each semicolon-delimited rank, and plots of per-level accuracy and observation correlations are plotted. A histogram of distance between false positive observations and the nearest expected feature is also generated, where distance equals the number of rank differences between the observed feature and the nearest common lineage in the expected feature. This visualizer is most suitable for testing per-run data quality on sequencing runs that contain mock communities or other samples with known composition. Also suitable for sanity checks of bioinformatics pipeline performance.

Citations

Bokulich et al., 2018

Inputs

expected_features: FeatureTable[RelativeFrequency]

Expected feature compositions[required]

observed_features: FeatureTable[RelativeFrequency]

Observed feature compositions[required]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[default: 7]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

plot_tar: Bool

Plot taxon accuracy rate (TAR) on score plot. TAR is the number of true positive features divided by the total number of observed features (TAR = true positives / (true positives + false positives)).[default: True]

plot_tdr: Bool

Plot taxon detection rate (TDR) on score plot. TDR is the number of true positive features divided by the total number of expected features (TDR = true positives / (true positives + false negatives)).[default: True]

plot_r_value: Bool

Plot expected vs. observed linear regression r value on score plot.[default: False]

plot_r_squared: Bool

Plot expected vs. observed linear regression r-squared value on score plot.[default: True]

plot_bray_curtis: Bool

Plot expected vs. observed Bray-Curtis dissimilarity scores on score plot.[default: False]

plot_jaccard: Bool

Plot expected vs. observed Jaccard distances scores on score plot.[default: False]

plot_observed_features: Bool

Plot observed features count on score plot.[default: False]

plot_observed_features_ratio: Bool

Plot ratio of observed:expected features on score plot.[default: True]

metadata: MetadataColumn[Categorical]

Optional sample metadata that maps observed_features sample IDs to expected_features sample IDs.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-seqs

This action aligns a set of query (e.g., observed) sequences against a set of reference (e.g., expected) sequences to evaluate the quality of alignment. The intended use is to align observed sequences against expected sequences (e.g., from a mock community) to determine the frequency of mismatches between observed sequences and the most similar expected sequences, e.g., as a measure of sequencing/method error. However, any sequences may be provided as input to generate a report on pairwise alignment quality against a set of reference sequences.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

show_alignments: Bool

Option to plot pairwise alignments of query sequences and their top hits.[default: False]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-taxonomy

This visualizer compares a pair of observed and expected taxonomic assignments to calculate precision, recall, and F-measure at each taxonomic level, up to maximum level specified by the depth parameter. These metrics are calculated at each semicolon-delimited rank. This action is useful for comparing the accuracy of taxonomic assignment, e.g., between different taxonomy classifiers or other bioinformatics methods. Expected taxonomies should be derived from simulated or mock community sequences that have known taxonomic affiliations.

Citations

Bokulich et al., 2018

Inputs

expected_taxa: FeatureData[Taxonomy]

Expected taxonomic assignments[required]

observed_taxa: FeatureData[Taxonomy]

Observed taxonomic assignments[required]

feature_table: FeatureTable[RelativeFrequency]

Optional feature table containing relative frequency of each feature, used to weight accuracy scores by frequency. Must contain all features found in expected and/or observed taxa. Features found in the table but not the expected/observed taxa will be dropped prior to analysis.[optional]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[required]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

require_exp_ids: Bool

Require that all features found in observed taxa must be found in expected taxa or raise error.[default: True]

require_obs_ids: Bool

Require that all features found in expected taxa must be found in observed taxa or raise error.[default: True]

sample_id: Str

Optional sample ID to use for extracting frequency data from feature table, and for labeling accuracy results. If no sample_id is provided, feature frequencies are derived from the sum of all samples present in the feature table.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-score-viz

Creates histogram based on the output of decontam identify

Inputs

decontam_scores: Collection[FeatureData[DecontamScore]]

Output from decontam identify to be visualized[required]

table: Collection[FeatureTable[Frequency]]

Raw OTU/ASV table that was used as input to decontam-identify[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate sequences will be removed from[optional]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float % Range(0.0, 1.0, inclusive_end=True)

Select bin size for the histogram[default: 0.02]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-identify-batches

This method breaks an ASV table into batches based on the given metadata and identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate seqeunces will be removed from[optional]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

split_column: Str

input metadata columns that you wish to subset the ASV table byNote: Column names must be in quotes and delimited by a space[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[required]

filter_empty_features: Bool

If true, features which are not present in a split feature table are dropped.[optional]

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

threshold: Float

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float

Select bin size for the histogram[default: 0.02]

Outputs

batch_subset_tables: Collection[FeatureTable[Frequency]]

Directory where feature tables split based on metadata and parameter split_column values should be written.[required]

decontam_scores: Collection[FeatureData[DecontamScore]]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]

score_histograms: Visualization

The vizulaizer histograms for all decontam score objects generated from the pipeline[required]

This QIIME 2 plugin supports methods for assessing and controlling the quality of feature and sequence data.

version: 2024.10.0
website: https://github.com/qiime2/q2-quality-control
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org

Actions

NameTypeShort Description
exclude-seqsmethodExclude sequences by alignment
filter-readsmethodFilter demultiplexed sequences by alignment to reference database.
bowtie2-buildmethodBuild bowtie2 index from reference sequences.
decontam-identifymethodIdentify contaminants
decontam-removemethodRemove contaminants
evaluate-compositionvisualizerEvaluate expected vs. observed taxonomic composition of samples
evaluate-seqsvisualizerCompare query (observed) vs. reference (expected) sequences.
evaluate-taxonomyvisualizerEvaluate expected vs. observed taxonomic assignments
decontam-score-vizvisualizerGenerate a histogram representation of the scores
decontam-identify-batchespipelineIdentify contaminants in Batch Mode

Artifact Classes

FeatureData[DecontamScore]

Formats

DecontamScoreFormat
DecontamScoreDirFmt


quality-control exclude-seqs

This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified perc_identity, evalue, and perc_query_aligned. This method could be used to define a positive filter, e.g., extract only feature sequences that align to a certain clade of bacteria; or to define a negative filter, e.g., identify sequences that align to contaminant or human DNA sequences that should be excluded from subsequent analyses. Note that filtering is performed based on the perc_identity, perc_query_aligned, and evalue thresholds (the latter only if method==BLAST and an evalue is set). Set perc_identity==0 and/or perc_query_aligned==0 to disable these filtering thresholds as necessary.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

method: Str % Choices('blast', 'blastn-short') | Str % Choices('vsearch')

Alignment method to use for matching feature sequences against reference sequences[default: 'blast']

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to reference is lower. Must be in range [0.0, 1.0][default: 0.97]

evalue: Float

BLAST expectation (E) value threshold for saving hits. Reject if E value is higher than threshold. This threshold is disabled by default.[optional]

perc_query_aligned: Float

Percent of query sequence that must align to reference in order to be accepted as a hit.[default: 0.97]

threads: Threads

Number of threads to use. Only applies to vsearch method.[default: 1]

left_justify: Bool % Choices(False) | Bool

Reject match if the pairwise alignment begins with gaps[default: False]

Outputs

sequence_hits: FeatureData[Sequence]

Subset of feature sequences that align to reference sequences[required]

sequence_misses: FeatureData[Sequence]

Subset of feature sequences that do not align to reference sequences[required]


quality-control filter-reads

Filter out (or keep) demultiplexed single- or paired-end sequences that align to a reference database, using bowtie2 and samtools. This method can be used to filter out human DNA sequences and other contaminants in any FASTQ sequence data (e.g., shotgun genome or amplicon sequence data), or alternatively (when exclude_seqs is False) to only keep sequences that do align to the reference.

Citations

Langmead & Salzberg, 2012; Li et al., 2009

Inputs

demultiplexed_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The sequences to be trimmed.[required]

database: Bowtie2Index

Bowtie2 indexed database.[required]

Parameters

n_threads: Threads

Number of alignment threads to launch.[default: 1]

mode: Str % Choices('local', 'global')

Bowtie2 alignment settings. See bowtie2 manual for more details.[default: 'local']

sensitivity: Str % Choices('very-fast', 'fast', 'sensitive', 'very-sensitive')

Bowtie2 alignment sensitivity. See bowtie2 manual for details.[default: 'sensitive']

ref_gap_open_penalty: Int % Range(1, None)

Reference gap open penalty.[default: 5]

ref_gap_ext_penalty: Int % Range(1, None)

Reference gap extend penalty.[default: 3]

exclude_seqs: Bool

Exclude sequences that align to reference. Set this option to False to exclude sequences that do not align to the reference database.[default: True]

Outputs

filtered_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The resulting filtered sequences.[required]


quality-control bowtie2-build

Build bowtie2 index from reference sequences.

Citations

Langmead & Salzberg, 2012

Inputs

sequences: FeatureData[Sequence]

Reference sequences used to build bowtie2 index.[required]

Parameters

n_threads: Threads

Number of threads to launch.[default: 1]

Outputs

database: Bowtie2Index

Bowtie2 index.[required]


quality-control decontam-identify

This method identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[default: 'prevalence']

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

Outputs

decontam_scores: FeatureData[DecontamScore]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]


quality-control decontam-remove

Remove contaminant sequences from a feature table and the associated representative sequences.

Inputs

decontam_scores: FeatureData[DecontamScore]

Pre-feature decontam scores.[required]

table: FeatureTable[Frequency]

Feature table from which contaminants will be removed.[required]

rep_seqs: FeatureData[Sequence]

Feature representative sequences from which contaminants will be removed.[required]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Decontam score threshold. Features with a score less than or equal to this threshold will be removed.[default: 0.1]

Outputs

filtered_table: FeatureTable[Frequency]

Feature table with contaminants removed.[required]

filtered_rep_seqs: FeatureData[Sequence]

Feature representative sequences with contaminants removed.[required]


quality-control evaluate-composition

This visualizer compares the feature composition of pairs of observed and expected samples containing the same sample ID in two separate feature tables. Typically, feature composition will consist of taxonomy classifications or other semicolon-delimited feature annotations. Taxon accuracy rate, taxon detection rate, and linear regression scores between expected and observed observations are calculated at each semicolon-delimited rank, and plots of per-level accuracy and observation correlations are plotted. A histogram of distance between false positive observations and the nearest expected feature is also generated, where distance equals the number of rank differences between the observed feature and the nearest common lineage in the expected feature. This visualizer is most suitable for testing per-run data quality on sequencing runs that contain mock communities or other samples with known composition. Also suitable for sanity checks of bioinformatics pipeline performance.

Citations

Bokulich et al., 2018

Inputs

expected_features: FeatureTable[RelativeFrequency]

Expected feature compositions[required]

observed_features: FeatureTable[RelativeFrequency]

Observed feature compositions[required]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[default: 7]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

plot_tar: Bool

Plot taxon accuracy rate (TAR) on score plot. TAR is the number of true positive features divided by the total number of observed features (TAR = true positives / (true positives + false positives)).[default: True]

plot_tdr: Bool

Plot taxon detection rate (TDR) on score plot. TDR is the number of true positive features divided by the total number of expected features (TDR = true positives / (true positives + false negatives)).[default: True]

plot_r_value: Bool

Plot expected vs. observed linear regression r value on score plot.[default: False]

plot_r_squared: Bool

Plot expected vs. observed linear regression r-squared value on score plot.[default: True]

plot_bray_curtis: Bool

Plot expected vs. observed Bray-Curtis dissimilarity scores on score plot.[default: False]

plot_jaccard: Bool

Plot expected vs. observed Jaccard distances scores on score plot.[default: False]

plot_observed_features: Bool

Plot observed features count on score plot.[default: False]

plot_observed_features_ratio: Bool

Plot ratio of observed:expected features on score plot.[default: True]

metadata: MetadataColumn[Categorical]

Optional sample metadata that maps observed_features sample IDs to expected_features sample IDs.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-seqs

This action aligns a set of query (e.g., observed) sequences against a set of reference (e.g., expected) sequences to evaluate the quality of alignment. The intended use is to align observed sequences against expected sequences (e.g., from a mock community) to determine the frequency of mismatches between observed sequences and the most similar expected sequences, e.g., as a measure of sequencing/method error. However, any sequences may be provided as input to generate a report on pairwise alignment quality against a set of reference sequences.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

show_alignments: Bool

Option to plot pairwise alignments of query sequences and their top hits.[default: False]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-taxonomy

This visualizer compares a pair of observed and expected taxonomic assignments to calculate precision, recall, and F-measure at each taxonomic level, up to maximum level specified by the depth parameter. These metrics are calculated at each semicolon-delimited rank. This action is useful for comparing the accuracy of taxonomic assignment, e.g., between different taxonomy classifiers or other bioinformatics methods. Expected taxonomies should be derived from simulated or mock community sequences that have known taxonomic affiliations.

Citations

Bokulich et al., 2018

Inputs

expected_taxa: FeatureData[Taxonomy]

Expected taxonomic assignments[required]

observed_taxa: FeatureData[Taxonomy]

Observed taxonomic assignments[required]

feature_table: FeatureTable[RelativeFrequency]

Optional feature table containing relative frequency of each feature, used to weight accuracy scores by frequency. Must contain all features found in expected and/or observed taxa. Features found in the table but not the expected/observed taxa will be dropped prior to analysis.[optional]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[required]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

require_exp_ids: Bool

Require that all features found in observed taxa must be found in expected taxa or raise error.[default: True]

require_obs_ids: Bool

Require that all features found in expected taxa must be found in observed taxa or raise error.[default: True]

sample_id: Str

Optional sample ID to use for extracting frequency data from feature table, and for labeling accuracy results. If no sample_id is provided, feature frequencies are derived from the sum of all samples present in the feature table.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-score-viz

Creates histogram based on the output of decontam identify

Inputs

decontam_scores: Collection[FeatureData[DecontamScore]]

Output from decontam identify to be visualized[required]

table: Collection[FeatureTable[Frequency]]

Raw OTU/ASV table that was used as input to decontam-identify[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate sequences will be removed from[optional]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float % Range(0.0, 1.0, inclusive_end=True)

Select bin size for the histogram[default: 0.02]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-identify-batches

This method breaks an ASV table into batches based on the given metadata and identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate seqeunces will be removed from[optional]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

split_column: Str

input metadata columns that you wish to subset the ASV table byNote: Column names must be in quotes and delimited by a space[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[required]

filter_empty_features: Bool

If true, features which are not present in a split feature table are dropped.[optional]

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

threshold: Float

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float

Select bin size for the histogram[default: 0.02]

Outputs

batch_subset_tables: Collection[FeatureTable[Frequency]]

Directory where feature tables split based on metadata and parameter split_column values should be written.[required]

decontam_scores: Collection[FeatureData[DecontamScore]]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]

score_histograms: Visualization

The vizulaizer histograms for all decontam score objects generated from the pipeline[required]

This QIIME 2 plugin supports methods for assessing and controlling the quality of feature and sequence data.

version: 2024.10.0
website: https://github.com/qiime2/q2-quality-control
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org

Actions

NameTypeShort Description
exclude-seqsmethodExclude sequences by alignment
filter-readsmethodFilter demultiplexed sequences by alignment to reference database.
bowtie2-buildmethodBuild bowtie2 index from reference sequences.
decontam-identifymethodIdentify contaminants
decontam-removemethodRemove contaminants
evaluate-compositionvisualizerEvaluate expected vs. observed taxonomic composition of samples
evaluate-seqsvisualizerCompare query (observed) vs. reference (expected) sequences.
evaluate-taxonomyvisualizerEvaluate expected vs. observed taxonomic assignments
decontam-score-vizvisualizerGenerate a histogram representation of the scores
decontam-identify-batchespipelineIdentify contaminants in Batch Mode

Artifact Classes

FeatureData[DecontamScore]

Formats

DecontamScoreFormat
DecontamScoreDirFmt


quality-control exclude-seqs

This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified perc_identity, evalue, and perc_query_aligned. This method could be used to define a positive filter, e.g., extract only feature sequences that align to a certain clade of bacteria; or to define a negative filter, e.g., identify sequences that align to contaminant or human DNA sequences that should be excluded from subsequent analyses. Note that filtering is performed based on the perc_identity, perc_query_aligned, and evalue thresholds (the latter only if method==BLAST and an evalue is set). Set perc_identity==0 and/or perc_query_aligned==0 to disable these filtering thresholds as necessary.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

method: Str % Choices('blast', 'blastn-short') | Str % Choices('vsearch')

Alignment method to use for matching feature sequences against reference sequences[default: 'blast']

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to reference is lower. Must be in range [0.0, 1.0][default: 0.97]

evalue: Float

BLAST expectation (E) value threshold for saving hits. Reject if E value is higher than threshold. This threshold is disabled by default.[optional]

perc_query_aligned: Float

Percent of query sequence that must align to reference in order to be accepted as a hit.[default: 0.97]

threads: Threads

Number of threads to use. Only applies to vsearch method.[default: 1]

left_justify: Bool % Choices(False) | Bool

Reject match if the pairwise alignment begins with gaps[default: False]

Outputs

sequence_hits: FeatureData[Sequence]

Subset of feature sequences that align to reference sequences[required]

sequence_misses: FeatureData[Sequence]

Subset of feature sequences that do not align to reference sequences[required]


quality-control filter-reads

Filter out (or keep) demultiplexed single- or paired-end sequences that align to a reference database, using bowtie2 and samtools. This method can be used to filter out human DNA sequences and other contaminants in any FASTQ sequence data (e.g., shotgun genome or amplicon sequence data), or alternatively (when exclude_seqs is False) to only keep sequences that do align to the reference.

Citations

Langmead & Salzberg, 2012; Li et al., 2009

Inputs

demultiplexed_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The sequences to be trimmed.[required]

database: Bowtie2Index

Bowtie2 indexed database.[required]

Parameters

n_threads: Threads

Number of alignment threads to launch.[default: 1]

mode: Str % Choices('local', 'global')

Bowtie2 alignment settings. See bowtie2 manual for more details.[default: 'local']

sensitivity: Str % Choices('very-fast', 'fast', 'sensitive', 'very-sensitive')

Bowtie2 alignment sensitivity. See bowtie2 manual for details.[default: 'sensitive']

ref_gap_open_penalty: Int % Range(1, None)

Reference gap open penalty.[default: 5]

ref_gap_ext_penalty: Int % Range(1, None)

Reference gap extend penalty.[default: 3]

exclude_seqs: Bool

Exclude sequences that align to reference. Set this option to False to exclude sequences that do not align to the reference database.[default: True]

Outputs

filtered_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The resulting filtered sequences.[required]


quality-control bowtie2-build

Build bowtie2 index from reference sequences.

Citations

Langmead & Salzberg, 2012

Inputs

sequences: FeatureData[Sequence]

Reference sequences used to build bowtie2 index.[required]

Parameters

n_threads: Threads

Number of threads to launch.[default: 1]

Outputs

database: Bowtie2Index

Bowtie2 index.[required]


quality-control decontam-identify

This method identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[default: 'prevalence']

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

Outputs

decontam_scores: FeatureData[DecontamScore]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]


quality-control decontam-remove

Remove contaminant sequences from a feature table and the associated representative sequences.

Inputs

decontam_scores: FeatureData[DecontamScore]

Pre-feature decontam scores.[required]

table: FeatureTable[Frequency]

Feature table from which contaminants will be removed.[required]

rep_seqs: FeatureData[Sequence]

Feature representative sequences from which contaminants will be removed.[required]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Decontam score threshold. Features with a score less than or equal to this threshold will be removed.[default: 0.1]

Outputs

filtered_table: FeatureTable[Frequency]

Feature table with contaminants removed.[required]

filtered_rep_seqs: FeatureData[Sequence]

Feature representative sequences with contaminants removed.[required]


quality-control evaluate-composition

This visualizer compares the feature composition of pairs of observed and expected samples containing the same sample ID in two separate feature tables. Typically, feature composition will consist of taxonomy classifications or other semicolon-delimited feature annotations. Taxon accuracy rate, taxon detection rate, and linear regression scores between expected and observed observations are calculated at each semicolon-delimited rank, and plots of per-level accuracy and observation correlations are plotted. A histogram of distance between false positive observations and the nearest expected feature is also generated, where distance equals the number of rank differences between the observed feature and the nearest common lineage in the expected feature. This visualizer is most suitable for testing per-run data quality on sequencing runs that contain mock communities or other samples with known composition. Also suitable for sanity checks of bioinformatics pipeline performance.

Citations

Bokulich et al., 2018

Inputs

expected_features: FeatureTable[RelativeFrequency]

Expected feature compositions[required]

observed_features: FeatureTable[RelativeFrequency]

Observed feature compositions[required]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[default: 7]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

plot_tar: Bool

Plot taxon accuracy rate (TAR) on score plot. TAR is the number of true positive features divided by the total number of observed features (TAR = true positives / (true positives + false positives)).[default: True]

plot_tdr: Bool

Plot taxon detection rate (TDR) on score plot. TDR is the number of true positive features divided by the total number of expected features (TDR = true positives / (true positives + false negatives)).[default: True]

plot_r_value: Bool

Plot expected vs. observed linear regression r value on score plot.[default: False]

plot_r_squared: Bool

Plot expected vs. observed linear regression r-squared value on score plot.[default: True]

plot_bray_curtis: Bool

Plot expected vs. observed Bray-Curtis dissimilarity scores on score plot.[default: False]

plot_jaccard: Bool

Plot expected vs. observed Jaccard distances scores on score plot.[default: False]

plot_observed_features: Bool

Plot observed features count on score plot.[default: False]

plot_observed_features_ratio: Bool

Plot ratio of observed:expected features on score plot.[default: True]

metadata: MetadataColumn[Categorical]

Optional sample metadata that maps observed_features sample IDs to expected_features sample IDs.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-seqs

This action aligns a set of query (e.g., observed) sequences against a set of reference (e.g., expected) sequences to evaluate the quality of alignment. The intended use is to align observed sequences against expected sequences (e.g., from a mock community) to determine the frequency of mismatches between observed sequences and the most similar expected sequences, e.g., as a measure of sequencing/method error. However, any sequences may be provided as input to generate a report on pairwise alignment quality against a set of reference sequences.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

show_alignments: Bool

Option to plot pairwise alignments of query sequences and their top hits.[default: False]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-taxonomy

This visualizer compares a pair of observed and expected taxonomic assignments to calculate precision, recall, and F-measure at each taxonomic level, up to maximum level specified by the depth parameter. These metrics are calculated at each semicolon-delimited rank. This action is useful for comparing the accuracy of taxonomic assignment, e.g., between different taxonomy classifiers or other bioinformatics methods. Expected taxonomies should be derived from simulated or mock community sequences that have known taxonomic affiliations.

Citations

Bokulich et al., 2018

Inputs

expected_taxa: FeatureData[Taxonomy]

Expected taxonomic assignments[required]

observed_taxa: FeatureData[Taxonomy]

Observed taxonomic assignments[required]

feature_table: FeatureTable[RelativeFrequency]

Optional feature table containing relative frequency of each feature, used to weight accuracy scores by frequency. Must contain all features found in expected and/or observed taxa. Features found in the table but not the expected/observed taxa will be dropped prior to analysis.[optional]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[required]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

require_exp_ids: Bool

Require that all features found in observed taxa must be found in expected taxa or raise error.[default: True]

require_obs_ids: Bool

Require that all features found in expected taxa must be found in observed taxa or raise error.[default: True]

sample_id: Str

Optional sample ID to use for extracting frequency data from feature table, and for labeling accuracy results. If no sample_id is provided, feature frequencies are derived from the sum of all samples present in the feature table.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-score-viz

Creates histogram based on the output of decontam identify

Inputs

decontam_scores: Collection[FeatureData[DecontamScore]]

Output from decontam identify to be visualized[required]

table: Collection[FeatureTable[Frequency]]

Raw OTU/ASV table that was used as input to decontam-identify[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate sequences will be removed from[optional]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float % Range(0.0, 1.0, inclusive_end=True)

Select bin size for the histogram[default: 0.02]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-identify-batches

This method breaks an ASV table into batches based on the given metadata and identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate seqeunces will be removed from[optional]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

split_column: Str

input metadata columns that you wish to subset the ASV table byNote: Column names must be in quotes and delimited by a space[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[required]

filter_empty_features: Bool

If true, features which are not present in a split feature table are dropped.[optional]

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

threshold: Float

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float

Select bin size for the histogram[default: 0.02]

Outputs

batch_subset_tables: Collection[FeatureTable[Frequency]]

Directory where feature tables split based on metadata and parameter split_column values should be written.[required]

decontam_scores: Collection[FeatureData[DecontamScore]]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]

score_histograms: Visualization

The vizulaizer histograms for all decontam score objects generated from the pipeline[required]

This QIIME 2 plugin supports methods for assessing and controlling the quality of feature and sequence data.

version: 2024.10.0
website: https://github.com/qiime2/q2-quality-control
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org

Actions

NameTypeShort Description
exclude-seqsmethodExclude sequences by alignment
filter-readsmethodFilter demultiplexed sequences by alignment to reference database.
bowtie2-buildmethodBuild bowtie2 index from reference sequences.
decontam-identifymethodIdentify contaminants
decontam-removemethodRemove contaminants
evaluate-compositionvisualizerEvaluate expected vs. observed taxonomic composition of samples
evaluate-seqsvisualizerCompare query (observed) vs. reference (expected) sequences.
evaluate-taxonomyvisualizerEvaluate expected vs. observed taxonomic assignments
decontam-score-vizvisualizerGenerate a histogram representation of the scores
decontam-identify-batchespipelineIdentify contaminants in Batch Mode

Artifact Classes

FeatureData[DecontamScore]

Formats

DecontamScoreFormat
DecontamScoreDirFmt


quality-control exclude-seqs

This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified perc_identity, evalue, and perc_query_aligned. This method could be used to define a positive filter, e.g., extract only feature sequences that align to a certain clade of bacteria; or to define a negative filter, e.g., identify sequences that align to contaminant or human DNA sequences that should be excluded from subsequent analyses. Note that filtering is performed based on the perc_identity, perc_query_aligned, and evalue thresholds (the latter only if method==BLAST and an evalue is set). Set perc_identity==0 and/or perc_query_aligned==0 to disable these filtering thresholds as necessary.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

method: Str % Choices('blast', 'blastn-short') | Str % Choices('vsearch')

Alignment method to use for matching feature sequences against reference sequences[default: 'blast']

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to reference is lower. Must be in range [0.0, 1.0][default: 0.97]

evalue: Float

BLAST expectation (E) value threshold for saving hits. Reject if E value is higher than threshold. This threshold is disabled by default.[optional]

perc_query_aligned: Float

Percent of query sequence that must align to reference in order to be accepted as a hit.[default: 0.97]

threads: Threads

Number of threads to use. Only applies to vsearch method.[default: 1]

left_justify: Bool % Choices(False) | Bool

Reject match if the pairwise alignment begins with gaps[default: False]

Outputs

sequence_hits: FeatureData[Sequence]

Subset of feature sequences that align to reference sequences[required]

sequence_misses: FeatureData[Sequence]

Subset of feature sequences that do not align to reference sequences[required]


quality-control filter-reads

Filter out (or keep) demultiplexed single- or paired-end sequences that align to a reference database, using bowtie2 and samtools. This method can be used to filter out human DNA sequences and other contaminants in any FASTQ sequence data (e.g., shotgun genome or amplicon sequence data), or alternatively (when exclude_seqs is False) to only keep sequences that do align to the reference.

Citations

Langmead & Salzberg, 2012; Li et al., 2009

Inputs

demultiplexed_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The sequences to be trimmed.[required]

database: Bowtie2Index

Bowtie2 indexed database.[required]

Parameters

n_threads: Threads

Number of alignment threads to launch.[default: 1]

mode: Str % Choices('local', 'global')

Bowtie2 alignment settings. See bowtie2 manual for more details.[default: 'local']

sensitivity: Str % Choices('very-fast', 'fast', 'sensitive', 'very-sensitive')

Bowtie2 alignment sensitivity. See bowtie2 manual for details.[default: 'sensitive']

ref_gap_open_penalty: Int % Range(1, None)

Reference gap open penalty.[default: 5]

ref_gap_ext_penalty: Int % Range(1, None)

Reference gap extend penalty.[default: 3]

exclude_seqs: Bool

Exclude sequences that align to reference. Set this option to False to exclude sequences that do not align to the reference database.[default: True]

Outputs

filtered_sequences: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]

The resulting filtered sequences.[required]


quality-control bowtie2-build

Build bowtie2 index from reference sequences.

Citations

Langmead & Salzberg, 2012

Inputs

sequences: FeatureData[Sequence]

Reference sequences used to build bowtie2 index.[required]

Parameters

n_threads: Threads

Number of threads to launch.[default: 1]

Outputs

database: Bowtie2Index

Bowtie2 index.[required]


quality-control decontam-identify

This method identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[default: 'prevalence']

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

Outputs

decontam_scores: FeatureData[DecontamScore]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]


quality-control decontam-remove

Remove contaminant sequences from a feature table and the associated representative sequences.

Inputs

decontam_scores: FeatureData[DecontamScore]

Pre-feature decontam scores.[required]

table: FeatureTable[Frequency]

Feature table from which contaminants will be removed.[required]

rep_seqs: FeatureData[Sequence]

Feature representative sequences from which contaminants will be removed.[required]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Decontam score threshold. Features with a score less than or equal to this threshold will be removed.[default: 0.1]

Outputs

filtered_table: FeatureTable[Frequency]

Feature table with contaminants removed.[required]

filtered_rep_seqs: FeatureData[Sequence]

Feature representative sequences with contaminants removed.[required]


quality-control evaluate-composition

This visualizer compares the feature composition of pairs of observed and expected samples containing the same sample ID in two separate feature tables. Typically, feature composition will consist of taxonomy classifications or other semicolon-delimited feature annotations. Taxon accuracy rate, taxon detection rate, and linear regression scores between expected and observed observations are calculated at each semicolon-delimited rank, and plots of per-level accuracy and observation correlations are plotted. A histogram of distance between false positive observations and the nearest expected feature is also generated, where distance equals the number of rank differences between the observed feature and the nearest common lineage in the expected feature. This visualizer is most suitable for testing per-run data quality on sequencing runs that contain mock communities or other samples with known composition. Also suitable for sanity checks of bioinformatics pipeline performance.

Citations

Bokulich et al., 2018

Inputs

expected_features: FeatureTable[RelativeFrequency]

Expected feature compositions[required]

observed_features: FeatureTable[RelativeFrequency]

Observed feature compositions[required]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[default: 7]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

plot_tar: Bool

Plot taxon accuracy rate (TAR) on score plot. TAR is the number of true positive features divided by the total number of observed features (TAR = true positives / (true positives + false positives)).[default: True]

plot_tdr: Bool

Plot taxon detection rate (TDR) on score plot. TDR is the number of true positive features divided by the total number of expected features (TDR = true positives / (true positives + false negatives)).[default: True]

plot_r_value: Bool

Plot expected vs. observed linear regression r value on score plot.[default: False]

plot_r_squared: Bool

Plot expected vs. observed linear regression r-squared value on score plot.[default: True]

plot_bray_curtis: Bool

Plot expected vs. observed Bray-Curtis dissimilarity scores on score plot.[default: False]

plot_jaccard: Bool

Plot expected vs. observed Jaccard distances scores on score plot.[default: False]

plot_observed_features: Bool

Plot observed features count on score plot.[default: False]

plot_observed_features_ratio: Bool

Plot ratio of observed:expected features on score plot.[default: True]

metadata: MetadataColumn[Categorical]

Optional sample metadata that maps observed_features sample IDs to expected_features sample IDs.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-seqs

This action aligns a set of query (e.g., observed) sequences against a set of reference (e.g., expected) sequences to evaluate the quality of alignment. The intended use is to align observed sequences against expected sequences (e.g., from a mock community) to determine the frequency of mismatches between observed sequences and the most similar expected sequences, e.g., as a measure of sequencing/method error. However, any sequences may be provided as input to generate a report on pairwise alignment quality against a set of reference sequences.

Citations

Camacho et al., 2009

Inputs

query_sequences: FeatureData[Sequence]

Sequences to test for exclusion[required]

reference_sequences: FeatureData[Sequence]

Reference sequences to align against feature sequences[required]

Parameters

show_alignments: Bool

Option to plot pairwise alignments of query sequences and their top hits.[default: False]

Outputs

visualization: Visualization

<no description>[required]


quality-control evaluate-taxonomy

This visualizer compares a pair of observed and expected taxonomic assignments to calculate precision, recall, and F-measure at each taxonomic level, up to maximum level specified by the depth parameter. These metrics are calculated at each semicolon-delimited rank. This action is useful for comparing the accuracy of taxonomic assignment, e.g., between different taxonomy classifiers or other bioinformatics methods. Expected taxonomies should be derived from simulated or mock community sequences that have known taxonomic affiliations.

Citations

Bokulich et al., 2018

Inputs

expected_taxa: FeatureData[Taxonomy]

Expected taxonomic assignments[required]

observed_taxa: FeatureData[Taxonomy]

Observed taxonomic assignments[required]

feature_table: FeatureTable[RelativeFrequency]

Optional feature table containing relative frequency of each feature, used to weight accuracy scores by frequency. Must contain all features found in expected and/or observed taxa. Features found in the table but not the expected/observed taxa will be dropped prior to analysis.[optional]

Parameters

depth: Int

Maximum depth of semicolon-delimited taxonomic ranks to test (e.g., 1 = root, 7 = species for the greengenes reference sequence database).[required]

palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'terrain', 'rainbow')

Color palette to utilize for plotting.[default: 'Set1']

require_exp_ids: Bool

Require that all features found in observed taxa must be found in expected taxa or raise error.[default: True]

require_obs_ids: Bool

Require that all features found in expected taxa must be found in observed taxa or raise error.[default: True]

sample_id: Str

Optional sample ID to use for extracting frequency data from feature table, and for labeling accuracy results. If no sample_id is provided, feature frequencies are derived from the sum of all samples present in the feature table.[optional]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-score-viz

Creates histogram based on the output of decontam identify

Inputs

decontam_scores: Collection[FeatureData[DecontamScore]]

Output from decontam identify to be visualized[required]

table: Collection[FeatureTable[Frequency]]

Raw OTU/ASV table that was used as input to decontam-identify[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate sequences will be removed from[optional]

Parameters

threshold: Float % Range(0.0, 1.0, inclusive_end=True)

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float % Range(0.0, 1.0, inclusive_end=True)

Select bin size for the histogram[default: 0.02]

Outputs

visualization: Visualization

<no description>[required]


quality-control decontam-identify-batches

This method breaks an ASV table into batches based on the given metadata and identifies contaminant sequences from an OTU or ASV table and reports them to the user

Inputs

table: FeatureTable[Frequency]

Feature table which contaminate sequences will be identified from[required]

rep_seqs: FeatureData[Sequence]

Representative Sequences table which contaminate seqeunces will be removed from[optional]

Parameters

metadata: Metadata

metadata file indicating which samples in the experiment are control samples, assumes sample names in file correspond to the table input parameter[required]

split_column: Str

input metadata columns that you wish to subset the ASV table byNote: Column names must be in quotes and delimited by a space[required]

method: Str % Choices('combined', 'frequency', 'prevalence')

Select how to which method to id contaminants with; Prevalence: Utilizes control ASVs/OTUs to identify contaminants, Frequency: Utilizes sample concentration information to identify contaminants, Combined: Utilizes both Prevalence and Frequency methods when identifying contaminants[required]

filter_empty_features: Bool

If true, features which are not present in a split feature table are dropped.[optional]

freq_concentration_column: Str

Input column name that has concentration information for the samples[optional]

prev_control_column: Str

Input column name containing experimental or control sample metadata[optional]

prev_control_indicator: Str

indicate the control sample identifier (e.g. "control" or "blank")[optional]

threshold: Float

Select threshold cutoff for decontam algorithm scores[default: 0.1]

weighted: Bool

weight the decontam scores by their associated read number[default: True]

bin_size: Float

Select bin size for the histogram[default: 0.02]

Outputs

batch_subset_tables: Collection[FeatureTable[Frequency]]

Directory where feature tables split based on metadata and parameter split_column values should be written.[required]

decontam_scores: Collection[FeatureData[DecontamScore]]

The resulting table of scores from the decontam algorithm which scores each feature on how likely they are to be a contaminant sequence[required]

score_histograms: Visualization

The vizulaizer histograms for all decontam score objects generated from the pipeline[required]