This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2024.10.0
website: https://github.com/qiime2/q2-feature-classifier
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:
Bokulich et al., 2018

Actions

NameTypeShort Description
fit-classifier-sklearnmethodTrain an almost arbitrary scikit-learn classifier
classify-sklearnmethodPre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayesmethodTrain the naive_bayes classifier
extract-readsmethodExtract reads from reference sequences.
find-consensus-annotationmethodFind consensus among multiple annotations.
makeblastdbmethodMake BLAST database.
blastmethodBLAST+ local alignment search.
vsearch-globalmethodVSEARCH global alignment search
classify-consensus-blastpipelineBLAST+ consensus taxonomy classifier
classify-consensus-vsearchpipelineVSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearnpipelineALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes

BLASTDB
TaxonomicClassifier

Formats

BLASTDBDirFmtV5
TaxonomicClassifierDirFmt
TaxonomicClassiferTemporaryPickleDirFmt


feature-classifier fit-classifier-sklearn

Train a scikit-learn classifier to classify reads.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classifier_specification: Str

<no description>[required]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier classify-sklearn

Classify reads by taxon using a fitted classifier.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reads: FeatureData[Sequence]

The feature data to be classified.[required]

classifier: TaxonomicClassifier

The taxonomic classifier for classifying the reads.[required]

Parameters

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']

n_jobs: Threads

The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]

pre_dispatch: Str

"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs

classification: FeatureData[Taxonomy]

<no description>[required]


feature-classifier fit-classifier-naive-bayes

Create a scikit-learn naive_bayes classifier for reads

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classify__alpha: Float

<no description>[default: 0.001]

classify__chunk_size: Int

<no description>[default: 20000]

classify__class_prior: Str

<no description>[default: 'null']

classify__fit_prior: Bool

<no description>[default: False]

feat_ext__alternate_sign: Bool

<no description>[default: False]

feat_ext__analyzer: Str

<no description>[default: 'char_wb']

feat_ext__binary: Bool

<no description>[default: False]

feat_ext__decode_error: Str

<no description>[default: 'strict']

feat_ext__encoding: Str

<no description>[default: 'utf-8']

feat_ext__input: Str

<no description>[default: 'content']

feat_ext__lowercase: Bool

<no description>[default: True]

feat_ext__n_features: Int

<no description>[default: 8192]

feat_ext__ngram_range: Str

<no description>[default: '[7, 7]']

feat_ext__norm: Str

<no description>[default: 'l2']

feat_ext__preprocessor: Str

<no description>[default: 'null']

feat_ext__stop_words: Str

<no description>[default: 'null']

feat_ext__strip_accents: Str

<no description>[default: 'null']

feat_ext__token_pattern: Str

<no description>[default: '(?u)\\b\\w\\w+\\b']

feat_ext__tokenizer: Str

<no description>[default: 'null']

verbose: Bool

<no description>[default: False]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier extract-reads

Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length are removed; 4. reads are trimmed with trim_right; 5. reads are truncated to trunc_len; 6. reads are trimmed with trim_left; 7. reads shorter than min_length are removed.

Citations

Bokulich et al., 2018

Inputs

sequences: FeatureData[Sequence]

<no description>[required]

Parameters

f_primer: Str

forward primer sequence (5' -> 3').[required]

r_primer: Str

reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]

trim_right: Int

trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]

trunc_len: Int

read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]

trim_left: Int

trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]

identity: Float

minimum combined primer match identity threshold.[default: 0.8]

min_length: Int % Range(0, None)

Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]

max_length: Int % Range(0, None)

Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]

n_jobs: Int % Range(1, None)

Number of seperate processes to run.[default: 1]

batch_size: Int % Range(1, None) | Str % Choices('auto')

Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']

read_orientation: Str % Choices('both', 'forward', 'reverse')

Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs

reads: FeatureData[Sequence]

<no description>[required]


feature-classifier find-consensus-annotation

Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.

Citations

Bokulich et al., 2018

Inputs

search_results: FeatureData[BLAST6]

Search results in BLAST6 output format[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

Parameters

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given when no consensus is found.[default: 'Unassigned']

Outputs

consensus_taxonomy: FeatureData[Taxonomy]

Consensus taxonomy and scores.[required]


feature-classifier makeblastdb

Make BLAST database from custom sequence collection.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

sequences: FeatureData[Sequence]

Input reference sequences.[required]

Outputs

database: BLASTDB

Output BLAST database.[required]


feature-classifier blast

Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier vsearch-global

Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-blast

Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-vsearch

Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-hybrid-vsearch-sklearn

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid classifier. First performs rough positive filter to remove artifact and low-coverage sequences (use "prefilter" parameter to toggle this step on or off). Second, performs VSEARCH exact match between query and reference_reads to find exact matches, followed by least common ancestor consensus taxonomy assignment from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Query sequences without an exact match are then classified with a pre-trained sklearn taxonomy classifier to predict the most likely taxonomic lineage.

Citations

Bokulich et al., 2018

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

classifier: TaxonomicClassifier

Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

maxhits: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

prefilter: Bool

Toggle positive filter of query sequences on or off.[default: True]

sample_size: Int % Range(1, None)

Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]

randseed: Int % Range(0, None)

Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2024.10.0
website: https://github.com/qiime2/q2-feature-classifier
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:
Bokulich et al., 2018

Actions

NameTypeShort Description
fit-classifier-sklearnmethodTrain an almost arbitrary scikit-learn classifier
classify-sklearnmethodPre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayesmethodTrain the naive_bayes classifier
extract-readsmethodExtract reads from reference sequences.
find-consensus-annotationmethodFind consensus among multiple annotations.
makeblastdbmethodMake BLAST database.
blastmethodBLAST+ local alignment search.
vsearch-globalmethodVSEARCH global alignment search
classify-consensus-blastpipelineBLAST+ consensus taxonomy classifier
classify-consensus-vsearchpipelineVSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearnpipelineALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes

BLASTDB
TaxonomicClassifier

Formats

BLASTDBDirFmtV5
TaxonomicClassifierDirFmt
TaxonomicClassiferTemporaryPickleDirFmt


feature-classifier fit-classifier-sklearn

Train a scikit-learn classifier to classify reads.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classifier_specification: Str

<no description>[required]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier classify-sklearn

Classify reads by taxon using a fitted classifier.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reads: FeatureData[Sequence]

The feature data to be classified.[required]

classifier: TaxonomicClassifier

The taxonomic classifier for classifying the reads.[required]

Parameters

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']

n_jobs: Threads

The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]

pre_dispatch: Str

"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs

classification: FeatureData[Taxonomy]

<no description>[required]


feature-classifier fit-classifier-naive-bayes

Create a scikit-learn naive_bayes classifier for reads

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classify__alpha: Float

<no description>[default: 0.001]

classify__chunk_size: Int

<no description>[default: 20000]

classify__class_prior: Str

<no description>[default: 'null']

classify__fit_prior: Bool

<no description>[default: False]

feat_ext__alternate_sign: Bool

<no description>[default: False]

feat_ext__analyzer: Str

<no description>[default: 'char_wb']

feat_ext__binary: Bool

<no description>[default: False]

feat_ext__decode_error: Str

<no description>[default: 'strict']

feat_ext__encoding: Str

<no description>[default: 'utf-8']

feat_ext__input: Str

<no description>[default: 'content']

feat_ext__lowercase: Bool

<no description>[default: True]

feat_ext__n_features: Int

<no description>[default: 8192]

feat_ext__ngram_range: Str

<no description>[default: '[7, 7]']

feat_ext__norm: Str

<no description>[default: 'l2']

feat_ext__preprocessor: Str

<no description>[default: 'null']

feat_ext__stop_words: Str

<no description>[default: 'null']

feat_ext__strip_accents: Str

<no description>[default: 'null']

feat_ext__token_pattern: Str

<no description>[default: '(?u)\\b\\w\\w+\\b']

feat_ext__tokenizer: Str

<no description>[default: 'null']

verbose: Bool

<no description>[default: False]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier extract-reads

Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length are removed; 4. reads are trimmed with trim_right; 5. reads are truncated to trunc_len; 6. reads are trimmed with trim_left; 7. reads shorter than min_length are removed.

Citations

Bokulich et al., 2018

Inputs

sequences: FeatureData[Sequence]

<no description>[required]

Parameters

f_primer: Str

forward primer sequence (5' -> 3').[required]

r_primer: Str

reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]

trim_right: Int

trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]

trunc_len: Int

read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]

trim_left: Int

trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]

identity: Float

minimum combined primer match identity threshold.[default: 0.8]

min_length: Int % Range(0, None)

Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]

max_length: Int % Range(0, None)

Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]

n_jobs: Int % Range(1, None)

Number of seperate processes to run.[default: 1]

batch_size: Int % Range(1, None) | Str % Choices('auto')

Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']

read_orientation: Str % Choices('both', 'forward', 'reverse')

Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs

reads: FeatureData[Sequence]

<no description>[required]


feature-classifier find-consensus-annotation

Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.

Citations

Bokulich et al., 2018

Inputs

search_results: FeatureData[BLAST6]

Search results in BLAST6 output format[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

Parameters

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given when no consensus is found.[default: 'Unassigned']

Outputs

consensus_taxonomy: FeatureData[Taxonomy]

Consensus taxonomy and scores.[required]


feature-classifier makeblastdb

Make BLAST database from custom sequence collection.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

sequences: FeatureData[Sequence]

Input reference sequences.[required]

Outputs

database: BLASTDB

Output BLAST database.[required]


feature-classifier blast

Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier vsearch-global

Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-blast

Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-vsearch

Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-hybrid-vsearch-sklearn

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid classifier. First performs rough positive filter to remove artifact and low-coverage sequences (use "prefilter" parameter to toggle this step on or off). Second, performs VSEARCH exact match between query and reference_reads to find exact matches, followed by least common ancestor consensus taxonomy assignment from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Query sequences without an exact match are then classified with a pre-trained sklearn taxonomy classifier to predict the most likely taxonomic lineage.

Citations

Bokulich et al., 2018

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

classifier: TaxonomicClassifier

Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

maxhits: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

prefilter: Bool

Toggle positive filter of query sequences on or off.[default: True]

sample_size: Int % Range(1, None)

Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]

randseed: Int % Range(0, None)

Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2024.10.0
website: https://github.com/qiime2/q2-feature-classifier
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:
Bokulich et al., 2018

Actions

NameTypeShort Description
fit-classifier-sklearnmethodTrain an almost arbitrary scikit-learn classifier
classify-sklearnmethodPre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayesmethodTrain the naive_bayes classifier
extract-readsmethodExtract reads from reference sequences.
find-consensus-annotationmethodFind consensus among multiple annotations.
makeblastdbmethodMake BLAST database.
blastmethodBLAST+ local alignment search.
vsearch-globalmethodVSEARCH global alignment search
classify-consensus-blastpipelineBLAST+ consensus taxonomy classifier
classify-consensus-vsearchpipelineVSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearnpipelineALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes

BLASTDB
TaxonomicClassifier

Formats

BLASTDBDirFmtV5
TaxonomicClassifierDirFmt
TaxonomicClassiferTemporaryPickleDirFmt


feature-classifier fit-classifier-sklearn

Train a scikit-learn classifier to classify reads.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classifier_specification: Str

<no description>[required]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier classify-sklearn

Classify reads by taxon using a fitted classifier.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reads: FeatureData[Sequence]

The feature data to be classified.[required]

classifier: TaxonomicClassifier

The taxonomic classifier for classifying the reads.[required]

Parameters

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']

n_jobs: Threads

The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]

pre_dispatch: Str

"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs

classification: FeatureData[Taxonomy]

<no description>[required]


feature-classifier fit-classifier-naive-bayes

Create a scikit-learn naive_bayes classifier for reads

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classify__alpha: Float

<no description>[default: 0.001]

classify__chunk_size: Int

<no description>[default: 20000]

classify__class_prior: Str

<no description>[default: 'null']

classify__fit_prior: Bool

<no description>[default: False]

feat_ext__alternate_sign: Bool

<no description>[default: False]

feat_ext__analyzer: Str

<no description>[default: 'char_wb']

feat_ext__binary: Bool

<no description>[default: False]

feat_ext__decode_error: Str

<no description>[default: 'strict']

feat_ext__encoding: Str

<no description>[default: 'utf-8']

feat_ext__input: Str

<no description>[default: 'content']

feat_ext__lowercase: Bool

<no description>[default: True]

feat_ext__n_features: Int

<no description>[default: 8192]

feat_ext__ngram_range: Str

<no description>[default: '[7, 7]']

feat_ext__norm: Str

<no description>[default: 'l2']

feat_ext__preprocessor: Str

<no description>[default: 'null']

feat_ext__stop_words: Str

<no description>[default: 'null']

feat_ext__strip_accents: Str

<no description>[default: 'null']

feat_ext__token_pattern: Str

<no description>[default: '(?u)\\b\\w\\w+\\b']

feat_ext__tokenizer: Str

<no description>[default: 'null']

verbose: Bool

<no description>[default: False]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier extract-reads

Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length are removed; 4. reads are trimmed with trim_right; 5. reads are truncated to trunc_len; 6. reads are trimmed with trim_left; 7. reads shorter than min_length are removed.

Citations

Bokulich et al., 2018

Inputs

sequences: FeatureData[Sequence]

<no description>[required]

Parameters

f_primer: Str

forward primer sequence (5' -> 3').[required]

r_primer: Str

reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]

trim_right: Int

trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]

trunc_len: Int

read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]

trim_left: Int

trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]

identity: Float

minimum combined primer match identity threshold.[default: 0.8]

min_length: Int % Range(0, None)

Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]

max_length: Int % Range(0, None)

Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]

n_jobs: Int % Range(1, None)

Number of seperate processes to run.[default: 1]

batch_size: Int % Range(1, None) | Str % Choices('auto')

Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']

read_orientation: Str % Choices('both', 'forward', 'reverse')

Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs

reads: FeatureData[Sequence]

<no description>[required]


feature-classifier find-consensus-annotation

Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.

Citations

Bokulich et al., 2018

Inputs

search_results: FeatureData[BLAST6]

Search results in BLAST6 output format[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

Parameters

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given when no consensus is found.[default: 'Unassigned']

Outputs

consensus_taxonomy: FeatureData[Taxonomy]

Consensus taxonomy and scores.[required]


feature-classifier makeblastdb

Make BLAST database from custom sequence collection.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

sequences: FeatureData[Sequence]

Input reference sequences.[required]

Outputs

database: BLASTDB

Output BLAST database.[required]


feature-classifier blast

Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier vsearch-global

Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-blast

Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-vsearch

Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-hybrid-vsearch-sklearn

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid classifier. First performs rough positive filter to remove artifact and low-coverage sequences (use "prefilter" parameter to toggle this step on or off). Second, performs VSEARCH exact match between query and reference_reads to find exact matches, followed by least common ancestor consensus taxonomy assignment from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Query sequences without an exact match are then classified with a pre-trained sklearn taxonomy classifier to predict the most likely taxonomic lineage.

Citations

Bokulich et al., 2018

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

classifier: TaxonomicClassifier

Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

maxhits: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

prefilter: Bool

Toggle positive filter of query sequences on or off.[default: True]

sample_size: Int % Range(1, None)

Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]

randseed: Int % Range(0, None)

Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2024.10.0
website: https://github.com/qiime2/q2-feature-classifier
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:
Bokulich et al., 2018

Actions

NameTypeShort Description
fit-classifier-sklearnmethodTrain an almost arbitrary scikit-learn classifier
classify-sklearnmethodPre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayesmethodTrain the naive_bayes classifier
extract-readsmethodExtract reads from reference sequences.
find-consensus-annotationmethodFind consensus among multiple annotations.
makeblastdbmethodMake BLAST database.
blastmethodBLAST+ local alignment search.
vsearch-globalmethodVSEARCH global alignment search
classify-consensus-blastpipelineBLAST+ consensus taxonomy classifier
classify-consensus-vsearchpipelineVSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearnpipelineALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes

BLASTDB
TaxonomicClassifier

Formats

BLASTDBDirFmtV5
TaxonomicClassifierDirFmt
TaxonomicClassiferTemporaryPickleDirFmt


feature-classifier fit-classifier-sklearn

Train a scikit-learn classifier to classify reads.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classifier_specification: Str

<no description>[required]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier classify-sklearn

Classify reads by taxon using a fitted classifier.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reads: FeatureData[Sequence]

The feature data to be classified.[required]

classifier: TaxonomicClassifier

The taxonomic classifier for classifying the reads.[required]

Parameters

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']

n_jobs: Threads

The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]

pre_dispatch: Str

"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs

classification: FeatureData[Taxonomy]

<no description>[required]


feature-classifier fit-classifier-naive-bayes

Create a scikit-learn naive_bayes classifier for reads

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classify__alpha: Float

<no description>[default: 0.001]

classify__chunk_size: Int

<no description>[default: 20000]

classify__class_prior: Str

<no description>[default: 'null']

classify__fit_prior: Bool

<no description>[default: False]

feat_ext__alternate_sign: Bool

<no description>[default: False]

feat_ext__analyzer: Str

<no description>[default: 'char_wb']

feat_ext__binary: Bool

<no description>[default: False]

feat_ext__decode_error: Str

<no description>[default: 'strict']

feat_ext__encoding: Str

<no description>[default: 'utf-8']

feat_ext__input: Str

<no description>[default: 'content']

feat_ext__lowercase: Bool

<no description>[default: True]

feat_ext__n_features: Int

<no description>[default: 8192]

feat_ext__ngram_range: Str

<no description>[default: '[7, 7]']

feat_ext__norm: Str

<no description>[default: 'l2']

feat_ext__preprocessor: Str

<no description>[default: 'null']

feat_ext__stop_words: Str

<no description>[default: 'null']

feat_ext__strip_accents: Str

<no description>[default: 'null']

feat_ext__token_pattern: Str

<no description>[default: '(?u)\\b\\w\\w+\\b']

feat_ext__tokenizer: Str

<no description>[default: 'null']

verbose: Bool

<no description>[default: False]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier extract-reads

Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length are removed; 4. reads are trimmed with trim_right; 5. reads are truncated to trunc_len; 6. reads are trimmed with trim_left; 7. reads shorter than min_length are removed.

Citations

Bokulich et al., 2018

Inputs

sequences: FeatureData[Sequence]

<no description>[required]

Parameters

f_primer: Str

forward primer sequence (5' -> 3').[required]

r_primer: Str

reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]

trim_right: Int

trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]

trunc_len: Int

read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]

trim_left: Int

trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]

identity: Float

minimum combined primer match identity threshold.[default: 0.8]

min_length: Int % Range(0, None)

Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]

max_length: Int % Range(0, None)

Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]

n_jobs: Int % Range(1, None)

Number of seperate processes to run.[default: 1]

batch_size: Int % Range(1, None) | Str % Choices('auto')

Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']

read_orientation: Str % Choices('both', 'forward', 'reverse')

Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs

reads: FeatureData[Sequence]

<no description>[required]


feature-classifier find-consensus-annotation

Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.

Citations

Bokulich et al., 2018

Inputs

search_results: FeatureData[BLAST6]

Search results in BLAST6 output format[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

Parameters

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given when no consensus is found.[default: 'Unassigned']

Outputs

consensus_taxonomy: FeatureData[Taxonomy]

Consensus taxonomy and scores.[required]


feature-classifier makeblastdb

Make BLAST database from custom sequence collection.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

sequences: FeatureData[Sequence]

Input reference sequences.[required]

Outputs

database: BLASTDB

Output BLAST database.[required]


feature-classifier blast

Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier vsearch-global

Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-blast

Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-vsearch

Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-hybrid-vsearch-sklearn

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid classifier. First performs rough positive filter to remove artifact and low-coverage sequences (use "prefilter" parameter to toggle this step on or off). Second, performs VSEARCH exact match between query and reference_reads to find exact matches, followed by least common ancestor consensus taxonomy assignment from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Query sequences without an exact match are then classified with a pre-trained sklearn taxonomy classifier to predict the most likely taxonomic lineage.

Citations

Bokulich et al., 2018

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

classifier: TaxonomicClassifier

Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

maxhits: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

prefilter: Bool

Toggle positive filter of query sequences on or off.[default: True]

sample_size: Int % Range(1, None)

Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]

randseed: Int % Range(0, None)

Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2024.10.0
website: https://github.com/qiime2/q2-feature-classifier
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:
Bokulich et al., 2018

Actions

NameTypeShort Description
fit-classifier-sklearnmethodTrain an almost arbitrary scikit-learn classifier
classify-sklearnmethodPre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayesmethodTrain the naive_bayes classifier
extract-readsmethodExtract reads from reference sequences.
find-consensus-annotationmethodFind consensus among multiple annotations.
makeblastdbmethodMake BLAST database.
blastmethodBLAST+ local alignment search.
vsearch-globalmethodVSEARCH global alignment search
classify-consensus-blastpipelineBLAST+ consensus taxonomy classifier
classify-consensus-vsearchpipelineVSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearnpipelineALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes

BLASTDB
TaxonomicClassifier

Formats

BLASTDBDirFmtV5
TaxonomicClassifierDirFmt
TaxonomicClassiferTemporaryPickleDirFmt


feature-classifier fit-classifier-sklearn

Train a scikit-learn classifier to classify reads.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classifier_specification: Str

<no description>[required]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier classify-sklearn

Classify reads by taxon using a fitted classifier.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reads: FeatureData[Sequence]

The feature data to be classified.[required]

classifier: TaxonomicClassifier

The taxonomic classifier for classifying the reads.[required]

Parameters

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']

n_jobs: Threads

The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]

pre_dispatch: Str

"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs

classification: FeatureData[Taxonomy]

<no description>[required]


feature-classifier fit-classifier-naive-bayes

Create a scikit-learn naive_bayes classifier for reads

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classify__alpha: Float

<no description>[default: 0.001]

classify__chunk_size: Int

<no description>[default: 20000]

classify__class_prior: Str

<no description>[default: 'null']

classify__fit_prior: Bool

<no description>[default: False]

feat_ext__alternate_sign: Bool

<no description>[default: False]

feat_ext__analyzer: Str

<no description>[default: 'char_wb']

feat_ext__binary: Bool

<no description>[default: False]

feat_ext__decode_error: Str

<no description>[default: 'strict']

feat_ext__encoding: Str

<no description>[default: 'utf-8']

feat_ext__input: Str

<no description>[default: 'content']

feat_ext__lowercase: Bool

<no description>[default: True]

feat_ext__n_features: Int

<no description>[default: 8192]

feat_ext__ngram_range: Str

<no description>[default: '[7, 7]']

feat_ext__norm: Str

<no description>[default: 'l2']

feat_ext__preprocessor: Str

<no description>[default: 'null']

feat_ext__stop_words: Str

<no description>[default: 'null']

feat_ext__strip_accents: Str

<no description>[default: 'null']

feat_ext__token_pattern: Str

<no description>[default: '(?u)\\b\\w\\w+\\b']

feat_ext__tokenizer: Str

<no description>[default: 'null']

verbose: Bool

<no description>[default: False]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier extract-reads

Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length are removed; 4. reads are trimmed with trim_right; 5. reads are truncated to trunc_len; 6. reads are trimmed with trim_left; 7. reads shorter than min_length are removed.

Citations

Bokulich et al., 2018

Inputs

sequences: FeatureData[Sequence]

<no description>[required]

Parameters

f_primer: Str

forward primer sequence (5' -> 3').[required]

r_primer: Str

reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]

trim_right: Int

trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]

trunc_len: Int

read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]

trim_left: Int

trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]

identity: Float

minimum combined primer match identity threshold.[default: 0.8]

min_length: Int % Range(0, None)

Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]

max_length: Int % Range(0, None)

Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]

n_jobs: Int % Range(1, None)

Number of seperate processes to run.[default: 1]

batch_size: Int % Range(1, None) | Str % Choices('auto')

Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']

read_orientation: Str % Choices('both', 'forward', 'reverse')

Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs

reads: FeatureData[Sequence]

<no description>[required]


feature-classifier find-consensus-annotation

Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.

Citations

Bokulich et al., 2018

Inputs

search_results: FeatureData[BLAST6]

Search results in BLAST6 output format[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

Parameters

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given when no consensus is found.[default: 'Unassigned']

Outputs

consensus_taxonomy: FeatureData[Taxonomy]

Consensus taxonomy and scores.[required]


feature-classifier makeblastdb

Make BLAST database from custom sequence collection.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

sequences: FeatureData[Sequence]

Input reference sequences.[required]

Outputs

database: BLASTDB

Output BLAST database.[required]


feature-classifier blast

Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier vsearch-global

Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-blast

Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-vsearch

Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-hybrid-vsearch-sklearn

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid classifier. First performs rough positive filter to remove artifact and low-coverage sequences (use "prefilter" parameter to toggle this step on or off). Second, performs VSEARCH exact match between query and reference_reads to find exact matches, followed by least common ancestor consensus taxonomy assignment from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Query sequences without an exact match are then classified with a pre-trained sklearn taxonomy classifier to predict the most likely taxonomic lineage.

Citations

Bokulich et al., 2018

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

classifier: TaxonomicClassifier

Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

maxhits: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

prefilter: Bool

Toggle positive filter of query sequences on or off.[default: True]

sample_size: Int % Range(1, None)

Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]

randseed: Int % Range(0, None)

Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2024.10.0
website: https://github.com/qiime2/q2-feature-classifier
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:
Bokulich et al., 2018

Actions

NameTypeShort Description
fit-classifier-sklearnmethodTrain an almost arbitrary scikit-learn classifier
classify-sklearnmethodPre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayesmethodTrain the naive_bayes classifier
extract-readsmethodExtract reads from reference sequences.
find-consensus-annotationmethodFind consensus among multiple annotations.
makeblastdbmethodMake BLAST database.
blastmethodBLAST+ local alignment search.
vsearch-globalmethodVSEARCH global alignment search
classify-consensus-blastpipelineBLAST+ consensus taxonomy classifier
classify-consensus-vsearchpipelineVSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearnpipelineALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes

BLASTDB
TaxonomicClassifier

Formats

BLASTDBDirFmtV5
TaxonomicClassifierDirFmt
TaxonomicClassiferTemporaryPickleDirFmt


feature-classifier fit-classifier-sklearn

Train a scikit-learn classifier to classify reads.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classifier_specification: Str

<no description>[required]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier classify-sklearn

Classify reads by taxon using a fitted classifier.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reads: FeatureData[Sequence]

The feature data to be classified.[required]

classifier: TaxonomicClassifier

The taxonomic classifier for classifying the reads.[required]

Parameters

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']

n_jobs: Threads

The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]

pre_dispatch: Str

"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs

classification: FeatureData[Taxonomy]

<no description>[required]


feature-classifier fit-classifier-naive-bayes

Create a scikit-learn naive_bayes classifier for reads

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classify__alpha: Float

<no description>[default: 0.001]

classify__chunk_size: Int

<no description>[default: 20000]

classify__class_prior: Str

<no description>[default: 'null']

classify__fit_prior: Bool

<no description>[default: False]

feat_ext__alternate_sign: Bool

<no description>[default: False]

feat_ext__analyzer: Str

<no description>[default: 'char_wb']

feat_ext__binary: Bool

<no description>[default: False]

feat_ext__decode_error: Str

<no description>[default: 'strict']

feat_ext__encoding: Str

<no description>[default: 'utf-8']

feat_ext__input: Str

<no description>[default: 'content']

feat_ext__lowercase: Bool

<no description>[default: True]

feat_ext__n_features: Int

<no description>[default: 8192]

feat_ext__ngram_range: Str

<no description>[default: '[7, 7]']

feat_ext__norm: Str

<no description>[default: 'l2']

feat_ext__preprocessor: Str

<no description>[default: 'null']

feat_ext__stop_words: Str

<no description>[default: 'null']

feat_ext__strip_accents: Str

<no description>[default: 'null']

feat_ext__token_pattern: Str

<no description>[default: '(?u)\\b\\w\\w+\\b']

feat_ext__tokenizer: Str

<no description>[default: 'null']

verbose: Bool

<no description>[default: False]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier extract-reads

Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length are removed; 4. reads are trimmed with trim_right; 5. reads are truncated to trunc_len; 6. reads are trimmed with trim_left; 7. reads shorter than min_length are removed.

Citations

Bokulich et al., 2018

Inputs

sequences: FeatureData[Sequence]

<no description>[required]

Parameters

f_primer: Str

forward primer sequence (5' -> 3').[required]

r_primer: Str

reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]

trim_right: Int

trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]

trunc_len: Int

read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]

trim_left: Int

trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]

identity: Float

minimum combined primer match identity threshold.[default: 0.8]

min_length: Int % Range(0, None)

Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]

max_length: Int % Range(0, None)

Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]

n_jobs: Int % Range(1, None)

Number of seperate processes to run.[default: 1]

batch_size: Int % Range(1, None) | Str % Choices('auto')

Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']

read_orientation: Str % Choices('both', 'forward', 'reverse')

Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs

reads: FeatureData[Sequence]

<no description>[required]


feature-classifier find-consensus-annotation

Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.

Citations

Bokulich et al., 2018

Inputs

search_results: FeatureData[BLAST6]

Search results in BLAST6 output format[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

Parameters

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given when no consensus is found.[default: 'Unassigned']

Outputs

consensus_taxonomy: FeatureData[Taxonomy]

Consensus taxonomy and scores.[required]


feature-classifier makeblastdb

Make BLAST database from custom sequence collection.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

sequences: FeatureData[Sequence]

Input reference sequences.[required]

Outputs

database: BLASTDB

Output BLAST database.[required]


feature-classifier blast

Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier vsearch-global

Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-blast

Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-vsearch

Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-hybrid-vsearch-sklearn

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid classifier. First performs rough positive filter to remove artifact and low-coverage sequences (use "prefilter" parameter to toggle this step on or off). Second, performs VSEARCH exact match between query and reference_reads to find exact matches, followed by least common ancestor consensus taxonomy assignment from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Query sequences without an exact match are then classified with a pre-trained sklearn taxonomy classifier to predict the most likely taxonomic lineage.

Citations

Bokulich et al., 2018

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

classifier: TaxonomicClassifier

Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

maxhits: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

prefilter: Bool

Toggle positive filter of query sequences on or off.[default: True]

sample_size: Int % Range(1, None)

Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]

randseed: Int % Range(0, None)

Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2024.10.0
website: https://github.com/qiime2/q2-feature-classifier
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:
Bokulich et al., 2018

Actions

NameTypeShort Description
fit-classifier-sklearnmethodTrain an almost arbitrary scikit-learn classifier
classify-sklearnmethodPre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayesmethodTrain the naive_bayes classifier
extract-readsmethodExtract reads from reference sequences.
find-consensus-annotationmethodFind consensus among multiple annotations.
makeblastdbmethodMake BLAST database.
blastmethodBLAST+ local alignment search.
vsearch-globalmethodVSEARCH global alignment search
classify-consensus-blastpipelineBLAST+ consensus taxonomy classifier
classify-consensus-vsearchpipelineVSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearnpipelineALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes

BLASTDB
TaxonomicClassifier

Formats

BLASTDBDirFmtV5
TaxonomicClassifierDirFmt
TaxonomicClassiferTemporaryPickleDirFmt


feature-classifier fit-classifier-sklearn

Train a scikit-learn classifier to classify reads.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classifier_specification: Str

<no description>[required]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier classify-sklearn

Classify reads by taxon using a fitted classifier.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reads: FeatureData[Sequence]

The feature data to be classified.[required]

classifier: TaxonomicClassifier

The taxonomic classifier for classifying the reads.[required]

Parameters

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']

n_jobs: Threads

The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]

pre_dispatch: Str

"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs

classification: FeatureData[Taxonomy]

<no description>[required]


feature-classifier fit-classifier-naive-bayes

Create a scikit-learn naive_bayes classifier for reads

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classify__alpha: Float

<no description>[default: 0.001]

classify__chunk_size: Int

<no description>[default: 20000]

classify__class_prior: Str

<no description>[default: 'null']

classify__fit_prior: Bool

<no description>[default: False]

feat_ext__alternate_sign: Bool

<no description>[default: False]

feat_ext__analyzer: Str

<no description>[default: 'char_wb']

feat_ext__binary: Bool

<no description>[default: False]

feat_ext__decode_error: Str

<no description>[default: 'strict']

feat_ext__encoding: Str

<no description>[default: 'utf-8']

feat_ext__input: Str

<no description>[default: 'content']

feat_ext__lowercase: Bool

<no description>[default: True]

feat_ext__n_features: Int

<no description>[default: 8192]

feat_ext__ngram_range: Str

<no description>[default: '[7, 7]']

feat_ext__norm: Str

<no description>[default: 'l2']

feat_ext__preprocessor: Str

<no description>[default: 'null']

feat_ext__stop_words: Str

<no description>[default: 'null']

feat_ext__strip_accents: Str

<no description>[default: 'null']

feat_ext__token_pattern: Str

<no description>[default: '(?u)\\b\\w\\w+\\b']

feat_ext__tokenizer: Str

<no description>[default: 'null']

verbose: Bool

<no description>[default: False]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier extract-reads

Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length are removed; 4. reads are trimmed with trim_right; 5. reads are truncated to trunc_len; 6. reads are trimmed with trim_left; 7. reads shorter than min_length are removed.

Citations

Bokulich et al., 2018

Inputs

sequences: FeatureData[Sequence]

<no description>[required]

Parameters

f_primer: Str

forward primer sequence (5' -> 3').[required]

r_primer: Str

reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]

trim_right: Int

trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]

trunc_len: Int

read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]

trim_left: Int

trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]

identity: Float

minimum combined primer match identity threshold.[default: 0.8]

min_length: Int % Range(0, None)

Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]

max_length: Int % Range(0, None)

Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]

n_jobs: Int % Range(1, None)

Number of seperate processes to run.[default: 1]

batch_size: Int % Range(1, None) | Str % Choices('auto')

Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']

read_orientation: Str % Choices('both', 'forward', 'reverse')

Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs

reads: FeatureData[Sequence]

<no description>[required]


feature-classifier find-consensus-annotation

Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.

Citations

Bokulich et al., 2018

Inputs

search_results: FeatureData[BLAST6]

Search results in BLAST6 output format[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

Parameters

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given when no consensus is found.[default: 'Unassigned']

Outputs

consensus_taxonomy: FeatureData[Taxonomy]

Consensus taxonomy and scores.[required]


feature-classifier makeblastdb

Make BLAST database from custom sequence collection.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

sequences: FeatureData[Sequence]

Input reference sequences.[required]

Outputs

database: BLASTDB

Output BLAST database.[required]


feature-classifier blast

Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier vsearch-global

Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-blast

Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-vsearch

Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-hybrid-vsearch-sklearn

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid classifier. First performs rough positive filter to remove artifact and low-coverage sequences (use "prefilter" parameter to toggle this step on or off). Second, performs VSEARCH exact match between query and reference_reads to find exact matches, followed by least common ancestor consensus taxonomy assignment from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Query sequences without an exact match are then classified with a pre-trained sklearn taxonomy classifier to predict the most likely taxonomic lineage.

Citations

Bokulich et al., 2018

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

classifier: TaxonomicClassifier

Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

maxhits: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

prefilter: Bool

Toggle positive filter of query sequences on or off.[default: True]

sample_size: Int % Range(1, None)

Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]

randseed: Int % Range(0, None)

Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2024.10.0
website: https://github.com/qiime2/q2-feature-classifier
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:
Bokulich et al., 2018

Actions

NameTypeShort Description
fit-classifier-sklearnmethodTrain an almost arbitrary scikit-learn classifier
classify-sklearnmethodPre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayesmethodTrain the naive_bayes classifier
extract-readsmethodExtract reads from reference sequences.
find-consensus-annotationmethodFind consensus among multiple annotations.
makeblastdbmethodMake BLAST database.
blastmethodBLAST+ local alignment search.
vsearch-globalmethodVSEARCH global alignment search
classify-consensus-blastpipelineBLAST+ consensus taxonomy classifier
classify-consensus-vsearchpipelineVSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearnpipelineALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes

BLASTDB
TaxonomicClassifier

Formats

BLASTDBDirFmtV5
TaxonomicClassifierDirFmt
TaxonomicClassiferTemporaryPickleDirFmt


feature-classifier fit-classifier-sklearn

Train a scikit-learn classifier to classify reads.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classifier_specification: Str

<no description>[required]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier classify-sklearn

Classify reads by taxon using a fitted classifier.

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reads: FeatureData[Sequence]

The feature data to be classified.[required]

classifier: TaxonomicClassifier

The taxonomic classifier for classifying the reads.[required]

Parameters

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']

n_jobs: Threads

The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]

pre_dispatch: Str

"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs

classification: FeatureData[Taxonomy]

<no description>[required]


feature-classifier fit-classifier-naive-bayes

Create a scikit-learn naive_bayes classifier for reads

Citations

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs

reference_reads: FeatureData[Sequence]

<no description>[required]

reference_taxonomy: FeatureData[Taxonomy]

<no description>[required]

class_weight: FeatureTable[RelativeFrequency]

<no description>[optional]

Parameters

classify__alpha: Float

<no description>[default: 0.001]

classify__chunk_size: Int

<no description>[default: 20000]

classify__class_prior: Str

<no description>[default: 'null']

classify__fit_prior: Bool

<no description>[default: False]

feat_ext__alternate_sign: Bool

<no description>[default: False]

feat_ext__analyzer: Str

<no description>[default: 'char_wb']

feat_ext__binary: Bool

<no description>[default: False]

feat_ext__decode_error: Str

<no description>[default: 'strict']

feat_ext__encoding: Str

<no description>[default: 'utf-8']

feat_ext__input: Str

<no description>[default: 'content']

feat_ext__lowercase: Bool

<no description>[default: True]

feat_ext__n_features: Int

<no description>[default: 8192]

feat_ext__ngram_range: Str

<no description>[default: '[7, 7]']

feat_ext__norm: Str

<no description>[default: 'l2']

feat_ext__preprocessor: Str

<no description>[default: 'null']

feat_ext__stop_words: Str

<no description>[default: 'null']

feat_ext__strip_accents: Str

<no description>[default: 'null']

feat_ext__token_pattern: Str

<no description>[default: '(?u)\\b\\w\\w+\\b']

feat_ext__tokenizer: Str

<no description>[default: 'null']

verbose: Bool

<no description>[default: False]

Outputs

classifier: TaxonomicClassifier

<no description>[required]


feature-classifier extract-reads

Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length are removed; 4. reads are trimmed with trim_right; 5. reads are truncated to trunc_len; 6. reads are trimmed with trim_left; 7. reads shorter than min_length are removed.

Citations

Bokulich et al., 2018

Inputs

sequences: FeatureData[Sequence]

<no description>[required]

Parameters

f_primer: Str

forward primer sequence (5' -> 3').[required]

r_primer: Str

reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]

trim_right: Int

trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]

trunc_len: Int

read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]

trim_left: Int

trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]

identity: Float

minimum combined primer match identity threshold.[default: 0.8]

min_length: Int % Range(0, None)

Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]

max_length: Int % Range(0, None)

Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]

n_jobs: Int % Range(1, None)

Number of seperate processes to run.[default: 1]

batch_size: Int % Range(1, None) | Str % Choices('auto')

Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']

read_orientation: Str % Choices('both', 'forward', 'reverse')

Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs

reads: FeatureData[Sequence]

<no description>[required]


feature-classifier find-consensus-annotation

Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.

Citations

Bokulich et al., 2018

Inputs

search_results: FeatureData[BLAST6]

Search results in BLAST6 output format[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

Parameters

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given when no consensus is found.[default: 'Unassigned']

Outputs

consensus_taxonomy: FeatureData[Taxonomy]

Consensus taxonomy and scores.[required]


feature-classifier makeblastdb

Make BLAST database from custom sequence collection.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

sequences: FeatureData[Sequence]

Input reference sequences.[required]

Outputs

database: BLASTDB

Output BLAST database.[required]


feature-classifier blast

Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier vsearch-global

Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-blast

Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.

Citations

Bokulich et al., 2018; Camacho et al., 2009

Inputs

query: FeatureData[Sequence]

Query sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

reference taxonomy labels.[required]

blastdb: BLASTDB

BLAST indexed database. Incompatible with reference_reads.[optional]

reference_reads: FeatureData[Sequence]

Reference sequences. Incompatible with blastdb.[optional]

Parameters

maxaccepts: Int % Range(1, None)

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]

strand: Str % Choices('both', 'plus', 'minus')

Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']

evalue: Float

BLAST expectation value (E) threshold for saving hits.[default: 0.001]

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

num_threads: Threads

Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-consensus-vsearch

Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.

Citations

Bokulich et al., 2018; Rognes et al., 2016

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if percent identity to query is lower.[default: 0.8]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

search_exact: Bool

Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]

top_hits_only: Bool

Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]

maxhits: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to show once the search is terminated.[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']

output_no_hits: Bool

Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]

weak_id: Float % Range(0.0, 1.0, inclusive_end=True)

Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

unassignable_label: Str

Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]

search_results: FeatureData[BLAST6]

Top hits for each query.[required]


feature-classifier classify-hybrid-vsearch-sklearn

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid classifier. First performs rough positive filter to remove artifact and low-coverage sequences (use "prefilter" parameter to toggle this step on or off). Second, performs VSEARCH exact match between query and reference_reads to find exact matches, followed by least common ancestor consensus taxonomy assignment from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Query sequences without an exact match are then classified with a pre-trained sklearn taxonomy classifier to predict the most likely taxonomic lineage.

Citations

Bokulich et al., 2018

Inputs

query: FeatureData[Sequence]

Query Sequences.[required]

reference_reads: FeatureData[Sequence]

Reference sequences.[required]

reference_taxonomy: FeatureData[Taxonomy]

Reference taxonomy labels.[required]

classifier: TaxonomicClassifier

Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters

maxaccepts: Int % Range(1, None) | Str % Choices('all')

Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]

perc_identity: Float % Range(0.0, 1.0, inclusive_end=True)

Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]

query_cov: Float % Range(0.0, 1.0, inclusive_end=True)

Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]

strand: Str % Choices('both', 'plus')

Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True)

Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]

maxhits: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

maxrejects: Int % Range(1, None) | Str % Choices('all')

<no description>[default: 'all']

reads_per_batch: Int % Range(1, None) | Str % Choices('auto')

Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']

confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable')

Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

read_orientation: Str % Choices('same', 'reverse-complement', 'auto')

Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

threads: Threads

Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

prefilter: Bool

Toggle positive filter of query sequences on or off.[default: True]

sample_size: Int % Range(1, None)

Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]

randseed: Int % Range(0, None)

Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs

classification: FeatureData[Taxonomy]

Taxonomy classifications of query sequences.[required]