q2-feature-classifier

Formats¶

feature-classifier fit-classifier-sklearn¶

Train a scikit-learn classifier to classify reads.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classifier_specification: Str: <no description>[required]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier classify-sklearn¶

Classify reads by taxon using a fitted classifier.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reads: FeatureData[Sequence]: The feature data to be classified.[required]
classifier: TaxonomicClassifier: The taxonomic classifier for classifying the reads.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
pre_dispatch: Str: "all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence. auto will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs¶

classification: FeatureData[Taxonomy]: <no description>[required]

feature-classifier fit-classifier-naive-bayes¶

Create a scikit-learn naive_bayes classifier for reads

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classify__alpha: Float: <no description>[default: 0.001]
classify__chunk_size: Int: <no description>[default: 20000]
classify__class_prior: Str: <no description>[default: 'null']
classify__fit_prior: Bool: <no description>[default: False]
feat_ext__alternate_sign: Bool: <no description>[default: False]
feat_ext__analyzer: Str: <no description>[default: 'char_wb']
feat_ext__binary: Bool: <no description>[default: False]
feat_ext__decode_error: Str: <no description>[default: 'strict']
feat_ext__encoding: Str: <no description>[default: 'utf-8']
feat_ext__input: Str: <no description>[default: 'content']
feat_ext__lowercase: Bool: <no description>[default: True]
feat_ext__n_features: Int: <no description>[default: 8192]
feat_ext__ngram_range: Str: <no description>[default: '[7, 7]']
feat_ext__norm: Str: <no description>[default: 'l2']
feat_ext__preprocessor: Str: <no description>[default: 'null']
feat_ext__stop_words: Str: <no description>[default: 'null']
feat_ext__strip_accents: Str: <no description>[default: 'null']
feat_ext__token_pattern: Str: <no description>[default: '(?u)\\b\\w\\w+\\b']
feat_ext__tokenizer: Str: <no description>[default: 'null']
verbose: Bool: <no description>[default: False]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier extract-reads¶

Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length are removed; 4. reads are trimmed with trim_right; 5. reads are truncated to trunc_len; 6. reads are trimmed with trim_left; 7. reads shorter than min_length are removed.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: <no description>[required]

Parameters¶

f_primer: Str: forward primer sequence (5' -> 3').[required]
r_primer: Str: reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
trim_right: Int: trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]
trunc_len: Int: read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]
trim_left: Int: trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]
identity: Float: minimum combined primer match identity threshold.[default: 0.8]
min_length: Int % Range(0, None): Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]
max_length: Int % Range(0, None): Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]
n_jobs: Int % Range(1, None): Number of seperate processes to run.[default: 1]
batch_size: Int % Range(1, None) | Str % Choices('auto'): Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']
read_orientation: Str % Choices('both', 'forward', 'reverse'): Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs¶

reads: FeatureData[Sequence]: <no description>[required]

feature-classifier find-consensus-annotation¶

Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.

Citations¶

Inputs¶

search_results: FeatureData[BLAST6]: Search results in BLAST6 output format[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]

Parameters¶

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given when no consensus is found.[default: 'Unassigned']

Outputs¶

consensus_taxonomy: FeatureData[Taxonomy]: Consensus taxonomy and scores.[required]

feature-classifier makeblastdb¶

Make BLAST database from custom sequence collection.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Input reference sequences.[required]

Outputs¶

database: BLASTDB: Output BLAST database.[required]

feature-classifier blast¶

Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier vsearch-global¶

Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-blast¶

Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-vsearch¶

Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-hybrid-vsearch-sklearn¶

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid classifier. First performs rough positive filter to remove artifact and low-coverage sequences (use "prefilter" parameter to toggle this step on or off). Second, performs VSEARCH exact match between query and reference_reads to find exact matches, followed by least common ancestor consensus taxonomy assignment from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Query sequences without an exact match are then classified with a pre-trained sklearn taxonomy classifier to predict the most likely taxonomic lineage.

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]
classifier: TaxonomicClassifier: Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
maxhits: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence."auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
prefilter: Bool: Toggle positive filter of query sequences on or off.[default: True]
sample_size: Int % Range(1, None): Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]
randseed: Int % Range(0, None): Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2025.10.0.dev0
website: https://github.com/qiime2/q2-feature-classifier
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Bokulich et al., 2018

Actions¶

Name	Type	Short Description
fit-classifier-sklearn	method	Train an almost arbitrary scikit-learn classifier
classify-sklearn	method	Pre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayes	method	Train the naive_bayes classifier
extract-reads	method	Extract reads from reference sequences.
find-consensus-annotation	method	Find consensus among multiple annotations.
makeblastdb	method	Make BLAST database.
blast	method	BLAST+ local alignment search.
vsearch-global	method	VSEARCH global alignment search
classify-consensus-blast	pipeline	BLAST+ consensus taxonomy classifier
classify-consensus-vsearch	pipeline	VSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearn	pipeline	ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes¶

Formats¶

feature-classifier fit-classifier-sklearn¶

Train a scikit-learn classifier to classify reads.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classifier_specification: Str: <no description>[required]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier classify-sklearn¶

Classify reads by taxon using a fitted classifier.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reads: FeatureData[Sequence]: The feature data to be classified.[required]
classifier: TaxonomicClassifier: The taxonomic classifier for classifying the reads.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
pre_dispatch: Str: "all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence. auto will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs¶

classification: FeatureData[Taxonomy]: <no description>[required]

feature-classifier fit-classifier-naive-bayes¶

Create a scikit-learn naive_bayes classifier for reads

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classify__alpha: Float: <no description>[default: 0.001]
classify__chunk_size: Int: <no description>[default: 20000]
classify__class_prior: Str: <no description>[default: 'null']
classify__fit_prior: Bool: <no description>[default: False]
feat_ext__alternate_sign: Bool: <no description>[default: False]
feat_ext__analyzer: Str: <no description>[default: 'char_wb']
feat_ext__binary: Bool: <no description>[default: False]
feat_ext__decode_error: Str: <no description>[default: 'strict']
feat_ext__encoding: Str: <no description>[default: 'utf-8']
feat_ext__input: Str: <no description>[default: 'content']
feat_ext__lowercase: Bool: <no description>[default: True]
feat_ext__n_features: Int: <no description>[default: 8192]
feat_ext__ngram_range: Str: <no description>[default: '[7, 7]']
feat_ext__norm: Str: <no description>[default: 'l2']
feat_ext__preprocessor: Str: <no description>[default: 'null']
feat_ext__stop_words: Str: <no description>[default: 'null']
feat_ext__strip_accents: Str: <no description>[default: 'null']
feat_ext__token_pattern: Str: <no description>[default: '(?u)\\b\\w\\w+\\b']
feat_ext__tokenizer: Str: <no description>[default: 'null']
verbose: Bool: <no description>[default: False]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier extract-reads¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: <no description>[required]

Parameters¶

f_primer: Str: forward primer sequence (5' -> 3').[required]
r_primer: Str: reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
trim_right: Int: trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]
trunc_len: Int: read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]
trim_left: Int: trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]
identity: Float: minimum combined primer match identity threshold.[default: 0.8]
min_length: Int % Range(0, None): Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]
max_length: Int % Range(0, None): Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]
n_jobs: Int % Range(1, None): Number of seperate processes to run.[default: 1]
batch_size: Int % Range(1, None) | Str % Choices('auto'): Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']
read_orientation: Str % Choices('both', 'forward', 'reverse'): Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs¶

reads: FeatureData[Sequence]: <no description>[required]

feature-classifier find-consensus-annotation¶

Citations¶

Inputs¶

search_results: FeatureData[BLAST6]: Search results in BLAST6 output format[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]

Parameters¶

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given when no consensus is found.[default: 'Unassigned']

Outputs¶

consensus_taxonomy: FeatureData[Taxonomy]: Consensus taxonomy and scores.[required]

feature-classifier makeblastdb¶

Make BLAST database from custom sequence collection.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Input reference sequences.[required]

Outputs¶

database: BLASTDB: Output BLAST database.[required]

feature-classifier blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier vsearch-global¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-vsearch¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-hybrid-vsearch-sklearn¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]
classifier: TaxonomicClassifier: Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
maxhits: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence."auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
prefilter: Bool: Toggle positive filter of query sequences on or off.[default: True]
sample_size: Int % Range(1, None): Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]
randseed: Int % Range(0, None): Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2025.10.0.dev0
website: https://github.com/qiime2/q2-feature-classifier
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Bokulich et al., 2018

Actions¶

Name	Type	Short Description
fit-classifier-sklearn	method	Train an almost arbitrary scikit-learn classifier
classify-sklearn	method	Pre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayes	method	Train the naive_bayes classifier
extract-reads	method	Extract reads from reference sequences.
find-consensus-annotation	method	Find consensus among multiple annotations.
makeblastdb	method	Make BLAST database.
blast	method	BLAST+ local alignment search.
vsearch-global	method	VSEARCH global alignment search
classify-consensus-blast	pipeline	BLAST+ consensus taxonomy classifier
classify-consensus-vsearch	pipeline	VSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearn	pipeline	ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes¶

Formats¶

feature-classifier fit-classifier-sklearn¶

Train a scikit-learn classifier to classify reads.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classifier_specification: Str: <no description>[required]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier classify-sklearn¶

Classify reads by taxon using a fitted classifier.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reads: FeatureData[Sequence]: The feature data to be classified.[required]
classifier: TaxonomicClassifier: The taxonomic classifier for classifying the reads.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
pre_dispatch: Str: "all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence. auto will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs¶

classification: FeatureData[Taxonomy]: <no description>[required]

feature-classifier fit-classifier-naive-bayes¶

Create a scikit-learn naive_bayes classifier for reads

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classify__alpha: Float: <no description>[default: 0.001]
classify__chunk_size: Int: <no description>[default: 20000]
classify__class_prior: Str: <no description>[default: 'null']
classify__fit_prior: Bool: <no description>[default: False]
feat_ext__alternate_sign: Bool: <no description>[default: False]
feat_ext__analyzer: Str: <no description>[default: 'char_wb']
feat_ext__binary: Bool: <no description>[default: False]
feat_ext__decode_error: Str: <no description>[default: 'strict']
feat_ext__encoding: Str: <no description>[default: 'utf-8']
feat_ext__input: Str: <no description>[default: 'content']
feat_ext__lowercase: Bool: <no description>[default: True]
feat_ext__n_features: Int: <no description>[default: 8192]
feat_ext__ngram_range: Str: <no description>[default: '[7, 7]']
feat_ext__norm: Str: <no description>[default: 'l2']
feat_ext__preprocessor: Str: <no description>[default: 'null']
feat_ext__stop_words: Str: <no description>[default: 'null']
feat_ext__strip_accents: Str: <no description>[default: 'null']
feat_ext__token_pattern: Str: <no description>[default: '(?u)\\b\\w\\w+\\b']
feat_ext__tokenizer: Str: <no description>[default: 'null']
verbose: Bool: <no description>[default: False]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier extract-reads¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: <no description>[required]

Parameters¶

f_primer: Str: forward primer sequence (5' -> 3').[required]
r_primer: Str: reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
trim_right: Int: trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]
trunc_len: Int: read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]
trim_left: Int: trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]
identity: Float: minimum combined primer match identity threshold.[default: 0.8]
min_length: Int % Range(0, None): Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]
max_length: Int % Range(0, None): Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]
n_jobs: Int % Range(1, None): Number of seperate processes to run.[default: 1]
batch_size: Int % Range(1, None) | Str % Choices('auto'): Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']
read_orientation: Str % Choices('both', 'forward', 'reverse'): Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs¶

reads: FeatureData[Sequence]: <no description>[required]

feature-classifier find-consensus-annotation¶

Citations¶

Inputs¶

search_results: FeatureData[BLAST6]: Search results in BLAST6 output format[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]

Parameters¶

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given when no consensus is found.[default: 'Unassigned']

Outputs¶

consensus_taxonomy: FeatureData[Taxonomy]: Consensus taxonomy and scores.[required]

feature-classifier makeblastdb¶

Make BLAST database from custom sequence collection.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Input reference sequences.[required]

Outputs¶

database: BLASTDB: Output BLAST database.[required]

feature-classifier blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier vsearch-global¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-vsearch¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-hybrid-vsearch-sklearn¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]
classifier: TaxonomicClassifier: Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
maxhits: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence."auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
prefilter: Bool: Toggle positive filter of query sequences on or off.[default: True]
sample_size: Int % Range(1, None): Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]
randseed: Int % Range(0, None): Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2025.10.0.dev0
website: https://github.com/qiime2/q2-feature-classifier
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Bokulich et al., 2018

Actions¶

Name	Type	Short Description
fit-classifier-sklearn	method	Train an almost arbitrary scikit-learn classifier
classify-sklearn	method	Pre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayes	method	Train the naive_bayes classifier
extract-reads	method	Extract reads from reference sequences.
find-consensus-annotation	method	Find consensus among multiple annotations.
makeblastdb	method	Make BLAST database.
blast	method	BLAST+ local alignment search.
vsearch-global	method	VSEARCH global alignment search
classify-consensus-blast	pipeline	BLAST+ consensus taxonomy classifier
classify-consensus-vsearch	pipeline	VSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearn	pipeline	ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes¶

Formats¶

feature-classifier fit-classifier-sklearn¶

Train a scikit-learn classifier to classify reads.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classifier_specification: Str: <no description>[required]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier classify-sklearn¶

Classify reads by taxon using a fitted classifier.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reads: FeatureData[Sequence]: The feature data to be classified.[required]
classifier: TaxonomicClassifier: The taxonomic classifier for classifying the reads.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
pre_dispatch: Str: "all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence. auto will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs¶

classification: FeatureData[Taxonomy]: <no description>[required]

feature-classifier fit-classifier-naive-bayes¶

Create a scikit-learn naive_bayes classifier for reads

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classify__alpha: Float: <no description>[default: 0.001]
classify__chunk_size: Int: <no description>[default: 20000]
classify__class_prior: Str: <no description>[default: 'null']
classify__fit_prior: Bool: <no description>[default: False]
feat_ext__alternate_sign: Bool: <no description>[default: False]
feat_ext__analyzer: Str: <no description>[default: 'char_wb']
feat_ext__binary: Bool: <no description>[default: False]
feat_ext__decode_error: Str: <no description>[default: 'strict']
feat_ext__encoding: Str: <no description>[default: 'utf-8']
feat_ext__input: Str: <no description>[default: 'content']
feat_ext__lowercase: Bool: <no description>[default: True]
feat_ext__n_features: Int: <no description>[default: 8192]
feat_ext__ngram_range: Str: <no description>[default: '[7, 7]']
feat_ext__norm: Str: <no description>[default: 'l2']
feat_ext__preprocessor: Str: <no description>[default: 'null']
feat_ext__stop_words: Str: <no description>[default: 'null']
feat_ext__strip_accents: Str: <no description>[default: 'null']
feat_ext__token_pattern: Str: <no description>[default: '(?u)\\b\\w\\w+\\b']
feat_ext__tokenizer: Str: <no description>[default: 'null']
verbose: Bool: <no description>[default: False]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier extract-reads¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: <no description>[required]

Parameters¶

f_primer: Str: forward primer sequence (5' -> 3').[required]
r_primer: Str: reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
trim_right: Int: trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]
trunc_len: Int: read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]
trim_left: Int: trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]
identity: Float: minimum combined primer match identity threshold.[default: 0.8]
min_length: Int % Range(0, None): Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]
max_length: Int % Range(0, None): Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]
n_jobs: Int % Range(1, None): Number of seperate processes to run.[default: 1]
batch_size: Int % Range(1, None) | Str % Choices('auto'): Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']
read_orientation: Str % Choices('both', 'forward', 'reverse'): Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs¶

reads: FeatureData[Sequence]: <no description>[required]

feature-classifier find-consensus-annotation¶

Citations¶

Inputs¶

search_results: FeatureData[BLAST6]: Search results in BLAST6 output format[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]

Parameters¶

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given when no consensus is found.[default: 'Unassigned']

Outputs¶

consensus_taxonomy: FeatureData[Taxonomy]: Consensus taxonomy and scores.[required]

feature-classifier makeblastdb¶

Make BLAST database from custom sequence collection.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Input reference sequences.[required]

Outputs¶

database: BLASTDB: Output BLAST database.[required]

feature-classifier blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier vsearch-global¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-vsearch¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-hybrid-vsearch-sklearn¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]
classifier: TaxonomicClassifier: Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
maxhits: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence."auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
prefilter: Bool: Toggle positive filter of query sequences on or off.[default: True]
sample_size: Int % Range(1, None): Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]
randseed: Int % Range(0, None): Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2025.10.0.dev0
website: https://github.com/qiime2/q2-feature-classifier
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Bokulich et al., 2018

Actions¶

Name	Type	Short Description
fit-classifier-sklearn	method	Train an almost arbitrary scikit-learn classifier
classify-sklearn	method	Pre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayes	method	Train the naive_bayes classifier
extract-reads	method	Extract reads from reference sequences.
find-consensus-annotation	method	Find consensus among multiple annotations.
makeblastdb	method	Make BLAST database.
blast	method	BLAST+ local alignment search.
vsearch-global	method	VSEARCH global alignment search
classify-consensus-blast	pipeline	BLAST+ consensus taxonomy classifier
classify-consensus-vsearch	pipeline	VSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearn	pipeline	ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes¶

Formats¶

feature-classifier fit-classifier-sklearn¶

Train a scikit-learn classifier to classify reads.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classifier_specification: Str: <no description>[required]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier classify-sklearn¶

Classify reads by taxon using a fitted classifier.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reads: FeatureData[Sequence]: The feature data to be classified.[required]
classifier: TaxonomicClassifier: The taxonomic classifier for classifying the reads.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
pre_dispatch: Str: "all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence. auto will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs¶

classification: FeatureData[Taxonomy]: <no description>[required]

feature-classifier fit-classifier-naive-bayes¶

Create a scikit-learn naive_bayes classifier for reads

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classify__alpha: Float: <no description>[default: 0.001]
classify__chunk_size: Int: <no description>[default: 20000]
classify__class_prior: Str: <no description>[default: 'null']
classify__fit_prior: Bool: <no description>[default: False]
feat_ext__alternate_sign: Bool: <no description>[default: False]
feat_ext__analyzer: Str: <no description>[default: 'char_wb']
feat_ext__binary: Bool: <no description>[default: False]
feat_ext__decode_error: Str: <no description>[default: 'strict']
feat_ext__encoding: Str: <no description>[default: 'utf-8']
feat_ext__input: Str: <no description>[default: 'content']
feat_ext__lowercase: Bool: <no description>[default: True]
feat_ext__n_features: Int: <no description>[default: 8192]
feat_ext__ngram_range: Str: <no description>[default: '[7, 7]']
feat_ext__norm: Str: <no description>[default: 'l2']
feat_ext__preprocessor: Str: <no description>[default: 'null']
feat_ext__stop_words: Str: <no description>[default: 'null']
feat_ext__strip_accents: Str: <no description>[default: 'null']
feat_ext__token_pattern: Str: <no description>[default: '(?u)\\b\\w\\w+\\b']
feat_ext__tokenizer: Str: <no description>[default: 'null']
verbose: Bool: <no description>[default: False]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier extract-reads¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: <no description>[required]

Parameters¶

f_primer: Str: forward primer sequence (5' -> 3').[required]
r_primer: Str: reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
trim_right: Int: trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]
trunc_len: Int: read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]
trim_left: Int: trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]
identity: Float: minimum combined primer match identity threshold.[default: 0.8]
min_length: Int % Range(0, None): Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]
max_length: Int % Range(0, None): Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]
n_jobs: Int % Range(1, None): Number of seperate processes to run.[default: 1]
batch_size: Int % Range(1, None) | Str % Choices('auto'): Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']
read_orientation: Str % Choices('both', 'forward', 'reverse'): Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs¶

reads: FeatureData[Sequence]: <no description>[required]

feature-classifier find-consensus-annotation¶

Citations¶

Inputs¶

search_results: FeatureData[BLAST6]: Search results in BLAST6 output format[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]

Parameters¶

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given when no consensus is found.[default: 'Unassigned']

Outputs¶

consensus_taxonomy: FeatureData[Taxonomy]: Consensus taxonomy and scores.[required]

feature-classifier makeblastdb¶

Make BLAST database from custom sequence collection.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Input reference sequences.[required]

Outputs¶

database: BLASTDB: Output BLAST database.[required]

feature-classifier blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier vsearch-global¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-vsearch¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-hybrid-vsearch-sklearn¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]
classifier: TaxonomicClassifier: Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
maxhits: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence."auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
prefilter: Bool: Toggle positive filter of query sequences on or off.[default: True]
sample_size: Int % Range(1, None): Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]
randseed: Int % Range(0, None): Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2025.10.0.dev0
website: https://github.com/qiime2/q2-feature-classifier
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Bokulich et al., 2018

Actions¶

Name	Type	Short Description
fit-classifier-sklearn	method	Train an almost arbitrary scikit-learn classifier
classify-sklearn	method	Pre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayes	method	Train the naive_bayes classifier
extract-reads	method	Extract reads from reference sequences.
find-consensus-annotation	method	Find consensus among multiple annotations.
makeblastdb	method	Make BLAST database.
blast	method	BLAST+ local alignment search.
vsearch-global	method	VSEARCH global alignment search
classify-consensus-blast	pipeline	BLAST+ consensus taxonomy classifier
classify-consensus-vsearch	pipeline	VSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearn	pipeline	ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes¶

Formats¶

feature-classifier fit-classifier-sklearn¶

Train a scikit-learn classifier to classify reads.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classifier_specification: Str: <no description>[required]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier classify-sklearn¶

Classify reads by taxon using a fitted classifier.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reads: FeatureData[Sequence]: The feature data to be classified.[required]
classifier: TaxonomicClassifier: The taxonomic classifier for classifying the reads.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
pre_dispatch: Str: "all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence. auto will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs¶

classification: FeatureData[Taxonomy]: <no description>[required]

feature-classifier fit-classifier-naive-bayes¶

Create a scikit-learn naive_bayes classifier for reads

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classify__alpha: Float: <no description>[default: 0.001]
classify__chunk_size: Int: <no description>[default: 20000]
classify__class_prior: Str: <no description>[default: 'null']
classify__fit_prior: Bool: <no description>[default: False]
feat_ext__alternate_sign: Bool: <no description>[default: False]
feat_ext__analyzer: Str: <no description>[default: 'char_wb']
feat_ext__binary: Bool: <no description>[default: False]
feat_ext__decode_error: Str: <no description>[default: 'strict']
feat_ext__encoding: Str: <no description>[default: 'utf-8']
feat_ext__input: Str: <no description>[default: 'content']
feat_ext__lowercase: Bool: <no description>[default: True]
feat_ext__n_features: Int: <no description>[default: 8192]
feat_ext__ngram_range: Str: <no description>[default: '[7, 7]']
feat_ext__norm: Str: <no description>[default: 'l2']
feat_ext__preprocessor: Str: <no description>[default: 'null']
feat_ext__stop_words: Str: <no description>[default: 'null']
feat_ext__strip_accents: Str: <no description>[default: 'null']
feat_ext__token_pattern: Str: <no description>[default: '(?u)\\b\\w\\w+\\b']
feat_ext__tokenizer: Str: <no description>[default: 'null']
verbose: Bool: <no description>[default: False]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier extract-reads¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: <no description>[required]

Parameters¶

f_primer: Str: forward primer sequence (5' -> 3').[required]
r_primer: Str: reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
trim_right: Int: trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]
trunc_len: Int: read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]
trim_left: Int: trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]
identity: Float: minimum combined primer match identity threshold.[default: 0.8]
min_length: Int % Range(0, None): Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]
max_length: Int % Range(0, None): Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]
n_jobs: Int % Range(1, None): Number of seperate processes to run.[default: 1]
batch_size: Int % Range(1, None) | Str % Choices('auto'): Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']
read_orientation: Str % Choices('both', 'forward', 'reverse'): Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs¶

reads: FeatureData[Sequence]: <no description>[required]

feature-classifier find-consensus-annotation¶

Citations¶

Inputs¶

search_results: FeatureData[BLAST6]: Search results in BLAST6 output format[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]

Parameters¶

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given when no consensus is found.[default: 'Unassigned']

Outputs¶

consensus_taxonomy: FeatureData[Taxonomy]: Consensus taxonomy and scores.[required]

feature-classifier makeblastdb¶

Make BLAST database from custom sequence collection.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Input reference sequences.[required]

Outputs¶

database: BLASTDB: Output BLAST database.[required]

feature-classifier blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier vsearch-global¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-vsearch¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-hybrid-vsearch-sklearn¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]
classifier: TaxonomicClassifier: Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
maxhits: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence."auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
prefilter: Bool: Toggle positive filter of query sequences on or off.[default: True]
sample_size: Int % Range(1, None): Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]
randseed: Int % Range(0, None): Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2025.10.0.dev0
website: https://github.com/qiime2/q2-feature-classifier
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Bokulich et al., 2018

Actions¶

Name	Type	Short Description
fit-classifier-sklearn	method	Train an almost arbitrary scikit-learn classifier
classify-sklearn	method	Pre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayes	method	Train the naive_bayes classifier
extract-reads	method	Extract reads from reference sequences.
find-consensus-annotation	method	Find consensus among multiple annotations.
makeblastdb	method	Make BLAST database.
blast	method	BLAST+ local alignment search.
vsearch-global	method	VSEARCH global alignment search
classify-consensus-blast	pipeline	BLAST+ consensus taxonomy classifier
classify-consensus-vsearch	pipeline	VSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearn	pipeline	ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes¶

Formats¶

feature-classifier fit-classifier-sklearn¶

Train a scikit-learn classifier to classify reads.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classifier_specification: Str: <no description>[required]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier classify-sklearn¶

Classify reads by taxon using a fitted classifier.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reads: FeatureData[Sequence]: The feature data to be classified.[required]
classifier: TaxonomicClassifier: The taxonomic classifier for classifying the reads.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
pre_dispatch: Str: "all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence. auto will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs¶

classification: FeatureData[Taxonomy]: <no description>[required]

feature-classifier fit-classifier-naive-bayes¶

Create a scikit-learn naive_bayes classifier for reads

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classify__alpha: Float: <no description>[default: 0.001]
classify__chunk_size: Int: <no description>[default: 20000]
classify__class_prior: Str: <no description>[default: 'null']
classify__fit_prior: Bool: <no description>[default: False]
feat_ext__alternate_sign: Bool: <no description>[default: False]
feat_ext__analyzer: Str: <no description>[default: 'char_wb']
feat_ext__binary: Bool: <no description>[default: False]
feat_ext__decode_error: Str: <no description>[default: 'strict']
feat_ext__encoding: Str: <no description>[default: 'utf-8']
feat_ext__input: Str: <no description>[default: 'content']
feat_ext__lowercase: Bool: <no description>[default: True]
feat_ext__n_features: Int: <no description>[default: 8192]
feat_ext__ngram_range: Str: <no description>[default: '[7, 7]']
feat_ext__norm: Str: <no description>[default: 'l2']
feat_ext__preprocessor: Str: <no description>[default: 'null']
feat_ext__stop_words: Str: <no description>[default: 'null']
feat_ext__strip_accents: Str: <no description>[default: 'null']
feat_ext__token_pattern: Str: <no description>[default: '(?u)\\b\\w\\w+\\b']
feat_ext__tokenizer: Str: <no description>[default: 'null']
verbose: Bool: <no description>[default: False]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier extract-reads¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: <no description>[required]

Parameters¶

f_primer: Str: forward primer sequence (5' -> 3').[required]
r_primer: Str: reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
trim_right: Int: trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]
trunc_len: Int: read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]
trim_left: Int: trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]
identity: Float: minimum combined primer match identity threshold.[default: 0.8]
min_length: Int % Range(0, None): Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]
max_length: Int % Range(0, None): Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]
n_jobs: Int % Range(1, None): Number of seperate processes to run.[default: 1]
batch_size: Int % Range(1, None) | Str % Choices('auto'): Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']
read_orientation: Str % Choices('both', 'forward', 'reverse'): Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs¶

reads: FeatureData[Sequence]: <no description>[required]

feature-classifier find-consensus-annotation¶

Citations¶

Inputs¶

search_results: FeatureData[BLAST6]: Search results in BLAST6 output format[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]

Parameters¶

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given when no consensus is found.[default: 'Unassigned']

Outputs¶

consensus_taxonomy: FeatureData[Taxonomy]: Consensus taxonomy and scores.[required]

feature-classifier makeblastdb¶

Make BLAST database from custom sequence collection.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Input reference sequences.[required]

Outputs¶

database: BLASTDB: Output BLAST database.[required]

feature-classifier blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier vsearch-global¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-vsearch¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-hybrid-vsearch-sklearn¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]
classifier: TaxonomicClassifier: Pre-trained sklearn taxonomic classifier for classifying the reads.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.5]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if prefilter is disabled.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
maxhits: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): <no description>[default: 'all']
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default: 'auto']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence."auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
prefilter: Bool: Toggle positive filter of query sequences on or off.[default: True]
sample_size: Int % Range(1, None): Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if prefilter is disabled.[default: 1000]
randseed: Int % Range(0, None): Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if prefilter is disabled.[default: 0]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]

This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.

version: 2025.10.0.dev0
website: https://github.com/qiime2/q2-feature-classifier
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Bokulich et al., 2018

Actions¶

Name	Type	Short Description
fit-classifier-sklearn	method	Train an almost arbitrary scikit-learn classifier
classify-sklearn	method	Pre-fitted sklearn-based taxonomy classifier
fit-classifier-naive-bayes	method	Train the naive_bayes classifier
extract-reads	method	Extract reads from reference sequences.
find-consensus-annotation	method	Find consensus among multiple annotations.
makeblastdb	method	Make BLAST database.
blast	method	BLAST+ local alignment search.
vsearch-global	method	VSEARCH global alignment search
classify-consensus-blast	pipeline	BLAST+ consensus taxonomy classifier
classify-consensus-vsearch	pipeline	VSEARCH-based consensus taxonomy classifier
classify-hybrid-vsearch-sklearn	pipeline	ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

Artifact Classes¶

Formats¶

feature-classifier fit-classifier-sklearn¶

Train a scikit-learn classifier to classify reads.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classifier_specification: Str: <no description>[required]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier classify-sklearn¶

Classify reads by taxon using a fitted classifier.

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reads: FeatureData[Sequence]: The feature data to be classified.[required]
classifier: TaxonomicClassifier: The taxonomic classifier for classifying the reads.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
pre_dispatch: Str: "all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default: '2*n_jobs']
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]
read_orientation: Str % Choices('same', 'reverse-complement', 'auto', 'both'): Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. Both will classify sequences unchanged and in reverse-complement and retain the classification with higher confidence. auto will autodetect orientation based on the confidence estimates for the first 100 reads.[default: 'auto']

Outputs¶

classification: FeatureData[Taxonomy]: <no description>[required]

feature-classifier fit-classifier-naive-bayes¶

Create a scikit-learn naive_bayes classifier for reads

Citations¶

Bokulich et al., 2018; Pedregosa et al., 2011

Inputs¶

reference_reads: FeatureData[Sequence]: <no description>[required]
reference_taxonomy: FeatureData[Taxonomy]: <no description>[required]
class_weight: FeatureTable[RelativeFrequency]: <no description>[optional]

Parameters¶

classify__alpha: Float: <no description>[default: 0.001]
classify__chunk_size: Int: <no description>[default: 20000]
classify__class_prior: Str: <no description>[default: 'null']
classify__fit_prior: Bool: <no description>[default: False]
feat_ext__alternate_sign: Bool: <no description>[default: False]
feat_ext__analyzer: Str: <no description>[default: 'char_wb']
feat_ext__binary: Bool: <no description>[default: False]
feat_ext__decode_error: Str: <no description>[default: 'strict']
feat_ext__encoding: Str: <no description>[default: 'utf-8']
feat_ext__input: Str: <no description>[default: 'content']
feat_ext__lowercase: Bool: <no description>[default: True]
feat_ext__n_features: Int: <no description>[default: 8192]
feat_ext__ngram_range: Str: <no description>[default: '[7, 7]']
feat_ext__norm: Str: <no description>[default: 'l2']
feat_ext__preprocessor: Str: <no description>[default: 'null']
feat_ext__stop_words: Str: <no description>[default: 'null']
feat_ext__strip_accents: Str: <no description>[default: 'null']
feat_ext__token_pattern: Str: <no description>[default: '(?u)\\b\\w\\w+\\b']
feat_ext__tokenizer: Str: <no description>[default: 'null']
verbose: Bool: <no description>[default: False]

Outputs¶

classifier: TaxonomicClassifier: <no description>[required]

feature-classifier extract-reads¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: <no description>[required]

Parameters¶

f_primer: Str: forward primer sequence (5' -> 3').[required]
r_primer: Str: reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
trim_right: Int: trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default: 0]
trunc_len: Int: read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default: 0]
trim_left: Int: trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default: 0]
identity: Float: minimum combined primer match identity threshold.[default: 0.8]
min_length: Int % Range(0, None): Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default: 50]
max_length: Int % Range(0, None): Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default: 0]
n_jobs: Int % Range(1, None): Number of seperate processes to run.[default: 1]
batch_size: Int % Range(1, None) | Str % Choices('auto'): Number of sequences to process in a batch. The auto option is calculated from the number of sequences and number of jobs specified.[default: 'auto']
read_orientation: Str % Choices('both', 'forward', 'reverse'): Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default: 'both']

Outputs¶

reads: FeatureData[Sequence]: <no description>[required]

feature-classifier find-consensus-annotation¶

Citations¶

Inputs¶

search_results: FeatureData[BLAST6]: Search results in BLAST6 output format[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]

Parameters¶

min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given when no consensus is found.[default: 'Unassigned']

Outputs¶

consensus_taxonomy: FeatureData[Taxonomy]: Consensus taxonomy and scores.[required]

feature-classifier makeblastdb¶

Make BLAST database from custom sequence collection.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Input reference sequences.[required]

Outputs¶

database: BLASTDB: Output BLAST database.[required]

feature-classifier blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier vsearch-global¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]

Outputs¶

search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-blast¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: reference taxonomy labels.[required]
blastdb: BLASTDB: BLAST indexed database. Incompatible with reference_reads.[optional]
reference_reads: FeatureData[Sequence]: Reference sequences. Incompatible with blastdb.[optional]

Parameters¶

maxaccepts: Int % Range(1, None): Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default: 0.8]
strand: Str % Choices('both', 'plus', 'minus'): Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default: 'both']
evalue: Float: BLAST expectation value (E) threshold for saving hits.[default: 0.001]
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default: True]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']
num_threads: Threads: Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default: 1]

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-consensus-vsearch¶

Citations¶

Inputs¶

query: FeatureData[Sequence]: Query Sequences.[required]
reference_reads: FeatureData[Sequence]: Reference sequences.[required]
reference_taxonomy: FeatureData[Taxonomy]: Reference taxonomy labels.[required]

Parameters¶

maxaccepts: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default: 10]
perc_identity: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if percent identity to query is lower.[default: 0.8]
query_cov: Float % Range(0.0, 1.0, inclusive_end=True): Reject match if query alignment coverage per high-scoring pair is lower.[default: 0.8]
strand: Str % Choices('both', 'plus'): Align against reference sequences in forward ("plus") or both directions ("both").[default: 'both']
search_exact: Bool: Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default: False]
top_hits_only: Bool: Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default: False]
maxhits: Int % Range(1, None) | Str % Choices('all'): Maximum number of hits to show once the search is terminated.[default: 'all']
maxrejects: Int % Range(1, None) | Str % Choices('all'): Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default: 'all']
output_no_hits: Bool: Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default: True]
weak_id: Float % Range(0.0, 1.0, inclusive_end=True): Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default: 0.0]
threads: Threads: Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default: 1]
min_consensus: Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True): Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default: 0.51]
unassignable_label: Str: Annotation given to sequences without any hits.[default: 'Unassigned']

Outputs¶

classification: FeatureData[Taxonomy]: Taxonomy classifications of query sequences.[required]
search_results: FeatureData[BLAST6]: Top hits for each query.[required]

feature-classifier classify-hybrid-vsearch-sklearn¶

Citations¶