This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.
- version:
2024.10.0
- website: https://
github .com /qiime2 /q2 -feature -classifier - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Bokulich et al., 2018
Actions¶
Name | Type | Short Description |
---|---|---|
fit-classifier-sklearn | method | Train an almost arbitrary scikit-learn classifier |
classify-sklearn | method | Pre-fitted sklearn-based taxonomy classifier |
fit-classifier-naive-bayes | method | Train the naive_bayes classifier |
extract-reads | method | Extract reads from reference sequences. |
find-consensus-annotation | method | Find consensus among multiple annotations. |
makeblastdb | method | Make BLAST database. |
blast | method | BLAST+ local alignment search. |
vsearch-global | method | VSEARCH global alignment search |
classify-consensus-blast | pipeline | BLAST+ consensus taxonomy classifier |
classify-consensus-vsearch | pipeline | VSEARCH-based consensus taxonomy classifier |
classify-hybrid-vsearch-sklearn | pipeline | ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier |
Artifact Classes¶
BLASTDB |
TaxonomicClassifier |
Formats¶
BLASTDBDirFmtV5 |
TaxonomicClassifierDirFmt |
TaxonomicClassiferTemporaryPickleDirFmt |
feature-classifier fit-classifier-sklearn¶
Train a scikit-learn classifier to classify reads.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classifier_specification:
Str
<no description>[required]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier classify-sklearn¶
Classify reads by taxon using a fitted classifier.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reads:
FeatureData[Sequence]
The feature data to be classified.[required]
- classifier:
TaxonomicClassifier
The taxonomic classifier for classifying the reads.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- pre_dispatch:
Str
"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default:
'2*n_jobs'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
<no description>[required]
feature-classifier fit-classifier-naive-bayes¶
Create a scikit-learn naive_bayes classifier for reads
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classify__alpha:
Float
<no description>[default:
0.001
]- classify__chunk_size:
Int
<no description>[default:
20000
]- classify__class_prior:
Str
<no description>[default:
'null'
]- classify__fit_prior:
Bool
<no description>[default:
False
]- feat_ext__alternate_sign:
Bool
<no description>[default:
False
]- feat_ext__analyzer:
Str
<no description>[default:
'char_wb'
]- feat_ext__binary:
Bool
<no description>[default:
False
]- feat_ext__decode_error:
Str
<no description>[default:
'strict'
]- feat_ext__encoding:
Str
<no description>[default:
'utf-8'
]- feat_ext__input:
Str
<no description>[default:
'content'
]- feat_ext__lowercase:
Bool
<no description>[default:
True
]- feat_ext__n_features:
Int
<no description>[default:
8192
]- feat_ext__ngram_range:
Str
<no description>[default:
'[7, 7]'
]- feat_ext__norm:
Str
<no description>[default:
'l2'
]- feat_ext__preprocessor:
Str
<no description>[default:
'null'
]- feat_ext__stop_words:
Str
<no description>[default:
'null'
]- feat_ext__strip_accents:
Str
<no description>[default:
'null'
]- feat_ext__token_pattern:
Str
<no description>[default:
'(?u)\\b\\w\\w+\\b'
]- feat_ext__tokenizer:
Str
<no description>[default:
'null'
]- verbose:
Bool
<no description>[default:
False
]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier extract-reads¶
Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity
). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length
are removed; 4. reads are trimmed with trim_right
; 5. reads are truncated to trunc_len
; 6. reads are trimmed with trim_left
; 7. reads shorter than min_length
are removed.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
<no description>[required]
Parameters¶
- f_primer:
Str
forward primer sequence (5' -> 3').[required]
- r_primer:
Str
reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
- trim_right:
Int
trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default:
0
]- trunc_len:
Int
read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default:
0
]- trim_left:
Int
trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default:
0
]- identity:
Float
minimum combined primer match identity threshold.[default:
0.8
]- min_length:
Int
%
Range
(0, None)
Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default:
50
]- max_length:
Int
%
Range
(0, None)
Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default:
0
]- n_jobs:
Int
%
Range
(1, None)
Number of seperate processes to run.[default:
1
]- batch_size:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of sequences to process in a batch. The
auto
option is calculated from the number of sequences and number of jobs specified.[default:'auto'
]- read_orientation:
Str
%
Choices
('both', 'forward', 'reverse')
Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default:
'both'
]
Outputs¶
- reads:
FeatureData[Sequence]
<no description>[required]
feature-classifier find-consensus-annotation¶
Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.
Citations¶
Inputs¶
- search_results:
FeatureData[BLAST6]
Search results in BLAST6 output format[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
Parameters¶
- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given when no consensus is found.[default:
'Unassigned'
]
Outputs¶
- consensus_taxonomy:
FeatureData[Taxonomy]
Consensus taxonomy and scores.[required]
feature-classifier makeblastdb¶
Make BLAST database from custom sequence collection.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- sequences:
FeatureData[Sequence]
Input reference sequences.[required]
Outputs¶
- database:
BLASTDB
Output BLAST database.[required]
feature-classifier blast¶
Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier vsearch-global¶
Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-blast¶
Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-vsearch¶
Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-hybrid-vsearch-sklearn¶
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://
Citations¶
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
- classifier:
TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for classifying the reads.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.5
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default:
'auto'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- prefilter:
Bool
Toggle positive filter of query sequences on or off.[default:
True
]- sample_size:
Int
%
Range
(1, None)
Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if
prefilter
is disabled.[default:1000
]- randseed:
Int
%
Range
(0, None)
Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if
prefilter
is disabled.[default:0
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.
- version:
2024.10.0
- website: https://
github .com /qiime2 /q2 -feature -classifier - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Bokulich et al., 2018
Actions¶
Name | Type | Short Description |
---|---|---|
fit-classifier-sklearn | method | Train an almost arbitrary scikit-learn classifier |
classify-sklearn | method | Pre-fitted sklearn-based taxonomy classifier |
fit-classifier-naive-bayes | method | Train the naive_bayes classifier |
extract-reads | method | Extract reads from reference sequences. |
find-consensus-annotation | method | Find consensus among multiple annotations. |
makeblastdb | method | Make BLAST database. |
blast | method | BLAST+ local alignment search. |
vsearch-global | method | VSEARCH global alignment search |
classify-consensus-blast | pipeline | BLAST+ consensus taxonomy classifier |
classify-consensus-vsearch | pipeline | VSEARCH-based consensus taxonomy classifier |
classify-hybrid-vsearch-sklearn | pipeline | ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier |
Artifact Classes¶
BLASTDB |
TaxonomicClassifier |
Formats¶
BLASTDBDirFmtV5 |
TaxonomicClassifierDirFmt |
TaxonomicClassiferTemporaryPickleDirFmt |
feature-classifier fit-classifier-sklearn¶
Train a scikit-learn classifier to classify reads.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classifier_specification:
Str
<no description>[required]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier classify-sklearn¶
Classify reads by taxon using a fitted classifier.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reads:
FeatureData[Sequence]
The feature data to be classified.[required]
- classifier:
TaxonomicClassifier
The taxonomic classifier for classifying the reads.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- pre_dispatch:
Str
"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default:
'2*n_jobs'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
<no description>[required]
feature-classifier fit-classifier-naive-bayes¶
Create a scikit-learn naive_bayes classifier for reads
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classify__alpha:
Float
<no description>[default:
0.001
]- classify__chunk_size:
Int
<no description>[default:
20000
]- classify__class_prior:
Str
<no description>[default:
'null'
]- classify__fit_prior:
Bool
<no description>[default:
False
]- feat_ext__alternate_sign:
Bool
<no description>[default:
False
]- feat_ext__analyzer:
Str
<no description>[default:
'char_wb'
]- feat_ext__binary:
Bool
<no description>[default:
False
]- feat_ext__decode_error:
Str
<no description>[default:
'strict'
]- feat_ext__encoding:
Str
<no description>[default:
'utf-8'
]- feat_ext__input:
Str
<no description>[default:
'content'
]- feat_ext__lowercase:
Bool
<no description>[default:
True
]- feat_ext__n_features:
Int
<no description>[default:
8192
]- feat_ext__ngram_range:
Str
<no description>[default:
'[7, 7]'
]- feat_ext__norm:
Str
<no description>[default:
'l2'
]- feat_ext__preprocessor:
Str
<no description>[default:
'null'
]- feat_ext__stop_words:
Str
<no description>[default:
'null'
]- feat_ext__strip_accents:
Str
<no description>[default:
'null'
]- feat_ext__token_pattern:
Str
<no description>[default:
'(?u)\\b\\w\\w+\\b'
]- feat_ext__tokenizer:
Str
<no description>[default:
'null'
]- verbose:
Bool
<no description>[default:
False
]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier extract-reads¶
Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity
). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length
are removed; 4. reads are trimmed with trim_right
; 5. reads are truncated to trunc_len
; 6. reads are trimmed with trim_left
; 7. reads shorter than min_length
are removed.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
<no description>[required]
Parameters¶
- f_primer:
Str
forward primer sequence (5' -> 3').[required]
- r_primer:
Str
reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
- trim_right:
Int
trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default:
0
]- trunc_len:
Int
read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default:
0
]- trim_left:
Int
trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default:
0
]- identity:
Float
minimum combined primer match identity threshold.[default:
0.8
]- min_length:
Int
%
Range
(0, None)
Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default:
50
]- max_length:
Int
%
Range
(0, None)
Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default:
0
]- n_jobs:
Int
%
Range
(1, None)
Number of seperate processes to run.[default:
1
]- batch_size:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of sequences to process in a batch. The
auto
option is calculated from the number of sequences and number of jobs specified.[default:'auto'
]- read_orientation:
Str
%
Choices
('both', 'forward', 'reverse')
Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default:
'both'
]
Outputs¶
- reads:
FeatureData[Sequence]
<no description>[required]
feature-classifier find-consensus-annotation¶
Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.
Citations¶
Inputs¶
- search_results:
FeatureData[BLAST6]
Search results in BLAST6 output format[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
Parameters¶
- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given when no consensus is found.[default:
'Unassigned'
]
Outputs¶
- consensus_taxonomy:
FeatureData[Taxonomy]
Consensus taxonomy and scores.[required]
feature-classifier makeblastdb¶
Make BLAST database from custom sequence collection.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- sequences:
FeatureData[Sequence]
Input reference sequences.[required]
Outputs¶
- database:
BLASTDB
Output BLAST database.[required]
feature-classifier blast¶
Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier vsearch-global¶
Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-blast¶
Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-vsearch¶
Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-hybrid-vsearch-sklearn¶
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://
Citations¶
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
- classifier:
TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for classifying the reads.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.5
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default:
'auto'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- prefilter:
Bool
Toggle positive filter of query sequences on or off.[default:
True
]- sample_size:
Int
%
Range
(1, None)
Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if
prefilter
is disabled.[default:1000
]- randseed:
Int
%
Range
(0, None)
Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if
prefilter
is disabled.[default:0
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.
- version:
2024.10.0
- website: https://
github .com /qiime2 /q2 -feature -classifier - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Bokulich et al., 2018
Actions¶
Name | Type | Short Description |
---|---|---|
fit-classifier-sklearn | method | Train an almost arbitrary scikit-learn classifier |
classify-sklearn | method | Pre-fitted sklearn-based taxonomy classifier |
fit-classifier-naive-bayes | method | Train the naive_bayes classifier |
extract-reads | method | Extract reads from reference sequences. |
find-consensus-annotation | method | Find consensus among multiple annotations. |
makeblastdb | method | Make BLAST database. |
blast | method | BLAST+ local alignment search. |
vsearch-global | method | VSEARCH global alignment search |
classify-consensus-blast | pipeline | BLAST+ consensus taxonomy classifier |
classify-consensus-vsearch | pipeline | VSEARCH-based consensus taxonomy classifier |
classify-hybrid-vsearch-sklearn | pipeline | ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier |
Artifact Classes¶
BLASTDB |
TaxonomicClassifier |
Formats¶
BLASTDBDirFmtV5 |
TaxonomicClassifierDirFmt |
TaxonomicClassiferTemporaryPickleDirFmt |
feature-classifier fit-classifier-sklearn¶
Train a scikit-learn classifier to classify reads.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classifier_specification:
Str
<no description>[required]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier classify-sklearn¶
Classify reads by taxon using a fitted classifier.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reads:
FeatureData[Sequence]
The feature data to be classified.[required]
- classifier:
TaxonomicClassifier
The taxonomic classifier for classifying the reads.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- pre_dispatch:
Str
"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default:
'2*n_jobs'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
<no description>[required]
feature-classifier fit-classifier-naive-bayes¶
Create a scikit-learn naive_bayes classifier for reads
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classify__alpha:
Float
<no description>[default:
0.001
]- classify__chunk_size:
Int
<no description>[default:
20000
]- classify__class_prior:
Str
<no description>[default:
'null'
]- classify__fit_prior:
Bool
<no description>[default:
False
]- feat_ext__alternate_sign:
Bool
<no description>[default:
False
]- feat_ext__analyzer:
Str
<no description>[default:
'char_wb'
]- feat_ext__binary:
Bool
<no description>[default:
False
]- feat_ext__decode_error:
Str
<no description>[default:
'strict'
]- feat_ext__encoding:
Str
<no description>[default:
'utf-8'
]- feat_ext__input:
Str
<no description>[default:
'content'
]- feat_ext__lowercase:
Bool
<no description>[default:
True
]- feat_ext__n_features:
Int
<no description>[default:
8192
]- feat_ext__ngram_range:
Str
<no description>[default:
'[7, 7]'
]- feat_ext__norm:
Str
<no description>[default:
'l2'
]- feat_ext__preprocessor:
Str
<no description>[default:
'null'
]- feat_ext__stop_words:
Str
<no description>[default:
'null'
]- feat_ext__strip_accents:
Str
<no description>[default:
'null'
]- feat_ext__token_pattern:
Str
<no description>[default:
'(?u)\\b\\w\\w+\\b'
]- feat_ext__tokenizer:
Str
<no description>[default:
'null'
]- verbose:
Bool
<no description>[default:
False
]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier extract-reads¶
Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity
). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length
are removed; 4. reads are trimmed with trim_right
; 5. reads are truncated to trunc_len
; 6. reads are trimmed with trim_left
; 7. reads shorter than min_length
are removed.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
<no description>[required]
Parameters¶
- f_primer:
Str
forward primer sequence (5' -> 3').[required]
- r_primer:
Str
reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
- trim_right:
Int
trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default:
0
]- trunc_len:
Int
read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default:
0
]- trim_left:
Int
trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default:
0
]- identity:
Float
minimum combined primer match identity threshold.[default:
0.8
]- min_length:
Int
%
Range
(0, None)
Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default:
50
]- max_length:
Int
%
Range
(0, None)
Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default:
0
]- n_jobs:
Int
%
Range
(1, None)
Number of seperate processes to run.[default:
1
]- batch_size:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of sequences to process in a batch. The
auto
option is calculated from the number of sequences and number of jobs specified.[default:'auto'
]- read_orientation:
Str
%
Choices
('both', 'forward', 'reverse')
Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default:
'both'
]
Outputs¶
- reads:
FeatureData[Sequence]
<no description>[required]
feature-classifier find-consensus-annotation¶
Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.
Citations¶
Inputs¶
- search_results:
FeatureData[BLAST6]
Search results in BLAST6 output format[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
Parameters¶
- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given when no consensus is found.[default:
'Unassigned'
]
Outputs¶
- consensus_taxonomy:
FeatureData[Taxonomy]
Consensus taxonomy and scores.[required]
feature-classifier makeblastdb¶
Make BLAST database from custom sequence collection.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- sequences:
FeatureData[Sequence]
Input reference sequences.[required]
Outputs¶
- database:
BLASTDB
Output BLAST database.[required]
feature-classifier blast¶
Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier vsearch-global¶
Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-blast¶
Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-vsearch¶
Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-hybrid-vsearch-sklearn¶
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://
Citations¶
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
- classifier:
TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for classifying the reads.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.5
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default:
'auto'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- prefilter:
Bool
Toggle positive filter of query sequences on or off.[default:
True
]- sample_size:
Int
%
Range
(1, None)
Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if
prefilter
is disabled.[default:1000
]- randseed:
Int
%
Range
(0, None)
Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if
prefilter
is disabled.[default:0
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.
- version:
2024.10.0
- website: https://
github .com /qiime2 /q2 -feature -classifier - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Bokulich et al., 2018
Actions¶
Name | Type | Short Description |
---|---|---|
fit-classifier-sklearn | method | Train an almost arbitrary scikit-learn classifier |
classify-sklearn | method | Pre-fitted sklearn-based taxonomy classifier |
fit-classifier-naive-bayes | method | Train the naive_bayes classifier |
extract-reads | method | Extract reads from reference sequences. |
find-consensus-annotation | method | Find consensus among multiple annotations. |
makeblastdb | method | Make BLAST database. |
blast | method | BLAST+ local alignment search. |
vsearch-global | method | VSEARCH global alignment search |
classify-consensus-blast | pipeline | BLAST+ consensus taxonomy classifier |
classify-consensus-vsearch | pipeline | VSEARCH-based consensus taxonomy classifier |
classify-hybrid-vsearch-sklearn | pipeline | ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier |
Artifact Classes¶
BLASTDB |
TaxonomicClassifier |
Formats¶
BLASTDBDirFmtV5 |
TaxonomicClassifierDirFmt |
TaxonomicClassiferTemporaryPickleDirFmt |
feature-classifier fit-classifier-sklearn¶
Train a scikit-learn classifier to classify reads.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classifier_specification:
Str
<no description>[required]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier classify-sklearn¶
Classify reads by taxon using a fitted classifier.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reads:
FeatureData[Sequence]
The feature data to be classified.[required]
- classifier:
TaxonomicClassifier
The taxonomic classifier for classifying the reads.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- pre_dispatch:
Str
"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default:
'2*n_jobs'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
<no description>[required]
feature-classifier fit-classifier-naive-bayes¶
Create a scikit-learn naive_bayes classifier for reads
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classify__alpha:
Float
<no description>[default:
0.001
]- classify__chunk_size:
Int
<no description>[default:
20000
]- classify__class_prior:
Str
<no description>[default:
'null'
]- classify__fit_prior:
Bool
<no description>[default:
False
]- feat_ext__alternate_sign:
Bool
<no description>[default:
False
]- feat_ext__analyzer:
Str
<no description>[default:
'char_wb'
]- feat_ext__binary:
Bool
<no description>[default:
False
]- feat_ext__decode_error:
Str
<no description>[default:
'strict'
]- feat_ext__encoding:
Str
<no description>[default:
'utf-8'
]- feat_ext__input:
Str
<no description>[default:
'content'
]- feat_ext__lowercase:
Bool
<no description>[default:
True
]- feat_ext__n_features:
Int
<no description>[default:
8192
]- feat_ext__ngram_range:
Str
<no description>[default:
'[7, 7]'
]- feat_ext__norm:
Str
<no description>[default:
'l2'
]- feat_ext__preprocessor:
Str
<no description>[default:
'null'
]- feat_ext__stop_words:
Str
<no description>[default:
'null'
]- feat_ext__strip_accents:
Str
<no description>[default:
'null'
]- feat_ext__token_pattern:
Str
<no description>[default:
'(?u)\\b\\w\\w+\\b'
]- feat_ext__tokenizer:
Str
<no description>[default:
'null'
]- verbose:
Bool
<no description>[default:
False
]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier extract-reads¶
Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity
). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length
are removed; 4. reads are trimmed with trim_right
; 5. reads are truncated to trunc_len
; 6. reads are trimmed with trim_left
; 7. reads shorter than min_length
are removed.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
<no description>[required]
Parameters¶
- f_primer:
Str
forward primer sequence (5' -> 3').[required]
- r_primer:
Str
reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
- trim_right:
Int
trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default:
0
]- trunc_len:
Int
read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default:
0
]- trim_left:
Int
trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default:
0
]- identity:
Float
minimum combined primer match identity threshold.[default:
0.8
]- min_length:
Int
%
Range
(0, None)
Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default:
50
]- max_length:
Int
%
Range
(0, None)
Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default:
0
]- n_jobs:
Int
%
Range
(1, None)
Number of seperate processes to run.[default:
1
]- batch_size:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of sequences to process in a batch. The
auto
option is calculated from the number of sequences and number of jobs specified.[default:'auto'
]- read_orientation:
Str
%
Choices
('both', 'forward', 'reverse')
Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default:
'both'
]
Outputs¶
- reads:
FeatureData[Sequence]
<no description>[required]
feature-classifier find-consensus-annotation¶
Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.
Citations¶
Inputs¶
- search_results:
FeatureData[BLAST6]
Search results in BLAST6 output format[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
Parameters¶
- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given when no consensus is found.[default:
'Unassigned'
]
Outputs¶
- consensus_taxonomy:
FeatureData[Taxonomy]
Consensus taxonomy and scores.[required]
feature-classifier makeblastdb¶
Make BLAST database from custom sequence collection.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- sequences:
FeatureData[Sequence]
Input reference sequences.[required]
Outputs¶
- database:
BLASTDB
Output BLAST database.[required]
feature-classifier blast¶
Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier vsearch-global¶
Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-blast¶
Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-vsearch¶
Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-hybrid-vsearch-sklearn¶
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://
Citations¶
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
- classifier:
TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for classifying the reads.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.5
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default:
'auto'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- prefilter:
Bool
Toggle positive filter of query sequences on or off.[default:
True
]- sample_size:
Int
%
Range
(1, None)
Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if
prefilter
is disabled.[default:1000
]- randseed:
Int
%
Range
(0, None)
Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if
prefilter
is disabled.[default:0
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.
- version:
2024.10.0
- website: https://
github .com /qiime2 /q2 -feature -classifier - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Bokulich et al., 2018
Actions¶
Name | Type | Short Description |
---|---|---|
fit-classifier-sklearn | method | Train an almost arbitrary scikit-learn classifier |
classify-sklearn | method | Pre-fitted sklearn-based taxonomy classifier |
fit-classifier-naive-bayes | method | Train the naive_bayes classifier |
extract-reads | method | Extract reads from reference sequences. |
find-consensus-annotation | method | Find consensus among multiple annotations. |
makeblastdb | method | Make BLAST database. |
blast | method | BLAST+ local alignment search. |
vsearch-global | method | VSEARCH global alignment search |
classify-consensus-blast | pipeline | BLAST+ consensus taxonomy classifier |
classify-consensus-vsearch | pipeline | VSEARCH-based consensus taxonomy classifier |
classify-hybrid-vsearch-sklearn | pipeline | ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier |
Artifact Classes¶
BLASTDB |
TaxonomicClassifier |
Formats¶
BLASTDBDirFmtV5 |
TaxonomicClassifierDirFmt |
TaxonomicClassiferTemporaryPickleDirFmt |
feature-classifier fit-classifier-sklearn¶
Train a scikit-learn classifier to classify reads.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classifier_specification:
Str
<no description>[required]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier classify-sklearn¶
Classify reads by taxon using a fitted classifier.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reads:
FeatureData[Sequence]
The feature data to be classified.[required]
- classifier:
TaxonomicClassifier
The taxonomic classifier for classifying the reads.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- pre_dispatch:
Str
"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default:
'2*n_jobs'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
<no description>[required]
feature-classifier fit-classifier-naive-bayes¶
Create a scikit-learn naive_bayes classifier for reads
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classify__alpha:
Float
<no description>[default:
0.001
]- classify__chunk_size:
Int
<no description>[default:
20000
]- classify__class_prior:
Str
<no description>[default:
'null'
]- classify__fit_prior:
Bool
<no description>[default:
False
]- feat_ext__alternate_sign:
Bool
<no description>[default:
False
]- feat_ext__analyzer:
Str
<no description>[default:
'char_wb'
]- feat_ext__binary:
Bool
<no description>[default:
False
]- feat_ext__decode_error:
Str
<no description>[default:
'strict'
]- feat_ext__encoding:
Str
<no description>[default:
'utf-8'
]- feat_ext__input:
Str
<no description>[default:
'content'
]- feat_ext__lowercase:
Bool
<no description>[default:
True
]- feat_ext__n_features:
Int
<no description>[default:
8192
]- feat_ext__ngram_range:
Str
<no description>[default:
'[7, 7]'
]- feat_ext__norm:
Str
<no description>[default:
'l2'
]- feat_ext__preprocessor:
Str
<no description>[default:
'null'
]- feat_ext__stop_words:
Str
<no description>[default:
'null'
]- feat_ext__strip_accents:
Str
<no description>[default:
'null'
]- feat_ext__token_pattern:
Str
<no description>[default:
'(?u)\\b\\w\\w+\\b'
]- feat_ext__tokenizer:
Str
<no description>[default:
'null'
]- verbose:
Bool
<no description>[default:
False
]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier extract-reads¶
Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity
). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length
are removed; 4. reads are trimmed with trim_right
; 5. reads are truncated to trunc_len
; 6. reads are trimmed with trim_left
; 7. reads shorter than min_length
are removed.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
<no description>[required]
Parameters¶
- f_primer:
Str
forward primer sequence (5' -> 3').[required]
- r_primer:
Str
reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
- trim_right:
Int
trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default:
0
]- trunc_len:
Int
read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default:
0
]- trim_left:
Int
trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default:
0
]- identity:
Float
minimum combined primer match identity threshold.[default:
0.8
]- min_length:
Int
%
Range
(0, None)
Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default:
50
]- max_length:
Int
%
Range
(0, None)
Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default:
0
]- n_jobs:
Int
%
Range
(1, None)
Number of seperate processes to run.[default:
1
]- batch_size:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of sequences to process in a batch. The
auto
option is calculated from the number of sequences and number of jobs specified.[default:'auto'
]- read_orientation:
Str
%
Choices
('both', 'forward', 'reverse')
Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default:
'both'
]
Outputs¶
- reads:
FeatureData[Sequence]
<no description>[required]
feature-classifier find-consensus-annotation¶
Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.
Citations¶
Inputs¶
- search_results:
FeatureData[BLAST6]
Search results in BLAST6 output format[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
Parameters¶
- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given when no consensus is found.[default:
'Unassigned'
]
Outputs¶
- consensus_taxonomy:
FeatureData[Taxonomy]
Consensus taxonomy and scores.[required]
feature-classifier makeblastdb¶
Make BLAST database from custom sequence collection.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- sequences:
FeatureData[Sequence]
Input reference sequences.[required]
Outputs¶
- database:
BLASTDB
Output BLAST database.[required]
feature-classifier blast¶
Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier vsearch-global¶
Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-blast¶
Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-vsearch¶
Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-hybrid-vsearch-sklearn¶
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://
Citations¶
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
- classifier:
TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for classifying the reads.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.5
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default:
'auto'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- prefilter:
Bool
Toggle positive filter of query sequences on or off.[default:
True
]- sample_size:
Int
%
Range
(1, None)
Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if
prefilter
is disabled.[default:1000
]- randseed:
Int
%
Range
(0, None)
Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if
prefilter
is disabled.[default:0
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.
- version:
2024.10.0
- website: https://
github .com /qiime2 /q2 -feature -classifier - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Bokulich et al., 2018
Actions¶
Name | Type | Short Description |
---|---|---|
fit-classifier-sklearn | method | Train an almost arbitrary scikit-learn classifier |
classify-sklearn | method | Pre-fitted sklearn-based taxonomy classifier |
fit-classifier-naive-bayes | method | Train the naive_bayes classifier |
extract-reads | method | Extract reads from reference sequences. |
find-consensus-annotation | method | Find consensus among multiple annotations. |
makeblastdb | method | Make BLAST database. |
blast | method | BLAST+ local alignment search. |
vsearch-global | method | VSEARCH global alignment search |
classify-consensus-blast | pipeline | BLAST+ consensus taxonomy classifier |
classify-consensus-vsearch | pipeline | VSEARCH-based consensus taxonomy classifier |
classify-hybrid-vsearch-sklearn | pipeline | ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier |
Artifact Classes¶
BLASTDB |
TaxonomicClassifier |
Formats¶
BLASTDBDirFmtV5 |
TaxonomicClassifierDirFmt |
TaxonomicClassiferTemporaryPickleDirFmt |
feature-classifier fit-classifier-sklearn¶
Train a scikit-learn classifier to classify reads.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classifier_specification:
Str
<no description>[required]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier classify-sklearn¶
Classify reads by taxon using a fitted classifier.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reads:
FeatureData[Sequence]
The feature data to be classified.[required]
- classifier:
TaxonomicClassifier
The taxonomic classifier for classifying the reads.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- pre_dispatch:
Str
"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default:
'2*n_jobs'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
<no description>[required]
feature-classifier fit-classifier-naive-bayes¶
Create a scikit-learn naive_bayes classifier for reads
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classify__alpha:
Float
<no description>[default:
0.001
]- classify__chunk_size:
Int
<no description>[default:
20000
]- classify__class_prior:
Str
<no description>[default:
'null'
]- classify__fit_prior:
Bool
<no description>[default:
False
]- feat_ext__alternate_sign:
Bool
<no description>[default:
False
]- feat_ext__analyzer:
Str
<no description>[default:
'char_wb'
]- feat_ext__binary:
Bool
<no description>[default:
False
]- feat_ext__decode_error:
Str
<no description>[default:
'strict'
]- feat_ext__encoding:
Str
<no description>[default:
'utf-8'
]- feat_ext__input:
Str
<no description>[default:
'content'
]- feat_ext__lowercase:
Bool
<no description>[default:
True
]- feat_ext__n_features:
Int
<no description>[default:
8192
]- feat_ext__ngram_range:
Str
<no description>[default:
'[7, 7]'
]- feat_ext__norm:
Str
<no description>[default:
'l2'
]- feat_ext__preprocessor:
Str
<no description>[default:
'null'
]- feat_ext__stop_words:
Str
<no description>[default:
'null'
]- feat_ext__strip_accents:
Str
<no description>[default:
'null'
]- feat_ext__token_pattern:
Str
<no description>[default:
'(?u)\\b\\w\\w+\\b'
]- feat_ext__tokenizer:
Str
<no description>[default:
'null'
]- verbose:
Bool
<no description>[default:
False
]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier extract-reads¶
Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity
). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length
are removed; 4. reads are trimmed with trim_right
; 5. reads are truncated to trunc_len
; 6. reads are trimmed with trim_left
; 7. reads shorter than min_length
are removed.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
<no description>[required]
Parameters¶
- f_primer:
Str
forward primer sequence (5' -> 3').[required]
- r_primer:
Str
reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
- trim_right:
Int
trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default:
0
]- trunc_len:
Int
read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default:
0
]- trim_left:
Int
trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default:
0
]- identity:
Float
minimum combined primer match identity threshold.[default:
0.8
]- min_length:
Int
%
Range
(0, None)
Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default:
50
]- max_length:
Int
%
Range
(0, None)
Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default:
0
]- n_jobs:
Int
%
Range
(1, None)
Number of seperate processes to run.[default:
1
]- batch_size:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of sequences to process in a batch. The
auto
option is calculated from the number of sequences and number of jobs specified.[default:'auto'
]- read_orientation:
Str
%
Choices
('both', 'forward', 'reverse')
Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default:
'both'
]
Outputs¶
- reads:
FeatureData[Sequence]
<no description>[required]
feature-classifier find-consensus-annotation¶
Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.
Citations¶
Inputs¶
- search_results:
FeatureData[BLAST6]
Search results in BLAST6 output format[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
Parameters¶
- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given when no consensus is found.[default:
'Unassigned'
]
Outputs¶
- consensus_taxonomy:
FeatureData[Taxonomy]
Consensus taxonomy and scores.[required]
feature-classifier makeblastdb¶
Make BLAST database from custom sequence collection.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- sequences:
FeatureData[Sequence]
Input reference sequences.[required]
Outputs¶
- database:
BLASTDB
Output BLAST database.[required]
feature-classifier blast¶
Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier vsearch-global¶
Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-blast¶
Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-vsearch¶
Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-hybrid-vsearch-sklearn¶
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://
Citations¶
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
- classifier:
TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for classifying the reads.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.5
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default:
'auto'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- prefilter:
Bool
Toggle positive filter of query sequences on or off.[default:
True
]- sample_size:
Int
%
Range
(1, None)
Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if
prefilter
is disabled.[default:1000
]- randseed:
Int
%
Range
(0, None)
Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if
prefilter
is disabled.[default:0
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.
- version:
2024.10.0
- website: https://
github .com /qiime2 /q2 -feature -classifier - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Bokulich et al., 2018
Actions¶
Name | Type | Short Description |
---|---|---|
fit-classifier-sklearn | method | Train an almost arbitrary scikit-learn classifier |
classify-sklearn | method | Pre-fitted sklearn-based taxonomy classifier |
fit-classifier-naive-bayes | method | Train the naive_bayes classifier |
extract-reads | method | Extract reads from reference sequences. |
find-consensus-annotation | method | Find consensus among multiple annotations. |
makeblastdb | method | Make BLAST database. |
blast | method | BLAST+ local alignment search. |
vsearch-global | method | VSEARCH global alignment search |
classify-consensus-blast | pipeline | BLAST+ consensus taxonomy classifier |
classify-consensus-vsearch | pipeline | VSEARCH-based consensus taxonomy classifier |
classify-hybrid-vsearch-sklearn | pipeline | ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier |
Artifact Classes¶
BLASTDB |
TaxonomicClassifier |
Formats¶
BLASTDBDirFmtV5 |
TaxonomicClassifierDirFmt |
TaxonomicClassiferTemporaryPickleDirFmt |
feature-classifier fit-classifier-sklearn¶
Train a scikit-learn classifier to classify reads.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classifier_specification:
Str
<no description>[required]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier classify-sklearn¶
Classify reads by taxon using a fitted classifier.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reads:
FeatureData[Sequence]
The feature data to be classified.[required]
- classifier:
TaxonomicClassifier
The taxonomic classifier for classifying the reads.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- pre_dispatch:
Str
"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default:
'2*n_jobs'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
<no description>[required]
feature-classifier fit-classifier-naive-bayes¶
Create a scikit-learn naive_bayes classifier for reads
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classify__alpha:
Float
<no description>[default:
0.001
]- classify__chunk_size:
Int
<no description>[default:
20000
]- classify__class_prior:
Str
<no description>[default:
'null'
]- classify__fit_prior:
Bool
<no description>[default:
False
]- feat_ext__alternate_sign:
Bool
<no description>[default:
False
]- feat_ext__analyzer:
Str
<no description>[default:
'char_wb'
]- feat_ext__binary:
Bool
<no description>[default:
False
]- feat_ext__decode_error:
Str
<no description>[default:
'strict'
]- feat_ext__encoding:
Str
<no description>[default:
'utf-8'
]- feat_ext__input:
Str
<no description>[default:
'content'
]- feat_ext__lowercase:
Bool
<no description>[default:
True
]- feat_ext__n_features:
Int
<no description>[default:
8192
]- feat_ext__ngram_range:
Str
<no description>[default:
'[7, 7]'
]- feat_ext__norm:
Str
<no description>[default:
'l2'
]- feat_ext__preprocessor:
Str
<no description>[default:
'null'
]- feat_ext__stop_words:
Str
<no description>[default:
'null'
]- feat_ext__strip_accents:
Str
<no description>[default:
'null'
]- feat_ext__token_pattern:
Str
<no description>[default:
'(?u)\\b\\w\\w+\\b'
]- feat_ext__tokenizer:
Str
<no description>[default:
'null'
]- verbose:
Bool
<no description>[default:
False
]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier extract-reads¶
Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity
). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length
are removed; 4. reads are trimmed with trim_right
; 5. reads are truncated to trunc_len
; 6. reads are trimmed with trim_left
; 7. reads shorter than min_length
are removed.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
<no description>[required]
Parameters¶
- f_primer:
Str
forward primer sequence (5' -> 3').[required]
- r_primer:
Str
reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
- trim_right:
Int
trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default:
0
]- trunc_len:
Int
read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default:
0
]- trim_left:
Int
trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default:
0
]- identity:
Float
minimum combined primer match identity threshold.[default:
0.8
]- min_length:
Int
%
Range
(0, None)
Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default:
50
]- max_length:
Int
%
Range
(0, None)
Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default:
0
]- n_jobs:
Int
%
Range
(1, None)
Number of seperate processes to run.[default:
1
]- batch_size:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of sequences to process in a batch. The
auto
option is calculated from the number of sequences and number of jobs specified.[default:'auto'
]- read_orientation:
Str
%
Choices
('both', 'forward', 'reverse')
Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default:
'both'
]
Outputs¶
- reads:
FeatureData[Sequence]
<no description>[required]
feature-classifier find-consensus-annotation¶
Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.
Citations¶
Inputs¶
- search_results:
FeatureData[BLAST6]
Search results in BLAST6 output format[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
Parameters¶
- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given when no consensus is found.[default:
'Unassigned'
]
Outputs¶
- consensus_taxonomy:
FeatureData[Taxonomy]
Consensus taxonomy and scores.[required]
feature-classifier makeblastdb¶
Make BLAST database from custom sequence collection.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- sequences:
FeatureData[Sequence]
Input reference sequences.[required]
Outputs¶
- database:
BLASTDB
Output BLAST database.[required]
feature-classifier blast¶
Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier vsearch-global¶
Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-blast¶
Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-vsearch¶
Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-hybrid-vsearch-sklearn¶
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://
Citations¶
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
- classifier:
TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for classifying the reads.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.5
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default:
'auto'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- prefilter:
Bool
Toggle positive filter of query sequences on or off.[default:
True
]- sample_size:
Int
%
Range
(1, None)
Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if
prefilter
is disabled.[default:1000
]- randseed:
Int
%
Range
(0, None)
Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if
prefilter
is disabled.[default:0
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
This QIIME 2 plugin supports taxonomic classification of features using a variety of methods, including Naive Bayes, vsearch, and BLAST+.
- version:
2024.10.0
- website: https://
github .com /qiime2 /q2 -feature -classifier - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Bokulich et al., 2018
Actions¶
Name | Type | Short Description |
---|---|---|
fit-classifier-sklearn | method | Train an almost arbitrary scikit-learn classifier |
classify-sklearn | method | Pre-fitted sklearn-based taxonomy classifier |
fit-classifier-naive-bayes | method | Train the naive_bayes classifier |
extract-reads | method | Extract reads from reference sequences. |
find-consensus-annotation | method | Find consensus among multiple annotations. |
makeblastdb | method | Make BLAST database. |
blast | method | BLAST+ local alignment search. |
vsearch-global | method | VSEARCH global alignment search |
classify-consensus-blast | pipeline | BLAST+ consensus taxonomy classifier |
classify-consensus-vsearch | pipeline | VSEARCH-based consensus taxonomy classifier |
classify-hybrid-vsearch-sklearn | pipeline | ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier |
Artifact Classes¶
BLASTDB |
TaxonomicClassifier |
Formats¶
BLASTDBDirFmtV5 |
TaxonomicClassifierDirFmt |
TaxonomicClassiferTemporaryPickleDirFmt |
feature-classifier fit-classifier-sklearn¶
Train a scikit-learn classifier to classify reads.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classifier_specification:
Str
<no description>[required]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier classify-sklearn¶
Classify reads by taxon using a fitted classifier.
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reads:
FeatureData[Sequence]
The feature data to be classified.[required]
- classifier:
TaxonomicClassifier
The taxonomic classifier for classifying the reads.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- pre_dispatch:
Str
"all" or expression, as in "3*n_jobs". The number of batches (of tasks) to be pre-dispatched.[default:
'2*n_jobs'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
<no description>[required]
feature-classifier fit-classifier-naive-bayes¶
Create a scikit-learn naive_bayes classifier for reads
Citations¶
Bokulich et al., 2018; Pedregosa et al., 2011
Inputs¶
- reference_reads:
FeatureData[Sequence]
<no description>[required]
- reference_taxonomy:
FeatureData[Taxonomy]
<no description>[required]
- class_weight:
FeatureTable[RelativeFrequency]
<no description>[optional]
Parameters¶
- classify__alpha:
Float
<no description>[default:
0.001
]- classify__chunk_size:
Int
<no description>[default:
20000
]- classify__class_prior:
Str
<no description>[default:
'null'
]- classify__fit_prior:
Bool
<no description>[default:
False
]- feat_ext__alternate_sign:
Bool
<no description>[default:
False
]- feat_ext__analyzer:
Str
<no description>[default:
'char_wb'
]- feat_ext__binary:
Bool
<no description>[default:
False
]- feat_ext__decode_error:
Str
<no description>[default:
'strict'
]- feat_ext__encoding:
Str
<no description>[default:
'utf-8'
]- feat_ext__input:
Str
<no description>[default:
'content'
]- feat_ext__lowercase:
Bool
<no description>[default:
True
]- feat_ext__n_features:
Int
<no description>[default:
8192
]- feat_ext__ngram_range:
Str
<no description>[default:
'[7, 7]'
]- feat_ext__norm:
Str
<no description>[default:
'l2'
]- feat_ext__preprocessor:
Str
<no description>[default:
'null'
]- feat_ext__stop_words:
Str
<no description>[default:
'null'
]- feat_ext__strip_accents:
Str
<no description>[default:
'null'
]- feat_ext__token_pattern:
Str
<no description>[default:
'(?u)\\b\\w\\w+\\b'
]- feat_ext__tokenizer:
Str
<no description>[default:
'null'
]- verbose:
Bool
<no description>[default:
False
]
Outputs¶
- classifier:
TaxonomicClassifier
<no description>[required]
feature-classifier extract-reads¶
Extract simulated amplicon reads from a reference database. Performs in-silico PCR to extract simulated amplicons from reference sequences that match the input primer sequences (within the mismatch threshold specified by identity
). Both primer sequences must be in the 5' -> 3' orientation. Sequences that fail to match both primers will be excluded. Reads are extracted, trimmed, and filtered in the following order: 1. reads are extracted in specified orientation; 2. primers are removed; 3. reads longer than max_length
are removed; 4. reads are trimmed with trim_right
; 5. reads are truncated to trunc_len
; 6. reads are trimmed with trim_left
; 7. reads shorter than min_length
are removed.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
<no description>[required]
Parameters¶
- f_primer:
Str
forward primer sequence (5' -> 3').[required]
- r_primer:
Str
reverse primer sequence (5' -> 3'). Do not use reverse-complemented primer sequence.[required]
- trim_right:
Int
trim_right nucleotides are removed from the 3' end if trim_right is positive. Applied before trunc_len and trim_left.[default:
0
]- trunc_len:
Int
read is cut to trunc_len if trunc_len is positive. Applied after trim_right but before trim_left.[default:
0
]- trim_left:
Int
trim_left nucleotides are removed from the 5' end if trim_left is positive. Applied after trim_right and trunc_len.[default:
0
]- identity:
Float
minimum combined primer match identity threshold.[default:
0.8
]- min_length:
Int
%
Range
(0, None)
Minimum amplicon length. Shorter amplicons are discarded. Applied after trimming and truncation, so be aware that trimming may impact sequence retention. Set to zero to disable min length filtering.[default:
50
]- max_length:
Int
%
Range
(0, None)
Maximum amplicon length. Longer amplicons are discarded. Applied before trimming and truncation, so plan accordingly. Set to zero (default) to disable max length filtering.[default:
0
]- n_jobs:
Int
%
Range
(1, None)
Number of seperate processes to run.[default:
1
]- batch_size:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of sequences to process in a batch. The
auto
option is calculated from the number of sequences and number of jobs specified.[default:'auto'
]- read_orientation:
Str
%
Choices
('both', 'forward', 'reverse')
Orientation of primers relative to the sequences: "forward" searches for primer hits in the forward direction, "reverse" searches reverse-complement, and "both" searches both directions.[default:
'both'
]
Outputs¶
- reads:
FeatureData[Sequence]
<no description>[required]
feature-classifier find-consensus-annotation¶
Find consensus annotation for each query searched against a reference database, by finding the least common ancestor among one or more semicolon-delimited hierarchical annotations. Note that the annotation hierarchy is assumed to have an even number of ranks.
Citations¶
Inputs¶
- search_results:
FeatureData[BLAST6]
Search results in BLAST6 output format[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
Parameters¶
- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given when no consensus is found.[default:
'Unassigned'
]
Outputs¶
- consensus_taxonomy:
FeatureData[Taxonomy]
Consensus taxonomy and scores.[required]
feature-classifier makeblastdb¶
Make BLAST database from custom sequence collection.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- sequences:
FeatureData[Sequence]
Input reference sequences.[required]
Outputs¶
- database:
BLASTDB
Output BLAST database.[required]
feature-classifier blast¶
Search for top hits in a reference database via local alignment between the query sequences and reference database sequences using BLAST+. Returns a report of the top M hits for each query (where M=maxaccepts).
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier vsearch-global¶
Search for top hits in a reference database via global alignment between the query sequences and reference database sequences using VSEARCH. Returns a report of the top M hits for each query (where M=maxaccepts or maxhits).
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]
Outputs¶
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-blast¶
Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts hits, min_consensus of which share that taxonomic assignment. Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches. For top N hits, use classify-consensus-vsearch.
Citations¶
Bokulich et al., 2018; Camacho et al., 2009
Inputs¶
- query:
FeatureData[Sequence]
Query sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
reference taxonomy labels.[required]
- blastdb:
BLASTDB
BLAST indexed database. Incompatible with reference_reads.[optional]
- reference_reads:
FeatureData[Sequence]
Reference sequences. Incompatible with blastdb.[optional]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc_identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower. Note: this uses blastn's qcov_hsp_perc parameter, and may not behave identically to the query_cov parameter used by classify-consensus-vsearch.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus', 'minus')
Align against reference sequences in forward ("plus"), reverse ("minus"), or both directions ("both").[default:
'both'
]- evalue:
Float
BLAST expectation value (E) threshold for saving hits.[default:
0.001
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs. Set to FALSE to mirror default BLAST search.[default:
True
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]- num_threads:
Threads
Number of threads (CPUs) to use in the BLAST search. Pass 0 to use all available CPUs.[default:
1
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-consensus-vsearch¶
Assign taxonomy to query sequences using VSEARCH. Performs VSEARCH global alignment between query and reference_reads, then assigns consensus taxonomy to each query sequence from among maxaccepts top hits, min_consensus of which share that taxonomic assignment. Unlike classify-consensus-blast, this method searches the entire reference database before choosing the top N hits, not the first N hits.
Citations¶
Bokulich et al., 2018; Rognes et al., 2016
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if percent identity to query is lower.[default:
0.8
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Reject match if query alignment coverage per high-scoring pair is lower.[default:
0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- search_exact:
Bool
Search for exact full-length matches to the query sequences. Only 100% exact matches are reported and this command is much faster than the default. If True, the perc_identity, query_cov, maxaccepts, and maxrejects settings are ignored. Note: query and reference reads must be trimmed to the exact same DNA locus (e.g., primer site) because only exact matches will be reported.[default:
False
]- top_hits_only:
Bool
Only the top hits between the query and reference sequence sets are reported. For each query, the top hit is the one presenting the highest percentage of identity. Multiple equally scored top hits will be used for consensus taxonomic assignment if maxaccepts is greater than 1.[default:
False
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to show once the search is terminated.[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of non-matching target sequences to consider before stopping the search. This option works in pair with maxaccepts (see maxaccepts description for details).[default:
'all'
]- output_no_hits:
Bool
Report both matching and non-matching queries. WARNING: always use the default setting for this option unless if you know what you are doing! If you set this option to False, your sequences and feature table will need to be filtered to exclude unclassified sequences, otherwise you may run into errors downstream from missing feature IDs.[default:
True
]- weak_id:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Show hits with percentage of identity of at least N, without terminating the search. A normal search stops as soon as enough hits are found (as defined by maxaccepts, maxrejects, and perc_identity). As weak_id reports weak hits that are not deduced from maxaccepts, high perc_identity values can be used, hence preserving both speed and sensitivity. Logically, weak_id must be smaller than the value indicated by perc_identity, otherwise this option will be ignored.[default:
0.0
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- unassignable_label:
Str
Annotation given to sequences without any hits.[default:
'Unassigned'
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- search_results:
FeatureData[BLAST6]
Top hits for each query.[required]
feature-classifier classify-hybrid-vsearch-sklearn¶
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to https://
Citations¶
Inputs¶
- query:
FeatureData[Sequence]
Query Sequences.[required]
- reference_reads:
FeatureData[Sequence]
Reference sequences.[required]
- reference_taxonomy:
FeatureData[Taxonomy]
Reference taxonomy labels.[required]
- classifier:
TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for classifying the reads.[required]
Parameters¶
- maxaccepts:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
Maximum number of hits to keep for each query. Set to "all" to keep all hits > perc_identity similarity. Note that if strand=both, maxaccepts will keep N hits for each direction (if searches in the opposite direction yield results that exceed the minimum perc_identity). In those cases use maxhits to control the total number of hits returned. This option works in pair with maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If maxaccepts is set to a higher value, more hits are accepted. If maxaccepts and maxrejects are both set to "all", the complete database is searched.[default:
10
]- perc_identity:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Percent sequence similarity to use for PREFILTER. Reject match if percent identity to query is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.5
]- query_cov:
Float
%
Range
(0.0, 1.0, inclusive_end=True)
Query coverage threshold to use for PREFILTER. Reject match if query alignment coverage per high-scoring pair is lower. Set to a lower value to perform a rough pre-filter. This parameter is ignored if
prefilter
is disabled.[default:0.8
]- strand:
Str
%
Choices
('both', 'plus')
Align against reference sequences in forward ("plus") or both directions ("both").[default:
'both'
]- min_consensus:
Float
%
Range
(0.5, 1.0, inclusive_start=False, inclusive_end=True)
Minimum fraction of assignments must match top hit to be accepted as consensus assignment.[default:
0.51
]- maxhits:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- maxrejects:
Int
%
Range
(1, None)
|
Str
%
Choices
('all')
<no description>[default:
'all'
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch for sklearn classification. If "auto", this parameter is autoscaled to min(number of query sequences / threads, 20000).[default:
'auto'
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]- read_orientation:
Str
%
Choices
('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference sequences in pre-trained sklearn classifier. same will cause reads to be classified unchanged; reverse-complement will cause reads to be reversed and complemented prior to classification. "auto" will autodetect orientation based on the confidence estimates for the first 100 reads.[default:
'auto'
]- threads:
Threads
Number of threads to use for job parallelization. Pass 0 to use one per available CPU.[default:
1
]- prefilter:
Bool
Toggle positive filter of query sequences on or off.[default:
True
]- sample_size:
Int
%
Range
(1, None)
Randomly extract the given number of sequences from the reference database to use for prefiltering. This parameter is ignored if
prefilter
is disabled.[default:1000
]- randseed:
Int
%
Range
(0, None)
Use integer as a seed for the pseudo-random generator used during prefiltering. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed. This parameter is ignored if
prefilter
is disabled.[default:0
]
Outputs¶
- classification:
FeatureData[Taxonomy]
Taxonomy classifications of query sequences.[required]
- Links
- Documentation
- Source Code
- Stars
- 19
- Last Commit
- c64bc5c
- Available Distros
- 2024.10
- 2024.10/amplicon
- 2024.10/metagenome
- 2024.10/pathogenome
- 2024.5
- 2024.5/amplicon
- 2024.5/metagenome
- 2024.2
- 2024.2/amplicon
- 2023.9
- 2023.9/amplicon
- 2023.7
- 2023.7/core