Reference sequence annotation and curation pipeline.
- version:
2024.10.0
- website: https://
github .com /nbokulich /RESCRIPt - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Robeson et al., 2021
Actions¶
Name | Type | Short Description |
---|---|---|
merge-taxa | method | Compare taxonomies and select the longest, highest scoring, or find the least common ancestor. |
dereplicate | method | Dereplicate features with matching sequences and taxonomies. |
cull-seqs | method | Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length. |
degap-seqs | method | Remove gaps from DNA sequence alignments. |
edit-taxonomy | method | Edit taxonomy strings with find and replace terms. |
orient-seqs | method | Orient input sequences by comparison against reference. |
filter-seqs-length-by-taxon | method | Filter sequences by length and taxonomic group. |
filter-seqs-length | method | Filter sequences by length. |
parse-silva-taxonomy | method | Generates a SILVA fixed-rank taxonomy. |
reverse-transcribe | method | Reverse transcribe RNA to DNA sequences. |
get-ncbi-data | method | Download, parse, and import NCBI sequences and taxonomies |
get-ncbi-data-protein | method | Download, parse, and import NCBI protein sequences and taxonomies |
get-gtdb-data | method | Download, parse, and import SSU GTDB reference data. |
get-unite-data | method | Download and import UNITE reference data. |
filter-taxa | method | Filter taxonomy by list of IDs or search criteria. |
subsample-fasta | method | Subsample an indicated number of sequences from a FASTA file. |
extract-seq-segments | method | Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value. |
get-ncbi-genomes | method | Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets. |
get-bv-brc-genomes | method | Get genome sequences from the BV-BRC database. |
get-bv-brc-metadata | method | Fetch BV-BCR metadata. |
get-bv-brc-genome-features | method | Fetch genome features from BV-BRC. |
evaluate-seqs | visualizer | Compute summary statistics on sequence artifact(s). |
evaluate-fit-classifier | pipeline | Evaluate and train naive Bayes classifier on reference sequences. |
evaluate-cross-validate | pipeline | Evaluate DNA sequence reference database via cross-validated taxonomic classification. |
evaluate-classifications | pipeline | Interactively evaluate taxonomic classification accuracy. |
evaluate-taxonomy | pipeline | Compute summary statistics on taxonomy artifact(s). |
get-silva-data | pipeline | Download, parse, and import SILVA database. |
trim-alignment | pipeline | Trim alignment based on provided primers or specific positions. |
Artifact Classes¶
FeatureData[SILVATaxonomy] |
FeatureData[SILVATaxidMap] |
Formats¶
SILVATaxonomyFormat |
SILVATaxonomyDirectoryFormat |
SILVATaxidMapFormat |
SILVATaxidMapDirectoryFormat |
rescript merge-taxa¶
Compare taxonomy annotations and choose the best one. Can select the longest taxonomy annotation, the highest scoring, or the least common ancestor. Note: when a tie occurs, the last taxonomy added takes precedent.
Citations¶
Inputs¶
- data:
List
[
FeatureData[Taxonomy]
]
Two or more feature taxonomies to be merged.[required]
Parameters¶
- mode:
Str
%
Choices
('len', 'lca', 'score', 'super', 'majority')
How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'len'
]- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default:
'^[dkpcofgs]__'
]- new_rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with
rank_handle_regex
if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default:['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- unclassified_label:
Str
Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default:
'Unassigned'
]
Outputs¶
- merged_data:
FeatureData[Taxonomy]
<no description>[required]
rescript dereplicate¶
Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be dereplicated[required]
- taxa:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated[required]
Parameters¶
- mode:
Str
%
Choices
('uniq', 'lca', 'majority', 'super')
How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'uniq'
]- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
1.0
]- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default:
['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- derep_prefix:
Bool
Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default:
False
]
Outputs¶
- dereplicated_sequences:
FeatureData[Sequence]
<no description>[required]
- dereplicated_taxa:
FeatureData[Taxonomy]
<no description>[required]
rescript cull-seqs¶
Filter DNA or RNA sequences that contain ambiguous bases and homopolymers, and output filtered DNA sequences. Removes DNA sequences that have the specified number, or more, of IUPAC compliant degenerate bases. Remaining sequences are removed if they contain homopolymers equal to or longer than the specified length. If the input consists of RNA sequences, they are reverse transcribed to DNA before filtering.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence]
DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]
Parameters¶
- num_degenerates:
Int
%
Range
(1, None)
Sequences with N, or more, degenerate bases will be removed.[default:
5
]- homopolymer_length:
Int
%
Range
(2, None)
Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default:
8
]- n_jobs:
Int
%
Range
(1, None)
Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default:
1
]
Outputs¶
- clean_sequences:
FeatureData[Sequence]
The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]
rescript degap-seqs¶
This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA Sequences to be degapped.[required]
Parameters¶
- min_length:
Int
%
Range
(1, None)
Minimum length of sequence to be returned after degapping.[default:
1
]
Outputs¶
- degapped_sequences:
FeatureData[Sequence]
The resulting unaligned (degapped) DNA sequences.[required]
rescript edit-taxonomy¶
A method that allows the user to edit taxonomy strings. This is often used to fix inconsistent and/or inccorect taxonomic annotations. The user can either provide two separate lists of strings, i.e. 'search-strings', and 'replacement-strings', on the command line, and/or a single tab-delimited replacement map file containing a list of these strings. In both cases the number of search strings must match the number of replacement strings. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. In the case that both search / replacement strings, and a replacement map file are provided, they will be merged.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy strings data to be edited.[required]
Parameters¶
- replacement_map:
MetadataColumn
[
Categorical
]
A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
- search_strings:
List
[
Str
]
Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
- replacement_strings:
List
[
Str
]
Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
- use_regex:
Bool
Toggle regular expressions. By default, only litereal substring matching is performed.[default:
False
]
Outputs¶
- edited_taxonomy:
FeatureData[Taxonomy]
Taxonomy in which the original strings are replaced by user-supplied strings.[required]
rescript orient-seqs¶
Orient input sequences by comparison against a set of reference sequences using VSEARCH. This action can also be used to quickly filter out sequences that (do not) match a set of reference sequences in either orientation. Alternatively, if no reference sequences are provided as input, all input sequences will be reverse-complemented. In this case, no alignment is performed, and all alignment parameters (dbmask
, relabel
, relabel_keep
, relabel_md5
, relabel_self
, relabel_sha1
, sizein
, sizeout
and threads
) are ignored.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be oriented.[required]
- reference_sequences:
FeatureData[Sequence]
Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]
Parameters¶
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- dbmask:
Str
%
Choices
('none', 'dust', 'soft')
Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
- relabel:
Str
Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_keep:
Bool
When relabeling, keep the original identifier in the header after a space.[optional]
- relabel_md5:
Bool
When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_self:
Bool
Relabel sequences using the sequence itself as a label.[optional]
- relabel_sha1:
Bool
When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
- sizein:
Bool
In de novo mode, abundance annotations (pattern
[>;]size=integer[;]
) present in sequence headers are taken into account.[optional]- sizeout:
Bool
Add abundance annotations to the output FASTA files.[optional]
Outputs¶
- oriented_seqs:
FeatureData[Sequence]
Query sequences in same orientation as top matching reference sequence.[required]
- unmatched_seqs:
FeatureData[Sequence]
Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]
rescript filter-seqs-length-by-taxon¶
Filter sequences by length. Can filter both globally by minimum and/or maximum length, and set individual threshold for individual taxonomic groups (using the "labels" option). Note that filtering can be performed for multiple taxonomic groups simultaneously, and nested taxonomic filters can be applied (e.g., to apply a more stringent filter for a particular genus, but a less stringent filter for other members of the kingdom). For global length-based filtering without conditional taxonomic filtering, see filter_seqs_length.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be filtered.[required]
Parameters¶
- labels:
List
[
Str
]
One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
- min_lens:
List
[
Int
%
Range
(1, None)
]
Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
- max_lens:
List
[
Int
%
Range
(1, None)
]
Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript filter-seqs-length¶
Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
Parameters¶
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript parse-silva-taxonomy¶
Parses several files from the SILVA reference database to produce a GreenGenes-like fixed rank taxonomy that is 6 or 7 ranks deep, depending on whether or not include_species_labels
is applied. The generated ranks (and the rank handles used to label these ranks in the resulting taxonomy) are: domain (d__), phylum (p__), class (c__), order (o__), family (f__), genus (g__), and species (s__). NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Inputs¶
- taxonomy_tree:
Phylogeny[Rooted]
SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
- taxonomy_map:
FeatureData[SILVATaxidMap]
SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
- taxonomy_ranks:
FeatureData[SILVATaxonomy]
SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]
Parameters¶
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
The resulting fixed-rank formatted SILVA taxonomy.[required]
rescript reverse-transcribe¶
Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.
Citations¶
Inputs¶
- rna_sequences:
FeatureData[AlignedRNASequence¹ | RNASequence²]
RNA Sequences to reverse transcribe to DNA.[required]
Outputs¶
- dna_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Reverse-transcribed DNA sequences.[required]
rescript get-ncbi-data¶
Download and import sequences from the NCBI Nucleotide database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Nucleotide database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Nucleotide database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[Sequence]
Sequences from the NCBI Nucleotide database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-ncbi-data-protein¶
Download and import sequences from the NCBI Protein database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Protein database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Protein database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[ProteinSequence]
Sequences from the NCBI Protein database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-gtdb-data¶
Download, parse, and import SSU GTDB files, given a version number. Downloads data directly from GTDB, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM GTDB. SEE https://
Citations¶
Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021
Parameters¶
- version:
Str
%
Choices
('202.0', '207.0', '214.0', '214.1', '220.0')
GTDB database version to download.[default:
'220.0'
]- domain:
Str
%
Choices
('Both', 'Bacteria', 'Archaea')
SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default:
'Both'
]- db_type:
Str
%
Choices
('All', 'SpeciesReps')
'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default:
'SpeciesReps'
]
Outputs¶
- gtdb_taxonomy:
FeatureData[Taxonomy]
SSU GTDB reference taxonomy.[required]
- gtdb_sequences:
FeatureData[Sequence]
SSU GTDB reference sequences.[required]
rescript get-unite-data¶
Download and import ITS sequences and taxonomy from the UNITE database, given a version number and taxon_group, with the option to select a cluster_id and include singletons. Downloads data directly from UNITE's PlutoF REST API. NOTE: THIS ACTION ACQUIRES DATA FROM UNITE, which is licensed under CC BY-SA 4.0. To learn more, please visit https://
Citations¶
Robeson et al., 2021; Nilsson et al., 2018
Parameters¶
- version:
Str
%
Choices
('10.0', '9.0', '8.3', '8.2')
UNITE version to download.[default:
'10.0'
]- taxon_group:
Str
%
Choices
('fungi', 'eukaryotes')
Download a database with only 'fungi' or including all 'eukaryotes'.[default:
'eukaryotes'
]- cluster_id:
Str
%
Choices
('99', '97', 'dynamic')
Percent similarity at which sequences in the of database were clustered.[default:
'99'
]- singletons:
Bool
Include singleton clusters in the database.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
UNITE reference taxonomy.[required]
- sequences:
FeatureData[Sequence]
UNITE reference sequences.[required]
rescript filter-taxa¶
Filter taxonomy by list of IDs or search criteria.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy to filter.[required]
Parameters¶
- ids_to_keep:
Metadata
List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
- include:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting
ids_to_keep
.[optional]- exclude:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting
ids_to_keep
.[optional]
Outputs¶
- filtered_taxonomy:
FeatureData[Taxonomy]
The filtered taxonomy.[required]
rescript subsample-fasta¶
Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.
Citations¶
Inputs¶
- sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sequences to subsample from.[required]
Parameters¶
- subsample_size:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Size of the random sample as a fraction of the total count[default:
0.1
]- random_seed:
Int
%
Range
(1, None)
Seed to be used for random sampling.[default:
1
]
Outputs¶
- sample_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sample of original sequences.[required]
rescript extract-seq-segments¶
This action provides the ability to extract a region, or segment, of sequence without the need to specify primer pairs. This is very useful in cases when one or more of the primer sequences are not present within the target sequences, which prevents extraction of the (amplicon) region through primer-pair searching. Here, VSEARCH is used to extract these segments based on a reference pool of sequences that only span the region of interest.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- input_sequences:
FeatureData[Sequence]
Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
- reference_segment_sequences:
FeatureData[Sequence]
Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]
Parameters¶
- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
0.7
]- target_coverage:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default:
0.9
]- min_seq_len:
Int
%
Range
(1, None)
Minimum length of sequence allowed for searching. Any sequence less than this will be discarded. If not set, default program settings will be used.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- extracted_sequence_segments:
FeatureData[Sequence]
Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
- unmatched_sequences:
FeatureData[Sequence]
Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]
rescript get-ncbi-genomes¶
Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.
Citations¶
Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020
Parameters¶
- taxon:
Str
NCBI Taxonomy ID or name (common or scientific) at any taxonomic rank.[required]
- assembly_source:
Str
%
Choices
('refseq', 'genbank', 'all')
Fetch only RefSeq or GenBank genome assemblies.[default:
'refseq'
]- assembly_levels:
List
[
Str
%
Choices
('complete_genome', 'chromosome', 'scaffold', 'contig')
]
Fetch only genome assemblies that are one of the specified assembly levels.[default:
['complete_genome']
]- only_reference:
Bool
Fetch only reference and representative genome assemblies.[default:
True
]- only_genomic:
Bool
Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default:
False
]- tax_exact_match:
Bool
If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default:
False
]- page_size:
Int
%
Range
(20, 1000, inclusive_end=True)
The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default:
20
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default:
['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genome_assemblies:
FeatureData[Sequence]
Nucleotide sequences of requested genomes.[required]
- loci:
GenomeData[Loci]
Loci features of requested genomes.[required]
- proteins:
GenomeData[Proteins]
Protein sequences originating from requested genomes.[required]
- taxonomies:
FeatureData[Taxonomy]
Taxonomies of requested genomes.[required]
rescript get-bv-brc-genomes¶
Fetch genome sequences from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted genomes. By providing IDs/values and a corresponding data field, you can retrieve all genomes associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _sequence for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genomes:
GenomeData[DNASequence]
Genome sequences for specified query.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
rescript get-bv-brc-metadata¶
Fetch BV-BCR metadata for a specific data type. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted results. By providing IDs/values and a corresponding data field, you can retrieve all metadata associated with those specific values in that data field. And as a third option a metadata column can be provided, to use the results from other data types as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- data_type:
Str
%
Choices
('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology')
BV-BCR data type for which metadata should be downloaded. Check https://
www .bv -brc .org /api /doc / for documentation.[optional] - rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc / for allowed data fields in the specified "data-type".[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
Outputs¶
- metadata:
ImmutableMetadata
BV-BCR metadata of specified data type.[required]
rescript get-bv-brc-genome-features¶
Fetch DNA and protein sequences of genome features from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted features. By providing IDs/values and a corresponding data field, you can retrieve all features associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _feature for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genes:
GenomeData[Genes]
Gene[required]
- proteins:
GenomeData[Proteins]
proteins[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
- loci:
GenomeData[Loci]
loci[required]
rescript evaluate-seqs¶
Compute summary statistics on sequence artifact(s) and visualize. Summary statistics include the number of unique sequences, sequence entropy, kmer entropy, and sequence length distributions. This action is useful for both reference taxonomies and classification results.
Citations¶
Inputs¶
- sequences:
List
[
FeatureData[Sequence]
]
One or more sets of sequences to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- kmer_lengths:
List
[
Int
%
Range
(1, None)
]
Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
- subsample_kmers:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default:
1.0
]- palette:
Str
%
Choices
('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic')
Color palette to use for plotting evaluation results.[default:
'viridis'
]
Outputs¶
- visualization:
Visualization
<no description>[required]
rescript evaluate-fit-classifier¶
Train a naive Bayes classifier on a set of reference sequences, then test performance accuracy on this same set of sequences. This results in a "perfect" classifier that "knows" the correct identity of each input sequence. Such a leaky classifier indicates the upper limit of classification accuracy based on sequence information alone, as misclassifications are an indication of unresolvable kmer profiles. This test simulates the case where all query sequences are present in a fully comprehensive reference database. To simulate more realistic conditions, see evaluate_cross_validate
. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS PRODUCTION-READY and can be re-used for classification of other sequences (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- classifier:
TaxonomicClassifier
Trained naive Bayes taxonomic classifier.[required]
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]
rescript evaluate-cross-validate¶
Evaluate DNA sequence reference database via cross-validated taxonomic classification. Unique taxonomic labels are truncated to enable appropriate label stratification. See the cited reference (Bokulich et al. 2018) for more details.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- k:
Int
%
Range
(2, None)
Number of stratified folds.[default:
3
]- random_state:
Int
%
Range
(0, None)
Seed used by the random number generator.[default:
0
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- expected_taxonomy:
FeatureData[Taxonomy]
Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
- evaluation:
Visualization
Visualization of cross-validated accuracy results.[required]
rescript evaluate-classifications¶
Evaluate taxonomic classification accuracy by comparing one or more sets of true taxonomic labels to the predicted taxonomies for the same set(s) of features. Output an interactive line plot of classification accuracy for each pair of expected/observed taxonomies. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018
Inputs¶
- expected_taxonomies:
List
[
FeatureData[Taxonomy]
]
True taxonomic labels for one more more sets of features.[required]
- observed_taxonomies:
List
[
FeatureData[Taxonomy]
]
Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
Outputs¶
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
rescript evaluate-taxonomy¶
Compute summary statistics on taxonomy artifact(s) and visualize as interactive lineplots. Summary statistics include the number of unique labels, taxonomic entropy, and the number of features that are (un)classified at each taxonomic level. This action is useful for both reference taxonomies and classification results. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- taxonomies:
List
[
FeatureData[Taxonomy]
]
One or more taxonomies to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]
Outputs¶
- taxonomy_stats:
Visualization
<no description>[required]
rescript get-silva-data¶
Download, parse, and import SILVA database files, given a version number and reference target. Downloads data directly from SILVA, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Parameters¶
- version:
Str
%
Choices
('128', '132')
|
Str
%
Choices
('138')
|
Str
%
Choices
('138.1', '138.2')
SILVA database version to download.[default:
'138.2'
]- target:
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef')
Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default:
'SSURef_NR99'
]- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- download_sequences:
Bool
Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a
silva_sequences
output is still created, but contains no data.[default:True
]
Outputs¶
- silva_sequences:
FeatureData[RNASequence]
SILVA reference sequences.[required]
- silva_taxonomy:
FeatureData[Taxonomy]
SILVA reference taxonomy.[required]
rescript trim-alignment¶
Trim an existing alignment based on provided primers or specific, pre-defined positions. Primers take precedence over the positions,i.e. if both are provided, positions will be ignored.When using primers in combination with a DNA alignment, a new alignment will be generated to locate primer positions. Subsequently, start (5'-most) and end (3'-most) position from fwd and rev primer located within the new alignment is identified and used for slicing the original alignment.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA sequences.[required]
Parameters¶
- primer_fwd:
Str
Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
- primer_rev:
Str
Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
- position_start:
Int
%
Range
(1, None)
Position within the alignment where the trimming will begin. If not provided, alignment will notbe trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
- position_end:
Int
%
Range
(1, None)
Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
- n_threads:
Int
%
Range
(1, None)
Number of threads to use for primer-based trimming, otherwise ignored. (Use
auto
to automatically use all available cores)[default:1
]
Outputs¶
- trimmed_sequences:
FeatureData[AlignedSequence]
Trimmed sequence alignment.[required]
Reference sequence annotation and curation pipeline.
- version:
2024.10.0
- website: https://
github .com /nbokulich /RESCRIPt - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Robeson et al., 2021
Actions¶
Name | Type | Short Description |
---|---|---|
merge-taxa | method | Compare taxonomies and select the longest, highest scoring, or find the least common ancestor. |
dereplicate | method | Dereplicate features with matching sequences and taxonomies. |
cull-seqs | method | Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length. |
degap-seqs | method | Remove gaps from DNA sequence alignments. |
edit-taxonomy | method | Edit taxonomy strings with find and replace terms. |
orient-seqs | method | Orient input sequences by comparison against reference. |
filter-seqs-length-by-taxon | method | Filter sequences by length and taxonomic group. |
filter-seqs-length | method | Filter sequences by length. |
parse-silva-taxonomy | method | Generates a SILVA fixed-rank taxonomy. |
reverse-transcribe | method | Reverse transcribe RNA to DNA sequences. |
get-ncbi-data | method | Download, parse, and import NCBI sequences and taxonomies |
get-ncbi-data-protein | method | Download, parse, and import NCBI protein sequences and taxonomies |
get-gtdb-data | method | Download, parse, and import SSU GTDB reference data. |
get-unite-data | method | Download and import UNITE reference data. |
filter-taxa | method | Filter taxonomy by list of IDs or search criteria. |
subsample-fasta | method | Subsample an indicated number of sequences from a FASTA file. |
extract-seq-segments | method | Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value. |
get-ncbi-genomes | method | Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets. |
get-bv-brc-genomes | method | Get genome sequences from the BV-BRC database. |
get-bv-brc-metadata | method | Fetch BV-BCR metadata. |
get-bv-brc-genome-features | method | Fetch genome features from BV-BRC. |
evaluate-seqs | visualizer | Compute summary statistics on sequence artifact(s). |
evaluate-fit-classifier | pipeline | Evaluate and train naive Bayes classifier on reference sequences. |
evaluate-cross-validate | pipeline | Evaluate DNA sequence reference database via cross-validated taxonomic classification. |
evaluate-classifications | pipeline | Interactively evaluate taxonomic classification accuracy. |
evaluate-taxonomy | pipeline | Compute summary statistics on taxonomy artifact(s). |
get-silva-data | pipeline | Download, parse, and import SILVA database. |
trim-alignment | pipeline | Trim alignment based on provided primers or specific positions. |
Artifact Classes¶
FeatureData[SILVATaxonomy] |
FeatureData[SILVATaxidMap] |
Formats¶
SILVATaxonomyFormat |
SILVATaxonomyDirectoryFormat |
SILVATaxidMapFormat |
SILVATaxidMapDirectoryFormat |
rescript merge-taxa¶
Compare taxonomy annotations and choose the best one. Can select the longest taxonomy annotation, the highest scoring, or the least common ancestor. Note: when a tie occurs, the last taxonomy added takes precedent.
Citations¶
Inputs¶
- data:
List
[
FeatureData[Taxonomy]
]
Two or more feature taxonomies to be merged.[required]
Parameters¶
- mode:
Str
%
Choices
('len', 'lca', 'score', 'super', 'majority')
How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'len'
]- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default:
'^[dkpcofgs]__'
]- new_rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with
rank_handle_regex
if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default:['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- unclassified_label:
Str
Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default:
'Unassigned'
]
Outputs¶
- merged_data:
FeatureData[Taxonomy]
<no description>[required]
rescript dereplicate¶
Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be dereplicated[required]
- taxa:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated[required]
Parameters¶
- mode:
Str
%
Choices
('uniq', 'lca', 'majority', 'super')
How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'uniq'
]- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
1.0
]- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default:
['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- derep_prefix:
Bool
Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default:
False
]
Outputs¶
- dereplicated_sequences:
FeatureData[Sequence]
<no description>[required]
- dereplicated_taxa:
FeatureData[Taxonomy]
<no description>[required]
rescript cull-seqs¶
Filter DNA or RNA sequences that contain ambiguous bases and homopolymers, and output filtered DNA sequences. Removes DNA sequences that have the specified number, or more, of IUPAC compliant degenerate bases. Remaining sequences are removed if they contain homopolymers equal to or longer than the specified length. If the input consists of RNA sequences, they are reverse transcribed to DNA before filtering.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence]
DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]
Parameters¶
- num_degenerates:
Int
%
Range
(1, None)
Sequences with N, or more, degenerate bases will be removed.[default:
5
]- homopolymer_length:
Int
%
Range
(2, None)
Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default:
8
]- n_jobs:
Int
%
Range
(1, None)
Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default:
1
]
Outputs¶
- clean_sequences:
FeatureData[Sequence]
The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]
rescript degap-seqs¶
This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA Sequences to be degapped.[required]
Parameters¶
- min_length:
Int
%
Range
(1, None)
Minimum length of sequence to be returned after degapping.[default:
1
]
Outputs¶
- degapped_sequences:
FeatureData[Sequence]
The resulting unaligned (degapped) DNA sequences.[required]
rescript edit-taxonomy¶
A method that allows the user to edit taxonomy strings. This is often used to fix inconsistent and/or inccorect taxonomic annotations. The user can either provide two separate lists of strings, i.e. 'search-strings', and 'replacement-strings', on the command line, and/or a single tab-delimited replacement map file containing a list of these strings. In both cases the number of search strings must match the number of replacement strings. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. In the case that both search / replacement strings, and a replacement map file are provided, they will be merged.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy strings data to be edited.[required]
Parameters¶
- replacement_map:
MetadataColumn
[
Categorical
]
A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
- search_strings:
List
[
Str
]
Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
- replacement_strings:
List
[
Str
]
Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
- use_regex:
Bool
Toggle regular expressions. By default, only litereal substring matching is performed.[default:
False
]
Outputs¶
- edited_taxonomy:
FeatureData[Taxonomy]
Taxonomy in which the original strings are replaced by user-supplied strings.[required]
rescript orient-seqs¶
Orient input sequences by comparison against a set of reference sequences using VSEARCH. This action can also be used to quickly filter out sequences that (do not) match a set of reference sequences in either orientation. Alternatively, if no reference sequences are provided as input, all input sequences will be reverse-complemented. In this case, no alignment is performed, and all alignment parameters (dbmask
, relabel
, relabel_keep
, relabel_md5
, relabel_self
, relabel_sha1
, sizein
, sizeout
and threads
) are ignored.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be oriented.[required]
- reference_sequences:
FeatureData[Sequence]
Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]
Parameters¶
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- dbmask:
Str
%
Choices
('none', 'dust', 'soft')
Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
- relabel:
Str
Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_keep:
Bool
When relabeling, keep the original identifier in the header after a space.[optional]
- relabel_md5:
Bool
When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_self:
Bool
Relabel sequences using the sequence itself as a label.[optional]
- relabel_sha1:
Bool
When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
- sizein:
Bool
In de novo mode, abundance annotations (pattern
[>;]size=integer[;]
) present in sequence headers are taken into account.[optional]- sizeout:
Bool
Add abundance annotations to the output FASTA files.[optional]
Outputs¶
- oriented_seqs:
FeatureData[Sequence]
Query sequences in same orientation as top matching reference sequence.[required]
- unmatched_seqs:
FeatureData[Sequence]
Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]
rescript filter-seqs-length-by-taxon¶
Filter sequences by length. Can filter both globally by minimum and/or maximum length, and set individual threshold for individual taxonomic groups (using the "labels" option). Note that filtering can be performed for multiple taxonomic groups simultaneously, and nested taxonomic filters can be applied (e.g., to apply a more stringent filter for a particular genus, but a less stringent filter for other members of the kingdom). For global length-based filtering without conditional taxonomic filtering, see filter_seqs_length.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be filtered.[required]
Parameters¶
- labels:
List
[
Str
]
One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
- min_lens:
List
[
Int
%
Range
(1, None)
]
Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
- max_lens:
List
[
Int
%
Range
(1, None)
]
Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript filter-seqs-length¶
Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
Parameters¶
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript parse-silva-taxonomy¶
Parses several files from the SILVA reference database to produce a GreenGenes-like fixed rank taxonomy that is 6 or 7 ranks deep, depending on whether or not include_species_labels
is applied. The generated ranks (and the rank handles used to label these ranks in the resulting taxonomy) are: domain (d__), phylum (p__), class (c__), order (o__), family (f__), genus (g__), and species (s__). NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Inputs¶
- taxonomy_tree:
Phylogeny[Rooted]
SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
- taxonomy_map:
FeatureData[SILVATaxidMap]
SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
- taxonomy_ranks:
FeatureData[SILVATaxonomy]
SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]
Parameters¶
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
The resulting fixed-rank formatted SILVA taxonomy.[required]
rescript reverse-transcribe¶
Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.
Citations¶
Inputs¶
- rna_sequences:
FeatureData[AlignedRNASequence¹ | RNASequence²]
RNA Sequences to reverse transcribe to DNA.[required]
Outputs¶
- dna_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Reverse-transcribed DNA sequences.[required]
rescript get-ncbi-data¶
Download and import sequences from the NCBI Nucleotide database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Nucleotide database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Nucleotide database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[Sequence]
Sequences from the NCBI Nucleotide database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-ncbi-data-protein¶
Download and import sequences from the NCBI Protein database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Protein database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Protein database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[ProteinSequence]
Sequences from the NCBI Protein database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-gtdb-data¶
Download, parse, and import SSU GTDB files, given a version number. Downloads data directly from GTDB, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM GTDB. SEE https://
Citations¶
Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021
Parameters¶
- version:
Str
%
Choices
('202.0', '207.0', '214.0', '214.1', '220.0')
GTDB database version to download.[default:
'220.0'
]- domain:
Str
%
Choices
('Both', 'Bacteria', 'Archaea')
SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default:
'Both'
]- db_type:
Str
%
Choices
('All', 'SpeciesReps')
'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default:
'SpeciesReps'
]
Outputs¶
- gtdb_taxonomy:
FeatureData[Taxonomy]
SSU GTDB reference taxonomy.[required]
- gtdb_sequences:
FeatureData[Sequence]
SSU GTDB reference sequences.[required]
rescript get-unite-data¶
Download and import ITS sequences and taxonomy from the UNITE database, given a version number and taxon_group, with the option to select a cluster_id and include singletons. Downloads data directly from UNITE's PlutoF REST API. NOTE: THIS ACTION ACQUIRES DATA FROM UNITE, which is licensed under CC BY-SA 4.0. To learn more, please visit https://
Citations¶
Robeson et al., 2021; Nilsson et al., 2018
Parameters¶
- version:
Str
%
Choices
('10.0', '9.0', '8.3', '8.2')
UNITE version to download.[default:
'10.0'
]- taxon_group:
Str
%
Choices
('fungi', 'eukaryotes')
Download a database with only 'fungi' or including all 'eukaryotes'.[default:
'eukaryotes'
]- cluster_id:
Str
%
Choices
('99', '97', 'dynamic')
Percent similarity at which sequences in the of database were clustered.[default:
'99'
]- singletons:
Bool
Include singleton clusters in the database.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
UNITE reference taxonomy.[required]
- sequences:
FeatureData[Sequence]
UNITE reference sequences.[required]
rescript filter-taxa¶
Filter taxonomy by list of IDs or search criteria.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy to filter.[required]
Parameters¶
- ids_to_keep:
Metadata
List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
- include:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting
ids_to_keep
.[optional]- exclude:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting
ids_to_keep
.[optional]
Outputs¶
- filtered_taxonomy:
FeatureData[Taxonomy]
The filtered taxonomy.[required]
rescript subsample-fasta¶
Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.
Citations¶
Inputs¶
- sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sequences to subsample from.[required]
Parameters¶
- subsample_size:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Size of the random sample as a fraction of the total count[default:
0.1
]- random_seed:
Int
%
Range
(1, None)
Seed to be used for random sampling.[default:
1
]
Outputs¶
- sample_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sample of original sequences.[required]
rescript extract-seq-segments¶
This action provides the ability to extract a region, or segment, of sequence without the need to specify primer pairs. This is very useful in cases when one or more of the primer sequences are not present within the target sequences, which prevents extraction of the (amplicon) region through primer-pair searching. Here, VSEARCH is used to extract these segments based on a reference pool of sequences that only span the region of interest.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- input_sequences:
FeatureData[Sequence]
Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
- reference_segment_sequences:
FeatureData[Sequence]
Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]
Parameters¶
- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
0.7
]- target_coverage:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default:
0.9
]- min_seq_len:
Int
%
Range
(1, None)
Minimum length of sequence allowed for searching. Any sequence less than this will be discarded. If not set, default program settings will be used.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- extracted_sequence_segments:
FeatureData[Sequence]
Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
- unmatched_sequences:
FeatureData[Sequence]
Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]
rescript get-ncbi-genomes¶
Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.
Citations¶
Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020
Parameters¶
- taxon:
Str
NCBI Taxonomy ID or name (common or scientific) at any taxonomic rank.[required]
- assembly_source:
Str
%
Choices
('refseq', 'genbank', 'all')
Fetch only RefSeq or GenBank genome assemblies.[default:
'refseq'
]- assembly_levels:
List
[
Str
%
Choices
('complete_genome', 'chromosome', 'scaffold', 'contig')
]
Fetch only genome assemblies that are one of the specified assembly levels.[default:
['complete_genome']
]- only_reference:
Bool
Fetch only reference and representative genome assemblies.[default:
True
]- only_genomic:
Bool
Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default:
False
]- tax_exact_match:
Bool
If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default:
False
]- page_size:
Int
%
Range
(20, 1000, inclusive_end=True)
The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default:
20
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default:
['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genome_assemblies:
FeatureData[Sequence]
Nucleotide sequences of requested genomes.[required]
- loci:
GenomeData[Loci]
Loci features of requested genomes.[required]
- proteins:
GenomeData[Proteins]
Protein sequences originating from requested genomes.[required]
- taxonomies:
FeatureData[Taxonomy]
Taxonomies of requested genomes.[required]
rescript get-bv-brc-genomes¶
Fetch genome sequences from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted genomes. By providing IDs/values and a corresponding data field, you can retrieve all genomes associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _sequence for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genomes:
GenomeData[DNASequence]
Genome sequences for specified query.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
rescript get-bv-brc-metadata¶
Fetch BV-BCR metadata for a specific data type. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted results. By providing IDs/values and a corresponding data field, you can retrieve all metadata associated with those specific values in that data field. And as a third option a metadata column can be provided, to use the results from other data types as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- data_type:
Str
%
Choices
('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology')
BV-BCR data type for which metadata should be downloaded. Check https://
www .bv -brc .org /api /doc / for documentation.[optional] - rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc / for allowed data fields in the specified "data-type".[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
Outputs¶
- metadata:
ImmutableMetadata
BV-BCR metadata of specified data type.[required]
rescript get-bv-brc-genome-features¶
Fetch DNA and protein sequences of genome features from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted features. By providing IDs/values and a corresponding data field, you can retrieve all features associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _feature for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genes:
GenomeData[Genes]
Gene[required]
- proteins:
GenomeData[Proteins]
proteins[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
- loci:
GenomeData[Loci]
loci[required]
rescript evaluate-seqs¶
Compute summary statistics on sequence artifact(s) and visualize. Summary statistics include the number of unique sequences, sequence entropy, kmer entropy, and sequence length distributions. This action is useful for both reference taxonomies and classification results.
Citations¶
Inputs¶
- sequences:
List
[
FeatureData[Sequence]
]
One or more sets of sequences to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- kmer_lengths:
List
[
Int
%
Range
(1, None)
]
Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
- subsample_kmers:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default:
1.0
]- palette:
Str
%
Choices
('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic')
Color palette to use for plotting evaluation results.[default:
'viridis'
]
Outputs¶
- visualization:
Visualization
<no description>[required]
rescript evaluate-fit-classifier¶
Train a naive Bayes classifier on a set of reference sequences, then test performance accuracy on this same set of sequences. This results in a "perfect" classifier that "knows" the correct identity of each input sequence. Such a leaky classifier indicates the upper limit of classification accuracy based on sequence information alone, as misclassifications are an indication of unresolvable kmer profiles. This test simulates the case where all query sequences are present in a fully comprehensive reference database. To simulate more realistic conditions, see evaluate_cross_validate
. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS PRODUCTION-READY and can be re-used for classification of other sequences (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- classifier:
TaxonomicClassifier
Trained naive Bayes taxonomic classifier.[required]
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]
rescript evaluate-cross-validate¶
Evaluate DNA sequence reference database via cross-validated taxonomic classification. Unique taxonomic labels are truncated to enable appropriate label stratification. See the cited reference (Bokulich et al. 2018) for more details.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- k:
Int
%
Range
(2, None)
Number of stratified folds.[default:
3
]- random_state:
Int
%
Range
(0, None)
Seed used by the random number generator.[default:
0
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- expected_taxonomy:
FeatureData[Taxonomy]
Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
- evaluation:
Visualization
Visualization of cross-validated accuracy results.[required]
rescript evaluate-classifications¶
Evaluate taxonomic classification accuracy by comparing one or more sets of true taxonomic labels to the predicted taxonomies for the same set(s) of features. Output an interactive line plot of classification accuracy for each pair of expected/observed taxonomies. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018
Inputs¶
- expected_taxonomies:
List
[
FeatureData[Taxonomy]
]
True taxonomic labels for one more more sets of features.[required]
- observed_taxonomies:
List
[
FeatureData[Taxonomy]
]
Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
Outputs¶
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
rescript evaluate-taxonomy¶
Compute summary statistics on taxonomy artifact(s) and visualize as interactive lineplots. Summary statistics include the number of unique labels, taxonomic entropy, and the number of features that are (un)classified at each taxonomic level. This action is useful for both reference taxonomies and classification results. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- taxonomies:
List
[
FeatureData[Taxonomy]
]
One or more taxonomies to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]
Outputs¶
- taxonomy_stats:
Visualization
<no description>[required]
rescript get-silva-data¶
Download, parse, and import SILVA database files, given a version number and reference target. Downloads data directly from SILVA, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Parameters¶
- version:
Str
%
Choices
('128', '132')
|
Str
%
Choices
('138')
|
Str
%
Choices
('138.1', '138.2')
SILVA database version to download.[default:
'138.2'
]- target:
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef')
Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default:
'SSURef_NR99'
]- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- download_sequences:
Bool
Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a
silva_sequences
output is still created, but contains no data.[default:True
]
Outputs¶
- silva_sequences:
FeatureData[RNASequence]
SILVA reference sequences.[required]
- silva_taxonomy:
FeatureData[Taxonomy]
SILVA reference taxonomy.[required]
rescript trim-alignment¶
Trim an existing alignment based on provided primers or specific, pre-defined positions. Primers take precedence over the positions,i.e. if both are provided, positions will be ignored.When using primers in combination with a DNA alignment, a new alignment will be generated to locate primer positions. Subsequently, start (5'-most) and end (3'-most) position from fwd and rev primer located within the new alignment is identified and used for slicing the original alignment.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA sequences.[required]
Parameters¶
- primer_fwd:
Str
Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
- primer_rev:
Str
Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
- position_start:
Int
%
Range
(1, None)
Position within the alignment where the trimming will begin. If not provided, alignment will notbe trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
- position_end:
Int
%
Range
(1, None)
Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
- n_threads:
Int
%
Range
(1, None)
Number of threads to use for primer-based trimming, otherwise ignored. (Use
auto
to automatically use all available cores)[default:1
]
Outputs¶
- trimmed_sequences:
FeatureData[AlignedSequence]
Trimmed sequence alignment.[required]
Reference sequence annotation and curation pipeline.
- version:
2024.10.0
- website: https://
github .com /nbokulich /RESCRIPt - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Robeson et al., 2021
Actions¶
Name | Type | Short Description |
---|---|---|
merge-taxa | method | Compare taxonomies and select the longest, highest scoring, or find the least common ancestor. |
dereplicate | method | Dereplicate features with matching sequences and taxonomies. |
cull-seqs | method | Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length. |
degap-seqs | method | Remove gaps from DNA sequence alignments. |
edit-taxonomy | method | Edit taxonomy strings with find and replace terms. |
orient-seqs | method | Orient input sequences by comparison against reference. |
filter-seqs-length-by-taxon | method | Filter sequences by length and taxonomic group. |
filter-seqs-length | method | Filter sequences by length. |
parse-silva-taxonomy | method | Generates a SILVA fixed-rank taxonomy. |
reverse-transcribe | method | Reverse transcribe RNA to DNA sequences. |
get-ncbi-data | method | Download, parse, and import NCBI sequences and taxonomies |
get-ncbi-data-protein | method | Download, parse, and import NCBI protein sequences and taxonomies |
get-gtdb-data | method | Download, parse, and import SSU GTDB reference data. |
get-unite-data | method | Download and import UNITE reference data. |
filter-taxa | method | Filter taxonomy by list of IDs or search criteria. |
subsample-fasta | method | Subsample an indicated number of sequences from a FASTA file. |
extract-seq-segments | method | Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value. |
get-ncbi-genomes | method | Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets. |
get-bv-brc-genomes | method | Get genome sequences from the BV-BRC database. |
get-bv-brc-metadata | method | Fetch BV-BCR metadata. |
get-bv-brc-genome-features | method | Fetch genome features from BV-BRC. |
evaluate-seqs | visualizer | Compute summary statistics on sequence artifact(s). |
evaluate-fit-classifier | pipeline | Evaluate and train naive Bayes classifier on reference sequences. |
evaluate-cross-validate | pipeline | Evaluate DNA sequence reference database via cross-validated taxonomic classification. |
evaluate-classifications | pipeline | Interactively evaluate taxonomic classification accuracy. |
evaluate-taxonomy | pipeline | Compute summary statistics on taxonomy artifact(s). |
get-silva-data | pipeline | Download, parse, and import SILVA database. |
trim-alignment | pipeline | Trim alignment based on provided primers or specific positions. |
Artifact Classes¶
FeatureData[SILVATaxonomy] |
FeatureData[SILVATaxidMap] |
Formats¶
SILVATaxonomyFormat |
SILVATaxonomyDirectoryFormat |
SILVATaxidMapFormat |
SILVATaxidMapDirectoryFormat |
rescript merge-taxa¶
Compare taxonomy annotations and choose the best one. Can select the longest taxonomy annotation, the highest scoring, or the least common ancestor. Note: when a tie occurs, the last taxonomy added takes precedent.
Citations¶
Inputs¶
- data:
List
[
FeatureData[Taxonomy]
]
Two or more feature taxonomies to be merged.[required]
Parameters¶
- mode:
Str
%
Choices
('len', 'lca', 'score', 'super', 'majority')
How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'len'
]- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default:
'^[dkpcofgs]__'
]- new_rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with
rank_handle_regex
if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default:['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- unclassified_label:
Str
Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default:
'Unassigned'
]
Outputs¶
- merged_data:
FeatureData[Taxonomy]
<no description>[required]
rescript dereplicate¶
Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be dereplicated[required]
- taxa:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated[required]
Parameters¶
- mode:
Str
%
Choices
('uniq', 'lca', 'majority', 'super')
How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'uniq'
]- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
1.0
]- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default:
['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- derep_prefix:
Bool
Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default:
False
]
Outputs¶
- dereplicated_sequences:
FeatureData[Sequence]
<no description>[required]
- dereplicated_taxa:
FeatureData[Taxonomy]
<no description>[required]
rescript cull-seqs¶
Filter DNA or RNA sequences that contain ambiguous bases and homopolymers, and output filtered DNA sequences. Removes DNA sequences that have the specified number, or more, of IUPAC compliant degenerate bases. Remaining sequences are removed if they contain homopolymers equal to or longer than the specified length. If the input consists of RNA sequences, they are reverse transcribed to DNA before filtering.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence]
DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]
Parameters¶
- num_degenerates:
Int
%
Range
(1, None)
Sequences with N, or more, degenerate bases will be removed.[default:
5
]- homopolymer_length:
Int
%
Range
(2, None)
Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default:
8
]- n_jobs:
Int
%
Range
(1, None)
Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default:
1
]
Outputs¶
- clean_sequences:
FeatureData[Sequence]
The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]
rescript degap-seqs¶
This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA Sequences to be degapped.[required]
Parameters¶
- min_length:
Int
%
Range
(1, None)
Minimum length of sequence to be returned after degapping.[default:
1
]
Outputs¶
- degapped_sequences:
FeatureData[Sequence]
The resulting unaligned (degapped) DNA sequences.[required]
rescript edit-taxonomy¶
A method that allows the user to edit taxonomy strings. This is often used to fix inconsistent and/or inccorect taxonomic annotations. The user can either provide two separate lists of strings, i.e. 'search-strings', and 'replacement-strings', on the command line, and/or a single tab-delimited replacement map file containing a list of these strings. In both cases the number of search strings must match the number of replacement strings. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. In the case that both search / replacement strings, and a replacement map file are provided, they will be merged.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy strings data to be edited.[required]
Parameters¶
- replacement_map:
MetadataColumn
[
Categorical
]
A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
- search_strings:
List
[
Str
]
Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
- replacement_strings:
List
[
Str
]
Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
- use_regex:
Bool
Toggle regular expressions. By default, only litereal substring matching is performed.[default:
False
]
Outputs¶
- edited_taxonomy:
FeatureData[Taxonomy]
Taxonomy in which the original strings are replaced by user-supplied strings.[required]
rescript orient-seqs¶
Orient input sequences by comparison against a set of reference sequences using VSEARCH. This action can also be used to quickly filter out sequences that (do not) match a set of reference sequences in either orientation. Alternatively, if no reference sequences are provided as input, all input sequences will be reverse-complemented. In this case, no alignment is performed, and all alignment parameters (dbmask
, relabel
, relabel_keep
, relabel_md5
, relabel_self
, relabel_sha1
, sizein
, sizeout
and threads
) are ignored.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be oriented.[required]
- reference_sequences:
FeatureData[Sequence]
Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]
Parameters¶
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- dbmask:
Str
%
Choices
('none', 'dust', 'soft')
Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
- relabel:
Str
Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_keep:
Bool
When relabeling, keep the original identifier in the header after a space.[optional]
- relabel_md5:
Bool
When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_self:
Bool
Relabel sequences using the sequence itself as a label.[optional]
- relabel_sha1:
Bool
When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
- sizein:
Bool
In de novo mode, abundance annotations (pattern
[>;]size=integer[;]
) present in sequence headers are taken into account.[optional]- sizeout:
Bool
Add abundance annotations to the output FASTA files.[optional]
Outputs¶
- oriented_seqs:
FeatureData[Sequence]
Query sequences in same orientation as top matching reference sequence.[required]
- unmatched_seqs:
FeatureData[Sequence]
Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]
rescript filter-seqs-length-by-taxon¶
Filter sequences by length. Can filter both globally by minimum and/or maximum length, and set individual threshold for individual taxonomic groups (using the "labels" option). Note that filtering can be performed for multiple taxonomic groups simultaneously, and nested taxonomic filters can be applied (e.g., to apply a more stringent filter for a particular genus, but a less stringent filter for other members of the kingdom). For global length-based filtering without conditional taxonomic filtering, see filter_seqs_length.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be filtered.[required]
Parameters¶
- labels:
List
[
Str
]
One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
- min_lens:
List
[
Int
%
Range
(1, None)
]
Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
- max_lens:
List
[
Int
%
Range
(1, None)
]
Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript filter-seqs-length¶
Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
Parameters¶
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript parse-silva-taxonomy¶
Parses several files from the SILVA reference database to produce a GreenGenes-like fixed rank taxonomy that is 6 or 7 ranks deep, depending on whether or not include_species_labels
is applied. The generated ranks (and the rank handles used to label these ranks in the resulting taxonomy) are: domain (d__), phylum (p__), class (c__), order (o__), family (f__), genus (g__), and species (s__). NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Inputs¶
- taxonomy_tree:
Phylogeny[Rooted]
SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
- taxonomy_map:
FeatureData[SILVATaxidMap]
SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
- taxonomy_ranks:
FeatureData[SILVATaxonomy]
SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]
Parameters¶
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
The resulting fixed-rank formatted SILVA taxonomy.[required]
rescript reverse-transcribe¶
Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.
Citations¶
Inputs¶
- rna_sequences:
FeatureData[AlignedRNASequence¹ | RNASequence²]
RNA Sequences to reverse transcribe to DNA.[required]
Outputs¶
- dna_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Reverse-transcribed DNA sequences.[required]
rescript get-ncbi-data¶
Download and import sequences from the NCBI Nucleotide database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Nucleotide database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Nucleotide database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[Sequence]
Sequences from the NCBI Nucleotide database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-ncbi-data-protein¶
Download and import sequences from the NCBI Protein database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Protein database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Protein database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[ProteinSequence]
Sequences from the NCBI Protein database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-gtdb-data¶
Download, parse, and import SSU GTDB files, given a version number. Downloads data directly from GTDB, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM GTDB. SEE https://
Citations¶
Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021
Parameters¶
- version:
Str
%
Choices
('202.0', '207.0', '214.0', '214.1', '220.0')
GTDB database version to download.[default:
'220.0'
]- domain:
Str
%
Choices
('Both', 'Bacteria', 'Archaea')
SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default:
'Both'
]- db_type:
Str
%
Choices
('All', 'SpeciesReps')
'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default:
'SpeciesReps'
]
Outputs¶
- gtdb_taxonomy:
FeatureData[Taxonomy]
SSU GTDB reference taxonomy.[required]
- gtdb_sequences:
FeatureData[Sequence]
SSU GTDB reference sequences.[required]
rescript get-unite-data¶
Download and import ITS sequences and taxonomy from the UNITE database, given a version number and taxon_group, with the option to select a cluster_id and include singletons. Downloads data directly from UNITE's PlutoF REST API. NOTE: THIS ACTION ACQUIRES DATA FROM UNITE, which is licensed under CC BY-SA 4.0. To learn more, please visit https://
Citations¶
Robeson et al., 2021; Nilsson et al., 2018
Parameters¶
- version:
Str
%
Choices
('10.0', '9.0', '8.3', '8.2')
UNITE version to download.[default:
'10.0'
]- taxon_group:
Str
%
Choices
('fungi', 'eukaryotes')
Download a database with only 'fungi' or including all 'eukaryotes'.[default:
'eukaryotes'
]- cluster_id:
Str
%
Choices
('99', '97', 'dynamic')
Percent similarity at which sequences in the of database were clustered.[default:
'99'
]- singletons:
Bool
Include singleton clusters in the database.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
UNITE reference taxonomy.[required]
- sequences:
FeatureData[Sequence]
UNITE reference sequences.[required]
rescript filter-taxa¶
Filter taxonomy by list of IDs or search criteria.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy to filter.[required]
Parameters¶
- ids_to_keep:
Metadata
List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
- include:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting
ids_to_keep
.[optional]- exclude:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting
ids_to_keep
.[optional]
Outputs¶
- filtered_taxonomy:
FeatureData[Taxonomy]
The filtered taxonomy.[required]
rescript subsample-fasta¶
Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.
Citations¶
Inputs¶
- sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sequences to subsample from.[required]
Parameters¶
- subsample_size:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Size of the random sample as a fraction of the total count[default:
0.1
]- random_seed:
Int
%
Range
(1, None)
Seed to be used for random sampling.[default:
1
]
Outputs¶
- sample_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sample of original sequences.[required]
rescript extract-seq-segments¶
This action provides the ability to extract a region, or segment, of sequence without the need to specify primer pairs. This is very useful in cases when one or more of the primer sequences are not present within the target sequences, which prevents extraction of the (amplicon) region through primer-pair searching. Here, VSEARCH is used to extract these segments based on a reference pool of sequences that only span the region of interest.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- input_sequences:
FeatureData[Sequence]
Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
- reference_segment_sequences:
FeatureData[Sequence]
Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]
Parameters¶
- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
0.7
]- target_coverage:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default:
0.9
]- min_seq_len:
Int
%
Range
(1, None)
Minimum length of sequence allowed for searching. Any sequence less than this will be discarded. If not set, default program settings will be used.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- extracted_sequence_segments:
FeatureData[Sequence]
Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
- unmatched_sequences:
FeatureData[Sequence]
Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]
rescript get-ncbi-genomes¶
Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.
Citations¶
Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020
Parameters¶
- taxon:
Str
NCBI Taxonomy ID or name (common or scientific) at any taxonomic rank.[required]
- assembly_source:
Str
%
Choices
('refseq', 'genbank', 'all')
Fetch only RefSeq or GenBank genome assemblies.[default:
'refseq'
]- assembly_levels:
List
[
Str
%
Choices
('complete_genome', 'chromosome', 'scaffold', 'contig')
]
Fetch only genome assemblies that are one of the specified assembly levels.[default:
['complete_genome']
]- only_reference:
Bool
Fetch only reference and representative genome assemblies.[default:
True
]- only_genomic:
Bool
Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default:
False
]- tax_exact_match:
Bool
If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default:
False
]- page_size:
Int
%
Range
(20, 1000, inclusive_end=True)
The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default:
20
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default:
['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genome_assemblies:
FeatureData[Sequence]
Nucleotide sequences of requested genomes.[required]
- loci:
GenomeData[Loci]
Loci features of requested genomes.[required]
- proteins:
GenomeData[Proteins]
Protein sequences originating from requested genomes.[required]
- taxonomies:
FeatureData[Taxonomy]
Taxonomies of requested genomes.[required]
rescript get-bv-brc-genomes¶
Fetch genome sequences from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted genomes. By providing IDs/values and a corresponding data field, you can retrieve all genomes associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _sequence for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genomes:
GenomeData[DNASequence]
Genome sequences for specified query.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
rescript get-bv-brc-metadata¶
Fetch BV-BCR metadata for a specific data type. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted results. By providing IDs/values and a corresponding data field, you can retrieve all metadata associated with those specific values in that data field. And as a third option a metadata column can be provided, to use the results from other data types as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- data_type:
Str
%
Choices
('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology')
BV-BCR data type for which metadata should be downloaded. Check https://
www .bv -brc .org /api /doc / for documentation.[optional] - rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc / for allowed data fields in the specified "data-type".[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
Outputs¶
- metadata:
ImmutableMetadata
BV-BCR metadata of specified data type.[required]
rescript get-bv-brc-genome-features¶
Fetch DNA and protein sequences of genome features from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted features. By providing IDs/values and a corresponding data field, you can retrieve all features associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _feature for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genes:
GenomeData[Genes]
Gene[required]
- proteins:
GenomeData[Proteins]
proteins[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
- loci:
GenomeData[Loci]
loci[required]
rescript evaluate-seqs¶
Compute summary statistics on sequence artifact(s) and visualize. Summary statistics include the number of unique sequences, sequence entropy, kmer entropy, and sequence length distributions. This action is useful for both reference taxonomies and classification results.
Citations¶
Inputs¶
- sequences:
List
[
FeatureData[Sequence]
]
One or more sets of sequences to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- kmer_lengths:
List
[
Int
%
Range
(1, None)
]
Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
- subsample_kmers:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default:
1.0
]- palette:
Str
%
Choices
('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic')
Color palette to use for plotting evaluation results.[default:
'viridis'
]
Outputs¶
- visualization:
Visualization
<no description>[required]
rescript evaluate-fit-classifier¶
Train a naive Bayes classifier on a set of reference sequences, then test performance accuracy on this same set of sequences. This results in a "perfect" classifier that "knows" the correct identity of each input sequence. Such a leaky classifier indicates the upper limit of classification accuracy based on sequence information alone, as misclassifications are an indication of unresolvable kmer profiles. This test simulates the case where all query sequences are present in a fully comprehensive reference database. To simulate more realistic conditions, see evaluate_cross_validate
. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS PRODUCTION-READY and can be re-used for classification of other sequences (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- classifier:
TaxonomicClassifier
Trained naive Bayes taxonomic classifier.[required]
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]
rescript evaluate-cross-validate¶
Evaluate DNA sequence reference database via cross-validated taxonomic classification. Unique taxonomic labels are truncated to enable appropriate label stratification. See the cited reference (Bokulich et al. 2018) for more details.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- k:
Int
%
Range
(2, None)
Number of stratified folds.[default:
3
]- random_state:
Int
%
Range
(0, None)
Seed used by the random number generator.[default:
0
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- expected_taxonomy:
FeatureData[Taxonomy]
Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
- evaluation:
Visualization
Visualization of cross-validated accuracy results.[required]
rescript evaluate-classifications¶
Evaluate taxonomic classification accuracy by comparing one or more sets of true taxonomic labels to the predicted taxonomies for the same set(s) of features. Output an interactive line plot of classification accuracy for each pair of expected/observed taxonomies. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018
Inputs¶
- expected_taxonomies:
List
[
FeatureData[Taxonomy]
]
True taxonomic labels for one more more sets of features.[required]
- observed_taxonomies:
List
[
FeatureData[Taxonomy]
]
Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
Outputs¶
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
rescript evaluate-taxonomy¶
Compute summary statistics on taxonomy artifact(s) and visualize as interactive lineplots. Summary statistics include the number of unique labels, taxonomic entropy, and the number of features that are (un)classified at each taxonomic level. This action is useful for both reference taxonomies and classification results. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- taxonomies:
List
[
FeatureData[Taxonomy]
]
One or more taxonomies to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]
Outputs¶
- taxonomy_stats:
Visualization
<no description>[required]
rescript get-silva-data¶
Download, parse, and import SILVA database files, given a version number and reference target. Downloads data directly from SILVA, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Parameters¶
- version:
Str
%
Choices
('128', '132')
|
Str
%
Choices
('138')
|
Str
%
Choices
('138.1', '138.2')
SILVA database version to download.[default:
'138.2'
]- target:
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef')
Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default:
'SSURef_NR99'
]- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- download_sequences:
Bool
Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a
silva_sequences
output is still created, but contains no data.[default:True
]
Outputs¶
- silva_sequences:
FeatureData[RNASequence]
SILVA reference sequences.[required]
- silva_taxonomy:
FeatureData[Taxonomy]
SILVA reference taxonomy.[required]
rescript trim-alignment¶
Trim an existing alignment based on provided primers or specific, pre-defined positions. Primers take precedence over the positions,i.e. if both are provided, positions will be ignored.When using primers in combination with a DNA alignment, a new alignment will be generated to locate primer positions. Subsequently, start (5'-most) and end (3'-most) position from fwd and rev primer located within the new alignment is identified and used for slicing the original alignment.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA sequences.[required]
Parameters¶
- primer_fwd:
Str
Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
- primer_rev:
Str
Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
- position_start:
Int
%
Range
(1, None)
Position within the alignment where the trimming will begin. If not provided, alignment will notbe trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
- position_end:
Int
%
Range
(1, None)
Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
- n_threads:
Int
%
Range
(1, None)
Number of threads to use for primer-based trimming, otherwise ignored. (Use
auto
to automatically use all available cores)[default:1
]
Outputs¶
- trimmed_sequences:
FeatureData[AlignedSequence]
Trimmed sequence alignment.[required]
Reference sequence annotation and curation pipeline.
- version:
2024.10.0
- website: https://
github .com /nbokulich /RESCRIPt - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Robeson et al., 2021
Actions¶
Name | Type | Short Description |
---|---|---|
merge-taxa | method | Compare taxonomies and select the longest, highest scoring, or find the least common ancestor. |
dereplicate | method | Dereplicate features with matching sequences and taxonomies. |
cull-seqs | method | Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length. |
degap-seqs | method | Remove gaps from DNA sequence alignments. |
edit-taxonomy | method | Edit taxonomy strings with find and replace terms. |
orient-seqs | method | Orient input sequences by comparison against reference. |
filter-seqs-length-by-taxon | method | Filter sequences by length and taxonomic group. |
filter-seqs-length | method | Filter sequences by length. |
parse-silva-taxonomy | method | Generates a SILVA fixed-rank taxonomy. |
reverse-transcribe | method | Reverse transcribe RNA to DNA sequences. |
get-ncbi-data | method | Download, parse, and import NCBI sequences and taxonomies |
get-ncbi-data-protein | method | Download, parse, and import NCBI protein sequences and taxonomies |
get-gtdb-data | method | Download, parse, and import SSU GTDB reference data. |
get-unite-data | method | Download and import UNITE reference data. |
filter-taxa | method | Filter taxonomy by list of IDs or search criteria. |
subsample-fasta | method | Subsample an indicated number of sequences from a FASTA file. |
extract-seq-segments | method | Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value. |
get-ncbi-genomes | method | Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets. |
get-bv-brc-genomes | method | Get genome sequences from the BV-BRC database. |
get-bv-brc-metadata | method | Fetch BV-BCR metadata. |
get-bv-brc-genome-features | method | Fetch genome features from BV-BRC. |
evaluate-seqs | visualizer | Compute summary statistics on sequence artifact(s). |
evaluate-fit-classifier | pipeline | Evaluate and train naive Bayes classifier on reference sequences. |
evaluate-cross-validate | pipeline | Evaluate DNA sequence reference database via cross-validated taxonomic classification. |
evaluate-classifications | pipeline | Interactively evaluate taxonomic classification accuracy. |
evaluate-taxonomy | pipeline | Compute summary statistics on taxonomy artifact(s). |
get-silva-data | pipeline | Download, parse, and import SILVA database. |
trim-alignment | pipeline | Trim alignment based on provided primers or specific positions. |
Artifact Classes¶
FeatureData[SILVATaxonomy] |
FeatureData[SILVATaxidMap] |
Formats¶
SILVATaxonomyFormat |
SILVATaxonomyDirectoryFormat |
SILVATaxidMapFormat |
SILVATaxidMapDirectoryFormat |
rescript merge-taxa¶
Compare taxonomy annotations and choose the best one. Can select the longest taxonomy annotation, the highest scoring, or the least common ancestor. Note: when a tie occurs, the last taxonomy added takes precedent.
Citations¶
Inputs¶
- data:
List
[
FeatureData[Taxonomy]
]
Two or more feature taxonomies to be merged.[required]
Parameters¶
- mode:
Str
%
Choices
('len', 'lca', 'score', 'super', 'majority')
How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'len'
]- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default:
'^[dkpcofgs]__'
]- new_rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with
rank_handle_regex
if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default:['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- unclassified_label:
Str
Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default:
'Unassigned'
]
Outputs¶
- merged_data:
FeatureData[Taxonomy]
<no description>[required]
rescript dereplicate¶
Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be dereplicated[required]
- taxa:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated[required]
Parameters¶
- mode:
Str
%
Choices
('uniq', 'lca', 'majority', 'super')
How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'uniq'
]- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
1.0
]- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default:
['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- derep_prefix:
Bool
Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default:
False
]
Outputs¶
- dereplicated_sequences:
FeatureData[Sequence]
<no description>[required]
- dereplicated_taxa:
FeatureData[Taxonomy]
<no description>[required]
rescript cull-seqs¶
Filter DNA or RNA sequences that contain ambiguous bases and homopolymers, and output filtered DNA sequences. Removes DNA sequences that have the specified number, or more, of IUPAC compliant degenerate bases. Remaining sequences are removed if they contain homopolymers equal to or longer than the specified length. If the input consists of RNA sequences, they are reverse transcribed to DNA before filtering.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence]
DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]
Parameters¶
- num_degenerates:
Int
%
Range
(1, None)
Sequences with N, or more, degenerate bases will be removed.[default:
5
]- homopolymer_length:
Int
%
Range
(2, None)
Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default:
8
]- n_jobs:
Int
%
Range
(1, None)
Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default:
1
]
Outputs¶
- clean_sequences:
FeatureData[Sequence]
The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]
rescript degap-seqs¶
This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA Sequences to be degapped.[required]
Parameters¶
- min_length:
Int
%
Range
(1, None)
Minimum length of sequence to be returned after degapping.[default:
1
]
Outputs¶
- degapped_sequences:
FeatureData[Sequence]
The resulting unaligned (degapped) DNA sequences.[required]
rescript edit-taxonomy¶
A method that allows the user to edit taxonomy strings. This is often used to fix inconsistent and/or inccorect taxonomic annotations. The user can either provide two separate lists of strings, i.e. 'search-strings', and 'replacement-strings', on the command line, and/or a single tab-delimited replacement map file containing a list of these strings. In both cases the number of search strings must match the number of replacement strings. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. In the case that both search / replacement strings, and a replacement map file are provided, they will be merged.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy strings data to be edited.[required]
Parameters¶
- replacement_map:
MetadataColumn
[
Categorical
]
A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
- search_strings:
List
[
Str
]
Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
- replacement_strings:
List
[
Str
]
Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
- use_regex:
Bool
Toggle regular expressions. By default, only litereal substring matching is performed.[default:
False
]
Outputs¶
- edited_taxonomy:
FeatureData[Taxonomy]
Taxonomy in which the original strings are replaced by user-supplied strings.[required]
rescript orient-seqs¶
Orient input sequences by comparison against a set of reference sequences using VSEARCH. This action can also be used to quickly filter out sequences that (do not) match a set of reference sequences in either orientation. Alternatively, if no reference sequences are provided as input, all input sequences will be reverse-complemented. In this case, no alignment is performed, and all alignment parameters (dbmask
, relabel
, relabel_keep
, relabel_md5
, relabel_self
, relabel_sha1
, sizein
, sizeout
and threads
) are ignored.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be oriented.[required]
- reference_sequences:
FeatureData[Sequence]
Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]
Parameters¶
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- dbmask:
Str
%
Choices
('none', 'dust', 'soft')
Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
- relabel:
Str
Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_keep:
Bool
When relabeling, keep the original identifier in the header after a space.[optional]
- relabel_md5:
Bool
When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_self:
Bool
Relabel sequences using the sequence itself as a label.[optional]
- relabel_sha1:
Bool
When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
- sizein:
Bool
In de novo mode, abundance annotations (pattern
[>;]size=integer[;]
) present in sequence headers are taken into account.[optional]- sizeout:
Bool
Add abundance annotations to the output FASTA files.[optional]
Outputs¶
- oriented_seqs:
FeatureData[Sequence]
Query sequences in same orientation as top matching reference sequence.[required]
- unmatched_seqs:
FeatureData[Sequence]
Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]
rescript filter-seqs-length-by-taxon¶
Filter sequences by length. Can filter both globally by minimum and/or maximum length, and set individual threshold for individual taxonomic groups (using the "labels" option). Note that filtering can be performed for multiple taxonomic groups simultaneously, and nested taxonomic filters can be applied (e.g., to apply a more stringent filter for a particular genus, but a less stringent filter for other members of the kingdom). For global length-based filtering without conditional taxonomic filtering, see filter_seqs_length.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be filtered.[required]
Parameters¶
- labels:
List
[
Str
]
One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
- min_lens:
List
[
Int
%
Range
(1, None)
]
Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
- max_lens:
List
[
Int
%
Range
(1, None)
]
Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript filter-seqs-length¶
Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
Parameters¶
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript parse-silva-taxonomy¶
Parses several files from the SILVA reference database to produce a GreenGenes-like fixed rank taxonomy that is 6 or 7 ranks deep, depending on whether or not include_species_labels
is applied. The generated ranks (and the rank handles used to label these ranks in the resulting taxonomy) are: domain (d__), phylum (p__), class (c__), order (o__), family (f__), genus (g__), and species (s__). NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Inputs¶
- taxonomy_tree:
Phylogeny[Rooted]
SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
- taxonomy_map:
FeatureData[SILVATaxidMap]
SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
- taxonomy_ranks:
FeatureData[SILVATaxonomy]
SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]
Parameters¶
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
The resulting fixed-rank formatted SILVA taxonomy.[required]
rescript reverse-transcribe¶
Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.
Citations¶
Inputs¶
- rna_sequences:
FeatureData[AlignedRNASequence¹ | RNASequence²]
RNA Sequences to reverse transcribe to DNA.[required]
Outputs¶
- dna_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Reverse-transcribed DNA sequences.[required]
rescript get-ncbi-data¶
Download and import sequences from the NCBI Nucleotide database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Nucleotide database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Nucleotide database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[Sequence]
Sequences from the NCBI Nucleotide database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-ncbi-data-protein¶
Download and import sequences from the NCBI Protein database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Protein database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Protein database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[ProteinSequence]
Sequences from the NCBI Protein database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-gtdb-data¶
Download, parse, and import SSU GTDB files, given a version number. Downloads data directly from GTDB, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM GTDB. SEE https://
Citations¶
Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021
Parameters¶
- version:
Str
%
Choices
('202.0', '207.0', '214.0', '214.1', '220.0')
GTDB database version to download.[default:
'220.0'
]- domain:
Str
%
Choices
('Both', 'Bacteria', 'Archaea')
SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default:
'Both'
]- db_type:
Str
%
Choices
('All', 'SpeciesReps')
'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default:
'SpeciesReps'
]
Outputs¶
- gtdb_taxonomy:
FeatureData[Taxonomy]
SSU GTDB reference taxonomy.[required]
- gtdb_sequences:
FeatureData[Sequence]
SSU GTDB reference sequences.[required]
rescript get-unite-data¶
Download and import ITS sequences and taxonomy from the UNITE database, given a version number and taxon_group, with the option to select a cluster_id and include singletons. Downloads data directly from UNITE's PlutoF REST API. NOTE: THIS ACTION ACQUIRES DATA FROM UNITE, which is licensed under CC BY-SA 4.0. To learn more, please visit https://
Citations¶
Robeson et al., 2021; Nilsson et al., 2018
Parameters¶
- version:
Str
%
Choices
('10.0', '9.0', '8.3', '8.2')
UNITE version to download.[default:
'10.0'
]- taxon_group:
Str
%
Choices
('fungi', 'eukaryotes')
Download a database with only 'fungi' or including all 'eukaryotes'.[default:
'eukaryotes'
]- cluster_id:
Str
%
Choices
('99', '97', 'dynamic')
Percent similarity at which sequences in the of database were clustered.[default:
'99'
]- singletons:
Bool
Include singleton clusters in the database.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
UNITE reference taxonomy.[required]
- sequences:
FeatureData[Sequence]
UNITE reference sequences.[required]
rescript filter-taxa¶
Filter taxonomy by list of IDs or search criteria.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy to filter.[required]
Parameters¶
- ids_to_keep:
Metadata
List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
- include:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting
ids_to_keep
.[optional]- exclude:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting
ids_to_keep
.[optional]
Outputs¶
- filtered_taxonomy:
FeatureData[Taxonomy]
The filtered taxonomy.[required]
rescript subsample-fasta¶
Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.
Citations¶
Inputs¶
- sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sequences to subsample from.[required]
Parameters¶
- subsample_size:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Size of the random sample as a fraction of the total count[default:
0.1
]- random_seed:
Int
%
Range
(1, None)
Seed to be used for random sampling.[default:
1
]
Outputs¶
- sample_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sample of original sequences.[required]
rescript extract-seq-segments¶
This action provides the ability to extract a region, or segment, of sequence without the need to specify primer pairs. This is very useful in cases when one or more of the primer sequences are not present within the target sequences, which prevents extraction of the (amplicon) region through primer-pair searching. Here, VSEARCH is used to extract these segments based on a reference pool of sequences that only span the region of interest.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- input_sequences:
FeatureData[Sequence]
Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
- reference_segment_sequences:
FeatureData[Sequence]
Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]
Parameters¶
- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
0.7
]- target_coverage:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default:
0.9
]- min_seq_len:
Int
%
Range
(1, None)
Minimum length of sequence allowed for searching. Any sequence less than this will be discarded. If not set, default program settings will be used.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- extracted_sequence_segments:
FeatureData[Sequence]
Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
- unmatched_sequences:
FeatureData[Sequence]
Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]
rescript get-ncbi-genomes¶
Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.
Citations¶
Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020
Parameters¶
- taxon:
Str
NCBI Taxonomy ID or name (common or scientific) at any taxonomic rank.[required]
- assembly_source:
Str
%
Choices
('refseq', 'genbank', 'all')
Fetch only RefSeq or GenBank genome assemblies.[default:
'refseq'
]- assembly_levels:
List
[
Str
%
Choices
('complete_genome', 'chromosome', 'scaffold', 'contig')
]
Fetch only genome assemblies that are one of the specified assembly levels.[default:
['complete_genome']
]- only_reference:
Bool
Fetch only reference and representative genome assemblies.[default:
True
]- only_genomic:
Bool
Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default:
False
]- tax_exact_match:
Bool
If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default:
False
]- page_size:
Int
%
Range
(20, 1000, inclusive_end=True)
The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default:
20
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default:
['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genome_assemblies:
FeatureData[Sequence]
Nucleotide sequences of requested genomes.[required]
- loci:
GenomeData[Loci]
Loci features of requested genomes.[required]
- proteins:
GenomeData[Proteins]
Protein sequences originating from requested genomes.[required]
- taxonomies:
FeatureData[Taxonomy]
Taxonomies of requested genomes.[required]
rescript get-bv-brc-genomes¶
Fetch genome sequences from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted genomes. By providing IDs/values and a corresponding data field, you can retrieve all genomes associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _sequence for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genomes:
GenomeData[DNASequence]
Genome sequences for specified query.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
rescript get-bv-brc-metadata¶
Fetch BV-BCR metadata for a specific data type. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted results. By providing IDs/values and a corresponding data field, you can retrieve all metadata associated with those specific values in that data field. And as a third option a metadata column can be provided, to use the results from other data types as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- data_type:
Str
%
Choices
('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology')
BV-BCR data type for which metadata should be downloaded. Check https://
www .bv -brc .org /api /doc / for documentation.[optional] - rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc / for allowed data fields in the specified "data-type".[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
Outputs¶
- metadata:
ImmutableMetadata
BV-BCR metadata of specified data type.[required]
rescript get-bv-brc-genome-features¶
Fetch DNA and protein sequences of genome features from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted features. By providing IDs/values and a corresponding data field, you can retrieve all features associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _feature for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genes:
GenomeData[Genes]
Gene[required]
- proteins:
GenomeData[Proteins]
proteins[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
- loci:
GenomeData[Loci]
loci[required]
rescript evaluate-seqs¶
Compute summary statistics on sequence artifact(s) and visualize. Summary statistics include the number of unique sequences, sequence entropy, kmer entropy, and sequence length distributions. This action is useful for both reference taxonomies and classification results.
Citations¶
Inputs¶
- sequences:
List
[
FeatureData[Sequence]
]
One or more sets of sequences to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- kmer_lengths:
List
[
Int
%
Range
(1, None)
]
Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
- subsample_kmers:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default:
1.0
]- palette:
Str
%
Choices
('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic')
Color palette to use for plotting evaluation results.[default:
'viridis'
]
Outputs¶
- visualization:
Visualization
<no description>[required]
rescript evaluate-fit-classifier¶
Train a naive Bayes classifier on a set of reference sequences, then test performance accuracy on this same set of sequences. This results in a "perfect" classifier that "knows" the correct identity of each input sequence. Such a leaky classifier indicates the upper limit of classification accuracy based on sequence information alone, as misclassifications are an indication of unresolvable kmer profiles. This test simulates the case where all query sequences are present in a fully comprehensive reference database. To simulate more realistic conditions, see evaluate_cross_validate
. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS PRODUCTION-READY and can be re-used for classification of other sequences (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- classifier:
TaxonomicClassifier
Trained naive Bayes taxonomic classifier.[required]
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]
rescript evaluate-cross-validate¶
Evaluate DNA sequence reference database via cross-validated taxonomic classification. Unique taxonomic labels are truncated to enable appropriate label stratification. See the cited reference (Bokulich et al. 2018) for more details.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- k:
Int
%
Range
(2, None)
Number of stratified folds.[default:
3
]- random_state:
Int
%
Range
(0, None)
Seed used by the random number generator.[default:
0
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- expected_taxonomy:
FeatureData[Taxonomy]
Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
- evaluation:
Visualization
Visualization of cross-validated accuracy results.[required]
rescript evaluate-classifications¶
Evaluate taxonomic classification accuracy by comparing one or more sets of true taxonomic labels to the predicted taxonomies for the same set(s) of features. Output an interactive line plot of classification accuracy for each pair of expected/observed taxonomies. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018
Inputs¶
- expected_taxonomies:
List
[
FeatureData[Taxonomy]
]
True taxonomic labels for one more more sets of features.[required]
- observed_taxonomies:
List
[
FeatureData[Taxonomy]
]
Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
Outputs¶
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
rescript evaluate-taxonomy¶
Compute summary statistics on taxonomy artifact(s) and visualize as interactive lineplots. Summary statistics include the number of unique labels, taxonomic entropy, and the number of features that are (un)classified at each taxonomic level. This action is useful for both reference taxonomies and classification results. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- taxonomies:
List
[
FeatureData[Taxonomy]
]
One or more taxonomies to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]
Outputs¶
- taxonomy_stats:
Visualization
<no description>[required]
rescript get-silva-data¶
Download, parse, and import SILVA database files, given a version number and reference target. Downloads data directly from SILVA, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Parameters¶
- version:
Str
%
Choices
('128', '132')
|
Str
%
Choices
('138')
|
Str
%
Choices
('138.1', '138.2')
SILVA database version to download.[default:
'138.2'
]- target:
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef')
Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default:
'SSURef_NR99'
]- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- download_sequences:
Bool
Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a
silva_sequences
output is still created, but contains no data.[default:True
]
Outputs¶
- silva_sequences:
FeatureData[RNASequence]
SILVA reference sequences.[required]
- silva_taxonomy:
FeatureData[Taxonomy]
SILVA reference taxonomy.[required]
rescript trim-alignment¶
Trim an existing alignment based on provided primers or specific, pre-defined positions. Primers take precedence over the positions,i.e. if both are provided, positions will be ignored.When using primers in combination with a DNA alignment, a new alignment will be generated to locate primer positions. Subsequently, start (5'-most) and end (3'-most) position from fwd and rev primer located within the new alignment is identified and used for slicing the original alignment.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA sequences.[required]
Parameters¶
- primer_fwd:
Str
Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
- primer_rev:
Str
Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
- position_start:
Int
%
Range
(1, None)
Position within the alignment where the trimming will begin. If not provided, alignment will notbe trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
- position_end:
Int
%
Range
(1, None)
Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
- n_threads:
Int
%
Range
(1, None)
Number of threads to use for primer-based trimming, otherwise ignored. (Use
auto
to automatically use all available cores)[default:1
]
Outputs¶
- trimmed_sequences:
FeatureData[AlignedSequence]
Trimmed sequence alignment.[required]
Reference sequence annotation and curation pipeline.
- version:
2024.10.0
- website: https://
github .com /nbokulich /RESCRIPt - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Robeson et al., 2021
Actions¶
Name | Type | Short Description |
---|---|---|
merge-taxa | method | Compare taxonomies and select the longest, highest scoring, or find the least common ancestor. |
dereplicate | method | Dereplicate features with matching sequences and taxonomies. |
cull-seqs | method | Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length. |
degap-seqs | method | Remove gaps from DNA sequence alignments. |
edit-taxonomy | method | Edit taxonomy strings with find and replace terms. |
orient-seqs | method | Orient input sequences by comparison against reference. |
filter-seqs-length-by-taxon | method | Filter sequences by length and taxonomic group. |
filter-seqs-length | method | Filter sequences by length. |
parse-silva-taxonomy | method | Generates a SILVA fixed-rank taxonomy. |
reverse-transcribe | method | Reverse transcribe RNA to DNA sequences. |
get-ncbi-data | method | Download, parse, and import NCBI sequences and taxonomies |
get-ncbi-data-protein | method | Download, parse, and import NCBI protein sequences and taxonomies |
get-gtdb-data | method | Download, parse, and import SSU GTDB reference data. |
get-unite-data | method | Download and import UNITE reference data. |
filter-taxa | method | Filter taxonomy by list of IDs or search criteria. |
subsample-fasta | method | Subsample an indicated number of sequences from a FASTA file. |
extract-seq-segments | method | Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value. |
get-ncbi-genomes | method | Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets. |
get-bv-brc-genomes | method | Get genome sequences from the BV-BRC database. |
get-bv-brc-metadata | method | Fetch BV-BCR metadata. |
get-bv-brc-genome-features | method | Fetch genome features from BV-BRC. |
evaluate-seqs | visualizer | Compute summary statistics on sequence artifact(s). |
evaluate-fit-classifier | pipeline | Evaluate and train naive Bayes classifier on reference sequences. |
evaluate-cross-validate | pipeline | Evaluate DNA sequence reference database via cross-validated taxonomic classification. |
evaluate-classifications | pipeline | Interactively evaluate taxonomic classification accuracy. |
evaluate-taxonomy | pipeline | Compute summary statistics on taxonomy artifact(s). |
get-silva-data | pipeline | Download, parse, and import SILVA database. |
trim-alignment | pipeline | Trim alignment based on provided primers or specific positions. |
Artifact Classes¶
FeatureData[SILVATaxonomy] |
FeatureData[SILVATaxidMap] |
Formats¶
SILVATaxonomyFormat |
SILVATaxonomyDirectoryFormat |
SILVATaxidMapFormat |
SILVATaxidMapDirectoryFormat |
rescript merge-taxa¶
Compare taxonomy annotations and choose the best one. Can select the longest taxonomy annotation, the highest scoring, or the least common ancestor. Note: when a tie occurs, the last taxonomy added takes precedent.
Citations¶
Inputs¶
- data:
List
[
FeatureData[Taxonomy]
]
Two or more feature taxonomies to be merged.[required]
Parameters¶
- mode:
Str
%
Choices
('len', 'lca', 'score', 'super', 'majority')
How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'len'
]- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default:
'^[dkpcofgs]__'
]- new_rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with
rank_handle_regex
if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default:['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- unclassified_label:
Str
Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default:
'Unassigned'
]
Outputs¶
- merged_data:
FeatureData[Taxonomy]
<no description>[required]
rescript dereplicate¶
Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be dereplicated[required]
- taxa:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated[required]
Parameters¶
- mode:
Str
%
Choices
('uniq', 'lca', 'majority', 'super')
How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'uniq'
]- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
1.0
]- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default:
['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- derep_prefix:
Bool
Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default:
False
]
Outputs¶
- dereplicated_sequences:
FeatureData[Sequence]
<no description>[required]
- dereplicated_taxa:
FeatureData[Taxonomy]
<no description>[required]
rescript cull-seqs¶
Filter DNA or RNA sequences that contain ambiguous bases and homopolymers, and output filtered DNA sequences. Removes DNA sequences that have the specified number, or more, of IUPAC compliant degenerate bases. Remaining sequences are removed if they contain homopolymers equal to or longer than the specified length. If the input consists of RNA sequences, they are reverse transcribed to DNA before filtering.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence]
DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]
Parameters¶
- num_degenerates:
Int
%
Range
(1, None)
Sequences with N, or more, degenerate bases will be removed.[default:
5
]- homopolymer_length:
Int
%
Range
(2, None)
Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default:
8
]- n_jobs:
Int
%
Range
(1, None)
Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default:
1
]
Outputs¶
- clean_sequences:
FeatureData[Sequence]
The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]
rescript degap-seqs¶
This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA Sequences to be degapped.[required]
Parameters¶
- min_length:
Int
%
Range
(1, None)
Minimum length of sequence to be returned after degapping.[default:
1
]
Outputs¶
- degapped_sequences:
FeatureData[Sequence]
The resulting unaligned (degapped) DNA sequences.[required]
rescript edit-taxonomy¶
A method that allows the user to edit taxonomy strings. This is often used to fix inconsistent and/or inccorect taxonomic annotations. The user can either provide two separate lists of strings, i.e. 'search-strings', and 'replacement-strings', on the command line, and/or a single tab-delimited replacement map file containing a list of these strings. In both cases the number of search strings must match the number of replacement strings. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. In the case that both search / replacement strings, and a replacement map file are provided, they will be merged.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy strings data to be edited.[required]
Parameters¶
- replacement_map:
MetadataColumn
[
Categorical
]
A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
- search_strings:
List
[
Str
]
Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
- replacement_strings:
List
[
Str
]
Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
- use_regex:
Bool
Toggle regular expressions. By default, only litereal substring matching is performed.[default:
False
]
Outputs¶
- edited_taxonomy:
FeatureData[Taxonomy]
Taxonomy in which the original strings are replaced by user-supplied strings.[required]
rescript orient-seqs¶
Orient input sequences by comparison against a set of reference sequences using VSEARCH. This action can also be used to quickly filter out sequences that (do not) match a set of reference sequences in either orientation. Alternatively, if no reference sequences are provided as input, all input sequences will be reverse-complemented. In this case, no alignment is performed, and all alignment parameters (dbmask
, relabel
, relabel_keep
, relabel_md5
, relabel_self
, relabel_sha1
, sizein
, sizeout
and threads
) are ignored.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be oriented.[required]
- reference_sequences:
FeatureData[Sequence]
Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]
Parameters¶
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- dbmask:
Str
%
Choices
('none', 'dust', 'soft')
Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
- relabel:
Str
Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_keep:
Bool
When relabeling, keep the original identifier in the header after a space.[optional]
- relabel_md5:
Bool
When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_self:
Bool
Relabel sequences using the sequence itself as a label.[optional]
- relabel_sha1:
Bool
When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
- sizein:
Bool
In de novo mode, abundance annotations (pattern
[>;]size=integer[;]
) present in sequence headers are taken into account.[optional]- sizeout:
Bool
Add abundance annotations to the output FASTA files.[optional]
Outputs¶
- oriented_seqs:
FeatureData[Sequence]
Query sequences in same orientation as top matching reference sequence.[required]
- unmatched_seqs:
FeatureData[Sequence]
Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]
rescript filter-seqs-length-by-taxon¶
Filter sequences by length. Can filter both globally by minimum and/or maximum length, and set individual threshold for individual taxonomic groups (using the "labels" option). Note that filtering can be performed for multiple taxonomic groups simultaneously, and nested taxonomic filters can be applied (e.g., to apply a more stringent filter for a particular genus, but a less stringent filter for other members of the kingdom). For global length-based filtering without conditional taxonomic filtering, see filter_seqs_length.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be filtered.[required]
Parameters¶
- labels:
List
[
Str
]
One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
- min_lens:
List
[
Int
%
Range
(1, None)
]
Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
- max_lens:
List
[
Int
%
Range
(1, None)
]
Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript filter-seqs-length¶
Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
Parameters¶
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript parse-silva-taxonomy¶
Parses several files from the SILVA reference database to produce a GreenGenes-like fixed rank taxonomy that is 6 or 7 ranks deep, depending on whether or not include_species_labels
is applied. The generated ranks (and the rank handles used to label these ranks in the resulting taxonomy) are: domain (d__), phylum (p__), class (c__), order (o__), family (f__), genus (g__), and species (s__). NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Inputs¶
- taxonomy_tree:
Phylogeny[Rooted]
SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
- taxonomy_map:
FeatureData[SILVATaxidMap]
SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
- taxonomy_ranks:
FeatureData[SILVATaxonomy]
SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]
Parameters¶
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
The resulting fixed-rank formatted SILVA taxonomy.[required]
rescript reverse-transcribe¶
Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.
Citations¶
Inputs¶
- rna_sequences:
FeatureData[AlignedRNASequence¹ | RNASequence²]
RNA Sequences to reverse transcribe to DNA.[required]
Outputs¶
- dna_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Reverse-transcribed DNA sequences.[required]
rescript get-ncbi-data¶
Download and import sequences from the NCBI Nucleotide database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Nucleotide database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Nucleotide database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[Sequence]
Sequences from the NCBI Nucleotide database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-ncbi-data-protein¶
Download and import sequences from the NCBI Protein database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Protein database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Protein database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[ProteinSequence]
Sequences from the NCBI Protein database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-gtdb-data¶
Download, parse, and import SSU GTDB files, given a version number. Downloads data directly from GTDB, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM GTDB. SEE https://
Citations¶
Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021
Parameters¶
- version:
Str
%
Choices
('202.0', '207.0', '214.0', '214.1', '220.0')
GTDB database version to download.[default:
'220.0'
]- domain:
Str
%
Choices
('Both', 'Bacteria', 'Archaea')
SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default:
'Both'
]- db_type:
Str
%
Choices
('All', 'SpeciesReps')
'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default:
'SpeciesReps'
]
Outputs¶
- gtdb_taxonomy:
FeatureData[Taxonomy]
SSU GTDB reference taxonomy.[required]
- gtdb_sequences:
FeatureData[Sequence]
SSU GTDB reference sequences.[required]
rescript get-unite-data¶
Download and import ITS sequences and taxonomy from the UNITE database, given a version number and taxon_group, with the option to select a cluster_id and include singletons. Downloads data directly from UNITE's PlutoF REST API. NOTE: THIS ACTION ACQUIRES DATA FROM UNITE, which is licensed under CC BY-SA 4.0. To learn more, please visit https://
Citations¶
Robeson et al., 2021; Nilsson et al., 2018
Parameters¶
- version:
Str
%
Choices
('10.0', '9.0', '8.3', '8.2')
UNITE version to download.[default:
'10.0'
]- taxon_group:
Str
%
Choices
('fungi', 'eukaryotes')
Download a database with only 'fungi' or including all 'eukaryotes'.[default:
'eukaryotes'
]- cluster_id:
Str
%
Choices
('99', '97', 'dynamic')
Percent similarity at which sequences in the of database were clustered.[default:
'99'
]- singletons:
Bool
Include singleton clusters in the database.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
UNITE reference taxonomy.[required]
- sequences:
FeatureData[Sequence]
UNITE reference sequences.[required]
rescript filter-taxa¶
Filter taxonomy by list of IDs or search criteria.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy to filter.[required]
Parameters¶
- ids_to_keep:
Metadata
List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
- include:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting
ids_to_keep
.[optional]- exclude:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting
ids_to_keep
.[optional]
Outputs¶
- filtered_taxonomy:
FeatureData[Taxonomy]
The filtered taxonomy.[required]
rescript subsample-fasta¶
Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.
Citations¶
Inputs¶
- sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sequences to subsample from.[required]
Parameters¶
- subsample_size:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Size of the random sample as a fraction of the total count[default:
0.1
]- random_seed:
Int
%
Range
(1, None)
Seed to be used for random sampling.[default:
1
]
Outputs¶
- sample_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sample of original sequences.[required]
rescript extract-seq-segments¶
This action provides the ability to extract a region, or segment, of sequence without the need to specify primer pairs. This is very useful in cases when one or more of the primer sequences are not present within the target sequences, which prevents extraction of the (amplicon) region through primer-pair searching. Here, VSEARCH is used to extract these segments based on a reference pool of sequences that only span the region of interest.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- input_sequences:
FeatureData[Sequence]
Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
- reference_segment_sequences:
FeatureData[Sequence]
Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]
Parameters¶
- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
0.7
]- target_coverage:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default:
0.9
]- min_seq_len:
Int
%
Range
(1, None)
Minimum length of sequence allowed for searching. Any sequence less than this will be discarded. If not set, default program settings will be used.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- extracted_sequence_segments:
FeatureData[Sequence]
Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
- unmatched_sequences:
FeatureData[Sequence]
Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]
rescript get-ncbi-genomes¶
Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.
Citations¶
Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020
Parameters¶
- taxon:
Str
NCBI Taxonomy ID or name (common or scientific) at any taxonomic rank.[required]
- assembly_source:
Str
%
Choices
('refseq', 'genbank', 'all')
Fetch only RefSeq or GenBank genome assemblies.[default:
'refseq'
]- assembly_levels:
List
[
Str
%
Choices
('complete_genome', 'chromosome', 'scaffold', 'contig')
]
Fetch only genome assemblies that are one of the specified assembly levels.[default:
['complete_genome']
]- only_reference:
Bool
Fetch only reference and representative genome assemblies.[default:
True
]- only_genomic:
Bool
Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default:
False
]- tax_exact_match:
Bool
If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default:
False
]- page_size:
Int
%
Range
(20, 1000, inclusive_end=True)
The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default:
20
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default:
['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genome_assemblies:
FeatureData[Sequence]
Nucleotide sequences of requested genomes.[required]
- loci:
GenomeData[Loci]
Loci features of requested genomes.[required]
- proteins:
GenomeData[Proteins]
Protein sequences originating from requested genomes.[required]
- taxonomies:
FeatureData[Taxonomy]
Taxonomies of requested genomes.[required]
rescript get-bv-brc-genomes¶
Fetch genome sequences from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted genomes. By providing IDs/values and a corresponding data field, you can retrieve all genomes associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _sequence for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genomes:
GenomeData[DNASequence]
Genome sequences for specified query.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
rescript get-bv-brc-metadata¶
Fetch BV-BCR metadata for a specific data type. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted results. By providing IDs/values and a corresponding data field, you can retrieve all metadata associated with those specific values in that data field. And as a third option a metadata column can be provided, to use the results from other data types as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- data_type:
Str
%
Choices
('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology')
BV-BCR data type for which metadata should be downloaded. Check https://
www .bv -brc .org /api /doc / for documentation.[optional] - rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc / for allowed data fields in the specified "data-type".[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
Outputs¶
- metadata:
ImmutableMetadata
BV-BCR metadata of specified data type.[required]
rescript get-bv-brc-genome-features¶
Fetch DNA and protein sequences of genome features from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted features. By providing IDs/values and a corresponding data field, you can retrieve all features associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _feature for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genes:
GenomeData[Genes]
Gene[required]
- proteins:
GenomeData[Proteins]
proteins[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
- loci:
GenomeData[Loci]
loci[required]
rescript evaluate-seqs¶
Compute summary statistics on sequence artifact(s) and visualize. Summary statistics include the number of unique sequences, sequence entropy, kmer entropy, and sequence length distributions. This action is useful for both reference taxonomies and classification results.
Citations¶
Inputs¶
- sequences:
List
[
FeatureData[Sequence]
]
One or more sets of sequences to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- kmer_lengths:
List
[
Int
%
Range
(1, None)
]
Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
- subsample_kmers:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default:
1.0
]- palette:
Str
%
Choices
('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic')
Color palette to use for plotting evaluation results.[default:
'viridis'
]
Outputs¶
- visualization:
Visualization
<no description>[required]
rescript evaluate-fit-classifier¶
Train a naive Bayes classifier on a set of reference sequences, then test performance accuracy on this same set of sequences. This results in a "perfect" classifier that "knows" the correct identity of each input sequence. Such a leaky classifier indicates the upper limit of classification accuracy based on sequence information alone, as misclassifications are an indication of unresolvable kmer profiles. This test simulates the case where all query sequences are present in a fully comprehensive reference database. To simulate more realistic conditions, see evaluate_cross_validate
. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS PRODUCTION-READY and can be re-used for classification of other sequences (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- classifier:
TaxonomicClassifier
Trained naive Bayes taxonomic classifier.[required]
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]
rescript evaluate-cross-validate¶
Evaluate DNA sequence reference database via cross-validated taxonomic classification. Unique taxonomic labels are truncated to enable appropriate label stratification. See the cited reference (Bokulich et al. 2018) for more details.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- k:
Int
%
Range
(2, None)
Number of stratified folds.[default:
3
]- random_state:
Int
%
Range
(0, None)
Seed used by the random number generator.[default:
0
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- expected_taxonomy:
FeatureData[Taxonomy]
Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
- evaluation:
Visualization
Visualization of cross-validated accuracy results.[required]
rescript evaluate-classifications¶
Evaluate taxonomic classification accuracy by comparing one or more sets of true taxonomic labels to the predicted taxonomies for the same set(s) of features. Output an interactive line plot of classification accuracy for each pair of expected/observed taxonomies. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018
Inputs¶
- expected_taxonomies:
List
[
FeatureData[Taxonomy]
]
True taxonomic labels for one more more sets of features.[required]
- observed_taxonomies:
List
[
FeatureData[Taxonomy]
]
Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
Outputs¶
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
rescript evaluate-taxonomy¶
Compute summary statistics on taxonomy artifact(s) and visualize as interactive lineplots. Summary statistics include the number of unique labels, taxonomic entropy, and the number of features that are (un)classified at each taxonomic level. This action is useful for both reference taxonomies and classification results. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- taxonomies:
List
[
FeatureData[Taxonomy]
]
One or more taxonomies to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]
Outputs¶
- taxonomy_stats:
Visualization
<no description>[required]
rescript get-silva-data¶
Download, parse, and import SILVA database files, given a version number and reference target. Downloads data directly from SILVA, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Parameters¶
- version:
Str
%
Choices
('128', '132')
|
Str
%
Choices
('138')
|
Str
%
Choices
('138.1', '138.2')
SILVA database version to download.[default:
'138.2'
]- target:
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef')
Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default:
'SSURef_NR99'
]- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- download_sequences:
Bool
Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a
silva_sequences
output is still created, but contains no data.[default:True
]
Outputs¶
- silva_sequences:
FeatureData[RNASequence]
SILVA reference sequences.[required]
- silva_taxonomy:
FeatureData[Taxonomy]
SILVA reference taxonomy.[required]
rescript trim-alignment¶
Trim an existing alignment based on provided primers or specific, pre-defined positions. Primers take precedence over the positions,i.e. if both are provided, positions will be ignored.When using primers in combination with a DNA alignment, a new alignment will be generated to locate primer positions. Subsequently, start (5'-most) and end (3'-most) position from fwd and rev primer located within the new alignment is identified and used for slicing the original alignment.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA sequences.[required]
Parameters¶
- primer_fwd:
Str
Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
- primer_rev:
Str
Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
- position_start:
Int
%
Range
(1, None)
Position within the alignment where the trimming will begin. If not provided, alignment will notbe trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
- position_end:
Int
%
Range
(1, None)
Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
- n_threads:
Int
%
Range
(1, None)
Number of threads to use for primer-based trimming, otherwise ignored. (Use
auto
to automatically use all available cores)[default:1
]
Outputs¶
- trimmed_sequences:
FeatureData[AlignedSequence]
Trimmed sequence alignment.[required]
Reference sequence annotation and curation pipeline.
- version:
2024.10.0
- website: https://
github .com /nbokulich /RESCRIPt - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Robeson et al., 2021
Actions¶
Name | Type | Short Description |
---|---|---|
merge-taxa | method | Compare taxonomies and select the longest, highest scoring, or find the least common ancestor. |
dereplicate | method | Dereplicate features with matching sequences and taxonomies. |
cull-seqs | method | Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length. |
degap-seqs | method | Remove gaps from DNA sequence alignments. |
edit-taxonomy | method | Edit taxonomy strings with find and replace terms. |
orient-seqs | method | Orient input sequences by comparison against reference. |
filter-seqs-length-by-taxon | method | Filter sequences by length and taxonomic group. |
filter-seqs-length | method | Filter sequences by length. |
parse-silva-taxonomy | method | Generates a SILVA fixed-rank taxonomy. |
reverse-transcribe | method | Reverse transcribe RNA to DNA sequences. |
get-ncbi-data | method | Download, parse, and import NCBI sequences and taxonomies |
get-ncbi-data-protein | method | Download, parse, and import NCBI protein sequences and taxonomies |
get-gtdb-data | method | Download, parse, and import SSU GTDB reference data. |
get-unite-data | method | Download and import UNITE reference data. |
filter-taxa | method | Filter taxonomy by list of IDs or search criteria. |
subsample-fasta | method | Subsample an indicated number of sequences from a FASTA file. |
extract-seq-segments | method | Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value. |
get-ncbi-genomes | method | Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets. |
get-bv-brc-genomes | method | Get genome sequences from the BV-BRC database. |
get-bv-brc-metadata | method | Fetch BV-BCR metadata. |
get-bv-brc-genome-features | method | Fetch genome features from BV-BRC. |
evaluate-seqs | visualizer | Compute summary statistics on sequence artifact(s). |
evaluate-fit-classifier | pipeline | Evaluate and train naive Bayes classifier on reference sequences. |
evaluate-cross-validate | pipeline | Evaluate DNA sequence reference database via cross-validated taxonomic classification. |
evaluate-classifications | pipeline | Interactively evaluate taxonomic classification accuracy. |
evaluate-taxonomy | pipeline | Compute summary statistics on taxonomy artifact(s). |
get-silva-data | pipeline | Download, parse, and import SILVA database. |
trim-alignment | pipeline | Trim alignment based on provided primers or specific positions. |
Artifact Classes¶
FeatureData[SILVATaxonomy] |
FeatureData[SILVATaxidMap] |
Formats¶
SILVATaxonomyFormat |
SILVATaxonomyDirectoryFormat |
SILVATaxidMapFormat |
SILVATaxidMapDirectoryFormat |
rescript merge-taxa¶
Compare taxonomy annotations and choose the best one. Can select the longest taxonomy annotation, the highest scoring, or the least common ancestor. Note: when a tie occurs, the last taxonomy added takes precedent.
Citations¶
Inputs¶
- data:
List
[
FeatureData[Taxonomy]
]
Two or more feature taxonomies to be merged.[required]
Parameters¶
- mode:
Str
%
Choices
('len', 'lca', 'score', 'super', 'majority')
How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'len'
]- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default:
'^[dkpcofgs]__'
]- new_rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with
rank_handle_regex
if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default:['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- unclassified_label:
Str
Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default:
'Unassigned'
]
Outputs¶
- merged_data:
FeatureData[Taxonomy]
<no description>[required]
rescript dereplicate¶
Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be dereplicated[required]
- taxa:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated[required]
Parameters¶
- mode:
Str
%
Choices
('uniq', 'lca', 'majority', 'super')
How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'uniq'
]- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
1.0
]- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default:
['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- derep_prefix:
Bool
Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default:
False
]
Outputs¶
- dereplicated_sequences:
FeatureData[Sequence]
<no description>[required]
- dereplicated_taxa:
FeatureData[Taxonomy]
<no description>[required]
rescript cull-seqs¶
Filter DNA or RNA sequences that contain ambiguous bases and homopolymers, and output filtered DNA sequences. Removes DNA sequences that have the specified number, or more, of IUPAC compliant degenerate bases. Remaining sequences are removed if they contain homopolymers equal to or longer than the specified length. If the input consists of RNA sequences, they are reverse transcribed to DNA before filtering.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence]
DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]
Parameters¶
- num_degenerates:
Int
%
Range
(1, None)
Sequences with N, or more, degenerate bases will be removed.[default:
5
]- homopolymer_length:
Int
%
Range
(2, None)
Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default:
8
]- n_jobs:
Int
%
Range
(1, None)
Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default:
1
]
Outputs¶
- clean_sequences:
FeatureData[Sequence]
The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]
rescript degap-seqs¶
This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA Sequences to be degapped.[required]
Parameters¶
- min_length:
Int
%
Range
(1, None)
Minimum length of sequence to be returned after degapping.[default:
1
]
Outputs¶
- degapped_sequences:
FeatureData[Sequence]
The resulting unaligned (degapped) DNA sequences.[required]
rescript edit-taxonomy¶
A method that allows the user to edit taxonomy strings. This is often used to fix inconsistent and/or inccorect taxonomic annotations. The user can either provide two separate lists of strings, i.e. 'search-strings', and 'replacement-strings', on the command line, and/or a single tab-delimited replacement map file containing a list of these strings. In both cases the number of search strings must match the number of replacement strings. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. In the case that both search / replacement strings, and a replacement map file are provided, they will be merged.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy strings data to be edited.[required]
Parameters¶
- replacement_map:
MetadataColumn
[
Categorical
]
A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
- search_strings:
List
[
Str
]
Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
- replacement_strings:
List
[
Str
]
Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
- use_regex:
Bool
Toggle regular expressions. By default, only litereal substring matching is performed.[default:
False
]
Outputs¶
- edited_taxonomy:
FeatureData[Taxonomy]
Taxonomy in which the original strings are replaced by user-supplied strings.[required]
rescript orient-seqs¶
Orient input sequences by comparison against a set of reference sequences using VSEARCH. This action can also be used to quickly filter out sequences that (do not) match a set of reference sequences in either orientation. Alternatively, if no reference sequences are provided as input, all input sequences will be reverse-complemented. In this case, no alignment is performed, and all alignment parameters (dbmask
, relabel
, relabel_keep
, relabel_md5
, relabel_self
, relabel_sha1
, sizein
, sizeout
and threads
) are ignored.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be oriented.[required]
- reference_sequences:
FeatureData[Sequence]
Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]
Parameters¶
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- dbmask:
Str
%
Choices
('none', 'dust', 'soft')
Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
- relabel:
Str
Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_keep:
Bool
When relabeling, keep the original identifier in the header after a space.[optional]
- relabel_md5:
Bool
When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_self:
Bool
Relabel sequences using the sequence itself as a label.[optional]
- relabel_sha1:
Bool
When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
- sizein:
Bool
In de novo mode, abundance annotations (pattern
[>;]size=integer[;]
) present in sequence headers are taken into account.[optional]- sizeout:
Bool
Add abundance annotations to the output FASTA files.[optional]
Outputs¶
- oriented_seqs:
FeatureData[Sequence]
Query sequences in same orientation as top matching reference sequence.[required]
- unmatched_seqs:
FeatureData[Sequence]
Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]
rescript filter-seqs-length-by-taxon¶
Filter sequences by length. Can filter both globally by minimum and/or maximum length, and set individual threshold for individual taxonomic groups (using the "labels" option). Note that filtering can be performed for multiple taxonomic groups simultaneously, and nested taxonomic filters can be applied (e.g., to apply a more stringent filter for a particular genus, but a less stringent filter for other members of the kingdom). For global length-based filtering without conditional taxonomic filtering, see filter_seqs_length.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be filtered.[required]
Parameters¶
- labels:
List
[
Str
]
One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
- min_lens:
List
[
Int
%
Range
(1, None)
]
Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
- max_lens:
List
[
Int
%
Range
(1, None)
]
Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript filter-seqs-length¶
Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
Parameters¶
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript parse-silva-taxonomy¶
Parses several files from the SILVA reference database to produce a GreenGenes-like fixed rank taxonomy that is 6 or 7 ranks deep, depending on whether or not include_species_labels
is applied. The generated ranks (and the rank handles used to label these ranks in the resulting taxonomy) are: domain (d__), phylum (p__), class (c__), order (o__), family (f__), genus (g__), and species (s__). NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Inputs¶
- taxonomy_tree:
Phylogeny[Rooted]
SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
- taxonomy_map:
FeatureData[SILVATaxidMap]
SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
- taxonomy_ranks:
FeatureData[SILVATaxonomy]
SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]
Parameters¶
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
The resulting fixed-rank formatted SILVA taxonomy.[required]
rescript reverse-transcribe¶
Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.
Citations¶
Inputs¶
- rna_sequences:
FeatureData[AlignedRNASequence¹ | RNASequence²]
RNA Sequences to reverse transcribe to DNA.[required]
Outputs¶
- dna_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Reverse-transcribed DNA sequences.[required]
rescript get-ncbi-data¶
Download and import sequences from the NCBI Nucleotide database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Nucleotide database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Nucleotide database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[Sequence]
Sequences from the NCBI Nucleotide database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-ncbi-data-protein¶
Download and import sequences from the NCBI Protein database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Protein database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Protein database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[ProteinSequence]
Sequences from the NCBI Protein database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-gtdb-data¶
Download, parse, and import SSU GTDB files, given a version number. Downloads data directly from GTDB, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM GTDB. SEE https://
Citations¶
Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021
Parameters¶
- version:
Str
%
Choices
('202.0', '207.0', '214.0', '214.1', '220.0')
GTDB database version to download.[default:
'220.0'
]- domain:
Str
%
Choices
('Both', 'Bacteria', 'Archaea')
SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default:
'Both'
]- db_type:
Str
%
Choices
('All', 'SpeciesReps')
'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default:
'SpeciesReps'
]
Outputs¶
- gtdb_taxonomy:
FeatureData[Taxonomy]
SSU GTDB reference taxonomy.[required]
- gtdb_sequences:
FeatureData[Sequence]
SSU GTDB reference sequences.[required]
rescript get-unite-data¶
Download and import ITS sequences and taxonomy from the UNITE database, given a version number and taxon_group, with the option to select a cluster_id and include singletons. Downloads data directly from UNITE's PlutoF REST API. NOTE: THIS ACTION ACQUIRES DATA FROM UNITE, which is licensed under CC BY-SA 4.0. To learn more, please visit https://
Citations¶
Robeson et al., 2021; Nilsson et al., 2018
Parameters¶
- version:
Str
%
Choices
('10.0', '9.0', '8.3', '8.2')
UNITE version to download.[default:
'10.0'
]- taxon_group:
Str
%
Choices
('fungi', 'eukaryotes')
Download a database with only 'fungi' or including all 'eukaryotes'.[default:
'eukaryotes'
]- cluster_id:
Str
%
Choices
('99', '97', 'dynamic')
Percent similarity at which sequences in the of database were clustered.[default:
'99'
]- singletons:
Bool
Include singleton clusters in the database.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
UNITE reference taxonomy.[required]
- sequences:
FeatureData[Sequence]
UNITE reference sequences.[required]
rescript filter-taxa¶
Filter taxonomy by list of IDs or search criteria.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy to filter.[required]
Parameters¶
- ids_to_keep:
Metadata
List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
- include:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting
ids_to_keep
.[optional]- exclude:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting
ids_to_keep
.[optional]
Outputs¶
- filtered_taxonomy:
FeatureData[Taxonomy]
The filtered taxonomy.[required]
rescript subsample-fasta¶
Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.
Citations¶
Inputs¶
- sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sequences to subsample from.[required]
Parameters¶
- subsample_size:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Size of the random sample as a fraction of the total count[default:
0.1
]- random_seed:
Int
%
Range
(1, None)
Seed to be used for random sampling.[default:
1
]
Outputs¶
- sample_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sample of original sequences.[required]
rescript extract-seq-segments¶
This action provides the ability to extract a region, or segment, of sequence without the need to specify primer pairs. This is very useful in cases when one or more of the primer sequences are not present within the target sequences, which prevents extraction of the (amplicon) region through primer-pair searching. Here, VSEARCH is used to extract these segments based on a reference pool of sequences that only span the region of interest.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- input_sequences:
FeatureData[Sequence]
Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
- reference_segment_sequences:
FeatureData[Sequence]
Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]
Parameters¶
- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
0.7
]- target_coverage:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default:
0.9
]- min_seq_len:
Int
%
Range
(1, None)
Minimum length of sequence allowed for searching. Any sequence less than this will be discarded. If not set, default program settings will be used.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- extracted_sequence_segments:
FeatureData[Sequence]
Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
- unmatched_sequences:
FeatureData[Sequence]
Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]
rescript get-ncbi-genomes¶
Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.
Citations¶
Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020
Parameters¶
- taxon:
Str
NCBI Taxonomy ID or name (common or scientific) at any taxonomic rank.[required]
- assembly_source:
Str
%
Choices
('refseq', 'genbank', 'all')
Fetch only RefSeq or GenBank genome assemblies.[default:
'refseq'
]- assembly_levels:
List
[
Str
%
Choices
('complete_genome', 'chromosome', 'scaffold', 'contig')
]
Fetch only genome assemblies that are one of the specified assembly levels.[default:
['complete_genome']
]- only_reference:
Bool
Fetch only reference and representative genome assemblies.[default:
True
]- only_genomic:
Bool
Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default:
False
]- tax_exact_match:
Bool
If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default:
False
]- page_size:
Int
%
Range
(20, 1000, inclusive_end=True)
The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default:
20
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default:
['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genome_assemblies:
FeatureData[Sequence]
Nucleotide sequences of requested genomes.[required]
- loci:
GenomeData[Loci]
Loci features of requested genomes.[required]
- proteins:
GenomeData[Proteins]
Protein sequences originating from requested genomes.[required]
- taxonomies:
FeatureData[Taxonomy]
Taxonomies of requested genomes.[required]
rescript get-bv-brc-genomes¶
Fetch genome sequences from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted genomes. By providing IDs/values and a corresponding data field, you can retrieve all genomes associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _sequence for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genomes:
GenomeData[DNASequence]
Genome sequences for specified query.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
rescript get-bv-brc-metadata¶
Fetch BV-BCR metadata for a specific data type. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted results. By providing IDs/values and a corresponding data field, you can retrieve all metadata associated with those specific values in that data field. And as a third option a metadata column can be provided, to use the results from other data types as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- data_type:
Str
%
Choices
('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology')
BV-BCR data type for which metadata should be downloaded. Check https://
www .bv -brc .org /api /doc / for documentation.[optional] - rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc / for allowed data fields in the specified "data-type".[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
Outputs¶
- metadata:
ImmutableMetadata
BV-BCR metadata of specified data type.[required]
rescript get-bv-brc-genome-features¶
Fetch DNA and protein sequences of genome features from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted features. By providing IDs/values and a corresponding data field, you can retrieve all features associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _feature for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genes:
GenomeData[Genes]
Gene[required]
- proteins:
GenomeData[Proteins]
proteins[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
- loci:
GenomeData[Loci]
loci[required]
rescript evaluate-seqs¶
Compute summary statistics on sequence artifact(s) and visualize. Summary statistics include the number of unique sequences, sequence entropy, kmer entropy, and sequence length distributions. This action is useful for both reference taxonomies and classification results.
Citations¶
Inputs¶
- sequences:
List
[
FeatureData[Sequence]
]
One or more sets of sequences to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- kmer_lengths:
List
[
Int
%
Range
(1, None)
]
Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
- subsample_kmers:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default:
1.0
]- palette:
Str
%
Choices
('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic')
Color palette to use for plotting evaluation results.[default:
'viridis'
]
Outputs¶
- visualization:
Visualization
<no description>[required]
rescript evaluate-fit-classifier¶
Train a naive Bayes classifier on a set of reference sequences, then test performance accuracy on this same set of sequences. This results in a "perfect" classifier that "knows" the correct identity of each input sequence. Such a leaky classifier indicates the upper limit of classification accuracy based on sequence information alone, as misclassifications are an indication of unresolvable kmer profiles. This test simulates the case where all query sequences are present in a fully comprehensive reference database. To simulate more realistic conditions, see evaluate_cross_validate
. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS PRODUCTION-READY and can be re-used for classification of other sequences (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- classifier:
TaxonomicClassifier
Trained naive Bayes taxonomic classifier.[required]
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]
rescript evaluate-cross-validate¶
Evaluate DNA sequence reference database via cross-validated taxonomic classification. Unique taxonomic labels are truncated to enable appropriate label stratification. See the cited reference (Bokulich et al. 2018) for more details.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- k:
Int
%
Range
(2, None)
Number of stratified folds.[default:
3
]- random_state:
Int
%
Range
(0, None)
Seed used by the random number generator.[default:
0
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- expected_taxonomy:
FeatureData[Taxonomy]
Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
- evaluation:
Visualization
Visualization of cross-validated accuracy results.[required]
rescript evaluate-classifications¶
Evaluate taxonomic classification accuracy by comparing one or more sets of true taxonomic labels to the predicted taxonomies for the same set(s) of features. Output an interactive line plot of classification accuracy for each pair of expected/observed taxonomies. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018
Inputs¶
- expected_taxonomies:
List
[
FeatureData[Taxonomy]
]
True taxonomic labels for one more more sets of features.[required]
- observed_taxonomies:
List
[
FeatureData[Taxonomy]
]
Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
Outputs¶
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
rescript evaluate-taxonomy¶
Compute summary statistics on taxonomy artifact(s) and visualize as interactive lineplots. Summary statistics include the number of unique labels, taxonomic entropy, and the number of features that are (un)classified at each taxonomic level. This action is useful for both reference taxonomies and classification results. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- taxonomies:
List
[
FeatureData[Taxonomy]
]
One or more taxonomies to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]
Outputs¶
- taxonomy_stats:
Visualization
<no description>[required]
rescript get-silva-data¶
Download, parse, and import SILVA database files, given a version number and reference target. Downloads data directly from SILVA, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Parameters¶
- version:
Str
%
Choices
('128', '132')
|
Str
%
Choices
('138')
|
Str
%
Choices
('138.1', '138.2')
SILVA database version to download.[default:
'138.2'
]- target:
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef')
Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default:
'SSURef_NR99'
]- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- download_sequences:
Bool
Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a
silva_sequences
output is still created, but contains no data.[default:True
]
Outputs¶
- silva_sequences:
FeatureData[RNASequence]
SILVA reference sequences.[required]
- silva_taxonomy:
FeatureData[Taxonomy]
SILVA reference taxonomy.[required]
rescript trim-alignment¶
Trim an existing alignment based on provided primers or specific, pre-defined positions. Primers take precedence over the positions,i.e. if both are provided, positions will be ignored.When using primers in combination with a DNA alignment, a new alignment will be generated to locate primer positions. Subsequently, start (5'-most) and end (3'-most) position from fwd and rev primer located within the new alignment is identified and used for slicing the original alignment.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA sequences.[required]
Parameters¶
- primer_fwd:
Str
Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
- primer_rev:
Str
Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
- position_start:
Int
%
Range
(1, None)
Position within the alignment where the trimming will begin. If not provided, alignment will notbe trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
- position_end:
Int
%
Range
(1, None)
Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
- n_threads:
Int
%
Range
(1, None)
Number of threads to use for primer-based trimming, otherwise ignored. (Use
auto
to automatically use all available cores)[default:1
]
Outputs¶
- trimmed_sequences:
FeatureData[AlignedSequence]
Trimmed sequence alignment.[required]
Reference sequence annotation and curation pipeline.
- version:
2024.10.0
- website: https://
github .com /nbokulich /RESCRIPt - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Robeson et al., 2021
Actions¶
Name | Type | Short Description |
---|---|---|
merge-taxa | method | Compare taxonomies and select the longest, highest scoring, or find the least common ancestor. |
dereplicate | method | Dereplicate features with matching sequences and taxonomies. |
cull-seqs | method | Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length. |
degap-seqs | method | Remove gaps from DNA sequence alignments. |
edit-taxonomy | method | Edit taxonomy strings with find and replace terms. |
orient-seqs | method | Orient input sequences by comparison against reference. |
filter-seqs-length-by-taxon | method | Filter sequences by length and taxonomic group. |
filter-seqs-length | method | Filter sequences by length. |
parse-silva-taxonomy | method | Generates a SILVA fixed-rank taxonomy. |
reverse-transcribe | method | Reverse transcribe RNA to DNA sequences. |
get-ncbi-data | method | Download, parse, and import NCBI sequences and taxonomies |
get-ncbi-data-protein | method | Download, parse, and import NCBI protein sequences and taxonomies |
get-gtdb-data | method | Download, parse, and import SSU GTDB reference data. |
get-unite-data | method | Download and import UNITE reference data. |
filter-taxa | method | Filter taxonomy by list of IDs or search criteria. |
subsample-fasta | method | Subsample an indicated number of sequences from a FASTA file. |
extract-seq-segments | method | Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value. |
get-ncbi-genomes | method | Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets. |
get-bv-brc-genomes | method | Get genome sequences from the BV-BRC database. |
get-bv-brc-metadata | method | Fetch BV-BCR metadata. |
get-bv-brc-genome-features | method | Fetch genome features from BV-BRC. |
evaluate-seqs | visualizer | Compute summary statistics on sequence artifact(s). |
evaluate-fit-classifier | pipeline | Evaluate and train naive Bayes classifier on reference sequences. |
evaluate-cross-validate | pipeline | Evaluate DNA sequence reference database via cross-validated taxonomic classification. |
evaluate-classifications | pipeline | Interactively evaluate taxonomic classification accuracy. |
evaluate-taxonomy | pipeline | Compute summary statistics on taxonomy artifact(s). |
get-silva-data | pipeline | Download, parse, and import SILVA database. |
trim-alignment | pipeline | Trim alignment based on provided primers or specific positions. |
Artifact Classes¶
FeatureData[SILVATaxonomy] |
FeatureData[SILVATaxidMap] |
Formats¶
SILVATaxonomyFormat |
SILVATaxonomyDirectoryFormat |
SILVATaxidMapFormat |
SILVATaxidMapDirectoryFormat |
rescript merge-taxa¶
Compare taxonomy annotations and choose the best one. Can select the longest taxonomy annotation, the highest scoring, or the least common ancestor. Note: when a tie occurs, the last taxonomy added takes precedent.
Citations¶
Inputs¶
- data:
List
[
FeatureData[Taxonomy]
]
Two or more feature taxonomies to be merged.[required]
Parameters¶
- mode:
Str
%
Choices
('len', 'lca', 'score', 'super', 'majority')
How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'len'
]- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default:
'^[dkpcofgs]__'
]- new_rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with
rank_handle_regex
if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default:['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- unclassified_label:
Str
Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default:
'Unassigned'
]
Outputs¶
- merged_data:
FeatureData[Taxonomy]
<no description>[required]
rescript dereplicate¶
Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be dereplicated[required]
- taxa:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated[required]
Parameters¶
- mode:
Str
%
Choices
('uniq', 'lca', 'majority', 'super')
How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'uniq'
]- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
1.0
]- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default:
['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- derep_prefix:
Bool
Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default:
False
]
Outputs¶
- dereplicated_sequences:
FeatureData[Sequence]
<no description>[required]
- dereplicated_taxa:
FeatureData[Taxonomy]
<no description>[required]
rescript cull-seqs¶
Filter DNA or RNA sequences that contain ambiguous bases and homopolymers, and output filtered DNA sequences. Removes DNA sequences that have the specified number, or more, of IUPAC compliant degenerate bases. Remaining sequences are removed if they contain homopolymers equal to or longer than the specified length. If the input consists of RNA sequences, they are reverse transcribed to DNA before filtering.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence]
DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]
Parameters¶
- num_degenerates:
Int
%
Range
(1, None)
Sequences with N, or more, degenerate bases will be removed.[default:
5
]- homopolymer_length:
Int
%
Range
(2, None)
Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default:
8
]- n_jobs:
Int
%
Range
(1, None)
Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default:
1
]
Outputs¶
- clean_sequences:
FeatureData[Sequence]
The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]
rescript degap-seqs¶
This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA Sequences to be degapped.[required]
Parameters¶
- min_length:
Int
%
Range
(1, None)
Minimum length of sequence to be returned after degapping.[default:
1
]
Outputs¶
- degapped_sequences:
FeatureData[Sequence]
The resulting unaligned (degapped) DNA sequences.[required]
rescript edit-taxonomy¶
A method that allows the user to edit taxonomy strings. This is often used to fix inconsistent and/or inccorect taxonomic annotations. The user can either provide two separate lists of strings, i.e. 'search-strings', and 'replacement-strings', on the command line, and/or a single tab-delimited replacement map file containing a list of these strings. In both cases the number of search strings must match the number of replacement strings. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. In the case that both search / replacement strings, and a replacement map file are provided, they will be merged.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy strings data to be edited.[required]
Parameters¶
- replacement_map:
MetadataColumn
[
Categorical
]
A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
- search_strings:
List
[
Str
]
Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
- replacement_strings:
List
[
Str
]
Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
- use_regex:
Bool
Toggle regular expressions. By default, only litereal substring matching is performed.[default:
False
]
Outputs¶
- edited_taxonomy:
FeatureData[Taxonomy]
Taxonomy in which the original strings are replaced by user-supplied strings.[required]
rescript orient-seqs¶
Orient input sequences by comparison against a set of reference sequences using VSEARCH. This action can also be used to quickly filter out sequences that (do not) match a set of reference sequences in either orientation. Alternatively, if no reference sequences are provided as input, all input sequences will be reverse-complemented. In this case, no alignment is performed, and all alignment parameters (dbmask
, relabel
, relabel_keep
, relabel_md5
, relabel_self
, relabel_sha1
, sizein
, sizeout
and threads
) are ignored.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be oriented.[required]
- reference_sequences:
FeatureData[Sequence]
Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]
Parameters¶
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- dbmask:
Str
%
Choices
('none', 'dust', 'soft')
Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
- relabel:
Str
Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_keep:
Bool
When relabeling, keep the original identifier in the header after a space.[optional]
- relabel_md5:
Bool
When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_self:
Bool
Relabel sequences using the sequence itself as a label.[optional]
- relabel_sha1:
Bool
When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
- sizein:
Bool
In de novo mode, abundance annotations (pattern
[>;]size=integer[;]
) present in sequence headers are taken into account.[optional]- sizeout:
Bool
Add abundance annotations to the output FASTA files.[optional]
Outputs¶
- oriented_seqs:
FeatureData[Sequence]
Query sequences in same orientation as top matching reference sequence.[required]
- unmatched_seqs:
FeatureData[Sequence]
Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]
rescript filter-seqs-length-by-taxon¶
Filter sequences by length. Can filter both globally by minimum and/or maximum length, and set individual threshold for individual taxonomic groups (using the "labels" option). Note that filtering can be performed for multiple taxonomic groups simultaneously, and nested taxonomic filters can be applied (e.g., to apply a more stringent filter for a particular genus, but a less stringent filter for other members of the kingdom). For global length-based filtering without conditional taxonomic filtering, see filter_seqs_length.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be filtered.[required]
Parameters¶
- labels:
List
[
Str
]
One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
- min_lens:
List
[
Int
%
Range
(1, None)
]
Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
- max_lens:
List
[
Int
%
Range
(1, None)
]
Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript filter-seqs-length¶
Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
Parameters¶
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript parse-silva-taxonomy¶
Parses several files from the SILVA reference database to produce a GreenGenes-like fixed rank taxonomy that is 6 or 7 ranks deep, depending on whether or not include_species_labels
is applied. The generated ranks (and the rank handles used to label these ranks in the resulting taxonomy) are: domain (d__), phylum (p__), class (c__), order (o__), family (f__), genus (g__), and species (s__). NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Inputs¶
- taxonomy_tree:
Phylogeny[Rooted]
SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
- taxonomy_map:
FeatureData[SILVATaxidMap]
SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
- taxonomy_ranks:
FeatureData[SILVATaxonomy]
SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]
Parameters¶
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
The resulting fixed-rank formatted SILVA taxonomy.[required]
rescript reverse-transcribe¶
Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.
Citations¶
Inputs¶
- rna_sequences:
FeatureData[AlignedRNASequence¹ | RNASequence²]
RNA Sequences to reverse transcribe to DNA.[required]
Outputs¶
- dna_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Reverse-transcribed DNA sequences.[required]
rescript get-ncbi-data¶
Download and import sequences from the NCBI Nucleotide database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Nucleotide database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Nucleotide database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[Sequence]
Sequences from the NCBI Nucleotide database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-ncbi-data-protein¶
Download and import sequences from the NCBI Protein database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Protein database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Protein database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[ProteinSequence]
Sequences from the NCBI Protein database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-gtdb-data¶
Download, parse, and import SSU GTDB files, given a version number. Downloads data directly from GTDB, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM GTDB. SEE https://
Citations¶
Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021
Parameters¶
- version:
Str
%
Choices
('202.0', '207.0', '214.0', '214.1', '220.0')
GTDB database version to download.[default:
'220.0'
]- domain:
Str
%
Choices
('Both', 'Bacteria', 'Archaea')
SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default:
'Both'
]- db_type:
Str
%
Choices
('All', 'SpeciesReps')
'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default:
'SpeciesReps'
]
Outputs¶
- gtdb_taxonomy:
FeatureData[Taxonomy]
SSU GTDB reference taxonomy.[required]
- gtdb_sequences:
FeatureData[Sequence]
SSU GTDB reference sequences.[required]
rescript get-unite-data¶
Download and import ITS sequences and taxonomy from the UNITE database, given a version number and taxon_group, with the option to select a cluster_id and include singletons. Downloads data directly from UNITE's PlutoF REST API. NOTE: THIS ACTION ACQUIRES DATA FROM UNITE, which is licensed under CC BY-SA 4.0. To learn more, please visit https://
Citations¶
Robeson et al., 2021; Nilsson et al., 2018
Parameters¶
- version:
Str
%
Choices
('10.0', '9.0', '8.3', '8.2')
UNITE version to download.[default:
'10.0'
]- taxon_group:
Str
%
Choices
('fungi', 'eukaryotes')
Download a database with only 'fungi' or including all 'eukaryotes'.[default:
'eukaryotes'
]- cluster_id:
Str
%
Choices
('99', '97', 'dynamic')
Percent similarity at which sequences in the of database were clustered.[default:
'99'
]- singletons:
Bool
Include singleton clusters in the database.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
UNITE reference taxonomy.[required]
- sequences:
FeatureData[Sequence]
UNITE reference sequences.[required]
rescript filter-taxa¶
Filter taxonomy by list of IDs or search criteria.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy to filter.[required]
Parameters¶
- ids_to_keep:
Metadata
List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
- include:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting
ids_to_keep
.[optional]- exclude:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting
ids_to_keep
.[optional]
Outputs¶
- filtered_taxonomy:
FeatureData[Taxonomy]
The filtered taxonomy.[required]
rescript subsample-fasta¶
Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.
Citations¶
Inputs¶
- sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sequences to subsample from.[required]
Parameters¶
- subsample_size:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Size of the random sample as a fraction of the total count[default:
0.1
]- random_seed:
Int
%
Range
(1, None)
Seed to be used for random sampling.[default:
1
]
Outputs¶
- sample_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sample of original sequences.[required]
rescript extract-seq-segments¶
This action provides the ability to extract a region, or segment, of sequence without the need to specify primer pairs. This is very useful in cases when one or more of the primer sequences are not present within the target sequences, which prevents extraction of the (amplicon) region through primer-pair searching. Here, VSEARCH is used to extract these segments based on a reference pool of sequences that only span the region of interest.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- input_sequences:
FeatureData[Sequence]
Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
- reference_segment_sequences:
FeatureData[Sequence]
Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]
Parameters¶
- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
0.7
]- target_coverage:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default:
0.9
]- min_seq_len:
Int
%
Range
(1, None)
Minimum length of sequence allowed for searching. Any sequence less than this will be discarded. If not set, default program settings will be used.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- extracted_sequence_segments:
FeatureData[Sequence]
Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
- unmatched_sequences:
FeatureData[Sequence]
Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]
rescript get-ncbi-genomes¶
Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.
Citations¶
Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020
Parameters¶
- taxon:
Str
NCBI Taxonomy ID or name (common or scientific) at any taxonomic rank.[required]
- assembly_source:
Str
%
Choices
('refseq', 'genbank', 'all')
Fetch only RefSeq or GenBank genome assemblies.[default:
'refseq'
]- assembly_levels:
List
[
Str
%
Choices
('complete_genome', 'chromosome', 'scaffold', 'contig')
]
Fetch only genome assemblies that are one of the specified assembly levels.[default:
['complete_genome']
]- only_reference:
Bool
Fetch only reference and representative genome assemblies.[default:
True
]- only_genomic:
Bool
Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default:
False
]- tax_exact_match:
Bool
If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default:
False
]- page_size:
Int
%
Range
(20, 1000, inclusive_end=True)
The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default:
20
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default:
['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genome_assemblies:
FeatureData[Sequence]
Nucleotide sequences of requested genomes.[required]
- loci:
GenomeData[Loci]
Loci features of requested genomes.[required]
- proteins:
GenomeData[Proteins]
Protein sequences originating from requested genomes.[required]
- taxonomies:
FeatureData[Taxonomy]
Taxonomies of requested genomes.[required]
rescript get-bv-brc-genomes¶
Fetch genome sequences from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted genomes. By providing IDs/values and a corresponding data field, you can retrieve all genomes associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _sequence for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genomes:
GenomeData[DNASequence]
Genome sequences for specified query.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
rescript get-bv-brc-metadata¶
Fetch BV-BCR metadata for a specific data type. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted results. By providing IDs/values and a corresponding data field, you can retrieve all metadata associated with those specific values in that data field. And as a third option a metadata column can be provided, to use the results from other data types as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- data_type:
Str
%
Choices
('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology')
BV-BCR data type for which metadata should be downloaded. Check https://
www .bv -brc .org /api /doc / for documentation.[optional] - rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc / for allowed data fields in the specified "data-type".[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
Outputs¶
- metadata:
ImmutableMetadata
BV-BCR metadata of specified data type.[required]
rescript get-bv-brc-genome-features¶
Fetch DNA and protein sequences of genome features from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted features. By providing IDs/values and a corresponding data field, you can retrieve all features associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _feature for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genes:
GenomeData[Genes]
Gene[required]
- proteins:
GenomeData[Proteins]
proteins[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
- loci:
GenomeData[Loci]
loci[required]
rescript evaluate-seqs¶
Compute summary statistics on sequence artifact(s) and visualize. Summary statistics include the number of unique sequences, sequence entropy, kmer entropy, and sequence length distributions. This action is useful for both reference taxonomies and classification results.
Citations¶
Inputs¶
- sequences:
List
[
FeatureData[Sequence]
]
One or more sets of sequences to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- kmer_lengths:
List
[
Int
%
Range
(1, None)
]
Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
- subsample_kmers:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default:
1.0
]- palette:
Str
%
Choices
('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic')
Color palette to use for plotting evaluation results.[default:
'viridis'
]
Outputs¶
- visualization:
Visualization
<no description>[required]
rescript evaluate-fit-classifier¶
Train a naive Bayes classifier on a set of reference sequences, then test performance accuracy on this same set of sequences. This results in a "perfect" classifier that "knows" the correct identity of each input sequence. Such a leaky classifier indicates the upper limit of classification accuracy based on sequence information alone, as misclassifications are an indication of unresolvable kmer profiles. This test simulates the case where all query sequences are present in a fully comprehensive reference database. To simulate more realistic conditions, see evaluate_cross_validate
. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS PRODUCTION-READY and can be re-used for classification of other sequences (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- classifier:
TaxonomicClassifier
Trained naive Bayes taxonomic classifier.[required]
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]
rescript evaluate-cross-validate¶
Evaluate DNA sequence reference database via cross-validated taxonomic classification. Unique taxonomic labels are truncated to enable appropriate label stratification. See the cited reference (Bokulich et al. 2018) for more details.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- k:
Int
%
Range
(2, None)
Number of stratified folds.[default:
3
]- random_state:
Int
%
Range
(0, None)
Seed used by the random number generator.[default:
0
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- expected_taxonomy:
FeatureData[Taxonomy]
Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
- evaluation:
Visualization
Visualization of cross-validated accuracy results.[required]
rescript evaluate-classifications¶
Evaluate taxonomic classification accuracy by comparing one or more sets of true taxonomic labels to the predicted taxonomies for the same set(s) of features. Output an interactive line plot of classification accuracy for each pair of expected/observed taxonomies. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018
Inputs¶
- expected_taxonomies:
List
[
FeatureData[Taxonomy]
]
True taxonomic labels for one more more sets of features.[required]
- observed_taxonomies:
List
[
FeatureData[Taxonomy]
]
Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
Outputs¶
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
rescript evaluate-taxonomy¶
Compute summary statistics on taxonomy artifact(s) and visualize as interactive lineplots. Summary statistics include the number of unique labels, taxonomic entropy, and the number of features that are (un)classified at each taxonomic level. This action is useful for both reference taxonomies and classification results. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- taxonomies:
List
[
FeatureData[Taxonomy]
]
One or more taxonomies to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]
Outputs¶
- taxonomy_stats:
Visualization
<no description>[required]
rescript get-silva-data¶
Download, parse, and import SILVA database files, given a version number and reference target. Downloads data directly from SILVA, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Parameters¶
- version:
Str
%
Choices
('128', '132')
|
Str
%
Choices
('138')
|
Str
%
Choices
('138.1', '138.2')
SILVA database version to download.[default:
'138.2'
]- target:
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef')
Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default:
'SSURef_NR99'
]- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- download_sequences:
Bool
Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a
silva_sequences
output is still created, but contains no data.[default:True
]
Outputs¶
- silva_sequences:
FeatureData[RNASequence]
SILVA reference sequences.[required]
- silva_taxonomy:
FeatureData[Taxonomy]
SILVA reference taxonomy.[required]
rescript trim-alignment¶
Trim an existing alignment based on provided primers or specific, pre-defined positions. Primers take precedence over the positions,i.e. if both are provided, positions will be ignored.When using primers in combination with a DNA alignment, a new alignment will be generated to locate primer positions. Subsequently, start (5'-most) and end (3'-most) position from fwd and rev primer located within the new alignment is identified and used for slicing the original alignment.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA sequences.[required]
Parameters¶
- primer_fwd:
Str
Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
- primer_rev:
Str
Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
- position_start:
Int
%
Range
(1, None)
Position within the alignment where the trimming will begin. If not provided, alignment will notbe trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
- position_end:
Int
%
Range
(1, None)
Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
- n_threads:
Int
%
Range
(1, None)
Number of threads to use for primer-based trimming, otherwise ignored. (Use
auto
to automatically use all available cores)[default:1
]
Outputs¶
- trimmed_sequences:
FeatureData[AlignedSequence]
Trimmed sequence alignment.[required]
Reference sequence annotation and curation pipeline.
- version:
2024.10.0
- website: https://
github .com /nbokulich /RESCRIPt - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org - citations:
- Robeson et al., 2021
Actions¶
Name | Type | Short Description |
---|---|---|
merge-taxa | method | Compare taxonomies and select the longest, highest scoring, or find the least common ancestor. |
dereplicate | method | Dereplicate features with matching sequences and taxonomies. |
cull-seqs | method | Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length. |
degap-seqs | method | Remove gaps from DNA sequence alignments. |
edit-taxonomy | method | Edit taxonomy strings with find and replace terms. |
orient-seqs | method | Orient input sequences by comparison against reference. |
filter-seqs-length-by-taxon | method | Filter sequences by length and taxonomic group. |
filter-seqs-length | method | Filter sequences by length. |
parse-silva-taxonomy | method | Generates a SILVA fixed-rank taxonomy. |
reverse-transcribe | method | Reverse transcribe RNA to DNA sequences. |
get-ncbi-data | method | Download, parse, and import NCBI sequences and taxonomies |
get-ncbi-data-protein | method | Download, parse, and import NCBI protein sequences and taxonomies |
get-gtdb-data | method | Download, parse, and import SSU GTDB reference data. |
get-unite-data | method | Download and import UNITE reference data. |
filter-taxa | method | Filter taxonomy by list of IDs or search criteria. |
subsample-fasta | method | Subsample an indicated number of sequences from a FASTA file. |
extract-seq-segments | method | Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value. |
get-ncbi-genomes | method | Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets. |
get-bv-brc-genomes | method | Get genome sequences from the BV-BRC database. |
get-bv-brc-metadata | method | Fetch BV-BCR metadata. |
get-bv-brc-genome-features | method | Fetch genome features from BV-BRC. |
evaluate-seqs | visualizer | Compute summary statistics on sequence artifact(s). |
evaluate-fit-classifier | pipeline | Evaluate and train naive Bayes classifier on reference sequences. |
evaluate-cross-validate | pipeline | Evaluate DNA sequence reference database via cross-validated taxonomic classification. |
evaluate-classifications | pipeline | Interactively evaluate taxonomic classification accuracy. |
evaluate-taxonomy | pipeline | Compute summary statistics on taxonomy artifact(s). |
get-silva-data | pipeline | Download, parse, and import SILVA database. |
trim-alignment | pipeline | Trim alignment based on provided primers or specific positions. |
Artifact Classes¶
FeatureData[SILVATaxonomy] |
FeatureData[SILVATaxidMap] |
Formats¶
SILVATaxonomyFormat |
SILVATaxonomyDirectoryFormat |
SILVATaxidMapFormat |
SILVATaxidMapDirectoryFormat |
rescript merge-taxa¶
Compare taxonomy annotations and choose the best one. Can select the longest taxonomy annotation, the highest scoring, or the least common ancestor. Note: when a tie occurs, the last taxonomy added takes precedent.
Citations¶
Inputs¶
- data:
List
[
FeatureData[Taxonomy]
]
Two or more feature taxonomies to be merged.[required]
Parameters¶
- mode:
Str
%
Choices
('len', 'lca', 'score', 'super', 'majority')
How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'len'
]- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default:
'^[dkpcofgs]__'
]- new_rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with
rank_handle_regex
if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default:['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- unclassified_label:
Str
Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default:
'Unassigned'
]
Outputs¶
- merged_data:
FeatureData[Taxonomy]
<no description>[required]
rescript dereplicate¶
Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be dereplicated[required]
- taxa:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated[required]
Parameters¶
- mode:
Str
%
Choices
('uniq', 'lca', 'majority', 'super')
How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default:
'uniq'
]- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
1.0
]- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- rank_handles:
List
[
Str
%
Choices
('disable')
]
|
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default:
['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- derep_prefix:
Bool
Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default:
False
]
Outputs¶
- dereplicated_sequences:
FeatureData[Sequence]
<no description>[required]
- dereplicated_taxa:
FeatureData[Taxonomy]
<no description>[required]
rescript cull-seqs¶
Filter DNA or RNA sequences that contain ambiguous bases and homopolymers, and output filtered DNA sequences. Removes DNA sequences that have the specified number, or more, of IUPAC compliant degenerate bases. Remaining sequences are removed if they contain homopolymers equal to or longer than the specified length. If the input consists of RNA sequences, they are reverse transcribed to DNA before filtering.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence]
DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]
Parameters¶
- num_degenerates:
Int
%
Range
(1, None)
Sequences with N, or more, degenerate bases will be removed.[default:
5
]- homopolymer_length:
Int
%
Range
(2, None)
Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default:
8
]- n_jobs:
Int
%
Range
(1, None)
Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default:
1
]
Outputs¶
- clean_sequences:
FeatureData[Sequence]
The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]
rescript degap-seqs¶
This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA Sequences to be degapped.[required]
Parameters¶
- min_length:
Int
%
Range
(1, None)
Minimum length of sequence to be returned after degapping.[default:
1
]
Outputs¶
- degapped_sequences:
FeatureData[Sequence]
The resulting unaligned (degapped) DNA sequences.[required]
rescript edit-taxonomy¶
A method that allows the user to edit taxonomy strings. This is often used to fix inconsistent and/or inccorect taxonomic annotations. The user can either provide two separate lists of strings, i.e. 'search-strings', and 'replacement-strings', on the command line, and/or a single tab-delimited replacement map file containing a list of these strings. In both cases the number of search strings must match the number of replacement strings. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. In the case that both search / replacement strings, and a replacement map file are provided, they will be merged.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy strings data to be edited.[required]
Parameters¶
- replacement_map:
MetadataColumn
[
Categorical
]
A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
- search_strings:
List
[
Str
]
Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
- replacement_strings:
List
[
Str
]
Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
- use_regex:
Bool
Toggle regular expressions. By default, only litereal substring matching is performed.[default:
False
]
Outputs¶
- edited_taxonomy:
FeatureData[Taxonomy]
Taxonomy in which the original strings are replaced by user-supplied strings.[required]
rescript orient-seqs¶
Orient input sequences by comparison against a set of reference sequences using VSEARCH. This action can also be used to quickly filter out sequences that (do not) match a set of reference sequences in either orientation. Alternatively, if no reference sequences are provided as input, all input sequences will be reverse-complemented. In this case, no alignment is performed, and all alignment parameters (dbmask
, relabel
, relabel_keep
, relabel_md5
, relabel_self
, relabel_sha1
, sizein
, sizeout
and threads
) are ignored.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be oriented.[required]
- reference_sequences:
FeatureData[Sequence]
Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]
Parameters¶
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]- dbmask:
Str
%
Choices
('none', 'dust', 'soft')
Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
- relabel:
Str
Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_keep:
Bool
When relabeling, keep the original identifier in the header after a space.[optional]
- relabel_md5:
Bool
When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
- relabel_self:
Bool
Relabel sequences using the sequence itself as a label.[optional]
- relabel_sha1:
Bool
When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
- sizein:
Bool
In de novo mode, abundance annotations (pattern
[>;]size=integer[;]
) present in sequence headers are taken into account.[optional]- sizeout:
Bool
Add abundance annotations to the output FASTA files.[optional]
Outputs¶
- oriented_seqs:
FeatureData[Sequence]
Query sequences in same orientation as top matching reference sequence.[required]
- unmatched_seqs:
FeatureData[Sequence]
Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]
rescript filter-seqs-length-by-taxon¶
Filter sequences by length. Can filter both globally by minimum and/or maximum length, and set individual threshold for individual taxonomic groups (using the "labels" option). Note that filtering can be performed for multiple taxonomic groups simultaneously, and nested taxonomic filters can be applied (e.g., to apply a more stringent filter for a particular genus, but a less stringent filter for other members of the kingdom). For global length-based filtering without conditional taxonomic filtering, see filter_seqs_length.
Citations¶
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomic classifications of sequences to be filtered.[required]
Parameters¶
- labels:
List
[
Str
]
One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
- min_lens:
List
[
Int
%
Range
(1, None)
]
Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
- max_lens:
List
[
Int
%
Range
(1, None)
]
Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript filter-seqs-length¶
Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- sequences:
FeatureData[Sequence]
Sequences to be filtered by length.[required]
Parameters¶
- global_min:
Int
%
Range
(1, None)
The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
- global_max:
Int
%
Range
(1, None)
The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- filtered_seqs:
FeatureData[Sequence]
Sequences that pass the filtering thresholds.[required]
- discarded_seqs:
FeatureData[Sequence]
Sequences that fall outside the filtering thresholds.[required]
rescript parse-silva-taxonomy¶
Parses several files from the SILVA reference database to produce a GreenGenes-like fixed rank taxonomy that is 6 or 7 ranks deep, depending on whether or not include_species_labels
is applied. The generated ranks (and the rank handles used to label these ranks in the resulting taxonomy) are: domain (d__), phylum (p__), class (c__), order (o__), family (f__), genus (g__), and species (s__). NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Inputs¶
- taxonomy_tree:
Phylogeny[Rooted]
SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
- taxonomy_map:
FeatureData[SILVATaxidMap]
SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
- taxonomy_ranks:
FeatureData[SILVATaxonomy]
SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]
Parameters¶
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
The resulting fixed-rank formatted SILVA taxonomy.[required]
rescript reverse-transcribe¶
Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.
Citations¶
Inputs¶
- rna_sequences:
FeatureData[AlignedRNASequence¹ | RNASequence²]
RNA Sequences to reverse transcribe to DNA.[required]
Outputs¶
- dna_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Reverse-transcribed DNA sequences.[required]
rescript get-ncbi-data¶
Download and import sequences from the NCBI Nucleotide database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Nucleotide database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Nucleotide database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[Sequence]
Sequences from the NCBI Nucleotide database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-ncbi-data-protein¶
Download and import sequences from the NCBI Protein database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database.
Please be aware of the NCBI Disclaimer and Copyright notice (https://
Citations¶
Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012
Parameters¶
- query:
Str
Query on the NCBI Protein database[optional]
- accession_ids:
Metadata
List of accession ids for sequences in the NCBI Protein database.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
Propagate known ranks to missing ranks if true[default:
True
]- logging_level:
Str
%
Choices
('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL')
Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
- n_jobs:
Int
%
Range
(1, None)
Number of concurrent download connections. More is faster until you run out of bandwidth.[default:
1
]
Outputs¶
- sequences:
FeatureData[ProteinSequence]
Sequences from the NCBI Protein database[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database[required]
rescript get-gtdb-data¶
Download, parse, and import SSU GTDB files, given a version number. Downloads data directly from GTDB, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM GTDB. SEE https://
Citations¶
Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021
Parameters¶
- version:
Str
%
Choices
('202.0', '207.0', '214.0', '214.1', '220.0')
GTDB database version to download.[default:
'220.0'
]- domain:
Str
%
Choices
('Both', 'Bacteria', 'Archaea')
SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default:
'Both'
]- db_type:
Str
%
Choices
('All', 'SpeciesReps')
'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default:
'SpeciesReps'
]
Outputs¶
- gtdb_taxonomy:
FeatureData[Taxonomy]
SSU GTDB reference taxonomy.[required]
- gtdb_sequences:
FeatureData[Sequence]
SSU GTDB reference sequences.[required]
rescript get-unite-data¶
Download and import ITS sequences and taxonomy from the UNITE database, given a version number and taxon_group, with the option to select a cluster_id and include singletons. Downloads data directly from UNITE's PlutoF REST API. NOTE: THIS ACTION ACQUIRES DATA FROM UNITE, which is licensed under CC BY-SA 4.0. To learn more, please visit https://
Citations¶
Robeson et al., 2021; Nilsson et al., 2018
Parameters¶
- version:
Str
%
Choices
('10.0', '9.0', '8.3', '8.2')
UNITE version to download.[default:
'10.0'
]- taxon_group:
Str
%
Choices
('fungi', 'eukaryotes')
Download a database with only 'fungi' or including all 'eukaryotes'.[default:
'eukaryotes'
]- cluster_id:
Str
%
Choices
('99', '97', 'dynamic')
Percent similarity at which sequences in the of database were clustered.[default:
'99'
]- singletons:
Bool
Include singleton clusters in the database.[default:
False
]
Outputs¶
- taxonomy:
FeatureData[Taxonomy]
UNITE reference taxonomy.[required]
- sequences:
FeatureData[Sequence]
UNITE reference sequences.[required]
rescript filter-taxa¶
Filter taxonomy by list of IDs or search criteria.
Citations¶
Inputs¶
- taxonomy:
FeatureData[Taxonomy]
Taxonomy to filter.[required]
Parameters¶
- ids_to_keep:
Metadata
List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
- include:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting
ids_to_keep
.[optional]- exclude:
List
[
Str
]
List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting
ids_to_keep
.[optional]
Outputs¶
- filtered_taxonomy:
FeatureData[Taxonomy]
The filtered taxonomy.[required]
rescript subsample-fasta¶
Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.
Citations¶
Inputs¶
- sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sequences to subsample from.[required]
Parameters¶
- subsample_size:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Size of the random sample as a fraction of the total count[default:
0.1
]- random_seed:
Int
%
Range
(1, None)
Seed to be used for random sampling.[default:
1
]
Outputs¶
- sample_sequences:
FeatureData[AlignedSequence¹ | Sequence²]
Sample of original sequences.[required]
rescript extract-seq-segments¶
This action provides the ability to extract a region, or segment, of sequence without the need to specify primer pairs. This is very useful in cases when one or more of the primer sequences are not present within the target sequences, which prevents extraction of the (amplicon) region through primer-pair searching. Here, VSEARCH is used to extract these segments based on a reference pool of sequences that only span the region of interest.
Citations¶
Robeson et al., 2021; Rognes et al., 2016
Inputs¶
- input_sequences:
FeatureData[Sequence]
Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
- reference_segment_sequences:
FeatureData[Sequence]
Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]
Parameters¶
- perc_identity:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default:
0.7
]- target_coverage:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default:
0.9
]- min_seq_len:
Int
%
Range
(1, None)
Minimum length of sequence allowed for searching. Any sequence less than this will be discarded. If not set, default program settings will be used.[optional]
- threads:
Int
%
Range
(1, 256)
Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default:
1
]
Outputs¶
- extracted_sequence_segments:
FeatureData[Sequence]
Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
- unmatched_sequences:
FeatureData[Sequence]
Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]
rescript get-ncbi-genomes¶
Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.
Citations¶
Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020
Parameters¶
- taxon:
Str
NCBI Taxonomy ID or name (common or scientific) at any taxonomic rank.[required]
- assembly_source:
Str
%
Choices
('refseq', 'genbank', 'all')
Fetch only RefSeq or GenBank genome assemblies.[default:
'refseq'
]- assembly_levels:
List
[
Str
%
Choices
('complete_genome', 'chromosome', 'scaffold', 'contig')
]
Fetch only genome assemblies that are one of the specified assembly levels.[default:
['complete_genome']
]- only_reference:
Bool
Fetch only reference and representative genome assemblies.[default:
True
]- only_genomic:
Bool
Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default:
False
]- tax_exact_match:
Bool
If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default:
False
]- page_size:
Int
%
Range
(20, 1000, inclusive_end=True)
The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default:
20
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default:
['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genome_assemblies:
FeatureData[Sequence]
Nucleotide sequences of requested genomes.[required]
- loci:
GenomeData[Loci]
Loci features of requested genomes.[required]
- proteins:
GenomeData[Proteins]
Protein sequences originating from requested genomes.[required]
- taxonomies:
FeatureData[Taxonomy]
Taxonomies of requested genomes.[required]
rescript get-bv-brc-genomes¶
Fetch genome sequences from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted genomes. By providing IDs/values and a corresponding data field, you can retrieve all genomes associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _sequence for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genomes:
GenomeData[DNASequence]
Genome sequences for specified query.[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
rescript get-bv-brc-metadata¶
Fetch BV-BCR metadata for a specific data type. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted results. By providing IDs/values and a corresponding data field, you can retrieve all metadata associated with those specific values in that data field. And as a third option a metadata column can be provided, to use the results from other data types as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- data_type:
Str
%
Choices
('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology')
BV-BCR data type for which metadata should be downloaded. Check https://
www .bv -brc .org /api /doc / for documentation.[optional] - rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc / for allowed data fields in the specified "data-type".[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
Outputs¶
- metadata:
ImmutableMetadata
BV-BCR metadata of specified data type.[required]
rescript get-bv-brc-genome-features¶
Fetch DNA and protein sequences of genome features from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted features. By providing IDs/values and a corresponding data field, you can retrieve all features associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://
Citations¶
Robeson et al., 2021; Olson et al., 2023
Parameters¶
- ids_metadata:
MetadataColumn
[
Numeric
|
Categorical
]
A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
- rql_query:
Str
Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://
www .bv -brc .org /api /doc / for documentation on data types and corresponding data fields.[optional] - data_field:
Str
Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://
www .bv -brc .org /api /doc /genome _feature for allowed data fields.[optional] - ids:
List
[
Str
]
IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')
]
List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]
Outputs¶
- genes:
GenomeData[Genes]
Gene[required]
- proteins:
GenomeData[Proteins]
proteins[required]
- taxonomy:
FeatureData[Taxonomy]
Taxonomy data for all sequences.[required]
- loci:
GenomeData[Loci]
loci[required]
rescript evaluate-seqs¶
Compute summary statistics on sequence artifact(s) and visualize. Summary statistics include the number of unique sequences, sequence entropy, kmer entropy, and sequence length distributions. This action is useful for both reference taxonomies and classification results.
Citations¶
Inputs¶
- sequences:
List
[
FeatureData[Sequence]
]
One or more sets of sequences to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- kmer_lengths:
List
[
Int
%
Range
(1, None)
]
Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
- subsample_kmers:
Float
%
Range
(0, 1, inclusive_start=False, inclusive_end=True)
Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default:
1.0
]- palette:
Str
%
Choices
('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic')
Color palette to use for plotting evaluation results.[default:
'viridis'
]
Outputs¶
- visualization:
Visualization
<no description>[required]
rescript evaluate-fit-classifier¶
Train a naive Bayes classifier on a set of reference sequences, then test performance accuracy on this same set of sequences. This results in a "perfect" classifier that "knows" the correct identity of each input sequence. Such a leaky classifier indicates the upper limit of classification accuracy based on sequence information alone, as misclassifications are an indication of unresolvable kmer profiles. This test simulates the case where all query sequences are present in a fully comprehensive reference database. To simulate more realistic conditions, see evaluate_cross_validate
. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS PRODUCTION-READY and can be re-used for classification of other sequences (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- classifier:
TaxonomicClassifier
Trained naive Bayes taxonomic classifier.[required]
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]
rescript evaluate-cross-validate¶
Evaluate DNA sequence reference database via cross-validated taxonomic classification. Unique taxonomic labels are truncated to enable appropriate label stratification. See the cited reference (Bokulich et al. 2018) for more details.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- sequences:
FeatureData[Sequence]
Reference sequences to use for classifier training/testing.[required]
- taxonomy:
FeatureData[Taxonomy]
Reference taxonomy to use for classifier training/testing.[required]
Parameters¶
- k:
Int
%
Range
(2, None)
Number of stratified folds.[default:
3
]- random_state:
Int
%
Range
(0, None)
Seed used by the random number generator.[default:
0
]- reads_per_batch:
Int
%
Range
(1, None)
|
Str
%
Choices
('auto')
Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default:
'auto'
]- n_jobs:
Threads
The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default:
1
]- confidence:
Float
%
Range
(0, 1, inclusive_end=True)
|
Str
%
Choices
('disable')
Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default:
0.7
]
Outputs¶
- expected_taxonomy:
FeatureData[Taxonomy]
Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
- observed_taxonomy:
FeatureData[Taxonomy]
Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
- evaluation:
Visualization
Visualization of cross-validated accuracy results.[required]
rescript evaluate-classifications¶
Evaluate taxonomic classification accuracy by comparing one or more sets of true taxonomic labels to the predicted taxonomies for the same set(s) of features. Output an interactive line plot of classification accuracy for each pair of expected/observed taxonomies. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018
Inputs¶
- expected_taxonomies:
List
[
FeatureData[Taxonomy]
]
True taxonomic labels for one more more sets of features.[required]
- observed_taxonomies:
List
[
FeatureData[Taxonomy]
]
Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
Outputs¶
- evaluation:
Visualization
Visualization of classification accuracy results.[required]
rescript evaluate-taxonomy¶
Compute summary statistics on taxonomy artifact(s) and visualize as interactive lineplots. Summary statistics include the number of unique labels, taxonomic entropy, and the number of features that are (un)classified at each taxonomic level. This action is useful for both reference taxonomies and classification results. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.
Citations¶
Robeson et al., 2021; Bokulich et al., 2018
Inputs¶
- taxonomies:
List
[
FeatureData[Taxonomy]
]
One or more taxonomies to evaluate.[required]
Parameters¶
- labels:
List
[
Str
]
List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
- rank_handle_regex:
Str
Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]
Outputs¶
- taxonomy_stats:
Visualization
<no description>[required]
rescript get-silva-data¶
Download, parse, and import SILVA database files, given a version number and reference target. Downloads data directly from SILVA, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://
Citations¶
Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013
Parameters¶
- version:
Str
%
Choices
('128', '132')
|
Str
%
Choices
('138')
|
Str
%
Choices
('138.1', '138.2')
SILVA database version to download.[default:
'138.2'
]- target:
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef')
|
Str
%
Choices
('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef')
Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default:
'SSURef_NR99'
]- include_species_labels:
Bool
Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default:
False
]- rank_propagation:
Bool
If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default:
True
]- ranks:
List
[
Str
%
Choices
('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')
]
List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
- download_sequences:
Bool
Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a
silva_sequences
output is still created, but contains no data.[default:True
]
Outputs¶
- silva_sequences:
FeatureData[RNASequence]
SILVA reference sequences.[required]
- silva_taxonomy:
FeatureData[Taxonomy]
SILVA reference taxonomy.[required]
rescript trim-alignment¶
Trim an existing alignment based on provided primers or specific, pre-defined positions. Primers take precedence over the positions,i.e. if both are provided, positions will be ignored.When using primers in combination with a DNA alignment, a new alignment will be generated to locate primer positions. Subsequently, start (5'-most) and end (3'-most) position from fwd and rev primer located within the new alignment is identified and used for slicing the original alignment.
Citations¶
Inputs¶
- aligned_sequences:
FeatureData[AlignedSequence]
Aligned DNA sequences.[required]
Parameters¶
- primer_fwd:
Str
Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
- primer_rev:
Str
Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
- position_start:
Int
%
Range
(1, None)
Position within the alignment where the trimming will begin. If not provided, alignment will notbe trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
- position_end:
Int
%
Range
(1, None)
Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
- n_threads:
Int
%
Range
(1, None)
Number of threads to use for primer-based trimming, otherwise ignored. (Use
auto
to automatically use all available cores)[default:1
]
Outputs¶
- trimmed_sequences:
FeatureData[AlignedSequence]
Trimmed sequence alignment.[required]
- Links
- Documentation
- Source Code
- Stars
- 98
- Last Commit
- 3b783e2
- Available Distros