RESCRIPt

Formats¶

rescript merge-taxa¶

Compare taxonomy annotations and choose the best one. Can select the longest taxonomy annotation, the highest scoring, or the least common ancestor. Note: when a tie occurs, the last taxonomy added takes precedent.

Citations¶

Inputs¶

data: List[FeatureData[Taxonomy]]: Two or more feature taxonomies to be merged.[required]

Parameters¶

mode: Str % Choices('len', 'lca', 'score', 'super', 'majority'): How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'len']
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default: '^[dkpcofgs]__']
new_rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with rank_handle_regex if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
unclassified_label: Str: Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default: 'Unassigned']

Outputs¶

merged_data: FeatureData[Taxonomy]: <no description>[required]

rescript dereplicate¶

Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be dereplicated[required]
taxa: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be dereplicated[required]

Parameters¶

mode: Str % Choices('uniq', 'lca', 'majority', 'super'): How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'uniq']
perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 1.0]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]
rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
derep_prefix: Bool: Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default: False]

Outputs¶

dereplicated_sequences: FeatureData[Sequence]: <no description>[required]
dereplicated_taxa: FeatureData[Taxonomy]: <no description>[required]

rescript cull-seqs¶

Filter DNA or RNA sequences that contain ambiguous bases and homopolymers, and output filtered DNA sequences. Removes DNA sequences that have the specified number, or more, of IUPAC compliant degenerate bases. Remaining sequences are removed if they contain homopolymers equal to or longer than the specified length. If the input consists of RNA sequences, they are reverse transcribed to DNA before filtering.

Citations¶

Inputs¶

sequences: FeatureData[Sequence | RNASequence]: DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]

Parameters¶

num_degenerates: Int % Range(1, None): Sequences with N, or more, degenerate bases will be removed.[default: 5]
homopolymer_length: Int % Range(2, None): Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default: 8]
n_jobs: Int % Range(1, None): Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default: 1]

Outputs¶

clean_sequences: FeatureData[Sequence]: The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]

rescript degap-seqs¶

This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA Sequences to be degapped.[required]

Parameters¶

min_length: Int % Range(1, None): Minimum length of sequence to be returned after degapping.[default: 1]

Outputs¶

degapped_sequences: FeatureData[Sequence]: The resulting unaligned (degapped) DNA sequences.[required]

rescript edit-taxonomy¶

A method that allows the user to edit taxonomy strings. This is often used to fix inconsistent and/or inccorect taxonomic annotations. The user can either provide two separate lists of strings, i.e. 'search-strings', and 'replacement-strings', on the command line, and/or a single tab-delimited replacement map file containing a list of these strings. In both cases the number of search strings must match the number of replacement strings. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. In the case that both search / replacement strings, and a replacement map file are provided, they will be merged.

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy strings data to be edited.[required]

Parameters¶

replacement_map: MetadataColumn[Categorical]: A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
search_strings: List[Str]: Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
replacement_strings: List[Str]: Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
use_regex: Bool: Toggle regular expressions. By default, only litereal substring matching is performed.[default: False]

Outputs¶

edited_taxonomy: FeatureData[Taxonomy]: Taxonomy in which the original strings are replaced by user-supplied strings.[required]

rescript orient-seqs¶

Orient input sequences by comparison against a set of reference sequences using VSEARCH. This action can also be used to quickly filter out sequences that (do not) match a set of reference sequences in either orientation. Alternatively, if no reference sequences are provided as input, all input sequences will be reverse-complemented. In this case, no alignment is performed, and all alignment parameters (dbmask, relabel, relabel_keep, relabel_md5, relabel_self, relabel_sha1, sizein, sizeout and threads) are ignored.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
relabel: Str: Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
relabel_keep: Bool: When relabeling, keep the original identifier in the header after a space.[optional]
relabel_md5: Bool: When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
relabel_self: Bool: Relabel sequences using the sequence itself as a label.[optional]
relabel_sha1: Bool: When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
sizein: Bool: In de novo mode, abundance annotations (pattern [>;]size=integer[;]) present in sequence headers are taken into account.[optional]
sizeout: Bool: Add abundance annotations to the output FASTA files.[optional]

Outputs¶

oriented_seqs: FeatureData[Sequence]: Query sequences in same orientation as top matching reference sequence.[required]
unmatched_seqs: FeatureData[Sequence]: Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]

rescript orient-reads¶

Orient input reads (FASTQ) by comparison against a set of reference sequences using VSEARCH. This action is useful for orienting reads that are in mixed orientations prior to denoising or clustering.

Citations¶

Inputs¶

sequences: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Sequence reads to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against.[required]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]

Outputs¶

oriented_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Oriented reads.[required]
unmatched_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Reads that fail to match at least one reference sequence in either + or - orientation.[required]

rescript filter-seqs-length-by-taxon¶

Filter sequences by length. Can filter both globally by minimum and/or maximum length, and set individual threshold for individual taxonomic groups (using the "labels" option). Note that filtering can be performed for multiple taxonomic groups simultaneously, and nested taxonomic filters can be applied (e.g., to apply a more stringent filter for a particular genus, but a less stringent filter for other members of the kingdom). For global length-based filtering without conditional taxonomic filtering, see filter_seqs_length.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be filtered.[required]

Parameters¶

labels: List[Str]: One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
min_lens: List[Int % Range(1, None)]: Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
max_lens: List[Int % Range(1, None)]: Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript filter-seqs-length¶

Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]

Parameters¶

global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript parse-silva-taxonomy¶

Parses several files from the SILVA reference database to produce a GreenGenes-like fixed rank taxonomy that is 6 or 7 ranks deep, depending on whether or not include_species_labels is applied. The generated ranks (and the rank handles used to label these ranks in the resulting taxonomy) are: domain (d__), phylum (p__), class (c__), order (o__), family (f__), genus (g__), and species (s__). NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://www.arb-silva.de/silva-license-information/ FOR MORE INFORMATION and be aware that earlier versions may be released under a different license.

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Inputs¶

taxonomy_tree: Phylogeny[Rooted]: SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
taxonomy_map: FeatureData[SILVATaxidMap]: SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
taxonomy_ranks: FeatureData[SILVATaxonomy]: SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]

Parameters¶

rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: The resulting fixed-rank formatted SILVA taxonomy.[required]

rescript reverse-transcribe¶

Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.

Citations¶

Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021

Inputs¶

rna_sequences: FeatureData[AlignedRNASequence¹ | RNASequence²]: RNA Sequences to reverse transcribe to DNA.[required]

Outputs¶

dna_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Reverse-transcribed DNA sequences.[required]

rescript get-ncbi-data¶

Download and import sequences from the NCBI Nucleotide database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database. Please be aware of the NCBI Disclaimer and Copyright notice (https://www.ncbi.nlm.nih.gov/home/about/policies/), particularly "run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests". As a rough guide, if you are downloading more than 125,000 sequences, only run this method at those times. The NCBI servers can be capricious but reward polite persistence. If the download fails and gives you a message that contains the words "Last exception was ReadTimeout", you should probably try again, maybe with more connections. If it fails for any other reason, please create an issue at https://github.com/bokulich-lab/RESCRIPt.

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Nucleotide database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Nucleotide database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[Sequence]: Sequences from the NCBI Nucleotide database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-ncbi-data-protein¶

Download and import sequences from the NCBI Protein database and download, parse, and import the corresponding taxonomies from the NCBI Taxonomy database. Please be aware of the NCBI Disclaimer and Copyright notice (https://www.ncbi.nlm.nih.gov/home/about/policies/), particularly "run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests". As a rough guide, if you are downloading more than 125,000 sequences, only run this method at those times. The NCBI servers can be capricious but reward polite persistence. If the download fails and gives you a message that contains the words "Last exception was ReadTimeout", you should probably try again, maybe with more connections. If it fails for any other reason, please create an issue at https://github.com/bokulich-lab/RESCRIPt.

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Protein database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Protein database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[ProteinSequence]: Sequences from the NCBI Protein database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-gtdb-data¶

Download, parse, and import SSU GTDB files, given a version number. Downloads data directly from GTDB, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM GTDB. SEE https://gtdb.ecogenomic.org/about FOR MORE INFORMATION and be aware that earlier versions may be released under a different license.

Citations¶

Parameters¶

version: Str % Choices('202.0', '207.0', '214.0', '214.1', '220.0', '226.0'): GTDB database version to download.[default: '226.0']
domain: Str % Choices('Both', 'Bacteria', 'Archaea'): SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default: 'Both']
db_type: Str % Choices('All', 'SpeciesReps'): 'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default: 'SpeciesReps']
url_type: Str % Choices('Primary', 'Mirror'): Toggle download URL. 'Primary' will download data from the primary GTDB URL. 'Mirror' will dowload data from the GTDB data mirror. Use 'Mirror' if downloads from 'Primary' are slow.[default: 'Primary']

Outputs¶

gtdb_taxonomy: FeatureData[Taxonomy]: SSU GTDB reference taxonomy.[required]
gtdb_sequences: FeatureData[Sequence]: SSU GTDB reference sequences.[required]

rescript get-unite-data¶

Download and import ITS sequences and taxonomy from the UNITE database, given a version number and taxon_group, with the option to select a cluster_id and include singletons. Downloads data directly from UNITE's PlutoF REST API. NOTE: THIS ACTION ACQUIRES DATA FROM UNITE, which is licensed under CC BY-SA 4.0. To learn more, please visit https://unite.ut.ee/cite.php and https://creativecommons.org/licenses/by-sa/4.0/.

Citations¶

Parameters¶

version: Str % Choices('2025-02-19', '2024-04-04', '2023-07-18', '2022-10-16', '2021-05-10', '2020-02-20'): UNITE version to download.[default: '2025-02-19']
taxon_group: Str % Choices('fungi', 'eukaryotes'): Download a database with only 'fungi' or including all 'eukaryotes'.[default: 'eukaryotes']
cluster_id: Str % Choices('99', '97', 'dynamic'): Percent similarity at which sequences in the of database were clustered.[default: '99']
singletons: Bool: Include singleton clusters in the database.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: UNITE reference taxonomy.[required]
sequences: FeatureData[Sequence]: UNITE reference sequences.[required]

rescript get-pr2-data¶

Download, parse, and import SSU PR2 files, given a version number. Downloads data directly from PR2, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM PR2, which is licensed under MIT. To learn more, please visit https://pr2-database.org/ and https://github.com/pr2database/.

Citations¶

Parameters¶

version: Str % Choices('5.0.0', '4.14.0'): PR2 database version to download.[default: '5.0.0']
ranks: List[Str % Choices('domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species')]: List of taxonomic ranks for building a taxonomy from the PR2 Taxonomy database. Ranks can be provided as multiple separate flags, e.g.: --p-ranks genus --p-ranks species, or with a single flag delimited by a space: --p-ranks genus species. [default: 'domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species'][optional]

Outputs¶

pr2_sequences: FeatureData[Sequence]: SSU PR2 reference sequences.[required]
pr2_taxonomy: FeatureData[Taxonomy]: SSU PR2 reference taxonomy.[required]

rescript get-midori2-data¶

Download and import a variety of mitochonrial DNA gene sequences along with their associated taxonomy from the MIDORI 2 database. Simply provide the database version, the mitochondrial gene of interest, the reference sequence type, and if reference sequences with unspecified species information should be downloaded. NOTE: THIS ACTION ACQUIRES DATA FROM MIDORI2. To learn more, please visit https://www.reference-midori.info/.

Citations¶

Parameters¶

mito_gene: List[Str % Choices('A6', 'A8', 'CO1', 'CO2', 'CO3', 'Cytb', 'ND1', 'ND2', 'ND3', 'ND4L', 'ND4', 'ND5', 'ND6', 'lrRNA', 'srRNA', 'all')]: Download the mitochondrial gene(s) of interest. Specify the respective gene(s), or download all genes using 'all'.[required]
version: Str % Choices('GenBank265_2025-03-08', 'GenBank264_2024-12-14', 'GenBank263_2024-10-13', 'GenBank262_2024-08-16', 'GenBank261_2024-06-15', 'GenBank260_2024-04-15'): MIDORI 2 version to download.[default: 'GenBank265_2025-03-08']
ref_seq_type: Str % Choices('uniq', 'longest'): 'uniq': contains all unique haplotypes associated with each species. 'longest': contains the longest sequence for each species.[default: 'uniq']
unspecified_species: Bool: Download reference sequences that contain species that are left unspecified. That is, any reference sequences that lack binomial species-level description.[default: False]

Outputs¶

midori2_sequences: Collection[FeatureData[Sequence]]: MIDORI 2 reference sequence output directory.[required]
midori2_taxonomy: Collection[FeatureData[Taxonomy]]: MIDORI 2 reference taxonomy output directory.[required]

rescript filter-taxa¶

Filter taxonomy by list of IDs or search criteria.

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy to filter.[required]

Parameters¶

ids_to_keep: Metadata: List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
include: List[Str]: List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting ids_to_keep.[optional]
exclude: List[Str]: List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting ids_to_keep.[optional]

Outputs¶

filtered_taxonomy: FeatureData[Taxonomy]: The filtered taxonomy.[required]

rescript subsample-fasta¶

Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.

Citations¶

Inputs¶

sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sequences to subsample from.[required]

Parameters¶

subsample_size: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Size of the random sample as a fraction of the total count[default: 0.1]
random_seed: Int % Range(1, None): Seed to be used for random sampling.[default: 1]

Outputs¶

sample_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sample of original sequences.[required]

rescript extract-seq-segments¶

This action provides the ability to extract a region, or segment, of sequence without the need to specify primer pairs. This is very useful in cases when one or more of the primer sequences are not present within the target sequences, which prevents extraction of the (amplicon) region through primer-pair searching. Here, VSEARCH is used to extract these segments based on a reference pool of sequences that only span the region of interest.

Citations¶

Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020

Inputs¶

input_sequences: FeatureData[Sequence]: Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
reference_segment_sequences: FeatureData[Sequence]: Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]

Parameters¶

perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 0.7]
target_coverage: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default: 0.9]
min_seq_len: Int % Range(1, None): Minimum length of reference sequence segment allowed for searching. Any sequence less than this will be discarded.[default: 32]
max_seq_len: Int % Range(1, None): Maximum length of reference sequence segment allowed for searching. Any sequence greater than this will be discarded.[default: 50000]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

extracted_sequence_segments: FeatureData[Sequence]: Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
unmatched_sequences: FeatureData[Sequence]: Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]

rescript get-ncbi-genomes¶

Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.

Citations¶

Parameters¶

taxa: List[Str]: NCBI Taxonomy IDs or names (common or scientific) at any taxonomic rank.[required]
assembly_source: Str % Choices('refseq', 'genbank', 'all'): Fetch only RefSeq or GenBank genome assemblies.[default: 'refseq']
assembly_levels: List[Str % Choices('complete_genome', 'chromosome', 'scaffold', 'contig')]: Fetch only genome assemblies that are one of the specified assembly levels.[default: ['complete_genome']]
only_reference: Bool: Fetch only reference and representative genome assemblies.[default: True]
only_genomic: Bool: Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default: False]
tax_exact_match: Bool: If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default: False]
page_size: Int % Range(20, 1000, inclusive_end=True): The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default: 20]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default: ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genome_assemblies: FeatureData[Sequence]: Nucleotide sequences of requested genomes.[required]
loci: GenomeData[Loci]: Loci features of requested genomes.[required]
proteins: GenomeData[Proteins]: Protein sequences originating from requested genomes.[required]
taxonomies: FeatureData[Taxonomy]: Taxonomies of requested genomes.[required]

rescript get-bv-brc-genomes¶

Fetch genome sequences from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted genomes. By providing IDs/values and a corresponding data field, you can retrieve all genomes associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://www.bv-brc.org/api/doc/ for documentation.

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_sequence for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genomes: GenomeData[DNASequence]: Genome sequences for specified query.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]

rescript get-bv-brc-metadata¶

Fetch BV-BCR metadata for a specific data type. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted results. By providing IDs/values and a corresponding data field, you can retrieve all metadata associated with those specific values in that data field. And as a third option a metadata column can be provided, to use the results from other data types as a new query. Check https://www.bv-brc.org/api/doc/ for documentation.

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
data_type: Str % Choices('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology'): BV-BCR data type for which metadata should be downloaded. Check https://www.bv-brc.org/api/doc/ for documentation.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/ for allowed data fields in the specified "data-type".[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]

Outputs¶

metadata: ImmutableMetadata: BV-BCR metadata of specified data type.[required]

rescript get-bv-brc-genome-features¶

Fetch DNA and protein sequences of genome features from BV-BRC. BV-BRC (Bacterial and Viral Bioinformatics Resource Center) is a database for bacterial and viral genomes, annotations, and metadata. There are three ways to query data: You can use an RQL query to refine your search and get targeted features. By providing IDs/values and a corresponding data field, you can retrieve all features associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query. Check https://www.bv-brc.org/api/doc/ for documentation.

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_feature for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genes: GenomeData[Genes]: Gene[required]
proteins: GenomeData[Proteins]: proteins[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]
loci: GenomeData[Loci]: loci[required]

rescript evaluate-seqs¶

Compute summary statistics on sequence artifact(s) and visualize. Summary statistics include the number of unique sequences, sequence entropy, kmer entropy, and sequence length distributions. This action is useful for both reference taxonomies and classification results.

Citations¶

Inputs¶

sequences: List[FeatureData[Sequence]]: One or more sets of sequences to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
kmer_lengths: List[Int % Range(1, None)]: Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
subsample_kmers: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default: 1.0]
palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic'): Color palette to use for plotting evaluation results.[default: 'viridis']

Outputs¶

visualization: Visualization: <no description>[required]

rescript evaluate-fit-classifier¶

Train a naive Bayes classifier on a set of reference sequences, then test performance accuracy on this same set of sequences. This results in a "perfect" classifier that "knows" the correct identity of each input sequence. Such a leaky classifier indicates the upper limit of classification accuracy based on sequence information alone, as misclassifications are an indication of unresolvable kmer profiles. This test simulates the case where all query sequences are present in a fully comprehensive reference database. To simulate more realistic conditions, see evaluate_cross_validate. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS PRODUCTION-READY and can be re-used for classification of other sequences (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

classifier: TaxonomicClassifier: Trained naive Bayes taxonomic classifier.[required]
evaluation: Visualization: Visualization of classification accuracy results.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]

rescript evaluate-cross-validate¶

Evaluate DNA sequence reference database via cross-validated taxonomic classification. Unique taxonomic labels are truncated to enable appropriate label stratification. See the cited reference (Bokulich et al. 2018) for more details.

Citations¶

Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

k: Int % Range(2, None): Number of stratified folds.[default: 3]
random_state: Int % Range(0, None): Seed used by the random number generator.[default: 0]
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

expected_taxonomy: FeatureData[Taxonomy]: Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
evaluation: Visualization: Visualization of cross-validated accuracy results.[required]

rescript evaluate-classifications¶

Evaluate taxonomic classification accuracy by comparing one or more sets of true taxonomic labels to the predicted taxonomies for the same set(s) of features. Output an interactive line plot of classification accuracy for each pair of expected/observed taxonomies. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.

Citations¶

Inputs¶

expected_taxonomies: List[FeatureData[Taxonomy]]: True taxonomic labels for one more more sets of features.[required]
observed_taxonomies: List[FeatureData[Taxonomy]]: Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]

Outputs¶

evaluation: Visualization: Visualization of classification accuracy results.[required]

rescript evaluate-taxonomy¶

Compute summary statistics on taxonomy artifact(s) and visualize as interactive lineplots. Summary statistics include the number of unique labels, taxonomic entropy, and the number of features that are (un)classified at each taxonomic level. This action is useful for both reference taxonomies and classification results. The x-axis in these plots represents the taxonomic levels present in the input taxonomies so are labeled numerically instead of by rank, but typically for 7-level taxonomies these will represent: 1 = domain/kingdom, 2 = phylum, 3 = class, 4 = order, 5 = family, 6 = genus, 7 = species.

Citations¶

Inputs¶

taxonomies: List[FeatureData[Taxonomy]]: One or more taxonomies to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]

Outputs¶

taxonomy_stats: Visualization: <no description>[required]

rescript get-silva-data¶

Download, parse, and import SILVA database files, given a version number and reference target. Downloads data directly from SILVA, parses the taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts. REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM THE SILVA DATABASE. SEE https://www.arb-silva.de/silva-license-information/ FOR MORE INFORMATION and be aware that earlier versions may be released under a different license.

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Parameters¶

version: Str % Choices('128', '132') | Str % Choices('138') | Str % Choices('138.1', '138.2'): SILVA database version to download.[default: '138.2']
target: Str % Choices('SSURef_NR99', 'SSURef', 'LSURef') | Str % Choices('SSURef_NR99', 'SSURef') | Str % Choices('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef'): Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default: 'SSURef_NR99']
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
download_sequences: Bool: Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a silva_sequences output is still created, but contains no data.[default: True]

Outputs¶

silva_sequences: FeatureData[RNASequence]: SILVA reference sequences.[required]
silva_taxonomy: FeatureData[Taxonomy]: SILVA reference taxonomy.[required]

rescript trim-alignment¶

Trim an existing alignment based on provided primers or specific, pre-defined positions. Primers take precedence over the positions,i.e. if both are provided, positions will be ignored.When using primers in combination with a DNA alignment, a new alignment will be generated to locate primer positions. Subsequently, start (5'-most) and end (3'-most) position from fwd and rev primer located within the new alignment is identified and used for slicing the original alignment. The retention of alignment positions that span the primer locations can be toggled. WARNING: finding alignment positions via primer search can be inefficient for very large alignments and is only recommended for small alignments. For large alignments providing specific alignment positions is ideal.

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA sequences.[required]

Parameters¶

primer_fwd: Str: Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
primer_rev: Str: Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
position_start: Int % Range(1, None): Position within the alignment where the trimming will begin. If not provided, alignment will not be trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
position_end: Int % Range(1, None): Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
keep_primer_location: Bool: Retain the alignment positions of the primer binding location. Note: the primers themselves will be removed, but the alignment positions where the primers align will be retained in the alignment.[default: False]
n_threads: Int % Range(1, None): Number of threads to use for primer-based trimming, otherwise ignored. (Use auto to automatically use all available cores)[default: 1]

Outputs¶

trimmed_sequences: FeatureData[AlignedSequence]: Trimmed sequence alignment.[required]

Reference sequence annotation and curation pipeline.

version: 2025.10.0.dev0
website: https://github.com/nbokulich/RESCRIPt
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Robeson et al., 2021

Actions¶

Name	Type	Short Description
merge-taxa	method	Compare taxonomies and select the longest, highest scoring, or find the least common ancestor.
dereplicate	method	Dereplicate features with matching sequences and taxonomies.
cull-seqs	method	Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length.
degap-seqs	method	Remove gaps from DNA sequence alignments.
edit-taxonomy	method	Edit taxonomy strings with find and replace terms.
orient-seqs	method	Orient input sequences by comparison against reference.
orient-reads	method	Orient FASTQ reads against reference.
filter-seqs-length-by-taxon	method	Filter sequences by length and taxonomic group.
filter-seqs-length	method	Filter sequences by length.
parse-silva-taxonomy	method	Generates a SILVA fixed-rank taxonomy.
reverse-transcribe	method	Reverse transcribe RNA to DNA sequences.
get-ncbi-data	method	Download, parse, and import NCBI sequences and taxonomies
get-ncbi-data-protein	method	Download, parse, and import NCBI protein sequences and taxonomies
get-gtdb-data	method	Download, parse, and import SSU GTDB reference data.
get-unite-data	method	Download and import UNITE reference data.
get-pr2-data	method	Download, parse, and import SSU PR2 reference data.
get-midori2-data	method	Download and import MIDORI 2 reference data.
filter-taxa	method	Filter taxonomy by list of IDs or search criteria.
subsample-fasta	method	Subsample an indicated number of sequences from a FASTA file.
extract-seq-segments	method	Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value.
get-ncbi-genomes	method	Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets.
get-bv-brc-genomes	method	Get genome sequences from the BV-BRC database.
get-bv-brc-metadata	method	Fetch BV-BCR metadata.
get-bv-brc-genome-features	method	Fetch genome features from BV-BRC.
evaluate-seqs	visualizer	Compute summary statistics on sequence artifact(s).
evaluate-fit-classifier	pipeline	Evaluate and train naive Bayes classifier on reference sequences.
evaluate-cross-validate	pipeline	Evaluate DNA sequence reference database via cross-validated taxonomic classification.
evaluate-classifications	pipeline	Interactively evaluate taxonomic classification accuracy.
evaluate-taxonomy	pipeline	Compute summary statistics on taxonomy artifact(s).
get-silva-data	pipeline	Download, parse, and import SILVA database.
trim-alignment	pipeline	Trim alignment based on provided primers or specific positions.

Artifact Classes¶

Formats¶

rescript merge-taxa¶

Citations¶

Inputs¶

data: List[FeatureData[Taxonomy]]: Two or more feature taxonomies to be merged.[required]

Parameters¶

mode: Str % Choices('len', 'lca', 'score', 'super', 'majority'): How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'len']
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default: '^[dkpcofgs]__']
new_rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with rank_handle_regex if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
unclassified_label: Str: Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default: 'Unassigned']

Outputs¶

merged_data: FeatureData[Taxonomy]: <no description>[required]

rescript dereplicate¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be dereplicated[required]
taxa: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be dereplicated[required]

Parameters¶

mode: Str % Choices('uniq', 'lca', 'majority', 'super'): How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'uniq']
perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 1.0]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]
rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
derep_prefix: Bool: Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default: False]

Outputs¶

dereplicated_sequences: FeatureData[Sequence]: <no description>[required]
dereplicated_taxa: FeatureData[Taxonomy]: <no description>[required]

rescript cull-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence | RNASequence]: DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]

Parameters¶

num_degenerates: Int % Range(1, None): Sequences with N, or more, degenerate bases will be removed.[default: 5]
homopolymer_length: Int % Range(2, None): Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default: 8]
n_jobs: Int % Range(1, None): Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default: 1]

Outputs¶

clean_sequences: FeatureData[Sequence]: The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]

rescript degap-seqs¶

This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA Sequences to be degapped.[required]

Parameters¶

min_length: Int % Range(1, None): Minimum length of sequence to be returned after degapping.[default: 1]

Outputs¶

degapped_sequences: FeatureData[Sequence]: The resulting unaligned (degapped) DNA sequences.[required]

rescript edit-taxonomy¶

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy strings data to be edited.[required]

Parameters¶

replacement_map: MetadataColumn[Categorical]: A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
search_strings: List[Str]: Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
replacement_strings: List[Str]: Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
use_regex: Bool: Toggle regular expressions. By default, only litereal substring matching is performed.[default: False]

Outputs¶

edited_taxonomy: FeatureData[Taxonomy]: Taxonomy in which the original strings are replaced by user-supplied strings.[required]

rescript orient-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
relabel: Str: Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
relabel_keep: Bool: When relabeling, keep the original identifier in the header after a space.[optional]
relabel_md5: Bool: When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
relabel_self: Bool: Relabel sequences using the sequence itself as a label.[optional]
relabel_sha1: Bool: When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
sizein: Bool: In de novo mode, abundance annotations (pattern [>;]size=integer[;]) present in sequence headers are taken into account.[optional]
sizeout: Bool: Add abundance annotations to the output FASTA files.[optional]

Outputs¶

oriented_seqs: FeatureData[Sequence]: Query sequences in same orientation as top matching reference sequence.[required]
unmatched_seqs: FeatureData[Sequence]: Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]

rescript orient-reads¶

Orient input reads (FASTQ) by comparison against a set of reference sequences using VSEARCH. This action is useful for orienting reads that are in mixed orientations prior to denoising or clustering.

Citations¶

Inputs¶

sequences: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Sequence reads to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against.[required]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]

Outputs¶

oriented_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Oriented reads.[required]
unmatched_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Reads that fail to match at least one reference sequence in either + or - orientation.[required]

rescript filter-seqs-length-by-taxon¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be filtered.[required]

Parameters¶

labels: List[Str]: One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
min_lens: List[Int % Range(1, None)]: Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
max_lens: List[Int % Range(1, None)]: Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript filter-seqs-length¶

Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]

Parameters¶

global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript parse-silva-taxonomy¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Inputs¶

taxonomy_tree: Phylogeny[Rooted]: SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
taxonomy_map: FeatureData[SILVATaxidMap]: SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
taxonomy_ranks: FeatureData[SILVATaxonomy]: SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]

Parameters¶

rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: The resulting fixed-rank formatted SILVA taxonomy.[required]

rescript reverse-transcribe¶

Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.

Citations¶

Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021

Inputs¶

rna_sequences: FeatureData[AlignedRNASequence¹ | RNASequence²]: RNA Sequences to reverse transcribe to DNA.[required]

Outputs¶

dna_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Reverse-transcribed DNA sequences.[required]

rescript get-ncbi-data¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Nucleotide database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Nucleotide database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[Sequence]: Sequences from the NCBI Nucleotide database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-ncbi-data-protein¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Protein database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Protein database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[ProteinSequence]: Sequences from the NCBI Protein database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-gtdb-data¶

Citations¶

Parameters¶

version: Str % Choices('202.0', '207.0', '214.0', '214.1', '220.0', '226.0'): GTDB database version to download.[default: '226.0']
domain: Str % Choices('Both', 'Bacteria', 'Archaea'): SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default: 'Both']
db_type: Str % Choices('All', 'SpeciesReps'): 'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default: 'SpeciesReps']
url_type: Str % Choices('Primary', 'Mirror'): Toggle download URL. 'Primary' will download data from the primary GTDB URL. 'Mirror' will dowload data from the GTDB data mirror. Use 'Mirror' if downloads from 'Primary' are slow.[default: 'Primary']

Outputs¶

gtdb_taxonomy: FeatureData[Taxonomy]: SSU GTDB reference taxonomy.[required]
gtdb_sequences: FeatureData[Sequence]: SSU GTDB reference sequences.[required]

rescript get-unite-data¶

Citations¶

Parameters¶

version: Str % Choices('2025-02-19', '2024-04-04', '2023-07-18', '2022-10-16', '2021-05-10', '2020-02-20'): UNITE version to download.[default: '2025-02-19']
taxon_group: Str % Choices('fungi', 'eukaryotes'): Download a database with only 'fungi' or including all 'eukaryotes'.[default: 'eukaryotes']
cluster_id: Str % Choices('99', '97', 'dynamic'): Percent similarity at which sequences in the of database were clustered.[default: '99']
singletons: Bool: Include singleton clusters in the database.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: UNITE reference taxonomy.[required]
sequences: FeatureData[Sequence]: UNITE reference sequences.[required]

rescript get-pr2-data¶

Citations¶

Parameters¶

version: Str % Choices('5.0.0', '4.14.0'): PR2 database version to download.[default: '5.0.0']
ranks: List[Str % Choices('domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species')]: List of taxonomic ranks for building a taxonomy from the PR2 Taxonomy database. Ranks can be provided as multiple separate flags, e.g.: --p-ranks genus --p-ranks species, or with a single flag delimited by a space: --p-ranks genus species. [default: 'domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species'][optional]

Outputs¶

pr2_sequences: FeatureData[Sequence]: SSU PR2 reference sequences.[required]
pr2_taxonomy: FeatureData[Taxonomy]: SSU PR2 reference taxonomy.[required]

rescript get-midori2-data¶

Citations¶

Parameters¶

mito_gene: List[Str % Choices('A6', 'A8', 'CO1', 'CO2', 'CO3', 'Cytb', 'ND1', 'ND2', 'ND3', 'ND4L', 'ND4', 'ND5', 'ND6', 'lrRNA', 'srRNA', 'all')]: Download the mitochondrial gene(s) of interest. Specify the respective gene(s), or download all genes using 'all'.[required]
version: Str % Choices('GenBank265_2025-03-08', 'GenBank264_2024-12-14', 'GenBank263_2024-10-13', 'GenBank262_2024-08-16', 'GenBank261_2024-06-15', 'GenBank260_2024-04-15'): MIDORI 2 version to download.[default: 'GenBank265_2025-03-08']
ref_seq_type: Str % Choices('uniq', 'longest'): 'uniq': contains all unique haplotypes associated with each species. 'longest': contains the longest sequence for each species.[default: 'uniq']
unspecified_species: Bool: Download reference sequences that contain species that are left unspecified. That is, any reference sequences that lack binomial species-level description.[default: False]

Outputs¶

midori2_sequences: Collection[FeatureData[Sequence]]: MIDORI 2 reference sequence output directory.[required]
midori2_taxonomy: Collection[FeatureData[Taxonomy]]: MIDORI 2 reference taxonomy output directory.[required]

rescript filter-taxa¶

Filter taxonomy by list of IDs or search criteria.

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy to filter.[required]

Parameters¶

ids_to_keep: Metadata: List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
include: List[Str]: List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting ids_to_keep.[optional]
exclude: List[Str]: List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting ids_to_keep.[optional]

Outputs¶

filtered_taxonomy: FeatureData[Taxonomy]: The filtered taxonomy.[required]

rescript subsample-fasta¶

Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.

Citations¶

Inputs¶

sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sequences to subsample from.[required]

Parameters¶

subsample_size: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Size of the random sample as a fraction of the total count[default: 0.1]
random_seed: Int % Range(1, None): Seed to be used for random sampling.[default: 1]

Outputs¶

sample_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sample of original sequences.[required]

rescript extract-seq-segments¶

Citations¶

Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020

Inputs¶

input_sequences: FeatureData[Sequence]: Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
reference_segment_sequences: FeatureData[Sequence]: Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]

Parameters¶

perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 0.7]
target_coverage: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default: 0.9]
min_seq_len: Int % Range(1, None): Minimum length of reference sequence segment allowed for searching. Any sequence less than this will be discarded.[default: 32]
max_seq_len: Int % Range(1, None): Maximum length of reference sequence segment allowed for searching. Any sequence greater than this will be discarded.[default: 50000]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

extracted_sequence_segments: FeatureData[Sequence]: Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
unmatched_sequences: FeatureData[Sequence]: Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]

rescript get-ncbi-genomes¶

Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.

Citations¶

Parameters¶

taxa: List[Str]: NCBI Taxonomy IDs or names (common or scientific) at any taxonomic rank.[required]
assembly_source: Str % Choices('refseq', 'genbank', 'all'): Fetch only RefSeq or GenBank genome assemblies.[default: 'refseq']
assembly_levels: List[Str % Choices('complete_genome', 'chromosome', 'scaffold', 'contig')]: Fetch only genome assemblies that are one of the specified assembly levels.[default: ['complete_genome']]
only_reference: Bool: Fetch only reference and representative genome assemblies.[default: True]
only_genomic: Bool: Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default: False]
tax_exact_match: Bool: If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default: False]
page_size: Int % Range(20, 1000, inclusive_end=True): The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default: 20]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default: ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genome_assemblies: FeatureData[Sequence]: Nucleotide sequences of requested genomes.[required]
loci: GenomeData[Loci]: Loci features of requested genomes.[required]
proteins: GenomeData[Proteins]: Protein sequences originating from requested genomes.[required]
taxonomies: FeatureData[Taxonomy]: Taxonomies of requested genomes.[required]

rescript get-bv-brc-genomes¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_sequence for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genomes: GenomeData[DNASequence]: Genome sequences for specified query.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]

rescript get-bv-brc-metadata¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
data_type: Str % Choices('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology'): BV-BCR data type for which metadata should be downloaded. Check https://www.bv-brc.org/api/doc/ for documentation.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/ for allowed data fields in the specified "data-type".[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]

Outputs¶

metadata: ImmutableMetadata: BV-BCR metadata of specified data type.[required]

rescript get-bv-brc-genome-features¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_feature for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genes: GenomeData[Genes]: Gene[required]
proteins: GenomeData[Proteins]: proteins[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]
loci: GenomeData[Loci]: loci[required]

rescript evaluate-seqs¶

Citations¶

Inputs¶

sequences: List[FeatureData[Sequence]]: One or more sets of sequences to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
kmer_lengths: List[Int % Range(1, None)]: Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
subsample_kmers: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default: 1.0]
palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic'): Color palette to use for plotting evaluation results.[default: 'viridis']

Outputs¶

visualization: Visualization: <no description>[required]

rescript evaluate-fit-classifier¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

classifier: TaxonomicClassifier: Trained naive Bayes taxonomic classifier.[required]
evaluation: Visualization: Visualization of classification accuracy results.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]

rescript evaluate-cross-validate¶

Citations¶

Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

k: Int % Range(2, None): Number of stratified folds.[default: 3]
random_state: Int % Range(0, None): Seed used by the random number generator.[default: 0]
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

expected_taxonomy: FeatureData[Taxonomy]: Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
evaluation: Visualization: Visualization of cross-validated accuracy results.[required]

rescript evaluate-classifications¶

Citations¶

Inputs¶

expected_taxonomies: List[FeatureData[Taxonomy]]: True taxonomic labels for one more more sets of features.[required]
observed_taxonomies: List[FeatureData[Taxonomy]]: Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]

Outputs¶

evaluation: Visualization: Visualization of classification accuracy results.[required]

rescript evaluate-taxonomy¶

Citations¶

Inputs¶

taxonomies: List[FeatureData[Taxonomy]]: One or more taxonomies to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]

Outputs¶

taxonomy_stats: Visualization: <no description>[required]

rescript get-silva-data¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Parameters¶

version: Str % Choices('128', '132') | Str % Choices('138') | Str % Choices('138.1', '138.2'): SILVA database version to download.[default: '138.2']
target: Str % Choices('SSURef_NR99', 'SSURef', 'LSURef') | Str % Choices('SSURef_NR99', 'SSURef') | Str % Choices('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef'): Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default: 'SSURef_NR99']
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
download_sequences: Bool: Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a silva_sequences output is still created, but contains no data.[default: True]

Outputs¶

silva_sequences: FeatureData[RNASequence]: SILVA reference sequences.[required]
silva_taxonomy: FeatureData[Taxonomy]: SILVA reference taxonomy.[required]

rescript trim-alignment¶

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA sequences.[required]

Parameters¶

primer_fwd: Str: Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
primer_rev: Str: Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
position_start: Int % Range(1, None): Position within the alignment where the trimming will begin. If not provided, alignment will not be trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
position_end: Int % Range(1, None): Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
keep_primer_location: Bool: Retain the alignment positions of the primer binding location. Note: the primers themselves will be removed, but the alignment positions where the primers align will be retained in the alignment.[default: False]
n_threads: Int % Range(1, None): Number of threads to use for primer-based trimming, otherwise ignored. (Use auto to automatically use all available cores)[default: 1]

Outputs¶

trimmed_sequences: FeatureData[AlignedSequence]: Trimmed sequence alignment.[required]

Reference sequence annotation and curation pipeline.

version: 2025.10.0.dev0
website: https://github.com/nbokulich/RESCRIPt
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Robeson et al., 2021

Actions¶

Name	Type	Short Description
merge-taxa	method	Compare taxonomies and select the longest, highest scoring, or find the least common ancestor.
dereplicate	method	Dereplicate features with matching sequences and taxonomies.
cull-seqs	method	Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length.
degap-seqs	method	Remove gaps from DNA sequence alignments.
edit-taxonomy	method	Edit taxonomy strings with find and replace terms.
orient-seqs	method	Orient input sequences by comparison against reference.
orient-reads	method	Orient FASTQ reads against reference.
filter-seqs-length-by-taxon	method	Filter sequences by length and taxonomic group.
filter-seqs-length	method	Filter sequences by length.
parse-silva-taxonomy	method	Generates a SILVA fixed-rank taxonomy.
reverse-transcribe	method	Reverse transcribe RNA to DNA sequences.
get-ncbi-data	method	Download, parse, and import NCBI sequences and taxonomies
get-ncbi-data-protein	method	Download, parse, and import NCBI protein sequences and taxonomies
get-gtdb-data	method	Download, parse, and import SSU GTDB reference data.
get-unite-data	method	Download and import UNITE reference data.
get-pr2-data	method	Download, parse, and import SSU PR2 reference data.
get-midori2-data	method	Download and import MIDORI 2 reference data.
filter-taxa	method	Filter taxonomy by list of IDs or search criteria.
subsample-fasta	method	Subsample an indicated number of sequences from a FASTA file.
extract-seq-segments	method	Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value.
get-ncbi-genomes	method	Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets.
get-bv-brc-genomes	method	Get genome sequences from the BV-BRC database.
get-bv-brc-metadata	method	Fetch BV-BCR metadata.
get-bv-brc-genome-features	method	Fetch genome features from BV-BRC.
evaluate-seqs	visualizer	Compute summary statistics on sequence artifact(s).
evaluate-fit-classifier	pipeline	Evaluate and train naive Bayes classifier on reference sequences.
evaluate-cross-validate	pipeline	Evaluate DNA sequence reference database via cross-validated taxonomic classification.
evaluate-classifications	pipeline	Interactively evaluate taxonomic classification accuracy.
evaluate-taxonomy	pipeline	Compute summary statistics on taxonomy artifact(s).
get-silva-data	pipeline	Download, parse, and import SILVA database.
trim-alignment	pipeline	Trim alignment based on provided primers or specific positions.

Artifact Classes¶

Formats¶

rescript merge-taxa¶

Citations¶

Inputs¶

data: List[FeatureData[Taxonomy]]: Two or more feature taxonomies to be merged.[required]

Parameters¶

mode: Str % Choices('len', 'lca', 'score', 'super', 'majority'): How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'len']
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default: '^[dkpcofgs]__']
new_rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with rank_handle_regex if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
unclassified_label: Str: Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default: 'Unassigned']

Outputs¶

merged_data: FeatureData[Taxonomy]: <no description>[required]

rescript dereplicate¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be dereplicated[required]
taxa: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be dereplicated[required]

Parameters¶

mode: Str % Choices('uniq', 'lca', 'majority', 'super'): How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'uniq']
perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 1.0]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]
rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
derep_prefix: Bool: Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default: False]

Outputs¶

dereplicated_sequences: FeatureData[Sequence]: <no description>[required]
dereplicated_taxa: FeatureData[Taxonomy]: <no description>[required]

rescript cull-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence | RNASequence]: DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]

Parameters¶

num_degenerates: Int % Range(1, None): Sequences with N, or more, degenerate bases will be removed.[default: 5]
homopolymer_length: Int % Range(2, None): Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default: 8]
n_jobs: Int % Range(1, None): Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default: 1]

Outputs¶

clean_sequences: FeatureData[Sequence]: The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]

rescript degap-seqs¶

This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA Sequences to be degapped.[required]

Parameters¶

min_length: Int % Range(1, None): Minimum length of sequence to be returned after degapping.[default: 1]

Outputs¶

degapped_sequences: FeatureData[Sequence]: The resulting unaligned (degapped) DNA sequences.[required]

rescript edit-taxonomy¶

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy strings data to be edited.[required]

Parameters¶

replacement_map: MetadataColumn[Categorical]: A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
search_strings: List[Str]: Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
replacement_strings: List[Str]: Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
use_regex: Bool: Toggle regular expressions. By default, only litereal substring matching is performed.[default: False]

Outputs¶

edited_taxonomy: FeatureData[Taxonomy]: Taxonomy in which the original strings are replaced by user-supplied strings.[required]

rescript orient-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
relabel: Str: Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
relabel_keep: Bool: When relabeling, keep the original identifier in the header after a space.[optional]
relabel_md5: Bool: When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
relabel_self: Bool: Relabel sequences using the sequence itself as a label.[optional]
relabel_sha1: Bool: When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
sizein: Bool: In de novo mode, abundance annotations (pattern [>;]size=integer[;]) present in sequence headers are taken into account.[optional]
sizeout: Bool: Add abundance annotations to the output FASTA files.[optional]

Outputs¶

oriented_seqs: FeatureData[Sequence]: Query sequences in same orientation as top matching reference sequence.[required]
unmatched_seqs: FeatureData[Sequence]: Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]

rescript orient-reads¶

Orient input reads (FASTQ) by comparison against a set of reference sequences using VSEARCH. This action is useful for orienting reads that are in mixed orientations prior to denoising or clustering.

Citations¶

Inputs¶

sequences: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Sequence reads to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against.[required]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]

Outputs¶

oriented_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Oriented reads.[required]
unmatched_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Reads that fail to match at least one reference sequence in either + or - orientation.[required]

rescript filter-seqs-length-by-taxon¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be filtered.[required]

Parameters¶

labels: List[Str]: One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
min_lens: List[Int % Range(1, None)]: Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
max_lens: List[Int % Range(1, None)]: Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript filter-seqs-length¶

Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]

Parameters¶

global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript parse-silva-taxonomy¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Inputs¶

taxonomy_tree: Phylogeny[Rooted]: SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
taxonomy_map: FeatureData[SILVATaxidMap]: SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
taxonomy_ranks: FeatureData[SILVATaxonomy]: SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]

Parameters¶

rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: The resulting fixed-rank formatted SILVA taxonomy.[required]

rescript reverse-transcribe¶

Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.

Citations¶

Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021

Inputs¶

rna_sequences: FeatureData[AlignedRNASequence¹ | RNASequence²]: RNA Sequences to reverse transcribe to DNA.[required]

Outputs¶

dna_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Reverse-transcribed DNA sequences.[required]

rescript get-ncbi-data¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Nucleotide database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Nucleotide database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[Sequence]: Sequences from the NCBI Nucleotide database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-ncbi-data-protein¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Protein database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Protein database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[ProteinSequence]: Sequences from the NCBI Protein database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-gtdb-data¶

Citations¶

Parameters¶

version: Str % Choices('202.0', '207.0', '214.0', '214.1', '220.0', '226.0'): GTDB database version to download.[default: '226.0']
domain: Str % Choices('Both', 'Bacteria', 'Archaea'): SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default: 'Both']
db_type: Str % Choices('All', 'SpeciesReps'): 'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default: 'SpeciesReps']
url_type: Str % Choices('Primary', 'Mirror'): Toggle download URL. 'Primary' will download data from the primary GTDB URL. 'Mirror' will dowload data from the GTDB data mirror. Use 'Mirror' if downloads from 'Primary' are slow.[default: 'Primary']

Outputs¶

gtdb_taxonomy: FeatureData[Taxonomy]: SSU GTDB reference taxonomy.[required]
gtdb_sequences: FeatureData[Sequence]: SSU GTDB reference sequences.[required]

rescript get-unite-data¶

Citations¶

Parameters¶

version: Str % Choices('2025-02-19', '2024-04-04', '2023-07-18', '2022-10-16', '2021-05-10', '2020-02-20'): UNITE version to download.[default: '2025-02-19']
taxon_group: Str % Choices('fungi', 'eukaryotes'): Download a database with only 'fungi' or including all 'eukaryotes'.[default: 'eukaryotes']
cluster_id: Str % Choices('99', '97', 'dynamic'): Percent similarity at which sequences in the of database were clustered.[default: '99']
singletons: Bool: Include singleton clusters in the database.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: UNITE reference taxonomy.[required]
sequences: FeatureData[Sequence]: UNITE reference sequences.[required]

rescript get-pr2-data¶

Citations¶

Parameters¶

version: Str % Choices('5.0.0', '4.14.0'): PR2 database version to download.[default: '5.0.0']
ranks: List[Str % Choices('domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species')]: List of taxonomic ranks for building a taxonomy from the PR2 Taxonomy database. Ranks can be provided as multiple separate flags, e.g.: --p-ranks genus --p-ranks species, or with a single flag delimited by a space: --p-ranks genus species. [default: 'domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species'][optional]

Outputs¶

pr2_sequences: FeatureData[Sequence]: SSU PR2 reference sequences.[required]
pr2_taxonomy: FeatureData[Taxonomy]: SSU PR2 reference taxonomy.[required]

rescript get-midori2-data¶

Citations¶

Parameters¶

mito_gene: List[Str % Choices('A6', 'A8', 'CO1', 'CO2', 'CO3', 'Cytb', 'ND1', 'ND2', 'ND3', 'ND4L', 'ND4', 'ND5', 'ND6', 'lrRNA', 'srRNA', 'all')]: Download the mitochondrial gene(s) of interest. Specify the respective gene(s), or download all genes using 'all'.[required]
version: Str % Choices('GenBank265_2025-03-08', 'GenBank264_2024-12-14', 'GenBank263_2024-10-13', 'GenBank262_2024-08-16', 'GenBank261_2024-06-15', 'GenBank260_2024-04-15'): MIDORI 2 version to download.[default: 'GenBank265_2025-03-08']
ref_seq_type: Str % Choices('uniq', 'longest'): 'uniq': contains all unique haplotypes associated with each species. 'longest': contains the longest sequence for each species.[default: 'uniq']
unspecified_species: Bool: Download reference sequences that contain species that are left unspecified. That is, any reference sequences that lack binomial species-level description.[default: False]

Outputs¶

midori2_sequences: Collection[FeatureData[Sequence]]: MIDORI 2 reference sequence output directory.[required]
midori2_taxonomy: Collection[FeatureData[Taxonomy]]: MIDORI 2 reference taxonomy output directory.[required]

rescript filter-taxa¶

Filter taxonomy by list of IDs or search criteria.

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy to filter.[required]

Parameters¶

ids_to_keep: Metadata: List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
include: List[Str]: List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting ids_to_keep.[optional]
exclude: List[Str]: List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting ids_to_keep.[optional]

Outputs¶

filtered_taxonomy: FeatureData[Taxonomy]: The filtered taxonomy.[required]

rescript subsample-fasta¶

Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.

Citations¶

Inputs¶

sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sequences to subsample from.[required]

Parameters¶

subsample_size: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Size of the random sample as a fraction of the total count[default: 0.1]
random_seed: Int % Range(1, None): Seed to be used for random sampling.[default: 1]

Outputs¶

sample_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sample of original sequences.[required]

rescript extract-seq-segments¶

Citations¶

Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020

Inputs¶

input_sequences: FeatureData[Sequence]: Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
reference_segment_sequences: FeatureData[Sequence]: Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]

Parameters¶

perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 0.7]
target_coverage: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default: 0.9]
min_seq_len: Int % Range(1, None): Minimum length of reference sequence segment allowed for searching. Any sequence less than this will be discarded.[default: 32]
max_seq_len: Int % Range(1, None): Maximum length of reference sequence segment allowed for searching. Any sequence greater than this will be discarded.[default: 50000]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

extracted_sequence_segments: FeatureData[Sequence]: Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
unmatched_sequences: FeatureData[Sequence]: Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]

rescript get-ncbi-genomes¶

Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.

Citations¶

Parameters¶

taxa: List[Str]: NCBI Taxonomy IDs or names (common or scientific) at any taxonomic rank.[required]
assembly_source: Str % Choices('refseq', 'genbank', 'all'): Fetch only RefSeq or GenBank genome assemblies.[default: 'refseq']
assembly_levels: List[Str % Choices('complete_genome', 'chromosome', 'scaffold', 'contig')]: Fetch only genome assemblies that are one of the specified assembly levels.[default: ['complete_genome']]
only_reference: Bool: Fetch only reference and representative genome assemblies.[default: True]
only_genomic: Bool: Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default: False]
tax_exact_match: Bool: If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default: False]
page_size: Int % Range(20, 1000, inclusive_end=True): The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default: 20]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default: ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genome_assemblies: FeatureData[Sequence]: Nucleotide sequences of requested genomes.[required]
loci: GenomeData[Loci]: Loci features of requested genomes.[required]
proteins: GenomeData[Proteins]: Protein sequences originating from requested genomes.[required]
taxonomies: FeatureData[Taxonomy]: Taxonomies of requested genomes.[required]

rescript get-bv-brc-genomes¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_sequence for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genomes: GenomeData[DNASequence]: Genome sequences for specified query.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]

rescript get-bv-brc-metadata¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
data_type: Str % Choices('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology'): BV-BCR data type for which metadata should be downloaded. Check https://www.bv-brc.org/api/doc/ for documentation.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/ for allowed data fields in the specified "data-type".[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]

Outputs¶

metadata: ImmutableMetadata: BV-BCR metadata of specified data type.[required]

rescript get-bv-brc-genome-features¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_feature for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genes: GenomeData[Genes]: Gene[required]
proteins: GenomeData[Proteins]: proteins[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]
loci: GenomeData[Loci]: loci[required]

rescript evaluate-seqs¶

Citations¶

Inputs¶

sequences: List[FeatureData[Sequence]]: One or more sets of sequences to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
kmer_lengths: List[Int % Range(1, None)]: Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
subsample_kmers: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default: 1.0]
palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic'): Color palette to use for plotting evaluation results.[default: 'viridis']

Outputs¶

visualization: Visualization: <no description>[required]

rescript evaluate-fit-classifier¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

classifier: TaxonomicClassifier: Trained naive Bayes taxonomic classifier.[required]
evaluation: Visualization: Visualization of classification accuracy results.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]

rescript evaluate-cross-validate¶

Citations¶

Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

k: Int % Range(2, None): Number of stratified folds.[default: 3]
random_state: Int % Range(0, None): Seed used by the random number generator.[default: 0]
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

expected_taxonomy: FeatureData[Taxonomy]: Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
evaluation: Visualization: Visualization of cross-validated accuracy results.[required]

rescript evaluate-classifications¶

Citations¶

Inputs¶

expected_taxonomies: List[FeatureData[Taxonomy]]: True taxonomic labels for one more more sets of features.[required]
observed_taxonomies: List[FeatureData[Taxonomy]]: Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]

Outputs¶

evaluation: Visualization: Visualization of classification accuracy results.[required]

rescript evaluate-taxonomy¶

Citations¶

Inputs¶

taxonomies: List[FeatureData[Taxonomy]]: One or more taxonomies to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]

Outputs¶

taxonomy_stats: Visualization: <no description>[required]

rescript get-silva-data¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Parameters¶

version: Str % Choices('128', '132') | Str % Choices('138') | Str % Choices('138.1', '138.2'): SILVA database version to download.[default: '138.2']
target: Str % Choices('SSURef_NR99', 'SSURef', 'LSURef') | Str % Choices('SSURef_NR99', 'SSURef') | Str % Choices('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef'): Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default: 'SSURef_NR99']
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
download_sequences: Bool: Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a silva_sequences output is still created, but contains no data.[default: True]

Outputs¶

silva_sequences: FeatureData[RNASequence]: SILVA reference sequences.[required]
silva_taxonomy: FeatureData[Taxonomy]: SILVA reference taxonomy.[required]

rescript trim-alignment¶

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA sequences.[required]

Parameters¶

primer_fwd: Str: Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
primer_rev: Str: Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
position_start: Int % Range(1, None): Position within the alignment where the trimming will begin. If not provided, alignment will not be trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
position_end: Int % Range(1, None): Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
keep_primer_location: Bool: Retain the alignment positions of the primer binding location. Note: the primers themselves will be removed, but the alignment positions where the primers align will be retained in the alignment.[default: False]
n_threads: Int % Range(1, None): Number of threads to use for primer-based trimming, otherwise ignored. (Use auto to automatically use all available cores)[default: 1]

Outputs¶

trimmed_sequences: FeatureData[AlignedSequence]: Trimmed sequence alignment.[required]

Reference sequence annotation and curation pipeline.

version: 2025.10.0.dev0
website: https://github.com/nbokulich/RESCRIPt
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Robeson et al., 2021

Actions¶

Name	Type	Short Description
merge-taxa	method	Compare taxonomies and select the longest, highest scoring, or find the least common ancestor.
dereplicate	method	Dereplicate features with matching sequences and taxonomies.
cull-seqs	method	Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length.
degap-seqs	method	Remove gaps from DNA sequence alignments.
edit-taxonomy	method	Edit taxonomy strings with find and replace terms.
orient-seqs	method	Orient input sequences by comparison against reference.
orient-reads	method	Orient FASTQ reads against reference.
filter-seqs-length-by-taxon	method	Filter sequences by length and taxonomic group.
filter-seqs-length	method	Filter sequences by length.
parse-silva-taxonomy	method	Generates a SILVA fixed-rank taxonomy.
reverse-transcribe	method	Reverse transcribe RNA to DNA sequences.
get-ncbi-data	method	Download, parse, and import NCBI sequences and taxonomies
get-ncbi-data-protein	method	Download, parse, and import NCBI protein sequences and taxonomies
get-gtdb-data	method	Download, parse, and import SSU GTDB reference data.
get-unite-data	method	Download and import UNITE reference data.
get-pr2-data	method	Download, parse, and import SSU PR2 reference data.
get-midori2-data	method	Download and import MIDORI 2 reference data.
filter-taxa	method	Filter taxonomy by list of IDs or search criteria.
subsample-fasta	method	Subsample an indicated number of sequences from a FASTA file.
extract-seq-segments	method	Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value.
get-ncbi-genomes	method	Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets.
get-bv-brc-genomes	method	Get genome sequences from the BV-BRC database.
get-bv-brc-metadata	method	Fetch BV-BCR metadata.
get-bv-brc-genome-features	method	Fetch genome features from BV-BRC.
evaluate-seqs	visualizer	Compute summary statistics on sequence artifact(s).
evaluate-fit-classifier	pipeline	Evaluate and train naive Bayes classifier on reference sequences.
evaluate-cross-validate	pipeline	Evaluate DNA sequence reference database via cross-validated taxonomic classification.
evaluate-classifications	pipeline	Interactively evaluate taxonomic classification accuracy.
evaluate-taxonomy	pipeline	Compute summary statistics on taxonomy artifact(s).
get-silva-data	pipeline	Download, parse, and import SILVA database.
trim-alignment	pipeline	Trim alignment based on provided primers or specific positions.

Artifact Classes¶

Formats¶

rescript merge-taxa¶

Citations¶

Inputs¶

data: List[FeatureData[Taxonomy]]: Two or more feature taxonomies to be merged.[required]

Parameters¶

mode: Str % Choices('len', 'lca', 'score', 'super', 'majority'): How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'len']
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default: '^[dkpcofgs]__']
new_rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with rank_handle_regex if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
unclassified_label: Str: Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default: 'Unassigned']

Outputs¶

merged_data: FeatureData[Taxonomy]: <no description>[required]

rescript dereplicate¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be dereplicated[required]
taxa: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be dereplicated[required]

Parameters¶

mode: Str % Choices('uniq', 'lca', 'majority', 'super'): How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'uniq']
perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 1.0]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]
rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
derep_prefix: Bool: Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default: False]

Outputs¶

dereplicated_sequences: FeatureData[Sequence]: <no description>[required]
dereplicated_taxa: FeatureData[Taxonomy]: <no description>[required]

rescript cull-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence | RNASequence]: DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]

Parameters¶

num_degenerates: Int % Range(1, None): Sequences with N, or more, degenerate bases will be removed.[default: 5]
homopolymer_length: Int % Range(2, None): Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default: 8]
n_jobs: Int % Range(1, None): Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default: 1]

Outputs¶

clean_sequences: FeatureData[Sequence]: The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]

rescript degap-seqs¶

This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA Sequences to be degapped.[required]

Parameters¶

min_length: Int % Range(1, None): Minimum length of sequence to be returned after degapping.[default: 1]

Outputs¶

degapped_sequences: FeatureData[Sequence]: The resulting unaligned (degapped) DNA sequences.[required]

rescript edit-taxonomy¶

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy strings data to be edited.[required]

Parameters¶

replacement_map: MetadataColumn[Categorical]: A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
search_strings: List[Str]: Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
replacement_strings: List[Str]: Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
use_regex: Bool: Toggle regular expressions. By default, only litereal substring matching is performed.[default: False]

Outputs¶

edited_taxonomy: FeatureData[Taxonomy]: Taxonomy in which the original strings are replaced by user-supplied strings.[required]

rescript orient-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
relabel: Str: Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
relabel_keep: Bool: When relabeling, keep the original identifier in the header after a space.[optional]
relabel_md5: Bool: When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
relabel_self: Bool: Relabel sequences using the sequence itself as a label.[optional]
relabel_sha1: Bool: When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
sizein: Bool: In de novo mode, abundance annotations (pattern [>;]size=integer[;]) present in sequence headers are taken into account.[optional]
sizeout: Bool: Add abundance annotations to the output FASTA files.[optional]

Outputs¶

oriented_seqs: FeatureData[Sequence]: Query sequences in same orientation as top matching reference sequence.[required]
unmatched_seqs: FeatureData[Sequence]: Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]

rescript orient-reads¶

Orient input reads (FASTQ) by comparison against a set of reference sequences using VSEARCH. This action is useful for orienting reads that are in mixed orientations prior to denoising or clustering.

Citations¶

Inputs¶

sequences: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Sequence reads to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against.[required]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]

Outputs¶

oriented_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Oriented reads.[required]
unmatched_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Reads that fail to match at least one reference sequence in either + or - orientation.[required]

rescript filter-seqs-length-by-taxon¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be filtered.[required]

Parameters¶

labels: List[Str]: One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
min_lens: List[Int % Range(1, None)]: Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
max_lens: List[Int % Range(1, None)]: Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript filter-seqs-length¶

Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]

Parameters¶

global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript parse-silva-taxonomy¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Inputs¶

taxonomy_tree: Phylogeny[Rooted]: SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
taxonomy_map: FeatureData[SILVATaxidMap]: SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
taxonomy_ranks: FeatureData[SILVATaxonomy]: SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]

Parameters¶

rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: The resulting fixed-rank formatted SILVA taxonomy.[required]

rescript reverse-transcribe¶

Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.

Citations¶

Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021

Inputs¶

rna_sequences: FeatureData[AlignedRNASequence¹ | RNASequence²]: RNA Sequences to reverse transcribe to DNA.[required]

Outputs¶

dna_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Reverse-transcribed DNA sequences.[required]

rescript get-ncbi-data¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Nucleotide database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Nucleotide database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[Sequence]: Sequences from the NCBI Nucleotide database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-ncbi-data-protein¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Protein database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Protein database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[ProteinSequence]: Sequences from the NCBI Protein database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-gtdb-data¶

Citations¶

Parameters¶

version: Str % Choices('202.0', '207.0', '214.0', '214.1', '220.0', '226.0'): GTDB database version to download.[default: '226.0']
domain: Str % Choices('Both', 'Bacteria', 'Archaea'): SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default: 'Both']
db_type: Str % Choices('All', 'SpeciesReps'): 'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default: 'SpeciesReps']
url_type: Str % Choices('Primary', 'Mirror'): Toggle download URL. 'Primary' will download data from the primary GTDB URL. 'Mirror' will dowload data from the GTDB data mirror. Use 'Mirror' if downloads from 'Primary' are slow.[default: 'Primary']

Outputs¶

gtdb_taxonomy: FeatureData[Taxonomy]: SSU GTDB reference taxonomy.[required]
gtdb_sequences: FeatureData[Sequence]: SSU GTDB reference sequences.[required]

rescript get-unite-data¶

Citations¶

Parameters¶

version: Str % Choices('2025-02-19', '2024-04-04', '2023-07-18', '2022-10-16', '2021-05-10', '2020-02-20'): UNITE version to download.[default: '2025-02-19']
taxon_group: Str % Choices('fungi', 'eukaryotes'): Download a database with only 'fungi' or including all 'eukaryotes'.[default: 'eukaryotes']
cluster_id: Str % Choices('99', '97', 'dynamic'): Percent similarity at which sequences in the of database were clustered.[default: '99']
singletons: Bool: Include singleton clusters in the database.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: UNITE reference taxonomy.[required]
sequences: FeatureData[Sequence]: UNITE reference sequences.[required]

rescript get-pr2-data¶

Citations¶

Parameters¶

version: Str % Choices('5.0.0', '4.14.0'): PR2 database version to download.[default: '5.0.0']
ranks: List[Str % Choices('domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species')]: List of taxonomic ranks for building a taxonomy from the PR2 Taxonomy database. Ranks can be provided as multiple separate flags, e.g.: --p-ranks genus --p-ranks species, or with a single flag delimited by a space: --p-ranks genus species. [default: 'domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species'][optional]

Outputs¶

pr2_sequences: FeatureData[Sequence]: SSU PR2 reference sequences.[required]
pr2_taxonomy: FeatureData[Taxonomy]: SSU PR2 reference taxonomy.[required]

rescript get-midori2-data¶

Citations¶

Parameters¶

mito_gene: List[Str % Choices('A6', 'A8', 'CO1', 'CO2', 'CO3', 'Cytb', 'ND1', 'ND2', 'ND3', 'ND4L', 'ND4', 'ND5', 'ND6', 'lrRNA', 'srRNA', 'all')]: Download the mitochondrial gene(s) of interest. Specify the respective gene(s), or download all genes using 'all'.[required]
version: Str % Choices('GenBank265_2025-03-08', 'GenBank264_2024-12-14', 'GenBank263_2024-10-13', 'GenBank262_2024-08-16', 'GenBank261_2024-06-15', 'GenBank260_2024-04-15'): MIDORI 2 version to download.[default: 'GenBank265_2025-03-08']
ref_seq_type: Str % Choices('uniq', 'longest'): 'uniq': contains all unique haplotypes associated with each species. 'longest': contains the longest sequence for each species.[default: 'uniq']
unspecified_species: Bool: Download reference sequences that contain species that are left unspecified. That is, any reference sequences that lack binomial species-level description.[default: False]

Outputs¶

midori2_sequences: Collection[FeatureData[Sequence]]: MIDORI 2 reference sequence output directory.[required]
midori2_taxonomy: Collection[FeatureData[Taxonomy]]: MIDORI 2 reference taxonomy output directory.[required]

rescript filter-taxa¶

Filter taxonomy by list of IDs or search criteria.

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy to filter.[required]

Parameters¶

ids_to_keep: Metadata: List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
include: List[Str]: List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting ids_to_keep.[optional]
exclude: List[Str]: List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting ids_to_keep.[optional]

Outputs¶

filtered_taxonomy: FeatureData[Taxonomy]: The filtered taxonomy.[required]

rescript subsample-fasta¶

Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.

Citations¶

Inputs¶

sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sequences to subsample from.[required]

Parameters¶

subsample_size: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Size of the random sample as a fraction of the total count[default: 0.1]
random_seed: Int % Range(1, None): Seed to be used for random sampling.[default: 1]

Outputs¶

sample_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sample of original sequences.[required]

rescript extract-seq-segments¶

Citations¶

Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020

Inputs¶

input_sequences: FeatureData[Sequence]: Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
reference_segment_sequences: FeatureData[Sequence]: Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]

Parameters¶

perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 0.7]
target_coverage: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default: 0.9]
min_seq_len: Int % Range(1, None): Minimum length of reference sequence segment allowed for searching. Any sequence less than this will be discarded.[default: 32]
max_seq_len: Int % Range(1, None): Maximum length of reference sequence segment allowed for searching. Any sequence greater than this will be discarded.[default: 50000]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

extracted_sequence_segments: FeatureData[Sequence]: Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
unmatched_sequences: FeatureData[Sequence]: Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]

rescript get-ncbi-genomes¶

Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.

Citations¶

Parameters¶

taxa: List[Str]: NCBI Taxonomy IDs or names (common or scientific) at any taxonomic rank.[required]
assembly_source: Str % Choices('refseq', 'genbank', 'all'): Fetch only RefSeq or GenBank genome assemblies.[default: 'refseq']
assembly_levels: List[Str % Choices('complete_genome', 'chromosome', 'scaffold', 'contig')]: Fetch only genome assemblies that are one of the specified assembly levels.[default: ['complete_genome']]
only_reference: Bool: Fetch only reference and representative genome assemblies.[default: True]
only_genomic: Bool: Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default: False]
tax_exact_match: Bool: If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default: False]
page_size: Int % Range(20, 1000, inclusive_end=True): The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default: 20]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default: ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genome_assemblies: FeatureData[Sequence]: Nucleotide sequences of requested genomes.[required]
loci: GenomeData[Loci]: Loci features of requested genomes.[required]
proteins: GenomeData[Proteins]: Protein sequences originating from requested genomes.[required]
taxonomies: FeatureData[Taxonomy]: Taxonomies of requested genomes.[required]

rescript get-bv-brc-genomes¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_sequence for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genomes: GenomeData[DNASequence]: Genome sequences for specified query.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]

rescript get-bv-brc-metadata¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
data_type: Str % Choices('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology'): BV-BCR data type for which metadata should be downloaded. Check https://www.bv-brc.org/api/doc/ for documentation.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/ for allowed data fields in the specified "data-type".[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]

Outputs¶

metadata: ImmutableMetadata: BV-BCR metadata of specified data type.[required]

rescript get-bv-brc-genome-features¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_feature for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genes: GenomeData[Genes]: Gene[required]
proteins: GenomeData[Proteins]: proteins[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]
loci: GenomeData[Loci]: loci[required]

rescript evaluate-seqs¶

Citations¶

Inputs¶

sequences: List[FeatureData[Sequence]]: One or more sets of sequences to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
kmer_lengths: List[Int % Range(1, None)]: Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
subsample_kmers: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default: 1.0]
palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic'): Color palette to use for plotting evaluation results.[default: 'viridis']

Outputs¶

visualization: Visualization: <no description>[required]

rescript evaluate-fit-classifier¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

classifier: TaxonomicClassifier: Trained naive Bayes taxonomic classifier.[required]
evaluation: Visualization: Visualization of classification accuracy results.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]

rescript evaluate-cross-validate¶

Citations¶

Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

k: Int % Range(2, None): Number of stratified folds.[default: 3]
random_state: Int % Range(0, None): Seed used by the random number generator.[default: 0]
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

expected_taxonomy: FeatureData[Taxonomy]: Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
evaluation: Visualization: Visualization of cross-validated accuracy results.[required]

rescript evaluate-classifications¶

Citations¶

Inputs¶

expected_taxonomies: List[FeatureData[Taxonomy]]: True taxonomic labels for one more more sets of features.[required]
observed_taxonomies: List[FeatureData[Taxonomy]]: Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]

Outputs¶

evaluation: Visualization: Visualization of classification accuracy results.[required]

rescript evaluate-taxonomy¶

Citations¶

Inputs¶

taxonomies: List[FeatureData[Taxonomy]]: One or more taxonomies to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]

Outputs¶

taxonomy_stats: Visualization: <no description>[required]

rescript get-silva-data¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Parameters¶

version: Str % Choices('128', '132') | Str % Choices('138') | Str % Choices('138.1', '138.2'): SILVA database version to download.[default: '138.2']
target: Str % Choices('SSURef_NR99', 'SSURef', 'LSURef') | Str % Choices('SSURef_NR99', 'SSURef') | Str % Choices('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef'): Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default: 'SSURef_NR99']
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
download_sequences: Bool: Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a silva_sequences output is still created, but contains no data.[default: True]

Outputs¶

silva_sequences: FeatureData[RNASequence]: SILVA reference sequences.[required]
silva_taxonomy: FeatureData[Taxonomy]: SILVA reference taxonomy.[required]

rescript trim-alignment¶

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA sequences.[required]

Parameters¶

primer_fwd: Str: Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
primer_rev: Str: Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
position_start: Int % Range(1, None): Position within the alignment where the trimming will begin. If not provided, alignment will not be trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
position_end: Int % Range(1, None): Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
keep_primer_location: Bool: Retain the alignment positions of the primer binding location. Note: the primers themselves will be removed, but the alignment positions where the primers align will be retained in the alignment.[default: False]
n_threads: Int % Range(1, None): Number of threads to use for primer-based trimming, otherwise ignored. (Use auto to automatically use all available cores)[default: 1]

Outputs¶

trimmed_sequences: FeatureData[AlignedSequence]: Trimmed sequence alignment.[required]

Reference sequence annotation and curation pipeline.

version: 2025.10.0.dev0
website: https://github.com/nbokulich/RESCRIPt
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Robeson et al., 2021

Actions¶

Name	Type	Short Description
merge-taxa	method	Compare taxonomies and select the longest, highest scoring, or find the least common ancestor.
dereplicate	method	Dereplicate features with matching sequences and taxonomies.
cull-seqs	method	Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length.
degap-seqs	method	Remove gaps from DNA sequence alignments.
edit-taxonomy	method	Edit taxonomy strings with find and replace terms.
orient-seqs	method	Orient input sequences by comparison against reference.
orient-reads	method	Orient FASTQ reads against reference.
filter-seqs-length-by-taxon	method	Filter sequences by length and taxonomic group.
filter-seqs-length	method	Filter sequences by length.
parse-silva-taxonomy	method	Generates a SILVA fixed-rank taxonomy.
reverse-transcribe	method	Reverse transcribe RNA to DNA sequences.
get-ncbi-data	method	Download, parse, and import NCBI sequences and taxonomies
get-ncbi-data-protein	method	Download, parse, and import NCBI protein sequences and taxonomies
get-gtdb-data	method	Download, parse, and import SSU GTDB reference data.
get-unite-data	method	Download and import UNITE reference data.
get-pr2-data	method	Download, parse, and import SSU PR2 reference data.
get-midori2-data	method	Download and import MIDORI 2 reference data.
filter-taxa	method	Filter taxonomy by list of IDs or search criteria.
subsample-fasta	method	Subsample an indicated number of sequences from a FASTA file.
extract-seq-segments	method	Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value.
get-ncbi-genomes	method	Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets.
get-bv-brc-genomes	method	Get genome sequences from the BV-BRC database.
get-bv-brc-metadata	method	Fetch BV-BCR metadata.
get-bv-brc-genome-features	method	Fetch genome features from BV-BRC.
evaluate-seqs	visualizer	Compute summary statistics on sequence artifact(s).
evaluate-fit-classifier	pipeline	Evaluate and train naive Bayes classifier on reference sequences.
evaluate-cross-validate	pipeline	Evaluate DNA sequence reference database via cross-validated taxonomic classification.
evaluate-classifications	pipeline	Interactively evaluate taxonomic classification accuracy.
evaluate-taxonomy	pipeline	Compute summary statistics on taxonomy artifact(s).
get-silva-data	pipeline	Download, parse, and import SILVA database.
trim-alignment	pipeline	Trim alignment based on provided primers or specific positions.

Artifact Classes¶

Formats¶

rescript merge-taxa¶

Citations¶

Inputs¶

data: List[FeatureData[Taxonomy]]: Two or more feature taxonomies to be merged.[required]

Parameters¶

mode: Str % Choices('len', 'lca', 'score', 'super', 'majority'): How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'len']
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default: '^[dkpcofgs]__']
new_rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with rank_handle_regex if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
unclassified_label: Str: Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default: 'Unassigned']

Outputs¶

merged_data: FeatureData[Taxonomy]: <no description>[required]

rescript dereplicate¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be dereplicated[required]
taxa: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be dereplicated[required]

Parameters¶

mode: Str % Choices('uniq', 'lca', 'majority', 'super'): How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'uniq']
perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 1.0]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]
rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
derep_prefix: Bool: Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default: False]

Outputs¶

dereplicated_sequences: FeatureData[Sequence]: <no description>[required]
dereplicated_taxa: FeatureData[Taxonomy]: <no description>[required]

rescript cull-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence | RNASequence]: DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]

Parameters¶

num_degenerates: Int % Range(1, None): Sequences with N, or more, degenerate bases will be removed.[default: 5]
homopolymer_length: Int % Range(2, None): Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default: 8]
n_jobs: Int % Range(1, None): Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default: 1]

Outputs¶

clean_sequences: FeatureData[Sequence]: The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]

rescript degap-seqs¶

This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA Sequences to be degapped.[required]

Parameters¶

min_length: Int % Range(1, None): Minimum length of sequence to be returned after degapping.[default: 1]

Outputs¶

degapped_sequences: FeatureData[Sequence]: The resulting unaligned (degapped) DNA sequences.[required]

rescript edit-taxonomy¶

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy strings data to be edited.[required]

Parameters¶

replacement_map: MetadataColumn[Categorical]: A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
search_strings: List[Str]: Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
replacement_strings: List[Str]: Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
use_regex: Bool: Toggle regular expressions. By default, only litereal substring matching is performed.[default: False]

Outputs¶

edited_taxonomy: FeatureData[Taxonomy]: Taxonomy in which the original strings are replaced by user-supplied strings.[required]

rescript orient-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
relabel: Str: Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
relabel_keep: Bool: When relabeling, keep the original identifier in the header after a space.[optional]
relabel_md5: Bool: When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
relabel_self: Bool: Relabel sequences using the sequence itself as a label.[optional]
relabel_sha1: Bool: When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
sizein: Bool: In de novo mode, abundance annotations (pattern [>;]size=integer[;]) present in sequence headers are taken into account.[optional]
sizeout: Bool: Add abundance annotations to the output FASTA files.[optional]

Outputs¶

oriented_seqs: FeatureData[Sequence]: Query sequences in same orientation as top matching reference sequence.[required]
unmatched_seqs: FeatureData[Sequence]: Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]

rescript orient-reads¶

Orient input reads (FASTQ) by comparison against a set of reference sequences using VSEARCH. This action is useful for orienting reads that are in mixed orientations prior to denoising or clustering.

Citations¶

Inputs¶

sequences: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Sequence reads to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against.[required]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]

Outputs¶

oriented_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Oriented reads.[required]
unmatched_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Reads that fail to match at least one reference sequence in either + or - orientation.[required]

rescript filter-seqs-length-by-taxon¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be filtered.[required]

Parameters¶

labels: List[Str]: One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
min_lens: List[Int % Range(1, None)]: Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
max_lens: List[Int % Range(1, None)]: Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript filter-seqs-length¶

Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]

Parameters¶

global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript parse-silva-taxonomy¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Inputs¶

taxonomy_tree: Phylogeny[Rooted]: SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
taxonomy_map: FeatureData[SILVATaxidMap]: SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
taxonomy_ranks: FeatureData[SILVATaxonomy]: SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]

Parameters¶

rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: The resulting fixed-rank formatted SILVA taxonomy.[required]

rescript reverse-transcribe¶

Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.

Citations¶

Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021

Inputs¶

rna_sequences: FeatureData[AlignedRNASequence¹ | RNASequence²]: RNA Sequences to reverse transcribe to DNA.[required]

Outputs¶

dna_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Reverse-transcribed DNA sequences.[required]

rescript get-ncbi-data¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Nucleotide database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Nucleotide database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[Sequence]: Sequences from the NCBI Nucleotide database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-ncbi-data-protein¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Protein database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Protein database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[ProteinSequence]: Sequences from the NCBI Protein database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-gtdb-data¶

Citations¶

Parameters¶

version: Str % Choices('202.0', '207.0', '214.0', '214.1', '220.0', '226.0'): GTDB database version to download.[default: '226.0']
domain: Str % Choices('Both', 'Bacteria', 'Archaea'): SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default: 'Both']
db_type: Str % Choices('All', 'SpeciesReps'): 'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default: 'SpeciesReps']
url_type: Str % Choices('Primary', 'Mirror'): Toggle download URL. 'Primary' will download data from the primary GTDB URL. 'Mirror' will dowload data from the GTDB data mirror. Use 'Mirror' if downloads from 'Primary' are slow.[default: 'Primary']

Outputs¶

gtdb_taxonomy: FeatureData[Taxonomy]: SSU GTDB reference taxonomy.[required]
gtdb_sequences: FeatureData[Sequence]: SSU GTDB reference sequences.[required]

rescript get-unite-data¶

Citations¶

Parameters¶

version: Str % Choices('2025-02-19', '2024-04-04', '2023-07-18', '2022-10-16', '2021-05-10', '2020-02-20'): UNITE version to download.[default: '2025-02-19']
taxon_group: Str % Choices('fungi', 'eukaryotes'): Download a database with only 'fungi' or including all 'eukaryotes'.[default: 'eukaryotes']
cluster_id: Str % Choices('99', '97', 'dynamic'): Percent similarity at which sequences in the of database were clustered.[default: '99']
singletons: Bool: Include singleton clusters in the database.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: UNITE reference taxonomy.[required]
sequences: FeatureData[Sequence]: UNITE reference sequences.[required]

rescript get-pr2-data¶

Citations¶

Parameters¶

version: Str % Choices('5.0.0', '4.14.0'): PR2 database version to download.[default: '5.0.0']
ranks: List[Str % Choices('domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species')]: List of taxonomic ranks for building a taxonomy from the PR2 Taxonomy database. Ranks can be provided as multiple separate flags, e.g.: --p-ranks genus --p-ranks species, or with a single flag delimited by a space: --p-ranks genus species. [default: 'domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species'][optional]

Outputs¶

pr2_sequences: FeatureData[Sequence]: SSU PR2 reference sequences.[required]
pr2_taxonomy: FeatureData[Taxonomy]: SSU PR2 reference taxonomy.[required]

rescript get-midori2-data¶

Citations¶

Parameters¶

mito_gene: List[Str % Choices('A6', 'A8', 'CO1', 'CO2', 'CO3', 'Cytb', 'ND1', 'ND2', 'ND3', 'ND4L', 'ND4', 'ND5', 'ND6', 'lrRNA', 'srRNA', 'all')]: Download the mitochondrial gene(s) of interest. Specify the respective gene(s), or download all genes using 'all'.[required]
version: Str % Choices('GenBank265_2025-03-08', 'GenBank264_2024-12-14', 'GenBank263_2024-10-13', 'GenBank262_2024-08-16', 'GenBank261_2024-06-15', 'GenBank260_2024-04-15'): MIDORI 2 version to download.[default: 'GenBank265_2025-03-08']
ref_seq_type: Str % Choices('uniq', 'longest'): 'uniq': contains all unique haplotypes associated with each species. 'longest': contains the longest sequence for each species.[default: 'uniq']
unspecified_species: Bool: Download reference sequences that contain species that are left unspecified. That is, any reference sequences that lack binomial species-level description.[default: False]

Outputs¶

midori2_sequences: Collection[FeatureData[Sequence]]: MIDORI 2 reference sequence output directory.[required]
midori2_taxonomy: Collection[FeatureData[Taxonomy]]: MIDORI 2 reference taxonomy output directory.[required]

rescript filter-taxa¶

Filter taxonomy by list of IDs or search criteria.

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy to filter.[required]

Parameters¶

ids_to_keep: Metadata: List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
include: List[Str]: List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting ids_to_keep.[optional]
exclude: List[Str]: List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting ids_to_keep.[optional]

Outputs¶

filtered_taxonomy: FeatureData[Taxonomy]: The filtered taxonomy.[required]

rescript subsample-fasta¶

Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.

Citations¶

Inputs¶

sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sequences to subsample from.[required]

Parameters¶

subsample_size: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Size of the random sample as a fraction of the total count[default: 0.1]
random_seed: Int % Range(1, None): Seed to be used for random sampling.[default: 1]

Outputs¶

sample_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sample of original sequences.[required]

rescript extract-seq-segments¶

Citations¶

Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020

Inputs¶

input_sequences: FeatureData[Sequence]: Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
reference_segment_sequences: FeatureData[Sequence]: Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]

Parameters¶

perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 0.7]
target_coverage: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default: 0.9]
min_seq_len: Int % Range(1, None): Minimum length of reference sequence segment allowed for searching. Any sequence less than this will be discarded.[default: 32]
max_seq_len: Int % Range(1, None): Maximum length of reference sequence segment allowed for searching. Any sequence greater than this will be discarded.[default: 50000]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

extracted_sequence_segments: FeatureData[Sequence]: Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
unmatched_sequences: FeatureData[Sequence]: Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]

rescript get-ncbi-genomes¶

Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.

Citations¶

Parameters¶

taxa: List[Str]: NCBI Taxonomy IDs or names (common or scientific) at any taxonomic rank.[required]
assembly_source: Str % Choices('refseq', 'genbank', 'all'): Fetch only RefSeq or GenBank genome assemblies.[default: 'refseq']
assembly_levels: List[Str % Choices('complete_genome', 'chromosome', 'scaffold', 'contig')]: Fetch only genome assemblies that are one of the specified assembly levels.[default: ['complete_genome']]
only_reference: Bool: Fetch only reference and representative genome assemblies.[default: True]
only_genomic: Bool: Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default: False]
tax_exact_match: Bool: If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default: False]
page_size: Int % Range(20, 1000, inclusive_end=True): The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default: 20]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default: ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genome_assemblies: FeatureData[Sequence]: Nucleotide sequences of requested genomes.[required]
loci: GenomeData[Loci]: Loci features of requested genomes.[required]
proteins: GenomeData[Proteins]: Protein sequences originating from requested genomes.[required]
taxonomies: FeatureData[Taxonomy]: Taxonomies of requested genomes.[required]

rescript get-bv-brc-genomes¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_sequence for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genomes: GenomeData[DNASequence]: Genome sequences for specified query.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]

rescript get-bv-brc-metadata¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
data_type: Str % Choices('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology'): BV-BCR data type for which metadata should be downloaded. Check https://www.bv-brc.org/api/doc/ for documentation.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/ for allowed data fields in the specified "data-type".[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]

Outputs¶

metadata: ImmutableMetadata: BV-BCR metadata of specified data type.[required]

rescript get-bv-brc-genome-features¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_feature for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genes: GenomeData[Genes]: Gene[required]
proteins: GenomeData[Proteins]: proteins[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]
loci: GenomeData[Loci]: loci[required]

rescript evaluate-seqs¶

Citations¶

Inputs¶

sequences: List[FeatureData[Sequence]]: One or more sets of sequences to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
kmer_lengths: List[Int % Range(1, None)]: Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
subsample_kmers: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default: 1.0]
palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic'): Color palette to use for plotting evaluation results.[default: 'viridis']

Outputs¶

visualization: Visualization: <no description>[required]

rescript evaluate-fit-classifier¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

classifier: TaxonomicClassifier: Trained naive Bayes taxonomic classifier.[required]
evaluation: Visualization: Visualization of classification accuracy results.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]

rescript evaluate-cross-validate¶

Citations¶

Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

k: Int % Range(2, None): Number of stratified folds.[default: 3]
random_state: Int % Range(0, None): Seed used by the random number generator.[default: 0]
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

expected_taxonomy: FeatureData[Taxonomy]: Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
evaluation: Visualization: Visualization of cross-validated accuracy results.[required]

rescript evaluate-classifications¶

Citations¶

Inputs¶

expected_taxonomies: List[FeatureData[Taxonomy]]: True taxonomic labels for one more more sets of features.[required]
observed_taxonomies: List[FeatureData[Taxonomy]]: Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]

Outputs¶

evaluation: Visualization: Visualization of classification accuracy results.[required]

rescript evaluate-taxonomy¶

Citations¶

Inputs¶

taxonomies: List[FeatureData[Taxonomy]]: One or more taxonomies to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]

Outputs¶

taxonomy_stats: Visualization: <no description>[required]

rescript get-silva-data¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Parameters¶

version: Str % Choices('128', '132') | Str % Choices('138') | Str % Choices('138.1', '138.2'): SILVA database version to download.[default: '138.2']
target: Str % Choices('SSURef_NR99', 'SSURef', 'LSURef') | Str % Choices('SSURef_NR99', 'SSURef') | Str % Choices('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef'): Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default: 'SSURef_NR99']
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
download_sequences: Bool: Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a silva_sequences output is still created, but contains no data.[default: True]

Outputs¶

silva_sequences: FeatureData[RNASequence]: SILVA reference sequences.[required]
silva_taxonomy: FeatureData[Taxonomy]: SILVA reference taxonomy.[required]

rescript trim-alignment¶

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA sequences.[required]

Parameters¶

primer_fwd: Str: Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
primer_rev: Str: Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
position_start: Int % Range(1, None): Position within the alignment where the trimming will begin. If not provided, alignment will not be trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
position_end: Int % Range(1, None): Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
keep_primer_location: Bool: Retain the alignment positions of the primer binding location. Note: the primers themselves will be removed, but the alignment positions where the primers align will be retained in the alignment.[default: False]
n_threads: Int % Range(1, None): Number of threads to use for primer-based trimming, otherwise ignored. (Use auto to automatically use all available cores)[default: 1]

Outputs¶

trimmed_sequences: FeatureData[AlignedSequence]: Trimmed sequence alignment.[required]

Reference sequence annotation and curation pipeline.

version: 2025.10.0.dev0
website: https://github.com/nbokulich/RESCRIPt
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Robeson et al., 2021

Actions¶

Name	Type	Short Description
merge-taxa	method	Compare taxonomies and select the longest, highest scoring, or find the least common ancestor.
dereplicate	method	Dereplicate features with matching sequences and taxonomies.
cull-seqs	method	Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length.
degap-seqs	method	Remove gaps from DNA sequence alignments.
edit-taxonomy	method	Edit taxonomy strings with find and replace terms.
orient-seqs	method	Orient input sequences by comparison against reference.
orient-reads	method	Orient FASTQ reads against reference.
filter-seqs-length-by-taxon	method	Filter sequences by length and taxonomic group.
filter-seqs-length	method	Filter sequences by length.
parse-silva-taxonomy	method	Generates a SILVA fixed-rank taxonomy.
reverse-transcribe	method	Reverse transcribe RNA to DNA sequences.
get-ncbi-data	method	Download, parse, and import NCBI sequences and taxonomies
get-ncbi-data-protein	method	Download, parse, and import NCBI protein sequences and taxonomies
get-gtdb-data	method	Download, parse, and import SSU GTDB reference data.
get-unite-data	method	Download and import UNITE reference data.
get-pr2-data	method	Download, parse, and import SSU PR2 reference data.
get-midori2-data	method	Download and import MIDORI 2 reference data.
filter-taxa	method	Filter taxonomy by list of IDs or search criteria.
subsample-fasta	method	Subsample an indicated number of sequences from a FASTA file.
extract-seq-segments	method	Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value.
get-ncbi-genomes	method	Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets.
get-bv-brc-genomes	method	Get genome sequences from the BV-BRC database.
get-bv-brc-metadata	method	Fetch BV-BCR metadata.
get-bv-brc-genome-features	method	Fetch genome features from BV-BRC.
evaluate-seqs	visualizer	Compute summary statistics on sequence artifact(s).
evaluate-fit-classifier	pipeline	Evaluate and train naive Bayes classifier on reference sequences.
evaluate-cross-validate	pipeline	Evaluate DNA sequence reference database via cross-validated taxonomic classification.
evaluate-classifications	pipeline	Interactively evaluate taxonomic classification accuracy.
evaluate-taxonomy	pipeline	Compute summary statistics on taxonomy artifact(s).
get-silva-data	pipeline	Download, parse, and import SILVA database.
trim-alignment	pipeline	Trim alignment based on provided primers or specific positions.

Artifact Classes¶

Formats¶

rescript merge-taxa¶

Citations¶

Inputs¶

data: List[FeatureData[Taxonomy]]: Two or more feature taxonomies to be merged.[required]

Parameters¶

mode: Str % Choices('len', 'lca', 'score', 'super', 'majority'): How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'len']
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default: '^[dkpcofgs]__']
new_rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with rank_handle_regex if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
unclassified_label: Str: Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default: 'Unassigned']

Outputs¶

merged_data: FeatureData[Taxonomy]: <no description>[required]

rescript dereplicate¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be dereplicated[required]
taxa: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be dereplicated[required]

Parameters¶

mode: Str % Choices('uniq', 'lca', 'majority', 'super'): How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'uniq']
perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 1.0]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]
rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
derep_prefix: Bool: Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default: False]

Outputs¶

dereplicated_sequences: FeatureData[Sequence]: <no description>[required]
dereplicated_taxa: FeatureData[Taxonomy]: <no description>[required]

rescript cull-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence | RNASequence]: DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]

Parameters¶

num_degenerates: Int % Range(1, None): Sequences with N, or more, degenerate bases will be removed.[default: 5]
homopolymer_length: Int % Range(2, None): Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default: 8]
n_jobs: Int % Range(1, None): Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default: 1]

Outputs¶

clean_sequences: FeatureData[Sequence]: The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]

rescript degap-seqs¶

This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA Sequences to be degapped.[required]

Parameters¶

min_length: Int % Range(1, None): Minimum length of sequence to be returned after degapping.[default: 1]

Outputs¶

degapped_sequences: FeatureData[Sequence]: The resulting unaligned (degapped) DNA sequences.[required]

rescript edit-taxonomy¶

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy strings data to be edited.[required]

Parameters¶

replacement_map: MetadataColumn[Categorical]: A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
search_strings: List[Str]: Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
replacement_strings: List[Str]: Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
use_regex: Bool: Toggle regular expressions. By default, only litereal substring matching is performed.[default: False]

Outputs¶

edited_taxonomy: FeatureData[Taxonomy]: Taxonomy in which the original strings are replaced by user-supplied strings.[required]

rescript orient-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
relabel: Str: Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
relabel_keep: Bool: When relabeling, keep the original identifier in the header after a space.[optional]
relabel_md5: Bool: When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
relabel_self: Bool: Relabel sequences using the sequence itself as a label.[optional]
relabel_sha1: Bool: When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
sizein: Bool: In de novo mode, abundance annotations (pattern [>;]size=integer[;]) present in sequence headers are taken into account.[optional]
sizeout: Bool: Add abundance annotations to the output FASTA files.[optional]

Outputs¶

oriented_seqs: FeatureData[Sequence]: Query sequences in same orientation as top matching reference sequence.[required]
unmatched_seqs: FeatureData[Sequence]: Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]

rescript orient-reads¶

Orient input reads (FASTQ) by comparison against a set of reference sequences using VSEARCH. This action is useful for orienting reads that are in mixed orientations prior to denoising or clustering.

Citations¶

Inputs¶

sequences: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Sequence reads to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against.[required]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]

Outputs¶

oriented_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Oriented reads.[required]
unmatched_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Reads that fail to match at least one reference sequence in either + or - orientation.[required]

rescript filter-seqs-length-by-taxon¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be filtered.[required]

Parameters¶

labels: List[Str]: One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
min_lens: List[Int % Range(1, None)]: Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
max_lens: List[Int % Range(1, None)]: Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript filter-seqs-length¶

Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]

Parameters¶

global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript parse-silva-taxonomy¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Inputs¶

taxonomy_tree: Phylogeny[Rooted]: SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
taxonomy_map: FeatureData[SILVATaxidMap]: SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
taxonomy_ranks: FeatureData[SILVATaxonomy]: SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]

Parameters¶

rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: The resulting fixed-rank formatted SILVA taxonomy.[required]

rescript reverse-transcribe¶

Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.

Citations¶

Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021

Inputs¶

rna_sequences: FeatureData[AlignedRNASequence¹ | RNASequence²]: RNA Sequences to reverse transcribe to DNA.[required]

Outputs¶

dna_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Reverse-transcribed DNA sequences.[required]

rescript get-ncbi-data¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Nucleotide database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Nucleotide database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[Sequence]: Sequences from the NCBI Nucleotide database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-ncbi-data-protein¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Protein database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Protein database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[ProteinSequence]: Sequences from the NCBI Protein database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-gtdb-data¶

Citations¶

Parameters¶

version: Str % Choices('202.0', '207.0', '214.0', '214.1', '220.0', '226.0'): GTDB database version to download.[default: '226.0']
domain: Str % Choices('Both', 'Bacteria', 'Archaea'): SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default: 'Both']
db_type: Str % Choices('All', 'SpeciesReps'): 'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default: 'SpeciesReps']
url_type: Str % Choices('Primary', 'Mirror'): Toggle download URL. 'Primary' will download data from the primary GTDB URL. 'Mirror' will dowload data from the GTDB data mirror. Use 'Mirror' if downloads from 'Primary' are slow.[default: 'Primary']

Outputs¶

gtdb_taxonomy: FeatureData[Taxonomy]: SSU GTDB reference taxonomy.[required]
gtdb_sequences: FeatureData[Sequence]: SSU GTDB reference sequences.[required]

rescript get-unite-data¶

Citations¶

Parameters¶

version: Str % Choices('2025-02-19', '2024-04-04', '2023-07-18', '2022-10-16', '2021-05-10', '2020-02-20'): UNITE version to download.[default: '2025-02-19']
taxon_group: Str % Choices('fungi', 'eukaryotes'): Download a database with only 'fungi' or including all 'eukaryotes'.[default: 'eukaryotes']
cluster_id: Str % Choices('99', '97', 'dynamic'): Percent similarity at which sequences in the of database were clustered.[default: '99']
singletons: Bool: Include singleton clusters in the database.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: UNITE reference taxonomy.[required]
sequences: FeatureData[Sequence]: UNITE reference sequences.[required]

rescript get-pr2-data¶

Citations¶

Parameters¶

version: Str % Choices('5.0.0', '4.14.0'): PR2 database version to download.[default: '5.0.0']
ranks: List[Str % Choices('domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species')]: List of taxonomic ranks for building a taxonomy from the PR2 Taxonomy database. Ranks can be provided as multiple separate flags, e.g.: --p-ranks genus --p-ranks species, or with a single flag delimited by a space: --p-ranks genus species. [default: 'domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species'][optional]

Outputs¶

pr2_sequences: FeatureData[Sequence]: SSU PR2 reference sequences.[required]
pr2_taxonomy: FeatureData[Taxonomy]: SSU PR2 reference taxonomy.[required]

rescript get-midori2-data¶

Citations¶

Parameters¶

mito_gene: List[Str % Choices('A6', 'A8', 'CO1', 'CO2', 'CO3', 'Cytb', 'ND1', 'ND2', 'ND3', 'ND4L', 'ND4', 'ND5', 'ND6', 'lrRNA', 'srRNA', 'all')]: Download the mitochondrial gene(s) of interest. Specify the respective gene(s), or download all genes using 'all'.[required]
version: Str % Choices('GenBank265_2025-03-08', 'GenBank264_2024-12-14', 'GenBank263_2024-10-13', 'GenBank262_2024-08-16', 'GenBank261_2024-06-15', 'GenBank260_2024-04-15'): MIDORI 2 version to download.[default: 'GenBank265_2025-03-08']
ref_seq_type: Str % Choices('uniq', 'longest'): 'uniq': contains all unique haplotypes associated with each species. 'longest': contains the longest sequence for each species.[default: 'uniq']
unspecified_species: Bool: Download reference sequences that contain species that are left unspecified. That is, any reference sequences that lack binomial species-level description.[default: False]

Outputs¶

midori2_sequences: Collection[FeatureData[Sequence]]: MIDORI 2 reference sequence output directory.[required]
midori2_taxonomy: Collection[FeatureData[Taxonomy]]: MIDORI 2 reference taxonomy output directory.[required]

rescript filter-taxa¶

Filter taxonomy by list of IDs or search criteria.

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy to filter.[required]

Parameters¶

ids_to_keep: Metadata: List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
include: List[Str]: List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting ids_to_keep.[optional]
exclude: List[Str]: List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting ids_to_keep.[optional]

Outputs¶

filtered_taxonomy: FeatureData[Taxonomy]: The filtered taxonomy.[required]

rescript subsample-fasta¶

Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.

Citations¶

Inputs¶

sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sequences to subsample from.[required]

Parameters¶

subsample_size: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Size of the random sample as a fraction of the total count[default: 0.1]
random_seed: Int % Range(1, None): Seed to be used for random sampling.[default: 1]

Outputs¶

sample_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sample of original sequences.[required]

rescript extract-seq-segments¶

Citations¶

Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020

Inputs¶

input_sequences: FeatureData[Sequence]: Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
reference_segment_sequences: FeatureData[Sequence]: Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]

Parameters¶

perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 0.7]
target_coverage: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default: 0.9]
min_seq_len: Int % Range(1, None): Minimum length of reference sequence segment allowed for searching. Any sequence less than this will be discarded.[default: 32]
max_seq_len: Int % Range(1, None): Maximum length of reference sequence segment allowed for searching. Any sequence greater than this will be discarded.[default: 50000]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

extracted_sequence_segments: FeatureData[Sequence]: Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
unmatched_sequences: FeatureData[Sequence]: Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]

rescript get-ncbi-genomes¶

Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.

Citations¶

Parameters¶

taxa: List[Str]: NCBI Taxonomy IDs or names (common or scientific) at any taxonomic rank.[required]
assembly_source: Str % Choices('refseq', 'genbank', 'all'): Fetch only RefSeq or GenBank genome assemblies.[default: 'refseq']
assembly_levels: List[Str % Choices('complete_genome', 'chromosome', 'scaffold', 'contig')]: Fetch only genome assemblies that are one of the specified assembly levels.[default: ['complete_genome']]
only_reference: Bool: Fetch only reference and representative genome assemblies.[default: True]
only_genomic: Bool: Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default: False]
tax_exact_match: Bool: If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default: False]
page_size: Int % Range(20, 1000, inclusive_end=True): The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default: 20]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default: ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genome_assemblies: FeatureData[Sequence]: Nucleotide sequences of requested genomes.[required]
loci: GenomeData[Loci]: Loci features of requested genomes.[required]
proteins: GenomeData[Proteins]: Protein sequences originating from requested genomes.[required]
taxonomies: FeatureData[Taxonomy]: Taxonomies of requested genomes.[required]

rescript get-bv-brc-genomes¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_sequence for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genomes: GenomeData[DNASequence]: Genome sequences for specified query.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]

rescript get-bv-brc-metadata¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
data_type: Str % Choices('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology'): BV-BCR data type for which metadata should be downloaded. Check https://www.bv-brc.org/api/doc/ for documentation.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/ for allowed data fields in the specified "data-type".[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]

Outputs¶

metadata: ImmutableMetadata: BV-BCR metadata of specified data type.[required]

rescript get-bv-brc-genome-features¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_feature for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genes: GenomeData[Genes]: Gene[required]
proteins: GenomeData[Proteins]: proteins[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]
loci: GenomeData[Loci]: loci[required]

rescript evaluate-seqs¶

Citations¶

Inputs¶

sequences: List[FeatureData[Sequence]]: One or more sets of sequences to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
kmer_lengths: List[Int % Range(1, None)]: Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
subsample_kmers: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default: 1.0]
palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic'): Color palette to use for plotting evaluation results.[default: 'viridis']

Outputs¶

visualization: Visualization: <no description>[required]

rescript evaluate-fit-classifier¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

classifier: TaxonomicClassifier: Trained naive Bayes taxonomic classifier.[required]
evaluation: Visualization: Visualization of classification accuracy results.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]

rescript evaluate-cross-validate¶

Citations¶

Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

k: Int % Range(2, None): Number of stratified folds.[default: 3]
random_state: Int % Range(0, None): Seed used by the random number generator.[default: 0]
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

expected_taxonomy: FeatureData[Taxonomy]: Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
evaluation: Visualization: Visualization of cross-validated accuracy results.[required]

rescript evaluate-classifications¶

Citations¶

Inputs¶

expected_taxonomies: List[FeatureData[Taxonomy]]: True taxonomic labels for one more more sets of features.[required]
observed_taxonomies: List[FeatureData[Taxonomy]]: Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]

Outputs¶

evaluation: Visualization: Visualization of classification accuracy results.[required]

rescript evaluate-taxonomy¶

Citations¶

Inputs¶

taxonomies: List[FeatureData[Taxonomy]]: One or more taxonomies to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]

Outputs¶

taxonomy_stats: Visualization: <no description>[required]

rescript get-silva-data¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Parameters¶

version: Str % Choices('128', '132') | Str % Choices('138') | Str % Choices('138.1', '138.2'): SILVA database version to download.[default: '138.2']
target: Str % Choices('SSURef_NR99', 'SSURef', 'LSURef') | Str % Choices('SSURef_NR99', 'SSURef') | Str % Choices('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef'): Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default: 'SSURef_NR99']
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
download_sequences: Bool: Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a silva_sequences output is still created, but contains no data.[default: True]

Outputs¶

silva_sequences: FeatureData[RNASequence]: SILVA reference sequences.[required]
silva_taxonomy: FeatureData[Taxonomy]: SILVA reference taxonomy.[required]

rescript trim-alignment¶

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA sequences.[required]

Parameters¶

primer_fwd: Str: Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
primer_rev: Str: Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
position_start: Int % Range(1, None): Position within the alignment where the trimming will begin. If not provided, alignment will not be trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
position_end: Int % Range(1, None): Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
keep_primer_location: Bool: Retain the alignment positions of the primer binding location. Note: the primers themselves will be removed, but the alignment positions where the primers align will be retained in the alignment.[default: False]
n_threads: Int % Range(1, None): Number of threads to use for primer-based trimming, otherwise ignored. (Use auto to automatically use all available cores)[default: 1]

Outputs¶

trimmed_sequences: FeatureData[AlignedSequence]: Trimmed sequence alignment.[required]

Reference sequence annotation and curation pipeline.

version: 2025.10.0.dev0
website: https://github.com/nbokulich/RESCRIPt
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Robeson et al., 2021

Actions¶

Name	Type	Short Description
merge-taxa	method	Compare taxonomies and select the longest, highest scoring, or find the least common ancestor.
dereplicate	method	Dereplicate features with matching sequences and taxonomies.
cull-seqs	method	Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length.
degap-seqs	method	Remove gaps from DNA sequence alignments.
edit-taxonomy	method	Edit taxonomy strings with find and replace terms.
orient-seqs	method	Orient input sequences by comparison against reference.
orient-reads	method	Orient FASTQ reads against reference.
filter-seqs-length-by-taxon	method	Filter sequences by length and taxonomic group.
filter-seqs-length	method	Filter sequences by length.
parse-silva-taxonomy	method	Generates a SILVA fixed-rank taxonomy.
reverse-transcribe	method	Reverse transcribe RNA to DNA sequences.
get-ncbi-data	method	Download, parse, and import NCBI sequences and taxonomies
get-ncbi-data-protein	method	Download, parse, and import NCBI protein sequences and taxonomies
get-gtdb-data	method	Download, parse, and import SSU GTDB reference data.
get-unite-data	method	Download and import UNITE reference data.
get-pr2-data	method	Download, parse, and import SSU PR2 reference data.
get-midori2-data	method	Download and import MIDORI 2 reference data.
filter-taxa	method	Filter taxonomy by list of IDs or search criteria.
subsample-fasta	method	Subsample an indicated number of sequences from a FASTA file.
extract-seq-segments	method	Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value.
get-ncbi-genomes	method	Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets.
get-bv-brc-genomes	method	Get genome sequences from the BV-BRC database.
get-bv-brc-metadata	method	Fetch BV-BCR metadata.
get-bv-brc-genome-features	method	Fetch genome features from BV-BRC.
evaluate-seqs	visualizer	Compute summary statistics on sequence artifact(s).
evaluate-fit-classifier	pipeline	Evaluate and train naive Bayes classifier on reference sequences.
evaluate-cross-validate	pipeline	Evaluate DNA sequence reference database via cross-validated taxonomic classification.
evaluate-classifications	pipeline	Interactively evaluate taxonomic classification accuracy.
evaluate-taxonomy	pipeline	Compute summary statistics on taxonomy artifact(s).
get-silva-data	pipeline	Download, parse, and import SILVA database.
trim-alignment	pipeline	Trim alignment based on provided primers or specific positions.

Artifact Classes¶

Formats¶

rescript merge-taxa¶

Citations¶

Inputs¶

data: List[FeatureData[Taxonomy]]: Two or more feature taxonomies to be merged.[required]

Parameters¶

mode: Str % Choices('len', 'lca', 'score', 'super', 'majority'): How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'len']
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default: '^[dkpcofgs]__']
new_rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with rank_handle_regex if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
unclassified_label: Str: Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default: 'Unassigned']

Outputs¶

merged_data: FeatureData[Taxonomy]: <no description>[required]

rescript dereplicate¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be dereplicated[required]
taxa: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be dereplicated[required]

Parameters¶

mode: Str % Choices('uniq', 'lca', 'majority', 'super'): How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'uniq']
perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 1.0]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]
rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
derep_prefix: Bool: Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default: False]

Outputs¶

dereplicated_sequences: FeatureData[Sequence]: <no description>[required]
dereplicated_taxa: FeatureData[Taxonomy]: <no description>[required]

rescript cull-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence | RNASequence]: DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]

Parameters¶

num_degenerates: Int % Range(1, None): Sequences with N, or more, degenerate bases will be removed.[default: 5]
homopolymer_length: Int % Range(2, None): Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default: 8]
n_jobs: Int % Range(1, None): Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default: 1]

Outputs¶

clean_sequences: FeatureData[Sequence]: The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]

rescript degap-seqs¶

This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA Sequences to be degapped.[required]

Parameters¶

min_length: Int % Range(1, None): Minimum length of sequence to be returned after degapping.[default: 1]

Outputs¶

degapped_sequences: FeatureData[Sequence]: The resulting unaligned (degapped) DNA sequences.[required]

rescript edit-taxonomy¶

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy strings data to be edited.[required]

Parameters¶

replacement_map: MetadataColumn[Categorical]: A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
search_strings: List[Str]: Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
replacement_strings: List[Str]: Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
use_regex: Bool: Toggle regular expressions. By default, only litereal substring matching is performed.[default: False]

Outputs¶

edited_taxonomy: FeatureData[Taxonomy]: Taxonomy in which the original strings are replaced by user-supplied strings.[required]

rescript orient-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
relabel: Str: Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
relabel_keep: Bool: When relabeling, keep the original identifier in the header after a space.[optional]
relabel_md5: Bool: When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
relabel_self: Bool: Relabel sequences using the sequence itself as a label.[optional]
relabel_sha1: Bool: When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
sizein: Bool: In de novo mode, abundance annotations (pattern [>;]size=integer[;]) present in sequence headers are taken into account.[optional]
sizeout: Bool: Add abundance annotations to the output FASTA files.[optional]

Outputs¶

oriented_seqs: FeatureData[Sequence]: Query sequences in same orientation as top matching reference sequence.[required]
unmatched_seqs: FeatureData[Sequence]: Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]

rescript orient-reads¶

Orient input reads (FASTQ) by comparison against a set of reference sequences using VSEARCH. This action is useful for orienting reads that are in mixed orientations prior to denoising or clustering.

Citations¶

Inputs¶

sequences: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Sequence reads to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against.[required]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]

Outputs¶

oriented_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Oriented reads.[required]
unmatched_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Reads that fail to match at least one reference sequence in either + or - orientation.[required]

rescript filter-seqs-length-by-taxon¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be filtered.[required]

Parameters¶

labels: List[Str]: One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
min_lens: List[Int % Range(1, None)]: Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
max_lens: List[Int % Range(1, None)]: Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript filter-seqs-length¶

Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]

Parameters¶

global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript parse-silva-taxonomy¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Inputs¶

taxonomy_tree: Phylogeny[Rooted]: SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
taxonomy_map: FeatureData[SILVATaxidMap]: SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
taxonomy_ranks: FeatureData[SILVATaxonomy]: SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]

Parameters¶

rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: The resulting fixed-rank formatted SILVA taxonomy.[required]

rescript reverse-transcribe¶

Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.

Citations¶

Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021

Inputs¶

rna_sequences: FeatureData[AlignedRNASequence¹ | RNASequence²]: RNA Sequences to reverse transcribe to DNA.[required]

Outputs¶

dna_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Reverse-transcribed DNA sequences.[required]

rescript get-ncbi-data¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Nucleotide database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Nucleotide database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[Sequence]: Sequences from the NCBI Nucleotide database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-ncbi-data-protein¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Protein database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Protein database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[ProteinSequence]: Sequences from the NCBI Protein database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-gtdb-data¶

Citations¶

Parameters¶

version: Str % Choices('202.0', '207.0', '214.0', '214.1', '220.0', '226.0'): GTDB database version to download.[default: '226.0']
domain: Str % Choices('Both', 'Bacteria', 'Archaea'): SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default: 'Both']
db_type: Str % Choices('All', 'SpeciesReps'): 'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default: 'SpeciesReps']
url_type: Str % Choices('Primary', 'Mirror'): Toggle download URL. 'Primary' will download data from the primary GTDB URL. 'Mirror' will dowload data from the GTDB data mirror. Use 'Mirror' if downloads from 'Primary' are slow.[default: 'Primary']

Outputs¶

gtdb_taxonomy: FeatureData[Taxonomy]: SSU GTDB reference taxonomy.[required]
gtdb_sequences: FeatureData[Sequence]: SSU GTDB reference sequences.[required]

rescript get-unite-data¶

Citations¶

Parameters¶

version: Str % Choices('2025-02-19', '2024-04-04', '2023-07-18', '2022-10-16', '2021-05-10', '2020-02-20'): UNITE version to download.[default: '2025-02-19']
taxon_group: Str % Choices('fungi', 'eukaryotes'): Download a database with only 'fungi' or including all 'eukaryotes'.[default: 'eukaryotes']
cluster_id: Str % Choices('99', '97', 'dynamic'): Percent similarity at which sequences in the of database were clustered.[default: '99']
singletons: Bool: Include singleton clusters in the database.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: UNITE reference taxonomy.[required]
sequences: FeatureData[Sequence]: UNITE reference sequences.[required]

rescript get-pr2-data¶

Citations¶

Parameters¶

version: Str % Choices('5.0.0', '4.14.0'): PR2 database version to download.[default: '5.0.0']
ranks: List[Str % Choices('domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species')]: List of taxonomic ranks for building a taxonomy from the PR2 Taxonomy database. Ranks can be provided as multiple separate flags, e.g.: --p-ranks genus --p-ranks species, or with a single flag delimited by a space: --p-ranks genus species. [default: 'domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species'][optional]

Outputs¶

pr2_sequences: FeatureData[Sequence]: SSU PR2 reference sequences.[required]
pr2_taxonomy: FeatureData[Taxonomy]: SSU PR2 reference taxonomy.[required]

rescript get-midori2-data¶

Citations¶

Parameters¶

mito_gene: List[Str % Choices('A6', 'A8', 'CO1', 'CO2', 'CO3', 'Cytb', 'ND1', 'ND2', 'ND3', 'ND4L', 'ND4', 'ND5', 'ND6', 'lrRNA', 'srRNA', 'all')]: Download the mitochondrial gene(s) of interest. Specify the respective gene(s), or download all genes using 'all'.[required]
version: Str % Choices('GenBank265_2025-03-08', 'GenBank264_2024-12-14', 'GenBank263_2024-10-13', 'GenBank262_2024-08-16', 'GenBank261_2024-06-15', 'GenBank260_2024-04-15'): MIDORI 2 version to download.[default: 'GenBank265_2025-03-08']
ref_seq_type: Str % Choices('uniq', 'longest'): 'uniq': contains all unique haplotypes associated with each species. 'longest': contains the longest sequence for each species.[default: 'uniq']
unspecified_species: Bool: Download reference sequences that contain species that are left unspecified. That is, any reference sequences that lack binomial species-level description.[default: False]

Outputs¶

midori2_sequences: Collection[FeatureData[Sequence]]: MIDORI 2 reference sequence output directory.[required]
midori2_taxonomy: Collection[FeatureData[Taxonomy]]: MIDORI 2 reference taxonomy output directory.[required]

rescript filter-taxa¶

Filter taxonomy by list of IDs or search criteria.

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy to filter.[required]

Parameters¶

ids_to_keep: Metadata: List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
include: List[Str]: List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting ids_to_keep.[optional]
exclude: List[Str]: List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting ids_to_keep.[optional]

Outputs¶

filtered_taxonomy: FeatureData[Taxonomy]: The filtered taxonomy.[required]

rescript subsample-fasta¶

Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.

Citations¶

Inputs¶

sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sequences to subsample from.[required]

Parameters¶

subsample_size: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Size of the random sample as a fraction of the total count[default: 0.1]
random_seed: Int % Range(1, None): Seed to be used for random sampling.[default: 1]

Outputs¶

sample_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sample of original sequences.[required]

rescript extract-seq-segments¶

Citations¶

Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020

Inputs¶

input_sequences: FeatureData[Sequence]: Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
reference_segment_sequences: FeatureData[Sequence]: Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]

Parameters¶

perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 0.7]
target_coverage: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default: 0.9]
min_seq_len: Int % Range(1, None): Minimum length of reference sequence segment allowed for searching. Any sequence less than this will be discarded.[default: 32]
max_seq_len: Int % Range(1, None): Maximum length of reference sequence segment allowed for searching. Any sequence greater than this will be discarded.[default: 50000]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

extracted_sequence_segments: FeatureData[Sequence]: Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
unmatched_sequences: FeatureData[Sequence]: Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]

rescript get-ncbi-genomes¶

Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.

Citations¶

Parameters¶

taxa: List[Str]: NCBI Taxonomy IDs or names (common or scientific) at any taxonomic rank.[required]
assembly_source: Str % Choices('refseq', 'genbank', 'all'): Fetch only RefSeq or GenBank genome assemblies.[default: 'refseq']
assembly_levels: List[Str % Choices('complete_genome', 'chromosome', 'scaffold', 'contig')]: Fetch only genome assemblies that are one of the specified assembly levels.[default: ['complete_genome']]
only_reference: Bool: Fetch only reference and representative genome assemblies.[default: True]
only_genomic: Bool: Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default: False]
tax_exact_match: Bool: If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default: False]
page_size: Int % Range(20, 1000, inclusive_end=True): The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default: 20]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default: ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genome_assemblies: FeatureData[Sequence]: Nucleotide sequences of requested genomes.[required]
loci: GenomeData[Loci]: Loci features of requested genomes.[required]
proteins: GenomeData[Proteins]: Protein sequences originating from requested genomes.[required]
taxonomies: FeatureData[Taxonomy]: Taxonomies of requested genomes.[required]

rescript get-bv-brc-genomes¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_sequence for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genomes: GenomeData[DNASequence]: Genome sequences for specified query.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]

rescript get-bv-brc-metadata¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
data_type: Str % Choices('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology'): BV-BCR data type for which metadata should be downloaded. Check https://www.bv-brc.org/api/doc/ for documentation.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/ for allowed data fields in the specified "data-type".[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]

Outputs¶

metadata: ImmutableMetadata: BV-BCR metadata of specified data type.[required]

rescript get-bv-brc-genome-features¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_feature for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genes: GenomeData[Genes]: Gene[required]
proteins: GenomeData[Proteins]: proteins[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]
loci: GenomeData[Loci]: loci[required]

rescript evaluate-seqs¶

Citations¶

Inputs¶

sequences: List[FeatureData[Sequence]]: One or more sets of sequences to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
kmer_lengths: List[Int % Range(1, None)]: Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
subsample_kmers: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default: 1.0]
palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic'): Color palette to use for plotting evaluation results.[default: 'viridis']

Outputs¶

visualization: Visualization: <no description>[required]

rescript evaluate-fit-classifier¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

classifier: TaxonomicClassifier: Trained naive Bayes taxonomic classifier.[required]
evaluation: Visualization: Visualization of classification accuracy results.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]

rescript evaluate-cross-validate¶

Citations¶

Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

k: Int % Range(2, None): Number of stratified folds.[default: 3]
random_state: Int % Range(0, None): Seed used by the random number generator.[default: 0]
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

expected_taxonomy: FeatureData[Taxonomy]: Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
evaluation: Visualization: Visualization of cross-validated accuracy results.[required]

rescript evaluate-classifications¶

Citations¶

Inputs¶

expected_taxonomies: List[FeatureData[Taxonomy]]: True taxonomic labels for one more more sets of features.[required]
observed_taxonomies: List[FeatureData[Taxonomy]]: Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]

Outputs¶

evaluation: Visualization: Visualization of classification accuracy results.[required]

rescript evaluate-taxonomy¶

Citations¶

Inputs¶

taxonomies: List[FeatureData[Taxonomy]]: One or more taxonomies to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]

Outputs¶

taxonomy_stats: Visualization: <no description>[required]

rescript get-silva-data¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Parameters¶

version: Str % Choices('128', '132') | Str % Choices('138') | Str % Choices('138.1', '138.2'): SILVA database version to download.[default: '138.2']
target: Str % Choices('SSURef_NR99', 'SSURef', 'LSURef') | Str % Choices('SSURef_NR99', 'SSURef') | Str % Choices('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef'): Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default: 'SSURef_NR99']
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
download_sequences: Bool: Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a silva_sequences output is still created, but contains no data.[default: True]

Outputs¶

silva_sequences: FeatureData[RNASequence]: SILVA reference sequences.[required]
silva_taxonomy: FeatureData[Taxonomy]: SILVA reference taxonomy.[required]

rescript trim-alignment¶

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA sequences.[required]

Parameters¶

primer_fwd: Str: Forward primer used to find the start position for alignment trimming. Provide as 5'-3'.[optional]
primer_rev: Str: Reverse primer used to find the end position for alignment trimming. Provide as 5'-3'.[optional]
position_start: Int % Range(1, None): Position within the alignment where the trimming will begin. If not provided, alignment will not be trimmed at the beginning. If forward primer isspecified this parameter will be ignored.[optional]
position_end: Int % Range(1, None): Position within the alignment where the trimming will end. If not provided, alignment will not be trimmed at the end. If reverse primer is specified this parameter will be ignored.[optional]
keep_primer_location: Bool: Retain the alignment positions of the primer binding location. Note: the primers themselves will be removed, but the alignment positions where the primers align will be retained in the alignment.[default: False]
n_threads: Int % Range(1, None): Number of threads to use for primer-based trimming, otherwise ignored. (Use auto to automatically use all available cores)[default: 1]

Outputs¶

trimmed_sequences: FeatureData[AlignedSequence]: Trimmed sequence alignment.[required]

Reference sequence annotation and curation pipeline.

version: 2025.10.0.dev0
website: https://github.com/nbokulich/RESCRIPt
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Robeson et al., 2021

Actions¶

Name	Type	Short Description
merge-taxa	method	Compare taxonomies and select the longest, highest scoring, or find the least common ancestor.
dereplicate	method	Dereplicate features with matching sequences and taxonomies.
cull-seqs	method	Removes sequences that contain at least the specified number of degenerate bases and/or homopolymers of a given length.
degap-seqs	method	Remove gaps from DNA sequence alignments.
edit-taxonomy	method	Edit taxonomy strings with find and replace terms.
orient-seqs	method	Orient input sequences by comparison against reference.
orient-reads	method	Orient FASTQ reads against reference.
filter-seqs-length-by-taxon	method	Filter sequences by length and taxonomic group.
filter-seqs-length	method	Filter sequences by length.
parse-silva-taxonomy	method	Generates a SILVA fixed-rank taxonomy.
reverse-transcribe	method	Reverse transcribe RNA to DNA sequences.
get-ncbi-data	method	Download, parse, and import NCBI sequences and taxonomies
get-ncbi-data-protein	method	Download, parse, and import NCBI protein sequences and taxonomies
get-gtdb-data	method	Download, parse, and import SSU GTDB reference data.
get-unite-data	method	Download and import UNITE reference data.
get-pr2-data	method	Download, parse, and import SSU PR2 reference data.
get-midori2-data	method	Download and import MIDORI 2 reference data.
filter-taxa	method	Filter taxonomy by list of IDs or search criteria.
subsample-fasta	method	Subsample an indicated number of sequences from a FASTA file.
extract-seq-segments	method	Use reference sequences to extract shorter matching sequence segments from longer sequences based on a user-defined 'perc-identity' value.
get-ncbi-genomes	method	Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets.
get-bv-brc-genomes	method	Get genome sequences from the BV-BRC database.
get-bv-brc-metadata	method	Fetch BV-BCR metadata.
get-bv-brc-genome-features	method	Fetch genome features from BV-BRC.
evaluate-seqs	visualizer	Compute summary statistics on sequence artifact(s).
evaluate-fit-classifier	pipeline	Evaluate and train naive Bayes classifier on reference sequences.
evaluate-cross-validate	pipeline	Evaluate DNA sequence reference database via cross-validated taxonomic classification.
evaluate-classifications	pipeline	Interactively evaluate taxonomic classification accuracy.
evaluate-taxonomy	pipeline	Compute summary statistics on taxonomy artifact(s).
get-silva-data	pipeline	Download, parse, and import SILVA database.
trim-alignment	pipeline	Trim alignment based on provided primers or specific positions.

Artifact Classes¶

Formats¶

rescript merge-taxa¶

Citations¶

Inputs¶

data: List[FeatureData[Taxonomy]]: Two or more feature taxonomies to be merged.[required]

Parameters¶

mode: Str % Choices('len', 'lca', 'score', 'super', 'majority'): How to merge feature taxonomies: "len" will select the taxonomy with the most elements (e.g., species level will beat genus level); "lca" will find the least common ancestor and report this consensus taxonomy; "score" will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe. "majority" finds the LCA consensus while giving preference to majority labels. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'len']
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. Note that rank_handles are removed but not replaced; use the new_rank_handle parameter to replace the rank handles.[default: '^[dkpcofgs]__']
new_rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles to prepend to taxonomic labels at each rank. Note that merged taxonomies will only contain as many levels as there are handles if this parameter is used. This will trim all taxonomies to the given levels, even if longer annotations exist. Note that this parameter will prepend rank handles whether or not they already exist in the taxonomy, so should ALWAYS be used in conjunction with rank_handle_regex if rank handles exist in any of the inputs. Use 'disable' to prevent prepending 'new_rank_handles'[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
unclassified_label: Str: Specifies what label should be used for taxonomies that could not be resolved (when LCA modes are used).[default: 'Unassigned']

Outputs¶

merged_data: FeatureData[Taxonomy]: <no description>[required]

rescript dereplicate¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be dereplicated[required]
taxa: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be dereplicated[required]

Parameters¶

mode: Str % Choices('uniq', 'lca', 'majority', 'super'): How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"[default: 'uniq']
perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 1.0]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]
rank_handles: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. [default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
derep_prefix: Bool: Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.[default: False]

Outputs¶

dereplicated_sequences: FeatureData[Sequence]: <no description>[required]
dereplicated_taxa: FeatureData[Taxonomy]: <no description>[required]

rescript cull-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence | RNASequence]: DNA or RNA Sequences to be screened for removal based on degenerate base and homopolymer screening criteria.[required]

Parameters¶

num_degenerates: Int % Range(1, None): Sequences with N, or more, degenerate bases will be removed.[default: 5]
homopolymer_length: Int % Range(2, None): Sequences containing a homopolymer sequence of length N, or greater, will be removed.[default: 8]
n_jobs: Int % Range(1, None): Number of concurrent processes to use while processing sequences. More is faster but typically should not be higher than the number of available CPUs. Output sequence order may change when using multiple jobs.[default: 1]

Outputs¶

clean_sequences: FeatureData[Sequence]: The resulting DNA sequences that pass degenerate base and homopolymer screening criteria.[required]

rescript degap-seqs¶

This method converts aligned DNA sequences to unaligned DNA sequences by removing gaps ("-") and missing data (".") characters from the sequences. Essentially, 'unaligning' the sequences.

Citations¶

Inputs¶

aligned_sequences: FeatureData[AlignedSequence]: Aligned DNA Sequences to be degapped.[required]

Parameters¶

min_length: Int % Range(1, None): Minimum length of sequence to be returned after degapping.[default: 1]

Outputs¶

degapped_sequences: FeatureData[Sequence]: The resulting unaligned (degapped) DNA sequences.[required]

rescript edit-taxonomy¶

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy strings data to be edited.[required]

Parameters¶

replacement_map: MetadataColumn[Categorical]: A tab-delimitad metadata file in which the strings in the 'id' column are replaced by the 'replacement-strings' in the second column. All strings in the 'id' column must be unique![optional]
search_strings: List[Str]: Only used in conjuntion with 'replacement-strings'. Each string in this list is searched for and replaced with a string in the list of 'replace-ment-strings'. That is the first string in 'search-strings' is replaced with the first string in 'replacement-strings', and so on. The number of 'search-strings' must be equal to the number of replacement strings.[optional]
replacement_strings: List[Str]: Only used in conjuntion with 'search-strings'. This must contain the same number of replacement strings as search strings. See 'search-strings' parameter text for more details.[optional]
use_regex: Bool: Toggle regular expressions. By default, only litereal substring matching is performed.[default: False]

Outputs¶

edited_taxonomy: FeatureData[Taxonomy]: Taxonomy in which the original strings are replaced by user-supplied strings.[required]

rescript orient-seqs¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against. If no reference is provided, all the sequences will be reverse complemented and all parameters will be ignored.[optional]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]
relabel: Str: Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to construct the new headers. Use --sizeout to conserve the abundance annotations.[optional]
relabel_keep: Bool: When relabeling, keep the original identifier in the header after a space.[optional]
relabel_md5: Bool: When relabeling, use the MD5 digest of the sequence as the new identifier. Use --sizeout to conserve the abundance annotations.[optional]
relabel_self: Bool: Relabel sequences using the sequence itself as a label.[optional]
relabel_sha1: Bool: When relabeling, use the SHA1 digest of the sequence as the new identifier. The probability of a collision is smaller than the MD5 algorithm.[optional]
sizein: Bool: In de novo mode, abundance annotations (pattern [>;]size=integer[;]) present in sequence headers are taken into account.[optional]
sizeout: Bool: Add abundance annotations to the output FASTA files.[optional]

Outputs¶

oriented_seqs: FeatureData[Sequence]: Query sequences in same orientation as top matching reference sequence.[required]
unmatched_seqs: FeatureData[Sequence]: Query sequences that fail to match at least one reference sequence in either + or - orientation. This will be empty if no refrence is provided.[required]

rescript orient-reads¶

Orient input reads (FASTQ) by comparison against a set of reference sequences using VSEARCH. This action is useful for orienting reads that are in mixed orientations prior to denoising or clustering.

Citations¶

Inputs¶

sequences: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Sequence reads to be oriented.[required]
reference_sequences: FeatureData[Sequence]: Reference sequences to orient against.[required]

Parameters¶

dbmask: Str % Choices('none', 'dust', 'soft'): Mask regions in the target database sequences using the dust method, or do not mask (none). When using soft masking, search commands become case sensitive.[optional]

Outputs¶

oriented_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Oriented reads.[required]
unmatched_reads: SampleData[PairedEndSequencesWithQuality¹ | JoinedSequencesWithQuality²]: Reads that fail to match at least one reference sequence in either + or - orientation.[required]

rescript filter-seqs-length-by-taxon¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomic classifications of sequences to be filtered.[required]

Parameters¶

labels: List[Str]: One or more taxonomic labels to use for conditional filtering. For example, use this option to set different min/max filter settings for individual phyla. Must input the same number of labels as min_lens and/or max_lens. If a sequence matches multiple taxonomic labels, this method will apply the most stringent threshold(s): the longest minimum length and/or the shortest maximum length that is associated with the matching labels.[required]
min_lens: List[Int % Range(1, None)]: Minimum length thresholds to use for filtering sequences associated with each label. If any min_lens are specified, must have the same number of min_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are less than the specified length.[optional]
max_lens: List[Int % Range(1, None)]: Maximum length thresholds to use for filtering sequences associated with each label. If any max_lens are specified, must have the same number of max_lens as labels. Sequences that contain this label in their taxonomy will be removed if they are more than the specified length.[optional]
global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript filter-seqs-length¶

Filter sequences by length with VSEARCH. For a combination of global and conditional taxonomic filtering, see filter_seqs_length_by_taxon.

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Sequences to be filtered by length.[required]

Parameters¶

global_min: Int % Range(1, None): The minimum length threshold for filtering all sequences. Any sequence shorter than this length will be removed.[optional]
global_max: Int % Range(1, None): The maximum length threshold for filtering all sequences. Any sequence longer than this length will be removed.[optional]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

filtered_seqs: FeatureData[Sequence]: Sequences that pass the filtering thresholds.[required]
discarded_seqs: FeatureData[Sequence]: Sequences that fall outside the filtering thresholds.[required]

rescript parse-silva-taxonomy¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Inputs¶

taxonomy_tree: Phylogeny[Rooted]: SILVA hierarchical taxonomy tree. The SILVA release filename typically takes the form of: 'tax_slv_ssu_X.tre', where 'X' is the SILVA version number.[required]
taxonomy_map: FeatureData[SILVATaxidMap]: SILVA taxonomy map. This file contains a mapping of the sequence accessions to the numeric taxonomy identifiers and species label information. The SILVA release filename is typically in the form of: 'taxmap_slv_ssu_ref_X.txt', or 'taxmap_slv_ssu_ref_nr_X.txt' where 'X' is the SILVA version number.[required]
taxonomy_ranks: FeatureData[SILVATaxonomy]: SILVA taxonomy file. This file contains the taxonomic rank information for each numeric taxonomy identifier and the taxonomy. The SILVA filename typically takes the form of: 'tax_slv_ssu_X.txt', where 'X' is the SILVA version number.[required]

Parameters¶

rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: The resulting fixed-rank formatted SILVA taxonomy.[required]

rescript reverse-transcribe¶

Reverse transcribe RNA to DNA sequences. Accepts aligned or unaligned RNA sequences as input.

Citations¶

Robeson et al., 2021; Parks et al., 2020; Parks et al., 2021

Inputs¶

rna_sequences: FeatureData[AlignedRNASequence¹ | RNASequence²]: RNA Sequences to reverse transcribe to DNA.[required]

Outputs¶

dna_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Reverse-transcribed DNA sequences.[required]

rescript get-ncbi-data¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Nucleotide database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Nucleotide database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[Sequence]: Sequences from the NCBI Nucleotide database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-ncbi-data-protein¶

Citations¶

Robeson et al., 2021; Coordinators, 2018; Benson et al., 2012

Parameters¶

query: Str: Query on the NCBI Protein database[optional]
accession_ids: Metadata: List of accession ids for sequences in the NCBI Protein database.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: Propagate known ranks to missing ranks if true[default: True]
logging_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'): Logging level, set to INFO for download progress or DEBUG for copious verbosity[optional]
n_jobs: Int % Range(1, None): Number of concurrent download connections. More is faster until you run out of bandwidth.[default: 1]

Outputs¶

sequences: FeatureData[ProteinSequence]: Sequences from the NCBI Protein database[required]
taxonomy: FeatureData[Taxonomy]: Taxonomies from the NCBI Taxonomy database[required]

rescript get-gtdb-data¶

Citations¶

Parameters¶

version: Str % Choices('202.0', '207.0', '214.0', '214.1', '220.0', '226.0'): GTDB database version to download.[default: '226.0']
domain: Str % Choices('Both', 'Bacteria', 'Archaea'): SSU sequence and taxonomy data to download from a given microbial domain from GTDB. 'Both' will fetch both bacterial and archaeal data. 'Bacteria' will only fetch bacterial data. 'Archaea' will only fetch archaeal data. This only applies to 'db-type SpeciesReps'.[default: 'Both']
db_type: Str % Choices('All', 'SpeciesReps'): 'All': All SSU data that pass the quality-control of GTDB, but are not clustered into representative species. 'SpeciesReps': SSU gene sequences identified within the set of representative species. Note: if 'All' is used, the 'domain' parameter will be ignored as GTDB does not maintain separate domain-level files for these non-clustered data.[default: 'SpeciesReps']
url_type: Str % Choices('Primary', 'Mirror'): Toggle download URL. 'Primary' will download data from the primary GTDB URL. 'Mirror' will dowload data from the GTDB data mirror. Use 'Mirror' if downloads from 'Primary' are slow.[default: 'Primary']

Outputs¶

gtdb_taxonomy: FeatureData[Taxonomy]: SSU GTDB reference taxonomy.[required]
gtdb_sequences: FeatureData[Sequence]: SSU GTDB reference sequences.[required]

rescript get-unite-data¶

Citations¶

Parameters¶

version: Str % Choices('2025-02-19', '2024-04-04', '2023-07-18', '2022-10-16', '2021-05-10', '2020-02-20'): UNITE version to download.[default: '2025-02-19']
taxon_group: Str % Choices('fungi', 'eukaryotes'): Download a database with only 'fungi' or including all 'eukaryotes'.[default: 'eukaryotes']
cluster_id: Str % Choices('99', '97', 'dynamic'): Percent similarity at which sequences in the of database were clustered.[default: '99']
singletons: Bool: Include singleton clusters in the database.[default: False]

Outputs¶

taxonomy: FeatureData[Taxonomy]: UNITE reference taxonomy.[required]
sequences: FeatureData[Sequence]: UNITE reference sequences.[required]

rescript get-pr2-data¶

Citations¶

Parameters¶

version: Str % Choices('5.0.0', '4.14.0'): PR2 database version to download.[default: '5.0.0']
ranks: List[Str % Choices('domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species')]: List of taxonomic ranks for building a taxonomy from the PR2 Taxonomy database. Ranks can be provided as multiple separate flags, e.g.: --p-ranks genus --p-ranks species, or with a single flag delimited by a space: --p-ranks genus species. [default: 'domain', 'supergroup', 'division', 'subdivision', 'class', 'order', 'family', 'genus', 'species'][optional]

Outputs¶

pr2_sequences: FeatureData[Sequence]: SSU PR2 reference sequences.[required]
pr2_taxonomy: FeatureData[Taxonomy]: SSU PR2 reference taxonomy.[required]

rescript get-midori2-data¶

Citations¶

Parameters¶

mito_gene: List[Str % Choices('A6', 'A8', 'CO1', 'CO2', 'CO3', 'Cytb', 'ND1', 'ND2', 'ND3', 'ND4L', 'ND4', 'ND5', 'ND6', 'lrRNA', 'srRNA', 'all')]: Download the mitochondrial gene(s) of interest. Specify the respective gene(s), or download all genes using 'all'.[required]
version: Str % Choices('GenBank265_2025-03-08', 'GenBank264_2024-12-14', 'GenBank263_2024-10-13', 'GenBank262_2024-08-16', 'GenBank261_2024-06-15', 'GenBank260_2024-04-15'): MIDORI 2 version to download.[default: 'GenBank265_2025-03-08']
ref_seq_type: Str % Choices('uniq', 'longest'): 'uniq': contains all unique haplotypes associated with each species. 'longest': contains the longest sequence for each species.[default: 'uniq']
unspecified_species: Bool: Download reference sequences that contain species that are left unspecified. That is, any reference sequences that lack binomial species-level description.[default: False]

Outputs¶

midori2_sequences: Collection[FeatureData[Sequence]]: MIDORI 2 reference sequence output directory.[required]
midori2_taxonomy: Collection[FeatureData[Taxonomy]]: MIDORI 2 reference taxonomy output directory.[required]

rescript filter-taxa¶

Filter taxonomy by list of IDs or search criteria.

Citations¶

Inputs¶

taxonomy: FeatureData[Taxonomy]: Taxonomy to filter.[required]

Parameters¶

ids_to_keep: Metadata: List of IDs to keep (as Metadata). Selecting these IDs occurs after inclusion and exclusion filtering.[optional]
include: List[Str]: List of search terms. Taxa containing one or more of these terms will be retained. Inclusion filtering occurs prior to exclusion filtering and selecting ids_to_keep.[optional]
exclude: List[Str]: List of search terms. Taxa containing one or more of these terms will be excluded. Exclusion filtering occurs after inclusion filtering and prior to selecting ids_to_keep.[optional]

Outputs¶

filtered_taxonomy: FeatureData[Taxonomy]: The filtered taxonomy.[required]

rescript subsample-fasta¶

Subsample a set of sequences (either plain or aligned DNA)based on a fraction of original sequences.

Citations¶

Inputs¶

sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sequences to subsample from.[required]

Parameters¶

subsample_size: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Size of the random sample as a fraction of the total count[default: 0.1]
random_seed: Int % Range(1, None): Seed to be used for random sampling.[default: 1]

Outputs¶

sample_sequences: FeatureData[AlignedSequence¹ | Sequence²]: Sample of original sequences.[required]

rescript extract-seq-segments¶

Citations¶

Robeson et al., 2021; Clark et al., 2016; O'Leary et al., 2016; Schoch et al., 2020

Inputs¶

input_sequences: FeatureData[Sequence]: Sequences from which matching shorter sequence segments (regions) can be extracted from. Sequences containing segments that match those from 'reference-segment-sequences' will have those segments extracted and written to file.[required]
reference_segment_sequences: FeatureData[Sequence]: Reference sequence segments that will be used to search for and extract matching segments from 'input-sequences'.[required]

Parameters¶

perc_identity: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter.[default: 0.7]
target_coverage: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): The minimum fraction of coverage that 'reference-segment-sequences' must have in order to extract matching segments from 'input-sequences'.[default: 0.9]
min_seq_len: Int % Range(1, None): Minimum length of reference sequence segment allowed for searching. Any sequence less than this will be discarded.[default: 32]
max_seq_len: Int % Range(1, None): Maximum length of reference sequence segment allowed for searching. Any sequence greater than this will be discarded.[default: 50000]
threads: Int % Range(1, 256): Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores.[default: 1]

Outputs¶

extracted_sequence_segments: FeatureData[Sequence]: Extracted sequence segments from 'input-sequences' that succesfully aligned to 'reference-segment-sequences'.[required]
unmatched_sequences: FeatureData[Sequence]: Sequences in 'input-sequences' that did not have matching sequence segments within 'reference-segment-sequences'.[required]

rescript get-ncbi-genomes¶

Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences and protein/gene annotations will be fetched and supplemented with full taxonomy of every sequence.

Citations¶

Parameters¶

taxa: List[Str]: NCBI Taxonomy IDs or names (common or scientific) at any taxonomic rank.[required]
assembly_source: Str % Choices('refseq', 'genbank', 'all'): Fetch only RefSeq or GenBank genome assemblies.[default: 'refseq']
assembly_levels: List[Str % Choices('complete_genome', 'chromosome', 'scaffold', 'contig')]: Fetch only genome assemblies that are one of the specified assembly levels.[default: ['complete_genome']]
only_reference: Bool: Fetch only reference and representative genome assemblies.[default: True]
only_genomic: Bool: Exclude plasmid, mitochondrial and chloroplast molecules from the final results (i.e., keep only genomic DNA).[default: False]
tax_exact_match: Bool: If true, only return assemblies with the given NCBI Taxonomy ID, or name. Otherwise, assemblies from taxonomy subtree are included, too.[default: False]
page_size: Int % Range(20, 1000, inclusive_end=True): The maximum number of genome assemblies to return per request. If number of genomes to fetch is higher than this number, requests will be repeated until all assemblies are fetched.[default: 20]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy database.[default: ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genome_assemblies: FeatureData[Sequence]: Nucleotide sequences of requested genomes.[required]
loci: GenomeData[Loci]: Loci features of requested genomes.[required]
proteins: GenomeData[Proteins]: Protein sequences originating from requested genomes.[required]
taxonomies: FeatureData[Taxonomy]: Taxonomies of requested genomes.[required]

rescript get-bv-brc-genomes¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_sequence". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all genomes associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_sequence for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy. [default: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genomes: GenomeData[DNASequence]: Genome sequences for specified query.[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]

rescript get-bv-brc-metadata¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
data_type: Str % Choices('antibiotics', 'enzyme_class_ref', 'epitope', 'epitope_assay', 'experiment', 'bioset', 'bioset_result', 'gene_ontology_ref', 'genome', 'strain', 'genome_amr', 'feature_sequence', 'genome_feature', 'genome_sequence', 'id_ref', 'misc_niaid_sgc', 'pathway', 'pathway_ref', 'ppi', 'protein_family_ref', 'sequence_feature', 'sequence_feature_vt', 'sp_gene', 'sp_gene_ref', 'spike_lineage', 'spike_variant', 'structured_assertion', 'subsystem', 'subsystem_ref', 'taxonomy', 'protein_structure', 'protein_feature', 'surveillance', 'serology'): BV-BCR data type for which metadata should be downloaded. Check https://www.bv-brc.org/api/doc/ for documentation.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the specified "data-type". This parameter can only be used in conjunction with the "ids" parameter. Retrieves metadata associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/ for allowed data fields in the specified "data-type".[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]

Outputs¶

metadata: ImmutableMetadata: BV-BCR metadata of specified data type.[required]

rescript get-bv-brc-genome-features¶

Citations¶

Robeson et al., 2021; Olson et al., 2023

Parameters¶

ids_metadata: MetadataColumn[Numeric | Categorical]: A metadata column obtained with the action get-bv-brc-metadata that can be used as a query.[optional]
rql_query: Str: Query in RQL format. To download all data for genome_ids "224308.43" and "2030927.4755", the RQL query looks like this: "in(genome_id,(224308.43,2030927.4755))". While "in" is an RQL operator, "genome_id" is a data field and "224308.43,2030927.4755" are the values. It is important to percent encode values if they contain illegal characters like spaces. The values "Bacillus subtilis" and "Bacteroidales bacterium" have to be provided with percent encoded quotes (%22) and spaces (%20) like this: "in(species,(%22Bacillus%20subtilis%22,%22Bacteroidales%20bacterium%22))". Check https://www.bv-brc.org/api/doc/ for documentation on data types and corresponding data fields.[optional]
data_field: Str: Data field of the data type "genome_feature". This parameter can only be used in conjunction with the "ids" parameter. Retrieves all data associated with the IDs/values specified in parameter "ids" in this data field. Check https://www.bv-brc.org/api/doc/genome_feature for allowed data fields.[optional]
ids: List[Str]: IDs/values of the corresponding data field. This parameter can only be used in conjunction with the "data-field" parameter. Retrieves all data associated with these IDs/values in the specified data field.[optional]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]: List of taxonomic ranks for building a taxonomy [default: 'kingdom, phylum, class, order, family, genus, species'][optional]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]

Outputs¶

genes: GenomeData[Genes]: Gene[required]
proteins: GenomeData[Proteins]: proteins[required]
taxonomy: FeatureData[Taxonomy]: Taxonomy data for all sequences.[required]
loci: GenomeData[Loci]: loci[required]

rescript evaluate-seqs¶

Citations¶

Inputs¶

sequences: List[FeatureData[Sequence]]: One or more sets of sequences to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
kmer_lengths: List[Int % Range(1, None)]: Sequence kmer lengths to optionally use for entropy calculation. Warning: kmer entropy calculations may be time-consuming for large sequence sets.[optional]
subsample_kmers: Float % Range(0, 1, inclusive_start=False, inclusive_end=True): Optionally subsample sequences prior to kmer entropy measurement. A fraction of the input sequences will be randomly subsampled at the specified value.[default: 1.0]
palette: Str % Choices('Set1', 'Set2', 'Set3', 'Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'tab10', 'tab20', 'tab20b', 'tab20c', 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'terrain', 'rainbow', 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic'): Color palette to use for plotting evaluation results.[default: 'viridis']

Outputs¶

visualization: Visualization: <no description>[required]

rescript evaluate-fit-classifier¶

Citations¶

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

classifier: TaxonomicClassifier: Trained naive Bayes taxonomic classifier.[required]
evaluation: Visualization: Visualization of classification accuracy results.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by the trained classifier.[required]

rescript evaluate-cross-validate¶

Citations¶

Robeson et al., 2021; Bokulich et al., 2018; Bokulich et al., 2018

Inputs¶

sequences: FeatureData[Sequence]: Reference sequences to use for classifier training/testing.[required]
taxonomy: FeatureData[Taxonomy]: Reference taxonomy to use for classifier training/testing.[required]

Parameters¶

k: Int % Range(2, None): Number of stratified folds.[default: 3]
random_state: Int % Range(0, None): Seed used by the random number generator.[default: 0]
reads_per_batch: Int % Range(1, None) | Str % Choices('auto'): Number of reads to process in each batch. If "auto", this parameter is autoscaled to min( number of query sequences / n_jobs, 20000).[default: 'auto']
n_jobs: Threads: The maximum number of concurrent worker processes. If 0 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'): Confidence threshold for limiting taxonomic depth. Set to "disable" to disable confidence calculation, or 0 to calculate confidence but not apply it to limit the taxonomic depth of the assignments.[default: 0.7]

Outputs¶

expected_taxonomy: FeatureData[Taxonomy]: Expected taxonomic label for each input sequence. Taxonomic labels may be truncated due to k-fold CV and stratification.[required]
observed_taxonomy: FeatureData[Taxonomy]: Observed taxonomic label for each input sequence, predicted by cross-validation.[required]
evaluation: Visualization: Visualization of cross-validated accuracy results.[required]

rescript evaluate-classifications¶

Citations¶

Inputs¶

expected_taxonomies: List[FeatureData[Taxonomy]]: True taxonomic labels for one more more sets of features.[required]
observed_taxonomies: List[FeatureData[Taxonomy]]: Predicted classifications of same sets of features, input in same order as expected_taxonomies.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]

Outputs¶

evaluation: Visualization: Visualization of classification accuracy results.[required]

rescript evaluate-taxonomy¶

Citations¶

Inputs¶

taxonomies: List[FeatureData[Taxonomy]]: One or more taxonomies to evaluate.[required]

Parameters¶

labels: List[Str]: List of labels to use for labeling evaluation results in the resulting visualization. Inputs are labeled with labels in the order that each is input. If there are fewer labels than inputs (or no labels), unnamed inputs are labeled numerically in sequential order. Extra labels are ignored.[optional]
rank_handle_regex: Str: Regular expression indicating which taxonomic rank a label belongs to; this handle is stripped from the label prior to operating on the taxonomy. The net effect is that ambiguous or empty levels can be removed prior to comparison, enabling selection of taxonomies with more complete taxonomic information. For example, "^[dkpcofgs]" will recognize greengenes or silva rank handles. [optional]

Outputs¶

taxonomy_stats: Visualization: <no description>[required]

rescript get-silva-data¶

Citations¶

Robeson et al., 2021; Pruesse et al., 2007; Quast et al., 2013

Parameters¶

version: Str % Choices('128', '132') | Str % Choices('138') | Str % Choices('138.1', '138.2'): SILVA database version to download.[default: '138.2']
target: Str % Choices('SSURef_NR99', 'SSURef', 'LSURef') | Str % Choices('SSURef_NR99', 'SSURef') | Str % Choices('SSURef_NR99', 'SSURef', 'LSURef_NR99', 'LSURef'): Reference sequence target to download. SSURef = redundant small subunit reference. LSURef = redundant large subunit reference. SSURef_NR99 = non-redundant (clustered at 99% similarity) small subunit reference.[default: 'SSURef_NR99']
include_species_labels: Bool: Include species rank labels in taxonomy output. Note: species-labels may not be reliable in all cases.[default: False]
rank_propagation: Bool: If a rank has no taxonomy associated with it, the taxonomy from the upper-level rank of that lineage, will be propagated downward. For example, if we are missing the genus label for 'f__Pasteurellaceae; g__'then the 'f__' rank will be propagated to become: 'f__Pasteurellaceae; g__Pasteurellaceae'.[default: True]
ranks: List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus')]: List of taxonomic ranks for building a taxonomy from the SILVA Taxonomy database. Use 'include_species_labels' to append the organism name as the species label. [default: 'domain', 'phylum', 'class', 'order', 'family', 'genus'][optional]
download_sequences: Bool: Toggle whether or not to download and import the SILVA reference sequences associated with the release. Skipping the sequences is useful if you only want to download and parse the taxonomy, e.g., a local copy of the sequences already exists or for testing purposes. NOTE: if this option is used, a silva_sequences output is still created, but contains no data.[default: True]

Outputs¶

silva_sequences: FeatureData[RNASequence]: SILVA reference sequences.[required]
silva_taxonomy: FeatureData[Taxonomy]: SILVA reference taxonomy.[required]

rescript trim-alignment¶

Citations¶