RSA-tools - retrieve-seq manual
Returns upstream, downstream or coding sequences for one or several genes.
- The first column indicates the ID or the name of a query gene.
- The second column incitaed the organism to which the gene belongs.
For the organism names, all spaces must be replaced by the underscore
character, as whosn in the example.
- CDS: coding sequences (from start to stop codon, unspliced)
- mRNA:messenger RNA
- tRNA:transfer RNA
- rRNA:ribosomial RNA
The availability of some sequence types depends on the genome. For
example, some Genbank flat files contain annotations about CDSs but no
mRNA (e.g. bacterial annotations from the NCBI). Some other genome
contain separate annotations for CDS and mRNA
(e.g. A.thaliana). When mRNAs are annotated in Genbank, their
coordinates are stored and can be used.
The advantage of using mRNA is that, if the mRNA is complete (which is
not always the case), the upstream regions aretrieved relative to the
transcription initiation site, rather than the start codon.
- One gene can be associated to multiple CDSs and to multiple mRNAs.
- Many annotated "mRNAs" seem to be actually CDS (e.g. in June 2003,
12,000 out of 27,000 mRNAs from A.thaliana start with ATG).
- Upstream sequences located upstream the coding region. The
origin is at the start codon.
- Downstream sequences located downstream the coding
region. The origin is at the stop codon.
- Unspliced CDS DNA sequences located between the start and
stop codons. WARNING: introns are not spliced out (this will be
implemented in further versions)
Sequence limits (from, to)
Limits of the region to retrieve. Coordintates are calculated relative
to the start of the coding sequence.
values return sequence located upstream the origin
values return sequences downstream the origin
The origin itself depends on the sequence type, see above)
Default values for upstream sequence retrieval
- For yeast, we generally obtain good results with upstream
regions from -800 to -1. About 99% of the known upstream elements are
comprized between these limits (source: Transfac).
- For bacteria, the distribution of regulatory sites depends
on the mode of regulation :
The default is from -400 to -1 from the start codon (since we
currently do not have annotations about transcription initiation
- transcriptional repressors generally bind proximally, and
may overlap the transcription initiation or even be located downstream. A
good guess is from -200 to +50.
- Binding sites for
transcriptional activators have a more distal distribution
(-400 to -1).
- The default values for each organism can be obtained with the
Prevent overlap with neighbour genes:
It is quite frequent to find a predicted gene in close proximity
upstream from a query gene. If you want to discard these sequences from
your analysis, you should make sure this option is active.
When the option is checked, upstream sequences are automatically
clipped when a predicted gene is located within the range defined by
the option from. The actual size retainedfor the upstream
sequence is indicated in the sequence comments.
Note that in some cases a known regulatory element is located
upstream or within a predicted gene. This means either that the
predited gene is an artifact, or that the same sequence is bifunctional
(coding and regulatory).
It is particularly important to activate this option when working with
bacteria, since many genes are located in operons, and have a very
close upstream neighbour.
Admit imprecise positions:
In the annotations of some genomes, the limits of some genes are
imprecisely specified, by indicating an upper limit (e.g. <555245) or
a lower limit (e.g. >898098) rather than a precise value. Such
annotations can be found for example in the genomes
of Schizosaccharomyces pombe, Arabidopsis thaliana.
By default, these genes are not loaded. The option "Admit imprecise
positions" allows to retrieve sequence for these genes as well, using
the imprecise coordinate as reference position.
This option allows to use the genome version where repeats are masked (i.e. replaced by 'N' characters).
The presence of repetitive elements hampers the detection of motifs, especially for vertebrate genomes, because these repetitive sequences have a very distinct composition than the rest of the genome.
This option is only valid for organisms with annotated repeats.
List of organisms with annotated repeats
Output sequence format:
The result can be displayed in various sequence formats (click on the links for more details).
- raw: the raw sequence without any identifier or comment.
- multi: several raw sequences concatenated.
- IG: IntelliGenetics format.
- FastA: the sequence format used by FastA, BLAST, Gibbs sampler and a lot of other bioinformatic programs.
- Wconsensus: the format defined by Jerry Hertz for his programs (patser, consensus, wconsensus).
Sequences can be labeled (named) in different ways:
- gene identifier
- gene name
- gene id + gene name
- full: a concatenation of gene identifier, gene name, sequence type, from, to and strand. This option gives a full description of the conditions of sequence retrieval
The program can be used through its web interface at:
retrieve-seq is a perl script running on unix machines (SUN, SGI
have been tested). The web interface is a perl-cgi script.