getSeq {BSgenome} | R Documentation |
A function for extracting a set of sequences (or subsequences) from a BSgenome object or other sequence container. This man page specifically documents the method for BSgenome objects.
getSeq(x, ...) ## S4 method for signature 'BSgenome': getSeq(x, names, start=NA, end=NA, width=NA, strand="+", as.character=TRUE)
x |
A BSgenome object.
See the available.genomes function for how
to install a genome.
|
names |
The names of the sequences in x from where to extract the
subsequences, or a GRanges object,
or a RangedData object,
or a named RangesList object,
or a named Ranges object.
The RangesList or Ranges object
must be named according to the sequences in x from where
to extract the subsequences.
If
See |
start, end, width |
Vector of integers (eventually with NAs) specifying the locations
of the subsequences to extract.
These are not needed (and therefore it's an error to supply them)
when names is a GRanges,
RangedData, RangesList
or Ranges object.
|
strand |
A vector containing "+" s or/and "-" s.
This is not needed (and therefore it's an error to supply it)
when names is a GRanges object
or a RangedData object with a strand column.
|
as.character |
TRUE or FALSE . Should the extracted sequences
be returned in a standard character vector?
|
... |
Additional arguments. (Currently ignored.) |
L, the number of sequences to extract, is determined as follow:
names
is a GRanges or
Ranges object then L = length(names)
.
names
is a RangedData object
then L = nrow(names)
.
names
is a RangesList object
then L = length(unlist(names))
.
names
,
start
, end
and width
and all these
arguments are recycled to this length.
NA
s and negative values in these 3 arguments are
solved according to the rules of the SEW (Start/End/Width)
interface (see ?solveUserSEW
for
the details).
If names
is neither a GRanges object
or a RangedData object with a strand column,
then the strand
argument is also recycled to length L.
Here is how the lookup between the names passed to the names
argument and the sequences in x
is performed.
For each name
in names
:
x
contains a single sequence with that name
then this sequence is used for extraction;
name
is treated
as a regular expression and grep
is
used for this search. If exactly one sequence is found,
then it is used for extraction, otherwise an error is raised.
A character vector of length L when as.character=TRUE
.
A DNAString or DNAStringSet
object when as.character=FALSE
.
More precisely the returned value is a DNAString
object if L = 1 and names
is not a GRanges,
RangedData, RangesList or
Ranges object.
Otherwise it's a DNAStringSet object.
The default value for the as.character
argument is TRUE
even though getSeq
is much more efficient (in terms of speed
and memory usage) when used with as.character=FALSE
.
Be aware that using as.character=TRUE
can be very inefficient
when extracting a high number of sequences (hundreds of thousands)
or long sequences (> 1 million letters).
For this reason the plan is to change the default value to FALSE
in the near future.
Note that the masks in x
, if any, are always ignored. In other
words, masked regions in the genome are extracted in the same way as
unmasked regions (this is achieved by dropping the masks before extraction).
See ?`MaskedXString-class`
for more
information about masked sequences.
H. Pages; improvements suggested by Matt Settles and others
available.genomes
,
BSgenome-class,
seqnames,BSgenome-method,
mseqnames,BSgenome-method,
[[,BSgenome-method,
DNAString-class,
DNAStringSet-class,
MaskedDNAString-class,
GRanges-class,
Ranges-class,
RangesList-class,
subseq,XVector-method,
grep
## --------------------------------------------------------------------- ## A. EXTRACTING A SMALL NUMBER OF SEQUENCES FROM THE C. elegans GENOME ## --------------------------------------------------------------------- # Load the Caenorhabditis elegans genome (UCSC Release ce2): library(BSgenome.Celegans.UCSC.ce2) # Look at the index of sequences: Celegans # Get chromosome V as a DNAString object: getSeq(Celegans, "chrV", as.character=FALSE) # which is in fact the same as doing: Celegans$chrV # Never try this: #getSeq(Celegans, "chrV") # or this (even worse): #getSeq(Celegans) # Get the first 20 bases of each chromosome: getSeq(Celegans, end=20) # Get the last 20 bases of each chromosome: getSeq(Celegans, start=-20) # Extracting small sequences from different chromosomes: myseqs <- data.frame( chr=c("chrI", "chrX", "chrM", "chrM", "chrX", "chrI", "chrM", "chrI"), start=c(NA, -40, 8510, 301, 30001, 9220500, -2804, -30), end=c(50, NA, 8522, 324, 30011, 9220555, -2801, -11), strand=c("+", "-", "+", "+", "-", "-", "+", "-") ) getSeq(Celegans, myseqs$chr, start=myseqs$start, end=myseqs$end) getSeq(Celegans, myseqs$chr, start=myseqs$start, end=myseqs$end, strand=myseqs$strand) # Get the "NM_058280_up_1000" sequence (belongs to the upstream1000 # multiple sequence) as a character string: s1 <- getSeq(Celegans, "NM_058280_up_1000") # or a DNAString object (more efficient): s2 <- getSeq(Celegans, "NM_058280_up_1000", as.character=FALSE) getSeq(Celegans, "NM_058280_up_5000", start=-1000) == s1 # TRUE getSeq(Celegans, "NM_058280_up_5000", start=-1000, as.character=FALSE) == s2 # TRUE ## --------------------------------------------------------------------- ## B. RANDOMLY EXTRACTING A HIGH NUMBER OF 40-MERS FROM THE C. elegans ## GENOME ## --------------------------------------------------------------------- extractRandomReads <- function(x, density, readlength) { if (!is.integer(readlength)) readlength <- as.integer(readlength) start <- lapply(seqnames(x), function(name) { seqlength <- seqlengths(x)[name] sample(seqlength - readlength + 1L, seqlength * density, replace=TRUE) }) names <- rep.int(seqnames(x), elementLengths(start)) ranges <- IRanges(start=unlist(start), width=readlength) strand <- strand(sample(c("+", "-"), length(names), replace=TRUE)) gr <- GRanges(seqnames=names, ranges=ranges, strand=strand) getSeq(x, gr, as.character=FALSE) } library(BSgenome.Celegans.UCSC.ce2) ## With a density of 1 read every 100 genome bases, the total number of ## extracted 40-mers is about 1 million: rndreads <- extractRandomReads(Celegans, 0.01, 40) ## Notes: ## - The short sequences in 'rndreads' can be seen as the result of a ## simulated high-throughput sequencing experiment. A non-realistic ## one though because: ## (a) It assumes that the underlying technology is perfect (the ## generated reads have no technology induced errors). ## (b) It assumes that the sequenced genome is exactly the same as ## the reference genome. ## (c) The simulated reads can contain IUPAC ambiguity letters only ## because the reference genome contains them. In a real ## high-throughput sequencing experiment, the sequenced genome ## of course doesn't contain those letters, but the sequencer ## can introduce them in the generated reads to indicate ## ambiguous base-calling. ## - Those reads are coming from the plus and minus strands of the ## chromosomes. ## - With a density of 0.01 and the reads being only 40-base long, the ## average coverage of the genome is only 0.4 which is low. The total ## number of reads is about 1 million and it takes less than 10 sec. ## to generate them. ## - A higher coverage can be achieved by using a higher density and/or ## longer reads. For example, with a density of 0.1 and 100-base reads ## the average coverage is 10. The total number of reads is about 10 ## millions and it takes less than 1 minute to generate them. ## - Those reads could easily be mapped back to the reference by using ## an efficient matching tool like matchPDict() for performing exact ## matching (see ?matchPDict for more information). Typically, a ## small percentage of the reads (4 to 5% in our case) will hit the ## reference at multiple locations. This is especially true for such ## short reads, and, in a lower proportion, is still true for longer ## reads, even for reads as long as 300 bases.