XStringSet-io {Biostrings}R Documentation

Read/write an XStringSet or XStringViews object from/to a file

Description

Functions to read/write an XStringSet or XStringViews object from/to a file.

Usage

  ## Read FASTA (or FASTQ) files in an XStringSet object:
  read.BStringSet(filepath, format="fasta")
  read.DNAStringSet(filepath, format="fasta")
  read.RNAStringSet(filepath, format="fasta")
  read.AAStringSet(filepath, format="fasta")

  ## Extract basic information about FASTA (or FASTQ) files
  ## without loading them:
  fasta.info(filepath, use.descs=TRUE)
  fastq.geometry(filepath)

  ## Write an XStringSet object to a FASTA (or FASTQ) file:
  write.XStringSet(x, file="", append=FALSE, format="fasta", width=80)

  ## Serialize an XStringSet object:
  save.XStringSet(x, objname, dirpath=".", save.dups=FALSE, verbose=TRUE)

  ## Some legacy stuff:
  read.XStringViews(filepath, format="fasta", subjectClass, collapse="")
  write.XStringViews(x, file="", append=FALSE, format="fasta", width=80)
  FASTArecordsToCharacter(FASTArecs, use.names=TRUE)
  CharacterToFASTArecords(x)
  FASTArecordsToXStringViews(FASTArecs, subjectClass, collapse="")
  XStringSetToFASTArecords(x)

Arguments

filepath A character vector containing the paths to the input files.
format Either "fasta" (the default) or "fastq". Note that write.XStringSet and write.XStringViews only support "fasta" for now.
use.descs Should the returned vector be named with the description lines found in the FASTA records?
x For write.XStringSet and write.XStringViews, the object to write to file. For CharacterToFASTArecords, the (possibly named) character vector to be converted to a list of FASTA records as one returned by readFASTA. For XStringSetToFASTArecords, the XStringSet object to be converted to a list of FASTA records as one returned by readFASTA.
file A connection, or a character string naming the file to write to. If "" (the default), print to the standard output connection (generally the console) unless redirected by sink.
append TRUE or FALSE. If TRUE output will be appended to file; otherwise, it will overwrite the contents of file. See ?cat for the details.
width Only relevant if format is "fasta". The maximum number of letters per line of sequence.
objname The name of the serialized object.
dirpath The path to the directory where to save the serialized object.
save.dups TRUE or FALSE. If TRUE then the Dups object describing how duplicated elements in x are related to each other is saved too. For advanced users only.
verbose TRUE or FALSE.
subjectClass The class to be given to the subject of the XStringViews object created and returned by the function. Must be the name of one of the direct XString subclasses i.e. "BString", "DNAString", "RNAString" or "AAString".
collapse An optional character string to be inserted between the views of the XStringViews object created and returned by the function.
FASTArecs A list of FASTA records as one returned by readFASTA.
use.names Whether or not the description line preceding each FASTA records should be used to set the names of the returned object.

Details

Only FASTA and FASTQ files are supported for now. The identifiers and qualities stored in the FASTQ records are ignored (only the sequences are returned).

Reading functions read.BStringSet, read.DNAStringSet, read.RNAStringSet, read.AAStringSet and read.XStringViews load sequences from an input file (or set of input files) into an XStringSet or XStringViews object. (Note that for now read.XStringViews can only read 1 FASTA file at a time but this will be addressed ASAP). When multiple input files are specified, they are read in the corresponding order and their data are stored in the returned object in that order. Note that when multiple input FASTQ files are specified, they must all have the same "width" (i.e. all their sequences must have the same length).

The fasta.info utility returns an integer vector with one element per FASTA record in the input files. Each element is the length of the sequence found in the corresponding record. If use.descs is TRUE (the default) then the returned vector is named with the description lines found in the FASTA records.

The fastq.geometry utility returns an integer vector describing the "geometry" of the FASTQ files i.e. a vector of length 2 where the first element is the total number of FASTQ records in the files and the second element the common "width" of these files (this width is NA if the files contain no FASTQ records or records with different "widths").

Writing functions write.XStringSet and write.XStringViews write an XStringSet or XStringViews object to a file or connection. They only support the FASTA format for now.

Serializing an XStringSet object with save.XStringSet is equivalent to using the standard save mechanism. But it will try to reduce the size of x in memory first before calling save. Most of the times this leads to a much reduced size on disk.

FASTArecordsToCharacter, CharacterToFASTArecords, FASTArecordsToXStringViews and XStringSetToFASTArecords are helper functions used internally by write.XStringSet and read.XStringViews for switching between different representations of the same object.

See Also

readFASTA, writeFASTA, XStringSet-class, XStringViews-class, BString-class, DNAString-class, RNAString-class, AAString-class

Examples

  ## ---------------------------------------------------------------------
  ## A. READ/WRITE FASTA FILES
  ## ---------------------------------------------------------------------
  filepath <- system.file("extdata", "someORF.fa", package="Biostrings")
  fasta.info(filepath)
  x <- read.DNAStringSet(filepath)
  x
  write.XStringSet(x)  # writes to the console

  ## ---------------------------------------------------------------------
  ## B. READ FASTQ FILES
  ## ---------------------------------------------------------------------
  filepath <- system.file("extdata", "s_1_sequence.txt", package="Biostrings")
  fastq.geometry(filepath)
  ## Only the FASTQ sequences are returned (identifiers and qualities
  ## are dropped):
  read.DNAStringSet(filepath, format="fastq")

  ## ---------------------------------------------------------------------
  ## C. SERIALIZATION
  ## ---------------------------------------------------------------------
  library(BSgenome.Celegans.UCSC.ce2)
  ## Create a "sliding window" on chr I:
  sw_start <- seq.int(1, length(Celegans$chrI)-50, by=50)
  sw <- Views(Celegans$chrI, start=sw_start, width=10)
  my_fake_shortreads <- as(sw, "XStringSet")
  save.XStringSet(my_fake_shortreads, "my_fake_shortreads", dirpath=tempdir())

  ## ---------------------------------------------------------------------
  ## D. SOME RELATED HELPER FUNCTIONS
  ## ---------------------------------------------------------------------
  ## Converting 'x'...
  ## ... to a list of FASTA records (as one returned by the "readFASTA" function)
  x1 <- XStringSetToFASTArecords(x)
  ## ... to a named character vector
  x2 <- FASTArecordsToCharacter(x1) # same as 'as.character(x)'

[Package Biostrings version 2.16.9 Index]