cigar-utils {GenomicRanges} | R Documentation |
Utility functions for low-level CIGAR manipulation.
cigarOpTable(cigar) cigarToQWidth(cigar, before.hard.clipping=FALSE) cigarToWidth(cigar) cigarQNarrow(cigar, start=NA, end=NA, width=NA) cigarNarrow(cigar, start=NA, end=NA, width=NA) cigarToIRanges(cigar, drop.D.ranges=FALSE, merge.ranges=TRUE) cigarToIRangesListByAlignment(cigar, pos, flag=NULL, drop.D.ranges=FALSE) cigarToIRangesListByRName(cigar, rname, pos, flag=NULL, drop.D.ranges=FALSE, merge.ranges=TRUE) queryLoc2refLoc(qloc, cigar, pos=1) queryLocs2refLocs(qlocs, cigar, pos, flag=NULL) splitCigar(cigar) cigarToRleList(cigar) cigarToCigarTable(cigar) summarizeCigarTable(x)
cigar |
A character vector/factor containing the extended CIGAR
string for each read.
For cigarToIRanges and queryLoc2refLoc ,
this must be a single string (i.e. a character vector/factor
of length 1).
|
before.hard.clipping |
Should the returned widths be the lengths of the reads before
or after "hard clipping"? Hard clipping of a read is
encoded with an H in the CIGAR.
If NO (before.hard.clipping=FALSE , the default),
then the returned widths are the lengths of the query
sequences stored in the SAM/BAM file.
If YES (before.hard.clipping=TRUE ), then the returned
widths are the lengths of the original reads.
|
start,end,width |
Vectors of integers. NAs and negative values are accepted and
"solved" according to the rules of the SEW (Start/End/Width)
interface (see ?solveUserSEW for the details).
|
drop.D.ranges |
Should the ranges corresponding to a deletion from the
reference (encoded with a D in the CIGAR) be dropped?
By default we keep them to be consistent with the pileup tool
from SAMtools.
Note that, when drop.D.ranges is TRUE , then Ds
and Ns in the CIGAR are equivalent.
|
merge.ranges |
Should adjacent ranges coming from the same cigar be merged or not?
Using TRUE (the default) can significantly reduce the
size of the returned object.
|
pos |
An integer vector containing the 1-based leftmost position/coordinate for each (eventually clipped) read sequence. |
flag |
NULL or an integer vector containing the SAM flag for
each read.
According to the SAM specs, flag bits 0x004 and 0x400 have
the following meaning: when bit 0x004 is ON then "the query
sequence itself is unmapped" and when bit 0x400 is ON then
"the read is either a PCR duplicate or an optical duplicate".
When flag is provided,
cigarToIRangesListByAlignment and
cigarToIRangesListByRName ignore these reads.
|
rname |
A character vector/factor containing the name of the reference sequence associated with each read (i.e. the name of the sequence the read has been aligned to). |
qloc |
An integer vector containing "query-based locations" i.e. 1-based locations relative to the query sequence stored in the SAM/BAM file. |
qlocs |
A list of the same length as cigar where each
element is an integer vector containing "query-based
locations" i.e. 1-based locations relative to the corresponding
query sequence stored in the SAM/BAM file.
|
x |
A DataFrame produced by cigarToCigarTable .
|
For cigarOpTable
: An integer matrix with number of rows equal
to the length of cigar
and seven columns, one for each extended
CIGAR operation.
For cigarToQWidth
: An integer vector of the same length
as cigar
where each element is the width of the query (i.e.
the length of the query sequence) as inferred from the corresponding
element in cigar
(NAs in cigar
will produce NAs in the
returned vector).
For cigarQNarrow
and cigarNarrow
: A character vector
of the same length as cigar
containing the narrowed cigars.
In addition the vector has an "rshift" attribute which is an integer
vector of the same length as cigar
. It contains the values that
would need to be added to the POS field of a SAM/BAM file as a
consequence of this cigar narrowing.
For cigarToWidth
: An integer vector of the same length
as cigar
where each element is the width of the alignment
(i.e. its total length on the reference, gaps included)
as inferred from the corresponding element in cigar
(NAs in cigar
will produce NAs in the returned vector).
For cigarToIRanges
: An IRanges object
describing where the bases in the read align with respect to
an imaginary reference sequence assuming that the leftmost
aligned base is at position 1 in the reference (i.e. at the
first position).
For cigarToIRangesListByAlignment
: A
CompressedNormalIRangesList object of the same length
as cigar
.
For cigarToIRangesListByRName
: A named IRangesList
object with one element (IRanges) per unique reference
sequence.
For queryLoc2refLoc
: An integer vector of the same length as
qloc
containing the "reference-based locations" (i.e. the
1-based locations relative to the reference sequence) corresponding
to the "query-based locations" passed in qloc
.
For queryLocs2refLocs
: A list of the same length as
qlocs
where each element is an integer vector containing
the "reference-based locations" corresponding to the "query-based
locations" passed in the corresponding element in qlocs
.
For splitCigar
: A list of the same length as cigar
where each element is itself a list with 2 elements of the same
lengths, the 1st one being a raw vector containing the CIGAR
operations and the 2nd one being an integer vector containing
the lengths of the CIGAR operations.
For cigarToRleList
: A CompressedRleList object.
For cigarToCigarTable
: A frequency table of the CIGARs
in the form of a DataFrame with two columns:
cigar
(a CompressedRleList) and count
(an integer).
For summarizeCigarTable
: A list with two elements:
AlignedCharacters
(integer) and Indels
(matrix)
H. Pages and P. Aboyoun
http://samtools.sourceforge.net/
IRanges-class,
IRangesList-class,
coverage
,
RleList-class
## --------------------------------------------------------------------- ## A. SIMPLE EXAMPLES ## --------------------------------------------------------------------- ## With a cigar vector of length 1: cigar1 <- "3H15M55N4M2I6M2D5M6S" ## cigarToQWidth()/cigarToWidth(): cigarToQWidth(cigar1) cigarToQWidth(cigar1, before.hard.clipping=TRUE) cigarToWidth(cigar1) ## cigarQNarrow(): cigarQNarrow(cigar1, start=4, end=-3) cigarQNarrow(cigar1, start=10) cigarQNarrow(cigar1, start=19) cigarQNarrow(cigar1, start=24) ## cigarNarrow(): cigarNarrow(cigar1) # only drops the soft/hard clipping cigarNarrow(cigar1, start=10) cigarNarrow(cigar1, start=15) cigarNarrow(cigar1, start=15, width=57) cigarNarrow(cigar1, start=16) #cigarNarrow(cigar1, start=16, width=55) # ERROR! (empty cigar) cigarNarrow(cigar1, start=71) cigarNarrow(cigar1, start=72) cigarNarrow(cigar1, start=75) ## cigarToIRanges(): cigarToIRanges(cigar1) cigarToIRanges(cigar1, merge.ranges=FALSE) cigarToIRanges(cigar1, drop.D.ranges=TRUE) ## With a cigar vector of length 4: cigar2 <- c("40M", cigar1, "2S10M2000N15M", "3H25M5H") pos <- c(1, 1001, 1, 351) cigarToIRangesListByAlignment(cigar2, pos) rname <- c("chr6", "chr6", "chr2", "chr6") cigarToIRangesListByRName(cigar2, rname, pos) cigarOpTable(cigar2) splitCigar(cigar2) cigarToRleList(cigar2) cigarToCigarTable(cigar2) cigarToCigarTable(cigar2)[,"cigar"] cigarToCigarTable(cigar2)[,"count"] summarizeCigarTable(cigarToCigarTable(cigar2)) ## --------------------------------------------------------------------- ## B. PERFORMANCE ## --------------------------------------------------------------------- if (interactive()) { ## We simulate 20 millions aligned reads, all 40-mers. 95% of them ## align with no indels. 5% align with a big deletion in the ## reference. In the context of an RNAseq experiment, those 5% would ## be suspected to be "junction reads". set.seed(123) nreads <- 20000000L njunctionreads <- nreads * 5L / 100L cigar3 <- character(nreads) cigar3[] <- "40M" junctioncigars <- paste( paste(10:30, "M", sep=""), paste(sample(80:8000, njunctionreads, replace=TRUE), "N", sep=""), paste(30:10, "M", sep=""), sep="") cigar3[sample(nreads, njunctionreads)] <- junctioncigars some_fake_rnames <- paste("chr", c(1:6, "X"), sep="") rname <- sample(some_fake_rnames, nreads, replace=TRUE) pos <- sample(80000000L, nreads, replace=TRUE) ## The following takes < 5 sec. to complete: system.time(rglist <- cigarToIRangesListByAlignment(cigar3, pos)) ## The following takes < 10 sec. to complete: system.time(irl <- cigarToIRangesListByRName(cigar3, rname, pos)) ## Internally, cigarToIRangesListByRName() turns 'rname' into a factor ## before starting the calculation. Hence it will run sligthly ## faster if 'rname' is already a factor. rname2 <- as.factor(rname) system.time(irl2 <- cigarToIRangesListByRName(cigar3, rname2, pos)) ## The sizes of the resulting objects are about 240M and 160M, ## respectively: object.size(rglist) object.size(irl) } ## --------------------------------------------------------------------- ## C. COMPUTE THE COVERAGE OF THE READS STORED IN A BAM FILE ## --------------------------------------------------------------------- ## The information stored in a BAM file can be used to compute the ## "coverage" of the mapped reads i.e. the number of reads that hit any ## given position in the reference genome. ## The following function takes the path to a BAM file and returns an ## object representing the coverage of the mapped reads that are stored ## in the file. The returned object is an RleList object named with the ## names of the reference sequences that actually receive some coverage. extractCoverageFromBAM <- function(file) { ## This ScanBamParam object allows us to load only the necessary ## information from the file. param <- ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE, isDuplicate=FALSE), what=c("rname", "pos", "cigar")) bam <- scanBam(file, param=param)[[1]] ## Note that unmapped reads and reads that are PCR/optical duplicates ## have already been filtered out by using the ScanBamParam object above. irl <- cigarToIRangesListByRName(bam$cigar, bam$rname, bam$pos) irl <- irl[elementLengths(irl) != 0] # drop empty elements coverage(irl) } library(Rsamtools) f1 <- system.file("extdata", "ex1.bam", package="Rsamtools") extractCoverageFromBAM(f1)