Based on the buildref function in Inigo Martincorena's package dNdScv, this function takes in gene/transcript/CDS definitions and creates a dNdScv-style RefCDS object and an associated GenomicRanges object also required to run dNdScv.
Usage
build_RefCDS(
gtf,
genome,
use_all_transcripts = TRUE,
cds_ranges_lack_stop_codons = TRUE,
cores = 1,
additional_essential_splice_pos = NULL,
numcode = 1,
chromosome_style = "NCBI"
)
Arguments
- gtf
Path of a Gencode-style GTF file, or an equivalently formatted data table. See details for required columns (features). It's possible to build such a table using data pulled from biomaRt, but it's easier to use a GTF.
- genome
Genome assembly name (e.g., "hg19"); an associated BSgenome object must be available to load. Alternatively, supply a BSgenome object directly.
- use_all_transcripts
T/F (default TRUE): Whether to use all complete transcripts or just the longest one for each gene.
- cds_ranges_lack_stop_codons
The CDS records in Gencode GTFs don't include the stop codons in their genomic intervals. If your input does include the stop codons within CDS records, set to FALSE.
- cores
how many cores to use for parallel computations
- additional_essential_splice_pos
Usually not needed. A list of additional essential splice site positions to combine with those calculated automatically by this function. Each element of the list should have a name matching a protein_id in the input and consist of a numeric vector of additional positions. This option exists so that mutations at chr17:7579312 on TP53 are treated as splice site mutations in cancereffectsizeR's default hg19 reference data set. (Variants at this coding position, which are always synonymous, have validated effects on splicing, even though the position misses automatic "essential splice" annotation by 1 base.)
- numcode
(don't use) NCBI genetic code number; currently only code 1, the standard genetic code, is supported
- chromosome_style
Chromosome naming style to use. Defaults to "NCBI". For the human genome, that means 1, 2,..., as opposed to "UCSC" style (chr1, chr2, ...). Value gets passed to genomeInfoDb's seqlevelsStyle().
Value
A two-item list: RefCDS (which is itself a big list, with each entry containing information on one coding sequence (CDS)), and a GRanges object that defines the genomic intervals covered by each CDS.
Details
Required columns are seqnames, start, end, strand, gene_name, gene_id, protein_id, and type. Only rows that have type == "CDS" will be used. Strand should be "+" or "-".
By default, only one the longest complete transcript is used from each gene in the input. If you set use_all_transcripts = TRUE, then all complete transcripts will be used, resulting in multiple RefCDS entries for some genes. If you do this, you may want to first eliminate low-confidence or superfluous transcripts from the input data.