Read and verify MAF somatic mutation data — preload

Reads MAF-formatted data from a text file or data table, checks for problems, and provides a few quality check annotations (if available). If core MAF columns don't have standard names (Chromosome, Start_Position, etc., with Tumor_Sample_Barcode used as the sample ID column), you can supply your own column names. If the data you are loading is from a different genome build than the chosen reference data set (refset) you can use the chain_file option to supply a UCSC-style chain file, and your MAF coordinates will be automatically converted with rtracklayer's version of liftOver.

Usage

preload_maf(
  maf = NULL,
  refset = NULL,
  coverage_intervals_to_check = NULL,
  chain_file = NULL,
  sample_col = "Unique_Patient_Identifier",
  chr_col = "Chromosome",
  start_col = "Start_Position",
  ref_col = "Reference_Allele",
  tumor_allele_col = "guess",
  keep_extra_columns = FALSE,
  detect_hidden_mnv = TRUE
)

Arguments

maf: Path of tab-delimited text file in MAF format, or a data.table/data.frame with MAF data
refset: name of reference data set (refset) to use; run list_ces_refsets() for available refsets. Alternatively, the path to a custom reference data directory.
coverage_intervals_to_check: If available, a BED file or GRanges object represented the expected coverage intervals of the sequencing method used to generate the MAF data. Unless the coverage intervals are incorrect, most records will be covered. Output will show how far away uncovered records are from covered regions, which can inform whether to use the covered_regions_padding option in load_maf(). (For example, some variant callers will identify variants up to 100bp out of the target regions, and you may want to pad the covered intervals to allow these variants to remain in your data. Alternatively, if all records are already covered, then the calls have probably already be trimmed to the coverage intervals, which means no padding should be added.)
chain_file: a LiftOver chain file (text format, name ends in .chain) to convert MAF records to the genome build used in the CESAnalysis.
sample_col: column name with patient ID; defaults to Unique_Patient_Identifier, or, in its absence, Tumor_Sample_Barcode
chr_col: column name with chromosome data (Chromosome)
start_col: column name with start position (Start_Position)
ref_col: column name with reference allele data (Reference_Allele)
tumor_allele_col: column name with alternate allele data; by default, values from Tumor_Seq_Allele2 and Tumor_Seq_Allele1 columns are used
keep_extra_columns: TRUE/FALSE to load data columns not needed by cancereffectsizeR, or a vector of column names to keep.
detect_hidden_mnv: Find same-sample adjacent SNVs and replace these records with DBS (doublet base substitution) records. Also, find groups of same-sample variants within 2 bp of each other and replace these records with MNV (multi-nucleotide variant) records.

Value

a data.table of MAF data, with any problematic records flagged and a few quality-control annotations (if available with the chosen refset data).

Details

The ces.refset.hg19 ces.refset.hg38 refsets provides three annotations that you may consider using for quality filtering of MAF records:

cosmic_site_tier Indicates if the variant's position overlaps a mutation in COSMIC v92's Cancer Mutation Census. Mutations are classified as Tier 1, Tier 2, Tier 3, and Other. Note that the MAF mutation itself is not necessarily in the census. See COSMIC's website for tier definitions.
germline_variant_site The variant's position overlaps a site of common germline variation. Roughly, this means that gnomAD 2.1.1 shows an overlapping germline variant at greater than 1% prevalence in some population.
repetitive_region The variant overlaps a site marked as repetitive sequence by the RepeatMasker tool (data taken from UCSC Table Browser). Variant calls in repetitive sites frequently reflect sequencing or calling error.