Skip to contents

This tutorial applies some of the package’s key features to analyze publicly-available somatic variant data from tumor sequencing. We recommend that you start a fresh R/RStudio session. You can follow along by copy-and-pasting code into an R script.

Installation

Start by installing the latest release:

# Some dependencies are large, so we increase the download time limit to be safe
options(timeout = 600)
install.packages("remotes")
remotes::install_github("Townsend-Lab-Yale/cancereffectsizeR", dependencies = TRUE)

Regrettably, there is a bug in some versions of the GenomeInfoDb package that produces cryptic error messages in cancereffectsizeR, along the lines of !anyNA(m32) is not TRUE. If you encounter this issue, the simplest solution is to upgrade your Bioconductor version:

# Only necessary if GenomeInfoDb errors are causing problems.
BiocManager::install(version = "3.16") # or newer, when available

In addition to installing cancereffectsizeR, you need to install a reference data set, or refset. These refsets include genomic and gene annotations, mutational signature definitions, and more. Currently, refsets supporting the hg38 and hg19 builds of the human genome are available as separate data packages. (If you want to build your own refset to use a custom genome build or gene set for almost any species, you can.) For this tutorial, we’ll use the hg38 refset package.

options(timeout = 600)
remotes::install_github("Townsend-Lab-Yale/ces.refset.hg38@*release")

Restart R/RStudio after installation. Before continuing, you may want to create a directory for this tutorial to store all the data and output in one place.

# setwd() somewhere, if desired
dir.create("CES_tutorial")
setwd("CES_tutorial")

Quickstart

Theoretical overview

Very briefly, we extract mutational signatures from each sample’s SNV mutation profile using MutationalPatterns. The relative weights of biologically associated signatures are used to infer trinucleotide-context-specific relative rates of SNV mutations for each sample. Cohort-wide neutral gene mutation rates are calculated with dNdScv, with tissue-specific covariates provided by ces.refset.hg38. Combining this information, the rate of neutral mutation at a particular variant site is calculated by partitioning the gene mutation rate across all sites in the gene in accordance with the signature-informed relative rates. Comparing rates of observed and expected mutation under a model of somatic selection allows inference of scaled selection coefficients, which we also call cancer effects.

Example: Lung adenocarcinoma data from TCGA

Here is a straightforward cancereffectsizeR workflow: We load TCGA LUAD data, calculate mutation rates, and quantify selection. We also attribute selection to mutational signatures and find that mutations attributed to tobacco-associated signatures have disproportionately large selective effects.

(Show/hide quickstart)
library(cancereffectsizeR)
library(data.table)

# Download TCGA lung adenocarcinoma (LUAD) somatic variant data.
tcga_maf_file <- "TCGA-LUAD.maf.gz"
if (!file.exists(tcga_maf_file)) {
  get_TCGA_project_MAF(project = "LUAD", filename = "TCGA-LUAD.maf.gz")
}

# Prepare data
maf <- preload_maf(maf = tcga_maf_file, refset = "ces.refset.hg38")

# Create cancereffectsizeR analysis and load data
cesa <- CESAnalysis(refset = "ces.refset.hg38")
cesa <- load_maf(cesa = cesa, maf = maf)

# Infer trinculeotide-context-specific relative rates of SNV mutation from
# a mutational signature analysis (leaving out signatures not found in LUAD)
signature_exclusions <- suggest_cosmic_signature_exclusions(cancer_type = "LUAD", treatment_naive = TRUE)
cesa <- trinuc_mutation_rates(
  cesa = cesa, signature_set = ces.refset.hg38$signatures$COSMIC_v3.4,
  signature_exclusions = signature_exclusions
)

# Estimate neutral gene mutation rates using dNdScv, with tissue-specific mutation rate covariates.
cesa <- gene_mutation_rates(cesa, covariates = ces.refset.hg38$covariates$lung)

# Infer scaled selection coefficients under the default model of clonal selection.
# By default, inference is restricted to recurrent mutations.
cesa <- ces_variant(cesa, run_name = "example")

# Visualize top-effect variants.
plot_effects(effects = cesa$selection$example, color_by = "#DB382D", topn = 20)

# Attribute effects to mutational signatures
mut_effects <- mutational_signature_effects(cesa, cesa$selection$example)

# Plot a comparison of how signatures contribute to mutation vs. selection
plot_signature_effects(mut_effects, viridis_option = "F", num_sig_groups = 5)

# See the full tutorial for more details and a broader view of functionality!

Preparing data

Load cancereffectsizeR, as well as the data.table package. cancereffectsizeR makes extensive use of data tables, so it’s handy to have the package loaded.

For this tutorial, we’ll use somatic variant data produced from exome sequencing by the TCGA BRCA (breast carcinoma) project. We will supplement the WXS data with targeted sequencing data from the Metastatic Breast Cancer data set hosted at cBioPortal.

In cancereffectsizeR, data can be combined from multiple whole-exome, whole-genome, and targeted sequencing sources, although we always need at least one WXS or WGS source to anchor the analysis. Targeted sequencing data can’t be used for mutation rate inference: there are too few mutations, typically, and since they’re mostly in cancer hotspots, they don’t provide a baseline of what mutation rates are like in the absence of selection. Instead, mutation rates in TGS samples will be assumed to be similar to those calculated in WXS/WGS samples.

TCGA data (whole-exome)

We’ll begin by downloading variant data from TCGA. The function downloads patient MAF data from the latest TCGA data release and assembles a project-level MAF.

tcga_maf_file <- "TCGA-BRCA.maf.gz"
if (!file.exists(tcga_maf_file)) {
  get_TCGA_project_MAF(project = "BRCA", filename = tcga_maf_file)
}

Let’s also load and examine a table of patient information. Although the TCGA samples in the data set are all taken from primary tumors, you’ll see that some patients presented with metastatic disease (pM = M1). Hormone receptor status (progesterone or estrogen receptor positive) and HER2 amplification status are also recorded in a combined column.

tcga_clinical <- fread(system.file("tutorial/TCGA_BRCA_clinical.txt", package = "cancereffectsizeR"))

# Change patient identifier column name in clinical table to match the MAF.
setnames(tcga_clinical, "patient_id", "Unique_Patient_Identifier")

# Peek at data
tcga_clinical[1:5]
##    Unique_Patient_Identifier     pM receptor_status
##                       <char> <char>          <char>
## 1:              TCGA-3C-AAAU   <NA>       HR+/HER2-
## 2:              TCGA-3C-AALI     M0       HR+/HER2+
## 3:              TCGA-3C-AALJ     M0            <NA>
## 4:              TCGA-3C-AALK     M0       HR+/HER2+
## 5:              TCGA-4H-AAAK     M0            <NA>

The preload_maf() function takes in MAF data, extracts the columns needed by cancereffectsizeR, adds a couple of genomic annotations, and checks for common problems. (For your own analyses, see our MAF data tips.) This MAF file already uses the hg38 genome build, but if it didn’t, we could use the chain_file argument to convert records via liftOver.

tcga_maf <- preload_maf(maf = tcga_maf_file, refset = "ces.refset.hg38")

When MAFs created by are fed into , TCGA sample replicates are effectively merged, since the Unique_Patient_Identifier column supersedes Tumor_Sample_Barcode (the original sample identifiers). You will see a note that preload_maf() caught and handled the ensuing duplicate mutation records.

Metastatic data (TGS)

Our TGS data source (hosted on cBioPortal and recently published in Cancer Discovery), consists of metastatic tumors that were sequenced using various MSK-Impact panels. In the full data set, some patients have multiple samples sequenced; the data provided with the package has been subsetted to one sample per patient, and the genome build has been converted to hg38.

tgs_maf_file <- system.file("tutorial/metastatic_breast_2021_hg38.maf", package = "cancereffectsizeR")
tgs_maf <- preload_maf(maf = tgs_maf_file, refset = "ces.refset.hg38")

When combining data from multiple sources, check_sample_overlap() can detect unexpected sample duplication. If you feed in tcga_maf and tgs_maf and filter the output to variants_shared > 2 (with panel data, a couple of shared variants doesn’t imply sample duplication), you’ll see that everything looks fine.

Create CESAnalysis and load data

The CESAnalysis is the primary data structure of cancereffectsizeR. The cancereffectsizeR workflow consists of calling a series of functions that take a CESAnalysis as input and return an altered CESAnalysis.

Load whole-exome sequencing data

Let’s create a CESAnalysis and call load_maf() to load the TCGA data. Since we’re going to load more than one MAF, we’ll provide an optional maf_name.

cesa <- CESAnalysis(refset = "ces.refset.hg38")
cesa <- load_maf(cesa = cesa, maf = tcga_maf, maf_name = "BRCA")

You will see a message that some variants fall outside of the refset’s exome definitions. Since we don’t know exactly what exome capture techniques were used for the BRCA project (probably different methods at different study sites), this isn’t unexpected. When you do know the exact exome capture intervals of whatever data you’re using, you should supply those intervals with load_maf()’s covered_regions argument. For WGS data, set coverage = "genome".

To navigate a CESAnalysis, use the dollar sign ($). We can use these accessors to look at the MAF data as well as more detailed variant annotations, and if we want we can perform various filtering operations, such as identifying the most prevalent variants in the loaded data.

cesa$maf
cesa$variants
cesa$samples

# Let's see the top variants
(top_variants <- cesa$variants[order(-maf_prevalence)][1:10, .(variant_name, chr, start, end, maf_prevalence)])

Let’s load the clinical data into the analysis, too.

cesa <- load_sample_data(cesa, tcga_clinical)

Load targeted gene sequencing data

To load our TGS data, we need to be able to define its coverage. Why? To estimate the selection of a mutation, we need to know which samples have the mutation, which samples do not, and which are unknown due to lack of sequencing coverage. Our TGS samples were sequenced with multiple panels covering different genes, and unfortunately, the exact coverage (defined by genomic coordinates) is not publicly available for all of these. Therefore, for simplicity, we will filter the TGS data to mutations at a handful of top cancer genes that are covered in all of the panels. When we load the data, all records outside of these genes will be excluded, leaving us with greater power to infer selection in these genes without impacting our estimates outside these genes, which will use just the TCGA data. (As mentioned earlier, we shouldn’t filter WXS/WGS data this way, since it would interfere with mutation rate calculation, but TGS samples are not involved in mutation rate calculation.)

# Define coverage using the coding regions of these genes, as defined by the refset
top_tgs_genes <- c(
  "TP53", "PIK3CA", "ESR1", "CDH1", "GATA3", "KMT2C",
  "MAP3K1", "AKT1", "ARID1A", "FOXA1", "TBX3", "PTEN"
)
tgs_coverage <- ces.refset.hg38$gr_genes[ces.refset.hg38$gr_genes$gene %in% top_tgs_genes]

We don’t have a clinical file for the TGS data, but we do know that they’re all metastatic samples. Instead of creating another table and calling load_sample_data(), we can use the sample_data_cols argument in load_maf(). We will also add a little bit of padding (10 bp) to our coverage definitions to allow variants just outside coding regions to be considered covered.

tgs_maf$pM <- "M1"
cesa <- load_maf(cesa,
  maf = tgs_maf, sample_data_cols = "pM", maf_name = "MBC",
  coverage = "targeted", covered_regions = tgs_coverage,
  covered_regions_name = "top_genes", covered_regions_padding = 10
)

Mutational processes and relative mutation rates

For each (exome) sample in our data set, trinuc_mutation_rates() will perform mutational signature extraction in order to to attribute each sample’s set of SNVs to a linear combination of mutational processes. By default, the signature extraction is done using the MutationalPatterns package, and deconstructSigs is also supported. The signature attribution allows us to infer sample-specific relative rates of SNV mutation for all trinucleotide contexts. TGS samples will be assumed to have mutational processes matching the group-average mutational processes of the exome data. (This isn’t optimal, since the metastatic samples may be affected by mutational processes, such as chemotherapy, that are lacking in the primary tumor exome data, but it might be the best we can do without finding another data source.)

We will use signature definitions from the refset (it’s also possible to create your own). To improve the accuracy of signature extraction, we will exclude signatures that can safely be presumed absent from the samples; a helper function, suggest_cosmic_signature_exclusions(), can provide some guidance.

# We'll use all suggested exclusions (TCGA primary tumors are treatment-naive)
signature_exclusions <- suggest_cosmic_signature_exclusions(cancer_type = "BRCA", treatment_naive = TRUE)

cesa <- trinuc_mutation_rates(cesa,
  signature_set = ces.refset.hg38$signatures$COSMIC_v3.4,
  signature_exclusions = signature_exclusions
)

The trinuc_mutation_rates() run has added some useful information to the CESAnalysis:

  • snv_counts: A matrix of SNV counts by trinucleotide context. (You can also generate such a matrix from MAF data with trinuc_snv_counts().)
  • raw_attributions: Signature attributions as generated by the extractor. For MutationalPatterns, this matrix contains the number of mutations attributed to signature for each sample. Naturally, samples with more mutations will tend to have higher raw attributions.
  • biological_weights: The proportion of mutations attributed to each biologically-associated signature within each sample. Mutations attributed to signatures associated with sequencing/processing artifacts are left out. (Technical detail: Due to the instability of signature attributions on samples with few mutations, Samples with few MAF variants have their weights adjusted towards group-average weights, as indicated in the group_avg_blended column. If you want to make claims about subgroup differences in mutational processes, consider leaving these samples out or using the raw attributions. You should also leave out the TGS samples, which in this analysis are treated as having 0 mutations.)
  • trinuc_rates: Inferred relative rates of mutation, produced by matrix-multiplying biological_weights and signature definitions. (These rates will not equal empirical rates of observed mutations.)

Gene mutation rates

Next, we’ll use gene_mutation_rates() to estimate regional rates of mutation in the absence of selection. We’ll use the method provided in the dNdScv package, which uses dN/dS ratios and mutation rate covariates. Our refset has pre-computed covariates for a variety of tissue types. As with mutational processes, this analysis will not use the TGS samples, but the calculated rates will be assumed to hold for them.

cesa <- gene_mutation_rates(cesa, covariates = ces.refset.hg38$covariates$breast)

Here are the neutral gene mutation rates extracted from dNdScv’s regression:

head(cesa$gene_rates)
##       gene         rate rate_grp_1_95_low rate_grp_1_95_high
##     <char>        <num>             <num>              <num>
## 1:    A1BG 6.127126e-07      5.613872e-07       6.677908e-07
## 2:    A1CF 7.084595e-07      6.658288e-07       7.532645e-07
## 3:     A2M 7.896587e-07      7.532342e-07       8.269888e-07
## 4:   A2ML1 6.255639e-07      5.869469e-07       6.656543e-07
## 5: A3GALT2 5.508752e-07      5.100439e-07       5.946168e-07
## 6:  A4GALT 7.261875e-07      6.757761e-07       7.797770e-07

We can also look at dNdScv’s identification of selection at the gene level. We’ll filter results to q < .05.

# View top dNdScv genes, sorted by significance
dndscv_results <- cesa$dNdScv_results[[1]]
sig_genes <- dndscv_results[qallsubs_cv < .05][order(qallsubs_cv)][1:10]
gene_name n_syn n_mis n_non n_spl wmis_cv wnon_cv wspl_cv pmis_cv ptrunc_cv pallsubs_cv qmis_cv qtrunc_cv qallsubs_cv
TP53 0 198 46 22 188.063121 580.81334 580.81334 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
PIK3CA 5 336 0 0 49.946758 0.00000 0.00000 0.000000e+00 1.741979e-01 0.000000e+00 0.000000e+00 8.485461e-01 0.000000e+00
PTEN 0 15 11 3 9.088199 91.77341 91.77341 5.238135e-06 0.000000e+00 0.000000e+00 1.299282e-02 0.000000e+00 0.000000e+00
CDH1 3 16 30 13 4.475685 141.11002 141.11002 4.219075e-04 0.000000e+00 0.000000e+00 5.232557e-01 0.000000e+00 0.000000e+00
KMT2C 5 25 27 2 1.532738 14.61201 14.61201 2.523294e-01 3.663736e-15 2.220446e-16 7.803695e-01 1.188736e-11 7.710721e-13
MAP2K4 0 13 9 3 10.810199 76.15288 76.15288 2.433399e-06 4.107825e-15 8.881784e-16 7.041852e-03 1.188736e-11 2.570240e-12
MAP3K1 5 22 17 2 2.988223 24.31043 24.31043 3.202730e-03 2.331468e-15 2.253753e-14 7.803695e-01 1.012032e-11 5.590273e-11
AKT1 1 26 0 0 17.900089 0.00000 0.00000 2.093881e-13 5.882058e-01 9.905410e-13 1.211868e-09 8.485461e-01 2.149845e-09
ARID1A 2 12 13 2 2.100089 28.62643 28.62643 1.212432e-01 5.273559e-13 2.593925e-12 7.803695e-01 1.308069e-09 5.004258e-09
NCOR1 2 20 13 5 2.637482 18.67881 18.67881 2.289725e-02 8.482326e-12 5.398637e-11 7.803695e-01 1.840983e-08 9.373654e-08

See dNdScv’s documentation if you’re interested in interpretation of the dNdScv output.

Side note: If we had multiple tissue types in the analysis, we could call gene_mutation_rates() multiple times using the samples argument to specify sample groups, with appropriate covariates for each tissue type. Similar functionality is available in trinuc_mutation_rates().

Selection inference

The combination of cohort-level gene mutation rates and sample-level relative rates of trinucleotide-context-specific substitution allow us to estimate the rate at which any somatic substitution occurs in any patient’s tumor sample. To be clear, by “rates,” we don’t mean the frequency of mutations in our somatic variant calls (which we already know anyway). These rates represent how often the mutations can be expected to occur in individual cells. (More specifically, we assume that mutation events in each specific sample and site follow a Poisson distribution, and these rates are the Poisson rates.) It’s not typically necessary to look at these rates, but just to show that we can:

# Let's take the top 3 variants by MAF prevalence. These happen to all be in PIK3CA.
variants_to_check <- cesa$variants[order(-maf_prevalence), variant_id][1:3]

# A few random samples
samples_to_check <- c("TCGA-A2-A3Y0", "TCGA-XX-A89A", "P-0000224")

baseline_mutation_rates(cesa = cesa, variant_ids = variants_to_check, samples = samples_to_check)
Unique_Patient_Identifier PIK3CA_E542K_ENSP00000263967.3 PIK3CA_E545K_ENSP00000263967.3 PIK3CA_H1047R_ENSP00000263967.3
P-0000224 9.15e-06 9.15e-06 1.11e-06
TCGA-A2-A3Y0 1.84e-06 1.84e-06 8.73e-07
TCGA-XX-A89A 1.32e-05 1.32e-05 2.23e-08

Interestingly, PIK3CA H1047R, the most prevalent variant in our MAF data (246/1875 samples), is estimated to have a substantially lower rate of occurrence than E545K and E542K in these particular samples.

Default model

We can use our estimated rates and the MAF variant data to infer scaled selection coefficients (aka cancer effects, or selection intensity) under a model of selection. These cancer effects are directly proportional to the proliferative advantage provided by each variant. First, let’s calculate selection under the package’s default model of selection with ces_variant():

# Including an optional run_name
cesa <- ces_variant(cesa = cesa, run_name = "recurrents")

Let’s visualize the most selected variants with plot_effects().

plot_effects(effects = cesa$selection$recurrents)

PIK3CA and ESR1 variants take many of the top spots. The deconvolution of mutation rate from selection has revealed that some relatively low-prevalence variants have high effects. Interestingly, according to dNdScv, mutation in ESR1 at the gene level was not significantly greater than neutral in the TCGA cohort of primary tumors. The incorporation of metastatic TGS data and cancereffectsizeR’s assessment of selection at variant-level resolution confirm ESR1’s importance.

Here’s a look at the effects of all recurrent variants across some of the highest-effect genes.

plot_effects(cesa$selection$recurrents,
  group_by = "gene", topn = 10,
  label_individual_variants = FALSE
)

More options

By default, all recurrent variants (variants that appear at least twice in MAF data) are included in a ces_variant() run. As demonstrated, the variants argument can be used to specify which variants to test. Variants with prevalences of 1 or even 0 (perhaps helpful to establish an effect ceiling) can be included: variants = cesa$variants will include all annotated mutations in the analysis. Note that most single-hit variants will have over-estimated effects because most are probably neutral passengers, and the occurrence of any non-selected variant is inherently improbable.

For each variant site, only samples with coverage at the site will inform selection inference. As also shown above, the samples argument further limits the inference process to a subset of eligible samples. This could be helpful to assess and compare selection in various groups (e.g., smoking vs. non-smoking).

Besides the default model, ces_variant() supports user-supplied models of selection; see documentation for details. We hope that some enterprising users will come up with new models, and we are happy to discuss ideas and help work out any kinks.

The CompoundVariantSet feature, described in the next section, may sometimes provide improved resolution of somatic selection by batching related variants together.

Epistatic cancer effects

In the default model, a variant is assumed to have a single cancer effect across all samples. In reality, we expect a variant’s selection to be influenced by a complex combination of factors including a sample’s mutational background (from substitutions to structural variants), the epigenetic state of the tumor, and various environmental factors. The ces_epistasis() function allows us to assess selection for any pair of variants under a model of pairwise epistasis, in which the selection intensity for each variant is allowed to very depending on the state of the other site.

Variant-level epistasis

Let’s try two prevalent BRCA mutations, PIK3CA E545K (n=159) and AKT1 E17K (n=82). Let’s also test E545K with a nearby PIK3CA variant, E542K (n=93).

# Start by pulling full variant IDs (with protein identifier) from variants table
group1 <- cesa$variants[c("PIK3CA E545K", "AKT1 E17K"), variant_id, on = "variant_name"]
group2 <- cesa$variants[c("PIK3CA E545K", "PIK3CA E542K"), variant_id, on = "variant_name"]

cesa <- ces_epistasis(
  cesa = cesa, variants = list(group1, group2),
  conf = .95, run_name = "variant_epistasis_example"
)
cesa$epistasis$variant_epistasis_example
variant_A variant_B ces_A0 ces_B0 ces_A_on_B ces_B_on_A p_A_change p_B_change p_epistasis expected_nAB_epistasis expected_nAB_null AB_epistatic_ratio nA0 nB0 nAB n00 n_total ci_low_95_ces_A0 ci_high_95_ces_A0 ci_low_95_ces_B0 ci_high_95_ces_B0 ci_low_95_ces_A_on_B ci_high_95_ces_A_on_B ci_low_95_ces_B_on_A ci_high_95_ces_B_on_A
PIK3CA_E545K_ENSP00000263967.3 AKT1_E17K_ENSP00000497822.1 10855 26453 0.001 6541.475 0.17701 0.20577 6.76e-03 1.02e+00 7.54 1.35e-01 159 80 1 1614 1854 9257 12626 21071 32683 NA 8735 373 28908
PIK3CA_E545K_ENSP00000263967.3 PIK3CA_E542K_ENSP00000263967.3 10913 6343 0.001 0.001 0.00732 0.00671 4.94e-05 1.31e-06 9.34 1.40e-07 160 93 0 1601 1854 9307 12694 5139 7722 NA 4171 NA 2360

Confidence intervals tend to be wide in epistatic analyses. Even with the most prevalent somatic variants, both co-occurrence and mutual exclusivity can often be explained by chance. The output for PIK3CA E545K and E542K confirms the negative epistatic relationship that we would have expected given their mutual exclusivity in our data (and our knowledge of their biological effects). For E545K and AKT1 E17K, we see reduced selection for both after the acquisition of the other driver, but the confidence intervals leave open the possibility that selection for AKT1 E17K is unaffected by PIK3CA status. (Additionally, the NA’s on some of the lower bounds indicate that they fall below the lower limit of the optimization algorithm.)

Epistasis and CompoundVariantSets

We can probe this PIK3CA/AKT1 relationship further if we assume that all prevalent PIK3CA mutations share the same epistatic relationship with AKT1 E17K. The define_compound_variants() feature lets us combine arbitrary variants into “compound variants” that are treated as if they were single variants by cancereffectsizeR’s selection inference functions. (In brief, the mutation rate of the compound variant is equal to the sum of rates of constituent variants, and any sample with one or more of the constituent variants “has the compound variant.”) Below, we define a CompoundVariantSet with two compound variants: All PIK3CA variants with MAF prevalence > 1, and AKT E17K. We then pass the CompoundVariantSet to ces_epistasis(), which will test all pairs of compound variants (here, just the one pair).

# Collect all the variants that we want in the CompoundVariantSet into a table
top_PIK3CA <- cesa$variants[gene == "PIK3CA" & maf_prevalence > 1]
top_akt1 <- cesa$variants[variant_name == "AKT1 E17K"]
for_compound <- rbind(top_PIK3CA, top_akt1)

# see define_compound_variants() documentation for details on arguments
comp <- define_compound_variants(cesa = cesa, variant_table = for_compound, by = "gene", merge_distance = Inf)
cesa <- ces_epistasis(cesa = cesa, variants = comp, run_name = "AKT1_E17K_vs_PIK3CA")
cesa$epistasis$AKT1_E17K_vs_PIK3CA
variant_A variant_B ces_A0 ces_B0 ces_A_on_B ces_B_on_A p_A_change p_B_change p_epistasis expected_nAB_epistasis expected_nAB_null AB_epistatic_ratio nA0 nB0 nAB n00 n_total ci_low_95_ces_A0 ci_high_95_ces_A0 ci_low_95_ces_B0 ci_high_95_ces_B0 ci_low_95_ces_A_on_B ci_high_95_ces_A_on_B ci_low_95_ces_B_on_A ci_high_95_ces_B_on_A
PIK3CA AKT1 3756 30457 6.88 6215 0.000182 0.000392 4.87e-11 3.98 29.1 0.137 642 77 4 1131 1854 3471 4056 24153 37800 NA 1014 1874 14410

Note that the effect sizes are smaller for PIK3CA here, since we mixed in lots of lower-effect PIK3CA variants. While four samples carry both AKT1 E17K and a PIK3CA variant, this result strengthens the case for AKT1 mutation reducing selection for PIK3CA mutation.

Gene-level epistasis

The convenience function ces_gene_epistasis() provides a simpler way to apply the model of ces_epistasis() at the gene level. The variants argument provides three options for which variants from each gene to include in the inference: “recurrent” uses all recurrent variants; “nonsilent” uses nonsynonymous coding variants and any variants in essential splice sites; or, alternatively, supply a custom table of variants. In order to take advantage of the targeted sequencing data (which doesn’t quite cover all mutated sites in these genes), we will supply a custom table of variants that are covered across all samples.

genes <- c("AKT1", "PIK3CA", "TP53")

# Get consensus covered regions
combined_coverage <- intersect(cesa$coverage_ranges$exome$`exome+`, cesa$coverage_ranges$targeted$top_genes)

# Get variants in the genes of interest that have sequencing coverage in all samples
variants <- select_variants(cesa, genes = genes, gr = combined_coverage)

cesa <- ces_gene_epistasis(cesa = cesa, genes = genes, variants = variants, run_name = "gene_epistasis_example")
cesa$epistasis$gene_epistasis_example
variant_A variant_B ces_A0 ces_B0 ces_A_on_B ces_B_on_A p_A_change p_B_change p_epistasis expected_nAB_epistasis expected_nAB_null AB_epistatic_ratio nA0 nB0 nAB n00 n_total ci_low_95_ces_A0 ci_high_95_ces_A0 ci_low_95_ces_B0 ci_high_95_ces_B0 ci_low_95_ces_A_on_B ci_high_95_ces_A_on_B ci_low_95_ces_B_on_A ci_high_95_ces_B_on_A
AKT1 PIK3CA 3370 1739 0.001 547.624 0.00215 0.0014 2.00e-10 7.04 33.5 0.210 83 667 7 1097 1854 2719 4116 1609 1876 NA 886 234 1071
AKT1 TP53 2521 1616 3576.736 0.001 0.94556 0.8365 2.32e-01 15.36 21.6 0.712 75 426 15 1338 1854 1993 3136 1470 1773 2053 5723 NA 1192
PIK3CA TP53 1591 1983 2343.782 0.001 0.23271 0.0154 8.42e-06 126.90 161.9 0.784 551 318 123 862 1854 1460 1729 1802 2176 1936 2813 NA 319

Mutational selection in the two oncogenes appears to be maintained after mutation in TP53. In contrast, loss of TP53 is less selected after mutations in AKT1/PIK3CA. One explanation is that cell populations in which these powerful drivers have fixed must have already subverted tumor suppression mechanisms, making TP53 mutations superfluous. The two oncogenes show significant mutual exclusivity, in that selection for each is reduced after mutation in the other.

Here’s a plot showing the change in selection for mutated sites in each gene after mutation in the other gene. Asterisks indicate significant changes in selection for individual genes. Not infrequently, the epistatic model is significantly better than a no-epistasis model, but there is not significance for altered selection in either gene individually, since a strong enough change in selection in either one of the genes could sufficiently explain the data.

(Show/hide epistasis plot code)
require(grid) # Need the grid package for this plot
results <- cesa$epistasis$gene_epistasis_example
results <- results[, .(
  v1 = variant_A, v2 = variant_B, ces_A0, ces_B0, ces_A_on_B,
  ces_B_on_A, p_A_change, p_B_change, p_epistasis
)]

# By change, we mean fold-change of selection on mutant background over wildtype background
results[, change_in_v2 := ces_B_on_A / ces_B0]
results[, change_in_v1 := ces_A_on_B / ces_A0]

# Put in desired order for display
results <- results[, pairname := paste(v1, v2, sep = ".")]

results[, x := 1:.N]
results[, v1_x := x - .2]
results[, v2_x := x + .2]
results[, alpha := .6]
results[p_epistasis < .05, alpha := 1]

results[, v1_signif := ""]
results[p_A_change < .05, v1_signif := "*"]
results[p_A_change < .01, v1_signif := "**"]
results[p_A_change < .001, v1_signif := "***"]
results[, v1_signif_y := change_in_v1 + (.13 * sign(change_in_v1 - 1))]

results[, v2_signif := ""]
results[p_B_change < .05, v2_signif := "*"]
results[p_B_change < .01, v2_signif := "**"]
results[p_B_change < .001, v2_signif := "***"]
results[, v2_signif_y := change_in_v2 + (.13 * sign(change_in_v2 - 1))]

x_labels <- unlist(S4Vectors::zipup(results$v1, results$v2))
x_label_pos <- unlist(S4Vectors::zipup(results$v1_x, results$v2_x))

# Have to get fancy to depict significance nicely in legend.
draw_signif_key <- function(data, params, size) {
  grobTree(
    rectGrob(
      x = .25, y = .5, width = .5, height = 1,
      gp = gpar(col = NA, fill = alpha("plum4", data$alpha), lty = data$linetype)
    ),
    rectGrob(
      x = .75, y = .5, width = .5, height = 1,
      gp = gpar(col = NA, fill = alpha("sandybrown", data$alpha), lty = data$linetype)
    )
  )
}

ggplot(data = results) +
  # Put in a reference line depicting no change in selection
  geom_hline(yintercept = 1, color = "darkgrey") +
  geom_rect(aes(xmin = v1_x - .2, xmax = v1_x + .2, ymin = 1, ymax = change_in_v1, fill = "v1", alpha = alpha),
    show.legend = c(alpha = FALSE, fill = TRUE)
  ) +
  geom_rect(aes(xmin = 1, xmax = 1, ymin = 0, ymax = 0, alpha = alpha),
    show.legend = c(alpha = TRUE, fill = FALSE), key_glyph = draw_signif_key
  ) +
  geom_text(aes(x = v1_x, y = v1_signif_y, label = v1_signif), size = 7) +
  scale_alpha_identity(
    breaks = c(1, .6), labels = c("Significant", "Not significant"),
    guide = guide_legend(
      title = "Pairwise epistasis", override.aes = list(fill = "sandybrown", alpha = c(1, .6)),
      order = 1
    )
  ) +
  geom_rect(aes(xmin = v2_x - .2, xmax = v2_x + .2, ymin = 1, ymax = change_in_v2, fill = "v2", alpha = alpha),
    show.legend = c(alpha = FALSE, fill = TRUE)
  ) +
  geom_text(aes(x = v2_x, y = v2_signif_y, label = v2_signif), size = 7) +

  # Build legend
  scale_fill_manual(
    name = "Ratio of selection",
    breaks = c("v1", "v2"),
    labels = list(
      expression(frac("gene 1 on mutated gene 2", "gene 1 on wildtype gene 2")),
      expression(frac("gene 2 on mutated gene 1", "gene 2 on wildtype gene 1"))
    ),
    values = c("v1" = "plum4", "v2" = "sandybrown"),
    guide = guide_legend(label.theme = element_text(size = 6.5))
  ) +
  scale_x_continuous(breaks = x_label_pos, labels = x_labels) +
  scale_y_continuous(breaks = seq(from = 0, to = 3, by = .25)) +
  xlab("Gene pair") +
  ylab("Ratio of selection coefficients") +
  theme_classic() +
  theme(
    legend.position = "bottom", legend.title = element_text(size = 10),
    axis.ticks.length.x = unit(0, "cm")
  )

Save your work

Whether you’ve made it all the way through this tutorial–or simply want to take a break–you can save a commemorative (and space-efficient) copy of your CESAnalysis using save_cesa(). You can reload the analysis anytime with load_cesa().

save_cesa(cesa = cesa, "cancerffectsizeR_tutorial_analysis.rds")

# In some future R session...
library(cancereffectsizeR)
cesa <- load_cesa("cancerffectsizeR_tutorial_analysis.rds")