Running with custom reference data
Source:vignettes/custom_refset_instructions.Rmd
custom_refset_instructions.Rmd
A refset is a collection of all the reference data required to run a cancereffectsizeR analysis. Currently, refsets for the hg38 and hg19 human genome builds are available in the form as ces.refset.hg38 and ces.refset.hg19 data packages. It’s possible to generate your own reference data for pretty much any species or genome build. Here is the general method:
Find a GTF file containing coding sequence (CDS) definitions, filter to high-quality and desired CDS regions, and run
build_RefCDS()
to get a collection of gene/transcript/CDS information.Run
create_refset()
, which will require your RefCDS output and some additional information, to save reference data to an output directory.
The output of create_refset
is usable with
cancereffectsizeR, but there are two additional sources of data that you
will probably want to either add to your refset directory, or at least
have available to supply to cancereffectsizeR functions:
A CES Signature Set, containing SNV signature definitions and metadata, is required for mutational signature extraction. As long as you have a matrix of signature definitions (e.g., see those published by COSMIC), you can create your own set. You can include one or more signature set in your refset, or you can pass a signature set as an argument to
trinuc_mutation_rates()
. (In either case, you can also copy a signature set from an existing refset if the genome build is compatible.) See the details intrinuc_mutation_rates()
for how to make a signature set, then validate withvalidate_signature_set()
. Finally, to save with your refset, create a subdirectory called “signatures” in your refset directory, and then usesaveRDS
to save your signature set with the name “[set_name]_signatures.rds”.Tissue covariates data inform the calculation of gene mutation rates when cancereffectsizeR calls dNdScv. Covariates data can be saved in a refset subdirectory called “covariates”, or you can supply them as an argument to
gene_mutation_rates()
. (Example: “covariates/lung.rds” can be specified withcovariates = "lung"
.) This guide shows how to generate covariates data by combining and processing tissue-specific experimental data from several sources. As a last resort,gene_mutation_rates
can be run without covariates, but it’s best to use them if available.
For an example refset, see the ces.refset.hg38 data package’s refset directory here. If your refset also uses the hg19 or hg38, you can copy signatures and covariates from the refset packages into your own refset directory if desired.