Divide batches of variants into a CompoundVariantSet — define_compound

A CompoundVariantSet is a collection of "compound variants". A compound variant is an arbitrary group of variants that have sequencing coverage across some set of samples. (Any of these samples with one or more of the constituent SNVs "has the compound variant"–samples with coverage at only some of the sites are not considered.) The compound variants within a CompoundVariantSet are always disjoint: that is, no individual variant appears in more than one of the compound variants. After collecting variants of interest into a table using select_variants()–and further subsetting or annotating the table as desired–use this function to produce a CompoundVariantSet that combines variants into distinct compound variants based on your criteria.

Usage

define_compound_variants(cesa, variant_table, by = NULL, merge_distance = 0)

Arguments

cesa: CESAnalysis
variant_table: Data table of variants, in the style generated by select_variants().
by: One or more column names to use for initial splitting of the input table into variant groups. Each distinct group will then be further divided into compound variants based on merge_distance
merge_distance: maximum genomic distance between a given variant and the nearest variant in compound variant for the variant to variant to be merged into the compound variant (as opposed to being assigned to its own compound variant).

Details

This function works first by splitting the input table by the columns given in by. For example, splitting on "gene" will split the table into gene-specific subtables. Then, each subtable is divided into compound variants based on merge_distance. All variants in each subtable within the specified genomic distance of each other will be merged into a candidate compound variant, and then compound variants will be repeatedly merged until the nearest two variants in each pair of compound variants are not within merge_distance. Note that overlapping variants will always be merged unless you use by to separate them into different subtables (for example, by splitting on alt or aa_alt). If you use by to split variants by some functional annotation, you can set merge_distance very high to merge all same-chromosome sites (e.g., 1e9 on human genome). To merge sites across chromosomes, set merge_distance = Inf.