Population Prevalence of Deleterious SGCE Variants

Background: Myoclonus-Dystonia (M-D) is a pleiotropic neuropsychiatric disorder of variable penetrance. Pathogenic variants in SGCE, a maternally imprinted gene, are the most frequent known genetic cause of M-D. The population prevalence of SGCE-linked M-D is unknown, the pathogenicity of SGCE variants identified in patients with M-D may be indeterminant, and SGCE variants predicted to be deleterious by in silico analysis may appear in patients undergoing whole-exome or whole-genome sequencing for seemingly unrelated disorders. The Genome Aggregation Database (gnomAD) v2 provides variant data on 125,748 exomes and 15,708 genomes from unrelated individuals sequenced as part of various disease-specific and population genetic studies. Methods: SGCE variants included in the gnomAD v2 dataset were analyzed with Combined Annotation Dependent Depletion (CADD), and database for nonsynonymous single nucleotide polymorphisms’ functional predictions (dbNSFP). We determined the frequency of annotated SGCE variants, ranked by scores of deleteriousness, within the gnomAD v2 dataset. Deleteriousness scores were compared to a subset of published disease associated SGCE pathogenic variants. Results: Within gnomAD v2, there were 56, 408, and 1250 alleles harboring SGCE variants with CADD scores greater than 30, 25, and 20, respectively. We estimate that approximately 1/348 individuals in the United States population harbors an SGCE variant with a CADD score ≥ 25. Discussion: SGCE M-D may be underdiagnosed due to pleiotropy, mild phenotypes, variable penetrance, and impaired access to genetic testing. Due to the high population prevalence of deleterious SGCE variants, caution should be used when asserting pathogenicity without co-segregation analyses and expert neurological examination of phenotypes within pedigrees. Highlights In silico analyses of a large population database of genetic variants revealed that over 0.2% of individuals in the United States harbor a highly deleterious SGCE variant. This finding suggests that M-D and minor phenotypic variants such as mild isolated myoclonus may be underdiagnosed.

the notion that the incidence of M-D is less than that of isolated cervical dystonia which has been reported as 0.80/100,000 person-years in Northern California [12]. In predominately European populations, prevalence estimates of cervical dystonia range from 28 to 183 cases/million [13]. Although commonly associated with childhood-onset generalized dystonia, DYT1 due to the classic ΔGAG deletion in TOR1A, may manifest as an M-D-like syndrome [14]. Based on analysis of 135,000 exomes, carrier prevalence of the ΔGAG deletion in TOR1A has been estimated at 176 to 261/million in the United States [15]. These epidemiological data from the most common forms of adult-onset focal dystonia and childhood-onset generalized dystonia provide reference points for interpretation of the SGCE M-D data reported herein.

Methods
All SGCE variants (short and structural) reported in the genome Aggregation Database (gnomAD) v2 were included in my analysis. The v2 short variant data set includes 125,748 exomes and 15,708 genomes from unrelated subjects sequenced as part of various disease-specific and population genetic studies, totaling 141,456 subjects, and is aligned against the GRCh37/hg19 reference genome. The v2 release (gnomAD v2) comprises a total of 16 million single nucleotide variants (SNVs) and 1.2 million indels from 125,748 exomes. In gnomAD, all genomic rearrangements involving at least 50bp of DNA are defined as structural variants. Subjects known to be affected by severe pediatric disease, as well as their first-degree relatives, are not included in this dataset. However, some adult subjects with severe disease may still be included in the datasets, but likely at a frequency equivalent to or lower than that seen in the general population. In addition, gnomAD v2 parcels out a filtered non-neuro subset comprised of 104,068 exomes and 10,636 genomes.
I did not include the gnomAD v3 dataset for several reasons. The gnomAD v3 data set contains 71,702 whole genomes (and no exomes), all mapped to a different reference sequence (GRCh38/hg38). The gnomAD v2 and v3 datasets are not independent since most of the genomes from v2 are included in v3. At present, gnomAD v3 does not include a structural variant dataset. Variants were grouped based on gnomAD annotations (stop, frameshift, splice, missense, synonymous, intronic, start lost, 5'untranslated region [UTR], 3'UTR, and in-frame insertion) for downstream analyses. I did not include dubious variants flagged and filtered by gnomAD. Flagged variants did not pass the gnomAD quality control process and include those in low complexity regions, variants predicted to disrupt splicing outside the canonical splice site, and multi-nucleotide variants found in phase with another variant.
Variants reported by gnomAD were analyzed with Combined Annotation Dependent Depletion (CADD). CADD integrates multiple annotations by contrasting variants that survived natural selection with simulated pathogenic variants [16,17]. For the analyses reported here, I focused on CADD "PHRED-scaled" scores which represent the rank in order of magnitude terms rather than the precise rank itself. Reference genome single nucleotide variants at the 10% of CADD scores are assigned to CADD PHRED-10, top 1% to CADD PHRED-20, top 0.1% to CADD PHRED-30, etc. For SGCE variants, CADD raw scores ranged from -0.812 to 7.119, and CADD PHRED scores ranged from 0.008 to 38. CADD raw and PHRED scores were highly correlated (r = 0.975).
In addition to CADD, dbNSFP [18] was used for functional prediction and annotation of non-synonymous single-nucleotide variants (nsSNVs). Its current version (dbNSFP v4.0) is based on the Gencode release 29/Ensembl version 94 and includes a total of 84,013,490 nsSNVs and ssSNVs (splicing-site SNVs). dbNSFP compiles prediction scores from 29 algorithms (SIFT, Polyphen2-HDIV, MutationTaster2, MutationAssessor, FATHMM, MetaSVM, MetaLR, CADD, REVEL, PrimateAI, etc.), conservation scores, and includes other related information including allele frequencies observed in the 1000 Genomes Project phase 3 data, UK10K cohorts data, ExAC consortium data, gnomAD data and the NHLBI Exome Sequencing Project ESP6500 data, various gene IDs from different databases, functional descriptions of genes, gene expression and gene interaction information. Rare Exome Variant Ensemble Learner (REVEL) [19] and MetaLR [20] were used to predict the pathogenicity of nsSNVs. In comparison to most other prediction algorithms, REVEL and MetaLR show high overall performance and areas under the receiver operating characteristic curves [19].
The Human Gene Mutation Database® (HGMD) was used for identification of published SGCE variants reported in singletons and pedigrees with M-D. For comparison with population variants reported in gnomAD, I selected a subset of well-documented variants from multiple classes (missense, nonsense, splice, small deletions, and small insertions) included in 4 independent publications. Data from these 4 publications was reviewed to verify HGMD reporting. Of note, HGMD does not include all published SGCE variants and no attempt was made to scour the entire published literature to analyze all known disease-associated variants in SGCE.

Analysis of published variants
Using the HGMD, I identified 25 disease-associated SGCE variants reported in 4 independent publications [4,[21][22][23]. M-D has been linked to virtually all types of variants including interstitial deletions, single-exonic deletions, indels, in-frame deletions, non-synonymous single-nucleotide missense, and splice. The vast majority of indels lead to frameshifts and stops ( Table 1), likely resulting in nonsense-mediated decay [24]. Two (c.812G>A, and c.289C>T) of the 25 selected variants are also present in gnomAD, presumably in two single individuals ( Table 1). CADD_PHRED scores ranged from 21.2 for an in-frame deletion to 41 for two nonsense variants. Six missense variants are included in Table 1. CADD_ PHRED scores range from 23.8 to 35. All 6 of these missense variants are predicted to be disease causing by MetaLR, MetaSVM, and MutationTaster. REVEL_rankscores ranged from 0.852 (p.Thr36Arg) to 0.997 (p.Tyr115Cys). However, the p.Thr36Arg variant was classified as T (tolerated) by PrimateAI_pred [25], and B (benign) by Polyphen_2_HDIV_ pred. The male subject harboring the p.Thr36Arg variant had alcohol-responsive myoclonus but no dystonia or family history of M-D [4].

SGCE variants in gnomAD v2
Recognizing that some fraction of the 282,646 SGCE alleles could contain two or more variants, a maximum of 134,145 alleles within the gnomAD database harbored a short SGCE variant ( Table 2). As reported in gnomAD, SGCE has 7 polymorphisms (minor allele frequency > 5%). One of these is a nonsynonymous variant (p.Ser434Arg). Two others are found in low complexity regions of SGCE (Supplemental Table 1). Four structural variants, all intronic insertions, ranging in size from 243 bp to 322 bp, are present in 34 subjects within the gnomAD v2 dataset. The potential effects of these variants on gene expression and splicing are not known.
Of the 780 short variants reported in gnomAD v2, 265 had CADD scores ≥ 20. This subset of variants was present in a maximum of 1250 alleles. Based on data contained within Table 1, a more restrictive group of 92 more deleterious and possibly pathogenic variants with CADD_PHRED scores ≥25 was found in 406 individuals. Extrapolating to the current population in the United States (US Census Bureau, www.census.gov), nearly one million individuals in the US (1/348 or 0.287%) harbors a highly deleterious and possibly pathogenic SGCE variant in their genome. Limiting analysis to the most deleterious variants, and, after correction for imprinting, there could be an estimated 65,426 cases of M-D in the US.
CADD_PHRED scores were also generated for the 720 SGCE variants present in the gnomAD non-neuro v2 dataset derived from 104,068 exomes and 10,636 genomes. A total of 254 variants had CADD scores ≥ 20: 22 ≥ 30, 67 ≥ 25 and < 30, and 162 ≥ 20 and < 25. Three variants were homozygous. The total number of variants and their CADD scores are proportional to the data derived from the entirety of gnomAD v2. Therefore, deleterious SGCE variants are not concentrated in "neuro" exomes and genomes within gnomAD v2.

Variant annotation
As gleaned from Tables 1 and 3, nonsense, frameshift and splice pathogenic variants leading to NMD or truncated proteins are associated with the highest CADD scores of deleteriousness. Of the 23 variants with CADD_PHRED scores ≥30, 3 were stops, 4 were frameshifts leading to premature termination, 4 were located within canonical splice sites, and 12 were missense ( Table 3). In contrast, a much broader array of variants had CADD_PHRED scores between 20 and 25. In particular, among 173 variants in this grouping, 133 were missense, 7 were synonymous and 15 were in the 5'UTR. In contrast, there was only 1 stop or frameshift variant with a CADD score between 20 and 25, supporting the notion that more deleterious variants are poorly tolerated and may be causally associated with neurological manifestations. It should be noted that synonymous and 5'UTR variants are important causes of disease. Synonymous variants can alter mRNA stability and splicing and the rate of translation. Similarly, variants in the 5'UTR can exert deleterious effects on translation.

Missense variants
A maximum of 1216 gnomAD v2 alleles harbored a missense variant predicted to be disease-causing by MetaLR with a reliability_Index of 9 or 10 ( Table 4). For this group of variants, MetaLR_rankscores ranged from 0.99 to 0.95 with a median value of 0.97. For comparative purposes, I compared and correlated CADD_PHRED scores with MetaLR-rankscores and REVEL_rankscores (Supplemental Table 2). To apply a threshold value for deleteriousness, it is generally recommended that a CADD-PHRED score of 15 is chosen for the process of identifying potentially pathogenic variants (cadd.gs.washington.edu/info). This value is near the low CADD_PHRED score of 14.45. Given that correlations among CADD-PHRED scores, MetaLR_rankscores and REVEL_rankscores were only moderate (r > 0.5), reliance on a single measure of deleteriousness could lead to missed assignments of pathogenicity (Supplemental Table 2).

Discussion
My analyses of the gnomAD database indicate that deleterious variants in SGCE are common in presumably normal populations. This finding should inform interpretation of whole-exome sequencing (WES) and whole-genome sequencing (WGS) performed on individuals and populations for other purposes. There are several interpretations of my findings. First, in silico analyses of deleteriousness may be weak predictors of pathogenicity, particularly for SGCE and M-D. Second, individuals with deleterious SGCE variants may have very mild, unrecognized, clinical manifestations or isolated psychiatric disease. Third, M-D may be misdiagnosed or underdiagnosed, and, in this regard, most subjects included in the gnomAD database did not undergo neurological examination by an expert in movement disorders.
Additional limitations of my work should be highlighted. First, WES and WGS are associated with small false positive and false negative error rates. Although uncommon, M-D can be cause by large interstitial deletions which are often missed by short-read next-generation sequencing. These large structural variants would increase the predicted number of cases in the population. The single nucleotide and indel variants included in gnomAD were not confirmed with bidirectional Sanger sequencing. As such, it is possible that a small percentage of the reported variants were short-read errors. Caution should be used with examination of the gnomAD database and most other large genomic/genetic databases. Ideally, a neurological control genetic database should be restricted to neurologically-and psychiatrically-normal adults with no first-or second-degree relatives with neurological or psychiatric disease. The limitations of entirely in silico approaches are well established and reliable, inexpensive, high-throughput functional assays for SGCE variants are not available. Rather than simple reliance on CADD, my analyses suggest that clinical geneticists should use multiple in silico tools and query population control databases when evaluating the potential pathogenicity of SGCE variants, particularly missense variants.
Using a minimum ΔGAG carrier prevalence of 176/million and penetrance of 35%, an estimated 62 cases of DYT1 dystonia/million are present in the United States [15]. However, the actual number of DYT1 cases seen by movement disorders experts in the United States would seem much lower. This suggests that population penetrance may be considerably less than the penetrance within individual families, perhaps driven by other variants in cis or trans. In the context of M-D, our data predicts a population prevalence of 198 cases/million which is somewhat higher than the maximum estimated prevalence of cervical dystonia in the United States [13]. True penetrance can only be determined by expert examination of carriers within a population and variants databases like gnomAD are obviously limited in this regard.
SGCE-associated neuropsychiatric disease may be underrecognized by pediatric neurologists and psychiatrists. Neurological manifestations can range from early mild gait dysfunction or subtle myoclonus to generalized dystonia with cognitive dysfunction. M-D may be misdiagnosed as Tourette syndrome, myoclonic epilepsy, dyskinetic cerebral palsy, or isolated cervical dystonia with associated   Art. 50, page 7 of 9 LeDoux: Population Prevalence of Deleterious SGCE Variants appendicular tremor [26]. Consideration of a genetic etiology may be dismissed due to maternal imprinting and broad phenotypic variability in individual pedigrees. The positive effects of alcohol on motor manifestations are rarely identified in children. Timely diagnosis of SGCE-associated neuropsychiatric disease is important for individual patients and pedigrees given that effective treatments are available for myoclonus, dystonia, anxiety, depression, and other disease manifestations.
Given the phenotypic variability of dystonia and other disorders of the motor system along with the declining costs of next-generation sequencing (NGS), multi-gene panels, WES and WGS are being increasing utilized for genetic diagnoses. For instance, Invitae (www.invitae.com) offers a comprehensive dystonia panel that includes SGCE and 17 other genes. Their panel is sequenced to high depth (50x minimum) to detect SNVs, indels, exon-level deletions/duplications, and large copy number variants. Among 1,910 patients with a clinical diagnosis of dystonia included in a recent report, 7.9% were given a molecular diagnosis and 11.8% were found to have a variant of unknown significance [27]. The genes with highest yield were SGCE (20.5%) and TOR1A (19.9%) [27]. For comparison, within the ClinVar database on July 3, 2020, there are 215 accessions associated with SGCE, but only 122 associated with TOR1A. These evidences suggest that SGCE-associated dystonia is perhaps more common in clinics than previously recognized.
In conclusion, we have shown that SGCE variants predicted to be highly deleterious are common in population and non-neurological disease controls. Accordingly, SGCEassociated neuropsychiatric disease may be underrecognized by clinicians. Alternatively, the population penetrance of deleterious variants in SGCE may be quite low. Ideally, expert examination of pedigrees and co-segregation should be used to establish the causality of SGCE variants identified by routine Sanger sequencing, next-generation multi-gene panels, WES or WGS. Future work should focus on environmental and genetic contributions to penetrance.

Ethics and Consent
Our secondary analyses of human genetic data do not meet the definition of human experimentation. No personally identifiable information is associated with gnomAD variant data.