Distinct sequence features underlie microdeletions and gross deletions
in the human genome
Abstract
Microdeletions and gross deletions are important causes
(~20%) of human inherited disease. Their genomic
locations are strongly influenced by the local DNA sequence environment.
Yet no systematic study has examined the generative mechanisms. Here, we
obtained 42,098 pathogenic microdeletions and gross deletions from the
Human Gene Mutation Database (HGMD) that together form a continuum of
germline deletions ranging in size from 1 bp to 28,394,429 bp. We
analyzed the sequence within 1-kb of the breakpoint junctions and found
the frequencies of non-B DNA-forming repeats, GC content, and the
presence of seven of 78 specific sequence motifs in the vicinity of
pathogenic deletions correlated with deletion length for deletions of
length ≤30 bp. Furthermore, we found the repeats of DR, GQ, and STR
appear to be important for the formation of longer deletions
(>30 bp) but not for the formation of shorter deletions
(≤30 bp) and significantly (Chi-square test P-value < 2E-16)
more microhomologies were identified in flanking short deletions than
long deletions (length >30 bp). We provide evidence to
support a functional distinction between microdeletions and gross
deletions. A deletion length cut-off of 25-30 bp may serve as an
objective means to functionally distinguish microdeletions from gross
deletions.