The G4Hunter Algorithm ====================== This page provides a detailed explanation of the G4Hunter algorithm as described by Bedrat, Lacroix & Mergny (2016) [1]_. Understanding the algorithm helps interpret g4hunterpy3 output and choose appropriate parameters. .. contents:: On this page :local: :depth: 2 What Are G-Quadruplexes? ------------------------ G-quadruplexes (G4) are four-stranded nucleic acid structures formed by guanine-rich sequences. Their building block is the **guanine quartet (G-quartet)** — a planar association of four guanines held together by Hoogsteen hydrogen bonds. Two or more stacked G-quartets, stabilized by coordinated cations (typically K\ :sup:`+`), form a G-quadruplex. G4 structures have attracted significant biological interest because they: - Are involved in **telomere biology** and chromosome end-capping. - Play roles in **transcription regulation** (e.g., the *MYC* promoter G4). - Affect **DNA replication**, acting as replication barriers. - Are implicated in **genomic instability** — mitochondrial DNA deletion breakpoints map near G4-forming regions. - Serve as **drug targets** for anticancer and antiviral therapeutics. - Are found enriched in **gene promoters**, particularly proto-oncogenes. Why G4Hunter? ------------- Prior to G4Hunter, the dominant approach for predicting G4-forming sequences was **Quadparser**, which searched for patterns matching the motif ``G_n N_m G_n N_o G_n N_p G_n``, requiring fixed-length G-runs and constrained loop sizes. While useful, Quadparser has important limitations: - **False negatives**: sequences that do form G4 structures experimentally but lack the canonical pattern are missed. Some G4 structures contain interrupted G-runs, bulges, or non-standard loop lengths. - **False positives**: some sequences matching the consensus fail to form G4 structures *in vitro* because competing duplex formation (particularly in GC-alternating regions) is more favorable. - **Binary output**: Quadparser provides only a yes/no answer, without a quantitative measure of G4-forming propensity. G4Hunter addresses all three issues by providing a **continuous score** based on two sequence properties that are correlated with G4 formation: 1. **G-richness** — the density and clustering of guanines. 2. **G-skewness** — the asymmetry between G and C content on a given strand. Per-Base Scoring ---------------- The foundation of G4Hunter is a per-base scoring scheme that assigns every nucleotide in a sequence a score between **-4** and **+4**: .. list-table:: :header-rows: 1 :widths: 20 30 50 * - Base - Score - Rationale * - A, T (U) - 0 - Neutral — do not contribute to or compete with G4 formation. * - G in a single-G context - +1 - Weak G-richness signal. * - Each G in a GG run - +2 - Moderate G-richness. * - Each G in a GGG run - +3 - Strong G-richness — the classic G-tract. * - Each G in a run of 4+ Gs - +4 (capped) - Very strong G-richness (capped at 4). * - C - Negative (same magnitude rules as G) - C-runs **oppose** G4 formation on the scored strand, reflecting competition from stable Watson-Crick duplexes. **Worked example**: .. code-block:: text Sequence: G G G A T T C C G G G G A T Run lengths: 3 3 3 - - - 2 2 4 4 4 4 - - Scores: +3+3+3 0 0-2-2+4+4+4+4 0 0 Key design properties: - **Runs of G are rewarded superlinearly.** A ``GGG`` run scores 3 per base (total 9), far more than three isolated G's (total 3). This captures the physical reality that consecutive guanines are needed to form G-quartets. - **C-runs receive negative scores.** This simultaneously (a) penalizes GC-alternating regions where stable duplex formation competes with G4, and (b) allows scoring the complementary strand — a highly negative score on one strand implies a highly positive score (and G4 potential) on the reverse complement. - **Scores are capped at ±4.** Runs longer than 4 still receive a score of 4 per base. This prevents extremely long G-tracts from dominating the score disproportionately. Sliding Window Scoring (G4Hscore) --------------------------------- For genome-wide or sequence-level analysis, per-base scores are smoothed using a **sliding window**: 1. Fix a window size *k* (default: **25 nt**). 2. Slide the window one base at a time across the sequence. 3. For each window position, compute the **arithmetic mean** of the per-base scores within that window. This mean is the **G4Hscore** for that window. .. math:: \text{G4Hscore}(i) = \frac{1}{k} \sum_{j=i}^{i+k-1} \text{score}(j) The G4Hscore has several useful properties: - It is **centered on zero** for random sequences, independent of GC content. This is because G and C contributions cancel in sequences without strand asymmetry. - **Positive values** indicate G-rich regions with G4-forming potential on the scored (forward) strand. - **Negative values** indicate C-rich regions, implying G4-forming potential on the **complementary** strand. - The **magnitude** reflects G4 propensity strength. Window size considerations ^^^^^^^^^^^^^^^^^^^^^^^^^^ The default window size of **25 nt** was chosen to match the typical length of experimentally characterized G-quadruplex-forming sequences (~26 nt mean in the Bedrat et al. reference dataset). The window size can be adjusted: - **Smaller windows** (15–20 nt) increase sensitivity for short G4 motifs but may increase false positives, especially from single long G-runs that cannot form intramolecular G4 structures alone. - **Larger windows** (30–100 nt) can identify broader regions where multiple G4 structures may form in tandem, which may be biologically relevant (e.g., for replication-associated DNA damage). In g4hunterpy3, window size is set via ``window_size`` in the Python API or ``-w`` on the command line. Thresholding and Hit Calling ----------------------------- Windows whose **absolute** G4Hscore exceeds a user-defined threshold are reported as candidate G4-forming regions. Taking the absolute value means that G4-forming potential on both strands is captured simultaneously. .. math:: |\text{G4Hscore}(i)| \geq \theta The threshold $\theta$ controls the trade-off between sensitivity and specificity: .. list-table:: :header-rows: 1 :widths: 15 20 65 * - Threshold - Precision - Recommended use * - 1.0 - ~73% - Most inclusive; recovers the maximum number of true G4-forming sequences but with a higher false positive rate. Miss rate ~6%. * - 1.2 - ~85% - **Recommended compromise**. Identifies many true G4 motifs while maintaining reasonable precision. Miss rate ~15%. * - 1.5 - >90% - **High-confidence** predictions. Most sequences identified at this threshold are experimentally confirmed G4 formers. * - 1.75 - ~95% - Very stringent; near-zero false positives on the mitochondrial genome validation set. Best for applications requiring certainty (e.g., PCR primer or DNA origami design). * - 2.0 - >98% - Most stringent; identifies only the strongest G4-forming sequences. These precision estimates are based on experimental validation of 209 sequences from the human mitochondrial genome using six independent biophysical methods (CD, NMR, TDS, IDS, UV-melting, and thioflavin T fluorescence). In g4hunterpy3, the threshold is set via ``threshold`` in the Python API or ``-s`` on the command line. Region Merging -------------- In a genome-wide scan, consecutive windows with scores above the threshold typically overlap extensively (windows shifted by 1 nt share *k*-1 bases). These overlapping windows are **merged** into contiguous regions: 1. Sort window hits by start position. 2. Merge windows whose start positions overlap with existing regions (i.e., next window start ≤ current region end - 1). 3. For each merged region, compute the **region score** as the mean of the per-base scores across the full region (not the mean of window scores). 4. Report the merged region with its coordinates, sequence, length, score, and number of contributing windows. This produces a non-redundant set of candidate G4-forming sequences (G4FS) from the genome. Scoring Both Strands Simultaneously ------------------------------------ A key advantage of the G4Hunter scoring scheme is that **both strands of a DNA duplex are scored simultaneously**. Because C-runs receive negative scores while G-runs receive positive scores: - A region with a **positive** G4Hscore has G4-forming potential on the **forward (scored) strand**. - A region with a **negative** G4Hscore has G4-forming potential on the **reverse complement strand** (which is G-rich where the forward strand is C-rich). This means a single pass through a sequence identifies G4 candidates on both strands. The ``--strand-agnostic`` option in the CLI takes absolute values of scores, treating both strands equally when plotting. Comparison with Quadparser -------------------------- Bedrat et al. systematically compared G4Hunter against Quadparser on a reference dataset of 392 sequences with experimentally known G4 formation status, and on 209 sequences from the human mitochondrial genome: .. list-table:: :header-rows: 1 :widths: 30 25 25 20 * - Method - True G4 found - False positive rate - Type * - G4Hunter (θ=1.0) - 281 / 298 - 10.6% - Continuous score * - G4Hunter (θ=1.2) - 252 / 298 - 6.4% - Continuous score * - Quadparser (QP37) - 196 / 298 - 1.1% - Binary * - Quadparser (QP27) - More hits - Higher FPR - Binary Key advantages of G4Hunter: 1. **Fewer false negatives.** G4Hunter identifies G4-forming sequences that lack the canonical ``(G₃₊N₁₋₇)₄`` pattern — sequences with bulges, interrupted G-runs, or non-standard loop lengths. 2. **Quantitative output.** The continuous score allows ranking sequences by G4-forming propensity and tuning sensitivity via the threshold. 3. **Both strands.** A single scan reports candidates on both strands. 4. **ROC AUC > 0.96** on the reference dataset, indicating excellent discriminating power. Genomic Distribution of G4-Forming Sequences --------------------------------------------- Application of G4Hunter to 20 genomes revealed: - **Mammalian genomes** are the most G4-rich, with ~2.5 G4FS per kb at θ=1.0 and ~0.5 per kb at θ=1.5 (window size 25). - G4FS density decreases exponentially with increasing threshold, with similar exponential slopes across mammalian species. - **Promoter regions** are significantly enriched in G4FS (2–5.5-fold) compared to genomic background, particularly for stable G4 motifs (θ ≥ 1.5). - Enrichment is also observed in **5' UTRs**, **first exons**, and **first exon/intron junctions**, predominantly on the coding strand. - The number of G4-forming sequences in the human genome is estimated to be **2–10 times higher** than the ~376,000 previously reported using Quadparser. Implementation in g4hunterpy3 ------------------------------ g4hunterpy3 implements the G4Hunter algorithm as described above with the following pipeline: 1. :func:`~g4hunterpy3.core.base_scores` — assigns per-base scores using the run-length scoring scheme. 2. :func:`~g4hunterpy3.core.window_mean_scores` — computes sliding-window means using fast O(n) convolution. 3. :func:`~g4hunterpy3.core.find_window_hits` — identifies windows exceeding the absolute-score threshold. 4. :func:`~g4hunterpy3.core.merge_overlapping_windows` — merges overlapping hits into non-redundant regions with per-base-mean region scores. The convenience function :func:`~g4hunterpy3.core.scan_sequence` runs all four steps, and :func:`~g4hunterpy3.core.scan_fasta` applies the pipeline to every record in a FASTA file. References ---------- .. [1] Bedrat, A., Lacroix, L. & Mergny, J.-L. Re-evaluation of G-quadruplex propensity with G4Hunter. *Nucleic Acids Res.* **44**, 1746–1759 (2016). `doi:10.1093/nar/gkw006 `_