The G4Hunter Algorithm
======================

This page provides a detailed explanation of the G4Hunter algorithm as
described by Bedrat, Lacroix & Mergny (2016) [1]_. Understanding the
algorithm helps interpret g4hunterpy3 output and choose appropriate
parameters.


.. contents:: On this page
   :local:
   :depth: 2


What Are G-Quadruplexes?
------------------------

G-quadruplexes (G4) are four-stranded nucleic acid structures formed by
guanine-rich sequences. Their building block is the **guanine quartet (G-quartet)**
— a planar association of four guanines held together by Hoogsteen hydrogen
bonds. Two or more stacked G-quartets, stabilized by coordinated cations
(typically K\ :sup:`+`), form a G-quadruplex.

G4 structures have attracted significant biological interest because they:

- Are involved in **telomere biology** and chromosome end-capping.
- Play roles in **transcription regulation** (e.g., the *MYC* promoter G4).
- Affect **DNA replication**, acting as replication barriers.
- Are implicated in **genomic instability** — mitochondrial DNA deletion
  breakpoints map near G4-forming regions.
- Serve as **drug targets** for anticancer and antiviral therapeutics.
- Are found enriched in **gene promoters**, particularly proto-oncogenes.


Why G4Hunter?
-------------

Prior to G4Hunter, the dominant approach for predicting G4-forming sequences was
**Quadparser**, which searched for patterns matching the motif
``G_n N_m G_n N_o G_n N_p G_n``, requiring fixed-length G-runs and constrained
loop sizes. While useful, Quadparser has important limitations:

- **False negatives**: sequences that do form G4 structures experimentally but
  lack the canonical pattern are missed. Some G4 structures contain interrupted
  G-runs, bulges, or non-standard loop lengths.
- **False positives**: some sequences matching the consensus fail to form
  G4 structures *in vitro* because competing duplex formation (particularly in
  GC-alternating regions) is more favorable.
- **Binary output**: Quadparser provides only a yes/no answer, without a
  quantitative measure of G4-forming propensity.

G4Hunter addresses all three issues by providing a **continuous score** based
on two sequence properties that are correlated with G4 formation:

1. **G-richness** — the density and clustering of guanines.
2. **G-skewness** — the asymmetry between G and C content on a given strand.


Per-Base Scoring
----------------

The foundation of G4Hunter is a per-base scoring scheme that assigns every
nucleotide in a sequence a score between **-4** and **+4**:

.. list-table::
   :header-rows: 1
   :widths: 20 30 50

   * - Base
     - Score
     - Rationale
   * - A, T (U)
     - 0
     - Neutral — do not contribute to or compete with G4 formation.
   * - G in a single-G context
     - +1
     - Weak G-richness signal.
   * - Each G in a GG run
     - +2
     - Moderate G-richness.
   * - Each G in a GGG run
     - +3
     - Strong G-richness — the classic G-tract.
   * - Each G in a run of 4+ Gs
     - +4 (capped)
     - Very strong G-richness (capped at 4).
   * - C
     - Negative (same magnitude rules as G)
     - C-runs **oppose** G4 formation on the scored strand, reflecting
       competition from stable Watson-Crick duplexes.

**Worked example**:

.. code-block:: text

    Sequence:   G G G A T T C C G G G G A T
    Run lengths: 3 3 3 - - - 2 2 4 4 4 4 - -
    Scores:     +3+3+3  0 0-2-2+4+4+4+4  0 0

Key design properties:

- **Runs of G are rewarded superlinearly.** A ``GGG`` run scores 3 per base
  (total 9), far more than three isolated G's (total 3). This captures the
  physical reality that consecutive guanines are needed to form G-quartets.

- **C-runs receive negative scores.** This simultaneously (a) penalizes
  GC-alternating regions where stable duplex formation competes with G4,
  and (b) allows scoring the complementary strand — a highly negative score
  on one strand implies a highly positive score (and G4 potential) on the
  reverse complement.

- **Scores are capped at ±4.** Runs longer than 4 still receive a score of 4
  per base. This prevents extremely long G-tracts from dominating the score
  disproportionately.


Sliding Window Scoring (G4Hscore)
---------------------------------

For genome-wide or sequence-level analysis, per-base scores are smoothed
using a **sliding window**:

1. Fix a window size *k* (default: **25 nt**).
2. Slide the window one base at a time across the sequence.
3. For each window position, compute the **arithmetic mean** of the per-base
   scores within that window. This mean is the **G4Hscore** for that window.

.. math::

    \text{G4Hscore}(i) = \frac{1}{k} \sum_{j=i}^{i+k-1} \text{score}(j)

The G4Hscore has several useful properties:

- It is **centered on zero** for random sequences, independent of GC content.
  This is because G and C contributions cancel in sequences without strand
  asymmetry.
- **Positive values** indicate G-rich regions with G4-forming potential on the
  scored (forward) strand.
- **Negative values** indicate C-rich regions, implying G4-forming potential on
  the **complementary** strand.
- The **magnitude** reflects G4 propensity strength.

Window size considerations
^^^^^^^^^^^^^^^^^^^^^^^^^^

The default window size of **25 nt** was chosen to match the typical length of
experimentally characterized G-quadruplex-forming sequences (~26 nt mean in the
Bedrat et al. reference dataset). The window size can be adjusted:

- **Smaller windows** (15–20 nt) increase sensitivity for short G4 motifs but
  may increase false positives, especially from single long G-runs that cannot
  form intramolecular G4 structures alone.
- **Larger windows** (30–100 nt) can identify broader regions where multiple
  G4 structures may form in tandem, which may be biologically relevant (e.g.,
  for replication-associated DNA damage).

In g4hunterpy3, window size is set via ``window_size`` in the Python API
or ``-w`` on the command line.


Thresholding and Hit Calling
-----------------------------

Windows whose **absolute** G4Hscore exceeds a user-defined threshold are
reported as candidate G4-forming regions. Taking the absolute value means that
G4-forming potential on both strands is captured simultaneously.

.. math::

    |\text{G4Hscore}(i)| \geq \theta

The threshold $\theta$ controls the trade-off between sensitivity and
specificity:

.. list-table::
   :header-rows: 1
   :widths: 15 20 65

   * - Threshold
     - Precision
     - Recommended use
   * - 1.0
     - ~73%
     - Most inclusive; recovers the maximum number of true G4-forming
       sequences but with a higher false positive rate. Miss rate ~6%.
   * - 1.2
     - ~85%
     - **Recommended compromise**. Identifies many true G4 motifs while
       maintaining reasonable precision. Miss rate ~15%.
   * - 1.5
     - >90%
     - **High-confidence** predictions. Most sequences identified at this
       threshold are experimentally confirmed G4 formers.
   * - 1.75
     - ~95%
     - Very stringent; near-zero false positives on the mitochondrial genome
       validation set. Best for applications requiring certainty (e.g.,
       PCR primer or DNA origami design).
   * - 2.0
     - >98%
     - Most stringent; identifies only the strongest G4-forming sequences.

These precision estimates are based on experimental validation of 209
sequences from the human mitochondrial genome using six independent biophysical
methods (CD, NMR, TDS, IDS, UV-melting, and thioflavin T fluorescence).

In g4hunterpy3, the threshold is set via ``threshold`` in the Python API
or ``-s`` on the command line.


Region Merging
--------------

In a genome-wide scan, consecutive windows with scores above the threshold
typically overlap extensively (windows shifted by 1 nt share *k*-1 bases).
These overlapping windows are **merged** into contiguous regions:

1. Sort window hits by start position.
2. Merge windows whose start positions overlap with existing regions
   (i.e., next window start ≤ current region end - 1).
3. For each merged region, compute the **region score** as the mean of the
   per-base scores across the full region (not the mean of window scores).
4. Report the merged region with its coordinates, sequence, length, score,
   and number of contributing windows.

This produces a non-redundant set of candidate G4-forming sequences (G4FS)
from the genome.


Scoring Both Strands Simultaneously
------------------------------------

A key advantage of the G4Hunter scoring scheme is that **both strands of a
DNA duplex are scored simultaneously**. Because C-runs receive negative scores
while G-runs receive positive scores:

- A region with a **positive** G4Hscore has G4-forming potential on the
  **forward (scored) strand**.
- A region with a **negative** G4Hscore has G4-forming potential on the
  **reverse complement strand** (which is G-rich where the forward strand
  is C-rich).

This means a single pass through a sequence identifies G4 candidates on
both strands. The ``--strand-agnostic`` option in the CLI takes absolute
values of scores, treating both strands equally when plotting.


Comparison with Quadparser
--------------------------

Bedrat et al. systematically compared G4Hunter against Quadparser on a
reference dataset of 392 sequences with experimentally known G4 formation
status, and on 209 sequences from the human mitochondrial genome:

.. list-table::
   :header-rows: 1
   :widths: 30 25 25 20

   * - Method
     - True G4 found
     - False positive rate
     - Type
   * - G4Hunter (θ=1.0)
     - 281 / 298
     - 10.6%
     - Continuous score
   * - G4Hunter (θ=1.2)
     - 252 / 298
     - 6.4%
     - Continuous score
   * - Quadparser (QP37)
     - 196 / 298
     - 1.1%
     - Binary
   * - Quadparser (QP27)
     - More hits
     - Higher FPR
     - Binary

Key advantages of G4Hunter:

1. **Fewer false negatives.** G4Hunter identifies G4-forming sequences that
   lack the canonical ``(G₃₊N₁₋₇)₄`` pattern — sequences with bulges,
   interrupted G-runs, or non-standard loop lengths.

2. **Quantitative output.** The continuous score allows ranking sequences by
   G4-forming propensity and tuning sensitivity via the threshold.

3. **Both strands.** A single scan reports candidates on both strands.

4. **ROC AUC > 0.96** on the reference dataset, indicating excellent
   discriminating power.


Genomic Distribution of G4-Forming Sequences
---------------------------------------------

Application of G4Hunter to 20 genomes revealed:

- **Mammalian genomes** are the most G4-rich, with ~2.5 G4FS per kb at
  θ=1.0 and ~0.5 per kb at θ=1.5 (window size 25).
- G4FS density decreases exponentially with increasing threshold, with
  similar exponential slopes across mammalian species.
- **Promoter regions** are significantly enriched in G4FS (2–5.5-fold) compared
  to genomic background, particularly for stable G4 motifs (θ ≥ 1.5).
- Enrichment is also observed in **5' UTRs**, **first exons**, and **first
  exon/intron junctions**, predominantly on the coding strand.
- The number of G4-forming sequences in the human genome is estimated to be
  **2–10 times higher** than the ~376,000 previously reported using
  Quadparser.


Implementation in g4hunterpy3
------------------------------

g4hunterpy3 implements the G4Hunter algorithm as described above with the
following pipeline:

1. :func:`~g4hunterpy3.core.base_scores` — assigns per-base scores using the
   run-length scoring scheme.
2. :func:`~g4hunterpy3.core.window_mean_scores` — computes sliding-window
   means using fast O(n) convolution.
3. :func:`~g4hunterpy3.core.find_window_hits` — identifies windows exceeding
   the absolute-score threshold.
4. :func:`~g4hunterpy3.core.merge_overlapping_windows` — merges overlapping
   hits into non-redundant regions with per-base-mean region scores.

The convenience function :func:`~g4hunterpy3.core.scan_sequence` runs all four
steps, and :func:`~g4hunterpy3.core.scan_fasta` applies the pipeline to every
record in a FASTA file.


References
----------

.. [1] Bedrat, A., Lacroix, L. & Mergny, J.-L. Re-evaluation of G-quadruplex
   propensity with G4Hunter. *Nucleic Acids Res.* **44**, 1746–1759 (2016).
   `doi:10.1093/nar/gkw006 <https://doi.org/10.1093/nar/gkw006>`_