The G4Hunter Algorithm#

This page provides a detailed explanation of the G4Hunter algorithm as described by Bedrat, Lacroix & Mergny (2016) [1]. Understanding the algorithm helps interpret g4hunterpy3 output and choose appropriate parameters.

What Are G-Quadruplexes?#

G-quadruplexes (G4) are four-stranded nucleic acid structures formed by guanine-rich sequences. Their building block is the guanine quartet (G-quartet) — a planar association of four guanines held together by Hoogsteen hydrogen bonds. Two or more stacked G-quartets, stabilized by coordinated cations (typically K+), form a G-quadruplex.

G4 structures have attracted significant biological interest because they:

  • Are involved in telomere biology and chromosome end-capping.

  • Play roles in transcription regulation (e.g., the MYC promoter G4).

  • Affect DNA replication, acting as replication barriers.

  • Are implicated in genomic instability — mitochondrial DNA deletion breakpoints map near G4-forming regions.

  • Serve as drug targets for anticancer and antiviral therapeutics.

  • Are found enriched in gene promoters, particularly proto-oncogenes.

Why G4Hunter?#

Prior to G4Hunter, the dominant approach for predicting G4-forming sequences was Quadparser, which searched for patterns matching the motif G_n N_m G_n N_o G_n N_p G_n, requiring fixed-length G-runs and constrained loop sizes. While useful, Quadparser has important limitations:

  • False negatives: sequences that do form G4 structures experimentally but lack the canonical pattern are missed. Some G4 structures contain interrupted G-runs, bulges, or non-standard loop lengths.

  • False positives: some sequences matching the consensus fail to form G4 structures in vitro because competing duplex formation (particularly in GC-alternating regions) is more favorable.

  • Binary output: Quadparser provides only a yes/no answer, without a quantitative measure of G4-forming propensity.

G4Hunter addresses all three issues by providing a continuous score based on two sequence properties that are correlated with G4 formation:

  1. G-richness — the density and clustering of guanines.

  2. G-skewness — the asymmetry between G and C content on a given strand.

Per-Base Scoring#

The foundation of G4Hunter is a per-base scoring scheme that assigns every nucleotide in a sequence a score between -4 and +4:

Base

Score

Rationale

A, T (U)

0

Neutral — do not contribute to or compete with G4 formation.

G in a single-G context

+1

Weak G-richness signal.

Each G in a GG run

+2

Moderate G-richness.

Each G in a GGG run

+3

Strong G-richness — the classic G-tract.

Each G in a run of 4+ Gs

+4 (capped)

Very strong G-richness (capped at 4).

C

Negative (same magnitude rules as G)

C-runs oppose G4 formation on the scored strand, reflecting competition from stable Watson-Crick duplexes.

Worked example:

Sequence:   G G G A T T C C G G G G A T
Run lengths: 3 3 3 - - - 2 2 4 4 4 4 - -
Scores:     +3+3+3  0 0-2-2+4+4+4+4  0 0

Key design properties:

  • Runs of G are rewarded superlinearly. A GGG run scores 3 per base (total 9), far more than three isolated G’s (total 3). This captures the physical reality that consecutive guanines are needed to form G-quartets.

  • C-runs receive negative scores. This simultaneously (a) penalizes GC-alternating regions where stable duplex formation competes with G4, and (b) allows scoring the complementary strand — a highly negative score on one strand implies a highly positive score (and G4 potential) on the reverse complement.

  • Scores are capped at ±4. Runs longer than 4 still receive a score of 4 per base. This prevents extremely long G-tracts from dominating the score disproportionately.

Sliding Window Scoring (G4Hscore)#

For genome-wide or sequence-level analysis, per-base scores are smoothed using a sliding window:

  1. Fix a window size k (default: 25 nt).

  2. Slide the window one base at a time across the sequence.

  3. For each window position, compute the arithmetic mean of the per-base scores within that window. This mean is the G4Hscore for that window.

\[\text{G4Hscore}(i) = \frac{1}{k} \sum_{j=i}^{i+k-1} \text{score}(j)\]

The G4Hscore has several useful properties:

  • It is centered on zero for random sequences, independent of GC content. This is because G and C contributions cancel in sequences without strand asymmetry.

  • Positive values indicate G-rich regions with G4-forming potential on the scored (forward) strand.

  • Negative values indicate C-rich regions, implying G4-forming potential on the complementary strand.

  • The magnitude reflects G4 propensity strength.

Window size considerations#

The default window size of 25 nt was chosen to match the typical length of experimentally characterized G-quadruplex-forming sequences (~26 nt mean in the Bedrat et al. reference dataset). The window size can be adjusted:

  • Smaller windows (15–20 nt) increase sensitivity for short G4 motifs but may increase false positives, especially from single long G-runs that cannot form intramolecular G4 structures alone.

  • Larger windows (30–100 nt) can identify broader regions where multiple G4 structures may form in tandem, which may be biologically relevant (e.g., for replication-associated DNA damage).

In g4hunterpy3, window size is set via window_size in the Python API or -w on the command line.

Thresholding and Hit Calling#

Windows whose absolute G4Hscore exceeds a user-defined threshold are reported as candidate G4-forming regions. Taking the absolute value means that G4-forming potential on both strands is captured simultaneously.

\[|\text{G4Hscore}(i)| \geq \theta\]

The threshold $theta$ controls the trade-off between sensitivity and specificity:

Threshold

Precision

Recommended use

1.0

~73%

Most inclusive; recovers the maximum number of true G4-forming sequences but with a higher false positive rate. Miss rate ~6%.

1.2

~85%

Recommended compromise. Identifies many true G4 motifs while maintaining reasonable precision. Miss rate ~15%.

1.5

>90%

High-confidence predictions. Most sequences identified at this threshold are experimentally confirmed G4 formers.

1.75

~95%

Very stringent; near-zero false positives on the mitochondrial genome validation set. Best for applications requiring certainty (e.g., PCR primer or DNA origami design).

2.0

>98%

Most stringent; identifies only the strongest G4-forming sequences.

These precision estimates are based on experimental validation of 209 sequences from the human mitochondrial genome using six independent biophysical methods (CD, NMR, TDS, IDS, UV-melting, and thioflavin T fluorescence).

In g4hunterpy3, the threshold is set via threshold in the Python API or -s on the command line.

Region Merging#

In a genome-wide scan, consecutive windows with scores above the threshold typically overlap extensively (windows shifted by 1 nt share k-1 bases). These overlapping windows are merged into contiguous regions:

  1. Sort window hits by start position.

  2. Merge windows whose start positions overlap with existing regions (i.e., next window start ≤ current region end - 1).

  3. For each merged region, compute the region score as the mean of the per-base scores across the full region (not the mean of window scores).

  4. Report the merged region with its coordinates, sequence, length, score, and number of contributing windows.

This produces a non-redundant set of candidate G4-forming sequences (G4FS) from the genome.

Scoring Both Strands Simultaneously#

A key advantage of the G4Hunter scoring scheme is that both strands of a DNA duplex are scored simultaneously. Because C-runs receive negative scores while G-runs receive positive scores:

  • A region with a positive G4Hscore has G4-forming potential on the forward (scored) strand.

  • A region with a negative G4Hscore has G4-forming potential on the reverse complement strand (which is G-rich where the forward strand is C-rich).

This means a single pass through a sequence identifies G4 candidates on both strands. The --strand-agnostic option in the CLI takes absolute values of scores, treating both strands equally when plotting.

Comparison with Quadparser#

Bedrat et al. systematically compared G4Hunter against Quadparser on a reference dataset of 392 sequences with experimentally known G4 formation status, and on 209 sequences from the human mitochondrial genome:

Method

True G4 found

False positive rate

Type

G4Hunter (θ=1.0)

281 / 298

10.6%

Continuous score

G4Hunter (θ=1.2)

252 / 298

6.4%

Continuous score

Quadparser (QP37)

196 / 298

1.1%

Binary

Quadparser (QP27)

More hits

Higher FPR

Binary

Key advantages of G4Hunter:

  1. Fewer false negatives. G4Hunter identifies G4-forming sequences that lack the canonical (G₃₊N₁₋₇)₄ pattern — sequences with bulges, interrupted G-runs, or non-standard loop lengths.

  2. Quantitative output. The continuous score allows ranking sequences by G4-forming propensity and tuning sensitivity via the threshold.

  3. Both strands. A single scan reports candidates on both strands.

  4. ROC AUC > 0.96 on the reference dataset, indicating excellent discriminating power.

Genomic Distribution of G4-Forming Sequences#

Application of G4Hunter to 20 genomes revealed:

  • Mammalian genomes are the most G4-rich, with ~2.5 G4FS per kb at θ=1.0 and ~0.5 per kb at θ=1.5 (window size 25).

  • G4FS density decreases exponentially with increasing threshold, with similar exponential slopes across mammalian species.

  • Promoter regions are significantly enriched in G4FS (2–5.5-fold) compared to genomic background, particularly for stable G4 motifs (θ ≥ 1.5).

  • Enrichment is also observed in 5’ UTRs, first exons, and first exon/intron junctions, predominantly on the coding strand.

  • The number of G4-forming sequences in the human genome is estimated to be 2–10 times higher than the ~376,000 previously reported using Quadparser.

Implementation in g4hunterpy3#

g4hunterpy3 implements the G4Hunter algorithm as described above with the following pipeline:

  1. base_scores() — assigns per-base scores using the run-length scoring scheme.

  2. window_mean_scores() — computes sliding-window means using fast O(n) convolution.

  3. find_window_hits() — identifies windows exceeding the absolute-score threshold.

  4. merge_overlapping_windows() — merges overlapping hits into non-redundant regions with per-base-mean region scores.

The convenience function scan_sequence() runs all four steps, and scan_fasta() applies the pipeline to every record in a FASTA file.

References#