The G4Hunter Algorithm#

This page provides a detailed explanation of the G4Hunter algorithm as described by Bedrat, Lacroix & Mergny (2016) [1]. Understanding the algorithm helps interpret g4hunterpy3 output and choose appropriate parameters.

What Are G-Quadruplexes?#

G-quadruplexes (G4) are four-stranded nucleic acid structures formed by guanine-rich sequences. Their building block is the guanine quartet (G-quartet) — a planar association of four guanines held together by Hoogsteen hydrogen bonds. Two or more stacked G-quartets, stabilized by coordinated cations (typically K⁺), form a G-quadruplex.

G4 structures have attracted significant biological interest because they:

Are involved in telomere biology and chromosome end-capping.
Play roles in transcription regulation (e.g., the MYC promoter G4).
Affect DNA replication, acting as replication barriers.
Are implicated in genomic instability — mitochondrial DNA deletion breakpoints map near G4-forming regions.
Serve as drug targets for anticancer and antiviral therapeutics.
Are found enriched in gene promoters, particularly proto-oncogenes.

Why G4Hunter?#

Prior to G4Hunter, the dominant approach for predicting G4-forming sequences was Quadparser, which searched for patterns matching the motif G_n N_m G_n N_o G_n N_p G_n, requiring fixed-length G-runs and constrained loop sizes. While useful, Quadparser has important limitations:

False negatives: sequences that do form G4 structures experimentally but lack the canonical pattern are missed. Some G4 structures contain interrupted G-runs, bulges, or non-standard loop lengths.
False positives: some sequences matching the consensus fail to form G4 structures in vitro because competing duplex formation (particularly in GC-alternating regions) is more favorable.
Binary output: Quadparser provides only a yes/no answer, without a quantitative measure of G4-forming propensity.

G4Hunter addresses all three issues by providing a continuous score based on two sequence properties that are correlated with G4 formation:

G-richness — the density and clustering of guanines.
G-skewness — the asymmetry between G and C content on a given strand.

Per-Base Scoring #

The foundation of G4Hunter is a per-base scoring scheme that assigns every nucleotide in a sequence a score between -4 and +4:

Base	Score	Rationale
A, T (U)	0	Neutral — do not contribute to or compete with G4 formation.
G in a single-G context	+1	Weak G-richness signal.
Each G in a GG run	+2	Moderate G-richness.
Each G in a GGG run	+3	Strong G-richness — the classic G-tract.
Each G in a run of 4+ Gs	+4 (capped)	Very strong G-richness (capped at 4).
C	Negative (same magnitude rules as G)	C-runs oppose G4 formation on the scored strand, reflecting competition from stable Watson-Crick duplexes.

Worked example:

Sequence:   G G G A T T C C G G G G A T
Run lengths: 3 3 3 - - - 2 2 4 4 4 4 - -
Scores:     +3+3+3  0 0-2-2+4+4+4+4  0 0

Key design properties:

Runs of G are rewarded superlinearly. A GGG run scores 3 per base (total 9), far more than three isolated G’s (total 3). This captures the physical reality that consecutive guanines are needed to form G-quartets.
C-runs receive negative scores. This simultaneously (a) penalizes GC-alternating regions where stable duplex formation competes with G4, and (b) allows scoring the complementary strand — a highly negative score on one strand implies a highly positive score (and G4 potential) on the reverse complement.
Scores are capped at ±4. Runs longer than 4 still receive a score of 4 per base. This prevents extremely long G-tracts from dominating the score disproportionately.

Sliding Window Scoring (G4Hscore)#

For genome-wide or sequence-level analysis, per-base scores are smoothed using a sliding window:

Fix a window size k (default: 25 nt).
Slide the window one base at a time across the sequence.
For each window position, compute the arithmetic mean of the per-base scores within that window. This mean is the G4Hscore for that window.

\[\text{G4Hscore}(i) = \frac{1}{k} \sum_{j=i}^{i+k-1} \text{score}(j)\]

The G4Hscore has several useful properties:

It is centered on zero for random sequences, independent of GC content. This is because G and C contributions cancel in sequences without strand asymmetry.
Positive values indicate G-rich regions with G4-forming potential on the scored (forward) strand.
Negative values indicate C-rich regions, implying G4-forming potential on the complementary strand.
The magnitude reflects G4 propensity strength.

Window size considerations #

The default window size of 25 nt was chosen to match the typical length of experimentally characterized G-quadruplex-forming sequences (~26 nt mean in the Bedrat et al. reference dataset). The window size can be adjusted:

Smaller windows (15–20 nt) increase sensitivity for short G4 motifs but may increase false positives, especially from single long G-runs that cannot form intramolecular G4 structures alone.
Larger windows (30–100 nt) can identify broader regions where multiple G4 structures may form in tandem, which may be biologically relevant (e.g., for replication-associated DNA damage).

In g4hunterpy3, window size is set via window_size in the Python API or -w on the command line.

Thresholding and Hit Calling #

Windows whose absolute G4Hscore exceeds a user-defined threshold are reported as candidate G4-forming regions. Taking the absolute value means that G4-forming potential on both strands is captured simultaneously.

\[|\text{G4Hscore}(i)| \geq \theta\]

The threshold $theta$ controls the trade-off between sensitivity and specificity:

Threshold	Precision	Recommended use
1.0	~73%	Most inclusive; recovers the maximum number of true G4-forming sequences but with a higher false positive rate. Miss rate ~6%.
1.2	~85%	Recommended compromise. Identifies many true G4 motifs while maintaining reasonable precision. Miss rate ~15%.
1.5	>90%	High-confidence predictions. Most sequences identified at this threshold are experimentally confirmed G4 formers.
1.75	~95%	Very stringent; near-zero false positives on the mitochondrial genome validation set. Best for applications requiring certainty (e.g., PCR primer or DNA origami design).
2.0	>98%	Most stringent; identifies only the strongest G4-forming sequences.

These precision estimates are based on experimental validation of 209 sequences from the human mitochondrial genome using six independent biophysical methods (CD, NMR, TDS, IDS, UV-melting, and thioflavin T fluorescence).

In g4hunterpy3, the threshold is set via threshold in the Python API or -s on the command line.

Region Merging #

In a genome-wide scan, consecutive windows with scores above the threshold typically overlap extensively (windows shifted by 1 nt share k-1 bases). These overlapping windows are merged into contiguous regions:

Sort window hits by start position.
Merge windows whose start positions overlap with existing regions (i.e., next window start ≤ current region end - 1).
For each merged region, compute the region score as the mean of the per-base scores across the full region (not the mean of window scores).
Report the merged region with its coordinates, sequence, length, score, and number of contributing windows.

This produces a non-redundant set of candidate G4-forming sequences (G4FS) from the genome.

Scoring Both Strands Simultaneously #

A key advantage of the G4Hunter scoring scheme is that both strands of a DNA duplex are scored simultaneously. Because C-runs receive negative scores while G-runs receive positive scores:

A region with a positive G4Hscore has G4-forming potential on the forward (scored) strand.
A region with a negative G4Hscore has G4-forming potential on the reverse complement strand (which is G-rich where the forward strand is C-rich).

This means a single pass through a sequence identifies G4 candidates on both strands. The --strand-agnostic option in the CLI takes absolute values of scores, treating both strands equally when plotting.

Comparison with Quadparser #

Bedrat et al. systematically compared G4Hunter against Quadparser on a reference dataset of 392 sequences with experimentally known G4 formation status, and on 209 sequences from the human mitochondrial genome:

Method	True G4 found	False positive rate	Type
G4Hunter (θ=1.0)	281 / 298	10.6%	Continuous score
G4Hunter (θ=1.2)	252 / 298	6.4%	Continuous score
Quadparser (QP37)	196 / 298	1.1%	Binary
Quadparser (QP27)	More hits	Higher FPR	Binary

Key advantages of G4Hunter:

Fewer false negatives. G4Hunter identifies G4-forming sequences that lack the canonical (G₃₊N₁₋₇)₄ pattern — sequences with bulges, interrupted G-runs, or non-standard loop lengths.
Quantitative output. The continuous score allows ranking sequences by G4-forming propensity and tuning sensitivity via the threshold.
Both strands. A single scan reports candidates on both strands.
ROC AUC > 0.96 on the reference dataset, indicating excellent discriminating power.

Genomic Distribution of G4-Forming Sequences #

Application of G4Hunter to 20 genomes revealed:

Mammalian genomes are the most G4-rich, with ~2.5 G4FS per kb at θ=1.0 and ~0.5 per kb at θ=1.5 (window size 25).
G4FS density decreases exponentially with increasing threshold, with similar exponential slopes across mammalian species.
Promoter regions are significantly enriched in G4FS (2–5.5-fold) compared to genomic background, particularly for stable G4 motifs (θ ≥ 1.5).
Enrichment is also observed in 5’ UTRs, first exons, and first exon/intron junctions, predominantly on the coding strand.
The number of G4-forming sequences in the human genome is estimated to be 2–10 times higher than the ~376,000 previously reported using Quadparser.

Implementation in g4hunterpy3 #

g4hunterpy3 implements the G4Hunter algorithm as described above with the following pipeline:

base_scores() — assigns per-base scores using the run-length scoring scheme.
window_mean_scores() — computes sliding-window means using fast O(n) convolution.
find_window_hits() — identifies windows exceeding the absolute-score threshold.
merge_overlapping_windows() — merges overlapping hits into non-redundant regions with per-base-mean region scores.

The convenience function scan_sequence() runs all four steps, and scan_fasta() applies the pipeline to every record in a FASTA file.