User Guide#

This guide covers how to use g4hunterpy3 in detail, including the Python API and the command-line interface.

Background: How G4Hunter Scoring Works#

G4Hunter assigns a per-base score to every nucleotide in a sequence:

  • Guanine (G): positive score equal to the length of the G-run (capped at 4). For example, each G in a GGG run scores +3.

  • Cytosine (C): negative score equal to the length of the C-run (capped at -4). For example, each C in a CCCC run scores -4.

  • A, T, and other bases: score 0.

These per-base scores are then averaged over a sliding window (default 25 nt) to produce a G4Hunter score for each window position. Windows whose absolute score meets or exceeds a threshold are reported as candidate G4-forming regions. Overlapping windows are merged into contiguous regions.

Recommended thresholds:

  • 1.2 — good compromise for identifying many true G4 motifs

  • 1.5 — high-confidence predictions (precision >90%)

  • 2.0 — very high propensity

For more information see Bedrat et al. 2016.

Python API#

The core functionality lives in g4hunterpy3.core.

Scanning a single sequence#

Use scan_sequence() for the simplest workflow:

from g4hunterpy3.core import scan_sequence

seq = "ATGGGGATTTTGGGGCCCGGGGATTTGGGG"
window_scores, hits, regions = scan_sequence(
    seq, window_size=25, threshold=1.2
)

This returns three objects:

  1. window_scores — a NumPy array of per-window mean scores.

  2. hits — a list of WindowHit objects, one per window that passes the threshold.

  3. regions — a list of Region objects formed by merging overlapping hits.

Working with WindowHits and Regions#

Each WindowHit has start, end (0-based, end-exclusive), and score attributes:

for h in hits:
    print(f"Window [{h.start}:{h.end}] score={h.score:.2f}")

Each Region adds sequence, length, and n_windows:

for r in regions:
    print(f"Region [{r.start}:{r.end}] len={r.length} "
          f"score={r.score:.2f} ({r.n_windows} windows merged)")
    print(f"  Sequence: {r.sequence}")

Scanning a FASTA file#

Use scan_fasta() to iterate over all records in a FASTA file:

from g4hunterpy3.core import scan_fasta

results = scan_fasta("sequences.fasta", window_size=25, threshold=1.2)

for record_id, (window_scores, hits, regions) in results.items():
    print(f">{record_id}: {len(hits)} hits, {len(regions)} regions")

Step-by-step API#

For more control, you can call the individual functions:

from g4hunterpy3.core import (
    base_scores,
    window_mean_scores,
    find_window_hits,
    merge_overlapping_windows,
)

seq = "GGGGTTTTGGGG"

# Step 1: per-base scores
bs = base_scores(seq)
# array([ 4,  4,  4,  4,  0,  0,  0,  0,  4,  4,  4,  4])

# Step 2: sliding-window means
ws = window_mean_scores(bs, window_size=4)

# Step 3: find windows above threshold
hits = find_window_hits(ws, window_size=4, threshold=1.0)

# Step 4: merge overlapping hits into regions
regions = merge_overlapping_windows(hits, seq, base_score_array=bs)

Plotting#

The g4hunterpy3.plotting module provides two visualization functions.

Simple plot — a line plot of sliding-window scores:

from g4hunterpy3.core import scan_sequence
from g4hunterpy3.plotting import simple_plot

ws, hits, regions = scan_sequence(seq, window_size=25, threshold=1.2)
simple_plot(ws, "output_scores.pdf")

Complex plot — a binned heatmap suitable for large genomes:

from g4hunterpy3.plotting import complex_plot

complex_plot(
    hits,
    genome_length=len(seq),
    out_pdf="output_complex.pdf",
    nbins=500,
    score=1.2,
    strand_agnostic=True,
    highlight_regions=[[1000, 2000], [5000, 6000]],
)

Command-Line Interface#

After installation, the g4hunterpy3 command is available in your terminal.

Basic usage#

g4hunterpy3 -i <input.fasta> -o <output_directory> [options]

CLI options#

Option

Short

Default

Description

--input

-i

(required)

Path to the input FASTA file.

--output

-o

(required)

Output directory (created if it doesn’t exist).

--window

-w

25

Sliding window size in bases.

--score

-s

1.2

Absolute score threshold for calling hits.

--info

off

Print sequence info (length, hit/region counts).

--simple-plot

off

Write a PDF line plot of sliding-window scores.

--complex-plot

off

Write a PDF binned heatmap (for large sequences).

--complex-plot-nbins

1000

Number of bins for the complex plot.

--complex-plot-percentile

95

Percentile for y-axis limit in complex plot.

--strand-agnostic

off

Use absolute scores (ignores strand) in complex plot.

--highlight-regions

Regions to highlight (START:END pairs, 1-based).

CLI examples#

Basic analysis:

g4hunterpy3 -i sequences.fasta -o results/

Custom window size and stricter threshold:

g4hunterpy3 -i genome.fasta -o output/ -w 30 -s 1.5

Print sequence info:

g4hunterpy3 -i sequences.fasta -o results/ --info

Generate plots:

# simple line plot
g4hunterpy3 -i sequences.fasta -o results/ --simple-plot

# complex binned heatmap for a genome
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --complex-plot-nbins 500

Highlight genomic regions on complex plot:

g4hunterpy3 -i genome.fasta -o results/ \
    --complex-plot \
    --highlight-regions 1000:2000 5000:6000 8000:9000

Strand-agnostic vs strand-specific:

# strand-specific (default): blue = C-rich, red = G-rich
g4hunterpy3 -i genome.fasta -o results/ --complex-plot

# strand-agnostic: all G4-forming regions in red
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --strand-agnostic

Output files#

For each FASTA record, the CLI writes:

  1. Per-window hit file (<header>-W<k>-S<threshold>.txt) — tab-separated with columns: Start, End, Sequence, Length, Score (1-based coordinates).

  2. Merged region file (<header>-Merged.txt) — tab-separated with columns: Start, End, Sequence, Length, Score, NBR (1-based coordinates).

  3. Plot files (optional): - <header>-ScorePlot.pdf — simple line plot (with --simple-plot) - <header>-ComplexScorePlot.pdf — binned heatmap (with --complex-plot)

Understanding scores#

  • Positive scores → G-rich regions (G4-forming on the forward strand).

  • Negative scores → C-rich regions (G4-forming on the reverse strand).

  • Score magnitudes: - |score| ≥ 1.2 — moderate propensity - |score| ≥ 1.5 — high propensity - |score| ≥ 2.0 — very high propensity

How to Cite#

Please cite the original G4Hunter paper and link to the g4hunterpy3 repository:

Bedrat, A., Lacroix, L. & Mergny, J.-L. Re-evaluation of G-quadruplex propensity with G4Hunter. Nucleic Acids Res. 44, 1746–1759 (2016). doi:10.1093/nar/gkw006