User Guide#

This guide covers how to use g4hunterpy3 in detail, including the Python API and the command-line interface.

Background: How G4Hunter Scoring Works #

G4Hunter assigns a per-base score to every nucleotide in a sequence:

Guanine (G): positive score equal to the length of the G-run (capped at 4). For example, each G in a GGG run scores +3.
Cytosine (C): negative score equal to the length of the C-run (capped at -4). For example, each C in a CCCC run scores -4.
A, T, and other bases: score 0.

These per-base scores are then averaged over a sliding window (default 25 nt) to produce a G4Hunter score for each window position. Windows whose absolute score meets or exceeds a threshold are reported as candidate G4-forming regions. Overlapping windows are merged into contiguous regions.

Recommended thresholds:

1.2 — good compromise for identifying many true G4 motifs
1.5 — high-confidence predictions (precision >90%)
2.0 — very high propensity

For more information see Bedrat et al. 2016.

Python API #

The core functionality lives in g4hunterpy3.core.

Scanning a single sequence #

Use scan_sequence() for the simplest workflow:

from g4hunterpy3.core import scan_sequence

seq = "ATGGGGATTTTGGGGCCCGGGGATTTGGGG"
window_scores, hits, regions = scan_sequence(
    seq, window_size=25, threshold=1.2
)

This returns three objects:

window_scores — a NumPy array of per-window mean scores.
hits — a list of WindowHit objects, one per window that passes the threshold.
regions — a list of Region objects formed by merging overlapping hits.

Working with WindowHits and Regions#

Each WindowHit has start, end (0-based, end-exclusive), and score attributes:

for h in hits:
    print(f"Window [{h.start}:{h.end}] score={h.score:.2f}")

Each Region adds sequence, length, and n_windows:

for r in regions:
    print(f"Region [{r.start}:{r.end}] len={r.length} "
          f"score={r.score:.2f} ({r.n_windows} windows merged)")
    print(f"  Sequence: {r.sequence}")

Scanning a FASTA file #

Use scan_fasta() to iterate over all records in a FASTA file:

from g4hunterpy3.core import scan_fasta

results = scan_fasta("sequences.fasta", window_size=25, threshold=1.2)

for record_id, (window_scores, hits, regions) in results.items():
    print(f">{record_id}: {len(hits)} hits, {len(regions)} regions")

Step-by-step API #

For more control, you can call the individual functions:

from g4hunterpy3.core import (
    base_scores,
    window_mean_scores,
    find_window_hits,
    merge_overlapping_windows,
)

seq = "GGGGTTTTGGGG"

# Step 1: per-base scores
bs = base_scores(seq)
# array([ 4,  4,  4,  4,  0,  0,  0,  0,  4,  4,  4,  4])

# Step 2: sliding-window means
ws = window_mean_scores(bs, window_size=4)

# Step 3: find windows above threshold
hits = find_window_hits(ws, window_size=4, threshold=1.0)

# Step 4: merge overlapping hits into regions
regions = merge_overlapping_windows(hits, seq, base_score_array=bs)

Plotting #

The g4hunterpy3.plotting module provides two visualization functions.

Simple plot — a line plot of sliding-window scores:

from g4hunterpy3.core import scan_sequence
from g4hunterpy3.plotting import simple_plot

ws, hits, regions = scan_sequence(seq, window_size=25, threshold=1.2)
simple_plot(ws, "output_scores.pdf")

Complex plot — a binned heatmap suitable for large genomes:

from g4hunterpy3.plotting import complex_plot

complex_plot(
    hits,
    genome_length=len(seq),
    out_pdf="output_complex.pdf",
    nbins=500,
    score=1.2,
    strand_agnostic=True,
    highlight_regions=[[1000, 2000], [5000, 6000]],
)

Command-Line Interface #

After installation, the g4hunterpy3 command is available in your terminal.

Basic usage #

g4hunterpy3 -i <input.fasta> -o <output_directory> [options]

CLI options #

Option	Short	Default	Description
`--input`	`-i`	(required)	Path to the input FASTA file.
`--output`	`-o`	(required)	Output directory (created if it doesn’t exist).
`--window`	`-w`	25	Sliding window size in bases.
`--score`	`-s`	1.2	Absolute score threshold for calling hits.
`--info`		off	Print sequence info (length, hit/region counts).
`--simple-plot`		off	Write a PDF line plot of sliding-window scores.
`--complex-plot`		off	Write a PDF binned heatmap (for large sequences).
`--complex-plot-nbins`		1000	Number of bins for the complex plot.
`--complex-plot-percentile`		95	Percentile for y-axis limit in complex plot.
`--strand-agnostic`		off	Use absolute scores (ignores strand) in complex plot.
`--highlight-regions`			Regions to highlight (`START:END` pairs, 1-based).

CLI examples #

Basic analysis:

g4hunterpy3 -i sequences.fasta -o results/

Custom window size and stricter threshold:

g4hunterpy3 -i genome.fasta -o output/ -w 30 -s 1.5

Print sequence info:

g4hunterpy3 -i sequences.fasta -o results/ --info

Generate plots:

# simple line plot
g4hunterpy3 -i sequences.fasta -o results/ --simple-plot

# complex binned heatmap for a genome
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --complex-plot-nbins 500

Highlight genomic regions on complex plot:

g4hunterpy3 -i genome.fasta -o results/ \
    --complex-plot \
    --highlight-regions 1000:2000 5000:6000 8000:9000

Strand-agnostic vs strand-specific:

# strand-specific (default): blue = C-rich, red = G-rich
g4hunterpy3 -i genome.fasta -o results/ --complex-plot

# strand-agnostic: all G4-forming regions in red
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --strand-agnostic

Output files #

For each FASTA record, the CLI writes:

Per-window hit file (<header>-W<k>-S<threshold>.txt) — tab-separated with columns: Start, End, Sequence, Length, Score (1-based coordinates).
Merged region file (<header>-Merged.txt) — tab-separated with columns: Start, End, Sequence, Length, Score, NBR (1-based coordinates).
Plot files (optional): - <header>-ScorePlot.pdf — simple line plot (with --simple-plot) - <header>-ComplexScorePlot.pdf — binned heatmap (with --complex-plot)

Understanding scores #

Positive scores → G-rich regions (G4-forming on the forward strand).
Negative scores → C-rich regions (G4-forming on the reverse strand).
Score magnitudes: - |score| ≥ 1.2 — moderate propensity - |score| ≥ 1.5 — high propensity - |score| ≥ 2.0 — very high propensity

How to Cite #

Please cite the original G4Hunter paper and link to the g4hunterpy3 repository:

Bedrat, A., Lacroix, L. & Mergny, J.-L. Re-evaluation of G-quadruplex propensity with G4Hunter. Nucleic Acids Res. 44, 1746–1759 (2016). doi:10.1093/nar/gkw006