User Guide ========== This guide covers how to use g4hunterpy3 in detail, including the Python API and the command-line interface. .. contents:: On this page :local: :depth: 2 Background: How G4Hunter Scoring Works --------------------------------------- G4Hunter assigns a per-base score to every nucleotide in a sequence: - **Guanine (G):** positive score equal to the length of the G-run (capped at 4). For example, each G in a ``GGG`` run scores +3. - **Cytosine (C):** negative score equal to the length of the C-run (capped at -4). For example, each C in a ``CCCC`` run scores -4. - **A, T, and other bases:** score 0. These per-base scores are then averaged over a sliding window (default 25 nt) to produce a **G4Hunter score** for each window position. Windows whose absolute score meets or exceeds a threshold are reported as candidate G4-forming regions. Overlapping windows are merged into contiguous regions. Recommended thresholds: - **1.2** — good compromise for identifying many true G4 motifs - **1.5** — high-confidence predictions (precision >90%) - **2.0** — very high propensity For more information see `Bedrat et al. 2016 `_. Python API ---------- The core functionality lives in :mod:`g4hunterpy3.core`. Scanning a single sequence ^^^^^^^^^^^^^^^^^^^^^^^^^^ Use :func:`~g4hunterpy3.core.scan_sequence` for the simplest workflow: .. code-block:: python from g4hunterpy3.core import scan_sequence seq = "ATGGGGATTTTGGGGCCCGGGGATTTGGGG" window_scores, hits, regions = scan_sequence( seq, window_size=25, threshold=1.2 ) This returns three objects: 1. ``window_scores`` — a NumPy array of per-window mean scores. 2. ``hits`` — a list of :class:`~g4hunterpy3.core.WindowHit` objects, one per window that passes the threshold. 3. ``regions`` — a list of :class:`~g4hunterpy3.core.Region` objects formed by merging overlapping hits. Working with WindowHits and Regions """"""""""""""""""""""""""""""""""" Each :class:`~g4hunterpy3.core.WindowHit` has ``start``, ``end`` (0-based, end-exclusive), and ``score`` attributes: .. code-block:: python for h in hits: print(f"Window [{h.start}:{h.end}] score={h.score:.2f}") Each :class:`~g4hunterpy3.core.Region` adds ``sequence``, ``length``, and ``n_windows``: .. code-block:: python for r in regions: print(f"Region [{r.start}:{r.end}] len={r.length} " f"score={r.score:.2f} ({r.n_windows} windows merged)") print(f" Sequence: {r.sequence}") Scanning a FASTA file ^^^^^^^^^^^^^^^^^^^^^ Use :func:`~g4hunterpy3.core.scan_fasta` to iterate over all records in a FASTA file: .. code-block:: python from g4hunterpy3.core import scan_fasta results = scan_fasta("sequences.fasta", window_size=25, threshold=1.2) for record_id, (window_scores, hits, regions) in results.items(): print(f">{record_id}: {len(hits)} hits, {len(regions)} regions") Step-by-step API ^^^^^^^^^^^^^^^^ For more control, you can call the individual functions: .. code-block:: python from g4hunterpy3.core import ( base_scores, window_mean_scores, find_window_hits, merge_overlapping_windows, ) seq = "GGGGTTTTGGGG" # Step 1: per-base scores bs = base_scores(seq) # array([ 4, 4, 4, 4, 0, 0, 0, 0, 4, 4, 4, 4]) # Step 2: sliding-window means ws = window_mean_scores(bs, window_size=4) # Step 3: find windows above threshold hits = find_window_hits(ws, window_size=4, threshold=1.0) # Step 4: merge overlapping hits into regions regions = merge_overlapping_windows(hits, seq, base_score_array=bs) Plotting ^^^^^^^^ The :mod:`g4hunterpy3.plotting` module provides two visualization functions. **Simple plot** — a line plot of sliding-window scores: .. code-block:: python from g4hunterpy3.core import scan_sequence from g4hunterpy3.plotting import simple_plot ws, hits, regions = scan_sequence(seq, window_size=25, threshold=1.2) simple_plot(ws, "output_scores.pdf") **Complex plot** — a binned heatmap suitable for large genomes: .. code-block:: python from g4hunterpy3.plotting import complex_plot complex_plot( hits, genome_length=len(seq), out_pdf="output_complex.pdf", nbins=500, score=1.2, strand_agnostic=True, highlight_regions=[[1000, 2000], [5000, 6000]], ) Command-Line Interface ----------------------- After installation, the ``g4hunterpy3`` command is available in your terminal. Basic usage ^^^^^^^^^^^ .. code-block:: bash g4hunterpy3 -i -o [options] CLI options ^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 10 10 55 * - Option - Short - Default - Description * - ``--input`` - ``-i`` - *(required)* - Path to the input FASTA file. * - ``--output`` - ``-o`` - *(required)* - Output directory (created if it doesn't exist). * - ``--window`` - ``-w`` - 25 - Sliding window size in bases. * - ``--score`` - ``-s`` - 1.2 - Absolute score threshold for calling hits. * - ``--info`` - - off - Print sequence info (length, hit/region counts). * - ``--simple-plot`` - - off - Write a PDF line plot of sliding-window scores. * - ``--complex-plot`` - - off - Write a PDF binned heatmap (for large sequences). * - ``--complex-plot-nbins`` - - 1000 - Number of bins for the complex plot. * - ``--complex-plot-percentile`` - - 95 - Percentile for y-axis limit in complex plot. * - ``--strand-agnostic`` - - off - Use absolute scores (ignores strand) in complex plot. * - ``--highlight-regions`` - - - Regions to highlight (``START:END`` pairs, 1-based). CLI examples ^^^^^^^^^^^^ **Basic analysis:** .. code-block:: bash g4hunterpy3 -i sequences.fasta -o results/ **Custom window size and stricter threshold:** .. code-block:: bash g4hunterpy3 -i genome.fasta -o output/ -w 30 -s 1.5 **Print sequence info:** .. code-block:: bash g4hunterpy3 -i sequences.fasta -o results/ --info **Generate plots:** .. code-block:: bash # simple line plot g4hunterpy3 -i sequences.fasta -o results/ --simple-plot # complex binned heatmap for a genome g4hunterpy3 -i genome.fasta -o results/ --complex-plot --complex-plot-nbins 500 **Highlight genomic regions on complex plot:** .. code-block:: bash g4hunterpy3 -i genome.fasta -o results/ \ --complex-plot \ --highlight-regions 1000:2000 5000:6000 8000:9000 **Strand-agnostic vs strand-specific:** .. code-block:: bash # strand-specific (default): blue = C-rich, red = G-rich g4hunterpy3 -i genome.fasta -o results/ --complex-plot # strand-agnostic: all G4-forming regions in red g4hunterpy3 -i genome.fasta -o results/ --complex-plot --strand-agnostic Output files ^^^^^^^^^^^^ For each FASTA record, the CLI writes: 1. **Per-window hit file** (``
-W-S.txt``) — tab-separated with columns: Start, End, Sequence, Length, Score (1-based coordinates). 2. **Merged region file** (``
-Merged.txt``) — tab-separated with columns: Start, End, Sequence, Length, Score, NBR (1-based coordinates). 3. **Plot files** (optional): - ``
-ScorePlot.pdf`` — simple line plot (with ``--simple-plot``) - ``
-ComplexScorePlot.pdf`` — binned heatmap (with ``--complex-plot``) Understanding scores ^^^^^^^^^^^^^^^^^^^^ - **Positive scores** → G-rich regions (G4-forming on the forward strand). - **Negative scores** → C-rich regions (G4-forming on the reverse strand). - Score magnitudes: - \|score\| ≥ 1.2 — moderate propensity - \|score\| ≥ 1.5 — high propensity - \|score\| ≥ 2.0 — very high propensity How to Cite ----------- Please cite the original G4Hunter paper and link to the g4hunterpy3 repository: Bedrat, A., Lacroix, L. & Mergny, J.-L. Re-evaluation of G-quadruplex propensity with G4Hunter. *Nucleic Acids Res.* **44**, 1746–1759 (2016). `doi:10.1093/nar/gkw006 `_