User Guide
==========

This guide covers how to use g4hunterpy3 in detail, including the Python API
and the command-line interface.


.. contents:: On this page
   :local:
   :depth: 2


Background: How G4Hunter Scoring Works
---------------------------------------

G4Hunter assigns a per-base score to every nucleotide in a sequence:

- **Guanine (G):** positive score equal to the length of the G-run (capped at 4).
  For example, each G in a ``GGG`` run scores +3.
- **Cytosine (C):** negative score equal to the length of the C-run (capped at -4).
  For example, each C in a ``CCCC`` run scores -4.
- **A, T, and other bases:** score 0.

These per-base scores are then averaged over a sliding window (default 25 nt)
to produce a **G4Hunter score** for each window position. Windows whose
absolute score meets or exceeds a threshold are reported as candidate
G4-forming regions. Overlapping windows are merged into contiguous regions.

Recommended thresholds:

- **1.2** — good compromise for identifying many true G4 motifs
- **1.5** — high-confidence predictions (precision >90%)
- **2.0** — very high propensity

For more information see
`Bedrat et al. 2016 <https://doi.org/10.1093/nar/gkw006>`_.


Python API
----------

The core functionality lives in :mod:`g4hunterpy3.core`.

Scanning a single sequence
^^^^^^^^^^^^^^^^^^^^^^^^^^

Use :func:`~g4hunterpy3.core.scan_sequence` for the simplest workflow:

.. code-block:: python

    from g4hunterpy3.core import scan_sequence

    seq = "ATGGGGATTTTGGGGCCCGGGGATTTGGGG"
    window_scores, hits, regions = scan_sequence(
        seq, window_size=25, threshold=1.2
    )

This returns three objects:

1. ``window_scores`` — a NumPy array of per-window mean scores.
2. ``hits`` — a list of :class:`~g4hunterpy3.core.WindowHit` objects, one per
   window that passes the threshold.
3. ``regions`` — a list of :class:`~g4hunterpy3.core.Region` objects formed by
   merging overlapping hits.

Working with WindowHits and Regions
"""""""""""""""""""""""""""""""""""

Each :class:`~g4hunterpy3.core.WindowHit` has ``start``, ``end`` (0-based,
end-exclusive), and ``score`` attributes:

.. code-block:: python

    for h in hits:
        print(f"Window [{h.start}:{h.end}] score={h.score:.2f}")

Each :class:`~g4hunterpy3.core.Region` adds ``sequence``, ``length``,
and ``n_windows``:

.. code-block:: python

    for r in regions:
        print(f"Region [{r.start}:{r.end}] len={r.length} "
              f"score={r.score:.2f} ({r.n_windows} windows merged)")
        print(f"  Sequence: {r.sequence}")


Scanning a FASTA file
^^^^^^^^^^^^^^^^^^^^^

Use :func:`~g4hunterpy3.core.scan_fasta` to iterate over all records in a
FASTA file:

.. code-block:: python

    from g4hunterpy3.core import scan_fasta

    results = scan_fasta("sequences.fasta", window_size=25, threshold=1.2)

    for record_id, (window_scores, hits, regions) in results.items():
        print(f">{record_id}: {len(hits)} hits, {len(regions)} regions")


Step-by-step API
^^^^^^^^^^^^^^^^

For more control, you can call the individual functions:

.. code-block:: python

    from g4hunterpy3.core import (
        base_scores,
        window_mean_scores,
        find_window_hits,
        merge_overlapping_windows,
    )

    seq = "GGGGTTTTGGGG"

    # Step 1: per-base scores
    bs = base_scores(seq)
    # array([ 4,  4,  4,  4,  0,  0,  0,  0,  4,  4,  4,  4])

    # Step 2: sliding-window means
    ws = window_mean_scores(bs, window_size=4)

    # Step 3: find windows above threshold
    hits = find_window_hits(ws, window_size=4, threshold=1.0)

    # Step 4: merge overlapping hits into regions
    regions = merge_overlapping_windows(hits, seq, base_score_array=bs)


Plotting
^^^^^^^^

The :mod:`g4hunterpy3.plotting` module provides two visualization functions.

**Simple plot** — a line plot of sliding-window scores:

.. code-block:: python

    from g4hunterpy3.core import scan_sequence
    from g4hunterpy3.plotting import simple_plot

    ws, hits, regions = scan_sequence(seq, window_size=25, threshold=1.2)
    simple_plot(ws, "output_scores.pdf")

**Complex plot** — a binned heatmap suitable for large genomes:

.. code-block:: python

    from g4hunterpy3.plotting import complex_plot

    complex_plot(
        hits,
        genome_length=len(seq),
        out_pdf="output_complex.pdf",
        nbins=500,
        score=1.2,
        strand_agnostic=True,
        highlight_regions=[[1000, 2000], [5000, 6000]],
    )


Command-Line Interface
-----------------------

After installation, the ``g4hunterpy3`` command is available in your terminal.

Basic usage
^^^^^^^^^^^

.. code-block:: bash

    g4hunterpy3 -i <input.fasta> -o <output_directory> [options]


CLI options
^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 25 10 10 55

   * - Option
     - Short
     - Default
     - Description
   * - ``--input``
     - ``-i``
     - *(required)*
     - Path to the input FASTA file.
   * - ``--output``
     - ``-o``
     - *(required)*
     - Output directory (created if it doesn't exist).
   * - ``--window``
     - ``-w``
     - 25
     - Sliding window size in bases.
   * - ``--score``
     - ``-s``
     - 1.2
     - Absolute score threshold for calling hits.
   * - ``--info``
     -
     - off
     - Print sequence info (length, hit/region counts).
   * - ``--simple-plot``
     -
     - off
     - Write a PDF line plot of sliding-window scores.
   * - ``--complex-plot``
     -
     - off
     - Write a PDF binned heatmap (for large sequences).
   * - ``--complex-plot-nbins``
     -
     - 1000
     - Number of bins for the complex plot.
   * - ``--complex-plot-percentile``
     -
     - 95
     - Percentile for y-axis limit in complex plot.
   * - ``--strand-agnostic``
     -
     - off
     - Use absolute scores (ignores strand) in complex plot.
   * - ``--highlight-regions``
     -
     -
     - Regions to highlight (``START:END`` pairs, 1-based).


CLI examples
^^^^^^^^^^^^

**Basic analysis:**

.. code-block:: bash

    g4hunterpy3 -i sequences.fasta -o results/

**Custom window size and stricter threshold:**

.. code-block:: bash

    g4hunterpy3 -i genome.fasta -o output/ -w 30 -s 1.5

**Print sequence info:**

.. code-block:: bash

    g4hunterpy3 -i sequences.fasta -o results/ --info

**Generate plots:**

.. code-block:: bash

    # simple line plot
    g4hunterpy3 -i sequences.fasta -o results/ --simple-plot

    # complex binned heatmap for a genome
    g4hunterpy3 -i genome.fasta -o results/ --complex-plot --complex-plot-nbins 500

**Highlight genomic regions on complex plot:**

.. code-block:: bash

    g4hunterpy3 -i genome.fasta -o results/ \
        --complex-plot \
        --highlight-regions 1000:2000 5000:6000 8000:9000

**Strand-agnostic vs strand-specific:**

.. code-block:: bash

    # strand-specific (default): blue = C-rich, red = G-rich
    g4hunterpy3 -i genome.fasta -o results/ --complex-plot

    # strand-agnostic: all G4-forming regions in red
    g4hunterpy3 -i genome.fasta -o results/ --complex-plot --strand-agnostic


Output files
^^^^^^^^^^^^

For each FASTA record, the CLI writes:

1. **Per-window hit file** (``<header>-W<k>-S<threshold>.txt``)
   — tab-separated with columns: Start, End, Sequence, Length, Score
   (1-based coordinates).

2. **Merged region file** (``<header>-Merged.txt``)
   — tab-separated with columns: Start, End, Sequence, Length, Score, NBR
   (1-based coordinates).

3. **Plot files** (optional):
   - ``<header>-ScorePlot.pdf`` — simple line plot (with ``--simple-plot``)
   - ``<header>-ComplexScorePlot.pdf`` — binned heatmap (with ``--complex-plot``)


Understanding scores
^^^^^^^^^^^^^^^^^^^^

- **Positive scores** → G-rich regions (G4-forming on the forward strand).
- **Negative scores** → C-rich regions (G4-forming on the reverse strand).
- Score magnitudes:
  - \|score\| ≥ 1.2 — moderate propensity
  - \|score\| ≥ 1.5 — high propensity
  - \|score\| ≥ 2.0 — very high propensity


How to Cite
-----------

Please cite the original G4Hunter paper and link to the g4hunterpy3 repository:

    Bedrat, A., Lacroix, L. & Mergny, J.-L. Re-evaluation of G-quadruplex
    propensity with G4Hunter. *Nucleic Acids Res.* **44**, 1746–1759 (2016).
    `doi:10.1093/nar/gkw006 <https://doi.org/10.1093/nar/gkw006>`_