Genomics Module

The genomics module provides specialized functions for biological sequence analysis using LZSS factorization.

FASTA Processing

FASTA file parsing and compression utilities.

This module provides functions for reading, parsing, and compressing FASTA files with proper handling of biological sequences and edge cases.

exception noLZSS.genomics.fasta.FASTAError[source]

Bases: NoLZSSError

Raised when FASTA file parsing or validation fails.

noLZSS.genomics.fasta.read_nucleotide_fasta(filepath: str | Path) List[Tuple[str, List[Tuple[int, int, int]]]][source]

Read and factorize nucleotide sequences from a FASTA file.

Only accepts sequences containing A, C, T, G (case insensitive). Sequences are converted to uppercase and factorized.

Parameters:

filepath – Path to FASTA file

Returns:

List of (sequence_id, factors) tuples where factors is the LZSS factorization

Raises:
  • FASTAError – If file format is invalid or contains invalid nucleotides

  • FileNotFoundError – If file doesn’t exist

noLZSS.genomics.fasta.read_protein_fasta(filepath: str | Path) List[Tuple[str, str]][source]

Read amino acid sequences from a FASTA file.

Only accepts sequences containing canonical amino acids. Sequences are converted to uppercase.

Parameters:

filepath – Path to FASTA file

Returns:

List of (sequence_id, sequence) tuples

Raises:
  • FASTAError – If file format is invalid or contains invalid amino acids

  • FileNotFoundError – If file doesn’t exist

noLZSS.genomics.fasta.read_fasta_auto(filepath: str | Path) List[Tuple[str, List[Tuple[int, int, int]]]] | List[Tuple[str, str]][source]

Read a FASTA file and automatically detect whether it contains nucleotide or amino acid sequences.

For nucleotide sequences: validates A,C,T,G only and returns factorized results For amino acid sequences: validates canonical amino acids and returns sequences

Parameters:

filepath – Path to FASTA file

Returns:

List of (sequence_id, factors) tuples For amino acid FASTA: List of (sequence_id, sequence) tuples

Return type:

For nucleotide FASTA

Raises:
  • FASTAError – If file format is invalid or sequence type cannot be determined

  • FileNotFoundError – If file doesn’t exist

noLZSS.genomics.fasta.write_factors_dna_w_reference_fasta_files_to_binary(reference_fasta_path: str | Path, target_fasta_path: str | Path, output_path: str | Path, sanitize_mode: str = 'remove_ambiguous') int[source]

Factorize DNA sequences from FASTA files with reference and write factors to binary file.

Reads DNA sequences from reference and target FASTA files, performs noLZSS factorization of the target using the reference, and writes the resulting factors to a binary file. Specialized for nucleotide sequences (A, C, T, G) with reverse complement matching capability.

Parameters:
  • reference_fasta_path – Path to reference FASTA file containing DNA sequences

  • target_fasta_path – Path to target FASTA file containing DNA sequences to factorize

  • output_path – Path to output file where binary factors will be written

  • sanitize_mode – FASTA DNA sanitization mode: - “remove_ambiguous” (default): remove non-ACGT characters before factorization - “strict”: raise an error if non-ACGT characters are present

Returns:

Number of factors written to the output file

Raises:
  • ValueError – If files contain empty sequences or invalid nucleotides

  • FileNotFoundError – If FASTA files do not exist

  • RuntimeError – If unable to read FASTA files, create output file, or processing errors occur

  • FASTAError – If C++ extension is not available

Note

  • Factor start positions are absolute positions in the combined reference+target string

  • Supports reverse complement matching for DNA sequences (indicated by MSB in reference field)

  • All sequences from both FASTA files are concatenated with sentinel separators

  • In default mode, non-ACGT symbols are removed before factorization

  • Sequence positions/lengths are based on sanitized sequences

  • This function overwrites the output file if it exists

Warning

Characters 1-251 are used as sentinel separators and must not appear in sequences.

Sequence Utilities

Sequence utilities for biological data.

This module provides functions for working with nucleotide and amino acid sequences, including validation, transformation, and analysis functions.

noLZSS.genomics.sequences.is_dna_sequence(data: str | bytes) bool[source]

Check if data appears to be a DNA sequence (A, T, G, C).

Parameters:

data – Input data to check

Returns:

True if data contains only DNA nucleotides (case insensitive)

noLZSS.genomics.sequences.is_protein_sequence(data: str | bytes) bool[source]

Check if data appears to be a protein sequence (20 standard amino acids).

Parameters:

data – Input data to check

Returns:

True if data contains only standard amino acid codes

noLZSS.genomics.sequences.detect_sequence_type(data: str | bytes) str[source]

Detect the likely type of biological sequence.

Parameters:

data – Input data to analyze

Returns:

‘dna’, ‘protein’, ‘text’, or ‘binary’

Return type:

String indicating sequence type

noLZSS.genomics.sequences.factorize_dna_w_reference_seq(reference_seq: str | bytes, target_seq: str | bytes, validate: bool = True)[source]

Factorize target DNA sequence using a reference sequence with reverse complement awareness.

Concatenates a reference sequence and target sequence, then performs noLZSS factorization with reverse complement awareness starting from where the target sequence begins. This allows the target sequence to reference patterns in the reference sequence without factorizing the reference itself.

Parameters:
  • reference_seq – Reference DNA sequence (A, C, T, G - case insensitive)

  • target_seq – Target DNA sequence to be factorized (A, C, T, G - case insensitive)

  • validate – Whether to perform input validation (default: True)

Returns:

List of (start, length, ref, is_rc) tuples representing the factorization of target sequence

Raises:
  • ValueError – If sequences contain invalid nucleotides or are empty

  • TypeError – If input types are not supported

  • RuntimeError – If processing errors occur

Note

Factor start positions are relative to the beginning of the target sequence. Both sequences are converted to uppercase before factorization. ref field has RC_MASK cleared. is_rc boolean indicates reverse complement matches.

noLZSS.genomics.sequences.factorize_dna_w_reference_seq_file(reference_seq: str | bytes, target_seq: str | bytes, output_path: str | Path, validate: bool = True) int[source]

Factorize target DNA sequence using a reference sequence and write factors to binary file.

Concatenates a reference sequence and target sequence, then performs noLZSS factorization with reverse complement awareness starting from where the target sequence begins, and writes the resulting factors to a binary file.

Parameters:
  • reference_seq – Reference DNA sequence (A, C, T, G - case insensitive)

  • target_seq – Target DNA sequence to be factorized (A, C, T, G - case insensitive)

  • output_path – Path to output file where binary factors will be written

  • validate – Whether to perform input validation (default: True)

Returns:

Number of factors written to the output file

Raises:
  • ValueError – If sequences contain invalid nucleotides or are empty

  • TypeError – If input types are not supported

  • RuntimeError – If unable to create output file or processing errors occur

Note

Factor start positions are relative to the beginning of the target sequence. Binary format follows the same structure as other DNA factorization binary outputs. This function overwrites the output file if it exists.

Plotting and Visualization

FASTA file plotting utilities.

This module provides functions for creating plots and visualizations from FASTA files and their factorizations.

exception noLZSS.genomics.plots.PlotError[source]

Bases: NoLZSSError

Raised when plotting operations fail.

noLZSS.genomics.plots.plot_single_seq_accum_factors_from_file(fasta_filepath: str | Path | None = None, factors_filepath: str | Path | None = None, output_dir: str | Path | None = None, max_sequences: int | None = None, save_factors_text: bool = True, save_factors_binary: bool = False, min_factor_length: int = 1) Dict[str, Dict[str, Any]][source]

Process a FASTA file or binary factors file, factorize sequences (if needed), create plots, and save results.

For each sequence: - If FASTA file: reads sequences, factorizes them, and saves factor data and plots - If binary factors file: reads existing factors and creates plots

Parameters:
  • fasta_filepath – Path to input FASTA file (mutually exclusive with factors_filepath)

  • factors_filepath – Path to binary factors file (mutually exclusive with fasta_filepath)

  • output_dir – Directory to save all output files (required for FASTA, optional for binary)

  • max_sequences – Maximum number of sequences to process (None for all)

  • save_factors_text – Whether to save factors as text files (only for FASTA input)

  • save_factors_binary – Whether to save factors as binary files (only for FASTA input)

  • min_factor_length – Minimum factor length to include in analysis (default: 1)

Returns:

{
‘sequence_id’: {

‘sequence_length’: int, ‘num_factors’: int, ‘factors_file’: str, # path to saved factors ‘plot_file’: str, # path to saved plot ‘factors’: List[Tuple[int, int, int]] # the factors

}

}

Return type:

Dictionary with processing results for each sequence

Raises:
  • PlotError – If file processing fails

  • FileNotFoundError – If input file doesn’t exist

  • ValueError – If both or neither input files are provided, or if output_dir is missing for FASTA input

noLZSS.genomics.plots.plot_multiple_seq_self_lz_factor_plot_from_file(fasta_filepath: str | Path | None = None, factors_filepath: str | Path | None = None, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True, return_panel: bool = False, min_factor_length: int = 1) Any | None[source]

Create an interactive Datashader/Panel factor plot for multiple DNA sequences from a FASTA file or binary factors file.

This function reads factors either from a FASTA file (by factorizing multiple DNA sequences) or from an enhanced binary factors file with metadata. It creates a high-performance interactive plot using Datashader and Panel with level-of-detail rendering, zoom/pan-aware decimation, hover functionality, and sequence boundaries visualization.

Parameters:
  • fasta_filepath – Path to the FASTA file containing DNA sequences (mutually exclusive with factors_filepath)

  • factors_filepath – Path to binary factors file with metadata (mutually exclusive with fasta_filepath)

  • name – Optional name for the plot title (defaults to input filename)

  • save_path – Optional path to save the plot (supports .html or .png; PNG export requires optional selenium-based dependencies)

  • show_plot – Whether to display/serve the plot

  • return_panel – Whether to return the Panel app for embedding

  • min_factor_length – Minimum factor length to include in analysis (default: 1)

Returns:

Panel app if return_panel=True, otherwise None

Raises:
  • PlotError – If plotting fails or input files cannot be processed

  • FileNotFoundError – If input file doesn’t exist

  • ImportError – If required dependencies are missing

  • ValueError – If both or neither input files are provided

noLZSS.genomics.plots.plot_multiple_seq_self_lz_factor_plot_simple(fasta_filepath: str | Path | None = None, factors_filepath: str | Path | None = None, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True, min_factor_length: int = 1) None[source]

Create a simple matplotlib factor plot for multiple DNA sequences from a FASTA file or binary factors file.

This function reads factors either from a FASTA file (by factorizing multiple DNA sequences) or from an enhanced binary factors file with metadata. It creates a static plot using matplotlib with sequence boundaries visualization - a simplified alternative to the interactive Panel/Datashader version.

Parameters:
  • fasta_filepath – Path to the FASTA file containing DNA sequences (mutually exclusive with factors_filepath)

  • factors_filepath – Path to binary factors file with metadata (mutually exclusive with fasta_filepath)

  • name – Optional name for the plot title (defaults to input filename)

  • save_path – Optional path to save the plot image (PNG, PDF, SVG, etc.)

  • show_plot – Whether to display the plot

  • min_factor_length – Minimum factor length to include in analysis (default: 1)

Raises:
  • PlotError – If plotting fails or input files cannot be processed

  • FileNotFoundError – If input file doesn’t exist

  • ImportError – If matplotlib is not available

  • ValueError – If both or neither input files are provided

noLZSS.genomics.plots.plot_reference_seq_lz_factor_plot_simple(reference_seq: str | bytes | None = None, target_seq: str | bytes | None = None, factors: List[Tuple[int, int, int, bool]] | None = None, factors_filepath: str | Path | None = None, reference_name: str = 'Reference', target_name: str = 'Target', save_path: str | Path | None = None, show_plot: bool = True, factorization_mode: Literal['dna', 'general'] = 'dna') None[source]

Create a simple matplotlib factor plot for a sequence factorized with a reference sequence.

This function creates a plot compatible with the outputs of factorize_dna_w_reference_seq() or the general factorize_w_reference() wrapper. The plot shows the reference sequence at the beginning, concatenated with the target sequence, and uses distinct colors for reference vs target regions.

Parameters:
  • reference_seq – Reference DNA sequence (A, C, T, G - case insensitive) or general ASCII text when factorization_mode is “general”. Optional if factors_filepath is provided (parameters will be inferred from factors).

  • target_seq – Target DNA sequence (A, C, T, G - case insensitive) or general ASCII text when factorization_mode is “general”. Optional if factors_filepath is provided (parameters will be inferred from factors).

  • factors – Optional list of (start, length, ref, is_rc) tuples from factorize_dna_w_reference_seq() or factorize_w_reference(). If None, the function will compute factors automatically based on factorization_mode.

  • factors_filepath – Optional path to binary factors file (mutually exclusive with factors). When provided and sequences are not, parameters are inferred from the factors: first factor start = target_start, last factor end = total_length.

  • reference_name – Name for the reference sequence (default: “Reference”)

  • target_name – Name for the target sequence (default: “Target”)

  • save_path – Optional path to save the plot image

  • show_plot – Whether to display the plot

  • factorization_mode – Choose “dna” for reverse-complement-aware factorization or “general” for ASCII/general sequences without reverse complements

Raises:
  • PlotError – If plotting fails or input sequences are invalid

  • ValueError – If both factors and factors_filepath are provided, or if sequences are not provided and factors_filepath is not provided

  • ImportError – If matplotlib is not available

noLZSS.genomics.plots.plot_reference_seq_lz_factor_plot(reference_seq: str | bytes | None = None, target_seq: str | bytes | None = None, factors: List[Tuple[int, int, int, bool]] | None = None, factors_filepath: str | Path | None = None, reference_name: str = 'Reference', target_name: str = 'Target', save_path: str | Path | None = None, show_plot: bool = True, return_panel: bool = False, factorization_mode: Literal['dna', 'general'] = 'dna') Any | None[source]

Create an interactive Datashader/Panel factor plot for a sequence factorized with a reference sequence.

This function creates a plot compatible with the outputs of factorize_dna_w_reference_seq() or the general factorize_w_reference() wrapper. The plot shows the reference sequence at the beginning, concatenated with the target sequence, and uses distinct colors for reference vs target regions.

Parameters:
  • reference_seq – Reference DNA sequence (A, C, T, G - case insensitive) or general ASCII text when factorization_mode is “general”. Optional if factors_filepath is provided (parameters will be inferred from factors).

  • target_seq – Target DNA sequence (A, C, T, G - case insensitive) or general ASCII text when factorization_mode is “general”. Optional if factors_filepath is provided (parameters will be inferred from factors).

  • factors – Optional list of (start, length, ref, is_rc) tuples from factorize_dna_w_reference_seq() or factorize_w_reference(). If None, the function will compute factors automatically based on factorization_mode.

  • factors_filepath – Optional path to binary factors file (mutually exclusive with factors). When provided and sequences are not, parameters are inferred from the factors: first factor start = target_start, last factor end = total_length.

  • reference_name – Name for the reference sequence (default: “Reference”)

  • target_name – Name for the target sequence (default: “Target”)

  • save_path – Optional path to save the plot image (PNG export)

  • show_plot – Whether to display/serve the plot

  • return_panel – Whether to return the Panel app for embedding

  • factorization_mode – Choose “dna” for reverse-complement-aware factorization or “general” for ASCII/general sequences without reverse complements

Returns:

Panel app if return_panel=True, otherwise None

Raises:
  • PlotError – If plotting fails or input sequences are invalid

  • ValueError – If both factors and factors_filepath are provided, or if sequences are not provided and factors_filepath is not provided

  • ImportError – If required dependencies are missing

noLZSS.genomics.plots.plot_strand_bias_heatmap(fasta_filepath: str | Path | None = None, factors_filepath: str | Path | None = None, name: str | None = None, grid_size: int | Tuple[int, int] = 50, save_path: str | Path | None = None, show_plot: bool = True, min_factor_length: int = 1) None[source]

Visualize forward vs reverse-complement bias across the factor map.

The plot partitions the factor plane (target position vs reference position) into a square grid (default 50x50). Each bin accumulates nucleotide coverage from factors that overlap that bin; contributions are split when factors cross bin boundaries. Color encodes the log2 ratio between forward and reverse- complement coverage, normalized by the total coverage of each strand so that global strand imbalances are accounted for.

Parameters:
  • fasta_filepath – FASTA file to factorize (mutually exclusive with factors_filepath).

  • factors_filepath – Enhanced binary factors file with metadata (mutually exclusive with fasta_filepath).

  • name – Optional label for the plot title (defaults to input stem).

  • grid_size – Number of bins per axis (int) or explicit (x_bins, y_bins) tuple. Default: 50.

  • save_path – Optional path to save the heatmap image.

  • show_plot – Whether to display the plot.

  • min_factor_length – Minimum factor length to include in analysis (default: 1)

noLZSS.genomics.plots.plot_factor_length_ccdf(factors_filepath: str | Path, save_path: str | Path | None = None, show_plot: bool = True, separate: bool = True, min_factor_length: int = 1) None[source]

Create an empirical CCDF plot of factor lengths on log-log axes from a binary factors file.

This function reads factors from a binary file and plots the complementary cumulative distribution function (CCDF) of factor lengths. Forward and reverse complement factors can be plotted separately or together on the same axes with different colors.

Parameters:
  • factors_filepath – Path to binary factors file with metadata

  • save_path – Optional path to save the plot image (PNG, PDF, SVG, etc.)

  • show_plot – Whether to display the plot

  • separate – Whether to plot forward and reverse complement factors separately (default: True). If False, both are plotted on the same axes with different colors.

  • min_factor_length – Minimum factor length to include in analysis (default: 1)

Raises:
  • PlotError – If file reading or plotting fails

  • FileNotFoundError – If factors file doesn’t exist

  • ImportError – If matplotlib is not available

noLZSS.genomics.plots.plot_space_scale_heatmap(factors_filepath: str | Path, save_path: str | Path | None = None, show_plot: bool = True, genome_bin_size: float = 1.0, length_log_base: float = 2.0, separate_strands: bool = True, show_marginal_ccdf: bool = True, sequence_index: int | None = None, cmap: str = 'viridis', min_factor_length: int = 1) None[source]

Create a space-scale heatmap showing factor length distribution across genomic positions.

This function creates a 2D heatmap where: - X-axis: genomic position (binned into windows) - Y-axis: factor length (log-binned) - Cell color: CCDF-weighted factor count (emphasizes rare long factors)

The heatmap uses CCDF normalization to address the heavy-tailed distribution of factor lengths. Each cell’s value is weighted by the inverse of the CCDF (complementary cumulative distribution function) at that length, making rare long factors as visible as abundant short factors.

Forward and reverse-complement factors can be plotted separately or together. Optional marginal CCDF plots show the global length distribution per strand.

Parameters:
  • factors_filepath – Path to binary factors file with metadata

  • save_path – Optional path to save the plot image (PNG, PDF, SVG, etc.)

  • show_plot – Whether to display the plot

  • genome_bin_size – Size of genomic position bins in megabases (default: 1.0 Mb)

  • length_log_base – Base for logarithmic binning of factor lengths (default: 2.0)

  • separate_strands – Whether to create separate heatmaps for forward and reverse complement factors (default: True). If False, combines both on one heatmap.

  • show_marginal_ccdf – Whether to add marginal CCDF plots showing global length distribution per strand (default: True)

  • sequence_index – Optional index to select a specific sequence from multi-sequence files (0-based). If None, uses all sequences concatenated.

  • cmap – Matplotlib colormap name (default: ‘viridis’)

  • min_factor_length – Minimum factor length to include in analysis (default: 1)

Raises:
  • PlotError – If file reading or plotting fails

  • FileNotFoundError – If factors file doesn’t exist

  • ImportError – If required dependencies are not available

Per-sequence Complexity Tables

noLZSS.genomics.batch_factorize now exposes a lightweight mode for computing the DNA LZSS complexity of each FASTA record with and without reverse complement awareness:

python -m noLZSS.genomics.batch_factorize my_sequences.fasta \
   --complexity-tsv results/complexity.tsv \
   --complexity-threads 8

The generated TSV contains three columns:

  1. sequence_id – the exact FASTA header for the sequence

  2. complexity_w_rc – factor count when reverse complements are allowed

  3. complexity_no_rc – factor count without reverse complement matching

The command accepts local files or URLs (with optional --download-dir). No factor files are written when --complexity-tsv is supplied.

Genomics Package