Genomics Module
The genomics module provides specialized functions for biological sequence analysis using LZSS factorization.
FASTA Processing
FASTA file parsing and compression utilities.
This module provides functions for reading, parsing, and compressing FASTA files with proper handling of biological sequences and edge cases.
- exception noLZSS.genomics.fasta.FASTAError[source]
Bases:
NoLZSSErrorRaised when FASTA file parsing or validation fails.
- noLZSS.genomics.fasta.read_nucleotide_fasta(filepath: str | Path) List[Tuple[str, List[Tuple[int, int, int]]]][source]
Read and factorize nucleotide sequences from a FASTA file.
Only accepts sequences containing A, C, T, G (case insensitive). Sequences are converted to uppercase and factorized.
- Parameters:
filepath – Path to FASTA file
- Returns:
List of (sequence_id, factors) tuples where factors is the LZSS factorization
- Raises:
FASTAError – If file format is invalid or contains invalid nucleotides
FileNotFoundError – If file doesn’t exist
- noLZSS.genomics.fasta.read_protein_fasta(filepath: str | Path) List[Tuple[str, str]][source]
Read amino acid sequences from a FASTA file.
Only accepts sequences containing canonical amino acids. Sequences are converted to uppercase.
- Parameters:
filepath – Path to FASTA file
- Returns:
List of (sequence_id, sequence) tuples
- Raises:
FASTAError – If file format is invalid or contains invalid amino acids
FileNotFoundError – If file doesn’t exist
- noLZSS.genomics.fasta.read_fasta_auto(filepath: str | Path) List[Tuple[str, List[Tuple[int, int, int]]]] | List[Tuple[str, str]][source]
Read a FASTA file and automatically detect whether it contains nucleotide or amino acid sequences.
For nucleotide sequences: validates A,C,T,G only and returns factorized results For amino acid sequences: validates canonical amino acids and returns sequences
- Parameters:
filepath – Path to FASTA file
- Returns:
List of (sequence_id, factors) tuples For amino acid FASTA: List of (sequence_id, sequence) tuples
- Return type:
For nucleotide FASTA
- Raises:
FASTAError – If file format is invalid or sequence type cannot be determined
FileNotFoundError – If file doesn’t exist
- noLZSS.genomics.fasta.write_factors_dna_w_reference_fasta_files_to_binary(reference_fasta_path: str | Path, target_fasta_path: str | Path, output_path: str | Path, sanitize_mode: str = 'remove_ambiguous') int[source]
Factorize DNA sequences from FASTA files with reference and write factors to binary file.
Reads DNA sequences from reference and target FASTA files, performs noLZSS factorization of the target using the reference, and writes the resulting factors to a binary file. Specialized for nucleotide sequences (A, C, T, G) with reverse complement matching capability.
- Parameters:
reference_fasta_path – Path to reference FASTA file containing DNA sequences
target_fasta_path – Path to target FASTA file containing DNA sequences to factorize
output_path – Path to output file where binary factors will be written
sanitize_mode – FASTA DNA sanitization mode: - “remove_ambiguous” (default): remove non-ACGT characters before factorization - “strict”: raise an error if non-ACGT characters are present
- Returns:
Number of factors written to the output file
- Raises:
ValueError – If files contain empty sequences or invalid nucleotides
FileNotFoundError – If FASTA files do not exist
RuntimeError – If unable to read FASTA files, create output file, or processing errors occur
FASTAError – If C++ extension is not available
Note
Factor start positions are absolute positions in the combined reference+target string
Supports reverse complement matching for DNA sequences (indicated by MSB in reference field)
All sequences from both FASTA files are concatenated with sentinel separators
In default mode, non-ACGT symbols are removed before factorization
Sequence positions/lengths are based on sanitized sequences
This function overwrites the output file if it exists
Warning
Characters 1-251 are used as sentinel separators and must not appear in sequences.
Sequence Utilities
Sequence utilities for biological data.
This module provides functions for working with nucleotide and amino acid sequences, including validation, transformation, and analysis functions.
- noLZSS.genomics.sequences.is_dna_sequence(data: str | bytes) bool[source]
Check if data appears to be a DNA sequence (A, T, G, C).
- Parameters:
data – Input data to check
- Returns:
True if data contains only DNA nucleotides (case insensitive)
- noLZSS.genomics.sequences.is_protein_sequence(data: str | bytes) bool[source]
Check if data appears to be a protein sequence (20 standard amino acids).
- Parameters:
data – Input data to check
- Returns:
True if data contains only standard amino acid codes
- noLZSS.genomics.sequences.detect_sequence_type(data: str | bytes) str[source]
Detect the likely type of biological sequence.
- Parameters:
data – Input data to analyze
- Returns:
‘dna’, ‘protein’, ‘text’, or ‘binary’
- Return type:
String indicating sequence type
- noLZSS.genomics.sequences.factorize_dna_w_reference_seq(reference_seq: str | bytes, target_seq: str | bytes, validate: bool = True)[source]
Factorize target DNA sequence using a reference sequence with reverse complement awareness.
Concatenates a reference sequence and target sequence, then performs noLZSS factorization with reverse complement awareness starting from where the target sequence begins. This allows the target sequence to reference patterns in the reference sequence without factorizing the reference itself.
- Parameters:
reference_seq – Reference DNA sequence (A, C, T, G - case insensitive)
target_seq – Target DNA sequence to be factorized (A, C, T, G - case insensitive)
validate – Whether to perform input validation (default: True)
- Returns:
List of (start, length, ref, is_rc) tuples representing the factorization of target sequence
- Raises:
ValueError – If sequences contain invalid nucleotides or are empty
TypeError – If input types are not supported
RuntimeError – If processing errors occur
Note
Factor start positions are relative to the beginning of the target sequence. Both sequences are converted to uppercase before factorization. ref field has RC_MASK cleared. is_rc boolean indicates reverse complement matches.
- noLZSS.genomics.sequences.factorize_dna_w_reference_seq_file(reference_seq: str | bytes, target_seq: str | bytes, output_path: str | Path, validate: bool = True) int[source]
Factorize target DNA sequence using a reference sequence and write factors to binary file.
Concatenates a reference sequence and target sequence, then performs noLZSS factorization with reverse complement awareness starting from where the target sequence begins, and writes the resulting factors to a binary file.
- Parameters:
reference_seq – Reference DNA sequence (A, C, T, G - case insensitive)
target_seq – Target DNA sequence to be factorized (A, C, T, G - case insensitive)
output_path – Path to output file where binary factors will be written
validate – Whether to perform input validation (default: True)
- Returns:
Number of factors written to the output file
- Raises:
ValueError – If sequences contain invalid nucleotides or are empty
TypeError – If input types are not supported
RuntimeError – If unable to create output file or processing errors occur
Note
Factor start positions are relative to the beginning of the target sequence. Binary format follows the same structure as other DNA factorization binary outputs. This function overwrites the output file if it exists.
Plotting and Visualization
FASTA file plotting utilities.
This module provides functions for creating plots and visualizations from FASTA files and their factorizations.
- exception noLZSS.genomics.plots.PlotError[source]
Bases:
NoLZSSErrorRaised when plotting operations fail.
- noLZSS.genomics.plots.plot_single_seq_accum_factors_from_file(fasta_filepath: str | Path | None = None, factors_filepath: str | Path | None = None, output_dir: str | Path | None = None, max_sequences: int | None = None, save_factors_text: bool = True, save_factors_binary: bool = False, min_factor_length: int = 1) Dict[str, Dict[str, Any]][source]
Process a FASTA file or binary factors file, factorize sequences (if needed), create plots, and save results.
For each sequence: - If FASTA file: reads sequences, factorizes them, and saves factor data and plots - If binary factors file: reads existing factors and creates plots
- Parameters:
fasta_filepath – Path to input FASTA file (mutually exclusive with factors_filepath)
factors_filepath – Path to binary factors file (mutually exclusive with fasta_filepath)
output_dir – Directory to save all output files (required for FASTA, optional for binary)
max_sequences – Maximum number of sequences to process (None for all)
save_factors_text – Whether to save factors as text files (only for FASTA input)
save_factors_binary – Whether to save factors as binary files (only for FASTA input)
min_factor_length – Minimum factor length to include in analysis (default: 1)
- Returns:
- {
- ‘sequence_id’: {
‘sequence_length’: int, ‘num_factors’: int, ‘factors_file’: str, # path to saved factors ‘plot_file’: str, # path to saved plot ‘factors’: List[Tuple[int, int, int]] # the factors
}
}
- Return type:
Dictionary with processing results for each sequence
- Raises:
PlotError – If file processing fails
FileNotFoundError – If input file doesn’t exist
ValueError – If both or neither input files are provided, or if output_dir is missing for FASTA input
- noLZSS.genomics.plots.plot_multiple_seq_self_lz_factor_plot_from_file(fasta_filepath: str | Path | None = None, factors_filepath: str | Path | None = None, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True, return_panel: bool = False, min_factor_length: int = 1) Any | None[source]
Create an interactive Datashader/Panel factor plot for multiple DNA sequences from a FASTA file or binary factors file.
This function reads factors either from a FASTA file (by factorizing multiple DNA sequences) or from an enhanced binary factors file with metadata. It creates a high-performance interactive plot using Datashader and Panel with level-of-detail rendering, zoom/pan-aware decimation, hover functionality, and sequence boundaries visualization.
- Parameters:
fasta_filepath – Path to the FASTA file containing DNA sequences (mutually exclusive with factors_filepath)
factors_filepath – Path to binary factors file with metadata (mutually exclusive with fasta_filepath)
name – Optional name for the plot title (defaults to input filename)
save_path – Optional path to save the plot (supports .html or .png; PNG export requires optional selenium-based dependencies)
show_plot – Whether to display/serve the plot
return_panel – Whether to return the Panel app for embedding
min_factor_length – Minimum factor length to include in analysis (default: 1)
- Returns:
Panel app if return_panel=True, otherwise None
- Raises:
PlotError – If plotting fails or input files cannot be processed
FileNotFoundError – If input file doesn’t exist
ImportError – If required dependencies are missing
ValueError – If both or neither input files are provided
- noLZSS.genomics.plots.plot_multiple_seq_self_lz_factor_plot_simple(fasta_filepath: str | Path | None = None, factors_filepath: str | Path | None = None, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True, min_factor_length: int = 1) None[source]
Create a simple matplotlib factor plot for multiple DNA sequences from a FASTA file or binary factors file.
This function reads factors either from a FASTA file (by factorizing multiple DNA sequences) or from an enhanced binary factors file with metadata. It creates a static plot using matplotlib with sequence boundaries visualization - a simplified alternative to the interactive Panel/Datashader version.
- Parameters:
fasta_filepath – Path to the FASTA file containing DNA sequences (mutually exclusive with factors_filepath)
factors_filepath – Path to binary factors file with metadata (mutually exclusive with fasta_filepath)
name – Optional name for the plot title (defaults to input filename)
save_path – Optional path to save the plot image (PNG, PDF, SVG, etc.)
show_plot – Whether to display the plot
min_factor_length – Minimum factor length to include in analysis (default: 1)
- Raises:
PlotError – If plotting fails or input files cannot be processed
FileNotFoundError – If input file doesn’t exist
ImportError – If matplotlib is not available
ValueError – If both or neither input files are provided
- noLZSS.genomics.plots.plot_reference_seq_lz_factor_plot_simple(reference_seq: str | bytes | None = None, target_seq: str | bytes | None = None, factors: List[Tuple[int, int, int, bool]] | None = None, factors_filepath: str | Path | None = None, reference_name: str = 'Reference', target_name: str = 'Target', save_path: str | Path | None = None, show_plot: bool = True, factorization_mode: Literal['dna', 'general'] = 'dna') None[source]
Create a simple matplotlib factor plot for a sequence factorized with a reference sequence.
This function creates a plot compatible with the outputs of factorize_dna_w_reference_seq() or the general factorize_w_reference() wrapper. The plot shows the reference sequence at the beginning, concatenated with the target sequence, and uses distinct colors for reference vs target regions.
- Parameters:
reference_seq – Reference DNA sequence (A, C, T, G - case insensitive) or general ASCII text when
factorization_modeis “general”. Optional iffactors_filepathis provided (parameters will be inferred from factors).target_seq – Target DNA sequence (A, C, T, G - case insensitive) or general ASCII text when
factorization_modeis “general”. Optional iffactors_filepathis provided (parameters will be inferred from factors).factors – Optional list of (start, length, ref, is_rc) tuples from factorize_dna_w_reference_seq() or factorize_w_reference(). If None, the function will compute factors automatically based on
factorization_mode.factors_filepath – Optional path to binary factors file (mutually exclusive with
factors). When provided and sequences are not, parameters are inferred from the factors: first factor start = target_start, last factor end = total_length.reference_name – Name for the reference sequence (default: “Reference”)
target_name – Name for the target sequence (default: “Target”)
save_path – Optional path to save the plot image
show_plot – Whether to display the plot
factorization_mode – Choose “dna” for reverse-complement-aware factorization or “general” for ASCII/general sequences without reverse complements
- Raises:
PlotError – If plotting fails or input sequences are invalid
ValueError – If both factors and factors_filepath are provided, or if sequences are not provided and factors_filepath is not provided
ImportError – If matplotlib is not available
- noLZSS.genomics.plots.plot_reference_seq_lz_factor_plot(reference_seq: str | bytes | None = None, target_seq: str | bytes | None = None, factors: List[Tuple[int, int, int, bool]] | None = None, factors_filepath: str | Path | None = None, reference_name: str = 'Reference', target_name: str = 'Target', save_path: str | Path | None = None, show_plot: bool = True, return_panel: bool = False, factorization_mode: Literal['dna', 'general'] = 'dna') Any | None[source]
Create an interactive Datashader/Panel factor plot for a sequence factorized with a reference sequence.
This function creates a plot compatible with the outputs of factorize_dna_w_reference_seq() or the general factorize_w_reference() wrapper. The plot shows the reference sequence at the beginning, concatenated with the target sequence, and uses distinct colors for reference vs target regions.
- Parameters:
reference_seq – Reference DNA sequence (A, C, T, G - case insensitive) or general ASCII text when
factorization_modeis “general”. Optional iffactors_filepathis provided (parameters will be inferred from factors).target_seq – Target DNA sequence (A, C, T, G - case insensitive) or general ASCII text when
factorization_modeis “general”. Optional iffactors_filepathis provided (parameters will be inferred from factors).factors – Optional list of (start, length, ref, is_rc) tuples from factorize_dna_w_reference_seq() or factorize_w_reference(). If None, the function will compute factors automatically based on
factorization_mode.factors_filepath – Optional path to binary factors file (mutually exclusive with
factors). When provided and sequences are not, parameters are inferred from the factors: first factor start = target_start, last factor end = total_length.reference_name – Name for the reference sequence (default: “Reference”)
target_name – Name for the target sequence (default: “Target”)
save_path – Optional path to save the plot image (PNG export)
show_plot – Whether to display/serve the plot
return_panel – Whether to return the Panel app for embedding
factorization_mode – Choose “dna” for reverse-complement-aware factorization or “general” for ASCII/general sequences without reverse complements
- Returns:
Panel app if return_panel=True, otherwise None
- Raises:
PlotError – If plotting fails or input sequences are invalid
ValueError – If both factors and factors_filepath are provided, or if sequences are not provided and factors_filepath is not provided
ImportError – If required dependencies are missing
- noLZSS.genomics.plots.plot_strand_bias_heatmap(fasta_filepath: str | Path | None = None, factors_filepath: str | Path | None = None, name: str | None = None, grid_size: int | Tuple[int, int] = 50, save_path: str | Path | None = None, show_plot: bool = True, min_factor_length: int = 1) None[source]
Visualize forward vs reverse-complement bias across the factor map.
The plot partitions the factor plane (target position vs reference position) into a square grid (default 50x50). Each bin accumulates nucleotide coverage from factors that overlap that bin; contributions are split when factors cross bin boundaries. Color encodes the log2 ratio between forward and reverse- complement coverage, normalized by the total coverage of each strand so that global strand imbalances are accounted for.
- Parameters:
fasta_filepath – FASTA file to factorize (mutually exclusive with factors_filepath).
factors_filepath – Enhanced binary factors file with metadata (mutually exclusive with fasta_filepath).
name – Optional label for the plot title (defaults to input stem).
grid_size – Number of bins per axis (int) or explicit (x_bins, y_bins) tuple. Default: 50.
save_path – Optional path to save the heatmap image.
show_plot – Whether to display the plot.
min_factor_length – Minimum factor length to include in analysis (default: 1)
- noLZSS.genomics.plots.plot_factor_length_ccdf(factors_filepath: str | Path, save_path: str | Path | None = None, show_plot: bool = True, separate: bool = True, min_factor_length: int = 1) None[source]
Create an empirical CCDF plot of factor lengths on log-log axes from a binary factors file.
This function reads factors from a binary file and plots the complementary cumulative distribution function (CCDF) of factor lengths. Forward and reverse complement factors can be plotted separately or together on the same axes with different colors.
- Parameters:
factors_filepath – Path to binary factors file with metadata
save_path – Optional path to save the plot image (PNG, PDF, SVG, etc.)
show_plot – Whether to display the plot
separate – Whether to plot forward and reverse complement factors separately (default: True). If False, both are plotted on the same axes with different colors.
min_factor_length – Minimum factor length to include in analysis (default: 1)
- Raises:
PlotError – If file reading or plotting fails
FileNotFoundError – If factors file doesn’t exist
ImportError – If matplotlib is not available
- noLZSS.genomics.plots.plot_space_scale_heatmap(factors_filepath: str | Path, save_path: str | Path | None = None, show_plot: bool = True, genome_bin_size: float = 1.0, length_log_base: float = 2.0, separate_strands: bool = True, show_marginal_ccdf: bool = True, sequence_index: int | None = None, cmap: str = 'viridis', min_factor_length: int = 1) None[source]
Create a space-scale heatmap showing factor length distribution across genomic positions.
This function creates a 2D heatmap where: - X-axis: genomic position (binned into windows) - Y-axis: factor length (log-binned) - Cell color: CCDF-weighted factor count (emphasizes rare long factors)
The heatmap uses CCDF normalization to address the heavy-tailed distribution of factor lengths. Each cell’s value is weighted by the inverse of the CCDF (complementary cumulative distribution function) at that length, making rare long factors as visible as abundant short factors.
Forward and reverse-complement factors can be plotted separately or together. Optional marginal CCDF plots show the global length distribution per strand.
- Parameters:
factors_filepath – Path to binary factors file with metadata
save_path – Optional path to save the plot image (PNG, PDF, SVG, etc.)
show_plot – Whether to display the plot
genome_bin_size – Size of genomic position bins in megabases (default: 1.0 Mb)
length_log_base – Base for logarithmic binning of factor lengths (default: 2.0)
separate_strands – Whether to create separate heatmaps for forward and reverse complement factors (default: True). If False, combines both on one heatmap.
show_marginal_ccdf – Whether to add marginal CCDF plots showing global length distribution per strand (default: True)
sequence_index – Optional index to select a specific sequence from multi-sequence files (0-based). If None, uses all sequences concatenated.
cmap – Matplotlib colormap name (default: ‘viridis’)
min_factor_length – Minimum factor length to include in analysis (default: 1)
- Raises:
PlotError – If file reading or plotting fails
FileNotFoundError – If factors file doesn’t exist
ImportError – If required dependencies are not available
Per-sequence Complexity Tables
noLZSS.genomics.batch_factorize now exposes a lightweight mode for computing the DNA LZSS complexity of each FASTA record with and without reverse complement awareness:
python -m noLZSS.genomics.batch_factorize my_sequences.fasta \
--complexity-tsv results/complexity.tsv \
--complexity-threads 8
The generated TSV contains three columns:
sequence_id– the exact FASTA header for the sequencecomplexity_w_rc– factor count when reverse complements are allowedcomplexity_no_rc– factor count without reverse complement matching
The command accepts local files or URLs (with optional --download-dir). No factor files are written when --complexity-tsv is supplied.