Python API Reference
Core Functions
Core Python wrappers for noLZSS C++ functionality.
This module provides enhanced Python wrappers around the C++ factorization functions, adding input validation, error handling, and convenience features.
- noLZSS.core.factorize(data: str | bytes, validate: bool = True) List[Tuple[int, int, int]][source]
Factorize a string or bytes object into LZ factors.
- Parameters:
data – Input string or bytes to factorize
validate – Whether to perform input validation (default: True)
- Returns:
List of (position, length, ref) tuples representing the factorization
- Raises:
ValueError – If input is invalid (empty, etc.)
TypeError – If input type is not supported
- noLZSS.core.factorize_file(filepath: str | Path, reserve_hint: int = 0) List[Tuple[int, int, int]][source]
Factorize the contents of a file into LZ factors.
- Parameters:
filepath – Path to the input file
reserve_hint – Optional hint for reserving space in output vector (0 = no hint)
- Returns:
List of (position, length, ref) tuples representing the factorization
- Raises:
FileNotFoundError – If the file doesn’t exist
- noLZSS.core.count_factors(data: str | bytes, validate: bool = True) int[source]
Count the number of factors in a string without computing the full factorization.
- Parameters:
data – Input string or bytes to analyze
validate – Whether to perform input validation (default: True)
- Returns:
Number of factors in the factorization
- Raises:
ValueError – If input is invalid
TypeError – If input type is not supported
- noLZSS.core.count_factors_file(filepath: str | Path, validate: bool = True) int[source]
Count the number of factors in a file without computing the full factorization.
- Parameters:
filepath – Path to the input file
validate – Whether to perform input validation (default: True)
- Returns:
Number of factors in the factorization
- Raises:
FileNotFoundError – If the file doesn’t exist
ValueError – If file contents are invalid
- noLZSS.core.write_factors_binary_file(data: str | bytes, output_filepath: str | Path) None[source]
Factorize input and write the factors to a binary file.
- Parameters:
data – Input string or bytes to factorize
output_filepath – Path where to write the binary factors
- Raises:
ValueError – If input is invalid
TypeError – If input type is not supported
OSError – If unable to write to output file
- noLZSS.core.factorize_with_info(data: str | bytes, validate: bool = True) dict[source]
Factorize input and return both factors and additional information.
- Parameters:
data – Input string or bytes to factorize
validate – Whether to perform input validation (default: True)
- Returns:
‘factors’: List of (position, length, ref) tuples
’alphabet_info’: Alphabet analysis results
’input_size’: Size of input data
’num_factors’: Number of factors
- Return type:
Dictionary containing
- noLZSS.core.factorize_w_reference(reference_seq: str | bytes, target_seq: str | bytes, validate: bool = True) List[Tuple[int, int, int]][source]
Factorize target sequence using a reference sequence without reverse complement.
Concatenates a reference sequence and target sequence, then performs noLZSS factorization starting from where the target sequence begins. This allows the target sequence to reference patterns in the reference sequence without factorizing the reference itself. Suitable for general text or amino acid sequences.
- Parameters:
reference_seq – Reference sequence (any text)
target_seq – Target sequence to be factorized (any text)
validate – Whether to perform input validation (default: True)
- Returns:
List of (start, length, ref) tuples representing the factorization of target sequence
- Raises:
ValueError – If sequences are empty
TypeError – If input types are not supported
RuntimeError – If processing errors occur
Note
Factor start positions are absolute positions in the combined reference+target string. No reverse complement matching is performed - suitable for text or amino acid sequences.
Warning
The sentinel character ‘x01’ (ASCII 1) must not appear in either input sequence, as it is used internally to separate the reference and target sequences.
- noLZSS.core.factorize_w_reference_file(reference_seq: str | bytes, target_seq: str | bytes, output_path: str | Path, validate: bool = True) int[source]
Factorize target sequence using a reference sequence and write factors to binary file.
Concatenates a reference sequence and target sequence, then performs noLZSS factorization starting from where the target sequence begins, and writes the resulting factors to a binary file. Suitable for general text or amino acid sequences.
- Parameters:
reference_seq – Reference sequence (any text)
target_seq – Target sequence to be factorized (any text)
output_path – Path to output file where binary factors will be written
validate – Whether to perform input validation (default: True)
- Returns:
Number of factors written to the output file
- Raises:
ValueError – If sequences are empty
TypeError – If input types are not supported
RuntimeError – If unable to create output file or processing errors occur
Note
Factor start positions are absolute positions in the combined reference+target string. No reverse complement matching is performed - suitable for text or amino acid sequences. This function overwrites the output file if it exists.
Warning
The sentinel character ‘x01’ (ASCII 1) must not appear in either input sequence, as it is used internally to separate the reference and target sequences.
Utilities
Utility functions for input validation, alphabet analysis, file I/O helpers, and visualization.
This module provides reusable utilities for the noLZSS package, including input validation, sentinel handling, alphabet analysis, binary file I/O, and plotting functions.
- exception noLZSS.utils.NoLZSSError[source]
Bases:
ExceptionBase exception for noLZSS-related errors.
- exception noLZSS.utils.InvalidInputError[source]
Bases:
NoLZSSErrorRaised when input data is invalid for factorization.
- noLZSS.utils.validate_input(data: str | bytes) bytes[source]
Validate and normalize input data for factorization.
- Parameters:
data – Input string or bytes to validate
- Returns:
Normalized bytes data
- Raises:
InvalidInputError – If input is invalid
TypeError – If input type is not supported
- noLZSS.utils.analyze_alphabet(data: str | bytes) Dict[str, Any][source]
Analyze the alphabet of input data.
- Parameters:
data – Input string or bytes to analyze
- Returns:
‘size’: Number of unique characters/bytes
’characters’: Set of unique characters/bytes
’distribution’: Counter of character/byte frequencies
’entropy’: Shannon entropy of the data
’most_common’: List of (char, count) tuples for most frequent characters
- Return type:
Dictionary containing alphabet analysis
- noLZSS.utils.read_factors_binary_file(filepath: str | Path) List[Tuple[int, int, int]][source]
Read factors from a binary file written by write_factors_binary_file.
- Parameters:
filepath – Path to the binary factors file
- Returns:
List of (position, length, ref) tuples
- Raises:
NoLZSSError – If file cannot be read or has invalid format
- noLZSS.utils.read_binary_file_metadata(filepath: str | Path) Dict[str, Any][source]
Read only metadata from a binary file without loading all factors.
This function efficiently reads just the metadata (sequence names, sentinel indices, and counts) from the footer of binary files, without loading the factor data. This is useful for quickly inspecting file contents or gathering statistics.
- Parameters:
filepath – Path to the binary factors file with metadata
- Returns:
‘sentinel_factor_indices’: List of factor indices that are sentinels
’sequence_names’: List of sequence names from FASTA headers
’num_sequences’: Number of sequences
’num_sentinels’: Number of sentinel factors
’num_factors’: Total number of factors in the file
- Return type:
Dictionary containing
- Raises:
NoLZSSError – If file cannot be read or has invalid format
- noLZSS.utils.read_factors_binary_file_with_metadata(filepath: str | Path) Dict[str, Any][source]
Read factors from an enhanced binary file with metadata (sequence names and sentinel indices).
This function reads binary files written by write_factors_binary_file_fasta_multiple_dna_* functions that contain metadata including sequence names and sentinel factor indices.
- Parameters:
filepath – Path to the binary factors file with metadata
- Returns:
‘factors’: List of (start, length, ref, is_rc) tuples
’sentinel_factor_indices’: List of factor indices that are sentinels
’sequence_names’: List of sequence names from FASTA headers
’num_sequences’: Number of sequences
’num_sentinels’: Number of sentinel factors
- Return type:
Dictionary containing
- Raises:
NoLZSSError – If file cannot be read or has invalid format
- noLZSS.utils.plot_factor_lengths(factors_or_file: List[Tuple[int, int, int]] | str | Path, save_path: str | Path | None = None, show_plot: bool = True) None[source]
Plot the cumulative factor lengths vs factor index.
Creates a scatter plot where: - X-axis: Cumulative sum of factor lengths - Y-axis: Factor index (number of factors)
- Parameters:
factors_or_file – Either a list of (position, length, ref) tuples or path to binary factors file
save_path – Optional path to save the plot image (e.g., ‘plot.png’)
show_plot – Whether to display the plot (default: True)
- Raises:
NoLZSSError – If binary file cannot be read
TypeError – If input type is invalid
ValueError – If no factors to plot
- Warns:
UserWarning – If matplotlib is not installed (function returns gracefully)
Main Package
noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization.
A high-performance Python package with C++ core for computing non-overlapping LZ factorizations of strings and files.
Exception Classes
NoLZSSError
InvalidInputError
- class noLZSS.InvalidInputError[source]
Bases:
NoLZSSErrorRaised when input data is invalid for factorization.