Python API Reference

Core Functions

Core Python wrappers for noLZSS C++ functionality.

This module provides enhanced Python wrappers around the C++ factorization functions, adding input validation, error handling, and convenience features.

noLZSS.core.factorize(data: str | bytes, validate: bool = True) List[Tuple[int, int, int]][source]

Factorize a string or bytes object into LZ factors.

Parameters:
  • data – Input string or bytes to factorize

  • validate – Whether to perform input validation (default: True)

Returns:

List of (position, length, ref) tuples representing the factorization

Raises:
  • ValueError – If input is invalid (empty, etc.)

  • TypeError – If input type is not supported

noLZSS.core.factorize_file(filepath: str | Path, reserve_hint: int = 0) List[Tuple[int, int, int]][source]

Factorize the contents of a file into LZ factors.

Parameters:
  • filepath – Path to the input file

  • reserve_hint – Optional hint for reserving space in output vector (0 = no hint)

Returns:

List of (position, length, ref) tuples representing the factorization

Raises:

FileNotFoundError – If the file doesn’t exist

noLZSS.core.count_factors(data: str | bytes, validate: bool = True) int[source]

Count the number of factors in a string without computing the full factorization.

Parameters:
  • data – Input string or bytes to analyze

  • validate – Whether to perform input validation (default: True)

Returns:

Number of factors in the factorization

Raises:
  • ValueError – If input is invalid

  • TypeError – If input type is not supported

noLZSS.core.count_factors_file(filepath: str | Path, validate: bool = True) int[source]

Count the number of factors in a file without computing the full factorization.

Parameters:
  • filepath – Path to the input file

  • validate – Whether to perform input validation (default: True)

Returns:

Number of factors in the factorization

Raises:
  • FileNotFoundError – If the file doesn’t exist

  • ValueError – If file contents are invalid

noLZSS.core.write_factors_binary_file(data: str | bytes, output_filepath: str | Path) None[source]

Factorize input and write the factors to a binary file.

Parameters:
  • data – Input string or bytes to factorize

  • output_filepath – Path where to write the binary factors

Raises:
  • ValueError – If input is invalid

  • TypeError – If input type is not supported

  • OSError – If unable to write to output file

noLZSS.core.factorize_with_info(data: str | bytes, validate: bool = True) dict[source]

Factorize input and return both factors and additional information.

Parameters:
  • data – Input string or bytes to factorize

  • validate – Whether to perform input validation (default: True)

Returns:

  • ‘factors’: List of (position, length, ref) tuples

  • ’alphabet_info’: Alphabet analysis results

  • ’input_size’: Size of input data

  • ’num_factors’: Number of factors

Return type:

Dictionary containing

noLZSS.core.factorize_w_reference(reference_seq: str | bytes, target_seq: str | bytes, validate: bool = True) List[Tuple[int, int, int]][source]

Factorize target sequence using a reference sequence without reverse complement.

Concatenates a reference sequence and target sequence, then performs noLZSS factorization starting from where the target sequence begins. This allows the target sequence to reference patterns in the reference sequence without factorizing the reference itself. Suitable for general text or amino acid sequences.

Parameters:
  • reference_seq – Reference sequence (any text)

  • target_seq – Target sequence to be factorized (any text)

  • validate – Whether to perform input validation (default: True)

Returns:

List of (start, length, ref) tuples representing the factorization of target sequence

Raises:
  • ValueError – If sequences are empty

  • TypeError – If input types are not supported

  • RuntimeError – If processing errors occur

Note

Factor start positions are absolute positions in the combined reference+target string. No reverse complement matching is performed - suitable for text or amino acid sequences.

Warning

The sentinel character ‘x01’ (ASCII 1) must not appear in either input sequence, as it is used internally to separate the reference and target sequences.

noLZSS.core.factorize_w_reference_file(reference_seq: str | bytes, target_seq: str | bytes, output_path: str | Path, validate: bool = True) int[source]

Factorize target sequence using a reference sequence and write factors to binary file.

Concatenates a reference sequence and target sequence, then performs noLZSS factorization starting from where the target sequence begins, and writes the resulting factors to a binary file. Suitable for general text or amino acid sequences.

Parameters:
  • reference_seq – Reference sequence (any text)

  • target_seq – Target sequence to be factorized (any text)

  • output_path – Path to output file where binary factors will be written

  • validate – Whether to perform input validation (default: True)

Returns:

Number of factors written to the output file

Raises:
  • ValueError – If sequences are empty

  • TypeError – If input types are not supported

  • RuntimeError – If unable to create output file or processing errors occur

Note

Factor start positions are absolute positions in the combined reference+target string. No reverse complement matching is performed - suitable for text or amino acid sequences. This function overwrites the output file if it exists.

Warning

The sentinel character ‘x01’ (ASCII 1) must not appear in either input sequence, as it is used internally to separate the reference and target sequences.

Utilities

Utility functions for input validation, alphabet analysis, file I/O helpers, and visualization.

This module provides reusable utilities for the noLZSS package, including input validation, sentinel handling, alphabet analysis, binary file I/O, and plotting functions.

exception noLZSS.utils.NoLZSSError[source]

Bases: Exception

Base exception for noLZSS-related errors.

exception noLZSS.utils.InvalidInputError[source]

Bases: NoLZSSError

Raised when input data is invalid for factorization.

noLZSS.utils.validate_input(data: str | bytes) bytes[source]

Validate and normalize input data for factorization.

Parameters:

data – Input string or bytes to validate

Returns:

Normalized bytes data

Raises:
  • InvalidInputError – If input is invalid

  • TypeError – If input type is not supported

noLZSS.utils.analyze_alphabet(data: str | bytes) Dict[str, Any][source]

Analyze the alphabet of input data.

Parameters:

data – Input string or bytes to analyze

Returns:

  • ‘size’: Number of unique characters/bytes

  • ’characters’: Set of unique characters/bytes

  • ’distribution’: Counter of character/byte frequencies

  • ’entropy’: Shannon entropy of the data

  • ’most_common’: List of (char, count) tuples for most frequent characters

Return type:

Dictionary containing alphabet analysis

noLZSS.utils.read_factors_binary_file(filepath: str | Path) List[Tuple[int, int, int]][source]

Read factors from a binary file written by write_factors_binary_file.

Parameters:

filepath – Path to the binary factors file

Returns:

List of (position, length, ref) tuples

Raises:

NoLZSSError – If file cannot be read or has invalid format

noLZSS.utils.read_binary_file_metadata(filepath: str | Path) Dict[str, Any][source]

Read only metadata from a binary file without loading all factors.

This function efficiently reads just the metadata (sequence names, sentinel indices, and counts) from the footer of binary files, without loading the factor data. This is useful for quickly inspecting file contents or gathering statistics.

Parameters:

filepath – Path to the binary factors file with metadata

Returns:

  • ‘sentinel_factor_indices’: List of factor indices that are sentinels

  • ’sequence_names’: List of sequence names from FASTA headers

  • ’num_sequences’: Number of sequences

  • ’num_sentinels’: Number of sentinel factors

  • ’num_factors’: Total number of factors in the file

Return type:

Dictionary containing

Raises:

NoLZSSError – If file cannot be read or has invalid format

noLZSS.utils.read_factors_binary_file_with_metadata(filepath: str | Path) Dict[str, Any][source]

Read factors from an enhanced binary file with metadata (sequence names and sentinel indices).

This function reads binary files written by write_factors_binary_file_fasta_multiple_dna_* functions that contain metadata including sequence names and sentinel factor indices.

Parameters:

filepath – Path to the binary factors file with metadata

Returns:

  • ‘factors’: List of (start, length, ref, is_rc) tuples

  • ’sentinel_factor_indices’: List of factor indices that are sentinels

  • ’sequence_names’: List of sequence names from FASTA headers

  • ’num_sequences’: Number of sequences

  • ’num_sentinels’: Number of sentinel factors

Return type:

Dictionary containing

Raises:

NoLZSSError – If file cannot be read or has invalid format

noLZSS.utils.plot_factor_lengths(factors_or_file: List[Tuple[int, int, int]] | str | Path, save_path: str | Path | None = None, show_plot: bool = True) None[source]

Plot the cumulative factor lengths vs factor index.

Creates a scatter plot where: - X-axis: Cumulative sum of factor lengths - Y-axis: Factor index (number of factors)

Parameters:
  • factors_or_file – Either a list of (position, length, ref) tuples or path to binary factors file

  • save_path – Optional path to save the plot image (e.g., ‘plot.png’)

  • show_plot – Whether to display the plot (default: True)

Raises:
  • NoLZSSError – If binary file cannot be read

  • TypeError – If input type is invalid

  • ValueError – If no factors to plot

Warns:

UserWarning – If matplotlib is not installed (function returns gracefully)

Main Package

noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization.

A high-performance Python package with C++ core for computing non-overlapping LZ factorizations of strings and files.

Exception Classes

NoLZSSError

class noLZSS.NoLZSSError[source]

Bases: Exception

Base exception for noLZSS-related errors.

InvalidInputError

class noLZSS.InvalidInputError[source]

Bases: NoLZSSError

Raised when input data is invalid for factorization.