Python API Reference

Core Functions

Core Python wrappers for noLZSS C++ functionality.

This module provides enhanced Python wrappers around the C++ factorization functions, adding input validation, error handling, and convenience features.

noLZSS.core.factorize(data: str | bytes, validate: bool = True) → List[Tuple[int, int, int]][source]

Factorize a string or bytes object into LZ factors.

Parameters:

data – Input string or bytes to factorize
validate – Whether to perform input validation (default: True)

Returns:

List of (position, length, ref) tuples representing the factorization

Raises:

ValueError – If input is invalid (empty, etc.)
TypeError – If input type is not supported

noLZSS.core.factorize_file(filepath: str | Path, reserve_hint: int = 0) → List[Tuple[int, int, int]][source]

Factorize the contents of a file into LZ factors.

Parameters:

filepath – Path to the input file
reserve_hint – Optional hint for reserving space in output vector (0 = no hint)

Returns:

List of (position, length, ref) tuples representing the factorization

Raises:

FileNotFoundError – If the file doesn’t exist

noLZSS.core.count_factors(data: str | bytes, validate: bool = True) → int[source]

Count the number of factors in a string without computing the full factorization.

Parameters:

data – Input string or bytes to analyze
validate – Whether to perform input validation (default: True)

Returns:

Number of factors in the factorization

Raises:

ValueError – If input is invalid
TypeError – If input type is not supported

noLZSS.core.count_factors_file(filepath: str | Path, validate: bool = True) → int[source]

Count the number of factors in a file without computing the full factorization.

Parameters:

filepath – Path to the input file
validate – Whether to perform input validation (default: True)

Returns:

Number of factors in the factorization

Raises:

FileNotFoundError – If the file doesn’t exist
ValueError – If file contents are invalid

noLZSS.core.write_factors_binary_file(data: str | bytes, output_filepath: str | Path) → None[source]

Factorize input and write the factors to a binary file.

Parameters:

data – Input string or bytes to factorize
output_filepath – Path where to write the binary factors

Raises:

ValueError – If input is invalid
TypeError – If input type is not supported
OSError – If unable to write to output file

noLZSS.core.factorize_with_info(data: str | bytes, validate: bool = True) → dict[source]

Factorize input and return both factors and additional information.

Parameters:

data – Input string or bytes to factorize
validate – Whether to perform input validation (default: True)

Returns:

‘factors’: List of (position, length, ref) tuples
’alphabet_info’: Alphabet analysis results
’input_size’: Size of input data
’num_factors’: Number of factors

Return type:

Dictionary containing

Utilities

Utility functions for input validation, alphabet analysis, file I/O helpers, and visualization.

This module provides reusable utilities for the noLZSS package, including input validation, sentinel handling, alphabet analysis, binary file I/O, and plotting functions.

exception noLZSS.utils.NoLZSSError[source]

Bases: Exception

Base exception for noLZSS-related errors.

exception noLZSS.utils.InvalidInputError[source]

Bases: NoLZSSError

Raised when input data is invalid for factorization.

noLZSS.utils.validate_input(data: str | bytes) → bytes[source]

Validate and normalize input data for factorization.

Parameters:

data – Input string or bytes to validate

Returns:

Normalized bytes data

Raises:

InvalidInputError – If input is invalid
TypeError – If input type is not supported

noLZSS.utils.analyze_alphabet(data: str | bytes) → Dict[str, Any][source]

Analyze the alphabet of input data.

Parameters:

data – Input string or bytes to analyze

Returns:

‘size’: Number of unique characters/bytes
’characters’: Set of unique characters/bytes
’distribution’: Counter of character/byte frequencies
’entropy’: Shannon entropy of the data
’most_common’: List of (char, count) tuples for most frequent characters

Return type:

Dictionary containing alphabet analysis

noLZSS.utils.read_factors_binary_file(filepath: str | Path) → List[Tuple[int, int, int]][source]

Read factors from a binary file written by write_factors_binary_file.

Parameters:: filepath – Path to the binary factors file
Returns:: List of (position, length, ref) tuples
Raises:: NoLZSSError – If file cannot be read or has invalid format

noLZSS.utils.plot_factor_lengths(factors_or_file: List[Tuple[int, int, int]] | str | Path, save_path: str | Path | None = None, show_plot: bool = True) → None[source]

Plot the cumulative factor lengths vs factor index.

Creates a scatter plot where: - X-axis: Cumulative sum of factor lengths - Y-axis: Factor index (number of factors)

Parameters:

factors_or_file – Either a list of (position, length, ref) tuples or path to binary factors file
save_path – Optional path to save the plot image (e.g., ‘plot.png’)
show_plot – Whether to display the plot (default: True)

Raises:

NoLZSSError – If binary file cannot be read
TypeError – If input type is invalid
ValueError – If no factors to plot

Warns:

UserWarning – If matplotlib is not installed (function returns gracefully)

Main Package

noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization.

A high-performance Python package with C++ core for computing non-overlapping LZ factorizations of strings and files.

Exception Classes

NoLZSSError

class noLZSS.NoLZSSError[source]

Bases: Exception

Base exception for noLZSS-related errors.

InvalidInputError

class noLZSS.InvalidInputError[source]

Bases: NoLZSSError

Raised when input data is invalid for factorization.