Python API Reference

Core Functions

Core Python wrappers for noLZSS C++ functionality.

This module provides enhanced Python wrappers around the C++ factorization functions, adding input validation, error handling, and convenience features.

noLZSS.core.factorize(data: str | bytes, validate: bool = True) List[Tuple[int, int, int]][source]

Factorize a string or bytes object into LZ factors.

Parameters:
  • data – Input string or bytes to factorize

  • validate – Whether to perform input validation (default: True)

Returns:

List of (position, length, ref) tuples representing the factorization

Raises:
  • ValueError – If input is invalid (empty, etc.)

  • TypeError – If input type is not supported

noLZSS.core.factorize_file(filepath: str | Path, reserve_hint: int = 0) List[Tuple[int, int, int]][source]

Factorize the contents of a file into LZ factors.

Parameters:
  • filepath – Path to the input file

  • reserve_hint – Optional hint for reserving space in output vector (0 = no hint)

Returns:

List of (position, length, ref) tuples representing the factorization

Raises:

FileNotFoundError – If the file doesn’t exist

noLZSS.core.count_factors(data: str | bytes, validate: bool = True) int[source]

Count the number of factors in a string without computing the full factorization.

Parameters:
  • data – Input string or bytes to analyze

  • validate – Whether to perform input validation (default: True)

Returns:

Number of factors in the factorization

Raises:
  • ValueError – If input is invalid

  • TypeError – If input type is not supported

noLZSS.core.count_factors_file(filepath: str | Path, validate: bool = True) int[source]

Count the number of factors in a file without computing the full factorization.

Parameters:
  • filepath – Path to the input file

  • validate – Whether to perform input validation (default: True)

Returns:

Number of factors in the factorization

Raises:
  • FileNotFoundError – If the file doesn’t exist

  • ValueError – If file contents are invalid

noLZSS.core.write_factors_binary_file(data: str | bytes, output_filepath: str | Path) None[source]

Factorize input and write the factors to a binary file.

Parameters:
  • data – Input string or bytes to factorize

  • output_filepath – Path where to write the binary factors

Raises:
  • ValueError – If input is invalid

  • TypeError – If input type is not supported

  • OSError – If unable to write to output file

noLZSS.core.factorize_with_info(data: str | bytes, validate: bool = True) dict[source]

Factorize input and return both factors and additional information.

Parameters:
  • data – Input string or bytes to factorize

  • validate – Whether to perform input validation (default: True)

Returns:

  • ‘factors’: List of (position, length, ref) tuples

  • ’alphabet_info’: Alphabet analysis results

  • ’input_size’: Size of input data

  • ’num_factors’: Number of factors

Return type:

Dictionary containing

Utilities

Utility functions for input validation, alphabet analysis, file I/O helpers, and visualization.

This module provides reusable utilities for the noLZSS package, including input validation, sentinel handling, alphabet analysis, binary file I/O, and plotting functions.

exception noLZSS.utils.NoLZSSError[source]

Bases: Exception

Base exception for noLZSS-related errors.

exception noLZSS.utils.InvalidInputError[source]

Bases: NoLZSSError

Raised when input data is invalid for factorization.

noLZSS.utils.validate_input(data: str | bytes) bytes[source]

Validate and normalize input data for factorization.

Parameters:

data – Input string or bytes to validate

Returns:

Normalized bytes data

Raises:
  • InvalidInputError – If input is invalid

  • TypeError – If input type is not supported

noLZSS.utils.analyze_alphabet(data: str | bytes) Dict[str, Any][source]

Analyze the alphabet of input data.

Parameters:

data – Input string or bytes to analyze

Returns:

  • ‘size’: Number of unique characters/bytes

  • ’characters’: Set of unique characters/bytes

  • ’distribution’: Counter of character/byte frequencies

  • ’entropy’: Shannon entropy of the data

  • ’most_common’: List of (char, count) tuples for most frequent characters

Return type:

Dictionary containing alphabet analysis

noLZSS.utils.read_factors_binary_file(filepath: str | Path) List[Tuple[int, int, int]][source]

Read factors from a binary file written by write_factors_binary_file.

Parameters:

filepath – Path to the binary factors file

Returns:

List of (position, length, ref) tuples

Raises:

NoLZSSError – If file cannot be read or has invalid format

noLZSS.utils.plot_factor_lengths(factors_or_file: List[Tuple[int, int, int]] | str | Path, save_path: str | Path | None = None, show_plot: bool = True) None[source]

Plot the cumulative factor lengths vs factor index.

Creates a scatter plot where: - X-axis: Cumulative sum of factor lengths - Y-axis: Factor index (number of factors)

Parameters:
  • factors_or_file – Either a list of (position, length, ref) tuples or path to binary factors file

  • save_path – Optional path to save the plot image (e.g., ‘plot.png’)

  • show_plot – Whether to display the plot (default: True)

Raises:
  • NoLZSSError – If binary file cannot be read

  • TypeError – If input type is invalid

  • ValueError – If no factors to plot

Warns:

UserWarning – If matplotlib is not installed (function returns gracefully)

Main Package

noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization.

A high-performance Python package with C++ core for computing non-overlapping LZ factorizations of strings and files.

Exception Classes

NoLZSSError

class noLZSS.NoLZSSError[source]

Bases: Exception

Base exception for noLZSS-related errors.

InvalidInputError

class noLZSS.InvalidInputError[source]

Bases: NoLZSSError

Raised when input data is invalid for factorization.