Genomics Module

The genomics module provides specialized functions for biological sequence analysis using LZSS factorization.

FASTA Processing

FASTA file parsing and compression utilities.

This module provides functions for reading, parsing, and compressing FASTA files with proper handling of biological sequences and edge cases.

exception noLZSS.genomics.fasta.FASTAError[source]

Bases: NoLZSSError

Raised when FASTA file parsing or validation fails.

noLZSS.genomics.fasta.read_nucleotide_fasta(filepath: str | Path) → List[Tuple[str, List[Tuple[int, int, int]]]][source]

Read and factorize nucleotide sequences from a FASTA file.

Only accepts sequences containing A, C, T, G (case insensitive). Sequences are converted to uppercase and factorized.

Parameters:

filepath – Path to FASTA file

Returns:

List of (sequence_id, factors) tuples where factors is the LZSS factorization

Raises:

FASTAError – If file format is invalid or contains invalid nucleotides
FileNotFoundError – If file doesn’t exist

noLZSS.genomics.fasta.read_protein_fasta(filepath: str | Path) → List[Tuple[str, str]][source]

Read amino acid sequences from a FASTA file.

Only accepts sequences containing canonical amino acids. Sequences are converted to uppercase.

Parameters:

filepath – Path to FASTA file

Returns:

List of (sequence_id, sequence) tuples

Raises:

FASTAError – If file format is invalid or contains invalid amino acids
FileNotFoundError – If file doesn’t exist

noLZSS.genomics.fasta.read_fasta_auto(filepath: str | Path) → List[Tuple[str, List[Tuple[int, int, int]]]] | List[Tuple[str, str]][source]

Read a FASTA file and automatically detect whether it contains nucleotide or amino acid sequences.

For nucleotide sequences: validates A,C,T,G only and returns factorized results For amino acid sequences: validates canonical amino acids and returns sequences

Parameters:

filepath – Path to FASTA file

Returns:

List of (sequence_id, factors) tuples For amino acid FASTA: List of (sequence_id, sequence) tuples

Return type:

For nucleotide FASTA

Raises:

FASTAError – If file format is invalid or sequence type cannot be determined
FileNotFoundError – If file doesn’t exist

Sequence Utilities

Sequence utilities for biological data.

This module provides functions for working with nucleotide and amino acid sequences, including validation, transformation, and analysis functions.

noLZSS.genomics.sequences.is_dna_sequence(data: str | bytes) → bool[source]

Check if data appears to be a DNA sequence (A, T, G, C).

Parameters:: data – Input data to check
Returns:: True if data contains only DNA nucleotides (case insensitive)

noLZSS.genomics.sequences.is_protein_sequence(data: str | bytes) → bool[source]

Check if data appears to be a protein sequence (20 standard amino acids).

Parameters:: data – Input data to check
Returns:: True if data contains only standard amino acid codes

noLZSS.genomics.sequences.detect_sequence_type(data: str | bytes) → str[source]

Detect the likely type of biological sequence.

Parameters:: data – Input data to analyze
Returns:: ‘dna’, ‘protein’, ‘text’, or ‘binary’
Return type:: String indicating sequence type

Plotting and Visualization

FASTA file plotting utilities.

This module provides functions for creating plots and visualizations from FASTA files and their factorizations.

exception noLZSS.genomics.plots.PlotError[source]

Bases: NoLZSSError

Raised when plotting operations fail.

noLZSS.genomics.plots.plot_single_seq_accum_factors_from_fasta(fasta_filepath: str | Path, output_dir: str | Path, max_sequences: int | None = None, save_factors_text: bool = True, save_factors_binary: bool = False) → Dict[str, Dict[str, Any]][source]

Process a FASTA file, factorize all sequences, create plots, and save results.

For each sequence in the FASTA file: - Factorizes the sequence - Saves factor data (text and/or binary format) - Creates and saves a plot of factor lengths

Parameters:

fasta_filepath – Path to input FASTA file
output_dir – Directory to save all output files
max_sequences – Maximum number of sequences to process (None for all)
save_factors_text – Whether to save factors as text files
save_factors_binary – Whether to save factors as binary files

Returns:

{

‘sequence_id’: {: ‘sequence_length’: int, ‘num_factors’: int, ‘factors_file’: str, # path to saved factors ‘plot_file’: str, # path to saved plot ‘factors’: List[Tuple[int, int, int]] # the factors

}

Return type:

Dictionary with processing results for each sequence

Raises:

PlotError – If FASTA processing fails
FileNotFoundError – If input file doesn’t exist

noLZSS.genomics.plots.plot_multiple_seq_self_lz_factor_plot_from_fasta(fasta_filepath: str | Path, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True, return_panel: bool = False) → panel.viewable.Viewable | None[source]

Create an interactive Datashader/Panel factor plot for multiple DNA sequences from a FASTA file.

This function reads a FASTA file containing multiple DNA sequences, factorizes them using the multiple DNA with reverse complement algorithm, and creates a high-performance interactive plot using Datashader and Panel. The visualization can handle millions of factors with level-of-detail (LOD) rendering and includes zoom/pan-aware decimation with hover functionality.

Parameters:

fasta_filepath – Path to the FASTA file containing DNA sequences
name – Optional name for the plot title (defaults to FASTA filename)
save_path – Optional path to save the plot image (PNG export)
show_plot – Whether to display/serve the plot
return_panel – Whether to return the Panel app for embedding

Returns:

Panel app if return_panel=True, otherwise None

Raises:

PlotError – If plotting fails or FASTA file cannot be processed
FileNotFoundError – If FASTA file doesn’t exist
ImportError – If required dependencies are missing

noLZSS.genomics.plots.plot_multiple_seq_self_weizmann_factor_plot_from_fasta(fasta_filepath: str | Path, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True) → None[source]

Create a Weizmann factor plot for multiple DNA sequences from a FASTA file.

This function reads a FASTA file containing multiple DNA sequences, factorizes them using the multiple DNA with reverse complement algorithm, and creates a specialized plot where each factor is represented as a line. The plot shows the relationship between factor positions and their reference positions.

Parameters:

fasta_filepath – Path to the FASTA file containing DNA sequences
name – Optional name for the plot title (defaults to FASTA filename)
save_path – Optional path to save the plot image
show_plot – Whether to display the plot

Raises:

PlotError – If plotting fails or FASTA file cannot be processed
FileNotFoundError – If FASTA file doesn’t exist

Genomics Package

Genomics-specific functionality for noLZSS.

This subpackage provides specialized tools for working with biological sequences, including FASTA file parsing, sequence validation, and genomics-aware compression.