Genomics Module

The genomics module provides specialized functions for biological sequence analysis using LZSS factorization.

FASTA Processing

FASTA file parsing and compression utilities.

This module provides functions for reading, parsing, and compressing FASTA files with proper handling of biological sequences and edge cases.

exception noLZSS.genomics.fasta.FASTAError[source]

Bases: NoLZSSError

Raised when FASTA file parsing or validation fails.

noLZSS.genomics.fasta.read_nucleotide_fasta(filepath: str | Path) List[Tuple[str, List[Tuple[int, int, int]]]][source]

Read and factorize nucleotide sequences from a FASTA file.

Only accepts sequences containing A, C, T, G (case insensitive). Sequences are converted to uppercase and factorized.

Parameters:

filepath – Path to FASTA file

Returns:

List of (sequence_id, factors) tuples where factors is the LZSS factorization

Raises:
  • FASTAError – If file format is invalid or contains invalid nucleotides

  • FileNotFoundError – If file doesn’t exist

noLZSS.genomics.fasta.read_protein_fasta(filepath: str | Path) List[Tuple[str, str]][source]

Read amino acid sequences from a FASTA file.

Only accepts sequences containing canonical amino acids. Sequences are converted to uppercase.

Parameters:

filepath – Path to FASTA file

Returns:

List of (sequence_id, sequence) tuples

Raises:
  • FASTAError – If file format is invalid or contains invalid amino acids

  • FileNotFoundError – If file doesn’t exist

noLZSS.genomics.fasta.read_fasta_auto(filepath: str | Path) List[Tuple[str, List[Tuple[int, int, int]]]] | List[Tuple[str, str]][source]

Read a FASTA file and automatically detect whether it contains nucleotide or amino acid sequences.

For nucleotide sequences: validates A,C,T,G only and returns factorized results For amino acid sequences: validates canonical amino acids and returns sequences

Parameters:

filepath – Path to FASTA file

Returns:

List of (sequence_id, factors) tuples For amino acid FASTA: List of (sequence_id, sequence) tuples

Return type:

For nucleotide FASTA

Raises:
  • FASTAError – If file format is invalid or sequence type cannot be determined

  • FileNotFoundError – If file doesn’t exist

Sequence Utilities

Sequence utilities for biological data.

This module provides functions for working with nucleotide and amino acid sequences, including validation, transformation, and analysis functions.

noLZSS.genomics.sequences.is_dna_sequence(data: str | bytes) bool[source]

Check if data appears to be a DNA sequence (A, T, G, C).

Parameters:

data – Input data to check

Returns:

True if data contains only DNA nucleotides (case insensitive)

noLZSS.genomics.sequences.is_protein_sequence(data: str | bytes) bool[source]

Check if data appears to be a protein sequence (20 standard amino acids).

Parameters:

data – Input data to check

Returns:

True if data contains only standard amino acid codes

noLZSS.genomics.sequences.detect_sequence_type(data: str | bytes) str[source]

Detect the likely type of biological sequence.

Parameters:

data – Input data to analyze

Returns:

‘dna’, ‘protein’, ‘text’, or ‘binary’

Return type:

String indicating sequence type

Plotting and Visualization

FASTA file plotting utilities.

This module provides functions for creating plots and visualizations from FASTA files and their factorizations.

exception noLZSS.genomics.plots.PlotError[source]

Bases: NoLZSSError

Raised when plotting operations fail.

noLZSS.genomics.plots.plot_single_seq_accum_factors_from_fasta(fasta_filepath: str | Path, output_dir: str | Path, max_sequences: int | None = None, save_factors_text: bool = True, save_factors_binary: bool = False) Dict[str, Dict[str, Any]][source]

Process a FASTA file, factorize all sequences, create plots, and save results.

For each sequence in the FASTA file: - Factorizes the sequence - Saves factor data (text and/or binary format) - Creates and saves a plot of factor lengths

Parameters:
  • fasta_filepath – Path to input FASTA file

  • output_dir – Directory to save all output files

  • max_sequences – Maximum number of sequences to process (None for all)

  • save_factors_text – Whether to save factors as text files

  • save_factors_binary – Whether to save factors as binary files

Returns:

{
‘sequence_id’: {

‘sequence_length’: int, ‘num_factors’: int, ‘factors_file’: str, # path to saved factors ‘plot_file’: str, # path to saved plot ‘factors’: List[Tuple[int, int, int]] # the factors

}

}

Return type:

Dictionary with processing results for each sequence

Raises:
  • PlotError – If FASTA processing fails

  • FileNotFoundError – If input file doesn’t exist

noLZSS.genomics.plots.plot_multiple_seq_self_lz_factor_plot_from_fasta(fasta_filepath: str | Path, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True, return_panel: bool = False) panel.viewable.Viewable | None[source]

Create an interactive Datashader/Panel factor plot for multiple DNA sequences from a FASTA file.

This function reads a FASTA file containing multiple DNA sequences, factorizes them using the multiple DNA with reverse complement algorithm, and creates a high-performance interactive plot using Datashader and Panel. The visualization can handle millions of factors with level-of-detail (LOD) rendering and includes zoom/pan-aware decimation with hover functionality.

Parameters:
  • fasta_filepath – Path to the FASTA file containing DNA sequences

  • name – Optional name for the plot title (defaults to FASTA filename)

  • save_path – Optional path to save the plot image (PNG export)

  • show_plot – Whether to display/serve the plot

  • return_panel – Whether to return the Panel app for embedding

Returns:

Panel app if return_panel=True, otherwise None

Raises:
  • PlotError – If plotting fails or FASTA file cannot be processed

  • FileNotFoundError – If FASTA file doesn’t exist

  • ImportError – If required dependencies are missing

noLZSS.genomics.plots.plot_multiple_seq_self_weizmann_factor_plot_from_fasta(fasta_filepath: str | Path, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True) None[source]

Create a Weizmann factor plot for multiple DNA sequences from a FASTA file.

This function reads a FASTA file containing multiple DNA sequences, factorizes them using the multiple DNA with reverse complement algorithm, and creates a specialized plot where each factor is represented as a line. The plot shows the relationship between factor positions and their reference positions.

Parameters:
  • fasta_filepath – Path to the FASTA file containing DNA sequences

  • name – Optional name for the plot title (defaults to FASTA filename)

  • save_path – Optional path to save the plot image

  • show_plot – Whether to display the plot

Raises:
  • PlotError – If plotting fails or FASTA file cannot be processed

  • FileNotFoundError – If FASTA file doesn’t exist

Genomics Package

Genomics-specific functionality for noLZSS.

This subpackage provides specialized tools for working with biological sequences, including FASTA file parsing, sequence validation, and genomics-aware compression.