Genomics Module
The genomics module provides specialized functions for biological sequence analysis using LZSS factorization.
FASTA Processing
FASTA file parsing and compression utilities.
This module provides functions for reading, parsing, and compressing FASTA files with proper handling of biological sequences and edge cases.
- exception noLZSS.genomics.fasta.FASTAError[source]
Bases:
NoLZSSError
Raised when FASTA file parsing or validation fails.
- noLZSS.genomics.fasta.read_nucleotide_fasta(filepath: str | Path) List[Tuple[str, List[Tuple[int, int, int]]]] [source]
Read and factorize nucleotide sequences from a FASTA file.
Only accepts sequences containing A, C, T, G (case insensitive). Sequences are converted to uppercase and factorized.
- Parameters:
filepath – Path to FASTA file
- Returns:
List of (sequence_id, factors) tuples where factors is the LZSS factorization
- Raises:
FASTAError – If file format is invalid or contains invalid nucleotides
FileNotFoundError – If file doesn’t exist
- noLZSS.genomics.fasta.read_protein_fasta(filepath: str | Path) List[Tuple[str, str]] [source]
Read amino acid sequences from a FASTA file.
Only accepts sequences containing canonical amino acids. Sequences are converted to uppercase.
- Parameters:
filepath – Path to FASTA file
- Returns:
List of (sequence_id, sequence) tuples
- Raises:
FASTAError – If file format is invalid or contains invalid amino acids
FileNotFoundError – If file doesn’t exist
- noLZSS.genomics.fasta.read_fasta_auto(filepath: str | Path) List[Tuple[str, List[Tuple[int, int, int]]]] | List[Tuple[str, str]] [source]
Read a FASTA file and automatically detect whether it contains nucleotide or amino acid sequences.
For nucleotide sequences: validates A,C,T,G only and returns factorized results For amino acid sequences: validates canonical amino acids and returns sequences
- Parameters:
filepath – Path to FASTA file
- Returns:
List of (sequence_id, factors) tuples For amino acid FASTA: List of (sequence_id, sequence) tuples
- Return type:
For nucleotide FASTA
- Raises:
FASTAError – If file format is invalid or sequence type cannot be determined
FileNotFoundError – If file doesn’t exist
Sequence Utilities
Sequence utilities for biological data.
This module provides functions for working with nucleotide and amino acid sequences, including validation, transformation, and analysis functions.
- noLZSS.genomics.sequences.is_dna_sequence(data: str | bytes) bool [source]
Check if data appears to be a DNA sequence (A, T, G, C).
- Parameters:
data – Input data to check
- Returns:
True if data contains only DNA nucleotides (case insensitive)
Plotting and Visualization
FASTA file plotting utilities.
This module provides functions for creating plots and visualizations from FASTA files and their factorizations.
- exception noLZSS.genomics.plots.PlotError[source]
Bases:
NoLZSSError
Raised when plotting operations fail.
- noLZSS.genomics.plots.plot_single_seq_accum_factors_from_fasta(fasta_filepath: str | Path, output_dir: str | Path, max_sequences: int | None = None, save_factors_text: bool = True, save_factors_binary: bool = False) Dict[str, Dict[str, Any]] [source]
Process a FASTA file, factorize all sequences, create plots, and save results.
For each sequence in the FASTA file: - Factorizes the sequence - Saves factor data (text and/or binary format) - Creates and saves a plot of factor lengths
- Parameters:
fasta_filepath – Path to input FASTA file
output_dir – Directory to save all output files
max_sequences – Maximum number of sequences to process (None for all)
save_factors_text – Whether to save factors as text files
save_factors_binary – Whether to save factors as binary files
- Returns:
- {
- ‘sequence_id’: {
‘sequence_length’: int, ‘num_factors’: int, ‘factors_file’: str, # path to saved factors ‘plot_file’: str, # path to saved plot ‘factors’: List[Tuple[int, int, int]] # the factors
}
}
- Return type:
Dictionary with processing results for each sequence
- Raises:
PlotError – If FASTA processing fails
FileNotFoundError – If input file doesn’t exist
- noLZSS.genomics.plots.plot_multiple_seq_self_lz_factor_plot_from_fasta(fasta_filepath: str | Path, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True, return_panel: bool = False) panel.viewable.Viewable | None [source]
Create an interactive Datashader/Panel factor plot for multiple DNA sequences from a FASTA file.
This function reads a FASTA file containing multiple DNA sequences, factorizes them using the multiple DNA with reverse complement algorithm, and creates a high-performance interactive plot using Datashader and Panel. The visualization can handle millions of factors with level-of-detail (LOD) rendering and includes zoom/pan-aware decimation with hover functionality.
- Parameters:
fasta_filepath – Path to the FASTA file containing DNA sequences
name – Optional name for the plot title (defaults to FASTA filename)
save_path – Optional path to save the plot image (PNG export)
show_plot – Whether to display/serve the plot
return_panel – Whether to return the Panel app for embedding
- Returns:
Panel app if return_panel=True, otherwise None
- Raises:
PlotError – If plotting fails or FASTA file cannot be processed
FileNotFoundError – If FASTA file doesn’t exist
ImportError – If required dependencies are missing
- noLZSS.genomics.plots.plot_multiple_seq_self_weizmann_factor_plot_from_fasta(fasta_filepath: str | Path, name: str | None = None, save_path: str | Path | None = None, show_plot: bool = True) None [source]
Create a Weizmann factor plot for multiple DNA sequences from a FASTA file.
This function reads a FASTA file containing multiple DNA sequences, factorizes them using the multiple DNA with reverse complement algorithm, and creates a specialized plot where each factor is represented as a line. The plot shows the relationship between factor positions and their reference positions.
- Parameters:
fasta_filepath – Path to the FASTA file containing DNA sequences
name – Optional name for the plot title (defaults to FASTA filename)
save_path – Optional path to save the plot image
show_plot – Whether to display the plot
- Raises:
PlotError – If plotting fails or FASTA file cannot be processed
FileNotFoundError – If FASTA file doesn’t exist
Genomics Package
Genomics-specific functionality for noLZSS.
This subpackage provides specialized tools for working with biological sequences, including FASTA file parsing, sequence validation, and genomics-aware compression.