Function noLZSS::factorize_fasta_dna_no_rc_per_sequence
Defined in File fasta_processor.cpp
Function Documentation
-
FastaPerSequenceFactorizationResult noLZSS::factorize_fasta_dna_no_rc_per_sequence(const std::string &fasta_path, FastaDnaSanitizationMode sanitization_mode)
Factorizes each DNA sequence in a FASTA file separately without reverse complement awareness.
Unlike factorize_fasta_multiple_dna_no_rc which concatenates sequences with sentinels, this function factorizes each sequence independently. Each sequence gets its own compressed suffix tree and factorization, which avoids sentinel limitations and produces cleaner per-sequence results.
Note
Only A, C, T, G nucleotides are allowed (case insensitive)
Note
Sequences are converted to uppercase before factorization
Note
Reverse complement matches are NOT supported during factorization
Note
Each sequence is factorized independently - no cross-sequence matches
- Parameters:
fasta_path – Path to the FASTA file containing DNA sequences
- Throws:
std::runtime_error – If FASTA file cannot be opened or contains no valid sequences
std::invalid_argument – If invalid nucleotides found in sequences
- Returns:
FastaPerSequenceFactorizationResult containing per-sequence factors and sequence IDs