Function noLZSS::factorize_fasta_dna_w_rc_per_sequence

Function Documentation

FastaPerSequenceFactorizationResult noLZSS::factorize_fasta_dna_w_rc_per_sequence(const std::string &fasta_path, FastaDnaSanitizationMode sanitization_mode)

Factorizes each DNA sequence in a FASTA file separately with reverse complement awareness.

Unlike factorize_fasta_multiple_dna_w_rc which concatenates sequences with sentinels, this function factorizes each sequence independently. Each sequence gets its own compressed suffix tree and factorization, which avoids sentinel limitations and produces cleaner per-sequence results.

Note

Only A, C, T, G nucleotides are allowed (case insensitive)

Note

Sequences are converted to uppercase before factorization

Note

Reverse complement matches are supported during factorization

Note

Each sequence is factorized independently - no cross-sequence matches

Parameters:

fasta_path – Path to the FASTA file containing DNA sequences

Throws:
  • std::runtime_error – If FASTA file cannot be opened or contains no valid sequences

  • std::invalid_argument – If invalid nucleotides found in sequences

Returns:

FastaPerSequenceFactorizationResult containing per-sequence factors and sequence IDs