Function noLZSS::prepare_multiple_dna_sequences_w_rc

Function Documentation

PreparedSequenceResult noLZSS::prepare_multiple_dna_sequences_w_rc(const std::vector<std::string> &sequences)

Prepares multiple DNA sequences for factorization with reverse complement and tracks sentinel positions.

Takes multiple DNA sequences, concatenates them with unique sentinels, appends their reverse complements with unique sentinels, and tracks sentinel positions. The output format is compatible with nolzss_multiple_dna_w_rc(): S = T1!T2@T3$rt(T3)rt(T2)^rt(T1)&

Note

Sentinels avoid 0, A(65), C(67), G(71), T(84) - lowercase nucleotides are safe as sentinels

Note

The function validates that all sequences contain only valid DNA nucleotides

Note

Input sequences can be lowercase or uppercase, output is always uppercase

Parameters:

sequences – Vector of DNA sequence strings (should contain only A, C, T, G)

Throws:
  • std::invalid_argument – If too many sequences (>125) or invalid nucleotides found

  • std::runtime_error – If sequences contain invalid characters

Returns:

PreparedSequenceResult containing:

  • prepared_string: The formatted string with sequences and reverse complements

  • original_length: Length of the original sequences part (before reverse complements)

  • sentinel_positions: Positions of all sentinels in the prepared string