Function noLZSS::factorize_dna_w_reference_seq

Function Documentation

std::vector<Factor> noLZSS::factorize_dna_w_reference_seq(const std::string &reference_seq, const std::string &target_seq)

Factorizes a target DNA sequence using a reference DNA sequence with reverse complement awareness.

Factorizes a target DNA sequence with reverse complement awareness using a reference sequence.

This function allows factorization of a target DNA sequence where factors can reference positions in a reference sequence (or its reverse complement). This is useful for:

  • Comparing related genomes (e.g., different strains of the same organism)

  • Identifying similarities and differences between sequences

  • Compression where a reference genome is available

Algorithm:

  1. Prepares both sequences with reverse complements: REF[s1]TARGET[s2]RC(TARGET)[s3]RC(REF)[s4]

  2. Builds a single suffix tree containing all sequences

  3. Factorizes starting from the TARGET sequence position

  4. Factors can reference any position in REF, TARGET, or their reverse complements

The MSB of the reference field indicates reverse complement matches.

Concatenates a reference sequence and target sequence (ref@target), then performs noLZSS factorization with reverse complement awareness starting from where the target sequence begins. This allows the target sequence to reference patterns in the reference sequence without factorizing the reference itself.

See also

factorize_dna_w_reference_seq_file() for file output version

See also

factorize_w_reference() for general (non-DNA) reference factorization

Note

Factors cover only the target sequence, but can reference the reference sequence

Note

The reference field in factors points to positions in the combined prepared string

Note

Both sequences are converted to uppercase

Note

Factors start positions are absolute positions in the combined reference+target string

Note

Both sequences should contain only A, C, T, G nucleotides (case insensitive)

Note

Reverse complement matches are encoded with RC_MASK in the ref field

Parameters:
  • reference_seq – Reference DNA sequence (should contain only A, C, T, G)

  • target_seq – Target DNA sequence to factorize (should contain only A, C, T, G)

  • reference_seq – Reference DNA sequence string

  • target_seq – Target DNA sequence string to be factorized

Throws:
  • std::invalid_argument – If sequences contain invalid nucleotides

  • std::runtime_error – If sequence preparation fails

  • std::invalid_argument – If sequences contain invalid nucleotides

Returns:

Vector of factors representing the target sequence

Returns:

Vector of Factor objects representing the factorization of the target sequence