Function noLZSS::factorize_dna_w_reference_seq
Defined in File factorizer.cpp
Function Documentation
-
std::vector<Factor> noLZSS::factorize_dna_w_reference_seq(const std::string &reference_seq, const std::string &target_seq)
Factorizes a target DNA sequence using a reference DNA sequence with reverse complement awareness.
Factorizes a target DNA sequence with reverse complement awareness using a reference sequence.
This function allows factorization of a target DNA sequence where factors can reference positions in a reference sequence (or its reverse complement). This is useful for:
Comparing related genomes (e.g., different strains of the same organism)
Identifying similarities and differences between sequences
Compression where a reference genome is available
Algorithm:
Prepares both sequences with reverse complements: REF[s1]TARGET[s2]RC(TARGET)[s3]RC(REF)[s4]
Builds a single suffix tree containing all sequences
Factorizes starting from the TARGET sequence position
Factors can reference any position in REF, TARGET, or their reverse complements
The MSB of the reference field indicates reverse complement matches.
Concatenates a reference sequence and target sequence (ref@target), then performs noLZSS factorization with reverse complement awareness starting from where the target sequence begins. This allows the target sequence to reference patterns in the reference sequence without factorizing the reference itself.
See also
factorize_dna_w_reference_seq_file() for file output version
See also
factorize_w_reference() for general (non-DNA) reference factorization
Note
Factors cover only the target sequence, but can reference the reference sequence
Note
The reference field in factors points to positions in the combined prepared string
Note
Both sequences are converted to uppercase
Note
Factors start positions are absolute positions in the combined reference+target string
Note
Both sequences should contain only A, C, T, G nucleotides (case insensitive)
Note
Reverse complement matches are encoded with RC_MASK in the ref field
- Parameters:
reference_seq – Reference DNA sequence (should contain only A, C, T, G)
target_seq – Target DNA sequence to factorize (should contain only A, C, T, G)
reference_seq – Reference DNA sequence string
target_seq – Target DNA sequence string to be factorized
- Throws:
std::invalid_argument – If sequences contain invalid nucleotides
std::runtime_error – If sequence preparation fails
std::invalid_argument – If sequences contain invalid nucleotides
- Returns:
Vector of factors representing the target sequence
- Returns:
Vector of Factor objects representing the factorization of the target sequence