Function noLZSS::factorize_w_reference
Defined in File factorizer.cpp
Function Documentation
-
std::vector<Factor> noLZSS::factorize_w_reference(const std::string &reference_seq, const std::string &target_seq)
Factorizes a target sequence using a reference sequence (general, non-DNA version).
Factorizes a target sequence using a reference sequence without reverse complement.
This is the general-purpose version of reference-based factorization that works with any text, not just DNA. Unlike the DNA version, it does NOT consider reverse complements. This is useful for:
Non-genomic text compression with a reference document
Finding similarities between general text documents
Analyzing protein sequences or other non-DNA biological data
Algorithm:
Concatenates reference and target with a sentinel (ASCII 1) between them
Builds a compressed suffix tree on the combined string
Factorizes starting from the target sequence position
Factors can reference any position in the reference or target
Concatenates a reference sequence and target sequence (ref@target), then performs noLZSS factorization starting from where the target sequence begins. This allows the target sequence to reference patterns in the reference sequence without factorizing the reference itself. Suitable for general text or amino acid sequences.
See also
factorize_w_reference_file() for file output version
See also
factorize_dna_w_reference_seq() for DNA-specific version with reverse complement
Note
Factors cover only the target sequence, but can reference the reference sequence
Note
The reference field in factors points to positions in the combined string
Note
Sentinel character (ASCII 1) separates reference from target
Note
No reverse complement awareness - use factorize_dna_w_reference_seq() for DNA
Note
Factors start positions are absolute positions in the combined reference+target string
Note
No reverse complement matching is performed - suitable for text or amino acid sequences
Note
Sequences can contain any ASCII characters
Warning
The sentinel character ‘\x01’ (ASCII 1) must not appear in either input sequence, as it is used internally to separate the reference and target sequences
- Parameters:
reference_seq – Reference sequence (any text)
target_seq – Target sequence to factorize (any text)
reference_seq – Reference sequence string (any text)
target_seq – Target sequence string to be factorized (any text)
- Returns:
Vector of factors representing the target sequence
- Returns:
Vector of Factor objects representing the factorization of the target sequence