Function noLZSS::factorize_w_reference

Function Documentation

std::vector<Factor> noLZSS::factorize_w_reference(const std::string &reference_seq, const std::string &target_seq)

Factorizes a target sequence using a reference sequence (general, non-DNA version).

Factorizes a target sequence using a reference sequence without reverse complement.

This is the general-purpose version of reference-based factorization that works with any text, not just DNA. Unlike the DNA version, it does NOT consider reverse complements. This is useful for:

  • Non-genomic text compression with a reference document

  • Finding similarities between general text documents

  • Analyzing protein sequences or other non-DNA biological data

Algorithm:

  1. Concatenates reference and target with a sentinel (ASCII 1) between them

  2. Builds a compressed suffix tree on the combined string

  3. Factorizes starting from the target sequence position

  4. Factors can reference any position in the reference or target

Concatenates a reference sequence and target sequence (ref@target), then performs noLZSS factorization starting from where the target sequence begins. This allows the target sequence to reference patterns in the reference sequence without factorizing the reference itself. Suitable for general text or amino acid sequences.

See also

factorize_w_reference_file() for file output version

See also

factorize_dna_w_reference_seq() for DNA-specific version with reverse complement

Note

Factors cover only the target sequence, but can reference the reference sequence

Note

The reference field in factors points to positions in the combined string

Note

Sentinel character (ASCII 1) separates reference from target

Note

No reverse complement awareness - use factorize_dna_w_reference_seq() for DNA

Note

Factors start positions are absolute positions in the combined reference+target string

Note

No reverse complement matching is performed - suitable for text or amino acid sequences

Note

Sequences can contain any ASCII characters

Warning

The sentinel character ‘\x01’ (ASCII 1) must not appear in either input sequence, as it is used internally to separate the reference and target sequences

Parameters:
  • reference_seq – Reference sequence (any text)

  • target_seq – Target sequence to factorize (any text)

  • reference_seq – Reference sequence string (any text)

  • target_seq – Target sequence string to be factorized (any text)

Returns:

Vector of factors representing the target sequence

Returns:

Vector of Factor objects representing the factorization of the target sequence