Function noLZSS::factorize_w_reference_file

Function Documentation

size_t noLZSS::factorize_w_reference_file(const std::string &reference_seq, const std::string &target_seq, const std::string &out_path)

Factorizes a target sequence using a reference sequence and writes factors to a binary file (general version).

Factorizes a target sequence using a reference sequence and writes factors to a binary file.

This is the file output version of factorize_w_reference(). It performs general reference-based factorization (no reverse complement) and writes results directly to a binary file in the noLZSS factor format with metadata footer.

The output file format:

  • Factors: Binary array of Factor structs (24 bytes each: start, length, reference)

  • Footer: Metadata including factor count, sequence count (2), sentinel count (1)

Use cases:

  • Processing large non-DNA sequences without storing all factors in memory

  • Saving factorization results for later analysis

  • Comparing general text documents with a reference

Concatenates a reference sequence and target sequence (ref@target), then performs noLZSS factorization starting from where the target sequence begins, and writes the resulting factors to a binary file. Suitable for general text or amino acid sequences.

See also

factorize_w_reference() for in-memory version

See also

factorize_dna_w_reference_seq_file() for DNA-specific version with reverse complement

Note

Output file includes footer with num_sequences=2, num_sentinels=1

Note

File uses buffered I/O (1MB buffer) for performance

Note

The reference field in factors points to positions in the combined string

Note

No reverse complement awareness - this is for general text, not DNA

Note

Factors start positions are absolute positions in the combined reference+target string

Note

No reverse complement matching is performed - suitable for text or amino acid sequences

Note

Binary format follows the same structure as other factorization binary outputs

Warning

The sentinel character ‘\x01’ (ASCII 1) must not appear in either input sequence, as it is used internally to separate the reference and target sequences

Warning

This function overwrites the output file if it exists

Parameters:
  • reference_seq – Reference sequence (any text)

  • target_seq – Target sequence to factorize (any text)

  • out_path – Path to output binary file (will be overwritten if exists)

  • reference_seq – Reference sequence string (any text)

  • target_seq – Target sequence string to be factorized (any text)

  • out_path – Path to output file where binary factors will be written

Throws:

std::runtime_error – If output file cannot be created

Returns:

Number of factors written to the file

Returns:

Number of factors written to the output file