Function noLZSS::factorize_dna_w_reference_seq_file

Function Documentation

size_t noLZSS::factorize_dna_w_reference_seq_file(const std::string &reference_seq, const std::string &target_seq, const std::string &out_path)

Factorizes a target DNA sequence using a reference sequence and writes factors to a binary file.

This is the file output version of factorize_dna_w_reference_seq(). It performs the same reference-based DNA factorization but writes the results directly to a binary file in the noLZSS factor format with metadata footer.

The output file format:

  • Factors: Binary array of Factor structs (24 bytes each: start, length, reference)

  • Footer: Metadata including factor count, sequence count (2), sentinel count (1)

This is useful for:

  • Processing large sequences without storing all factors in memory

  • Saving factorization results for later analysis

  • Feeding results into other tools that read noLZSS binary format

Concatenates a reference sequence and target sequence (ref@target), then performs noLZSS factorization with reverse complement awareness starting from where the target sequence begins, and writes the resulting factors to a binary file.

See also

factorize_dna_w_reference_seq() for in-memory version

See also

factorize_w_reference_file() for general (non-DNA) reference factorization to file

Note

Output file includes footer with num_sequences=2, num_sentinels=1

Note

File uses buffered I/O (1MB buffer) for performance

Note

The reference field in factors points to positions in the combined prepared string

Note

Factors start positions are absolute positions in the combined reference+target string

Note

Both sequences should contain only A, C, T, G nucleotides (case insensitive)

Note

Binary format follows the same structure as other DNA factorization binary outputs

Warning

This function overwrites the output file if it exists

Parameters:
  • reference_seq – Reference DNA sequence (should contain only A, C, T, G)

  • target_seq – Target DNA sequence to factorize (should contain only A, C, T, G)

  • out_path – Path to output binary file (will be overwritten if exists)

  • reference_seq – Reference DNA sequence string

  • target_seq – Target DNA sequence string to be factorized

  • out_path – Path to output file where binary factors will be written

Throws:
  • std::runtime_error – If output file cannot be created

  • std::invalid_argument – If sequences contain invalid nucleotides

  • std::invalid_argument – If sequences contain invalid nucleotides

Returns:

Number of factors written to the file

Returns:

Number of factors written to the output file