Function noLZSS::factorize_dna_w_reference_seq_file
Defined in File factorizer.cpp
Function Documentation
-
size_t noLZSS::factorize_dna_w_reference_seq_file(const std::string &reference_seq, const std::string &target_seq, const std::string &out_path)
Factorizes a target DNA sequence using a reference sequence and writes factors to a binary file.
This is the file output version of factorize_dna_w_reference_seq(). It performs the same reference-based DNA factorization but writes the results directly to a binary file in the noLZSS factor format with metadata footer.
The output file format:
Factors: Binary array of Factor structs (24 bytes each: start, length, reference)
Footer: Metadata including factor count, sequence count (2), sentinel count (1)
This is useful for:
Processing large sequences without storing all factors in memory
Saving factorization results for later analysis
Feeding results into other tools that read noLZSS binary format
Concatenates a reference sequence and target sequence (ref@target), then performs noLZSS factorization with reverse complement awareness starting from where the target sequence begins, and writes the resulting factors to a binary file.
See also
factorize_dna_w_reference_seq() for in-memory version
See also
factorize_w_reference_file() for general (non-DNA) reference factorization to file
Note
Output file includes footer with num_sequences=2, num_sentinels=1
Note
File uses buffered I/O (1MB buffer) for performance
Note
The reference field in factors points to positions in the combined prepared string
Note
Factors start positions are absolute positions in the combined reference+target string
Note
Both sequences should contain only A, C, T, G nucleotides (case insensitive)
Note
Binary format follows the same structure as other DNA factorization binary outputs
Warning
This function overwrites the output file if it exists
- Parameters:
reference_seq – Reference DNA sequence (should contain only A, C, T, G)
target_seq – Target DNA sequence to factorize (should contain only A, C, T, G)
out_path – Path to output binary file (will be overwritten if exists)
reference_seq – Reference DNA sequence string
target_seq – Target DNA sequence string to be factorized
out_path – Path to output file where binary factors will be written
- Throws:
std::runtime_error – If output file cannot be created
std::invalid_argument – If sequences contain invalid nucleotides
std::invalid_argument – If sequences contain invalid nucleotides
- Returns:
Number of factors written to the file
- Returns:
Number of factors written to the output file