Function noLZSS::write_factors_dna_w_reference_fasta_files_to_binary
Defined in File fasta_processor.cpp
Function Documentation
-
size_t noLZSS::write_factors_dna_w_reference_fasta_files_to_binary(const std::string &reference_fasta_path, const std::string &target_fasta_path, const std::string &out_path, FastaDnaSanitizationMode sanitization_mode = FastaDnaSanitizationMode::RemoveAmbiguous)
Writes noLZSS factors from DNA sequences in reference and target FASTA files to a binary output file.
This function reads DNA sequences from two FASTA files (reference and target), concatenates them with a sentinel separator, performs general factorization starting from the target sequences, and writes the resulting factors in binary format to an output file.
Note
Binary format: each factor is 24 bytes (3 × uint64_t: start, length, ref)
Note
Only A, C, T, G nucleotides are allowed (case insensitive)
Note
This function overwrites the output file if it exists
Note
Uses general factorization (no reverse complement awareness)
Note
Factorization starts from target sequence positions only
Warning
Ensure sufficient disk space for the output file
- Parameters:
reference_fasta_path – Path to FASTA file containing reference DNA sequences
target_fasta_path – Path to FASTA file containing target DNA sequences
out_path – Path to output file where binary factors will be written
- Throws:
std::runtime_error – If FASTA files cannot be opened or contain no valid sequences
std::invalid_argument – If too many sequences total or invalid nucleotides found
- Returns:
Number of factors written to the output file