Function noLZSS::parallel_write_factors_dna_w_reference_fasta_files_to_binary
Defined in File parallel_fasta_processor.cpp
Function Documentation
-
size_t noLZSS::parallel_write_factors_dna_w_reference_fasta_files_to_binary(const std::string &reference_fasta_path, const std::string &target_fasta_path, const std::string &out_path, size_t num_threads = 0, FastaDnaSanitizationMode sanitization_mode = FastaDnaSanitizationMode::RemoveAmbiguous)
Parallel version of write_factors_dna_w_reference_fasta_files_to_binary.
Reads DNA sequences from reference and target FASTA files, concatenates them with sentinels, and performs parallel factorization starting from target sequences, writing results to a binary output file with metadata.
Note
Binary format includes factors, sequence IDs, sentinel indices, and footer metadata
Note
Only A, C, T, G nucleotides are allowed (case insensitive)
Note
This function overwrites the output file if it exists
Note
Reverse complement matches are supported during factorization
Note
Factorization starts from target sequence positions only
Note
For single-threaded execution (num_threads=1), no temporary files are created
Warning
Ensure sufficient disk space for the output file and temporary files
- Parameters:
reference_fasta_path – Path to FASTA file containing reference DNA sequences
target_fasta_path – Path to FASTA file containing target DNA sequences
out_path – Path to output file where binary factors will be written
num_threads – Number of threads to use (0 = auto-detect based on input size)
- Throws:
std::runtime_error – If FASTA files cannot be opened or contain no valid sequences
std::invalid_argument – If too many sequences total or invalid nucleotides found
- Returns:
Number of factors written to the output file