Function noLZSS::parallel_write_factors_dna_w_reference_fasta_files_to_binary

Function Documentation

size_t noLZSS::parallel_write_factors_dna_w_reference_fasta_files_to_binary(const std::string &reference_fasta_path, const std::string &target_fasta_path, const std::string &out_path, size_t num_threads = 0, FastaDnaSanitizationMode sanitization_mode = FastaDnaSanitizationMode::RemoveAmbiguous)

Parallel version of write_factors_dna_w_reference_fasta_files_to_binary.

Reads DNA sequences from reference and target FASTA files, concatenates them with sentinels, and performs parallel factorization starting from target sequences, writing results to a binary output file with metadata.

Note

Binary format includes factors, sequence IDs, sentinel indices, and footer metadata

Note

Only A, C, T, G nucleotides are allowed (case insensitive)

Note

This function overwrites the output file if it exists

Note

Reverse complement matches are supported during factorization

Note

Factorization starts from target sequence positions only

Note

For single-threaded execution (num_threads=1), no temporary files are created

Warning

Ensure sufficient disk space for the output file and temporary files

Parameters:
  • reference_fasta_path – Path to FASTA file containing reference DNA sequences

  • target_fasta_path – Path to FASTA file containing target DNA sequences

  • out_path – Path to output file where binary factors will be written

  • num_threads – Number of threads to use (0 = auto-detect based on input size)

Throws:
  • std::runtime_error – If FASTA files cannot be opened or contain no valid sequences

  • std::invalid_argument – If too many sequences total or invalid nucleotides found

Returns:

Number of factors written to the output file