Function noLZSS::parallel_write_factors_binary_file_fasta_dna_w_rc_per_sequence

Function Documentation

size_t noLZSS::parallel_write_factors_binary_file_fasta_dna_w_rc_per_sequence(const std::string &fasta_path, const std::string &out_dir, size_t num_threads = 0, FastaDnaSanitizationMode sanitization_mode = FastaDnaSanitizationMode::RemoveAmbiguous)

Parallel version of write_factors_binary_file_fasta_dna_w_rc_per_sequence.

Reads a FASTA file, factorizes each sequence independently with reverse complement awareness using parallel processing, and writes each sequence to a separate binary file.

Note

Each sequence is factorized independently in parallel

Note

Creates separate binary file for each sequence: <out_dir>/<seq_id>.bin

Note

Binary format per file: factors + metadata footer

Note

Only A, C, T, G nucleotides are allowed (case insensitive)

Note

Reverse complement matches are supported during factorization

Warning

Ensure sufficient disk space for the output files

Parameters:
  • fasta_path – Path to input FASTA file containing DNA sequences

  • out_dir – Path to output directory where binary factor files will be written

  • num_threads – Number of threads to use (0 = auto-detect based on sequence count)

Throws:
  • std::runtime_error – If FASTA file cannot be opened or contains no valid sequences

  • std::invalid_argument – If invalid nucleotides found

Returns:

Total number of factors written across all sequences