Function noLZSS::write_factors_binary_file_fasta_dna_w_rc_per_sequence

Function Documentation

size_t noLZSS::write_factors_binary_file_fasta_dna_w_rc_per_sequence(const std::string &fasta_path, const std::string &out_dir, FastaDnaSanitizationMode sanitization_mode)

Writes factors from per-sequence DNA factorization with reverse complement to separate binary files.

Reads a FASTA file, factorizes each sequence independently with reverse complement awareness, and writes each sequence’s factors to a separate binary output file. File names include the sequence ID.

Note

Creates separate binary file for each sequence: <out_dir>/<seq_id>.bin

Note

Binary format per file: factors + metadata footer

Note

Only A, C, T, G nucleotides are allowed (case insensitive)

Note

Reverse complement matches are supported during factorization

Warning

Ensure sufficient disk space for the output files

Parameters:
  • fasta_path – Path to input FASTA file containing DNA sequences

  • out_dir – Path to output directory where binary factor files will be written

Throws:
  • std::runtime_error – If FASTA file cannot be opened or contains no valid sequences

  • std::invalid_argument – If invalid nucleotides found

Returns:

Total number of factors written across all sequences