Function noLZSS::write_factors_binary_file_fasta_multiple_dna_w_rc

Function Documentation

size_t noLZSS::write_factors_binary_file_fasta_multiple_dna_w_rc(const std::string &fasta_path, const std::string &out_path)

Writes noLZSS factors from multiple DNA sequences in a FASTA file with reverse complement awareness to a binary output file.

This function reads DNA sequences from a FASTA file, parses them into individual sequences, prepares them for factorization using prepare_multiple_dna_sequences_w_rc(), performs factorization with reverse complement awareness, and writes the resulting factors in binary format to an output file with metadata including sequence IDs and sentinel factor indices.

This function reads DNA sequences from a FASTA file, parses them into individual sequences, prepares them for factorization using prepare_multiple_dna_sequences_w_rc(), performs factorization with reverse complement awareness, and writes the resulting factors in binary format to an output file. Each factor is written as three uint64_t values.

Note

Binary format: each factor is 24 bytes (3 × uint64_t: start, length, ref)

Note

Only A, C, T, G nucleotides are allowed (case insensitive)

Note

This function overwrites the output file if it exists

Note

Reverse complement matches are supported during factorization

Warning

Ensure sufficient disk space for the output file

Parameters:
  • fasta_path – Path to input FASTA file containing DNA sequences

  • out_path – Path to output file where binary factors will be written

Throws:
  • std::runtime_error – If FASTA file cannot be opened or contains no valid sequences

  • std::invalid_argument – If too many sequences (>125) in the FASTA file or invalid nucleotides found

Returns:

Number of factors written to the output file