Function noLZSS::write_factors_binary_file_fasta_multiple_dna_w_rc
Defined in File fasta_processor.cpp
Function Documentation
-
size_t noLZSS::write_factors_binary_file_fasta_multiple_dna_w_rc(const std::string &fasta_path, const std::string &out_path)
Writes noLZSS factors from multiple DNA sequences in a FASTA file with reverse complement awareness to a binary output file.
This function reads DNA sequences from a FASTA file, parses them into individual sequences, prepares them for factorization using prepare_multiple_dna_sequences_w_rc(), performs factorization with reverse complement awareness, and writes the resulting factors in binary format to an output file with metadata including sequence IDs and sentinel factor indices.
This function reads DNA sequences from a FASTA file, parses them into individual sequences, prepares them for factorization using prepare_multiple_dna_sequences_w_rc(), performs factorization with reverse complement awareness, and writes the resulting factors in binary format to an output file. Each factor is written as three uint64_t values.
Note
Binary format: each factor is 24 bytes (3 × uint64_t: start, length, ref)
Note
Only A, C, T, G nucleotides are allowed (case insensitive)
Note
This function overwrites the output file if it exists
Note
Reverse complement matches are supported during factorization
Warning
Ensure sufficient disk space for the output file
- Parameters:
fasta_path – Path to input FASTA file containing DNA sequences
out_path – Path to output file where binary factors will be written
- Throws:
std::runtime_error – If FASTA file cannot be opened or contains no valid sequences
std::invalid_argument – If too many sequences (>125) in the FASTA file or invalid nucleotides found
- Returns:
Number of factors written to the output file