Function noLZSS::write_factors_dna_w_reference_fasta_files_to_binary

Function Documentation

size_t noLZSS::write_factors_dna_w_reference_fasta_files_to_binary(const std::string &reference_fasta_path, const std::string &target_fasta_path, const std::string &out_path, FastaDnaSanitizationMode sanitization_mode = FastaDnaSanitizationMode::RemoveAmbiguous)

Writes noLZSS factors from DNA sequences in reference and target FASTA files to a binary output file.

This function reads DNA sequences from two FASTA files (reference and target), concatenates them with a sentinel separator, performs general factorization starting from the target sequences, and writes the resulting factors in binary format to an output file.

Note

Binary format: each factor is 24 bytes (3 × uint64_t: start, length, ref)

Note

Only A, C, T, G nucleotides are allowed (case insensitive)

Note

This function overwrites the output file if it exists

Note

Uses general factorization (no reverse complement awareness)

Note

Factorization starts from target sequence positions only

Warning

Ensure sufficient disk space for the output file

Parameters:
  • reference_fasta_path – Path to FASTA file containing reference DNA sequences

  • target_fasta_path – Path to FASTA file containing target DNA sequences

  • out_path – Path to output file where binary factors will be written

Throws:
  • std::runtime_error – If FASTA files cannot be opened or contain no valid sequences

  • std::invalid_argument – If too many sequences total or invalid nucleotides found

Returns:

Number of factors written to the output file