Class ParallelFactorizer

Class Documentation

class ParallelFactorizer

Parallel implementation of noLZSS factorization.

This class implements parallel factorization for noLZSS by dividing the input text among multiple threads and using convergence detection to efficiently merge results.

Public Functions

ParallelFactorizer() = default

Constructor.

~ParallelFactorizer() = default

Destructor.

size_t parallel_factorize(std::string_view text, const std::string &output_path, size_t num_threads = 0, size_t start_pos = 0)

Main parallel factorization function.

Factorizes the input text using multiple threads and writes results to output file

Performs non-overlapping LZSS factorization using multiple threads with convergence detection. The algorithm:

  1. Auto-detects optimal thread count based on input size (if num_threads=0)

  2. Builds a single compressed suffix tree (CST) shared by all threads

  3. Divides input text into chunks, one per thread

  4. Each thread factorizes its chunk independently, with thread 0 writing directly to the final output file for efficiency

  5. Threads detect convergence by monitoring when their factors align with the next thread’s factors, allowing early termination

  6. Merges results from thread 1+ into the output file (thread 0 already there)

  7. Appends binary footer with metadata

OPTIMIZATION: Thread 0 writes directly to output_path, avoiding a copy operation. For single-threaded execution, this means no temp files and no merge step.

Parameters:
  • text – Input text to factorize

  • output_path – Path to output binary factor file

  • num_threads – Number of threads to use (0 for auto-detection)

  • start_pos – Starting position in the text for factorization (default: 0)

  • text – Input text to factorize (string_view for zero-copy efficiency)

  • output_path – Path where the binary factor file will be written

  • num_threads – Number of threads to use (0 = auto-detect based on input size)

  • start_pos – Starting position in the text for factorization (default: 0)

Returns:

Number of factors produced

Throws:

std::runtime_error – If file I/O operations fail

Returns:

size_t Total number of factors produced

size_t parallel_factorize_file(const std::string &input_path, const std::string &output_path, size_t num_threads = 0, size_t start_pos = 0)

File-based parallel factorization.

Performs parallel factorization on a file.

Reads input from a file, factorizes using multiple threads, writes results to output file

Convenience wrapper that reads an entire file into memory and then factorizes it using the in-memory parallel_factorize() function.

NOTE: For very large files, this may consume significant memory. Consider using streaming approaches for files larger than available RAM.

Parameters:
  • input_path – Path to input text file

  • output_path – Path to output binary factor file

  • num_threads – Number of threads to use (0 for auto-detection)

  • start_pos – Starting position in the text for factorization (default: 0)

  • input_path – Path to the input text file

  • output_path – Path where the binary factor file will be written

  • num_threads – Number of threads to use (0 = auto-detect)

  • start_pos – Starting position in the text for factorization (default: 0)

Returns:

Number of factors produced

Throws:

std::runtime_error – If the input file cannot be opened

Returns:

size_t Total number of factors produced

size_t parallel_factorize_dna_w_rc(std::string_view text, const std::string &output_path, size_t num_threads = 0, size_t start_pos = 0)

DNA-specific parallel factorization with reverse complement support.

Wrapper function that prepares DNA text and calls the core parallel function.

Takes raw DNA text, prepares it with reverse complement, and calls the core parallel_factorize_multiple_dna_w_rc() function.

Takes raw DNA text, prepares it with reverse complement using prepare_multiple_dna_sequences_w_rc(), and then calls the core parallel_factorize_multiple_dna_w_rc() function.

Similar to the pattern of nolzss_dna_w_rc() calling nolzss_multiple_dna_w_rc().

Parameters:
  • text – Input DNA text (raw nucleotides)

  • output_path – Path to output binary factor file

  • num_threads – Number of threads to use (0 for auto-detection)

  • start_pos – Starting position in the text for factorization (default: 0)

  • text – Input DNA text to factorize (should contain only A, C, G, T)

  • output_path – Path where the binary factor file will be written

  • num_threads – Number of threads to use (0 = auto-detect)

  • start_pos – Starting position in the text for factorization (default: 0)

Returns:

Number of factors produced

Returns:

size_t Total number of factors produced

size_t parallel_factorize_multiple_dna_w_rc(const std::string &prepared_string, size_t original_length, const std::string &output_path, size_t num_threads = 0, size_t start_pos = 0)

Core parallel DNA factorization with reverse complement for prepared strings.

Performs parallel DNA factorization on already-prepared strings that include reverse complement. This is the core function that parallel_factorize_dna_w_rc() calls after preparing the input, and can be used directly by FASTA processors that have already prepared their strings.

The input string S should be in the format: T + sentinel + RC(T) + sentinel where T is the original sequence(s) concatenated with sentinels.

Performs non-overlapping LZSS factorization on already-prepared DNA sequences with reverse complement awareness using multiple threads. The algorithm:

  1. Takes prepared string: T + sentinel + RC(T) + sentinel (already prepared by caller)

  2. Auto-detects optimal thread count based on input size (if num_threads=0)

  3. Builds a single compressed suffix tree (CST) over the prepared string

  4. Divides the original text into chunks (not including RC portion)

  5. Each thread factorizes its chunk independently using both forward and RC matches

  6. Merges results with convergence detection

NOTE: This is the core function used by both parallel_factorize_dna_w_rc() and FASTA processors. It expects an already-prepared string with reverse complement.

Parameters:
  • prepared_string – The prepared string with reverse complement

  • original_length – Length of the original sequence (before RC, excluding final sentinel)

  • output_path – Path to output binary factor file

  • num_threads – Number of threads to use (0 for auto-detection)

  • start_pos – Starting position in the original sequence for factorization (default: 0)

  • prepared_string – Input prepared string (T + sentinel + RC(T) + sentinel)

  • original_length – Length of the original sequence (before RC, excluding final sentinel)

  • output_path – Path where the binary factor file will be written

  • num_threads – Number of threads to use (0 = auto-detect)

  • start_pos – Starting position in the original sequence for factorization (default: 0)

Returns:

Number of factors produced

Returns:

size_t Total number of factors produced

size_t parallel_factorize_file_dna_w_rc(const std::string &input_path, const std::string &output_path, size_t num_threads = 0)

File-based DNA-specific parallel factorization with reverse complement support.

Performs parallel DNA factorization with reverse complement on a file.

File-based wrapper for parallel_factorize_dna_w_rc(). Reads the entire file into memory and processes it.

Parameters:
  • input_path – Path to input DNA text file

  • output_path – Path to output binary factor file

  • num_threads – Number of threads to use (0 for auto-detection)

  • input_path – Path to the input DNA file

  • output_path – Path where the binary factor file will be written

  • num_threads – Number of threads to use (0 = auto-detect)

Returns:

Number of factors produced

Throws:

std::runtime_error – If the input file cannot be opened

Returns:

size_t Total number of factors produced

std::string create_temp_file_path(size_t thread_id)

Create temporary file path (public for use by fasta processor)

Creates a unique temporary file path for a worker thread.

Generates a unique temporary file path in the system’s temp directory using a timestamp and thread ID to ensure uniqueness across concurrent operations. Thread 0 doesn’t use this (writes directly to output), but threads 1+ use it.

Parameters:
  • thread_id – Thread identifier

  • thread_id – The ID of the thread requesting a temp file path

Returns:

Unique temporary file path

Returns:

std::string Absolute path to a unique temporary file

size_t merge_temp_files(const std::string &output_path, std::vector<ThreadContext> &contexts)

Merge temporary files into final output (public for use by fasta processor)

Merges temporary factor files and appends footer to create final output.

This function handles the final stage of parallel factorization by:

  1. Reading thread 0’s factors (already in output_path) to count them and find the end position

  2. If multiple threads and sequence not complete: appending factors from threads 1+ after finding the convergence point where their factorization aligns with prior threads

  3. Writing the binary footer with metadata (factor count, sequences, sentinels)

OPTIMIZATION: Thread 0 wrote directly to output_path during factorization, so we only need to append from other threads. For single-threaded execution, we just count factors and add the footer.

Convergence-based merging: For each subsequent thread, we skip factors until we find one that starts exactly where the previous thread’s last factor ended. From that point, we copy all remaining factors (they represent new coverage of the text).

Parameters:
  • output_path – Path to final output file

  • contexts – Thread contexts containing temporary file info

  • output_path – Path to the output file (already contains thread 0’s factors)

  • contexts – Vector of all thread contexts (for accessing temp file paths)

Returns:

Number of factors in final output

Throws:

std::runtime_error – If file I/O operations fail

Returns:

size_t Total number of factors in the final output

void cleanup_temp_files(const std::vector<ThreadContext> &contexts)

Clean up temporary files (public for use by fasta processor)

Removes temporary files created by worker threads.

Deletes temporary factor files created by threads 1 and higher. Thread 0’s “file” is actually the final output file, so it’s skipped during cleanup.

Errors during cleanup are logged as warnings but don’t cause failure, since the factorization has already completed successfully.

Parameters:
  • contexts – Thread contexts containing temporary file paths

  • contexts – Vector of all thread contexts (containing temp file paths)

void factorize_dna_w_rc_thread(const cst_t &cst, const sdsl::rmq_succinct_sct<> &rmqF, const sdsl::rmq_succinct_sct<> &rmqRcEnd, const sdsl::int_vector<64> &fwd_starts, const sdsl::int_vector<64> &rc_ends, uint64_t INF, size_t N, ThreadContext &ctx, std::vector<ThreadContext> &all_contexts, std::vector<std::mutex> &file_mutexes)

Thread worker function for DNA factorization with reverse complement (public for use by fasta processor)

Worker thread function for DNA factorization with reverse complement.

Worker thread that performs DNA-specific factorization with reverse complement awareness. Similar to factorize_thread() but uses the DNA w/ RC algorithm.

This function implements the DNA-specific factorization algorithm with reverse complement support for parallel execution. It’s based on nolzss_multiple_dna_w_rc but adapted to work on a chunk of the original sequence.

The algorithm:

  1. Initializes at the thread’s start position

  2. For each position, walks up the CST to find the best match (forward or RC)

  3. Checks both forward matches (using rmqF) and RC matches (using rmqRcEnd)

  4. Selects the longest non-overlapping match

  5. Writes factors and checks for convergence with the next thread

Parameters:
  • cst – The compressed suffix tree (shared by all threads)

  • rmqF – RMQ for forward starts

  • rmqRcEnd – RMQ for reverse complement ends

  • fwd_starts – Forward start positions vector

  • rc_ends – Reverse complement end positions vector

  • INF – Infinity value for invalid positions

  • N – Original sequence length

  • ctx – Thread context

  • all_contexts – All thread contexts (for convergence checking)

  • file_mutexes – Mutexes for protecting file access

  • cst – Compressed suffix tree built from T + sentinel + RC(T) + sentinel

  • rmqF – RMQ structure for finding minimum forward starts

  • rmqRcEnd – RMQ structure for finding minimum RC ends

  • fwd_starts – Vector of forward start positions (INF for RC positions)

  • rc_ends – Vector of RC end positions (INF for forward positions)

  • INF – Infinity value used for invalid positions

  • N – Length of the original sequence (not including RC or sentinels)

  • ctx – This thread’s context

  • all_contexts – Vector of all thread contexts (for convergence checking)

  • file_mutexes – Mutexes for file access

Public Static Attributes

static constexpr size_t MIN_CHARS_PER_THREAD = 100000

Minimum number of characters per thread to make parallelization worthwhile.

If the input text is smaller than this value multiplied by the requested thread count, fewer threads will be used. This prevents excessive overhead from thread management when the per-thread workload would be too small.

Can be adjusted based on hardware characteristics and typical input sizes.