Factor Length Significance Analysis

Introduction

Factor length significance analysis is a statistical method to distinguish meaningful (signal) factors from noise in genomic LZSS factorizations. By comparing factorization results from a real genome with results from a shuffled version of the same genome, we can determine the minimal factor length threshold at which factors are confidently considered non-random.

Why is this important?

In genomic analysis, not all LZSS factors represent biologically meaningful repetitive structures. Short factors may arise simply by chance, while longer factors are more likely to represent true biological repeats, conserved regions, or other significant genomic features. This analysis provides a principled, statistically-grounded way to separate signal from noise.

Key Concepts

  • Real Genome: The original, unmodified genome sequence

  • Shuffled Genome: A randomized version that preserves nucleotide composition but destroys biological structure

  • Factor Length: The length of each non-overlapping LZSS factor

  • Significance Threshold (L*): The minimum factor length at which factors are confidently non-random

  • Rarity Score: Per-factor probability of observing that length in the shuffled genome

Statistical Methodology

Clopper-Pearson Confidence Bounds

The analysis uses Clopper-Pearson confidence intervals to establish conservative upper bounds on the probability of observing factors of a given length in shuffled data. This is a robust method for binomial proportions that provides exact coverage probabilities.

For k factors of length ≥ L out of n total factors in the shuffled genome:

  • We compute the empirical tail probability: S₀(L) = k/n

  • We compute a conservative upper bound: S₀ᵁ(L) using Clopper-Pearson

The Clopper-Pearson upper bound is the (1-α) quantile of the Beta(k+1, n-k) distribution.

Threshold Selection

Given N factors in the real genome and a desired false positive threshold τ:

  1. For each unique length L in the shuffled data, compute expected false positives: E[FP] = N × S₀ᵁ(L)

  2. Select L* as the smallest length where E[FP] ≤ τ

This ensures that we expect at most τ false positives among factors with length ≥ L*.

Per-Factor Rarity Scores

Each factor in the real genome receives a rarity score s_i = S₀(L_i), representing the tail probability of observing a factor of that length or greater in the shuffled genome. Lower scores indicate rarer (more significant) factors.

Genome-Wide Exceedance Probability

Using a Poisson approximation, we estimate the probability of observing at least one factor of length ≥ L:

P(any factor ≥ L) ≈ 1 - exp(-N × S₀(L))

API Reference

Core Statistical Functions

clopper_pearson_upper(k: int, n: int, alpha: float = 0.05) -> float

Compute one-sided (1-α) Clopper-Pearson upper confidence bound for Binomial(n, p).

Parameters:

  • k (int): Number of successes (0 ≤ k ≤ n)

  • n (int): Number of trials (must be positive)

  • alpha (float): Confidence level (default: 0.05 for 95% confidence)

Returns:

  • float: Upper confidence bound for the binomial probability

Raises:

  • ValueError: If inputs are invalid

Example:

from noLZSS.genomics import clopper_pearson_upper

# 95% upper confidence bound for 5 successes out of 100 trials
upper = clopper_pearson_upper(5, 100, 0.05)
print(f"Upper bound: {upper:.4f}")  # ~0.11

extract_factor_lengths(factors) -> np.ndarray

Extract factor lengths from either a list of factor tuples or a binary factor file.

Parameters:

  • factors (Union[List[Tuple], str, Path]): Either a list of (pos, length, ref, …) tuples or path to binary factor file

Returns:

  • np.ndarray: Array of factor lengths (int64)

Raises:

  • ValueError: If input type is invalid or list contains invalid tuples

  • NoLZSSError: If binary file cannot be read

Example:

from noLZSS.genomics import extract_factor_lengths

# From tuple list
factors = [(0, 5, 0), (5, 3, 2), (8, 10, 1)]
lengths = extract_factor_lengths(factors)
print(lengths)  # [5, 3, 10]

# From binary file
lengths = extract_factor_lengths("genome_factors.bin")
print(f"Extracted {len(lengths)} factor lengths")

infer_length_significance(real_lengths, shuf_lengths, tau_expected_fp=1.0, alpha_cp=0.05) -> Dict

Core statistical inference using ONE shuffled genome.

Parameters:

  • real_lengths (array-like): Factor lengths from real genome

  • shuf_lengths (array-like): Factor lengths from shuffled genome

  • tau_expected_fp (float): Maximum expected false positives (default: 1.0)

  • alpha_cp (float): Confidence level for Clopper-Pearson (default: 0.05)

Returns: Dictionary with keys:

  • N_real (int): Number of real factors

  • N_shuf (int): Number of shuffled factors

  • L_star (int or None): Threshold length (None if no L meets criterion)

  • tau_expected_fp (float): Input parameter

  • alpha_cp (float): Input parameter

  • rarity_scores_real (np.ndarray): Per-factor tail probabilities

  • p_any_ge (Callable): Function(L) -> genome-wide exceedance probability

  • uniq_L (np.ndarray): Unique length values

  • S0 (np.ndarray): Empirical tail probabilities

  • S0_upper (np.ndarray): Conservative upper bounds

  • expected_fp_upper (np.ndarray): Expected FP counts

Example:

from noLZSS.genomics import infer_length_significance
import numpy as np

real_lens = np.array([50, 60, 70, 80, 90])
shuf_lens = np.array([5, 10, 15, 20, 25, 30])

result = infer_length_significance(real_lens, shuf_lens, tau_expected_fp=1.0)
print(f"Threshold length: {result['L_star']}")
print(f"Number of real factors: {result['N_real']}")

# Per-factor rarity scores
for i, score in enumerate(result['rarity_scores_real']):
    print(f"Factor {i} (len={real_lens[i]}): rarity={score:.4f}")

# Genome-wide exceedance probability
p_any_ge = result['p_any_ge']
print(f"P(any factor >= 50): {p_any_ge(50):.4f}")

High-Level API

calculate_factor_length_threshold(real_factors_file, shuffled_factors_file, tau_expected_fp=1.0, alpha_cp=0.05, plot_output=None) -> Dict

Main user-facing function for significance analysis.

Parameters:

  • real_factors_file (str or Path): Path to binary factor file from real genome

  • shuffled_factors_file (str or Path): Path to binary factor file from shuffled genome

  • tau_expected_fp (float): Maximum expected false positives (default: 1.0)

  • alpha_cp (float): Confidence level (default: 0.05)

  • plot_output (str, Path, or None): Optional path to save visualization

Returns:

  • dict: Result dictionary from infer_length_significance()

Raises:

  • FileNotFoundError: If either input file doesn’t exist

  • NoLZSSError: If files have invalid format

Example:

from noLZSS.genomics import calculate_factor_length_threshold

result = calculate_factor_length_threshold(
    "genome_real_factors.bin",
    "genome_shuffled_factors.bin",
    tau_expected_fp=1.0,
    plot_output="significance_analysis.png"
)

if result['L_star'] is not None:
    print(f"Minimum significant factor length: {result['L_star']}")
    
    # Count significant factors
    significant = result['rarity_scores_real'] < 0.05
    print(f"Number of significant factors (p < 0.05): {significant.sum()}")
else:
    print("No threshold meets the false positive criterion")

Visualization

plot_significance_analysis(result, save_path=None, show_plot=True) -> None

Create visualization of significance analysis results.

Parameters:

  • result (dict): Output dictionary from infer_length_significance()

  • save_path (str, Path, or None): Optional path to save plot

  • show_plot (bool): Whether to display plot interactively (default: True)

Raises:

  • ValueError: If result dictionary is missing required keys

Warnings:

  • UserWarning: If matplotlib is not available

Example:

from noLZSS.genomics import infer_length_significance, plot_significance_analysis

result = infer_length_significance(real_lens, shuf_lens)

# Display plot
plot_significance_analysis(result)

# Save without displaying
plot_significance_analysis(result, save_path="analysis.png", show_plot=False)

Workflow Guide

Step-by-Step Analysis

1. Prepare Your Data

Start with a FASTA file containing your genome sequence(s):

import noLZSS

# Factorize the real genome
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "genome.fasta",
    "genome_real_factors.bin"
)

2. Generate Shuffled Genome

Create a shuffled version that preserves nucleotide composition but destroys biological structure. You can use external tools or write your own:

from pathlib import Path
import random

def shuffle_fasta(input_path, output_path):
    """Simple shuffler - preserves per-sequence composition."""
    from noLZSS.genomics import read_fasta_auto
    
    sequences = read_fasta_auto(input_path)
    
    with open(output_path, 'w') as f:
        for seq_id, seq_data in sequences:
            # Convert to list and shuffle
            seq_list = list(seq_data)
            random.shuffle(seq_list)
            shuffled = ''.join(seq_list)
            
            # Write FASTA
            f.write(f">{seq_id}_shuffled\n")
            # Write in 80-char lines
            for i in range(0, len(shuffled), 80):
                f.write(shuffled[i:i+80] + '\n')

shuffle_fasta("genome.fasta", "genome_shuffled.fasta")

3. Factorize Shuffled Genome

# Factorize the shuffled genome
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "genome_shuffled.fasta",
    "genome_shuffled_factors.bin"
)

4. Calculate Significance Threshold

from noLZSS.genomics import calculate_factor_length_threshold

result = calculate_factor_length_threshold(
    "genome_real_factors.bin",
    "genome_shuffled_factors.bin",
    tau_expected_fp=1.0,
    alpha_cp=0.05,
    plot_output="significance_analysis.png"
)

print(f"Analysis Results:")
print(f"  Real genome factors: {result['N_real']}")
print(f"  Shuffled genome factors: {result['N_shuf']}")
print(f"  Minimum significant length (L*): {result['L_star']}")

5. Filter Significant Factors

from noLZSS.utils import read_factors_binary_file

# Read real factors
factors = read_factors_binary_file("genome_real_factors.bin")

# Filter by threshold
if result['L_star'] is not None:
    significant_factors = [f for f in factors if f[1] >= result['L_star']]
    print(f"Significant factors: {len(significant_factors)}/{len(factors)}")
    
    # Or use rarity scores for finer control
    rarity_threshold = 0.05  # p < 0.05
    highly_significant = [
        f for f, score in zip(factors, result['rarity_scores_real'])
        if score < rarity_threshold
    ]
    print(f"Highly significant (p < 0.05): {len(highly_significant)}")

Parameter Guide

Choosing tau_expected_fp

The tau_expected_fp parameter controls the stringency of the threshold:

  • τ = 0.1: Very stringent, expect ≤0.1 false positives

  • τ = 1.0: Standard choice, expect ≤1 false positive (default)

  • τ = 5.0: Relaxed, expect ≤5 false positives

Recommendations:

  • Use τ = 0.1 to 1.0 for genomes with thousands of factors

  • Use τ = 5.0 to 10.0 for smaller analyses with hundreds of factors

  • Consider your tolerance for false discoveries in downstream analysis

Choosing alpha_cp

The alpha_cp parameter controls the confidence level for the Clopper-Pearson bounds:

  • α = 0.05: 95% confidence (default, standard)

  • α = 0.01: 99% confidence (more conservative)

  • α = 0.10: 90% confidence (less conservative)

Recommendations:

  • α = 0.05 is appropriate for most analyses

  • Use smaller α (more conservative) when false positives are very costly

  • Larger α provides less conservative bounds but may increase false positives

Interpretation

Understanding the Results

L* (Threshold Length)

  • L* = 150: Factors with length ≥150 are confidently non-random

  • L* = None: No length meets the criterion; factors are similar to shuffled

  • Lower L* indicates more distinctive real genome structure

  • Higher L* suggests less distinctive long-range structure

Rarity Scores

  • Score = 0.001: Very rare, only 0.1% of shuffled factors this long

  • Score = 0.05: Rare, only 5% of shuffled factors this long

  • Score = 0.50: Common, 50% of shuffled factors this long or longer

  • Score = 1.0: Length shorter than any shuffled factor

Expected False Positives

The plot shows E[FP] = N_real × S₀ᵁ(L):

  • Where the curve crosses τ is approximately L*

  • Steep drop indicates clear separation between real and shuffled

  • Gradual drop suggests overlapping length distributions

Biological Interpretation

High L* (e.g., L* > 200)

  • Strong biological signal in long factors

  • Clear distinction between structured and random sequences

  • Likely represents true biological repeats, gene families, etc.

Moderate L* (e.g., 50 < L* < 200)

  • Moderate biological structure

  • Mix of structured and random-like regions

  • May represent moderate repeat content

Low L* or None

  • Weak distinction from shuffled genome

  • May indicate:

    • Low repeat content

    • High sequence diversity

    • Short, less distinctive repeats

    • Need for different null model

Visualization Guide

Understanding the Plots

The plot_significance_analysis() function creates two panels:

Top Panel: Tail Probabilities

  • Blue line (S₀(L)): Empirical probability from shuffled data

  • Red dashed (S₀ᵁ(L)): Conservative upper bound

  • Green dotted (L*): Significance threshold

  • Y-axis is log scale to show full range

Interpretation:

  • Steep drop = clear length separation

  • Flat region = many factors in that length range

  • Gap between S₀ and S₀ᵁ = statistical uncertainty

Bottom Panel: Expected False Positives

  • Purple line: Expected FP = N_real × S₀ᵁ(L)

  • Orange dashed: τ threshold

  • Green dotted: L* where curves intersect

Interpretation:

  • Where purple crosses orange = L*

  • Curve below τ = acceptable FP region

  • Steeper drop = easier to find low L*

Examples

Example 1: Basic Analysis

import noLZSS
from noLZSS.genomics import calculate_factor_length_threshold

# Factorize real and shuffled genomes
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "ecoli.fasta",
    "ecoli_real.bin"
)

noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "ecoli_shuffled.fasta",
    "ecoli_shuffled.bin"
)

# Calculate threshold
result = calculate_factor_length_threshold(
    "ecoli_real.bin",
    "ecoli_shuffled.bin",
    tau_expected_fp=1.0,
    plot_output="ecoli_significance.png"
)

print(f"E. coli genome analysis:")
print(f"  Threshold length: {result['L_star']}")
print(f"  Real factors: {result['N_real']}")
print(f"  Shuffled factors: {result['N_shuf']}")

Example 2: Filtering Significant Factors

from noLZSS.utils import read_factors_binary_file
import numpy as np

# Read factors and perform analysis
factors = read_factors_binary_file("genome_real.bin")
result = calculate_factor_length_threshold(
    "genome_real.bin",
    "genome_shuffled.bin"
)

# Multiple filtering strategies
if result['L_star'] is not None:
    # Strategy 1: Length threshold
    length_filtered = [f for f in factors if f[1] >= result['L_star']]
    print(f"Length threshold: {len(length_filtered)} factors")
    
    # Strategy 2: Rarity score (more granular)
    rarity_scores = result['rarity_scores_real']
    rarity_filtered = [
        f for f, score in zip(factors, rarity_scores)
        if score < 0.05  # p < 0.05
    ]
    print(f"Rarity filtered: {len(rarity_filtered)} factors")
    
    # Strategy 3: Top N most significant
    top_n = 100
    top_indices = np.argsort(rarity_scores)[:top_n]
    top_factors = [factors[i] for i in top_indices]
    print(f"Top {top_n} factors by rarity")

Example 3: Multiple Genomes Comparison

from pathlib import Path
import pandas as pd

genomes = ["ecoli", "human_chr1", "yeast"]
results = []

for genome in genomes:
    real_file = f"{genome}_real.bin"
    shuf_file = f"{genome}_shuffled.bin"
    
    if Path(real_file).exists() and Path(shuf_file).exists():
        result = calculate_factor_length_threshold(
            real_file,
            shuf_file,
            tau_expected_fp=1.0
        )
        
        results.append({
            'genome': genome,
            'N_real': result['N_real'],
            'N_shuf': result['N_shuf'],
            'L_star': result['L_star'],
            'pct_significant': (
                100 * (result['rarity_scores_real'] < 0.05).sum() / result['N_real']
            )
        })

# Create summary table
df = pd.DataFrame(results)
print(df.to_string(index=False))

Example 4: Custom Analysis with Direct Functions

from noLZSS.genomics import (
    extract_factor_lengths,
    infer_length_significance,
    plot_significance_analysis
)

# Extract lengths
real_lengths = extract_factor_lengths("genome_real.bin")
shuf_lengths = extract_factor_lengths("genome_shuffled.bin")

# Run analysis with custom parameters
result = infer_length_significance(
    real_lengths,
    shuf_lengths,
    tau_expected_fp=0.5,  # Stringent
    alpha_cp=0.01  # Very conservative
)

# Generate custom visualization
plot_significance_analysis(
    result,
    save_path="custom_analysis.png",
    show_plot=True
)

# Access detailed statistics
print(f"Unique lengths in shuffled: {len(result['uniq_L'])}")
print(f"Min/Max shuffled length: {result['uniq_L'].min()}/{result['uniq_L'].max()}")

# Compute custom statistics
import numpy as np
median_rarity = np.median(result['rarity_scores_real'])
print(f"Median rarity score: {median_rarity:.4f}")

Advanced Topics

Limitations and Considerations

  1. Single Shuffled Genome: This is a “Tier-0” analysis using one shuffled replicate. For more robust results, consider generating multiple shuffled genomes and averaging.

  2. Shuffling Strategy: The shuffling method matters. Per-sequence shuffling preserves local composition. Whole-genome shuffling may be more appropriate for some applications.

  3. Length-Only Analysis: This analysis considers only factor length, not position or sequence content. Factors at the same length are treated identically.

  4. Conservative Bounds: Clopper-Pearson provides conservative (wide) confidence intervals. This is by design to avoid false significance claims.

Performance Notes

  • Memory usage is O(N_factors) for storing lengths

  • Computation time is dominated by reading binary files

  • Plot generation requires matplotlib (optional dependency)

  • Exact Clopper-Pearson requires scipy (falls back to Wilson score if unavailable)

Integration with Other Tools

This analysis integrates naturally with:

  • Factor position analysis (see noLZSS.genomics.plots)

  • Strand bias analysis (see strand-bias-heatmap)

  • Per-sequence factorization workflows

  • Downstream filtering pipelines

References

  1. Clopper, C.J. and Pearson, E.S. (1934). “The use of confidence or fiducial limits illustrated in the case of the binomial.” Biometrika, 26(4), 404-413.

  2. Wilson, E.B. (1927). “Probable inference, the law of succession, and statistical inference.” Journal of the American Statistical Association, 22(158), 209-212.

  3. Köppl, D. (2021). “Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees.” Algorithms, 14(2), 44.