Factor Length Significance Analysis

Introduction

Factor length significance analysis is a statistical method to distinguish meaningful (signal) factors from noise in genomic LZSS factorizations. By comparing factorization results from a real genome with results from a shuffled version of the same genome, we can determine the minimal factor length threshold at which factors are confidently considered non-random.

Why is this important?

In genomic analysis, not all LZSS factors represent biologically meaningful repetitive structures. Short factors may arise simply by chance, while longer factors are more likely to represent true biological repeats, conserved regions, or other significant genomic features. This analysis provides a principled, statistically-grounded way to separate signal from noise.

Key Concepts

Real Genome: The original, unmodified genome sequence
Shuffled Genome: A randomized version that preserves nucleotide composition but destroys biological structure
Factor Length: The length of each non-overlapping LZSS factor
Significance Threshold (L*): The minimum factor length at which factors are confidently non-random
Rarity Score: Per-factor probability of observing that length in the shuffled genome

Statistical Methodology

Clopper-Pearson Confidence Bounds

The analysis uses Clopper-Pearson confidence intervals to establish conservative upper bounds on the probability of observing factors of a given length in shuffled data. This is a robust method for binomial proportions that provides exact coverage probabilities.

For k factors of length ≥ L out of n total factors in the shuffled genome:

We compute the empirical tail probability: S₀(L) = k/n
We compute a conservative upper bound: S₀ᵁ(L) using Clopper-Pearson

The Clopper-Pearson upper bound is the (1-α) quantile of the Beta(k+1, n-k) distribution.

Threshold Selection

Given N factors in the real genome and a desired false positive threshold τ:

For each unique length L in the shuffled data, compute expected false positives: E[FP] = N × S₀ᵁ(L)
Select L* as the smallest length where E[FP] ≤ τ

This ensures that we expect at most τ false positives among factors with length ≥ L*.

Per-Factor Rarity Scores

Each factor in the real genome receives a rarity score s_i = S₀(L_i), representing the tail probability of observing a factor of that length or greater in the shuffled genome. Lower scores indicate rarer (more significant) factors.

Genome-Wide Exceedance Probability

Using a Poisson approximation, we estimate the probability of observing at least one factor of length ≥ L:

P(any factor ≥ L) ≈ 1 - exp(-N × S₀(L))

API Reference

Core Statistical Functions

`clopper_pearson_upper(k: int, n: int, alpha: float = 0.05) -> float`

Compute one-sided (1-α) Clopper-Pearson upper confidence bound for Binomial(n, p).

Parameters:

k (int): Number of successes (0 ≤ k ≤ n)
n (int): Number of trials (must be positive)
alpha (float): Confidence level (default: 0.05 for 95% confidence)

Returns:

float: Upper confidence bound for the binomial probability

Raises:

ValueError: If inputs are invalid

Example:

from noLZSS.genomics import clopper_pearson_upper

# 95% upper confidence bound for 5 successes out of 100 trials
upper = clopper_pearson_upper(5, 100, 0.05)
print(f"Upper bound: {upper:.4f}")  # ~0.11

`extract_factor_lengths(factors) -> np.ndarray`

Extract factor lengths from either a list of factor tuples or a binary factor file.

Parameters:

factors (Union[List[Tuple], str, Path]): Either a list of (pos, length, ref, …) tuples or path to binary factor file

Returns:

np.ndarray: Array of factor lengths (int64)

Raises:

ValueError: If input type is invalid or list contains invalid tuples
NoLZSSError: If binary file cannot be read

Example:

from noLZSS.genomics import extract_factor_lengths

# From tuple list
factors = [(0, 5, 0), (5, 3, 2), (8, 10, 1)]
lengths = extract_factor_lengths(factors)
print(lengths)  # [5, 3, 10]

# From binary file
lengths = extract_factor_lengths("genome_factors.bin")
print(f"Extracted {len(lengths)} factor lengths")

`infer_length_significance(real_lengths, shuf_lengths, tau_expected_fp=1.0, alpha_cp=0.05) -> Dict`

Core statistical inference using ONE shuffled genome.

Parameters:

real_lengths (array-like): Factor lengths from real genome
shuf_lengths (array-like): Factor lengths from shuffled genome
tau_expected_fp (float): Maximum expected false positives (default: 1.0)
alpha_cp (float): Confidence level for Clopper-Pearson (default: 0.05)

Returns: Dictionary with keys:

N_real (int): Number of real factors
N_shuf (int): Number of shuffled factors
L_star (int or None): Threshold length (None if no L meets criterion)
tau_expected_fp (float): Input parameter
alpha_cp (float): Input parameter
rarity_scores_real (np.ndarray): Per-factor tail probabilities
p_any_ge (Callable): Function(L) -> genome-wide exceedance probability
uniq_L (np.ndarray): Unique length values
S0 (np.ndarray): Empirical tail probabilities
S0_upper (np.ndarray): Conservative upper bounds
expected_fp_upper (np.ndarray): Expected FP counts

Example:

from noLZSS.genomics import infer_length_significance
import numpy as np

real_lens = np.array([50, 60, 70, 80, 90])
shuf_lens = np.array([5, 10, 15, 20, 25, 30])

result = infer_length_significance(real_lens, shuf_lens, tau_expected_fp=1.0)
print(f"Threshold length: {result['L_star']}")
print(f"Number of real factors: {result['N_real']}")

# Per-factor rarity scores
for i, score in enumerate(result['rarity_scores_real']):
    print(f"Factor {i} (len={real_lens[i]}): rarity={score:.4f}")

# Genome-wide exceedance probability
p_any_ge = result['p_any_ge']
print(f"P(any factor >= 50): {p_any_ge(50):.4f}")

High-Level API

`calculate_factor_length_threshold(real_factors_file, shuffled_factors_file, tau_expected_fp=1.0, alpha_cp=0.05, plot_output=None) -> Dict`

Main user-facing function for significance analysis.

Parameters:

real_factors_file (str or Path): Path to binary factor file from real genome
shuffled_factors_file (str or Path): Path to binary factor file from shuffled genome
tau_expected_fp (float): Maximum expected false positives (default: 1.0)
alpha_cp (float): Confidence level (default: 0.05)
plot_output (str, Path, or None): Optional path to save visualization

Returns:

dict: Result dictionary from infer_length_significance()

Raises:

FileNotFoundError: If either input file doesn’t exist
NoLZSSError: If files have invalid format

Example:

from noLZSS.genomics import calculate_factor_length_threshold

result = calculate_factor_length_threshold(
    "genome_real_factors.bin",
    "genome_shuffled_factors.bin",
    tau_expected_fp=1.0,
    plot_output="significance_analysis.png"
)

if result['L_star'] is not None:
    print(f"Minimum significant factor length: {result['L_star']}")
    
    # Count significant factors
    significant = result['rarity_scores_real'] < 0.05
    print(f"Number of significant factors (p < 0.05): {significant.sum()}")
else:
    print("No threshold meets the false positive criterion")

Visualization

`plot_significance_analysis(result, save_path=None, show_plot=True) -> None`

Create visualization of significance analysis results.

Parameters:

result (dict): Output dictionary from infer_length_significance()
save_path (str, Path, or None): Optional path to save plot
show_plot (bool): Whether to display plot interactively (default: True)

Raises:

ValueError: If result dictionary is missing required keys

Warnings:

UserWarning: If matplotlib is not available

Example:

from noLZSS.genomics import infer_length_significance, plot_significance_analysis

result = infer_length_significance(real_lens, shuf_lens)

# Display plot
plot_significance_analysis(result)

# Save without displaying
plot_significance_analysis(result, save_path="analysis.png", show_plot=False)

Workflow Guide

Step-by-Step Analysis

1. Prepare Your Data

Start with a FASTA file containing your genome sequence(s):

import noLZSS

# Factorize the real genome
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "genome.fasta",
    "genome_real_factors.bin"
)

2. Generate Shuffled Genome

Create a shuffled version that preserves nucleotide composition but destroys biological structure. You can use external tools or write your own:

from pathlib import Path
import random

def shuffle_fasta(input_path, output_path):
    """Simple shuffler - preserves per-sequence composition."""
    from noLZSS.genomics import read_fasta_auto
    
    sequences = read_fasta_auto(input_path)
    
    with open(output_path, 'w') as f:
        for seq_id, seq_data in sequences:
            # Convert to list and shuffle
            seq_list = list(seq_data)
            random.shuffle(seq_list)
            shuffled = ''.join(seq_list)
            
            # Write FASTA
            f.write(f">{seq_id}_shuffled\n")
            # Write in 80-char lines
            for i in range(0, len(shuffled), 80):
                f.write(shuffled[i:i+80] + '\n')

shuffle_fasta("genome.fasta", "genome_shuffled.fasta")

3. Factorize Shuffled Genome

# Factorize the shuffled genome
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "genome_shuffled.fasta",
    "genome_shuffled_factors.bin"
)

4. Calculate Significance Threshold

from noLZSS.genomics import calculate_factor_length_threshold

result = calculate_factor_length_threshold(
    "genome_real_factors.bin",
    "genome_shuffled_factors.bin",
    tau_expected_fp=1.0,
    alpha_cp=0.05,
    plot_output="significance_analysis.png"
)

print(f"Analysis Results:")
print(f"  Real genome factors: {result['N_real']}")
print(f"  Shuffled genome factors: {result['N_shuf']}")
print(f"  Minimum significant length (L*): {result['L_star']}")

5. Filter Significant Factors

from noLZSS.utils import read_factors_binary_file

# Read real factors
factors = read_factors_binary_file("genome_real_factors.bin")

# Filter by threshold
if result['L_star'] is not None:
    significant_factors = [f for f in factors if f[1] >= result['L_star']]
    print(f"Significant factors: {len(significant_factors)}/{len(factors)}")
    
    # Or use rarity scores for finer control
    rarity_threshold = 0.05  # p < 0.05
    highly_significant = [
        f for f, score in zip(factors, result['rarity_scores_real'])
        if score < rarity_threshold
    ]
    print(f"Highly significant (p < 0.05): {len(highly_significant)}")

Parameter Guide

Choosing `tau_expected_fp`

The tau_expected_fp parameter controls the stringency of the threshold:

τ = 0.1: Very stringent, expect ≤0.1 false positives
τ = 1.0: Standard choice, expect ≤1 false positive (default)
τ = 5.0: Relaxed, expect ≤5 false positives

Recommendations:

Use τ = 0.1 to 1.0 for genomes with thousands of factors
Use τ = 5.0 to 10.0 for smaller analyses with hundreds of factors
Consider your tolerance for false discoveries in downstream analysis

Choosing `alpha_cp`

The alpha_cp parameter controls the confidence level for the Clopper-Pearson bounds:

α = 0.05: 95% confidence (default, standard)
α = 0.01: 99% confidence (more conservative)
α = 0.10: 90% confidence (less conservative)

Recommendations:

α = 0.05 is appropriate for most analyses
Use smaller α (more conservative) when false positives are very costly
Larger α provides less conservative bounds but may increase false positives

Interpretation

Understanding the Results

L* (Threshold Length)

L* = 150: Factors with length ≥150 are confidently non-random
L* = None: No length meets the criterion; factors are similar to shuffled
Lower L* indicates more distinctive real genome structure
Higher L* suggests less distinctive long-range structure

Rarity Scores

Score = 0.001: Very rare, only 0.1% of shuffled factors this long
Score = 0.05: Rare, only 5% of shuffled factors this long
Score = 0.50: Common, 50% of shuffled factors this long or longer
Score = 1.0: Length shorter than any shuffled factor

Expected False Positives

The plot shows E[FP] = N_real × S₀ᵁ(L):

Where the curve crosses τ is approximately L*
Steep drop indicates clear separation between real and shuffled
Gradual drop suggests overlapping length distributions

Biological Interpretation

High L* (e.g., L* > 200)

Strong biological signal in long factors
Clear distinction between structured and random sequences
Likely represents true biological repeats, gene families, etc.

Moderate L* (e.g., 50 < L* < 200)

Moderate biological structure
Mix of structured and random-like regions
May represent moderate repeat content

Low L* or None

Weak distinction from shuffled genome
May indicate:
- Low repeat content
- High sequence diversity
- Short, less distinctive repeats
- Need for different null model

Visualization Guide

Understanding the Plots

The plot_significance_analysis() function creates two panels:

Top Panel: Tail Probabilities

Blue line (S₀(L)): Empirical probability from shuffled data
Red dashed (S₀ᵁ(L)): Conservative upper bound
Green dotted (L*): Significance threshold
Y-axis is log scale to show full range

Interpretation:

Steep drop = clear length separation
Flat region = many factors in that length range
Gap between S₀ and S₀ᵁ = statistical uncertainty

Bottom Panel: Expected False Positives

Purple line: Expected FP = N_real × S₀ᵁ(L)
Orange dashed: τ threshold
Green dotted: L* where curves intersect

Interpretation:

Where purple crosses orange = L*
Curve below τ = acceptable FP region
Steeper drop = easier to find low L*

Examples

Example 1: Basic Analysis

import noLZSS
from noLZSS.genomics import calculate_factor_length_threshold

# Factorize real and shuffled genomes
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "ecoli.fasta",
    "ecoli_real.bin"
)

noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "ecoli_shuffled.fasta",
    "ecoli_shuffled.bin"
)

# Calculate threshold
result = calculate_factor_length_threshold(
    "ecoli_real.bin",
    "ecoli_shuffled.bin",
    tau_expected_fp=1.0,
    plot_output="ecoli_significance.png"
)

print(f"E. coli genome analysis:")
print(f"  Threshold length: {result['L_star']}")
print(f"  Real factors: {result['N_real']}")
print(f"  Shuffled factors: {result['N_shuf']}")

Example 2: Filtering Significant Factors

from noLZSS.utils import read_factors_binary_file
import numpy as np

# Read factors and perform analysis
factors = read_factors_binary_file("genome_real.bin")
result = calculate_factor_length_threshold(
    "genome_real.bin",
    "genome_shuffled.bin"
)

# Multiple filtering strategies
if result['L_star'] is not None:
    # Strategy 1: Length threshold
    length_filtered = [f for f in factors if f[1] >= result['L_star']]
    print(f"Length threshold: {len(length_filtered)} factors")
    
    # Strategy 2: Rarity score (more granular)
    rarity_scores = result['rarity_scores_real']
    rarity_filtered = [
        f for f, score in zip(factors, rarity_scores)
        if score < 0.05  # p < 0.05
    ]
    print(f"Rarity filtered: {len(rarity_filtered)} factors")
    
    # Strategy 3: Top N most significant
    top_n = 100
    top_indices = np.argsort(rarity_scores)[:top_n]
    top_factors = [factors[i] for i in top_indices]
    print(f"Top {top_n} factors by rarity")

Example 3: Multiple Genomes Comparison

from pathlib import Path
import pandas as pd

genomes = ["ecoli", "human_chr1", "yeast"]
results = []

for genome in genomes:
    real_file = f"{genome}_real.bin"
    shuf_file = f"{genome}_shuffled.bin"
    
    if Path(real_file).exists() and Path(shuf_file).exists():
        result = calculate_factor_length_threshold(
            real_file,
            shuf_file,
            tau_expected_fp=1.0
        )
        
        results.append({
            'genome': genome,
            'N_real': result['N_real'],
            'N_shuf': result['N_shuf'],
            'L_star': result['L_star'],
            'pct_significant': (
                100 * (result['rarity_scores_real'] < 0.05).sum() / result['N_real']
            )
        })

# Create summary table
df = pd.DataFrame(results)
print(df.to_string(index=False))

Example 4: Custom Analysis with Direct Functions

from noLZSS.genomics import (
    extract_factor_lengths,
    infer_length_significance,
    plot_significance_analysis
)

# Extract lengths
real_lengths = extract_factor_lengths("genome_real.bin")
shuf_lengths = extract_factor_lengths("genome_shuffled.bin")

# Run analysis with custom parameters
result = infer_length_significance(
    real_lengths,
    shuf_lengths,
    tau_expected_fp=0.5,  # Stringent
    alpha_cp=0.01  # Very conservative
)

# Generate custom visualization
plot_significance_analysis(
    result,
    save_path="custom_analysis.png",
    show_plot=True
)

# Access detailed statistics
print(f"Unique lengths in shuffled: {len(result['uniq_L'])}")
print(f"Min/Max shuffled length: {result['uniq_L'].min()}/{result['uniq_L'].max()}")

# Compute custom statistics
import numpy as np
median_rarity = np.median(result['rarity_scores_real'])
print(f"Median rarity score: {median_rarity:.4f}")

Advanced Topics

Limitations and Considerations

Single Shuffled Genome: This is a “Tier-0” analysis using one shuffled replicate. For more robust results, consider generating multiple shuffled genomes and averaging.
Shuffling Strategy: The shuffling method matters. Per-sequence shuffling preserves local composition. Whole-genome shuffling may be more appropriate for some applications.
Length-Only Analysis: This analysis considers only factor length, not position or sequence content. Factors at the same length are treated identically.
Conservative Bounds: Clopper-Pearson provides conservative (wide) confidence intervals. This is by design to avoid false significance claims.

Performance Notes

Memory usage is O(N_factors) for storing lengths
Computation time is dominated by reading binary files
Plot generation requires matplotlib (optional dependency)
Exact Clopper-Pearson requires scipy (falls back to Wilson score if unavailable)

Integration with Other Tools

This analysis integrates naturally with:

Factor position analysis (see noLZSS.genomics.plots)
Strand bias analysis (see strand-bias-heatmap)
Per-sequence factorization workflows
Downstream filtering pipelines

References

Clopper, C.J. and Pearson, E.S. (1934). “The use of confidence or fiducial limits illustrated in the case of the binomial.” Biometrika, 26(4), 404-413.
Wilson, E.B. (1927). “Probable inference, the law of succession, and statistical inference.” Journal of the American Statistical Association, 22(158), 209-212.
Köppl, D. (2021). “Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees.” Algorithms, 14(2), 44.

Factor Length Significance Analysis

Introduction

Why is this important?

Key Concepts

Statistical Methodology

Clopper-Pearson Confidence Bounds

Threshold Selection

Per-Factor Rarity Scores

Genome-Wide Exceedance Probability

API Reference

Core Statistical Functions

clopper_pearson_upper(k: int, n: int, alpha: float = 0.05) -> float

extract_factor_lengths(factors) -> np.ndarray

infer_length_significance(real_lengths, shuf_lengths, tau_expected_fp=1.0, alpha_cp=0.05) -> Dict

High-Level API

calculate_factor_length_threshold(real_factors_file, shuffled_factors_file, tau_expected_fp=1.0, alpha_cp=0.05, plot_output=None) -> Dict

Visualization

plot_significance_analysis(result, save_path=None, show_plot=True) -> None

Workflow Guide

Step-by-Step Analysis

1. Prepare Your Data

2. Generate Shuffled Genome

3. Factorize Shuffled Genome

4. Calculate Significance Threshold

5. Filter Significant Factors

Parameter Guide

Choosing tau_expected_fp

Choosing alpha_cp

Interpretation

Understanding the Results

L* (Threshold Length)

Rarity Scores

Expected False Positives

Biological Interpretation

Visualization Guide

Understanding the Plots

Top Panel: Tail Probabilities

Bottom Panel: Expected False Positives

Examples

Example 1: Basic Analysis

Example 2: Filtering Significant Factors

Example 3: Multiple Genomes Comparison

Example 4: Custom Analysis with Direct Functions

Advanced Topics

Limitations and Considerations

Performance Notes

Integration with Other Tools

References

`clopper_pearson_upper(k: int, n: int, alpha: float = 0.05) -> float`

`extract_factor_lengths(factors) -> np.ndarray`

`infer_length_significance(real_lengths, shuf_lengths, tau_expected_fp=1.0, alpha_cp=0.05) -> Dict`

`calculate_factor_length_threshold(real_factors_file, shuffled_factors_file, tau_expected_fp=1.0, alpha_cp=0.05, plot_output=None) -> Dict`

`plot_significance_analysis(result, save_path=None, show_plot=True) -> None`

Choosing `tau_expected_fp`

Choosing `alpha_cp`