Factor Length Significance Analysis
Introduction
Factor length significance analysis is a statistical method to distinguish meaningful (signal) factors from noise in genomic LZSS factorizations. By comparing factorization results from a real genome with results from a shuffled version of the same genome, we can determine the minimal factor length threshold at which factors are confidently considered non-random.
Why is this important?
In genomic analysis, not all LZSS factors represent biologically meaningful repetitive structures. Short factors may arise simply by chance, while longer factors are more likely to represent true biological repeats, conserved regions, or other significant genomic features. This analysis provides a principled, statistically-grounded way to separate signal from noise.
Key Concepts
Real Genome: The original, unmodified genome sequence
Shuffled Genome: A randomized version that preserves nucleotide composition but destroys biological structure
Factor Length: The length of each non-overlapping LZSS factor
Significance Threshold (L*): The minimum factor length at which factors are confidently non-random
Rarity Score: Per-factor probability of observing that length in the shuffled genome
Statistical Methodology
Clopper-Pearson Confidence Bounds
The analysis uses Clopper-Pearson confidence intervals to establish conservative upper bounds on the probability of observing factors of a given length in shuffled data. This is a robust method for binomial proportions that provides exact coverage probabilities.
For k factors of length ≥ L out of n total factors in the shuffled genome:
We compute the empirical tail probability: S₀(L) = k/n
We compute a conservative upper bound: S₀ᵁ(L) using Clopper-Pearson
The Clopper-Pearson upper bound is the (1-α) quantile of the Beta(k+1, n-k) distribution.
Threshold Selection
Given N factors in the real genome and a desired false positive threshold τ:
For each unique length L in the shuffled data, compute expected false positives: E[FP] = N × S₀ᵁ(L)
Select L* as the smallest length where E[FP] ≤ τ
This ensures that we expect at most τ false positives among factors with length ≥ L*.
Per-Factor Rarity Scores
Each factor in the real genome receives a rarity score s_i = S₀(L_i), representing the tail probability of observing a factor of that length or greater in the shuffled genome. Lower scores indicate rarer (more significant) factors.
Genome-Wide Exceedance Probability
Using a Poisson approximation, we estimate the probability of observing at least one factor of length ≥ L:
P(any factor ≥ L) ≈ 1 - exp(-N × S₀(L))
API Reference
Core Statistical Functions
clopper_pearson_upper(k: int, n: int, alpha: float = 0.05) -> float
Compute one-sided (1-α) Clopper-Pearson upper confidence bound for Binomial(n, p).
Parameters:
k(int): Number of successes (0 ≤ k ≤ n)n(int): Number of trials (must be positive)alpha(float): Confidence level (default: 0.05 for 95% confidence)
Returns:
float: Upper confidence bound for the binomial probability
Raises:
ValueError: If inputs are invalid
Example:
from noLZSS.genomics import clopper_pearson_upper
# 95% upper confidence bound for 5 successes out of 100 trials
upper = clopper_pearson_upper(5, 100, 0.05)
print(f"Upper bound: {upper:.4f}") # ~0.11
extract_factor_lengths(factors) -> np.ndarray
Extract factor lengths from either a list of factor tuples or a binary factor file.
Parameters:
factors(Union[List[Tuple], str, Path]): Either a list of (pos, length, ref, …) tuples or path to binary factor file
Returns:
np.ndarray: Array of factor lengths (int64)
Raises:
ValueError: If input type is invalid or list contains invalid tuplesNoLZSSError: If binary file cannot be read
Example:
from noLZSS.genomics import extract_factor_lengths
# From tuple list
factors = [(0, 5, 0), (5, 3, 2), (8, 10, 1)]
lengths = extract_factor_lengths(factors)
print(lengths) # [5, 3, 10]
# From binary file
lengths = extract_factor_lengths("genome_factors.bin")
print(f"Extracted {len(lengths)} factor lengths")
infer_length_significance(real_lengths, shuf_lengths, tau_expected_fp=1.0, alpha_cp=0.05) -> Dict
Core statistical inference using ONE shuffled genome.
Parameters:
real_lengths(array-like): Factor lengths from real genomeshuf_lengths(array-like): Factor lengths from shuffled genometau_expected_fp(float): Maximum expected false positives (default: 1.0)alpha_cp(float): Confidence level for Clopper-Pearson (default: 0.05)
Returns: Dictionary with keys:
N_real(int): Number of real factorsN_shuf(int): Number of shuffled factorsL_star(int or None): Threshold length (None if no L meets criterion)tau_expected_fp(float): Input parameteralpha_cp(float): Input parameterrarity_scores_real(np.ndarray): Per-factor tail probabilitiesp_any_ge(Callable): Function(L) -> genome-wide exceedance probabilityuniq_L(np.ndarray): Unique length valuesS0(np.ndarray): Empirical tail probabilitiesS0_upper(np.ndarray): Conservative upper boundsexpected_fp_upper(np.ndarray): Expected FP counts
Example:
from noLZSS.genomics import infer_length_significance
import numpy as np
real_lens = np.array([50, 60, 70, 80, 90])
shuf_lens = np.array([5, 10, 15, 20, 25, 30])
result = infer_length_significance(real_lens, shuf_lens, tau_expected_fp=1.0)
print(f"Threshold length: {result['L_star']}")
print(f"Number of real factors: {result['N_real']}")
# Per-factor rarity scores
for i, score in enumerate(result['rarity_scores_real']):
print(f"Factor {i} (len={real_lens[i]}): rarity={score:.4f}")
# Genome-wide exceedance probability
p_any_ge = result['p_any_ge']
print(f"P(any factor >= 50): {p_any_ge(50):.4f}")
High-Level API
calculate_factor_length_threshold(real_factors_file, shuffled_factors_file, tau_expected_fp=1.0, alpha_cp=0.05, plot_output=None) -> Dict
Main user-facing function for significance analysis.
Parameters:
real_factors_file(str or Path): Path to binary factor file from real genomeshuffled_factors_file(str or Path): Path to binary factor file from shuffled genometau_expected_fp(float): Maximum expected false positives (default: 1.0)alpha_cp(float): Confidence level (default: 0.05)plot_output(str, Path, or None): Optional path to save visualization
Returns:
dict: Result dictionary frominfer_length_significance()
Raises:
FileNotFoundError: If either input file doesn’t existNoLZSSError: If files have invalid format
Example:
from noLZSS.genomics import calculate_factor_length_threshold
result = calculate_factor_length_threshold(
"genome_real_factors.bin",
"genome_shuffled_factors.bin",
tau_expected_fp=1.0,
plot_output="significance_analysis.png"
)
if result['L_star'] is not None:
print(f"Minimum significant factor length: {result['L_star']}")
# Count significant factors
significant = result['rarity_scores_real'] < 0.05
print(f"Number of significant factors (p < 0.05): {significant.sum()}")
else:
print("No threshold meets the false positive criterion")
Visualization
plot_significance_analysis(result, save_path=None, show_plot=True) -> None
Create visualization of significance analysis results.
Parameters:
result(dict): Output dictionary frominfer_length_significance()save_path(str, Path, or None): Optional path to save plotshow_plot(bool): Whether to display plot interactively (default: True)
Raises:
ValueError: If result dictionary is missing required keys
Warnings:
UserWarning: If matplotlib is not available
Example:
from noLZSS.genomics import infer_length_significance, plot_significance_analysis
result = infer_length_significance(real_lens, shuf_lens)
# Display plot
plot_significance_analysis(result)
# Save without displaying
plot_significance_analysis(result, save_path="analysis.png", show_plot=False)
Workflow Guide
Step-by-Step Analysis
1. Prepare Your Data
Start with a FASTA file containing your genome sequence(s):
import noLZSS
# Factorize the real genome
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
"genome.fasta",
"genome_real_factors.bin"
)
2. Generate Shuffled Genome
Create a shuffled version that preserves nucleotide composition but destroys biological structure. You can use external tools or write your own:
from pathlib import Path
import random
def shuffle_fasta(input_path, output_path):
"""Simple shuffler - preserves per-sequence composition."""
from noLZSS.genomics import read_fasta_auto
sequences = read_fasta_auto(input_path)
with open(output_path, 'w') as f:
for seq_id, seq_data in sequences:
# Convert to list and shuffle
seq_list = list(seq_data)
random.shuffle(seq_list)
shuffled = ''.join(seq_list)
# Write FASTA
f.write(f">{seq_id}_shuffled\n")
# Write in 80-char lines
for i in range(0, len(shuffled), 80):
f.write(shuffled[i:i+80] + '\n')
shuffle_fasta("genome.fasta", "genome_shuffled.fasta")
3. Factorize Shuffled Genome
# Factorize the shuffled genome
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
"genome_shuffled.fasta",
"genome_shuffled_factors.bin"
)
4. Calculate Significance Threshold
from noLZSS.genomics import calculate_factor_length_threshold
result = calculate_factor_length_threshold(
"genome_real_factors.bin",
"genome_shuffled_factors.bin",
tau_expected_fp=1.0,
alpha_cp=0.05,
plot_output="significance_analysis.png"
)
print(f"Analysis Results:")
print(f" Real genome factors: {result['N_real']}")
print(f" Shuffled genome factors: {result['N_shuf']}")
print(f" Minimum significant length (L*): {result['L_star']}")
5. Filter Significant Factors
from noLZSS.utils import read_factors_binary_file
# Read real factors
factors = read_factors_binary_file("genome_real_factors.bin")
# Filter by threshold
if result['L_star'] is not None:
significant_factors = [f for f in factors if f[1] >= result['L_star']]
print(f"Significant factors: {len(significant_factors)}/{len(factors)}")
# Or use rarity scores for finer control
rarity_threshold = 0.05 # p < 0.05
highly_significant = [
f for f, score in zip(factors, result['rarity_scores_real'])
if score < rarity_threshold
]
print(f"Highly significant (p < 0.05): {len(highly_significant)}")
Parameter Guide
Choosing tau_expected_fp
The tau_expected_fp parameter controls the stringency of the threshold:
τ = 0.1: Very stringent, expect ≤0.1 false positives
τ = 1.0: Standard choice, expect ≤1 false positive (default)
τ = 5.0: Relaxed, expect ≤5 false positives
Recommendations:
Use τ = 0.1 to 1.0 for genomes with thousands of factors
Use τ = 5.0 to 10.0 for smaller analyses with hundreds of factors
Consider your tolerance for false discoveries in downstream analysis
Choosing alpha_cp
The alpha_cp parameter controls the confidence level for the Clopper-Pearson bounds:
α = 0.05: 95% confidence (default, standard)
α = 0.01: 99% confidence (more conservative)
α = 0.10: 90% confidence (less conservative)
Recommendations:
α = 0.05 is appropriate for most analyses
Use smaller α (more conservative) when false positives are very costly
Larger α provides less conservative bounds but may increase false positives
Interpretation
Understanding the Results
L* (Threshold Length)
L* = 150: Factors with length ≥150 are confidently non-random
L* = None: No length meets the criterion; factors are similar to shuffled
Lower L* indicates more distinctive real genome structure
Higher L* suggests less distinctive long-range structure
Rarity Scores
Score = 0.001: Very rare, only 0.1% of shuffled factors this long
Score = 0.05: Rare, only 5% of shuffled factors this long
Score = 0.50: Common, 50% of shuffled factors this long or longer
Score = 1.0: Length shorter than any shuffled factor
Expected False Positives
The plot shows E[FP] = N_real × S₀ᵁ(L):
Where the curve crosses τ is approximately L*
Steep drop indicates clear separation between real and shuffled
Gradual drop suggests overlapping length distributions
Biological Interpretation
High L* (e.g., L* > 200)
Strong biological signal in long factors
Clear distinction between structured and random sequences
Likely represents true biological repeats, gene families, etc.
Moderate L* (e.g., 50 < L* < 200)
Moderate biological structure
Mix of structured and random-like regions
May represent moderate repeat content
Low L* or None
Weak distinction from shuffled genome
May indicate:
Low repeat content
High sequence diversity
Short, less distinctive repeats
Need for different null model
Visualization Guide
Understanding the Plots
The plot_significance_analysis() function creates two panels:
Top Panel: Tail Probabilities
Blue line (S₀(L)): Empirical probability from shuffled data
Red dashed (S₀ᵁ(L)): Conservative upper bound
Green dotted (L*): Significance threshold
Y-axis is log scale to show full range
Interpretation:
Steep drop = clear length separation
Flat region = many factors in that length range
Gap between S₀ and S₀ᵁ = statistical uncertainty
Bottom Panel: Expected False Positives
Purple line: Expected FP = N_real × S₀ᵁ(L)
Orange dashed: τ threshold
Green dotted: L* where curves intersect
Interpretation:
Where purple crosses orange = L*
Curve below τ = acceptable FP region
Steeper drop = easier to find low L*
Examples
Example 1: Basic Analysis
import noLZSS
from noLZSS.genomics import calculate_factor_length_threshold
# Factorize real and shuffled genomes
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
"ecoli.fasta",
"ecoli_real.bin"
)
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
"ecoli_shuffled.fasta",
"ecoli_shuffled.bin"
)
# Calculate threshold
result = calculate_factor_length_threshold(
"ecoli_real.bin",
"ecoli_shuffled.bin",
tau_expected_fp=1.0,
plot_output="ecoli_significance.png"
)
print(f"E. coli genome analysis:")
print(f" Threshold length: {result['L_star']}")
print(f" Real factors: {result['N_real']}")
print(f" Shuffled factors: {result['N_shuf']}")
Example 2: Filtering Significant Factors
from noLZSS.utils import read_factors_binary_file
import numpy as np
# Read factors and perform analysis
factors = read_factors_binary_file("genome_real.bin")
result = calculate_factor_length_threshold(
"genome_real.bin",
"genome_shuffled.bin"
)
# Multiple filtering strategies
if result['L_star'] is not None:
# Strategy 1: Length threshold
length_filtered = [f for f in factors if f[1] >= result['L_star']]
print(f"Length threshold: {len(length_filtered)} factors")
# Strategy 2: Rarity score (more granular)
rarity_scores = result['rarity_scores_real']
rarity_filtered = [
f for f, score in zip(factors, rarity_scores)
if score < 0.05 # p < 0.05
]
print(f"Rarity filtered: {len(rarity_filtered)} factors")
# Strategy 3: Top N most significant
top_n = 100
top_indices = np.argsort(rarity_scores)[:top_n]
top_factors = [factors[i] for i in top_indices]
print(f"Top {top_n} factors by rarity")
Example 3: Multiple Genomes Comparison
from pathlib import Path
import pandas as pd
genomes = ["ecoli", "human_chr1", "yeast"]
results = []
for genome in genomes:
real_file = f"{genome}_real.bin"
shuf_file = f"{genome}_shuffled.bin"
if Path(real_file).exists() and Path(shuf_file).exists():
result = calculate_factor_length_threshold(
real_file,
shuf_file,
tau_expected_fp=1.0
)
results.append({
'genome': genome,
'N_real': result['N_real'],
'N_shuf': result['N_shuf'],
'L_star': result['L_star'],
'pct_significant': (
100 * (result['rarity_scores_real'] < 0.05).sum() / result['N_real']
)
})
# Create summary table
df = pd.DataFrame(results)
print(df.to_string(index=False))
Example 4: Custom Analysis with Direct Functions
from noLZSS.genomics import (
extract_factor_lengths,
infer_length_significance,
plot_significance_analysis
)
# Extract lengths
real_lengths = extract_factor_lengths("genome_real.bin")
shuf_lengths = extract_factor_lengths("genome_shuffled.bin")
# Run analysis with custom parameters
result = infer_length_significance(
real_lengths,
shuf_lengths,
tau_expected_fp=0.5, # Stringent
alpha_cp=0.01 # Very conservative
)
# Generate custom visualization
plot_significance_analysis(
result,
save_path="custom_analysis.png",
show_plot=True
)
# Access detailed statistics
print(f"Unique lengths in shuffled: {len(result['uniq_L'])}")
print(f"Min/Max shuffled length: {result['uniq_L'].min()}/{result['uniq_L'].max()}")
# Compute custom statistics
import numpy as np
median_rarity = np.median(result['rarity_scores_real'])
print(f"Median rarity score: {median_rarity:.4f}")
Advanced Topics
Limitations and Considerations
Single Shuffled Genome: This is a “Tier-0” analysis using one shuffled replicate. For more robust results, consider generating multiple shuffled genomes and averaging.
Shuffling Strategy: The shuffling method matters. Per-sequence shuffling preserves local composition. Whole-genome shuffling may be more appropriate for some applications.
Length-Only Analysis: This analysis considers only factor length, not position or sequence content. Factors at the same length are treated identically.
Conservative Bounds: Clopper-Pearson provides conservative (wide) confidence intervals. This is by design to avoid false significance claims.
Performance Notes
Memory usage is O(N_factors) for storing lengths
Computation time is dominated by reading binary files
Plot generation requires matplotlib (optional dependency)
Exact Clopper-Pearson requires scipy (falls back to Wilson score if unavailable)
Integration with Other Tools
This analysis integrates naturally with:
Factor position analysis (see
noLZSS.genomics.plots)Strand bias analysis (see strand-bias-heatmap)
Per-sequence factorization workflows
Downstream filtering pipelines
References
Clopper, C.J. and Pearson, E.S. (1934). “The use of confidence or fiducial limits illustrated in the case of the binomial.” Biometrika, 26(4), 404-413.
Wilson, E.B. (1927). “Probable inference, the law of succession, and statistical inference.” Journal of the American Statistical Association, 22(158), 209-212.
Köppl, D. (2021). “Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees.” Algorithms, 14(2), 44.