# Factor Length Significance Analysis

## Introduction

Factor length significance analysis is a statistical method to distinguish meaningful (signal) factors from noise in genomic LZSS factorizations. By comparing factorization results from a real genome with results from a shuffled version of the same genome, we can determine the minimal factor length threshold at which factors are confidently considered non-random.

### Why is this important?

In genomic analysis, not all LZSS factors represent biologically meaningful repetitive structures. Short factors may arise simply by chance, while longer factors are more likely to represent true biological repeats, conserved regions, or other significant genomic features. This analysis provides a principled, statistically-grounded way to separate signal from noise.

### Key Concepts

- **Real Genome**: The original, unmodified genome sequence
- **Shuffled Genome**: A randomized version that preserves nucleotide composition but destroys biological structure
- **Factor Length**: The length of each non-overlapping LZSS factor
- **Significance Threshold (L\*)**: The minimum factor length at which factors are confidently non-random
- **Rarity Score**: Per-factor probability of observing that length in the shuffled genome

## Statistical Methodology

### Clopper-Pearson Confidence Bounds

The analysis uses **Clopper-Pearson** confidence intervals to establish conservative upper bounds on the probability of observing factors of a given length in shuffled data. This is a robust method for binomial proportions that provides exact coverage probabilities.

For k factors of length ≥ L out of n total factors in the shuffled genome:
- We compute the empirical tail probability: S₀(L) = k/n
- We compute a conservative upper bound: S₀ᵁ(L) using Clopper-Pearson

The Clopper-Pearson upper bound is the (1-α) quantile of the Beta(k+1, n-k) distribution.

### Threshold Selection

Given N factors in the real genome and a desired false positive threshold τ:
1. For each unique length L in the shuffled data, compute expected false positives: E[FP] = N × S₀ᵁ(L)
2. Select L\* as the smallest length where E[FP] ≤ τ

This ensures that we expect at most τ false positives among factors with length ≥ L\*.

### Per-Factor Rarity Scores

Each factor in the real genome receives a rarity score s_i = S₀(L_i), representing the tail probability of observing a factor of that length or greater in the shuffled genome. Lower scores indicate rarer (more significant) factors.

### Genome-Wide Exceedance Probability

Using a Poisson approximation, we estimate the probability of observing at least one factor of length ≥ L:

P(any factor ≥ L) ≈ 1 - exp(-N × S₀(L))

## API Reference

### Core Statistical Functions

#### `clopper_pearson_upper(k: int, n: int, alpha: float = 0.05) -> float`

Compute one-sided (1-α) Clopper-Pearson upper confidence bound for Binomial(n, p).

**Parameters:**
- `k` (int): Number of successes (0 ≤ k ≤ n)
- `n` (int): Number of trials (must be positive)
- `alpha` (float): Confidence level (default: 0.05 for 95% confidence)

**Returns:**
- `float`: Upper confidence bound for the binomial probability

**Raises:**
- `ValueError`: If inputs are invalid

**Example:**
```python
from noLZSS.genomics import clopper_pearson_upper

# 95% upper confidence bound for 5 successes out of 100 trials
upper = clopper_pearson_upper(5, 100, 0.05)
print(f"Upper bound: {upper:.4f}")  # ~0.11
```

---

#### `extract_factor_lengths(factors) -> np.ndarray`

Extract factor lengths from either a list of factor tuples or a binary factor file.

**Parameters:**
- `factors` (Union[List[Tuple], str, Path]): Either a list of (pos, length, ref, ...) tuples or path to binary factor file

**Returns:**
- `np.ndarray`: Array of factor lengths (int64)

**Raises:**
- `ValueError`: If input type is invalid or list contains invalid tuples
- `NoLZSSError`: If binary file cannot be read

**Example:**
```python
from noLZSS.genomics import extract_factor_lengths

# From tuple list
factors = [(0, 5, 0), (5, 3, 2), (8, 10, 1)]
lengths = extract_factor_lengths(factors)
print(lengths)  # [5, 3, 10]

# From binary file
lengths = extract_factor_lengths("genome_factors.bin")
print(f"Extracted {len(lengths)} factor lengths")
```

---

#### `infer_length_significance(real_lengths, shuf_lengths, tau_expected_fp=1.0, alpha_cp=0.05) -> Dict`

Core statistical inference using ONE shuffled genome.

**Parameters:**
- `real_lengths` (array-like): Factor lengths from real genome
- `shuf_lengths` (array-like): Factor lengths from shuffled genome
- `tau_expected_fp` (float): Maximum expected false positives (default: 1.0)
- `alpha_cp` (float): Confidence level for Clopper-Pearson (default: 0.05)

**Returns:**
Dictionary with keys:
- `N_real` (int): Number of real factors
- `N_shuf` (int): Number of shuffled factors
- `L_star` (int or None): Threshold length (None if no L meets criterion)
- `tau_expected_fp` (float): Input parameter
- `alpha_cp` (float): Input parameter
- `rarity_scores_real` (np.ndarray): Per-factor tail probabilities
- `p_any_ge` (Callable): Function(L) -> genome-wide exceedance probability
- `uniq_L` (np.ndarray): Unique length values
- `S0` (np.ndarray): Empirical tail probabilities
- `S0_upper` (np.ndarray): Conservative upper bounds
- `expected_fp_upper` (np.ndarray): Expected FP counts

**Example:**
```python
from noLZSS.genomics import infer_length_significance
import numpy as np

real_lens = np.array([50, 60, 70, 80, 90])
shuf_lens = np.array([5, 10, 15, 20, 25, 30])

result = infer_length_significance(real_lens, shuf_lens, tau_expected_fp=1.0)
print(f"Threshold length: {result['L_star']}")
print(f"Number of real factors: {result['N_real']}")

# Per-factor rarity scores
for i, score in enumerate(result['rarity_scores_real']):
    print(f"Factor {i} (len={real_lens[i]}): rarity={score:.4f}")

# Genome-wide exceedance probability
p_any_ge = result['p_any_ge']
print(f"P(any factor >= 50): {p_any_ge(50):.4f}")
```

---

### High-Level API

#### `calculate_factor_length_threshold(real_factors_file, shuffled_factors_file, tau_expected_fp=1.0, alpha_cp=0.05, plot_output=None) -> Dict`

Main user-facing function for significance analysis.

**Parameters:**
- `real_factors_file` (str or Path): Path to binary factor file from real genome
- `shuffled_factors_file` (str or Path): Path to binary factor file from shuffled genome
- `tau_expected_fp` (float): Maximum expected false positives (default: 1.0)
- `alpha_cp` (float): Confidence level (default: 0.05)
- `plot_output` (str, Path, or None): Optional path to save visualization

**Returns:**
- `dict`: Result dictionary from `infer_length_significance()`

**Raises:**
- `FileNotFoundError`: If either input file doesn't exist
- `NoLZSSError`: If files have invalid format

**Example:**
```python
from noLZSS.genomics import calculate_factor_length_threshold

result = calculate_factor_length_threshold(
    "genome_real_factors.bin",
    "genome_shuffled_factors.bin",
    tau_expected_fp=1.0,
    plot_output="significance_analysis.png"
)

if result['L_star'] is not None:
    print(f"Minimum significant factor length: {result['L_star']}")
    
    # Count significant factors
    significant = result['rarity_scores_real'] < 0.05
    print(f"Number of significant factors (p < 0.05): {significant.sum()}")
else:
    print("No threshold meets the false positive criterion")
```

---

### Visualization

#### `plot_significance_analysis(result, save_path=None, show_plot=True) -> None`

Create visualization of significance analysis results.

**Parameters:**
- `result` (dict): Output dictionary from `infer_length_significance()`
- `save_path` (str, Path, or None): Optional path to save plot
- `show_plot` (bool): Whether to display plot interactively (default: True)

**Raises:**
- `ValueError`: If result dictionary is missing required keys

**Warnings:**
- `UserWarning`: If matplotlib is not available

**Example:**
```python
from noLZSS.genomics import infer_length_significance, plot_significance_analysis

result = infer_length_significance(real_lens, shuf_lens)

# Display plot
plot_significance_analysis(result)

# Save without displaying
plot_significance_analysis(result, save_path="analysis.png", show_plot=False)
```

## Workflow Guide

### Step-by-Step Analysis

#### 1. Prepare Your Data

Start with a FASTA file containing your genome sequence(s):

```python
import noLZSS

# Factorize the real genome
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "genome.fasta",
    "genome_real_factors.bin"
)
```

#### 2. Generate Shuffled Genome

Create a shuffled version that preserves nucleotide composition but destroys biological structure. You can use external tools or write your own:

```python
from pathlib import Path
import random

def shuffle_fasta(input_path, output_path):
    """Simple shuffler - preserves per-sequence composition."""
    from noLZSS.genomics import read_fasta_auto
    
    sequences = read_fasta_auto(input_path)
    
    with open(output_path, 'w') as f:
        for seq_id, seq_data in sequences:
            # Convert to list and shuffle
            seq_list = list(seq_data)
            random.shuffle(seq_list)
            shuffled = ''.join(seq_list)
            
            # Write FASTA
            f.write(f">{seq_id}_shuffled\n")
            # Write in 80-char lines
            for i in range(0, len(shuffled), 80):
                f.write(shuffled[i:i+80] + '\n')

shuffle_fasta("genome.fasta", "genome_shuffled.fasta")
```

#### 3. Factorize Shuffled Genome

```python
# Factorize the shuffled genome
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "genome_shuffled.fasta",
    "genome_shuffled_factors.bin"
)
```

#### 4. Calculate Significance Threshold

```python
from noLZSS.genomics import calculate_factor_length_threshold

result = calculate_factor_length_threshold(
    "genome_real_factors.bin",
    "genome_shuffled_factors.bin",
    tau_expected_fp=1.0,
    alpha_cp=0.05,
    plot_output="significance_analysis.png"
)

print(f"Analysis Results:")
print(f"  Real genome factors: {result['N_real']}")
print(f"  Shuffled genome factors: {result['N_shuf']}")
print(f"  Minimum significant length (L*): {result['L_star']}")
```

#### 5. Filter Significant Factors

```python
from noLZSS.utils import read_factors_binary_file

# Read real factors
factors = read_factors_binary_file("genome_real_factors.bin")

# Filter by threshold
if result['L_star'] is not None:
    significant_factors = [f for f in factors if f[1] >= result['L_star']]
    print(f"Significant factors: {len(significant_factors)}/{len(factors)}")
    
    # Or use rarity scores for finer control
    rarity_threshold = 0.05  # p < 0.05
    highly_significant = [
        f for f, score in zip(factors, result['rarity_scores_real'])
        if score < rarity_threshold
    ]
    print(f"Highly significant (p < 0.05): {len(highly_significant)}")
```

## Parameter Guide

### Choosing `tau_expected_fp`

The `tau_expected_fp` parameter controls the stringency of the threshold:

- **τ = 0.1**: Very stringent, expect ≤0.1 false positives
- **τ = 1.0**: Standard choice, expect ≤1 false positive (default)
- **τ = 5.0**: Relaxed, expect ≤5 false positives

**Recommendations:**
- Use τ = 0.1 to 1.0 for genomes with thousands of factors
- Use τ = 5.0 to 10.0 for smaller analyses with hundreds of factors
- Consider your tolerance for false discoveries in downstream analysis

### Choosing `alpha_cp`

The `alpha_cp` parameter controls the confidence level for the Clopper-Pearson bounds:

- **α = 0.05**: 95% confidence (default, standard)
- **α = 0.01**: 99% confidence (more conservative)
- **α = 0.10**: 90% confidence (less conservative)

**Recommendations:**
- α = 0.05 is appropriate for most analyses
- Use smaller α (more conservative) when false positives are very costly
- Larger α provides less conservative bounds but may increase false positives

## Interpretation

### Understanding the Results

#### L\* (Threshold Length)

- **L\* = 150**: Factors with length ≥150 are confidently non-random
- **L\* = None**: No length meets the criterion; factors are similar to shuffled
- Lower L\* indicates more distinctive real genome structure
- Higher L\* suggests less distinctive long-range structure

#### Rarity Scores

- **Score = 0.001**: Very rare, only 0.1% of shuffled factors this long
- **Score = 0.05**: Rare, only 5% of shuffled factors this long
- **Score = 0.50**: Common, 50% of shuffled factors this long or longer
- **Score = 1.0**: Length shorter than any shuffled factor

#### Expected False Positives

The plot shows E[FP] = N_real × S₀ᵁ(L):
- Where the curve crosses τ is approximately L\*
- Steep drop indicates clear separation between real and shuffled
- Gradual drop suggests overlapping length distributions

### Biological Interpretation

**High L\* (e.g., L\* > 200)**
- Strong biological signal in long factors
- Clear distinction between structured and random sequences
- Likely represents true biological repeats, gene families, etc.

**Moderate L\* (e.g., 50 < L\* < 200)**
- Moderate biological structure
- Mix of structured and random-like regions
- May represent moderate repeat content

**Low L\* or None**
- Weak distinction from shuffled genome
- May indicate:
  - Low repeat content
  - High sequence diversity
  - Short, less distinctive repeats
  - Need for different null model

## Visualization Guide

### Understanding the Plots

The `plot_significance_analysis()` function creates two panels:

#### Top Panel: Tail Probabilities

- **Blue line (S₀(L))**: Empirical probability from shuffled data
- **Red dashed (S₀ᵁ(L))**: Conservative upper bound
- **Green dotted (L\*)**: Significance threshold
- Y-axis is log scale to show full range

**Interpretation:**
- Steep drop = clear length separation
- Flat region = many factors in that length range
- Gap between S₀ and S₀ᵁ = statistical uncertainty

#### Bottom Panel: Expected False Positives

- **Purple line**: Expected FP = N_real × S₀ᵁ(L)
- **Orange dashed**: τ threshold
- **Green dotted**: L\* where curves intersect

**Interpretation:**
- Where purple crosses orange = L\*
- Curve below τ = acceptable FP region
- Steeper drop = easier to find low L\*

## Examples

### Example 1: Basic Analysis

```python
import noLZSS
from noLZSS.genomics import calculate_factor_length_threshold

# Factorize real and shuffled genomes
noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "ecoli.fasta",
    "ecoli_real.bin"
)

noLZSS.write_factors_binary_file_fasta_multiple_dna_w_rc(
    "ecoli_shuffled.fasta",
    "ecoli_shuffled.bin"
)

# Calculate threshold
result = calculate_factor_length_threshold(
    "ecoli_real.bin",
    "ecoli_shuffled.bin",
    tau_expected_fp=1.0,
    plot_output="ecoli_significance.png"
)

print(f"E. coli genome analysis:")
print(f"  Threshold length: {result['L_star']}")
print(f"  Real factors: {result['N_real']}")
print(f"  Shuffled factors: {result['N_shuf']}")
```

### Example 2: Filtering Significant Factors

```python
from noLZSS.utils import read_factors_binary_file
import numpy as np

# Read factors and perform analysis
factors = read_factors_binary_file("genome_real.bin")
result = calculate_factor_length_threshold(
    "genome_real.bin",
    "genome_shuffled.bin"
)

# Multiple filtering strategies
if result['L_star'] is not None:
    # Strategy 1: Length threshold
    length_filtered = [f for f in factors if f[1] >= result['L_star']]
    print(f"Length threshold: {len(length_filtered)} factors")
    
    # Strategy 2: Rarity score (more granular)
    rarity_scores = result['rarity_scores_real']
    rarity_filtered = [
        f for f, score in zip(factors, rarity_scores)
        if score < 0.05  # p < 0.05
    ]
    print(f"Rarity filtered: {len(rarity_filtered)} factors")
    
    # Strategy 3: Top N most significant
    top_n = 100
    top_indices = np.argsort(rarity_scores)[:top_n]
    top_factors = [factors[i] for i in top_indices]
    print(f"Top {top_n} factors by rarity")
```

### Example 3: Multiple Genomes Comparison

```python
from pathlib import Path
import pandas as pd

genomes = ["ecoli", "human_chr1", "yeast"]
results = []

for genome in genomes:
    real_file = f"{genome}_real.bin"
    shuf_file = f"{genome}_shuffled.bin"
    
    if Path(real_file).exists() and Path(shuf_file).exists():
        result = calculate_factor_length_threshold(
            real_file,
            shuf_file,
            tau_expected_fp=1.0
        )
        
        results.append({
            'genome': genome,
            'N_real': result['N_real'],
            'N_shuf': result['N_shuf'],
            'L_star': result['L_star'],
            'pct_significant': (
                100 * (result['rarity_scores_real'] < 0.05).sum() / result['N_real']
            )
        })

# Create summary table
df = pd.DataFrame(results)
print(df.to_string(index=False))
```

### Example 4: Custom Analysis with Direct Functions

```python
from noLZSS.genomics import (
    extract_factor_lengths,
    infer_length_significance,
    plot_significance_analysis
)

# Extract lengths
real_lengths = extract_factor_lengths("genome_real.bin")
shuf_lengths = extract_factor_lengths("genome_shuffled.bin")

# Run analysis with custom parameters
result = infer_length_significance(
    real_lengths,
    shuf_lengths,
    tau_expected_fp=0.5,  # Stringent
    alpha_cp=0.01  # Very conservative
)

# Generate custom visualization
plot_significance_analysis(
    result,
    save_path="custom_analysis.png",
    show_plot=True
)

# Access detailed statistics
print(f"Unique lengths in shuffled: {len(result['uniq_L'])}")
print(f"Min/Max shuffled length: {result['uniq_L'].min()}/{result['uniq_L'].max()}")

# Compute custom statistics
import numpy as np
median_rarity = np.median(result['rarity_scores_real'])
print(f"Median rarity score: {median_rarity:.4f}")
```

## Advanced Topics

### Limitations and Considerations

1. **Single Shuffled Genome**: This is a "Tier-0" analysis using one shuffled replicate. For more robust results, consider generating multiple shuffled genomes and averaging.

2. **Shuffling Strategy**: The shuffling method matters. Per-sequence shuffling preserves local composition. Whole-genome shuffling may be more appropriate for some applications.

3. **Length-Only Analysis**: This analysis considers only factor length, not position or sequence content. Factors at the same length are treated identically.

4. **Conservative Bounds**: Clopper-Pearson provides conservative (wide) confidence intervals. This is by design to avoid false significance claims.

### Performance Notes

- Memory usage is O(N_factors) for storing lengths
- Computation time is dominated by reading binary files
- Plot generation requires matplotlib (optional dependency)
- Exact Clopper-Pearson requires scipy (falls back to Wilson score if unavailable)

### Integration with Other Tools

This analysis integrates naturally with:
- Factor position analysis (see `noLZSS.genomics.plots`)
- Strand bias analysis (see strand-bias-heatmap)
- Per-sequence factorization workflows
- Downstream filtering pipelines

## References

1. Clopper, C.J. and Pearson, E.S. (1934). "The use of confidence or fiducial limits illustrated in the case of the binomial." Biometrika, 26(4), 404-413.

2. Wilson, E.B. (1927). "Probable inference, the law of succession, and statistical inference." Journal of the American Statistical Association, 22(158), 209-212.

3. Köppl, D. (2021). "Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees." Algorithms, 14(2), 44.