Test Runner Specification¶

2025-12-242026-01-09

Status: Canonical Reference
Scope: runner/test_runner.py - Main test execution engine
Last Updated: January 9, 2026 12:55:38
Created: December 24, 2025 12:39:00
Related: Runner README (Quick Reference)

The Test Runner is the core testing engine for executing solutions against test cases. It supports multi-solution benchmarking, random test generation, and complexity estimation.

Quick Start¶

# Run default solution
python runner/test_runner.py 0001_two_sum

# Run specific solution method
python runner/test_runner.py 0023 --method heap

# Compare all solutions with timing
python runner/test_runner.py 0023 --all --benchmark

Command Reference¶

python runner/test_runner.py <problem> [OPTIONS]

Solution Selection¶

Option	Description
(none)	Run `"default"` solution
`--method NAME`	Run specific solution
`--all`	Run all solutions in `SOLUTIONS`

Test Generation¶

Option	Description
`--generate N`	Static tests + N generated cases
`--generate-only N`	Skip static, generate N cases only
`--seed N`	Reproducible generation
`--save-failed`	Save failed cases to `tests/`

📖 Requires generator file. See Generator Contract.

Analysis¶

Option	Description
`--benchmark`	Show execution time per case (includes memory metrics if psutil installed)
`--estimate`	Estimate time complexity

📖 --estimate requires generate_for_complexity(n) and pip install big-O.

Memory Profiling¶

Option	Description
`--memory-trace`	Show run-level memory traces (sparklines) per method
`--trace-compare`	Multi-method memory comparison with ranking table
`--memory-per-case`	Debug: Top-K cases by peak RSS

📖 Memory profiling requires pip install psutil. Without it, memory columns show "Unavailable".

Other¶

Option	Description
`--tests-dir DIR`	Custom tests directory (default: `tests`)

Usage Examples¶

Basic Testing¶

python runner/test_runner.py 0001_two_sum
python runner/test_runner.py 0023 --method heap
python runner/test_runner.py 0023 --all --benchmark

Random Testing¶

python runner/test_runner.py 0215 --generate 10
python runner/test_runner.py 0215 --generate 10 --seed 12345
python runner/test_runner.py 0215 --generate 100 --save-failed

Complexity Estimation¶

python runner/test_runner.py 0322 --estimate
python runner/test_runner.py 0215 --all --estimate

Examples Gallery¶

This section shows actual output from various test runs to help you understand how to interpret results.

Example 1: Multi-Solution Benchmark (Trapping Rain Water)¶

Command:

python runner/test_runner.py 0042_trapping --all --benchmark

Output:

╔═════════════════════════════════════════╗
║ 0042_trapping_rain_water - Performance  ║
╠═════════════════════════════════════════╣
║ default:    ████████████████████  106ms ║
║ stack:      ███████████████████░  104ms ║
║ twopointer: ███████████████████░  102ms ║
║ dp:         ██████████████████░░  100ms ║
╚═════════════════════════════════════════╝

Method        Avg Time   Pass Rate  Complexity              Peak RSS
----------  ----------  ----------  --------------------  ----------
default       106.07ms         2/2  O(n) time, O(n) space      4.8MB
stack         103.54ms         2/2  O(n) time, O(n) space      4.7MB
twopointer    102.15ms         2/2  O(n) time, O(1) space      4.6MB
dp            100.35ms         2/2  O(n) time, O(n) space      4.6MB

How to Interpret: - Bar length is proportional to execution time (longest = full bar) - twopointer uses O(1) space while others use O(n) — a key insight for interviews - All approaches are O(n) time but have different constant factors

Example 2: Four Solutions Comparison (3Sum)¶

Command:

python runner/test_runner.py 0015_3sum --all --benchmark

Output:

╔═══════════════════════════════════════════╗
║          0015_3sum - Performance          ║
╠═══════════════════════════════════════════╣
║ default:      ██████████████████░░  103ms ║
║ two_pointers: ████████████████████  109ms ║
║ hashset:      ██████████████████░░  102ms ║
║ hash:         ██████████████████░░  102ms ║
╚═══════════════════════════════════════════╝

Method          Avg Time   Pass Rate  Complexity
------------  ----------  ----------  ---------------------------------
default         102.81ms         3/3  O(n²) time, O(1) extra space
two_pointers    108.52ms         3/3  O(n²) time, O(1) extra space
hashset         102.21ms         3/3  O(n²) time, O(n) space for set
hash            102.39ms         3/3  O(n²) time, O(n) space

How to Interpret: - All four approaches have similar O(n²) time complexity - hashset and hash trade space for simpler deduplication logic - Similar times indicate the test cases may be small — use --generate for stress testing

Example 3: Random Test Generation¶

Command:

python runner/test_runner.py 0215_kth_largest --generate 5 --seed 42

Output:

🎲 Generator: 5 cases, seed: 42

   --- tests/ (static) ---
   0215_kth_largest_element_in_an_array_1: ✅ PASS [judge]
   0215_kth_largest_element_in_an_array_2: ✅ PASS [judge]
   0215_kth_largest_element_in_an_array_3: ✅ PASS [judge]

   --- generators/ (5 cases, seed: 42) ---
   gen_1: ✅ PASS [generated]
   gen_2: ✅ PASS [generated]
   gen_3: ✅ PASS [generated]
   gen_4: ✅ PASS [generated]
   gen_5: ✅ PASS [generated]

   Result: 8 / 8 cases passed.
      ├─ Static: 3/3
      └─ Generated: 5/5

How to Interpret: - Static tests run first (from tests/ directory) - Generated tests use JUDGE_FUNC for validation (no expected output file) - The seed 42 makes tests reproducible — same seed = same test cases - Use --save-failed to capture failing generated cases for debugging

Example 4: Memory Trace Visualization¶

Command:

python runner/test_runner.py 0042_trapping --memory-trace

Output:

Memory Trace (Run-level RSS)

default:
▁▂▃▃▄▅▆▆▇█
Peak 4.8MB | P95 4.8MB

How to Interpret: - Sparkline shows memory usage progression over test cases - Peak RSS is the maximum memory used across all runs - P95 RSS is the 95^th percentile — useful for identifying outliers - Compare across methods with --trace-compare (requires multiple solutions)

Example 5: Complexity Estimation (O(n) vs O(n²))¶

This is the most impressive demonstration — showing the dramatic difference between O(n) and O(n²) algorithms.

Command:

python runner/test_runner.py 0011_container --all --estimate

Output (O(n) Two Pointers):

📌 Estimating: two_pointers
   n=  500: 0.34ms
   n= 1000: 0.51ms
   n= 2000: 1.24ms
   n= 5000: 2.78ms

✅ Estimated: O(n)
   Confidence: 1.00

Output (O(n²) Brute Force):

📌 Estimating: bruteforce
   n=  500: 554ms
   n= 1000: 2,544ms
   n= 2000: 10,697ms
   n= 5000: 68,291ms  ← 68 seconds!

✅ Estimated: O(n²)
   Confidence: 1.00

The Dramatic Difference:

n	O(n) Two Pointers	O(n²) Brute Force	Ratio
500	0.27ms	554ms	2,052x
1000	0.52ms	2,544ms	4,892x
5000	2.78ms	68,291ms	24,565x

How to Interpret: - O(n): Time doubles when n doubles (linear growth) - O(n²): Time quadruples when n doubles (quadratic growth) - At n=5000, the O(n²) algorithm is 1,818x slower - This is why algorithm complexity matters for large inputs!

Estimation Tips: - Works best when algorithm time dominates constant overhead - For fast algorithms, use larger n values (5000+) for accurate estimation - If estimated ≠ declared, the algorithm may have optimizations or the test sizes are too small

Output Format¶

Test Results¶

   case_1: ✅ PASS [exact]
   case_2: ✅ PASS (12.34ms) [judge]
   case_3: ❌ FAIL [exact]
      Expected: [0, 1]...
      Actual:   [1, 0]...
   case_4: ⚠️ SKIP (missing .out, no JUDGE_FUNC)

Multi-Solution Comparison¶

When running --all --benchmark, the test runner displays a visual bar chart followed by a detailed comparison table:

Visual Bar Chart with Approach Legend:

   ╔═══════════════════════════════════════════════════════════════════════════════╗
   ║                  0131_palindrome_partitioning - Performance                   ║
   ╠═══════════════════════════════════════════════════════════════════════════════╣
   ║ default: ████████████████████  158ms                                          ║
   ║ naive:   ███████████████████░  152ms                                          ║
   ╚═══════════════════════════════════════════════════════════════════════════════╝

   default  → Backtracking with DP-Precomputed Palindrome Table
   naive    → Backtracking with On-the-Fly Checking

The bar length is proportional to execution time (longest time = full bar). The approach descriptions are shown in a legend below the chart, parsed from class header comments.

Enhanced Method Header:

──────────────────────────────────────────────────
📌 Shorthand: default
   Approach: Backtracking with DP-Precomputed Palindrome Table
   Complexity: O(n × 2^n) time, O(n^2) space
──────────────────────────────────────────────────

Note: On terminals that don't support Unicode, ASCII fallback characters are used.

Detailed Table:

======================================================================
Performance Comparison (Details)
======================================================================

Method         Avg Time   Pass Rate  Complexity
-----------  ----------  ----------  --------------------
default        158.17ms         2/2  O(n × 2^n) time, O(n^2) space
naive          152.00ms         2/2  O(n × 2^n × n) time, O(n) space

default      → Backtracking with DP-Precomputed Palindrome Table
naive        → Backtracking with On-the-Fly Checking

======================================================================

The approach descriptions are shown in a legend below the table, matching the format used in the visual bar chart.

Multi-Solution Benchmark with Visual Charts¶

Use --all --benchmark to compare all solutions with visual performance charts:

python runner/test_runner.py 0215 --all --benchmark

This displays:

Visual bar chart with execution times
Approach legend (method → approach name)
Detailed table showing pass rate and complexity

Example Output:

   ╔════════════════════════════════════════════════════╗
   ║ 0215_kth_largest_element_in_an_array - Performance ║
   ╠════════════════════════════════════════════════════╣
   ║ default:     ████████████████████  114ms           ║
   ║ quickselect: ████████████████░░░░   96ms           ║
   ║ heap:        ██████████████████░░  107ms           ║
   ╚════════════════════════════════════════════════════╝

   default      → Quickselect Algorithm
   quickselect  → Quickselect Algorithm
   heap         → Heap-Based Solution

======================================================================
Performance Comparison (Details)
======================================================================

Method         Avg Time   Pass Rate  Complexity
-----------  ----------  ----------  --------------------
default        113.51ms         3/3  O(n) average time, O(1) space
quickselect     96.06ms         3/3  O(n) average time, O(1) space
heap           107.34ms         3/3  O(n log k) time, O(k) space

default      → Quickselect Algorithm
quickselect  → Quickselect Algorithm
heap         → Heap-Based Solution

======================================================================

Requirements for Complexity Estimation:

Generator must provide generate_for_complexity(n) function
Install pip install big-O package

Standalone Complexity Estimation¶

For complexity estimation without benchmark comparison:

python runner/test_runner.py 0239_sliding_window --estimate

📌 Estimating: default

   📈 Running complexity estimation...
      Mode: Direct call (Mock stdin, no subprocess overhead)
      Sizes: [10, 20, 50, 100, 200, 500, 1000, 2000, 5000]
      Runs per size: 3
      n=  500: 0.32ms (avg of 3 runs)
      n= 1000: 0.69ms (avg of 3 runs)
      n= 2000: 1.12ms (avg of 3 runs)
      n= 5000: 2.78ms (avg of 3 runs)

   ✅ Estimated: O(n)
      Confidence: 1.00
      Details: Linear: time = 0.059 + 0.00054*n (sec)

More Complexity Examples¶

Problem	Algorithm	Declared	Estimated	Confidence
0239_sliding_window	Monotonic Deque	O(n)	O(n)	1.00
0011_container (two_pointers)	Two Pointers	O(n)	O(n)	1.00
0011_container (bruteforce)	Brute Force	O(n²)	O(n²)	1.00
0042_trapping (twopointer)	Two Pointers	O(n)	O(n)	1.00

Note: The estimator uses sizes up to n=5000, which provides accurate results for distinguishing O(n) from O(n²). At n=5000, an O(n²) algorithm takes ~24,000x longer than O(n)!

Validation Modes¶

Mode	When Used
`[judge]`	`JUDGE_FUNC` + `.out` exists
`[judge-only]`	`JUDGE_FUNC`, no `.out` (generated tests)
`[exact]`	Default string comparison
`[sorted]`	`COMPARE_MODE="sorted"`
`[set]`	`COMPARE_MODE="set"`
`[skip]`	No `.out`, no `JUDGE_FUNC`

📖 See Solution Contract § Validation for JUDGE_FUNC and COMPARE_MODE details.

Troubleshooting¶

Error	Fix
`No test input files found`	Add `tests/{problem}_*.in` or use `--generate`
`Solution method 'X' not found`	Check `SOLUTIONS` dict in solution file
`Generator requires JUDGE_FUNC`	Add `JUDGE_FUNC` to solution
`No generator found`	Create `generators/{problem}.py`
`big-O package not installed`	`pip install big-O`

Complete Reference¶

All Options¶

Option	Short	Description
`--method NAME`	`-m`	Run specific solution
`--all`	`-a`	Run all solutions in `SOLUTIONS`
`--benchmark`	`-b`	Show execution time per case
`--tests-dir DIR`	`-t`	Custom tests directory (default: `tests`)
`--generate N`	`-g`	Static tests + N generated cases
`--generate-only N`	—	Skip static, generate N cases only
`--seed N`	`-s`	Reproducible generation
`--save-failed`	—	Save failed cases to `tests/`
`--estimate`	`-e`	Estimate time complexity

Advanced Combinations¶

# Full comparison: all methods, benchmarked, with generated tests
python runner/test_runner.py 0023 -a -b -g 50 -s 12345

# Stress test only (skip static tests)
python runner/test_runner.py 0023 --generate-only 100 --all

# Estimate complexity for all solutions
python runner/test_runner.py 0023 --all --estimate

# Full benchmark with complexity estimation (visual charts)
python runner/test_runner.py 0215 --all --benchmark --estimate

# Debug failed case with saved input
python runner/test_runner.py 0023 --generate 100 --save-failed

Output Details¶

Failed Generated Case Box:

   gen_3: ❌ FAIL [generated]
      ┌─ Input ─────────────────────────────────
      │ [1,3,5,7]
      │ [2,4,6,8]
      ├─ Actual ────────────────────────────────
      │ 4.5
      └─────────────────────────────────────────
      💾 Saved to: tests/0004_failed_1.in

Reproduction Hint (when using --seed):

💡 To reproduce: python runner/test_runner.py 0004 --generate 10 --seed 12345

Summary Breakdown (static + generated):

Summary: 15 / 15 cases passed.
   ├─ Static (tests/): 5/5
   └─ Generated: 10/10

Internal Behaviors¶

Behavior	Description
Failed file exclusion	Files matching `_failed_.in` are excluded from normal test runs
Legacy mode	When no `SOLUTIONS` dict exists, runs single default solution
Exit codes	Exits with code `1` on missing tests, invalid method, or missing generator

Case Runner¶

case_runner.py runs a single test case without comparison — ideal for debugging.

python runner/case_runner.py <problem> <case_number>

Example:

python runner/case_runner.py 0001_two_sum 1

This runs solutions/0001_two_sum.py with input from tests/0001_two_sum_1.in and displays output directly (no pass/fail comparison).

VSCode Integration¶

Pre-configured tasks and debug configurations are provided in .vscode/.

Ctrl+Shift+B: Run all tests for current problem (default build task)
F5: Debug with breakpoints

📖 See VSCode Setup Guide for complete task/debug configuration reference.

Architecture¶

test_runner.py (CLI)
├── module_loader.py      # Load solution/generator modules
├── executor.py           # Execute test cases
├── reporter.py           # Format results
├── compare.py            # Output validation
└── complexity_estimator.py  # Big-O estimation

Execution Methods¶

The test runner supports two execution methods:

Method 1: Virtual Environment (Recommended)¶

Use the project's virtual environment for isolated dependencies:

# Windows (PowerShell/CMD)
leetcode\Scripts\python.exe runner/test_runner.py 0023 --all --benchmark

# Linux/macOS
./leetcode/bin/python runner/test_runner.py 0023 --all --benchmark

Method 2: System Python¶

Use system Python directly (requires dependencies installed globally):

python runner/test_runner.py 0023 --all --benchmark

Dependencies¶

Required¶

Python 3.11 (matching LeetCode official environment)
Solution files in solutions/
Test files in tests/ (or use generators)

Optional Packages¶

Package	Feature	Install
`big-O`	Complexity estimation (`--estimate`)	`pip install big-O`
`psutil`	RSS memory profiling (`--memory-trace`, `--trace-compare`, `--memory-per-case`)	`pip install psutil`
`sparklines`	Memory trace visualization (sparkline charts)	`pip install sparklines`
`tabulate`	CLI table formatting	`pip install tabulate`

Install all optional packages:

pip install big-O psutil sparklines tabulate

Memory Measurement Types¶

Type	Source	Method	Description
RSS	Static/Generated tests	`psutil` (subprocess)	Full process memory including interpreter
Alloc	`--estimate` runs	`tracemalloc` (in-process)	Python allocations only

Note: RSS and Alloc metrics are displayed separately in --memory-per-case output because they measure different things and are not directly comparable.

Graceful Degradation¶

Missing Package	Behavior
`big-O`	`--estimate` ignored, complexity shown as "Unknown"
`psutil`	RSS memory columns show "Unavailable", warning displayed
`sparklines`	Falls back to simple ASCII visualization
`tabulate`	Falls back to manual column formatting

Document	Content
Test File Format	Canonical `.in`/`.out` format specification
Solution Contract	`SOLUTIONS`, `JUDGE_FUNC`, `COMPARE_MODE`, file structure
Generator Contract	`generate()`, `generate_for_complexity()`, edge cases
Runner README	Quick reference (in-module)
VSCode Setup Guide	Tasks, debug configurations, workflow examples

Documentation Maintenance¶

When modifying test_runner.py:

Update this spec (docs/runner/README.md)
Update quick reference (runner/README.md)
Update docstring (runner/test_runner.py)

Maintainer: See Contributors

Test Runner Specification¶

Quick Start¶

Command Reference¶

Solution Selection¶

Test Generation¶

Analysis¶

Memory Profiling¶

Other¶

Usage Examples¶

Basic Testing¶

Random Testing¶

Complexity Estimation¶

Examples Gallery¶

Example 1: Multi-Solution Benchmark (Trapping Rain Water)¶

Example 2: Four Solutions Comparison (3Sum)¶

Example 3: Random Test Generation¶

Example 4: Memory Trace Visualization¶

Example 5: Complexity Estimation (O(n) vs O(n²))¶

Output Format¶

Test Results¶

Multi-Solution Comparison¶

Multi-Solution Benchmark with Visual Charts¶

Standalone Complexity Estimation¶

More Complexity Examples¶

Validation Modes¶

Troubleshooting¶

Complete Reference¶

All Options¶

Advanced Combinations¶

Output Details¶

Internal Behaviors¶

Case Runner¶

VSCode Integration¶

Architecture¶

Execution Methods¶

Method 1: Virtual Environment (Recommended)¶

Method 2: System Python¶

Dependencies¶

Required¶

Optional Packages¶

Memory Measurement Types¶

Graceful Degradation¶

Related Documentation¶

Documentation Maintenance¶