ecmult-multi-algorithm-selection
secp256k1_ecmult_multi computes multi-scalar multiplication: . It automatically selects the best algorithm based on the memory budget (mem_limit) and the number of points (n_points). This note documents how that selection works.
Based on fjahr's ecmult refactor (PR #1789).
The ABCD Cost Model
Each algorithm is characterized by four constants:
| Parameter | Meaning |
|---|---|
| A | Per-point working memory (bytes) |
| B | Fixed memory overhead (bytes) |
| C | Per-point time cost (abstract units) |
| D | Fixed time overhead (abstract units) |
For a batch of x points:
- Memory usage:
- Time cost:
- Optime (per-point cost):
As , optime converges to . This is the algorithm's asymptotic floor.
The 14 Algorithms
| Index | Algorithm | C | D | Notes |
|---|---|---|---|---|
| 0 | TRIVIAL | 1000 | 0 | No memory needed, very slow |
| 1 | STRAUSS | 109 | 120 | Efficient for small batches |
| 2 | Pippenger w1 | 197 | 403 | |
| 3 | Pippenger w2 | 148 | 590 | |
| 4 | Pippenger w3 | 117 | 877 | |
| 5 | Pippenger w4 | 100 | 1,340 | |
| 6 | Pippenger w5 | 86 | 2,187 | |
| 7 | Pippenger w6 | 75 | 3,703 | |
| 8 | Pippenger w7 | 66 | 6,324 | |
| 9 | Pippenger w8 | 61 | 10,681 | |
| 10 | Pippenger w9 | 56 | 19,223 | |
| 11 | Pippenger w10 | 53 | 36,521 | |
| 12 | Pippenger w11 | 49 | 71,369 | |
| 13 | Pippenger w12 | 46 | 130,576 | Highest window, best asymptotic |
Key pattern: Higher Pippenger windows have lower C (better asymptotic) but exponentially higher D (more fixed overhead to amortize) and B (more bucket memory).
Memory Size Formulas
Struct sizes on 64-bit non-VERIFY builds:
secp256k1_fe= 40 bytes (5 xuint64_t)secp256k1_ge= 88 bytes (2 fe +int infinity+ padding)secp256k1_gej= 128 bytes (3 fe +int infinity+ padding)secp256k1_scalar= 32 bytes (4 xuint64_t)
Strauss
- A =
(sizeof(ge) + sizeof(fe)) * TABLE_SIZE(WINDOW_A) + sizeof(strauss_point_state) + sizeof(gej) + sizeof(scalar)= 1452 bytes - B = 0 (no fixed overhead)
Pippenger
- Entry size(w) =
sizeof(ge) + sizeof(scalar) + sizeof(pippenger_point_state) + WNAF_SIZE(w+1) * sizeof(int) - A (point size) =
2 * entry_size(w)(factor of 2 from GLV endomorphism) - B (fixed size) =
sizeof(gej) * 2^w + 2 * entry_size(w)- The
sizeof(gej) * 2^wterm is the bucket array -- this dominates at high windows
- The
| Window | A (bytes) | B (bytes) | Buckets |
|---|---|---|---|
| w1 | 784 | 1,040 | 2 |
| w5 | 448 | 4,544 | 32 |
| w7 | 400 | 16,784 | 128 |
| w9 | 376 | 65,912 | 512 |
| w11 | 360 | 262,504 | 2,048 |
| w12 | 352 | 524,640 | 4,096 |
The Two Selection Functions
ecmult_multi_batch_size(mem_limit) -- Capacity Planning
Used at batch creation time to determine how many points can fit.
for each algorithm (skipping TRIVIAL):
batch_size = (mem_limit - B) / A
optime = C + D / batch_size
pick the algorithm with minimum optime
return that algorithm's batch_size (not the algorithm itself)
ecmult_multi_select(mem_limit, n_points) -- Runtime Selection
Used at verify time to pick the best algorithm for the actual number of points.
for each algorithm (including TRIVIAL):
if A * n_points + B > mem_limit: skip
optime = C + D / n_points
pick the algorithm with minimum optime
return that algorithm
Key difference: batch_size optimizes for the best possible batch size within memory. select optimizes for the actual n_points being processed, which may be much smaller than capacity.
ecmult_multi(...) -- The Wrapper
Simply calls select(mem_limit, n_points) then internal(algo, points, scalars, ...). Does not split points into sub-batches internally. The caller is responsible for batching.
Capacity at Different Memory Budgets
| mem_limit | capacity | algorithm | optime | C (floor) |
|---|---|---|---|---|
| 256 KB | 641 | Pippenger w7 | 75 | 66 |
| 512 KB | 1,252 | Pippenger w8 | 69 | 61 |
| 1 MB | 2,613 | Pippenger w9 | 63 | 56 |
| 4 MB | 11,477 | Pippenger w11 | 55 | 49 |
| 16 MB | 48,468 | Pippenger w12 | 48 | 46 |
| 64 MB | ~198K | Pippenger w12 | 46 | 46 |
At 64 MB, optime hits the C=46 floor. More memory beyond this gives no further improvement.
Algorithm Selection Trace (16 MB)
What actually gets selected as n_points varies:
| n_points | Algorithm | Optime |
|---|---|---|
| 2 | Strauss | 169 |
| 10 | Strauss | 121 |
| 20 | Strauss | 115 |
| 80 | Strauss | 110 |
| 90 | Pip w5 | 110 |
| 100 | Pip w5 | 107 |
| 150 | Pip w6 | 99 |
| 300 | Pip w6 | 87 |
| 500 | Pip w7 | 78 |
| 1,000 | Pip w8 | 71 |
| 2,000 | Pip w9 | 65 |
| 5,000 | Pip w9 | 59 |
| 10,000 | Pip w10 | 56 |
| 20,000 | Pip w11 | 52 |
| 48,468 | Pip w12 | 48 |
The algorithm "cascades" through progressively higher Pippenger windows as n_points grows.
The Strauss-Pippenger Transition
Pippenger w_k beats Strauss when:
With :
- Pip w5 beats Strauss at n > 89 (first Pippenger window to win)
- Pip w4 at n > 135, but it's dominated by w5 at that point
This creates a gap in the algorithm space: Strauss reaches its C=109 floor by n20, but no Pippenger window can beat it until n90. In this range, per-point cost barely improves.
The Three Regions of a Batch Speedup Curve
Region 1: Strauss Ramp (n = 2-20)
Optime drops from 169 to 115. D=120 is still being amortized. Speedup climbing.
Region 2: Strauss Plateau (n = 20-89)
Optime only drops from 115 to 110. D/n is already negligible -- Strauss is near its C=109 floor. No Pippenger window is competitive yet. This produces the visible "early shelf" in speedup graphs.
Cannot be improved without an algorithm that fills the gap between Strauss (C=109, D=120) and Pippenger w5 (C=86, D=2187) -- something with C95-100 and D400-800. No such algorithm currently exists.
Region 3: Pippenger Cascade (n = 90+)
Progressive window upgrades: w5 -> w6 -> w7 -> ... -> w12. Optime drops from 110 to 48 (at 16 MB capacity). Speedup climbing steeply.
Eventually hits the capacity ceiling from transparent verify (see below), producing a final flatline.
Transparent Verify and the Capacity Ceiling
In the batch verification module, batch_add_* functions check remaining capacity before adding points:
/* schnorrsig: 2 points per signature */
if (batch->capacity - batch->len < BATCH_SCHNORRSIG_SCRATCH_OBJS) {
tverify = secp256k1_batch_verify(ctx, batch); /* verify + reset */
if (tverify == 0) return;
}
/* tweak check: 1 point per check */
if (batch->capacity - batch->len < BATCH_TWEAK_CHECK_SCRATCH_OBJS) {
tverify = secp256k1_batch_verify(ctx, batch);
if (tverify == 0) return;
}
When the batch is full, it automatically verifies (calling ecmult_multi with capacity points) and resets len = 0. This means each ecmult_multi call processes at most capacity points, and per-point cost is fixed at the optime for that capacity. This produces the final flatline in speedup graphs.
Flatline points:
- Schnorrsig: flatlines at
capacity / 2signatures (2 points per sig) - Tweak check: flatlines at
capacitychecks (1 point per check)
The Three Ceilings on Batch Speedup
| Ceiling | Cause | Can be raised? |
|---|---|---|
| Capacity ceiling | mem_limit determines max points per ecmult_multi call | Yes -- increase mem_limit |
| Algorithm ceiling | Pippenger w12 is the highest window; C=46 is the floor | Would need w13+ (exponential memory cost) |
| Overhead ceiling | Non-ecmult per-item costs (hashing, point decompression, randomizer generation) don't benefit from batching | No -- fundamental to the verification math |
Observed Speedups (Batch Verification PoC)
Using the ecmult refactor integration on the new-ecmult-batch branch:
| mem_limit | Schnorrsig speedup | Tweak check speedup |
|---|---|---|
| 256 KB | ~1.55x (flatlines after ~320 sigs) | ~1.55x (flatlines after ~640 checks) |
| 4 MB | ~1.78x (flatlines after ~5,700 sigs) | ~1.68x (flatlines after ~11,400 checks) |
| 16 MB | ~1.95x (flatlines after ~24,000 sigs) | ~1.73x (flatlines after ~48,000 checks) |
For comparison, the original batch PR (Strauss only) achieves ~1.20x for schnorrsig and ~1.50x for tweak checks.
Side Note: Exploring D=274 for Strauss
The current code on the new-ecmult-batch branch has Strauss D=274 (higher than D=120). This makes the early flatline (Region 2) less flat because Strauss takes longer to converge to its C=109 floor:
- D=120: optime at n=20 is 115, at n=80 is 110 (delta = 5)
- D=274: optime at n=20 is 122, at n=80 is 112 (delta = 10)
The Pip w5 transition also shifts earlier (n > 83 instead of n > 89). Whether D=274 leads to measurably better real-world speedup in the transition region is still being explored.