ecmult-multi-algorithm-selection

Sat Mar 28 2026

secp256k1_ecmult_multi computes multi-scalar multiplication: R=sGG+isiPiR = s_G \cdot G + \sum_{i} s_i \cdot P_i. It automatically selects the best algorithm based on the memory budget (mem_limit) and the number of points (n_points). This note documents how that selection works.

Based on fjahr's ecmult refactor (PR #1789).

The ABCD Cost Model

Each algorithm is characterized by four constants:

ParameterMeaning
APer-point working memory (bytes)
BFixed memory overhead (bytes)
CPer-point time cost (abstract units)
DFixed time overhead (abstract units)

For a batch of x points:

  • Memory usage: m(x)=Ax+Bm(x) = A \cdot x + B
  • Time cost: c(x)=Cx+Dc(x) = C \cdot x + D
  • Optime (per-point cost): optime=C+D/x\text{optime} = C + D / x

As xx \to \infty, optime converges to CC. This is the algorithm's asymptotic floor.

The 14 Algorithms

IndexAlgorithmCDNotes
0TRIVIAL10000No memory needed, very slow
1STRAUSS109120Efficient for small batches
2Pippenger w1197403
3Pippenger w2148590
4Pippenger w3117877
5Pippenger w41001,340
6Pippenger w5862,187
7Pippenger w6753,703
8Pippenger w7666,324
9Pippenger w86110,681
10Pippenger w95619,223
11Pippenger w105336,521
12Pippenger w114971,369
13Pippenger w1246130,576Highest window, best asymptotic

Key pattern: Higher Pippenger windows have lower C (better asymptotic) but exponentially higher D (more fixed overhead to amortize) and B (more bucket memory).

Memory Size Formulas

Struct sizes on 64-bit non-VERIFY builds:

  • secp256k1_fe = 40 bytes (5 x uint64_t)
  • secp256k1_ge = 88 bytes (2 fe + int infinity + padding)
  • secp256k1_gej = 128 bytes (3 fe + int infinity + padding)
  • secp256k1_scalar = 32 bytes (4 x uint64_t)

Strauss

  • A = (sizeof(ge) + sizeof(fe)) * TABLE_SIZE(WINDOW_A) + sizeof(strauss_point_state) + sizeof(gej) + sizeof(scalar) = 1452 bytes
  • B = 0 (no fixed overhead)

Pippenger

  • Entry size(w) = sizeof(ge) + sizeof(scalar) + sizeof(pippenger_point_state) + WNAF_SIZE(w+1) * sizeof(int)
  • A (point size) = 2 * entry_size(w) (factor of 2 from GLV endomorphism)
  • B (fixed size) = sizeof(gej) * 2^w + 2 * entry_size(w)
    • The sizeof(gej) * 2^w term is the bucket array -- this dominates at high windows
WindowA (bytes)B (bytes)Buckets
w17841,0402
w54484,54432
w740016,784128
w937665,912512
w11360262,5042,048
w12352524,6404,096

The Two Selection Functions

ecmult_multi_batch_size(mem_limit) -- Capacity Planning

Used at batch creation time to determine how many points can fit.

for each algorithm (skipping TRIVIAL):
    batch_size = (mem_limit - B) / A
    optime = C + D / batch_size
    pick the algorithm with minimum optime
return that algorithm's batch_size   (not the algorithm itself)

ecmult_multi_select(mem_limit, n_points) -- Runtime Selection

Used at verify time to pick the best algorithm for the actual number of points.

for each algorithm (including TRIVIAL):
    if A * n_points + B > mem_limit: skip
    optime = C + D / n_points
    pick the algorithm with minimum optime
return that algorithm

Key difference: batch_size optimizes for the best possible batch size within memory. select optimizes for the actual n_points being processed, which may be much smaller than capacity.

ecmult_multi(...) -- The Wrapper

Simply calls select(mem_limit, n_points) then internal(algo, points, scalars, ...). Does not split points into sub-batches internally. The caller is responsible for batching.

Capacity at Different Memory Budgets

mem_limitcapacityalgorithmoptimeC (floor)
256 KB641Pippenger w77566
512 KB1,252Pippenger w86961
1 MB2,613Pippenger w96356
4 MB11,477Pippenger w115549
16 MB48,468Pippenger w124846
64 MB~198KPippenger w124646

At 64 MB, optime hits the C=46 floor. More memory beyond this gives no further improvement.

Algorithm Selection Trace (16 MB)

What actually gets selected as n_points varies:

n_pointsAlgorithmOptime
2Strauss169
10Strauss121
20Strauss115
80Strauss110
90Pip w5110
100Pip w5107
150Pip w699
300Pip w687
500Pip w778
1,000Pip w871
2,000Pip w965
5,000Pip w959
10,000Pip w1056
20,000Pip w1152
48,468Pip w1248

The algorithm "cascades" through progressively higher Pippenger windows as n_points grows.

The Strauss-Pippenger Transition

Pippenger w_k beats Strauss when:

Ck+Dk/n<Cstrauss+Dstrauss/nC_k + D_k / n < C_\text{strauss} + D_\text{strauss} / n

n>DkDstraussCstraussCkn > \frac{D_k - D_\text{strauss}}{C_\text{strauss} - C_k}

With Dstrauss=120D_\text{strauss} = 120:

  • Pip w5 beats Strauss at n > 89 (first Pippenger window to win)
  • Pip w4 at n > 135, but it's dominated by w5 at that point

This creates a gap in the algorithm space: Strauss reaches its C=109 floor by n20, but no Pippenger window can beat it until n90. In this range, per-point cost barely improves.

The Three Regions of a Batch Speedup Curve

Region 1: Strauss Ramp (n = 2-20)

Optime drops from 169 to 115. D=120 is still being amortized. Speedup climbing.

Region 2: Strauss Plateau (n = 20-89)

Optime only drops from 115 to 110. D/n is already negligible -- Strauss is near its C=109 floor. No Pippenger window is competitive yet. This produces the visible "early shelf" in speedup graphs.

Cannot be improved without an algorithm that fills the gap between Strauss (C=109, D=120) and Pippenger w5 (C=86, D=2187) -- something with C95-100 and D400-800. No such algorithm currently exists.

Region 3: Pippenger Cascade (n = 90+)

Progressive window upgrades: w5 -> w6 -> w7 -> ... -> w12. Optime drops from 110 to 48 (at 16 MB capacity). Speedup climbing steeply.

Eventually hits the capacity ceiling from transparent verify (see below), producing a final flatline.

Transparent Verify and the Capacity Ceiling

In the batch verification module, batch_add_* functions check remaining capacity before adding points:

/* schnorrsig: 2 points per signature */
if (batch->capacity - batch->len < BATCH_SCHNORRSIG_SCRATCH_OBJS) {
    tverify = secp256k1_batch_verify(ctx, batch);  /* verify + reset */
    if (tverify == 0) return;
}

/* tweak check: 1 point per check */
if (batch->capacity - batch->len < BATCH_TWEAK_CHECK_SCRATCH_OBJS) {
    tverify = secp256k1_batch_verify(ctx, batch);
    if (tverify == 0) return;
}

When the batch is full, it automatically verifies (calling ecmult_multi with capacity points) and resets len = 0. This means each ecmult_multi call processes at most capacity points, and per-point cost is fixed at the optime for that capacity. This produces the final flatline in speedup graphs.

Flatline points:

  • Schnorrsig: flatlines at capacity / 2 signatures (2 points per sig)
  • Tweak check: flatlines at capacity checks (1 point per check)

The Three Ceilings on Batch Speedup

CeilingCauseCan be raised?
Capacity ceilingmem_limit determines max points per ecmult_multi callYes -- increase mem_limit
Algorithm ceilingPippenger w12 is the highest window; C=46 is the floorWould need w13+ (exponential memory cost)
Overhead ceilingNon-ecmult per-item costs (hashing, point decompression, randomizer generation) don't benefit from batchingNo -- fundamental to the verification math

Observed Speedups (Batch Verification PoC)

Using the ecmult refactor integration on the new-ecmult-batch branch:

mem_limitSchnorrsig speedupTweak check speedup
256 KB~1.55x (flatlines after ~320 sigs)~1.55x (flatlines after ~640 checks)
4 MB~1.78x (flatlines after ~5,700 sigs)~1.68x (flatlines after ~11,400 checks)
16 MB~1.95x (flatlines after ~24,000 sigs)~1.73x (flatlines after ~48,000 checks)

For comparison, the original batch PR (Strauss only) achieves ~1.20x for schnorrsig and ~1.50x for tweak checks.

Side Note: Exploring D=274 for Strauss

The current code on the new-ecmult-batch branch has Strauss D=274 (higher than D=120). This makes the early flatline (Region 2) less flat because Strauss takes longer to converge to its C=109 floor:

  • D=120: optime at n=20 is 115, at n=80 is 110 (delta = 5)
  • D=274: optime at n=20 is 122, at n=80 is 112 (delta = 10)

The Pip w5 transition also shifts earlier (n > 83 instead of n > 89). Whether D=274 leads to measurably better real-world speedup in the transition region is still being explored.