ecmult-multi-algorithm-selection

Sat Mar 28 2026

secp256k1_ecmult_multi computes multi-scalar multiplication: $R = s_G \cdot G + \sum_{i} s_i \cdot P_i$ . It automatically selects the best algorithm based on the memory budget (mem_limit) and the number of points (n_points). This note documents how that selection works.

Based on fjahr's ecmult refactor (PR #1789).

The ABCD Cost Model

Each algorithm is characterized by four constants:

Parameter	Meaning
A	Per-point working memory (bytes)
B	Fixed memory overhead (bytes)
C	Per-point time cost (abstract units)
D	Fixed time overhead (abstract units)

For a batch of x points:

Memory usage: $m(x) = A \cdot x + B$
Time cost: $c(x) = C \cdot x + D$
Optime (per-point cost): $\text{optime} = C + D / x$

As $x \to \infty$ , optime converges to $C$ . This is the algorithm's asymptotic floor.

The 14 Algorithms

Index	Algorithm	C	D	Notes
0	TRIVIAL	1000	0	No memory needed, very slow
1	STRAUSS	109	120	Efficient for small batches
2	Pippenger w1	197	403
3	Pippenger w2	148	590
4	Pippenger w3	117	877
5	Pippenger w4	100	1,340
6	Pippenger w5	86	2,187
7	Pippenger w6	75	3,703
8	Pippenger w7	66	6,324
9	Pippenger w8	61	10,681
10	Pippenger w9	56	19,223
11	Pippenger w10	53	36,521
12	Pippenger w11	49	71,369
13	Pippenger w12	46	130,576	Highest window, best asymptotic

Key pattern: Higher Pippenger windows have lower C (better asymptotic) but exponentially higher D (more fixed overhead to amortize) and B (more bucket memory).

Memory Size Formulas

Struct sizes on 64-bit non-VERIFY builds:

secp256k1_fe = 40 bytes (5 x uint64_t)
secp256k1_ge = 88 bytes (2 fe + int infinity + padding)
secp256k1_gej = 128 bytes (3 fe + int infinity + padding)
secp256k1_scalar = 32 bytes (4 x uint64_t)

Strauss

A = (sizeof(ge) + sizeof(fe)) * TABLE_SIZE(WINDOW_A) + sizeof(strauss_point_state) + sizeof(gej) + sizeof(scalar) = 1452 bytes
B = 0 (no fixed overhead)

Pippenger

Entry size(w) = sizeof(ge) + sizeof(scalar) + sizeof(pippenger_point_state) + WNAF_SIZE(w+1) * sizeof(int)
A (point size) = 2 * entry_size(w) (factor of 2 from GLV endomorphism)
B (fixed size) = sizeof(gej) * 2^w + 2 * entry_size(w)
- The sizeof(gej) * 2^w term is the bucket array -- this dominates at high windows

Window	A (bytes)	B (bytes)	Buckets
w1	784	1,040	2
w5	448	4,544	32
w7	400	16,784	128
w9	376	65,912	512
w11	360	262,504	2,048
w12	352	524,640	4,096

The Two Selection Functions

`ecmult_multi_batch_size(mem_limit)` -- Capacity Planning

Used at batch creation time to determine how many points can fit.

for each algorithm (skipping TRIVIAL):
    batch_size = (mem_limit - B) / A
    optime = C + D / batch_size
    pick the algorithm with minimum optime
return that algorithm's batch_size   (not the algorithm itself)

`ecmult_multi_select(mem_limit, n_points)` -- Runtime Selection

Used at verify time to pick the best algorithm for the actual number of points.

for each algorithm (including TRIVIAL):
    if A * n_points + B > mem_limit: skip
    optime = C + D / n_points
    pick the algorithm with minimum optime
return that algorithm

Key difference: batch_size optimizes for the best possible batch size within memory. select optimizes for the actual n_points being processed, which may be much smaller than capacity.

`ecmult_multi(...)` -- The Wrapper

Simply calls select(mem_limit, n_points) then internal(algo, points, scalars, ...). Does not split points into sub-batches internally. The caller is responsible for batching.

Capacity at Different Memory Budgets

mem_limit	capacity	algorithm	optime	C (floor)
256 KB	641	Pippenger w7	75	66
512 KB	1,252	Pippenger w8	69	61
1 MB	2,613	Pippenger w9	63	56
4 MB	11,477	Pippenger w11	55	49
16 MB	48,468	Pippenger w12	48	46
64 MB	~198K	Pippenger w12	46	46

At 64 MB, optime hits the C=46 floor. More memory beyond this gives no further improvement.

Algorithm Selection Trace (16 MB)

What actually gets selected as n_points varies:

n_points	Algorithm	Optime
2	Strauss	169
10	Strauss	121
20	Strauss	115
80	Strauss	110
90	Pip w5	110
100	Pip w5	107
150	Pip w6	99
300	Pip w6	87
500	Pip w7	78
1,000	Pip w8	71
2,000	Pip w9	65
5,000	Pip w9	59
10,000	Pip w10	56
20,000	Pip w11	52
48,468	Pip w12	48

The algorithm "cascades" through progressively higher Pippenger windows as n_points grows.

The Strauss-Pippenger Transition

Pippenger w_k beats Strauss when:

$C_k + D_k / n < C_\text{strauss} + D_\text{strauss} / n$

$n > \frac{D_k - D_\text{strauss}}{C_\text{strauss} - C_k}$

With $D_\text{strauss} = 120$ :

Pip w5 beats Strauss at n > 89 (first Pippenger window to win)
Pip w4 at n > 135, but it's dominated by w5 at that point

This creates a gap in the algorithm space: Strauss reaches its C=109 floor by n~~20, but no Pippenger window can beat it until n~~90. In this range, per-point cost barely improves.

The Three Regions of a Batch Speedup Curve

Region 1: Strauss Ramp (n = 2-20)

Optime drops from 169 to 115. D=120 is still being amortized. Speedup climbing.

Region 2: Strauss Plateau (n = 20-89)

Optime only drops from 115 to 110. D/n is already negligible -- Strauss is near its C=109 floor. No Pippenger window is competitive yet. This produces the visible "early shelf" in speedup graphs.

Cannot be improved without an algorithm that fills the gap between Strauss (C=109, D=120) and Pippenger w5 (C=86, D=2187) -- something with C~~95-100 and D~~400-800. No such algorithm currently exists.

Region 3: Pippenger Cascade (n = 90+)

Progressive window upgrades: w5 -> w6 -> w7 -> ... -> w12. Optime drops from 110 to 48 (at 16 MB capacity). Speedup climbing steeply.

Eventually hits the capacity ceiling from transparent verify (see below), producing a final flatline.

Transparent Verify and the Capacity Ceiling

In the batch verification module, batch_add_* functions check remaining capacity before adding points:

/* schnorrsig: 2 points per signature */
if (batch->capacity - batch->len < BATCH_SCHNORRSIG_SCRATCH_OBJS) {
    tverify = secp256k1_batch_verify(ctx, batch);  /* verify + reset */
    if (tverify == 0) return;
}

/* tweak check: 1 point per check */
if (batch->capacity - batch->len < BATCH_TWEAK_CHECK_SCRATCH_OBJS) {
    tverify = secp256k1_batch_verify(ctx, batch);
    if (tverify == 0) return;
}

When the batch is full, it automatically verifies (calling ecmult_multi with capacity points) and resets len = 0. This means each ecmult_multi call processes at most capacity points, and per-point cost is fixed at the optime for that capacity. This produces the final flatline in speedup graphs.

Flatline points:

Schnorrsig: flatlines at capacity / 2 signatures (2 points per sig)
Tweak check: flatlines at capacity checks (1 point per check)

The Three Ceilings on Batch Speedup

Ceiling	Cause	Can be raised?
Capacity ceiling	`mem_limit` determines max points per `ecmult_multi` call	Yes -- increase `mem_limit`
Algorithm ceiling	Pippenger w12 is the highest window; C=46 is the floor	Would need w13+ (exponential memory cost)
Overhead ceiling	Non-ecmult per-item costs (hashing, point decompression, randomizer generation) don't benefit from batching	No -- fundamental to the verification math

Observed Speedups (Batch Verification PoC)

Using the ecmult refactor integration on the new-ecmult-batch branch:

mem_limit	Schnorrsig speedup	Tweak check speedup
256 KB	~1.55x (flatlines after ~320 sigs)	~1.55x (flatlines after ~640 checks)
4 MB	~1.78x (flatlines after ~5,700 sigs)	~1.68x (flatlines after ~11,400 checks)
16 MB	~1.95x (flatlines after ~24,000 sigs)	~1.73x (flatlines after ~48,000 checks)

For comparison, the original batch PR (Strauss only) achieves ~1.20x for schnorrsig and ~1.50x for tweak checks.

Side Note: Exploring D=274 for Strauss

The current code on the new-ecmult-batch branch has Strauss D=274 (higher than D=120). This makes the early flatline (Region 2) less flat because Strauss takes longer to converge to its C=109 floor:

D=120: optime at n=20 is 115, at n=80 is 110 (delta = 5)
D=274: optime at n=20 is 122, at n=80 is 112 (delta = 10)

The Pip w5 transition also shifts earlier (n > 83 instead of n > 89). Whether D=274 leads to measurably better real-world speedup in the transition region is still being explored.