Implement BN_MONT_CTX_new_consttime with Montgomery reduction

Setting up Montgomery reduction requires computing RR, a larger power of
2 mod N. When N is secret (RSA primes), we currently start at
2^(n_bits-1), then iteratively double and reduce.

Instead, once we reach 2R = 2^(r_bits+1) or higher, we can switch to a
Montgomery square-and-multiply. (Montgomery reduction only needs n0. RR
is just for conversion.) This takes some tuning because, at low powers
of 2 (in Montgomery form), it is still more efficient to square by
doubling. I ran benchmarks for 32-bit and 64-bit, x86 and Arm, on the
machines I had available and picked a threshold that works decently
well.

(On the hardware I tested, it's the right threshold on all but the Pixel
5A. The 5A would ideally want a slightly higher threshold---it seems to
be worse at multiplying or better at addition, but the gap isn't that
large, and this operation isn't perf-sensitive anyway.)

The result is dramatically faster than the old shift-based approach.
That said, see I06f4a065fdecc1aec3160fe32a41e200538d1ee3 for discussion
on this operation. These speedups are not expected to translate to
increased RSA throughput. They just clear up some initialization work.
This speedup is not quite enough to match the division-based
variable-time one (perf-sensitive for RSA verification), so we'll keep
both codepaths around.

M1 Max
Before:
Did 712000 256-bit mont (constime) operations in 2000454us (355919.2 ops/sec)
Did 440000 384-bit mont (constime) operations in 2001121us (219876.8 ops/sec)
Did 259000 512-bit mont (constime) operations in 2003709us (129260.3 ops/sec)
Did 212000 521-bit mont (constime) operations in 2007033us (105628.6 ops/sec)
Did 107000 1024-bit mont (constime) operations in 2018551us (53008.3 ops/sec)
Did 57000 1536-bit mont (constime) operations in 2001027us (28485.4 ops/sec)
Did 37000 2048-bit mont (constime) operations in 2039631us (18140.5 ops/sec)
Did 20000 3072-bit mont (constime) operations in 2041163us (9798.3 ops/sec)
Did 11760 4096-bit mont (constime) operations in 2007195us (5858.9 ops/sec)
After:
Did 3996000 256-bit mont (constime) operations in 2000366us (1997634.4 ops/sec) [+461.3%]
Did 2687000 384-bit mont (constime) operations in 2000464us (1343188.4 ops/sec) [+510.9%]
Did 2615000 512-bit mont (constime) operations in 2000146us (1307404.6 ops/sec) [+911.5%]
Did 1029000 521-bit mont (constime) operations in 2000944us (514257.3 ops/sec) [+386.9%]
Did 1246000 1024-bit mont (constime) operations in 2000899us (622720.1 ops/sec) [+1074.8%]
Did 688000 1536-bit mont (constime) operations in 2000579us (343900.4 ops/sec) [+1107.3%]
Did 425000 2048-bit mont (constime) operations in 2003622us (212115.9 ops/sec) [+1069.3%]
Did 212000 3072-bit mont (constime) operations in 2004430us (105765.7 ops/sec) [+979.4%]
Did 125000 4096-bit mont (constime) operations in 2009677us (62199.0 ops/sec) [+961.6%]

Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
Before:
Did 781000 256-bit mont (constime) operations in 2000740us (390355.6 ops/sec)
Did 414000 384-bit mont (constime) operations in 2000180us (206981.4 ops/sec)
Did 258000 512-bit mont (constime) operations in 2001729us (128888.6 ops/sec)
Did 194000 521-bit mont (constime) operations in 2008814us (96574.4 ops/sec)
Did 79000 1024-bit mont (constime) operations in 2009309us (39317.0 ops/sec)
Did 36000 1536-bit mont (constime) operations in 2003945us (17964.6 ops/sec)
Did 21000 2048-bit mont (constime) operations in 2074987us (10120.5 ops/sec)
Did 9040 3072-bit mont (constime) operations in 2003869us (4511.3 ops/sec)
Did 5250 4096-bit mont (constime) operations in 2067796us (2538.9 ops/sec)
After:
Did 3496000 256-bit mont (constime) operations in 2000542us (1747526.4 ops/sec) [+347.7%]
Did 2466000 384-bit mont (constime) operations in 2000327us (1232798.4 ops/sec) [+495.6%]
Did 2392000 512-bit mont (constime) operations in 2000732us (1195562.4 ops/sec) [+827.6%]
Did 908000 521-bit mont (constime) operations in 2001181us (453732.1 ops/sec) [+369.8%]
Did 1054000 1024-bit mont (constime) operations in 2001429us (526623.7 ops/sec) [+1239.4%]
Did 548000 1536-bit mont (constime) operations in 2002417us (273669.3 ops/sec) [+1423.4%]
Did 339000 2048-bit mont (constime) operations in 2004127us (169151.0 ops/sec) [+1571.4%]
Did 162000 3072-bit mont (constime) operations in 2008221us (80668.4 ops/sec) [+1688.2%]
Did 94000 4096-bit mont (constime) operations in 2013848us (46676.8 ops/sec) [+1738.4%]

Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 32-bit mode
Before:
Did 335000 256-bit mont (constime) operations in 2000006us (167499.5 ops/sec)
Did 170000 384-bit mont (constime) operations in 2010398us (84560.4 ops/sec)
Did 102000 512-bit mont (constime) operations in 2013510us (50657.8 ops/sec)
Did 88000 521-bit mont (constime) operations in 2022909us (43501.7 ops/sec)
Did 27000 1024-bit mont (constime) operations in 2063490us (13084.6 ops/sec)
Did 11760 1536-bit mont (constime) operations in 2000600us (5878.2 ops/sec)
Did 6825 2048-bit mont (constime) operations in 2069343us (3298.1 ops/sec)
Did 2982 3072-bit mont (constime) operations in 2090651us (1426.3 ops/sec)
Did 1680 4096-bit mont (constime) operations in 2074824us (809.7 ops/sec)
After:
Did 1559000 256-bit mont (constime) operations in 2000884us (779155.6 ops/sec) [+365.2%]
Did 940000 384-bit mont (constime) operations in 2001511us (469645.2 ops/sec) [+455.4%]
Did 608000 512-bit mont (constime) operations in 2000380us (303942.3 ops/sec) [+500.0%]
Did 439000 521-bit mont (constime) operations in 2004282us (219031.1 ops/sec) [+403.5%]
Did 180000 1024-bit mont (constime) operations in 2005427us (89756.4 ops/sec) [+586.0%]
Did 85000 1536-bit mont (constime) operations in 2017009us (42141.6 ops/sec) [+616.9%]
Did 49000 2048-bit mont (constime) operations in 2035401us (24073.9 ops/sec) [+629.9%]
Did 22000 3072-bit mont (constime) operations in 2047404us (10745.3 ops/sec) [+653.3%]
Did 12642 4096-bit mont (constime) operations in 2094210us (6036.6 ops/sec) [+645.5%]

Pixel 5A:
Before:
Did 483000 256-bit mont (constime) operations in 2001460us (241323.8 ops/sec)
Did 279000 384-bit mont (constime) operations in 2004682us (139174.2 ops/sec)
Did 198000 512-bit mont (constime) operations in 2003995us (98802.6 ops/sec)
Did 141000 521-bit mont (constime) operations in 2006305us (70278.4 ops/sec)
Did 62000 1024-bit mont (constime) operations in 2022138us (30660.6 ops/sec)
Did 29000 1536-bit mont (constime) operations in 2007150us (14448.3 ops/sec)
Did 17376 2048-bit mont (constime) operations in 2044894us (8497.3 ops/sec)
Did 7686 3072-bit mont (constime) operations in 2011537us (3821.0 ops/sec)
Did 4620 4096-bit mont (constime) operations in 2048780us (2255.0 ops/sec)
After:
Did 1187000 256-bit mont (constime) operations in 2000099us (593470.6 ops/sec) [+145.9%]
Did 794000 384-bit mont (constime) operations in 2002162us (396571.3 ops/sec) [+184.9%]
Did 658000 512-bit mont (constime) operations in 2002808us (328538.7 ops/sec) [+232.5%]
Did 373000 521-bit mont (constime) operations in 2005135us (186022.4 ops/sec) [+164.7%]
Did 231000 1024-bit mont (constime) operations in 2008117us (115033.1 ops/sec) [+275.2%]
Did 112000 1536-bit mont (constime) operations in 2003151us (55911.9 ops/sec) [+287.0%]
Did 66000 2048-bit mont (constime) operations in 2022295us (32636.2 ops/sec) [+284.1%]
Did 30000 3072-bit mont (constime) operations in 2006199us (14953.7 ops/sec) [+291.4%]
Did 17182 4096-bit mont (constime) operations in 2017938us (8514.6 ops/sec) [+277.6%]

Pixel 5A, 32-bit mode:
Before:
Did 124000 256-bit mont (constime) operations in 2013082us (61597.1 ops/sec)
Did 66000 384-bit mont (constime) operations in 2024604us (32599.0 ops/sec)
Did 40000 512-bit mont (constime) operations in 2018560us (19816.1 ops/sec)
Did 38000 521-bit mont (constime) operations in 2043776us (18593.0 ops/sec)
Did 11466 1024-bit mont (constime) operations in 2010767us (5702.3 ops/sec)
Did 5481 1536-bit mont (constime) operations in 2061892us (2658.2 ops/sec)
Did 3171 2048-bit mont (constime) operations in 2075359us (1527.9 ops/sec)
Did 1407 3072-bit mont (constime) operations in 2032032us (692.4 ops/sec)
Did 819 4096-bit mont (constime) operations in 2070367us (395.6 ops/sec)
After:
Did 718000 256-bit mont (constime) operations in 2000496us (358911.0 ops/sec) [+482.7%]
Did 424000 384-bit mont (constime) operations in 2000523us (211944.6 ops/sec) [+550.2%]
Did 401000 512-bit mont (constime) operations in 2000933us (200406.5 ops/sec) [+911.3%]
Did 205000 521-bit mont (constime) operations in 2004212us (102284.6 ops/sec) [+450.1%]
Did 153000 1024-bit mont (constime) operations in 2004644us (76322.8 ops/sec) [+1238.5%]
Did 78000 1536-bit mont (constime) operations in 2007510us (38854.1 ops/sec) [+1361.6%]
Did 47000 2048-bit mont (constime) operations in 2018015us (23290.2 ops/sec) [+1424.3%]
Did 22848 3072-bit mont (constime) operations in 2079082us (10989.5 ops/sec) [+1487.1%]
Did 13156 4096-bit mont (constime) operations in 2067424us (6363.5 ops/sec) [+1508.6%]

Bug: 316
Change-Id: I402df85170cae780442225eaa879884e707ffa86
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60686
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
This commit is contained in:
David Benjamin 2022-01-06 19:18:45 -05:00 committed by Boringssl LUCI CQ
parent 98e1227cb7
commit 02d2715bcc
3 changed files with 60 additions and 32 deletions

View File

@ -431,12 +431,11 @@ void bn_power5(BN_ULONG *rp, const BN_ULONG *ap, const BN_ULONG *table,
uint64_t bn_mont_n0(const BIGNUM *n);
// bn_mod_exp_base_2_consttime calculates r = 2**p (mod n). |p| must be larger
// than log_2(n); i.e. 2**p must be larger than |n|. |n| must be positive and
// odd. |p| and the bit width of |n| are assumed public, but |n| is otherwise
// treated as secret.
int bn_mod_exp_base_2_consttime(BIGNUM *r, unsigned p, const BIGNUM *n,
BN_CTX *ctx);
// bn_mont_ctx_set_RR_consttime initializes |mont->RR|. It returns one on
// success and zero on error. |mont->N| and |mont->n0| must have been
// initialized already. The bit width of |mont->N| is assumed public, but
// |mont->N| is otherwise treated as secret.
int bn_mont_ctx_set_RR_consttime(BN_MONT_CTX *mont, BN_CTX *ctx);
#if defined(_MSC_VER)
#if defined(OPENSSL_X86_64)

View File

@ -248,19 +248,12 @@ BN_MONT_CTX *BN_MONT_CTX_new_for_modulus(const BIGNUM *mod, BN_CTX *ctx) {
BN_MONT_CTX *BN_MONT_CTX_new_consttime(const BIGNUM *mod, BN_CTX *ctx) {
BN_MONT_CTX *mont = BN_MONT_CTX_new();
if (mont == NULL ||
!bn_mont_ctx_set_N_and_n0(mont, mod)) {
goto err;
}
unsigned lgBigR = mont->N.width * BN_BITS2;
if (!bn_mod_exp_base_2_consttime(&mont->RR, lgBigR * 2, &mont->N, ctx) ||
!bn_resize_words(&mont->RR, mont->N.width)) {
goto err;
!bn_mont_ctx_set_N_and_n0(mont, mod) ||
!bn_mont_ctx_set_RR_consttime(mont, ctx)) {
BN_MONT_CTX_free(mont);
return NULL;
}
return mont;
err:
BN_MONT_CTX_free(mont);
return NULL;
}
int BN_MONT_CTX_set_locked(BN_MONT_CTX **pmont, CRYPTO_MUTEX *lock,

View File

@ -159,27 +159,63 @@ static uint64_t bn_neg_inv_mod_r_u64(uint64_t n) {
return v;
}
int bn_mod_exp_base_2_consttime(BIGNUM *r, unsigned p, const BIGNUM *n,
BN_CTX *ctx) {
assert(!BN_is_zero(n));
assert(!BN_is_negative(n));
assert(BN_is_odd(n));
int bn_mont_ctx_set_RR_consttime(BN_MONT_CTX *mont, BN_CTX *ctx) {
assert(!BN_is_zero(&mont->N));
assert(!BN_is_negative(&mont->N));
assert(BN_is_odd(&mont->N));
assert(bn_minimal_width(&mont->N) == mont->N.width);
BN_zero(r);
unsigned n_bits = BN_num_bits(n);
unsigned n_bits = BN_num_bits(&mont->N);
assert(n_bits != 0);
assert(p > n_bits);
if (n_bits == 1) {
return 1;
BN_zero(&mont->RR);
return bn_resize_words(&mont->RR, mont->N.width);
}
// Set |r| to the larger power of two smaller than |n|, then shift with
// reductions the rest of the way.
if (!BN_set_bit(r, n_bits - 1) ||
!bn_mod_lshift_consttime(r, r, p - (n_bits - 1), n, ctx)) {
unsigned lgBigR = mont->N.width * BN_BITS2;
assert(lgBigR >= n_bits);
// RR is R, or 2^lgBigR, in the Montgomery domain. We can compute 2 in the
// Montgomery domain, 2R or 2^(lgBigR+1), and then use Montgomery
// square-and-multiply to exponentiate.
//
// The multiply steps take 2^n R to 2^(n+1) R. It is faster to double
// the value instead. The square steps take 2^n R to 2^(2n) R. This is
// equivalent to doubling n times. When n is below some threshold, doubling is
// faster. When above, squaring is faster.
//
// We double to this threshold, then switch to Montgomery squaring. From
// benchmarking various 32-bit and 64-bit architectures, the word count seems
// to work well as a threshold. (Doubling scales linearly and Montgomery
// reduction scales quadratically, so the threshold should scale roughly
// linearly.)
unsigned threshold = mont->N.width;
unsigned iters;
for (iters = 0; iters < sizeof(lgBigR) * 8; iters++) {
if ((lgBigR >> iters) <= threshold) {
break;
}
}
// Compute 2^(lgBigR >> iters) R, or 2^((lgBigR >> iters) + lgBigR), by
// doubling. The first n_bits - 1 doubles can be skipped because we don't need
// to reduce.
if (!BN_set_bit(&mont->RR, n_bits - 1) ||
!bn_mod_lshift_consttime(&mont->RR, &mont->RR,
(lgBigR >> iters) + lgBigR - (n_bits - 1),
&mont->N, ctx)) {
return 0;
}
return 1;
for (unsigned i = iters - 1; i < iters; i--) {
if (!BN_mod_mul_montgomery(&mont->RR, &mont->RR, &mont->RR, mont, ctx)) {
return 0;
}
if ((lgBigR & (1u << i)) != 0 &&
!bn_mod_lshift1_consttime(&mont->RR, &mont->RR, &mont->N, ctx)) {
return 0;
}
}
return bn_resize_words(&mont->RR, mont->N.width);
}