Originally I was trying to be pedantic and avoid any use of `_t`-
suffixed names. However, this hasn't really accomplished anything
except annoying me, so just do what BoringSSL does.
The *ring* counterpart to `copy_from_prebuf` is `LIMBS_select_512_32`
which is already written very (too?) conservatively w.r.t. compiler-
introduced side channels. I inspected the generated code before/after
adding additional `value_barrier_w` and it made no difference.
BoringSSL split up there bn_tests.txt into multiple files, which we had
done previously. Prepare to merge that BoringSSL change by putting the
test input files in the same places.
This partially fixes a bug where, on x86_64, BN_mod_exp_mont_consttime
would sometimes return m, the modulus, when it should have returned
zero. Thanks to Guido Vranken for reporting it. It is only a partial fix
because the same bug also exists in the "rsaz" codepath. That will be
fixed in the subsequent CL. (See the commented out test.)
The bug only affects zero outputs (with non-zero inputs), so we believe
it has no security impact on our cryptographic functions. BoringSSL
calls BN_mod_exp_mont_consttime in the following cases:
- RSA private key operations
- Primality testing, raising the witness to the odd part of p-1
- DSA keygen and key import, pub = g^priv (mod p)
- DSA signing, r = g^k (mod p)
- DH keygen, pub = g^priv (mod p)
- Diffie-Hellman, secret = peer^priv (mod p)
It is not possible in the RSA private key operation, provided p and q
are primes. If using CRT, we are working modulo a prime, so zero output
with non-zero input is impossible. If not using CRT, we work mod n.
While there are nilpotent values mod n, none of them hit zero by
exponentiating. (Both p and q would need to divide the input, which
means n divides the input.)
In primality testing, this can only be hit when the input was composite.
But as the rest of the loop cannot then hit 1, we'll correctly report it
as composite anyway.
DSA and DH work modulo a prime, where this case cannot happen.
Analysis:
This bug is the result of sloppiness with the looser bounds from "almost
Montgomery multiplication", described in
https://eprint.iacr.org/2011/239. Prior to upstream's
ec9cc70f72454b8d4a84247c86159613cee83b81, I believe x86_64-mont5.pl
implemented standard Montgomery reduction (the left half of figure 3 in
the paper).
Though it did not document this, ec9cc70f7245 changed it to implement
the "almost" variant (the right half of the figure.) The difference is
that, rather than subtracting if T >= m, it subtracts if T >= R. In
code, it is the difference between something like our bn_reduce_once,
vs. subtracting based only on T's carry bit. (Interestingly, the
.Lmul_enter branch of bn_mul_mont_gather5 seems to still implement
normal reduction, but the .Lmul4x_enter branch is an almost reduction.)
That means none of the intermediate values here are bounded by m. They
are only bounded by R. Accordingly, Figure 2 in the paper ends with
step 10: REDUCE h modulo m. BN_mod_exp_mont_consttime is missing this
step. The bn_from_montgomery call only implements step 9, AMM(h, 1).
(x86_64-mont5.pl's bn_from_montgomery only implements an almost
reduction.)
The impact depends on how unreduced AMM(h, 1) can be. Remark 1 of the
paper discusses this, but is ambiguous about the scope of its 2^(n-1) <
m < 2^n precondition. The m+1 bound appears to be unconditional:
Montgomery reduction ultimately adds some 0 <= Y < m*R to T, to get a
multiple of R, and then divides by R. The output, pre-subtraction, is
thus less than m + T/R. MM works because T < mR => T' < m + mR/R = 2m.
A single subtraction of m if T' >= m gives T'' < m. AMM works because
T < R^2 => T' < m + R^2/R = m + R. A single subtraction of m if T' >= R
gives T'' < R. See also Lemma 1, Section 3 and Section 4 of the paper,
though their formulation is more complicated to capture the word-by-word
algorithm. It's ultimately the same adjustment to T.
But in AMM(h, 1), T = h*1 = h < R, so AMM(h, 1) < m + R/R = m + 1. That
is, AMM(h, 1) <= m. So the only case when AMM(h, 1) isn't fully reduced
is if it outputs m. Thus, our limited impact. Indeed, Remark 1 mentions
step 10 isn't necessary because m is a prime and the inputs are
non-zero. But that doesn't apply here because BN_mod_exp_mont_consttime
may be called elsewhere.
Fix:
To fix this, we could add the missing step 10, but a full division would
not be constant-time. The analysis above says it could be a single
subtraction, bn_reduce_once, but then we could integrate it into
the subtraction already in plain Montgomery reduction, implemented by
uppercase BN_from_montgomery. h*1 = h < R <= m*R, so we are within
bounds.
Thus, we delete lowercase bn_from_montgomery altogether, and have the
mont5 path use the same BN_from_montgomery ending as the non-mont5 path.
This only impacts the final step of the whole exponentiation and has no
measurable perf impact.
In doing so, add comments describing these looser bounds. This includes
one subtlety that BN_mod_exp_mont_consttime actually mixes bn_mul_mont
(MM) with bn_mul_mont_gather5/bn_power5 (AMM). But this is fine because
MM is AMM-compatible; when passed AMM's looser inputs, it will still
produce a correct looser output.
Ideally we'd drop the "almost" reduction and stick to the more
straightforward bounds. As this only impacts the final subtraction in
each reduction, I would be surprised if it actually had a real
performance impact. But this would involve deeper change to
x86_64-mont5.pl, so I haven't tried this yet.
I believe this is basically the same bug as
https://github.com/golang/go/issues/13907 from Go.
Change-Id: I06f879777bb2ef181e9da7632ec858582e2afa38
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52825
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
Trying to migrate Chromium to the "link all the asm files together"
strategy broke the aarch64 Android build because some of the ifdef'd out
assembly files were missing the .note.gnu.property section for BTI. If
we add support for IBT, that'll be another one.
To fix this, introduce <openssl/asm_base.h>, which must be included at
the start of every assembly file (before the target ifdefs). This does a
couple things:
- It emits BTI and noexecstack markers into every assembly file, even
those that ifdef themselves out.
- It resolves the MSan -> OPENSSL_NO_ASM logic, so we only need to do it
once.
- It defines the same OPENSSL_X86_64, etc., defines we set elsewhere, so
we can ensure they're consistent.
This required carving files up a bit. <openssl/base.h> has a lot of
things, such that trying to guard everything in it on __ASSEMBLER__
would be tedious. Instead, I moved the target defines to a new
<openssl/target.h>. Then <openssl/asm_base.h> is the new header that
pulls in all those things.
Bug: 542
Change-Id: I1682b4d929adea72908655fa1bb15765a6b3473b
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60765
Reviewed-by: Bob Beck <bbe@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
The Windows system RNG[1] lives in bcryptprimitives.dll which exports
the function ProcessPrng[2] to supply random bytes from its internal
generators. These are seeded and reseeded from the operating
system using a device connection to \\Device\CNG which is opened
when bcryptprimitives.dll is first loaded.
After this CL boringssl calls ProcessPrng() directly.
Before this CL boringssl got its system randomness (on non-UWP
desktop Windows) from calls to RtlGenRandom[3].
This function is undocumented and unsupported, but has always been
available by linking to SystemFunction036 in advadpi32.dll. In
Windows 10 and later, this export simply forwards to
cryptbase.dll!SystemFunction036 which calls ProcessPrng()
directly.
cryptbase!SystemFunction036 decompiled:
```
BOOLEAN SystemFunction036(PVOID RandomBuffer,ULONG RandomBufferLength)
{
BOOL retval;
retval = ProcessPrng(RandomBuffer,RandomBufferLength);
return retval != 0;
}
```
Loading cryptbase.dll has the side effect of opening a device handle
to \\Device\KsecDD which is not used by boringssl's random number
wrappers. Calling ProcessPrng() directly allows sandboxed programs
such as Chromium to avoid having this handle if they do not need it.
ProcessPrng() also takes a size_t length rather than a u32 length,
allowing some simplification of the calling code.
After this CL we require bcryptprimitives to be loaded before the
first call to CRYPTO_srand(). Applications using the library should
either load the module themselves or call CRYPTO_pre_sandbox_init().
Before this CL boringssl required that advapi32, cryptbase and
bcryptprimitives were all loaded so this should not represent a
breaking change.
[1] https://learn.microsoft.com/en-us/windows/win32/seccng/processprng
[2] https://download.microsoft.com/download/1/c/9/1c9813b8-089c-4fef-b2ad-ad80e79403ba/Whitepaper%20-%20The%20Windows%2010%20random%20number%20generation%20infrastructure.pdf
[3] https://docs.google.com/document/d/13n1t5ak0yofzcadQCF7Ew5TewSUkNfQ3n-IYodjeRYc/edit
Bug: chromium:74242
Change-Id: Ifb1d6ef1a4539ff6e9a2c36cc119b7700ca2be8f
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60825
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: David Benjamin <davidben@google.com>
We could just use the string literal as-is.
Change-Id: I2efe01fd9b020db1bb086001407bcf7fa8487551
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/61045
Auto-Submit: David Benjamin <davidben@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: Adam Langley <agl@google.com>
nanolibc is an embedded platform with no threads. To start unforking
that build, generalize some of the OPENSSL_TRUSTY defines. OpenSSL has
OPENSSL_NO_SOCK if you don't have sockets and OPENSSL_NO_POSIX_IO if you
don't have file descriptors. Those names are fine enough, so I've
borrowed them here too.
There's more to be done here, but this will clear out some of it.
Change-Id: Iaba1fafdebb46ebb8f68b7956535dd0ccaaa832f
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60890
Auto-Submit: David Benjamin <davidben@google.com>
Commit-Queue: Bob Beck <bbe@google.com>
Reviewed-by: Bob Beck <bbe@google.com>
We really should remove these (we only support them in private keys)
but, in the meantime, add some tests for all the curves, not just P-256.
Change-Id: I9c4c0660f082fa1701afe11f51bb157b06befd3c
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60925
Reviewed-by: Adam Langley <agl@google.com>
Auto-Submit: David Benjamin <davidben@google.com>
Commit-Queue: Adam Langley <agl@google.com>
RSA CRT is tiny bit messier when p < q.
https://boringssl-review.googlesource.com/25263 solved this by
normalizing to p > q. The cost was we sometimes had to compute a new
iqmp.
Modular inversion is expensive. We did it only once per key, but it's
still a performance cliff in per-key costs. When later work moves
freeze_private_key into RSA private key parsing, it will be a
performance cliff in the private key parser.
Instead, just handle p < q in the CRT function. The only difference is
needing one extra reduction before the modular subtraction. Even using
the fully general mod_montgomery function (as opposed to checking p < q,
or using bn_reduce_once when num_bits(p) == num_bits(q)) was not
measurable.
In doing so, I noticed we didn't actually have tests that exercise the
reduction step. I added one to evp_tests.txt, but it is only meaningful
when blinding is disabled. (Another cost of blinding.) When blinding is
enabled, the answers mod p and q are randomized and we hit this case
with about 1.8% probability. See comment in evp_test.txt.
I kept the optimization where we store iqmp in Montgomery form, not
because the optimization matters, but because we need to store a
corrected, fixed-width version of the value anyway, so we may as well
store it in a more convenient form.
M1 Max
Before:
Did 9048 RSA 2048 signing operations in 5033403us (1797.6 ops/sec)
Did 1500 RSA 4096 signing operations in 5009288us (299.4 ops/sec)
After:
Did 9116 RSA 2048 signing operations in 5053802us (1803.8 ops/sec) [+0.3%]
Did 1500 RSA 4096 signing operations in 5008283us (299.5 ops/sec) [+0.0%]
Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
Before:
Did 9282 RSA 2048 signing operations in 5019395us (1849.2 ops/sec)
Did 1302 RSA 4096 signing operations in 5055011us (257.6 ops/sec)
After:
Did 9240 RSA 2048 signing operations in 5024845us (1838.9 ops/sec) [-0.6%]
Did 1302 RSA 4096 signing operations in 5046157us (258.0 ops/sec) [+0.2%]
Bug: 316
Change-Id: Icb90c7d5f5188f9b69a6d7bcc63db13d92ec26d5
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60705
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
Setting up Montgomery reduction requires computing RR, a larger power of
2 mod N. When N is secret (RSA primes), we currently start at
2^(n_bits-1), then iteratively double and reduce.
Instead, once we reach 2R = 2^(r_bits+1) or higher, we can switch to a
Montgomery square-and-multiply. (Montgomery reduction only needs n0. RR
is just for conversion.) This takes some tuning because, at low powers
of 2 (in Montgomery form), it is still more efficient to square by
doubling. I ran benchmarks for 32-bit and 64-bit, x86 and Arm, on the
machines I had available and picked a threshold that works decently
well.
(On the hardware I tested, it's the right threshold on all but the Pixel
5A. The 5A would ideally want a slightly higher threshold---it seems to
be worse at multiplying or better at addition, but the gap isn't that
large, and this operation isn't perf-sensitive anyway.)
The result is dramatically faster than the old shift-based approach.
That said, see I06f4a065fdecc1aec3160fe32a41e200538d1ee3 for discussion
on this operation. These speedups are not expected to translate to
increased RSA throughput. They just clear up some initialization work.
This speedup is not quite enough to match the division-based
variable-time one (perf-sensitive for RSA verification), so we'll keep
both codepaths around.
M1 Max
Before:
Did 712000 256-bit mont (constime) operations in 2000454us (355919.2 ops/sec)
Did 440000 384-bit mont (constime) operations in 2001121us (219876.8 ops/sec)
Did 259000 512-bit mont (constime) operations in 2003709us (129260.3 ops/sec)
Did 212000 521-bit mont (constime) operations in 2007033us (105628.6 ops/sec)
Did 107000 1024-bit mont (constime) operations in 2018551us (53008.3 ops/sec)
Did 57000 1536-bit mont (constime) operations in 2001027us (28485.4 ops/sec)
Did 37000 2048-bit mont (constime) operations in 2039631us (18140.5 ops/sec)
Did 20000 3072-bit mont (constime) operations in 2041163us (9798.3 ops/sec)
Did 11760 4096-bit mont (constime) operations in 2007195us (5858.9 ops/sec)
After:
Did 3996000 256-bit mont (constime) operations in 2000366us (1997634.4 ops/sec) [+461.3%]
Did 2687000 384-bit mont (constime) operations in 2000464us (1343188.4 ops/sec) [+510.9%]
Did 2615000 512-bit mont (constime) operations in 2000146us (1307404.6 ops/sec) [+911.5%]
Did 1029000 521-bit mont (constime) operations in 2000944us (514257.3 ops/sec) [+386.9%]
Did 1246000 1024-bit mont (constime) operations in 2000899us (622720.1 ops/sec) [+1074.8%]
Did 688000 1536-bit mont (constime) operations in 2000579us (343900.4 ops/sec) [+1107.3%]
Did 425000 2048-bit mont (constime) operations in 2003622us (212115.9 ops/sec) [+1069.3%]
Did 212000 3072-bit mont (constime) operations in 2004430us (105765.7 ops/sec) [+979.4%]
Did 125000 4096-bit mont (constime) operations in 2009677us (62199.0 ops/sec) [+961.6%]
Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
Before:
Did 781000 256-bit mont (constime) operations in 2000740us (390355.6 ops/sec)
Did 414000 384-bit mont (constime) operations in 2000180us (206981.4 ops/sec)
Did 258000 512-bit mont (constime) operations in 2001729us (128888.6 ops/sec)
Did 194000 521-bit mont (constime) operations in 2008814us (96574.4 ops/sec)
Did 79000 1024-bit mont (constime) operations in 2009309us (39317.0 ops/sec)
Did 36000 1536-bit mont (constime) operations in 2003945us (17964.6 ops/sec)
Did 21000 2048-bit mont (constime) operations in 2074987us (10120.5 ops/sec)
Did 9040 3072-bit mont (constime) operations in 2003869us (4511.3 ops/sec)
Did 5250 4096-bit mont (constime) operations in 2067796us (2538.9 ops/sec)
After:
Did 3496000 256-bit mont (constime) operations in 2000542us (1747526.4 ops/sec) [+347.7%]
Did 2466000 384-bit mont (constime) operations in 2000327us (1232798.4 ops/sec) [+495.6%]
Did 2392000 512-bit mont (constime) operations in 2000732us (1195562.4 ops/sec) [+827.6%]
Did 908000 521-bit mont (constime) operations in 2001181us (453732.1 ops/sec) [+369.8%]
Did 1054000 1024-bit mont (constime) operations in 2001429us (526623.7 ops/sec) [+1239.4%]
Did 548000 1536-bit mont (constime) operations in 2002417us (273669.3 ops/sec) [+1423.4%]
Did 339000 2048-bit mont (constime) operations in 2004127us (169151.0 ops/sec) [+1571.4%]
Did 162000 3072-bit mont (constime) operations in 2008221us (80668.4 ops/sec) [+1688.2%]
Did 94000 4096-bit mont (constime) operations in 2013848us (46676.8 ops/sec) [+1738.4%]
Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 32-bit mode
Before:
Did 335000 256-bit mont (constime) operations in 2000006us (167499.5 ops/sec)
Did 170000 384-bit mont (constime) operations in 2010398us (84560.4 ops/sec)
Did 102000 512-bit mont (constime) operations in 2013510us (50657.8 ops/sec)
Did 88000 521-bit mont (constime) operations in 2022909us (43501.7 ops/sec)
Did 27000 1024-bit mont (constime) operations in 2063490us (13084.6 ops/sec)
Did 11760 1536-bit mont (constime) operations in 2000600us (5878.2 ops/sec)
Did 6825 2048-bit mont (constime) operations in 2069343us (3298.1 ops/sec)
Did 2982 3072-bit mont (constime) operations in 2090651us (1426.3 ops/sec)
Did 1680 4096-bit mont (constime) operations in 2074824us (809.7 ops/sec)
After:
Did 1559000 256-bit mont (constime) operations in 2000884us (779155.6 ops/sec) [+365.2%]
Did 940000 384-bit mont (constime) operations in 2001511us (469645.2 ops/sec) [+455.4%]
Did 608000 512-bit mont (constime) operations in 2000380us (303942.3 ops/sec) [+500.0%]
Did 439000 521-bit mont (constime) operations in 2004282us (219031.1 ops/sec) [+403.5%]
Did 180000 1024-bit mont (constime) operations in 2005427us (89756.4 ops/sec) [+586.0%]
Did 85000 1536-bit mont (constime) operations in 2017009us (42141.6 ops/sec) [+616.9%]
Did 49000 2048-bit mont (constime) operations in 2035401us (24073.9 ops/sec) [+629.9%]
Did 22000 3072-bit mont (constime) operations in 2047404us (10745.3 ops/sec) [+653.3%]
Did 12642 4096-bit mont (constime) operations in 2094210us (6036.6 ops/sec) [+645.5%]
Pixel 5A:
Before:
Did 483000 256-bit mont (constime) operations in 2001460us (241323.8 ops/sec)
Did 279000 384-bit mont (constime) operations in 2004682us (139174.2 ops/sec)
Did 198000 512-bit mont (constime) operations in 2003995us (98802.6 ops/sec)
Did 141000 521-bit mont (constime) operations in 2006305us (70278.4 ops/sec)
Did 62000 1024-bit mont (constime) operations in 2022138us (30660.6 ops/sec)
Did 29000 1536-bit mont (constime) operations in 2007150us (14448.3 ops/sec)
Did 17376 2048-bit mont (constime) operations in 2044894us (8497.3 ops/sec)
Did 7686 3072-bit mont (constime) operations in 2011537us (3821.0 ops/sec)
Did 4620 4096-bit mont (constime) operations in 2048780us (2255.0 ops/sec)
After:
Did 1187000 256-bit mont (constime) operations in 2000099us (593470.6 ops/sec) [+145.9%]
Did 794000 384-bit mont (constime) operations in 2002162us (396571.3 ops/sec) [+184.9%]
Did 658000 512-bit mont (constime) operations in 2002808us (328538.7 ops/sec) [+232.5%]
Did 373000 521-bit mont (constime) operations in 2005135us (186022.4 ops/sec) [+164.7%]
Did 231000 1024-bit mont (constime) operations in 2008117us (115033.1 ops/sec) [+275.2%]
Did 112000 1536-bit mont (constime) operations in 2003151us (55911.9 ops/sec) [+287.0%]
Did 66000 2048-bit mont (constime) operations in 2022295us (32636.2 ops/sec) [+284.1%]
Did 30000 3072-bit mont (constime) operations in 2006199us (14953.7 ops/sec) [+291.4%]
Did 17182 4096-bit mont (constime) operations in 2017938us (8514.6 ops/sec) [+277.6%]
Pixel 5A, 32-bit mode:
Before:
Did 124000 256-bit mont (constime) operations in 2013082us (61597.1 ops/sec)
Did 66000 384-bit mont (constime) operations in 2024604us (32599.0 ops/sec)
Did 40000 512-bit mont (constime) operations in 2018560us (19816.1 ops/sec)
Did 38000 521-bit mont (constime) operations in 2043776us (18593.0 ops/sec)
Did 11466 1024-bit mont (constime) operations in 2010767us (5702.3 ops/sec)
Did 5481 1536-bit mont (constime) operations in 2061892us (2658.2 ops/sec)
Did 3171 2048-bit mont (constime) operations in 2075359us (1527.9 ops/sec)
Did 1407 3072-bit mont (constime) operations in 2032032us (692.4 ops/sec)
Did 819 4096-bit mont (constime) operations in 2070367us (395.6 ops/sec)
After:
Did 718000 256-bit mont (constime) operations in 2000496us (358911.0 ops/sec) [+482.7%]
Did 424000 384-bit mont (constime) operations in 2000523us (211944.6 ops/sec) [+550.2%]
Did 401000 512-bit mont (constime) operations in 2000933us (200406.5 ops/sec) [+911.3%]
Did 205000 521-bit mont (constime) operations in 2004212us (102284.6 ops/sec) [+450.1%]
Did 153000 1024-bit mont (constime) operations in 2004644us (76322.8 ops/sec) [+1238.5%]
Did 78000 1536-bit mont (constime) operations in 2007510us (38854.1 ops/sec) [+1361.6%]
Did 47000 2048-bit mont (constime) operations in 2018015us (23290.2 ops/sec) [+1424.3%]
Did 22848 3072-bit mont (constime) operations in 2079082us (10989.5 ops/sec) [+1487.1%]
Did 13156 4096-bit mont (constime) operations in 2067424us (6363.5 ops/sec) [+1508.6%]
Bug: 316
Change-Id: I402df85170cae780442225eaa879884e707ffa86
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60686
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
bn_mod_lshift_consttime currently calls bn_mod_lshift1_consttime in a
loop, but between needing a temporary value and having to guard against
some complications in our fixed-width BIGNUM convention, it's actually
picking up a lot of overhead.
This function is currently called to setup Montgomery contexts with
secret moduli (RSA primes). The setup operation is not
performance-sensitive in our benchmarks, because it is amortized away in
RSA private key signing. However, as part of reducing thread contention
with the RSA object, I'm planning to make RSA creation, which we do
benchmark, eagerly fill in the Montgomery context.
We do benchmark RSA parsing, so adding a slow Montgomery setup would
show up in benchmarks. This distinction is mostly artificial. Work done
on creation and work done on first use is still work done once per RSA
key. However, work done on key creation may slow server startup, while
work deferred to first use is amortized but less predictable.
Either way, from this CL, and especially the one to follow it, we have
plenty of low-hanging fruit in this function. As a bonus, this should
help single-use RSA private keys, but that's not something we currently
benchmark.
Modulus sizes below chosen based on:
- Common curve sizes (moot because we use a variable-time setup anyway)
- Common RSA modulus sizes (also variable-time setup)
- Half of common RSA modulus sizes (the secret primes involved)
Of these, only the third category matters. The others can use the
division-based path where it's faster anyway. However, by the end of
this patch series, they'll get a bit closer, so I benchmarked them all
to compare. (Though division still wins in the end.)
Benchmarks on an M1 Max:
Before:
Did 528000 256-bit mont (constime) operations in 2000993us (263869.0 ops/sec)
Did 312000 384-bit mont (constime) operations in 2001281us (155900.1 ops/sec)
Did 246000 512-bit mont (constime) operations in 2001521us (122906.5 ops/sec)
Did 191000 521-bit mont (constime) operations in 2006336us (95198.4 ops/sec)
Did 98000 1024-bit mont (constime) operations in 2001438us (48964.8 ops/sec)
Did 55000 1536-bit mont (constime) operations in 2025306us (27156.4 ops/sec)
Did 35000 2048-bit mont (constime) operations in 2022714us (17303.5 ops/sec)
Did 17640 3072-bit mont (constime) operations in 2028352us (8696.7 ops/sec)
Did 10290 4096-bit mont (constime) operations in 2065529us (4981.8 ops/sec)
After:
Did 712000 256-bit mont (constime) operations in 2000454us (355919.2 ops/sec) [+34.9%]
Did 440000 384-bit mont (constime) operations in 2001121us (219876.8 ops/sec) [+41.0%]
Did 259000 512-bit mont (constime) operations in 2003709us (129260.3 ops/sec) [+5.2%]
Did 212000 521-bit mont (constime) operations in 2007033us (105628.6 ops/sec) [+11.0%]
Did 107000 1024-bit mont (constime) operations in 2018551us (53008.3 ops/sec) [+8.3%]
Did 57000 1536-bit mont (constime) operations in 2001027us (28485.4 ops/sec) [+4.9%]
Did 37000 2048-bit mont (constime) operations in 2039631us (18140.5 ops/sec) [+4.8%]
Did 20000 3072-bit mont (constime) operations in 2041163us (9798.3 ops/sec) [+12.7%]
Did 11760 4096-bit mont (constime) operations in 2007195us (5858.9 ops/sec) [+17.6%]
Bug: 316
Change-Id: I06f4a065fdecc1aec3160fe32a41e200538d1ee3
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60685
Auto-Submit: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: Adam Langley <agl@google.com>
I forgot a CPU capability check in X25519Test.NeonABI.
Change-Id: Ie2fa4a7b04a7eb152aa3b720687ec529e5dd5b0f
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60745
Reviewed-by: Adam Langley <agl@google.com>
Auto-Submit: David Benjamin <davidben@google.com>
Commit-Queue: Adam Langley <agl@google.com>