yggdrasil/ring - ring - alnyan's gitea

Author	SHA1	Message	Date
Brian Smith	00da1cb1f7	Merge BoringSSL 'a905bbb': Consistently include BTI markers in every assembly file	2023-09-29 14:52:41 -07:00
Brian Smith	b78f7deffb	Merge BoringSSL '3f680b0': Remove a layer of indirection from fiat curve25519 assembly	2023-09-29 12:13:26 -07:00
Brian Smith	0a12e31e02	Partial merge of BoringSSL '9d4f833': Use ADX asm for Curve25519 base-point multiplication. Add the code but don't plumb it in.	2023-09-29 12:10:32 -07:00
Brian Smith	e0948076a5	Partial merge of BoringSSL '43f8891': Add saturated X25519 for x86_64+ADX running Linux Add the new code but don't plumb it in yet.	2023-09-29 12:04:04 -07:00
Brian Smith	c274480f40	NFC: Remove more unused constant-time utilities.	2023-09-29 11:54:54 -07:00
Brian Smith	2e6d759e56	NFC: Remove dead code from syncing with BoringSSL. ring doesn't use the BoringSSL code that uses these constant-time utilities.	2023-09-29 10:30:01 -07:00
Brian Smith	7b59320e3e	Merge BoringSSL 'd605df5': Use packed representation for large Curve25519 table	2023-09-28 19:58:53 -07:00
Brian Smith	2d8fbe09e9	Import currently-unused utilities in crypto/internal.h Bring these in as they were in 4a0393fcf37d7dbd090a5bb2293601a9ec7605da. The next merge will modify these.	2023-09-28 18:14:14 -07:00
Brian Smith	2270dc6943	Rename crypto_word back to crypto_word_t. Originally I was trying to be pedantic and avoid any use of `_t`- suffixed names. However, this hasn't really accomplished anything except annoying me, so just do what BoringSSL does.	2023-09-28 18:11:05 -07:00
Brian Smith	03de1fa014	Merge BoringSSL '55b069d': Add a value barrier when checking for point doubling.	2023-09-28 17:43:49 -07:00
Brian Smith	30171c0829	Partial merge of BoringSSL 'da757e6': Add constant-time validation for curve25519. Don't add the constant-time validation tests since we need to develop the framework for it first. Do add the public-from-private test.	2023-09-28 17:30:25 -07:00
Brian Smith	e17b48df3c	Take BoringSSL '5fcd47d': Add prefetch to aes_hw_ctr32_encrypt_blocks.	2023-09-28 17:09:39 -07:00
Brian Smith	14142649d3	Merge BoringSSL '62f9751': Don't make assumptions about GCM128_CONTEXT layout in aesni-gcm-x86_64.pl.	2023-09-28 16:57:34 -07:00
Brian Smith	1d14b3de74	Partial merge of BoringSSL 'a7f83c4': Don't make assumptions about GCM128_CONTEXT layout in aesv8-gcm-armv8.pl. This is modifying not-yet-used code.	2023-09-28 16:18:31 -07:00
Brian Smith	183332021f	Merge BoringSSL 'ece1f86': Re-add go:build ignore lines	2023-09-28 14:33:29 -07:00
Brian Smith	6e85944940	Merge BoringSSL 'aa31748': Generate 64-bit Curve25519 and P256 code for MSVC	2023-09-28 14:28:33 -07:00
Brian Smith	88331f0737	Take BoringSSL 'abb9af8': Work around a NASM bug.	2023-09-28 12:42:21 -07:00
Brian Smith	c833ff64f9	Merge BoringSSL 'ebd43ef': Move data from .text to .rodata on x86_64	2023-09-28 12:38:20 -07:00
Brian Smith	7dcdf3cf13	Merge BoringSSL 'e18ba27': Move constants from .text to .rodata on aarch64.	2023-09-28 12:35:08 -07:00
Brian Smith	0671a90267	Partial merge of BoringSSL 'd1b4516': Add bn_add_words and bn_sub_words assembly for aarch64. Bring in the new code as we'll likely use it soon, but not now. Merged as-is except with the "arm_arch.h" include changed to what we need.	2023-09-28 12:05:52 -07:00
Brian Smith	8166b6855f	Merge BoringSSL '53b876a'. The ring counterpart to `copy_from_prebuf` is `LIMBS_select_512_32` which is already written very (too?) conservatively w.r.t. compiler- introduced side channels. I inspected the generated code before/after adding additional `value_barrier_w` and it made no difference.	2023-09-28 11:47:45 -07:00
Brian Smith	a02e49b0b0	Use ring-core/arm_arch.h in aesv8-gcm-armv8.pl. The code isn't used yet but we should avoid the openssl/ include before we forget it is there.	2023-09-28 10:44:07 -07:00
Brian Smith	78b0af8531	Take BoringSSL 'a43c76d': Work around nasm bug with empty assembly files	2023-09-27 22:48:05 -07:00
Brian Smith	f1668276c8	Merge BoringSSL '0d5b608': Maintain a frame pointer in aesni-gcm-x86_64.pl and add SEH unwind codes	2023-09-27 22:46:28 -07:00
Brian Smith	2653466c80	Take BoringSSL 'ae1546b': Convert ghash-x86_64.pl to new directives.	2023-09-27 22:43:45 -07:00
Brian Smith	2eccbdf001	Merge BoringSSL 'c556ee9': Add initial support for SEH directives in x86_64 perlasm.	2023-09-27 22:43:26 -07:00
Brian Smith	29ae0f1806	Merge BoringSSL 'aa18fe2': Indent DB lines in x86_64 NASM output.	2023-09-27 22:41:37 -07:00
Brian Smith	b0afb00eb8	Partial merge of BoringSSL 'c6e3780': Add optimised Aarch64 GCM. Bring in the new assembly language code but do not start using it yet. The changes to enable it will be done later.	2023-09-27 22:40:18 -07:00
Brian Smith	a6ff12be89	Take BoringSSL '90e3b6e': Add prefetch to aesni_ctr32_ghash_6x.	2023-09-27 21:17:23 -07:00
Brian Smith	c82566dea0	Merge BoringSSL 'cdccbe1': Fully condition all assembly files.	2023-09-27 21:15:24 -07:00
Brian Smith	8020c1b634	Tests: Move `bigint` tests to where BoringSSL puts them. BoringSSL split up there bn_tests.txt into multiple files, which we had done previously. Prepare to merge that BoringSSL change by putting the test input files in the same places.	2023-09-26 19:39:52 -07:00
Brian Smith	9e93637357	Merge BoringSSL 'e0bb21b': Update x86_64-mont5.pl and RSAZ comments a bit.	2023-09-24 15:49:52 -07:00
Brian Smith	6678808009	Merge BoringSSL '7ac94aa': More -Wshorten-64-to-32 fixes.	2023-09-24 15:43:35 -07:00
Brian Smith	20b1810a3b	Merge BoringSSL '0faffc7': Fix the comment in ecp_nistz256_ord_sqr_mont to match code and prototype.	2023-09-24 15:40:07 -07:00
Brian Smith	97a526c010	Merge BoringSSL '1b2b7b2': Various -Wshorten-64-to-32 fixes.	2023-09-24 15:31:41 -07:00
Brian Smith	75d34bc1a8	Merge BoringSSL 7b2795a: Replace even more ad-hoc bytes/integer conversions.	2023-09-24 15:26:51 -07:00
Brian Smith	5233928eb9	Take BoringSSL '0378578': Dedup a few more load/store implementations.	2023-09-23 15:48:18 -07:00
Brian Smith	6ccdf7bd12	Merge BoringSSL '6c2af68': Remove a few more unions.	2023-09-23 15:12:24 -07:00
David Benjamin	584f1e1016	Cherry-pick BoringSSL ca45987: Move load/store helpers to crypto/internal.h. These are needed for the next merge from BoringSSL.	2023-09-23 15:03:59 -07:00
Brian Smith	f812f37aba	Merge commit '0f2c55cb748651833af247bbed43e' into b/merge-boringssl-9. Take the changes from BoringSSL, except use `limbs_copy` and `limbs_zero`.	2023-09-18 17:53:44 -07:00
David Benjamin	76e98c4351	Always end BN_mod_exp_mont_consttime with normal Montgomery reduction. This partially fixes a bug where, on x86_64, BN_mod_exp_mont_consttime would sometimes return m, the modulus, when it should have returned zero. Thanks to Guido Vranken for reporting it. It is only a partial fix because the same bug also exists in the "rsaz" codepath. That will be fixed in the subsequent CL. (See the commented out test.) The bug only affects zero outputs (with non-zero inputs), so we believe it has no security impact on our cryptographic functions. BoringSSL calls BN_mod_exp_mont_consttime in the following cases: - RSA private key operations - Primality testing, raising the witness to the odd part of p-1 - DSA keygen and key import, pub = g^priv (mod p) - DSA signing, r = g^k (mod p) - DH keygen, pub = g^priv (mod p) - Diffie-Hellman, secret = peer^priv (mod p) It is not possible in the RSA private key operation, provided p and q are primes. If using CRT, we are working modulo a prime, so zero output with non-zero input is impossible. If not using CRT, we work mod n. While there are nilpotent values mod n, none of them hit zero by exponentiating. (Both p and q would need to divide the input, which means n divides the input.) In primality testing, this can only be hit when the input was composite. But as the rest of the loop cannot then hit 1, we'll correctly report it as composite anyway. DSA and DH work modulo a prime, where this case cannot happen. Analysis: This bug is the result of sloppiness with the looser bounds from "almost Montgomery multiplication", described in https://eprint.iacr.org/2011/239. Prior to upstream's ec9cc70f72454b8d4a84247c86159613cee83b81, I believe x86_64-mont5.pl implemented standard Montgomery reduction (the left half of figure 3 in the paper). Though it did not document this, ec9cc70f7245 changed it to implement the "almost" variant (the right half of the figure.) The difference is that, rather than subtracting if T >= m, it subtracts if T >= R. In code, it is the difference between something like our bn_reduce_once, vs. subtracting based only on T's carry bit. (Interestingly, the .Lmul_enter branch of bn_mul_mont_gather5 seems to still implement normal reduction, but the .Lmul4x_enter branch is an almost reduction.) That means none of the intermediate values here are bounded by m. They are only bounded by R. Accordingly, Figure 2 in the paper ends with step 10: REDUCE h modulo m. BN_mod_exp_mont_consttime is missing this step. The bn_from_montgomery call only implements step 9, AMM(h, 1). (x86_64-mont5.pl's bn_from_montgomery only implements an almost reduction.) The impact depends on how unreduced AMM(h, 1) can be. Remark 1 of the paper discusses this, but is ambiguous about the scope of its 2^(n-1) < m < 2^n precondition. The m+1 bound appears to be unconditional: Montgomery reduction ultimately adds some 0 <= Y < mR to T, to get a multiple of R, and then divides by R. The output, pre-subtraction, is thus less than m + T/R. MM works because T < mR => T' < m + mR/R = 2m. A single subtraction of m if T' >= m gives T'' < m. AMM works because T < R^2 => T' < m + R^2/R = m + R. A single subtraction of m if T' >= R gives T'' < R. See also Lemma 1, Section 3 and Section 4 of the paper, though their formulation is more complicated to capture the word-by-word algorithm. It's ultimately the same adjustment to T. But in AMM(h, 1), T = h1 = h < R, so AMM(h, 1) < m + R/R = m + 1. That is, AMM(h, 1) <= m. So the only case when AMM(h, 1) isn't fully reduced is if it outputs m. Thus, our limited impact. Indeed, Remark 1 mentions step 10 isn't necessary because m is a prime and the inputs are non-zero. But that doesn't apply here because BN_mod_exp_mont_consttime may be called elsewhere. Fix: To fix this, we could add the missing step 10, but a full division would not be constant-time. The analysis above says it could be a single subtraction, bn_reduce_once, but then we could integrate it into the subtraction already in plain Montgomery reduction, implemented by uppercase BN_from_montgomery. h1 = h < R <= mR, so we are within bounds. Thus, we delete lowercase bn_from_montgomery altogether, and have the mont5 path use the same BN_from_montgomery ending as the non-mont5 path. This only impacts the final step of the whole exponentiation and has no measurable perf impact. In doing so, add comments describing these looser bounds. This includes one subtlety that BN_mod_exp_mont_consttime actually mixes bn_mul_mont (MM) with bn_mul_mont_gather5/bn_power5 (AMM). But this is fine because MM is AMM-compatible; when passed AMM's looser inputs, it will still produce a correct looser output. Ideally we'd drop the "almost" reduction and stick to the more straightforward bounds. As this only impacts the final subtraction in each reduction, I would be surprised if it actually had a real performance impact. But this would involve deeper change to x86_64-mont5.pl, so I haven't tried this yet. I believe this is basically the same bug as https://github.com/golang/go/issues/13907 from Go. Change-Id: I06f879777bb2ef181e9da7632ec858582e2afa38 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52825 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>	2023-09-15 17:01:39 -07:00
David Benjamin	a905bbb52a	Consistently include BTI markers in every assembly file Trying to migrate Chromium to the "link all the asm files together" strategy broke the aarch64 Android build because some of the ifdef'd out assembly files were missing the .note.gnu.property section for BTI. If we add support for IBT, that'll be another one. To fix this, introduce <openssl/asm_base.h>, which must be included at the start of every assembly file (before the target ifdefs). This does a couple things: - It emits BTI and noexecstack markers into every assembly file, even those that ifdef themselves out. - It resolves the MSan -> OPENSSL_NO_ASM logic, so we only need to do it once. - It defines the same OPENSSL_X86_64, etc., defines we set elsewhere, so we can ensure they're consistent. This required carving files up a bit. <openssl/base.h> has a lot of things, such that trying to guard everything in it on __ASSEMBLER__ would be tedious. Instead, I moved the target defines to a new <openssl/target.h>. Then <openssl/asm_base.h> is the new header that pulls in all those things. Bug: 542 Change-Id: I1682b4d929adea72908655fa1bb15765a6b3473b Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60765 Reviewed-by: Bob Beck <bbe@google.com> Commit-Queue: David Benjamin <davidben@google.com>	2023-06-22 23:36:55 +00:00
Alex Gough	e79649ba4d	Use ProcessPrng instead of RtlGenRandom on Windows The Windows system RNG[1] lives in bcryptprimitives.dll which exports the function ProcessPrng[2] to supply random bytes from its internal generators. These are seeded and reseeded from the operating system using a device connection to \\Device\CNG which is opened when bcryptprimitives.dll is first loaded. After this CL boringssl calls ProcessPrng() directly. Before this CL boringssl got its system randomness (on non-UWP desktop Windows) from calls to RtlGenRandom[3]. This function is undocumented and unsupported, but has always been available by linking to SystemFunction036 in advadpi32.dll. In Windows 10 and later, this export simply forwards to cryptbase.dll!SystemFunction036 which calls ProcessPrng() directly. cryptbase!SystemFunction036 decompiled: ``` BOOLEAN SystemFunction036(PVOID RandomBuffer,ULONG RandomBufferLength) { BOOL retval; retval = ProcessPrng(RandomBuffer,RandomBufferLength); return retval != 0; } ``` Loading cryptbase.dll has the side effect of opening a device handle to \\Device\KsecDD which is not used by boringssl's random number wrappers. Calling ProcessPrng() directly allows sandboxed programs such as Chromium to avoid having this handle if they do not need it. ProcessPrng() also takes a size_t length rather than a u32 length, allowing some simplification of the calling code. After this CL we require bcryptprimitives to be loaded before the first call to CRYPTO_srand(). Applications using the library should either load the module themselves or call CRYPTO_pre_sandbox_init(). Before this CL boringssl required that advapi32, cryptbase and bcryptprimitives were all loaded so this should not represent a breaking change. [1] https://learn.microsoft.com/en-us/windows/win32/seccng/processprng [2] https://download.microsoft.com/download/1/c/9/1c9813b8-089c-4fef-b2ad-ad80e79403ba/Whitepaper%20-%20The%20Windows%2010%20random%20number%20generation%20infrastructure.pdf [3] https://docs.google.com/document/d/13n1t5ak0yofzcadQCF7Ew5TewSUkNfQ3n-IYodjeRYc/edit Bug: chromium:74242 Change-Id: Ifb1d6ef1a4539ff6e9a2c36cc119b7700ca2be8f Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60825 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: David Benjamin <davidben@google.com>	2023-06-22 19:51:36 +00:00
David Benjamin	ee194c75a6	Slightly tidy BIO_C_SET_FILENAME logic We could just use the string literal as-is. Change-Id: I2efe01fd9b020db1bb086001407bcf7fa8487551 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/61045 Auto-Submit: David Benjamin <davidben@google.com> Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com> Commit-Queue: Adam Langley <agl@google.com>	2023-06-22 17:52:19 +00:00
David Benjamin	9fcaec6435	Start recognizing the OPENSSL_NANOLIBC define nanolibc is an embedded platform with no threads. To start unforking that build, generalize some of the OPENSSL_TRUSTY defines. OpenSSL has OPENSSL_NO_SOCK if you don't have sockets and OPENSSL_NO_POSIX_IO if you don't have file descriptors. Those names are fine enough, so I've borrowed them here too. There's more to be done here, but this will clear out some of it. Change-Id: Iaba1fafdebb46ebb8f68b7956535dd0ccaaa832f Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60890 Auto-Submit: David Benjamin <davidben@google.com> Commit-Queue: Bob Beck <bbe@google.com> Reviewed-by: Bob Beck <bbe@google.com>	2023-06-20 18:41:34 +00:00
David Benjamin	8ead3f5314	Add more tests for recognizing explicit forms of built-in curves We really should remove these (we only support them in private keys) but, in the meantime, add some tests for all the curves, not just P-256. Change-Id: I9c4c0660f082fa1701afe11f51bb157b06befd3c Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60925 Reviewed-by: Adam Langley <agl@google.com> Auto-Submit: David Benjamin <davidben@google.com> Commit-Queue: Adam Langley <agl@google.com>	2023-06-19 16:20:55 +00:00
David Benjamin	6a7d8b5472	Remove p > q normalization in RSA keys RSA CRT is tiny bit messier when p < q. https://boringssl-review.googlesource.com/25263 solved this by normalizing to p > q. The cost was we sometimes had to compute a new iqmp. Modular inversion is expensive. We did it only once per key, but it's still a performance cliff in per-key costs. When later work moves freeze_private_key into RSA private key parsing, it will be a performance cliff in the private key parser. Instead, just handle p < q in the CRT function. The only difference is needing one extra reduction before the modular subtraction. Even using the fully general mod_montgomery function (as opposed to checking p < q, or using bn_reduce_once when num_bits(p) == num_bits(q)) was not measurable. In doing so, I noticed we didn't actually have tests that exercise the reduction step. I added one to evp_tests.txt, but it is only meaningful when blinding is disabled. (Another cost of blinding.) When blinding is enabled, the answers mod p and q are randomized and we hit this case with about 1.8% probability. See comment in evp_test.txt. I kept the optimization where we store iqmp in Montgomery form, not because the optimization matters, but because we need to store a corrected, fixed-width version of the value anyway, so we may as well store it in a more convenient form. M1 Max Before: Did 9048 RSA 2048 signing operations in 5033403us (1797.6 ops/sec) Did 1500 RSA 4096 signing operations in 5009288us (299.4 ops/sec) After: Did 9116 RSA 2048 signing operations in 5053802us (1803.8 ops/sec) [+0.3%] Did 1500 RSA 4096 signing operations in 5008283us (299.5 ops/sec) [+0.0%] Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz Before: Did 9282 RSA 2048 signing operations in 5019395us (1849.2 ops/sec) Did 1302 RSA 4096 signing operations in 5055011us (257.6 ops/sec) After: Did 9240 RSA 2048 signing operations in 5024845us (1838.9 ops/sec) [-0.6%] Did 1302 RSA 4096 signing operations in 5046157us (258.0 ops/sec) [+0.2%] Bug: 316 Change-Id: Icb90c7d5f5188f9b69a6d7bcc63db13d92ec26d5 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60705 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>	2023-06-13 22:47:13 +00:00
David Benjamin	02d2715bcc	Implement BN_MONT_CTX_new_consttime with Montgomery reduction Setting up Montgomery reduction requires computing RR, a larger power of 2 mod N. When N is secret (RSA primes), we currently start at 2^(n_bits-1), then iteratively double and reduce. Instead, once we reach 2R = 2^(r_bits+1) or higher, we can switch to a Montgomery square-and-multiply. (Montgomery reduction only needs n0. RR is just for conversion.) This takes some tuning because, at low powers of 2 (in Montgomery form), it is still more efficient to square by doubling. I ran benchmarks for 32-bit and 64-bit, x86 and Arm, on the machines I had available and picked a threshold that works decently well. (On the hardware I tested, it's the right threshold on all but the Pixel 5A. The 5A would ideally want a slightly higher threshold---it seems to be worse at multiplying or better at addition, but the gap isn't that large, and this operation isn't perf-sensitive anyway.) The result is dramatically faster than the old shift-based approach. That said, see I06f4a065fdecc1aec3160fe32a41e200538d1ee3 for discussion on this operation. These speedups are not expected to translate to increased RSA throughput. They just clear up some initialization work. This speedup is not quite enough to match the division-based variable-time one (perf-sensitive for RSA verification), so we'll keep both codepaths around. M1 Max Before: Did 712000 256-bit mont (constime) operations in 2000454us (355919.2 ops/sec) Did 440000 384-bit mont (constime) operations in 2001121us (219876.8 ops/sec) Did 259000 512-bit mont (constime) operations in 2003709us (129260.3 ops/sec) Did 212000 521-bit mont (constime) operations in 2007033us (105628.6 ops/sec) Did 107000 1024-bit mont (constime) operations in 2018551us (53008.3 ops/sec) Did 57000 1536-bit mont (constime) operations in 2001027us (28485.4 ops/sec) Did 37000 2048-bit mont (constime) operations in 2039631us (18140.5 ops/sec) Did 20000 3072-bit mont (constime) operations in 2041163us (9798.3 ops/sec) Did 11760 4096-bit mont (constime) operations in 2007195us (5858.9 ops/sec) After: Did 3996000 256-bit mont (constime) operations in 2000366us (1997634.4 ops/sec) [+461.3%] Did 2687000 384-bit mont (constime) operations in 2000464us (1343188.4 ops/sec) [+510.9%] Did 2615000 512-bit mont (constime) operations in 2000146us (1307404.6 ops/sec) [+911.5%] Did 1029000 521-bit mont (constime) operations in 2000944us (514257.3 ops/sec) [+386.9%] Did 1246000 1024-bit mont (constime) operations in 2000899us (622720.1 ops/sec) [+1074.8%] Did 688000 1536-bit mont (constime) operations in 2000579us (343900.4 ops/sec) [+1107.3%] Did 425000 2048-bit mont (constime) operations in 2003622us (212115.9 ops/sec) [+1069.3%] Did 212000 3072-bit mont (constime) operations in 2004430us (105765.7 ops/sec) [+979.4%] Did 125000 4096-bit mont (constime) operations in 2009677us (62199.0 ops/sec) [+961.6%] Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz Before: Did 781000 256-bit mont (constime) operations in 2000740us (390355.6 ops/sec) Did 414000 384-bit mont (constime) operations in 2000180us (206981.4 ops/sec) Did 258000 512-bit mont (constime) operations in 2001729us (128888.6 ops/sec) Did 194000 521-bit mont (constime) operations in 2008814us (96574.4 ops/sec) Did 79000 1024-bit mont (constime) operations in 2009309us (39317.0 ops/sec) Did 36000 1536-bit mont (constime) operations in 2003945us (17964.6 ops/sec) Did 21000 2048-bit mont (constime) operations in 2074987us (10120.5 ops/sec) Did 9040 3072-bit mont (constime) operations in 2003869us (4511.3 ops/sec) Did 5250 4096-bit mont (constime) operations in 2067796us (2538.9 ops/sec) After: Did 3496000 256-bit mont (constime) operations in 2000542us (1747526.4 ops/sec) [+347.7%] Did 2466000 384-bit mont (constime) operations in 2000327us (1232798.4 ops/sec) [+495.6%] Did 2392000 512-bit mont (constime) operations in 2000732us (1195562.4 ops/sec) [+827.6%] Did 908000 521-bit mont (constime) operations in 2001181us (453732.1 ops/sec) [+369.8%] Did 1054000 1024-bit mont (constime) operations in 2001429us (526623.7 ops/sec) [+1239.4%] Did 548000 1536-bit mont (constime) operations in 2002417us (273669.3 ops/sec) [+1423.4%] Did 339000 2048-bit mont (constime) operations in 2004127us (169151.0 ops/sec) [+1571.4%] Did 162000 3072-bit mont (constime) operations in 2008221us (80668.4 ops/sec) [+1688.2%] Did 94000 4096-bit mont (constime) operations in 2013848us (46676.8 ops/sec) [+1738.4%] Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 32-bit mode Before: Did 335000 256-bit mont (constime) operations in 2000006us (167499.5 ops/sec) Did 170000 384-bit mont (constime) operations in 2010398us (84560.4 ops/sec) Did 102000 512-bit mont (constime) operations in 2013510us (50657.8 ops/sec) Did 88000 521-bit mont (constime) operations in 2022909us (43501.7 ops/sec) Did 27000 1024-bit mont (constime) operations in 2063490us (13084.6 ops/sec) Did 11760 1536-bit mont (constime) operations in 2000600us (5878.2 ops/sec) Did 6825 2048-bit mont (constime) operations in 2069343us (3298.1 ops/sec) Did 2982 3072-bit mont (constime) operations in 2090651us (1426.3 ops/sec) Did 1680 4096-bit mont (constime) operations in 2074824us (809.7 ops/sec) After: Did 1559000 256-bit mont (constime) operations in 2000884us (779155.6 ops/sec) [+365.2%] Did 940000 384-bit mont (constime) operations in 2001511us (469645.2 ops/sec) [+455.4%] Did 608000 512-bit mont (constime) operations in 2000380us (303942.3 ops/sec) [+500.0%] Did 439000 521-bit mont (constime) operations in 2004282us (219031.1 ops/sec) [+403.5%] Did 180000 1024-bit mont (constime) operations in 2005427us (89756.4 ops/sec) [+586.0%] Did 85000 1536-bit mont (constime) operations in 2017009us (42141.6 ops/sec) [+616.9%] Did 49000 2048-bit mont (constime) operations in 2035401us (24073.9 ops/sec) [+629.9%] Did 22000 3072-bit mont (constime) operations in 2047404us (10745.3 ops/sec) [+653.3%] Did 12642 4096-bit mont (constime) operations in 2094210us (6036.6 ops/sec) [+645.5%] Pixel 5A: Before: Did 483000 256-bit mont (constime) operations in 2001460us (241323.8 ops/sec) Did 279000 384-bit mont (constime) operations in 2004682us (139174.2 ops/sec) Did 198000 512-bit mont (constime) operations in 2003995us (98802.6 ops/sec) Did 141000 521-bit mont (constime) operations in 2006305us (70278.4 ops/sec) Did 62000 1024-bit mont (constime) operations in 2022138us (30660.6 ops/sec) Did 29000 1536-bit mont (constime) operations in 2007150us (14448.3 ops/sec) Did 17376 2048-bit mont (constime) operations in 2044894us (8497.3 ops/sec) Did 7686 3072-bit mont (constime) operations in 2011537us (3821.0 ops/sec) Did 4620 4096-bit mont (constime) operations in 2048780us (2255.0 ops/sec) After: Did 1187000 256-bit mont (constime) operations in 2000099us (593470.6 ops/sec) [+145.9%] Did 794000 384-bit mont (constime) operations in 2002162us (396571.3 ops/sec) [+184.9%] Did 658000 512-bit mont (constime) operations in 2002808us (328538.7 ops/sec) [+232.5%] Did 373000 521-bit mont (constime) operations in 2005135us (186022.4 ops/sec) [+164.7%] Did 231000 1024-bit mont (constime) operations in 2008117us (115033.1 ops/sec) [+275.2%] Did 112000 1536-bit mont (constime) operations in 2003151us (55911.9 ops/sec) [+287.0%] Did 66000 2048-bit mont (constime) operations in 2022295us (32636.2 ops/sec) [+284.1%] Did 30000 3072-bit mont (constime) operations in 2006199us (14953.7 ops/sec) [+291.4%] Did 17182 4096-bit mont (constime) operations in 2017938us (8514.6 ops/sec) [+277.6%] Pixel 5A, 32-bit mode: Before: Did 124000 256-bit mont (constime) operations in 2013082us (61597.1 ops/sec) Did 66000 384-bit mont (constime) operations in 2024604us (32599.0 ops/sec) Did 40000 512-bit mont (constime) operations in 2018560us (19816.1 ops/sec) Did 38000 521-bit mont (constime) operations in 2043776us (18593.0 ops/sec) Did 11466 1024-bit mont (constime) operations in 2010767us (5702.3 ops/sec) Did 5481 1536-bit mont (constime) operations in 2061892us (2658.2 ops/sec) Did 3171 2048-bit mont (constime) operations in 2075359us (1527.9 ops/sec) Did 1407 3072-bit mont (constime) operations in 2032032us (692.4 ops/sec) Did 819 4096-bit mont (constime) operations in 2070367us (395.6 ops/sec) After: Did 718000 256-bit mont (constime) operations in 2000496us (358911.0 ops/sec) [+482.7%] Did 424000 384-bit mont (constime) operations in 2000523us (211944.6 ops/sec) [+550.2%] Did 401000 512-bit mont (constime) operations in 2000933us (200406.5 ops/sec) [+911.3%] Did 205000 521-bit mont (constime) operations in 2004212us (102284.6 ops/sec) [+450.1%] Did 153000 1024-bit mont (constime) operations in 2004644us (76322.8 ops/sec) [+1238.5%] Did 78000 1536-bit mont (constime) operations in 2007510us (38854.1 ops/sec) [+1361.6%] Did 47000 2048-bit mont (constime) operations in 2018015us (23290.2 ops/sec) [+1424.3%] Did 22848 3072-bit mont (constime) operations in 2079082us (10989.5 ops/sec) [+1487.1%] Did 13156 4096-bit mont (constime) operations in 2067424us (6363.5 ops/sec) [+1508.6%] Bug: 316 Change-Id: I402df85170cae780442225eaa879884e707ffa86 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60686 Reviewed-by: Adam Langley <agl@google.com> Commit-Queue: David Benjamin <davidben@google.com>	2023-06-13 22:22:53 +00:00
David Benjamin	98e1227cb7	Make bn_mod_lshift_consttime faster bn_mod_lshift_consttime currently calls bn_mod_lshift1_consttime in a loop, but between needing a temporary value and having to guard against some complications in our fixed-width BIGNUM convention, it's actually picking up a lot of overhead. This function is currently called to setup Montgomery contexts with secret moduli (RSA primes). The setup operation is not performance-sensitive in our benchmarks, because it is amortized away in RSA private key signing. However, as part of reducing thread contention with the RSA object, I'm planning to make RSA creation, which we do benchmark, eagerly fill in the Montgomery context. We do benchmark RSA parsing, so adding a slow Montgomery setup would show up in benchmarks. This distinction is mostly artificial. Work done on creation and work done on first use is still work done once per RSA key. However, work done on key creation may slow server startup, while work deferred to first use is amortized but less predictable. Either way, from this CL, and especially the one to follow it, we have plenty of low-hanging fruit in this function. As a bonus, this should help single-use RSA private keys, but that's not something we currently benchmark. Modulus sizes below chosen based on: - Common curve sizes (moot because we use a variable-time setup anyway) - Common RSA modulus sizes (also variable-time setup) - Half of common RSA modulus sizes (the secret primes involved) Of these, only the third category matters. The others can use the division-based path where it's faster anyway. However, by the end of this patch series, they'll get a bit closer, so I benchmarked them all to compare. (Though division still wins in the end.) Benchmarks on an M1 Max: Before: Did 528000 256-bit mont (constime) operations in 2000993us (263869.0 ops/sec) Did 312000 384-bit mont (constime) operations in 2001281us (155900.1 ops/sec) Did 246000 512-bit mont (constime) operations in 2001521us (122906.5 ops/sec) Did 191000 521-bit mont (constime) operations in 2006336us (95198.4 ops/sec) Did 98000 1024-bit mont (constime) operations in 2001438us (48964.8 ops/sec) Did 55000 1536-bit mont (constime) operations in 2025306us (27156.4 ops/sec) Did 35000 2048-bit mont (constime) operations in 2022714us (17303.5 ops/sec) Did 17640 3072-bit mont (constime) operations in 2028352us (8696.7 ops/sec) Did 10290 4096-bit mont (constime) operations in 2065529us (4981.8 ops/sec) After: Did 712000 256-bit mont (constime) operations in 2000454us (355919.2 ops/sec) [+34.9%] Did 440000 384-bit mont (constime) operations in 2001121us (219876.8 ops/sec) [+41.0%] Did 259000 512-bit mont (constime) operations in 2003709us (129260.3 ops/sec) [+5.2%] Did 212000 521-bit mont (constime) operations in 2007033us (105628.6 ops/sec) [+11.0%] Did 107000 1024-bit mont (constime) operations in 2018551us (53008.3 ops/sec) [+8.3%] Did 57000 1536-bit mont (constime) operations in 2001027us (28485.4 ops/sec) [+4.9%] Did 37000 2048-bit mont (constime) operations in 2039631us (18140.5 ops/sec) [+4.8%] Did 20000 3072-bit mont (constime) operations in 2041163us (9798.3 ops/sec) [+12.7%] Did 11760 4096-bit mont (constime) operations in 2007195us (5858.9 ops/sec) [+17.6%] Bug: 316 Change-Id: I06f4a065fdecc1aec3160fe32a41e200538d1ee3 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60685 Auto-Submit: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com> Commit-Queue: Adam Langley <agl@google.com>	2023-06-13 21:42:15 +00:00
David Benjamin	acfb1062f4	Fix tests on Arm when NEON is unavailable I forgot a CPU capability check in X25519Test.NeonABI. Change-Id: Ie2fa4a7b04a7eb152aa3b720687ec529e5dd5b0f Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60745 Reviewed-by: Adam Langley <agl@google.com> Auto-Submit: David Benjamin <davidben@google.com> Commit-Queue: Adam Langley <agl@google.com>	2023-06-13 20:52:40 +00:00

1 2 3 4 5 ...

5408 Commits