Have the inner AEAD API take `cpu::features()` for all operations.
Then we will be able to write CPU-capability-based tests using (a
variation of) the inner API, which will (when implemented) eliminate
the need to use SDE and the other various hacks we use for testing all
the implementations.
Commit f932b941bd1f59782cb3db8f7cd7b8b2c9842ee9 was incomplete and
wrong. On targets where we do any static or dynamic feature detection
and where we have these global variables, we need to unconditionally
write the detected features to the global variable so that assembly
can see them. Since we do static feature detection regardless of
operating system, the initialization of the global most be done
without any conditions on the operating system.
GitHub Actions runners already have rustup with the stable toolchain
installed, apparently. actions-rs is going away and we don't want to
keep maintaining a fork with an unsupported upstream, so start the
process of dropping it.
We want all of our internal symbols to be internal so that none of
these internal symbols leak from a static/dynamic library that is
built with *ring* inside.
Check `d` by processing it as a `OwnedModulus` like we do for the
other moduli. This should make the checking more consistent.
As a nice side effect, this eliminates the last non-test usage of
`Nonnegative` and elimnates more now-dead `Nonnegative` code.
Remove `SmallerModulus` and instead do the check dynamically. This
eliminates the last `unsafe impl` regarding the modulus
relationships. The uses of `elem_widen` won't ever fail but since
they are in an already-fallible function they wo't hurt.
The dynamic checks should never fail but since they are added in
already-fallible functions they won't cause any trouble. This
facilitates future changes where the dynmic checks are required.
This saves two private-modulus-length multiplications per RSA
private key operation at the cost of two private-modulus-length
squarings per `RsaKeyPair` construction.
Split the checking of the private modulus from the checking of the
private exponent so that we can do things in the order recommended
in the NIST spec.
This also facilitates storing R**3 instead of R**2 in the
`RsaKeyPair`. (We need R**2 during `RsaKeyPair` construction, but
R**3 afterwards.)
This was necessary at some point in the past, but no longer is. It is
better to avoid depending on any of the `core::fmt` machinery in these
lower layers if we can avoid it.
`PublicModulus` and `PrivatePrime` are basically duplicates of
`OwnedModulusWithOne`. In the future we would like to create an
`OwnedModulus` that doesn't need 1RR to be calculated. Also in the
future we'd like to be able to "take" 1RR from a public modulus.
This change is a step towards those ends.
Use the pattern we typically use where one argument is passed by value.
This lets us use `limbs_add_assign_mod`, eliminating the `unsafe`
direct use of `LIMBS_add_mod`. This will make future refactoring easier.
This also eliminates the need to construct and zeroize a new scalar `r`
for the result.
Note: I originally tried an alternative implementation using `flat_map` that
ended up being materially slower. To fix that performance regression I had to
make the following change:
```
let mut output = Output([0; MAX_OUTPUT_LEN]);
output
.0
- .iter_mut()
- .zip(input.iter().copied().flat_map(|Wrapping(w)| f(w)))
+ .chunks_mut(N)
+ .zip(input.iter().copied().map(|Wrapping(w)| f(w)))
.for_each(|(o, i)| {
- *o = i;
+ o.copy_from_slice(&i);
});
output
}
```
I verified that this generates the same assembly code as the original code
on x86-64 using Rust 1.74.0, except that there are two additional 128-bit
moves in `sha256_formta_output` to zero out the latter half of `Output`,
which was intended.
Clarify how the math works, and use a slightly better trade-off of
doubling vs squaring. On 64-bit targets RSA verification is now
less than 10% faster. On 32-bit targets its over 20% faster. I
expect that we can improve the performance further by optimizing
the doubling implementation.
Also the new implementation avoids allocating/cloning any temporary
`Elem`s, unlike the previous implementation.
Save two private-modulus Montgomery multiplications per RSA exponentiation
at the cost of approximately two modulus-wide XORs.
The new new `oneR()` is extracted from the Montgomery RR setup.
Remove the use of `One<RR>` in `elem_exp_consttime`.
Eliminate one modular doubling in Montgomery RR setup. This saves one
public modulus modular doubling per RSA signature verification, at the
cost of approximately one public-modulus-wide XOR. RsaKeyPair also sees
similar savings per Modulus.