Add a constant-time fallback GHASH implementation.

We have several variable-time table-based GHASH implementations, called
"4bit" in the code. We have a fallback one in C and assembly
implementations for x86, x86_64, and armv4. This are used if assembly is
off or if the hardware lacks NEON or SSSE3.

Note these benchmarks are all on hardware several generations beyond
what would actually run this code, so it's a bit artificial.

Implement a constant-time implementation of GHASH based on the notes in
https://bearssl.org/constanttime.html#ghash-for-gcm, as well as the
reduction algorithm described in
https://crypto.stanford.edu/RealWorldCrypto/slides/gueron.pdf.

This new implementation is actually faster than the fallback C code for
both 32-bit and 64-bit. It is slower than the assembly implementations,
particularly for 32-bit. I've left 32-bit x86 alone but replaced the
x86_64 and armv4 ones.  The perf hit on x86_64 is smaller and affects a
small percentage of 64-bit Chrome on Windows users. ARM chips without
NEON is rare (Chrome for Android requires it), so replace that too.

The answer for 32-bit x86 is unclear. More 32-bit Chrome on Windows
users lack SSSE3, and the perf hit is dramatic. gcm_gmult_4bit_mmx uses
SSE2, so perhaps we can close the gap with an SSE2 version of this
strategy, or perhaps we can decide this perf hit is worth fixing the
timing leaks.

32-bit x86 with OPENSSL_NO_ASM
Before: (4bit C)
Did 1136000 AES-128-GCM (16 bytes) seal operations in 1000762us (1135135.0 ops/sec): 18.2 MB/s
Did 190000 AES-128-GCM (256 bytes) seal operations in 1003533us (189331.1 ops/sec): 48.5 MB/s
Did 40000 AES-128-GCM (1350 bytes) seal operations in 1022114us (39134.6 ops/sec): 52.8 MB/s
Did 7282 AES-128-GCM (8192 bytes) seal operations in 1117575us (6515.9 ops/sec): 53.4 MB/s
Did 3663 AES-128-GCM (16384 bytes) seal operations in 1098538us (3334.4 ops/sec): 54.6 MB/s
After:
Did 1503000 AES-128-GCM (16 bytes) seal operations in 1000054us (1502918.8 ops/sec): 24.0 MB/s
Did 252000 AES-128-GCM (256 bytes) seal operations in 1001173us (251704.8 ops/sec): 64.4 MB/s
Did 53000 AES-128-GCM (1350 bytes) seal operations in 1016983us (52114.9 ops/sec): 70.4 MB/s
Did 9317 AES-128-GCM (8192 bytes) seal operations in 1056367us (8819.9 ops/sec): 72.3 MB/s
Did 4356 AES-128-GCM (16384 bytes) seal operations in 1000445us (4354.1 ops/sec): 71.3 MB/s

64-bit x86 with OPENSSL_NO_ASM
Before: (4bit C)
Did 2976000 AES-128-GCM (16 bytes) seal operations in 1000258us (2975232.4 ops/sec): 47.6 MB/s
Did 510000 AES-128-GCM (256 bytes) seal operations in 1000295us (509849.6 ops/sec): 130.5 MB/s
Did 106000 AES-128-GCM (1350 bytes) seal operations in 1001573us (105833.5 ops/sec): 142.9 MB/s
Did 18000 AES-128-GCM (8192 bytes) seal operations in 1003895us (17930.2 ops/sec): 146.9 MB/s
Did 9000 AES-128-GCM (16384 bytes) seal operations in 1003352us (8969.9 ops/sec): 147.0 MB/s
After:
Did 2972000 AES-128-GCM (16 bytes) seal operations in 1000178us (2971471.1 ops/sec): 47.5 MB/s
Did 515000 AES-128-GCM (256 bytes) seal operations in 1001850us (514049.0 ops/sec): 131.6 MB/s
Did 108000 AES-128-GCM (1350 bytes) seal operations in 1004941us (107469.0 ops/sec): 145.1 MB/s
Did 19000 AES-128-GCM (8192 bytes) seal operations in 1034966us (18358.1 ops/sec): 150.4 MB/s
Did 9250 AES-128-GCM (16384 bytes) seal operations in 1005269us (9201.5 ops/sec): 150.8 MB/s

32-bit ARM without NEON
Before: (4bit armv4 asm)
Did 952000 AES-128-GCM (16 bytes) seal operations in 1001009us (951040.4 ops/sec): 15.2 MB/s
Did 152000 AES-128-GCM (256 bytes) seal operations in 1005576us (151157.1 ops/sec): 38.7 MB/s
Did 32000 AES-128-GCM (1350 bytes) seal operations in 1024522us (31234.1 ops/sec): 42.2 MB/s
Did 5290 AES-128-GCM (8192 bytes) seal operations in 1005335us (5261.9 ops/sec): 43.1 MB/s
Did 2650 AES-128-GCM (16384 bytes) seal operations in 1004396us (2638.4 ops/sec): 43.2 MB/s
After:
Did 540000 AES-128-GCM (16 bytes) seal operations in 1000009us (539995.1 ops/sec): 8.6 MB/s
Did 90000 AES-128-GCM (256 bytes) seal operations in 1000028us (89997.5 ops/sec): 23.0 MB/s
Did 19000 AES-128-GCM (1350 bytes) seal operations in 1022041us (18590.3 ops/sec): 25.1 MB/s
Did 3150 AES-128-GCM (8192 bytes) seal operations in 1003199us (3140.0 ops/sec): 25.7 MB/s
Did 1694 AES-128-GCM (16384 bytes) seal operations in 1076156us (1574.1 ops/sec): 25.8 MB/s
(Note fallback AES is dampening the perf hit.)

64-bit x86 with OPENSSL_ia32cap=0
Before: (4bit x86_64 asm)
Did 2615000 AES-128-GCM (16 bytes) seal operations in 1000220us (2614424.8 ops/sec): 41.8 MB/s
Did 431000 AES-128-GCM (256 bytes) seal operations in 1001250us (430461.9 ops/sec): 110.2 MB/s
Did 89000 AES-128-GCM (1350 bytes) seal operations in 1002209us (88803.8 ops/sec): 119.9 MB/s
Did 16000 AES-128-GCM (8192 bytes) seal operations in 1064535us (15030.0 ops/sec): 123.1 MB/s
Did 8261 AES-128-GCM (16384 bytes) seal operations in 1096787us (7532.0 ops/sec): 123.4 MB/s
After:
Did 2355000 AES-128-GCM (16 bytes) seal operations in 1000096us (2354773.9 ops/sec): 37.7 MB/s
Did 373000 AES-128-GCM (256 bytes) seal operations in 1000981us (372634.4 ops/sec): 95.4 MB/s
Did 77000 AES-128-GCM (1350 bytes) seal operations in 1003557us (76727.1 ops/sec): 103.6 MB/s
Did 13000 AES-128-GCM (8192 bytes) seal operations in 1003058us (12960.4 ops/sec): 106.2 MB/s
Did 7139 AES-128-GCM (16384 bytes) seal operations in 1099576us (6492.5 ops/sec): 106.4 MB/s
(Note fallback AES is dampening the perf hit. Pairing with AESNI to roughly
isolate GHASH shows a 40% hit.)

For comparison, this is what removing gcm_gmult_4bit_mmx would do.
32-bit x86 with OPENSSL_ia32cap=0
Before:
Did 2014000 AES-128-GCM (16 bytes) seal operations in 1000026us (2013947.6 ops/sec): 32.2 MB/s
Did 367000 AES-128-GCM (256 bytes) seal operations in 1000097us (366964.4 ops/sec): 93.9 MB/s
Did 77000 AES-128-GCM (1350 bytes) seal operations in 1002135us (76836.0 ops/sec): 103.7 MB/s
Did 13000 AES-128-GCM (8192 bytes) seal operations in 1011394us (12853.5 ops/sec): 105.3 MB/s
Did 7227 AES-128-GCM (16384 bytes) seal operations in 1099409us (6573.5 ops/sec): 107.7 MB/s
If gcm_gmult_4bit_mmx were replaced:
Did 1350000 AES-128-GCM (16 bytes) seal operations in 1000128us (1349827.2 ops/sec): 21.6 MB/s
Did 219000 AES-128-GCM (256 bytes) seal operations in 1000090us (218980.3 ops/sec): 56.1 MB/s
Did 46000 AES-128-GCM (1350 bytes) seal operations in 1017365us (45214.8 ops/sec): 61.0 MB/s
Did 8393 AES-128-GCM (8192 bytes) seal operations in 1115579us (7523.4 ops/sec): 61.6 MB/s
Did 3840 AES-128-GCM (16384 bytes) seal operations in 1001928us (3832.6 ops/sec): 62.8 MB/s
(Note fallback AES is dampening the perf hit. Pairing with AESNI to roughly
isolate GHASH shows a 73% hit. gcm_gmult_4bit_mmx is almost 4x as faster.)

Change-Id: Ib28c981e92e200b17fb9ddc89aef695ac6733a43
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/38724
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
This commit is contained in:
David Benjamin 2019-11-07 23:42:58 -05:00 committed by CQ bot account: commit-bot@chromium.org
parent 98f969491c
commit 9855c1c59a
7 changed files with 272 additions and 870 deletions

View File

@ -83,6 +83,7 @@
#include "modes/cfb.c"
#include "modes/ctr.c"
#include "modes/gcm.c"
#include "modes/gcm_nohw.c"
#include "modes/ofb.c"
#include "modes/polyval.c"
#include "rand/ctrdrbg.c"

View File

@ -78,6 +78,9 @@
# *native* byte order on current platform. See gcm128.c for working
# example...
# This file was patched in BoringSSL to remove the variable-time 4-bit
# implementation.
$flavour = shift;
if ($flavour=~/\w[\w\-]*\.\w+$/) { $output=$flavour; undef $flavour; }
else { while (($output=shift) && ($output!~/\w[\w\-]*\.\w+$/)) {} }
@ -100,47 +103,6 @@ $Htbl="r1";
$inp="r2";
$len="r3";
$Zll="r4"; # variables
$Zlh="r5";
$Zhl="r6";
$Zhh="r7";
$Tll="r8";
$Tlh="r9";
$Thl="r10";
$Thh="r11";
$nlo="r12";
################# r13 is stack pointer
$nhi="r14";
################# r15 is program counter
$rem_4bit=$inp; # used in gcm_gmult_4bit
$cnt=$len;
sub Zsmash() {
my $i=12;
my @args=@_;
for ($Zll,$Zlh,$Zhl,$Zhh) {
$code.=<<___;
#if __ARM_ARCH__>=7 && defined(__ARMEL__)
rev $_,$_
str $_,[$Xi,#$i]
#elif defined(__ARMEB__)
str $_,[$Xi,#$i]
#else
mov $Tlh,$_,lsr#8
strb $_,[$Xi,#$i+3]
mov $Thl,$_,lsr#16
strb $Tlh,[$Xi,#$i+2]
mov $Thh,$_,lsr#24
strb $Thl,[$Xi,#$i+1]
strb $Thh,[$Xi,#$i]
#endif
___
$code.="\t".shift(@args)."\n";
$i-=4;
}
}
$code=<<___;
#include <openssl/arm_arch.h>
@ -160,226 +122,6 @@ $code=<<___;
#else
.code 32
#endif
.type rem_4bit,%object
.align 5
rem_4bit:
.short 0x0000,0x1C20,0x3840,0x2460
.short 0x7080,0x6CA0,0x48C0,0x54E0
.short 0xE100,0xFD20,0xD940,0xC560
.short 0x9180,0x8DA0,0xA9C0,0xB5E0
.size rem_4bit,.-rem_4bit
.type rem_4bit_get,%function
rem_4bit_get:
#if defined(__thumb2__)
adr $rem_4bit,rem_4bit
#else
sub $rem_4bit,pc,#8+32 @ &rem_4bit
#endif
b .Lrem_4bit_got
nop
nop
.size rem_4bit_get,.-rem_4bit_get
.global gcm_ghash_4bit
.type gcm_ghash_4bit,%function
.align 4
gcm_ghash_4bit:
#if defined(__thumb2__)
adr r12,rem_4bit
#else
sub r12,pc,#8+48 @ &rem_4bit
#endif
add $len,$inp,$len @ $len to point at the end
stmdb sp!,{r3-r11,lr} @ save $len/end too
ldmia r12,{r4-r11} @ copy rem_4bit ...
stmdb sp!,{r4-r11} @ ... to stack
ldrb $nlo,[$inp,#15]
ldrb $nhi,[$Xi,#15]
.Louter:
eor $nlo,$nlo,$nhi
and $nhi,$nlo,#0xf0
and $nlo,$nlo,#0x0f
mov $cnt,#14
add $Zhh,$Htbl,$nlo,lsl#4
ldmia $Zhh,{$Zll-$Zhh} @ load Htbl[nlo]
add $Thh,$Htbl,$nhi
ldrb $nlo,[$inp,#14]
and $nhi,$Zll,#0xf @ rem
ldmia $Thh,{$Tll-$Thh} @ load Htbl[nhi]
add $nhi,$nhi,$nhi
eor $Zll,$Tll,$Zll,lsr#4
ldrh $Tll,[sp,$nhi] @ rem_4bit[rem]
eor $Zll,$Zll,$Zlh,lsl#28
ldrb $nhi,[$Xi,#14]
eor $Zlh,$Tlh,$Zlh,lsr#4
eor $Zlh,$Zlh,$Zhl,lsl#28
eor $Zhl,$Thl,$Zhl,lsr#4
eor $Zhl,$Zhl,$Zhh,lsl#28
eor $Zhh,$Thh,$Zhh,lsr#4
eor $nlo,$nlo,$nhi
and $nhi,$nlo,#0xf0
and $nlo,$nlo,#0x0f
eor $Zhh,$Zhh,$Tll,lsl#16
.Linner:
add $Thh,$Htbl,$nlo,lsl#4
and $nlo,$Zll,#0xf @ rem
subs $cnt,$cnt,#1
add $nlo,$nlo,$nlo
ldmia $Thh,{$Tll-$Thh} @ load Htbl[nlo]
eor $Zll,$Tll,$Zll,lsr#4
eor $Zll,$Zll,$Zlh,lsl#28
eor $Zlh,$Tlh,$Zlh,lsr#4
eor $Zlh,$Zlh,$Zhl,lsl#28
ldrh $Tll,[sp,$nlo] @ rem_4bit[rem]
eor $Zhl,$Thl,$Zhl,lsr#4
#ifdef __thumb2__
it pl
#endif
ldrplb $nlo,[$inp,$cnt]
eor $Zhl,$Zhl,$Zhh,lsl#28
eor $Zhh,$Thh,$Zhh,lsr#4
add $Thh,$Htbl,$nhi
and $nhi,$Zll,#0xf @ rem
eor $Zhh,$Zhh,$Tll,lsl#16 @ ^= rem_4bit[rem]
add $nhi,$nhi,$nhi
ldmia $Thh,{$Tll-$Thh} @ load Htbl[nhi]
eor $Zll,$Tll,$Zll,lsr#4
#ifdef __thumb2__
it pl
#endif
ldrplb $Tll,[$Xi,$cnt]
eor $Zll,$Zll,$Zlh,lsl#28
eor $Zlh,$Tlh,$Zlh,lsr#4
ldrh $Tlh,[sp,$nhi]
eor $Zlh,$Zlh,$Zhl,lsl#28
eor $Zhl,$Thl,$Zhl,lsr#4
eor $Zhl,$Zhl,$Zhh,lsl#28
#ifdef __thumb2__
it pl
#endif
eorpl $nlo,$nlo,$Tll
eor $Zhh,$Thh,$Zhh,lsr#4
#ifdef __thumb2__
itt pl
#endif
andpl $nhi,$nlo,#0xf0
andpl $nlo,$nlo,#0x0f
eor $Zhh,$Zhh,$Tlh,lsl#16 @ ^= rem_4bit[rem]
bpl .Linner
ldr $len,[sp,#32] @ re-load $len/end
add $inp,$inp,#16
mov $nhi,$Zll
___
&Zsmash("cmp\t$inp,$len","\n".
"#ifdef __thumb2__\n".
" it ne\n".
"#endif\n".
" ldrneb $nlo,[$inp,#15]");
$code.=<<___;
bne .Louter
add sp,sp,#36
#if __ARM_ARCH__>=5
ldmia sp!,{r4-r11,pc}
#else
ldmia sp!,{r4-r11,lr}
tst lr,#1
moveq pc,lr @ be binary compatible with V4, yet
bx lr @ interoperable with Thumb ISA:-)
#endif
.size gcm_ghash_4bit,.-gcm_ghash_4bit
.global gcm_gmult_4bit
.type gcm_gmult_4bit,%function
gcm_gmult_4bit:
stmdb sp!,{r4-r11,lr}
ldrb $nlo,[$Xi,#15]
b rem_4bit_get
.Lrem_4bit_got:
and $nhi,$nlo,#0xf0
and $nlo,$nlo,#0x0f
mov $cnt,#14
add $Zhh,$Htbl,$nlo,lsl#4
ldmia $Zhh,{$Zll-$Zhh} @ load Htbl[nlo]
ldrb $nlo,[$Xi,#14]
add $Thh,$Htbl,$nhi
and $nhi,$Zll,#0xf @ rem
ldmia $Thh,{$Tll-$Thh} @ load Htbl[nhi]
add $nhi,$nhi,$nhi
eor $Zll,$Tll,$Zll,lsr#4
ldrh $Tll,[$rem_4bit,$nhi] @ rem_4bit[rem]
eor $Zll,$Zll,$Zlh,lsl#28
eor $Zlh,$Tlh,$Zlh,lsr#4
eor $Zlh,$Zlh,$Zhl,lsl#28
eor $Zhl,$Thl,$Zhl,lsr#4
eor $Zhl,$Zhl,$Zhh,lsl#28
eor $Zhh,$Thh,$Zhh,lsr#4
and $nhi,$nlo,#0xf0
eor $Zhh,$Zhh,$Tll,lsl#16
and $nlo,$nlo,#0x0f
.Loop:
add $Thh,$Htbl,$nlo,lsl#4
and $nlo,$Zll,#0xf @ rem
subs $cnt,$cnt,#1
add $nlo,$nlo,$nlo
ldmia $Thh,{$Tll-$Thh} @ load Htbl[nlo]
eor $Zll,$Tll,$Zll,lsr#4
eor $Zll,$Zll,$Zlh,lsl#28
eor $Zlh,$Tlh,$Zlh,lsr#4
eor $Zlh,$Zlh,$Zhl,lsl#28
ldrh $Tll,[$rem_4bit,$nlo] @ rem_4bit[rem]
eor $Zhl,$Thl,$Zhl,lsr#4
#ifdef __thumb2__
it pl
#endif
ldrplb $nlo,[$Xi,$cnt]
eor $Zhl,$Zhl,$Zhh,lsl#28
eor $Zhh,$Thh,$Zhh,lsr#4
add $Thh,$Htbl,$nhi
and $nhi,$Zll,#0xf @ rem
eor $Zhh,$Zhh,$Tll,lsl#16 @ ^= rem_4bit[rem]
add $nhi,$nhi,$nhi
ldmia $Thh,{$Tll-$Thh} @ load Htbl[nhi]
eor $Zll,$Tll,$Zll,lsr#4
eor $Zll,$Zll,$Zlh,lsl#28
eor $Zlh,$Tlh,$Zlh,lsr#4
ldrh $Tll,[$rem_4bit,$nhi] @ rem_4bit[rem]
eor $Zlh,$Zlh,$Zhl,lsl#28
eor $Zhl,$Thl,$Zhl,lsr#4
eor $Zhl,$Zhl,$Zhh,lsl#28
eor $Zhh,$Thh,$Zhh,lsr#4
#ifdef __thumb2__
itt pl
#endif
andpl $nhi,$nlo,#0xf0
andpl $nlo,$nlo,#0x0f
eor $Zhh,$Zhh,$Tll,lsl#16 @ ^= rem_4bit[rem]
bpl .Loop
___
&Zsmash();
$code.=<<___;
#if __ARM_ARCH__>=5
ldmia sp!,{r4-r11,pc}
#else
ldmia sp!,{r4-r11,lr}
tst lr,#1
moveq pc,lr @ be binary compatible with V4, yet
bx lr @ interoperable with Thumb ISA:-)
#endif
.size gcm_gmult_4bit,.-gcm_gmult_4bit
___
{
my ($Xl,$Xm,$Xh,$IN)=map("q$_",(0..3));

View File

@ -90,6 +90,9 @@
#
# [1] http://rt.openssl.org/Ticket/Display.html?id=2900&user=guest&pass=guest
# This file was patched in BoringSSL to remove the variable-time 4-bit
# implementation.
$flavour = shift;
$output = shift;
if ($flavour =~ /\./) { $output = $flavour; undef $flavour; }
@ -125,10 +128,6 @@ $rem_4bit = "%r11";
$Xi="%rdi";
$Htbl="%rsi";
# per-function register layout
$cnt="%rcx";
$rem="%rdx";
sub LB() { my $r=shift; $r =~ s/%[er]([a-d])x/%\1l/ or
$r =~ s/%[er]([sd]i)/%\1l/ or
$r =~ s/%[er](bp)/%\1l/ or
@ -141,302 +140,15 @@ sub AUTOLOAD() # thunk [simplified] 32-bit style perlasm
$code .= "\t$opcode\t".join(',',$arg,reverse @_)."\n";
}
{ my $N;
sub loop() {
my $inp = shift;
$N++;
$code.=<<___;
xor $nlo,$nlo
xor $nhi,$nhi
mov `&LB("$Zlo")`,`&LB("$nlo")`
mov `&LB("$Zlo")`,`&LB("$nhi")`
shl \$4,`&LB("$nlo")`
mov \$14,$cnt
mov 8($Htbl,$nlo),$Zlo
mov ($Htbl,$nlo),$Zhi
and \$0xf0,`&LB("$nhi")`
mov $Zlo,$rem
jmp .Loop$N
.align 16
.Loop$N:
shr \$4,$Zlo
and \$0xf,$rem
mov $Zhi,$tmp
mov ($inp,$cnt),`&LB("$nlo")`
shr \$4,$Zhi
xor 8($Htbl,$nhi),$Zlo
shl \$60,$tmp
xor ($Htbl,$nhi),$Zhi
mov `&LB("$nlo")`,`&LB("$nhi")`
xor ($rem_4bit,$rem,8),$Zhi
mov $Zlo,$rem
shl \$4,`&LB("$nlo")`
xor $tmp,$Zlo
dec $cnt
js .Lbreak$N
shr \$4,$Zlo
and \$0xf,$rem
mov $Zhi,$tmp
shr \$4,$Zhi
xor 8($Htbl,$nlo),$Zlo
shl \$60,$tmp
xor ($Htbl,$nlo),$Zhi
and \$0xf0,`&LB("$nhi")`
xor ($rem_4bit,$rem,8),$Zhi
mov $Zlo,$rem
xor $tmp,$Zlo
jmp .Loop$N
.align 16
.Lbreak$N:
shr \$4,$Zlo
and \$0xf,$rem
mov $Zhi,$tmp
shr \$4,$Zhi
xor 8($Htbl,$nlo),$Zlo
shl \$60,$tmp
xor ($Htbl,$nlo),$Zhi
and \$0xf0,`&LB("$nhi")`
xor ($rem_4bit,$rem,8),$Zhi
mov $Zlo,$rem
xor $tmp,$Zlo
shr \$4,$Zlo
and \$0xf,$rem
mov $Zhi,$tmp
shr \$4,$Zhi
xor 8($Htbl,$nhi),$Zlo
shl \$60,$tmp
xor ($Htbl,$nhi),$Zhi
xor $tmp,$Zlo
xor ($rem_4bit,$rem,8),$Zhi
bswap $Zlo
bswap $Zhi
___
}}
$code=<<___;
.text
.extern OPENSSL_ia32cap_P
.globl gcm_gmult_4bit
.type gcm_gmult_4bit,\@function,2
.align 16
gcm_gmult_4bit:
.cfi_startproc
push %rbx
.cfi_push %rbx
push %rbp # %rbp and others are pushed exclusively in
.cfi_push %rbp
push %r12 # order to reuse Win64 exception handler...
.cfi_push %r12
push %r13
.cfi_push %r13
push %r14
.cfi_push %r14
push %r15
.cfi_push %r15
sub \$280,%rsp
.cfi_adjust_cfa_offset 280
.Lgmult_prologue:
movzb 15($Xi),$Zlo
lea .Lrem_4bit(%rip),$rem_4bit
___
&loop ($Xi);
$code.=<<___;
mov $Zlo,8($Xi)
mov $Zhi,($Xi)
lea 280+48(%rsp),%rsi
.cfi_def_cfa %rsi,8
mov -8(%rsi),%rbx
.cfi_restore %rbx
lea (%rsi),%rsp
.cfi_def_cfa_register %rsp
.Lgmult_epilogue:
ret
.cfi_endproc
.size gcm_gmult_4bit,.-gcm_gmult_4bit
___
# per-function register layout
$inp="%rdx";
$len="%rcx";
$rem_8bit=$rem_4bit;
$code.=<<___;
.globl gcm_ghash_4bit
.type gcm_ghash_4bit,\@function,4
.align 16
gcm_ghash_4bit:
.cfi_startproc
push %rbx
.cfi_push %rbx
push %rbp
.cfi_push %rbp
push %r12
.cfi_push %r12
push %r13
.cfi_push %r13
push %r14
.cfi_push %r14
push %r15
.cfi_push %r15
sub \$280,%rsp
.cfi_adjust_cfa_offset 280
.Lghash_prologue:
mov $inp,%r14 # reassign couple of args
mov $len,%r15
___
{ my $inp="%r14";
my $dat="%edx";
my $len="%r15";
my @nhi=("%ebx","%ecx");
my @rem=("%r12","%r13");
my $Hshr4="%rbp";
&sub ($Htbl,-128); # size optimization
&lea ($Hshr4,"16+128(%rsp)");
{ my @lo =($nlo,$nhi);
my @hi =($Zlo,$Zhi);
&xor ($dat,$dat);
for ($i=0,$j=-2;$i<18;$i++,$j++) {
&mov ("$j(%rsp)",&LB($dat)) if ($i>1);
&or ($lo[0],$tmp) if ($i>1);
&mov (&LB($dat),&LB($lo[1])) if ($i>0 && $i<17);
&shr ($lo[1],4) if ($i>0 && $i<17);
&mov ($tmp,$hi[1]) if ($i>0 && $i<17);
&shr ($hi[1],4) if ($i>0 && $i<17);
&mov ("8*$j($Hshr4)",$hi[0]) if ($i>1);
&mov ($hi[0],"16*$i+0-128($Htbl)") if ($i<16);
&shl (&LB($dat),4) if ($i>0 && $i<17);
&mov ("8*$j-128($Hshr4)",$lo[0]) if ($i>1);
&mov ($lo[0],"16*$i+8-128($Htbl)") if ($i<16);
&shl ($tmp,60) if ($i>0 && $i<17);
push (@lo,shift(@lo));
push (@hi,shift(@hi));
}
}
&add ($Htbl,-128);
&mov ($Zlo,"8($Xi)");
&mov ($Zhi,"0($Xi)");
&add ($len,$inp); # pointer to the end of data
&lea ($rem_8bit,".Lrem_8bit(%rip)");
&jmp (".Louter_loop");
$code.=".align 16\n.Louter_loop:\n";
&xor ($Zhi,"($inp)");
&mov ("%rdx","8($inp)");
&lea ($inp,"16($inp)");
&xor ("%rdx",$Zlo);
&mov ("($Xi)",$Zhi);
&mov ("8($Xi)","%rdx");
&shr ("%rdx",32);
&xor ($nlo,$nlo);
&rol ($dat,8);
&mov (&LB($nlo),&LB($dat));
&movz ($nhi[0],&LB($dat));
&shl (&LB($nlo),4);
&shr ($nhi[0],4);
for ($j=11,$i=0;$i<15;$i++) {
&rol ($dat,8);
&xor ($Zlo,"8($Htbl,$nlo)") if ($i>0);
&xor ($Zhi,"($Htbl,$nlo)") if ($i>0);
&mov ($Zlo,"8($Htbl,$nlo)") if ($i==0);
&mov ($Zhi,"($Htbl,$nlo)") if ($i==0);
&mov (&LB($nlo),&LB($dat));
&xor ($Zlo,$tmp) if ($i>0);
&movzw ($rem[1],"($rem_8bit,$rem[1],2)") if ($i>0);
&movz ($nhi[1],&LB($dat));
&shl (&LB($nlo),4);
&movzb ($rem[0],"(%rsp,$nhi[0])");
&shr ($nhi[1],4) if ($i<14);
&and ($nhi[1],0xf0) if ($i==14);
&shl ($rem[1],48) if ($i>0);
&xor ($rem[0],$Zlo);
&mov ($tmp,$Zhi);
&xor ($Zhi,$rem[1]) if ($i>0);
&shr ($Zlo,8);
&movz ($rem[0],&LB($rem[0]));
&mov ($dat,"$j($Xi)") if (--$j%4==0);
&shr ($Zhi,8);
&xor ($Zlo,"-128($Hshr4,$nhi[0],8)");
&shl ($tmp,56);
&xor ($Zhi,"($Hshr4,$nhi[0],8)");
unshift (@nhi,pop(@nhi)); # "rotate" registers
unshift (@rem,pop(@rem));
}
&movzw ($rem[1],"($rem_8bit,$rem[1],2)");
&xor ($Zlo,"8($Htbl,$nlo)");
&xor ($Zhi,"($Htbl,$nlo)");
&shl ($rem[1],48);
&xor ($Zlo,$tmp);
&xor ($Zhi,$rem[1]);
&movz ($rem[0],&LB($Zlo));
&shr ($Zlo,4);
&mov ($tmp,$Zhi);
&shl (&LB($rem[0]),4);
&shr ($Zhi,4);
&xor ($Zlo,"8($Htbl,$nhi[0])");
&movzw ($rem[0],"($rem_8bit,$rem[0],2)");
&shl ($tmp,60);
&xor ($Zhi,"($Htbl,$nhi[0])");
&xor ($Zlo,$tmp);
&shl ($rem[0],48);
&bswap ($Zlo);
&xor ($Zhi,$rem[0]);
&bswap ($Zhi);
&cmp ($inp,$len);
&jb (".Louter_loop");
}
$code.=<<___;
mov $Zlo,8($Xi)
mov $Zhi,($Xi)
lea 280+48(%rsp),%rsi
.cfi_def_cfa %rsi,8
mov -48(%rsi),%r15
.cfi_restore %r15
mov -40(%rsi),%r14
.cfi_restore %r14
mov -32(%rsi),%r13
.cfi_restore %r13
mov -24(%rsi),%r12
.cfi_restore %r12
mov -16(%rsi),%rbp
.cfi_restore %rbp
mov -8(%rsi),%rbx
.cfi_restore %rbx
lea 0(%rsi),%rsp
.cfi_def_cfa_register %rsp
.Lghash_epilogue:
ret
.cfi_endproc
.size gcm_ghash_4bit,.-gcm_ghash_4bit
___
######################################################################
# PCLMULQDQ version.
@ -1599,158 +1311,15 @@ $code.=<<___;
.L7_mask_poly:
.long 7,0,`0xE1<<1`,0
.align 64
.type .Lrem_4bit,\@object
.Lrem_4bit:
.long 0,`0x0000<<16`,0,`0x1C20<<16`,0,`0x3840<<16`,0,`0x2460<<16`
.long 0,`0x7080<<16`,0,`0x6CA0<<16`,0,`0x48C0<<16`,0,`0x54E0<<16`
.long 0,`0xE100<<16`,0,`0xFD20<<16`,0,`0xD940<<16`,0,`0xC560<<16`
.long 0,`0x9180<<16`,0,`0x8DA0<<16`,0,`0xA9C0<<16`,0,`0xB5E0<<16`
.type .Lrem_8bit,\@object
.Lrem_8bit:
.value 0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E
.value 0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E
.value 0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E
.value 0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E
.value 0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E
.value 0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E
.value 0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E
.value 0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E
.value 0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE
.value 0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE
.value 0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE
.value 0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE
.value 0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E
.value 0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E
.value 0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE
.value 0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE
.value 0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E
.value 0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E
.value 0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E
.value 0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E
.value 0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E
.value 0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E
.value 0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E
.value 0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E
.value 0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE
.value 0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE
.value 0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE
.value 0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE
.value 0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E
.value 0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E
.value 0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE
.value 0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE
.asciz "GHASH for x86_64, CRYPTOGAMS by <appro\@openssl.org>"
.align 64
___
# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
# CONTEXT *context,DISPATCHER_CONTEXT *disp)
if ($win64) {
$rec="%rcx";
$frame="%rdx";
$context="%r8";
$disp="%r9";
$code.=<<___;
.extern __imp_RtlVirtualUnwind
.type se_handler,\@abi-omnipotent
.align 16
se_handler:
push %rsi
push %rdi
push %rbx
push %rbp
push %r12
push %r13
push %r14
push %r15
pushfq
sub \$64,%rsp
mov 120($context),%rax # pull context->Rax
mov 248($context),%rbx # pull context->Rip
mov 8($disp),%rsi # disp->ImageBase
mov 56($disp),%r11 # disp->HandlerData
mov 0(%r11),%r10d # HandlerData[0]
lea (%rsi,%r10),%r10 # prologue label
cmp %r10,%rbx # context->Rip<prologue label
jb .Lin_prologue
mov 152($context),%rax # pull context->Rsp
mov 4(%r11),%r10d # HandlerData[1]
lea (%rsi,%r10),%r10 # epilogue label
cmp %r10,%rbx # context->Rip>=epilogue label
jae .Lin_prologue
lea 48+280(%rax),%rax # adjust "rsp"
mov -8(%rax),%rbx
mov -16(%rax),%rbp
mov -24(%rax),%r12
mov -32(%rax),%r13
mov -40(%rax),%r14
mov -48(%rax),%r15
mov %rbx,144($context) # restore context->Rbx
mov %rbp,160($context) # restore context->Rbp
mov %r12,216($context) # restore context->R12
mov %r13,224($context) # restore context->R13
mov %r14,232($context) # restore context->R14
mov %r15,240($context) # restore context->R15
.Lin_prologue:
mov 8(%rax),%rdi
mov 16(%rax),%rsi
mov %rax,152($context) # restore context->Rsp
mov %rsi,168($context) # restore context->Rsi
mov %rdi,176($context) # restore context->Rdi
mov 40($disp),%rdi # disp->ContextRecord
mov $context,%rsi # context
mov \$`1232/8`,%ecx # sizeof(CONTEXT)
.long 0xa548f3fc # cld; rep movsq
mov $disp,%rsi
xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER
mov 8(%rsi),%rdx # arg2, disp->ImageBase
mov 0(%rsi),%r8 # arg3, disp->ControlPc
mov 16(%rsi),%r9 # arg4, disp->FunctionEntry
mov 40(%rsi),%r10 # disp->ContextRecord
lea 56(%rsi),%r11 # &disp->HandlerData
lea 24(%rsi),%r12 # &disp->EstablisherFrame
mov %r10,32(%rsp) # arg5
mov %r11,40(%rsp) # arg6
mov %r12,48(%rsp) # arg7
mov %rcx,56(%rsp) # arg8, (NULL)
call *__imp_RtlVirtualUnwind(%rip)
mov \$1,%eax # ExceptionContinueSearch
add \$64,%rsp
popfq
pop %r15
pop %r14
pop %r13
pop %r12
pop %rbp
pop %rbx
pop %rdi
pop %rsi
ret
.size se_handler,.-se_handler
.section .pdata
.align 4
.rva .LSEH_begin_gcm_gmult_4bit
.rva .LSEH_end_gcm_gmult_4bit
.rva .LSEH_info_gcm_gmult_4bit
.rva .LSEH_begin_gcm_ghash_4bit
.rva .LSEH_end_gcm_ghash_4bit
.rva .LSEH_info_gcm_ghash_4bit
.rva .LSEH_begin_gcm_init_clmul
.rva .LSEH_end_gcm_init_clmul
.rva .LSEH_info_gcm_init_clmul
@ -1771,14 +1340,6 @@ ___
$code.=<<___;
.section .xdata
.align 8
.LSEH_info_gcm_gmult_4bit:
.byte 9,0,0,0
.rva se_handler
.rva .Lgmult_prologue,.Lgmult_epilogue # HandlerData
.LSEH_info_gcm_ghash_4bit:
.byte 9,0,0,0
.rva se_handler
.rva .Lghash_prologue,.Lghash_epilogue # HandlerData
.LSEH_info_gcm_init_clmul:
.byte 0x01,0x08,0x03,0x00
.byte 0x08,0x68,0x00,0x00 #movaps 0x00(rsp),xmm6

View File

@ -58,7 +58,6 @@
#include "../../internal.h"
#define PACK(s) ((size_t)(s) << (sizeof(size_t) * 8 - 16))
#define REDUCE1BIT(V) \
do { \
if (sizeof(size_t) == 8) { \
@ -104,138 +103,11 @@ void gcm_init_4bit(u128 Htable[16], const uint64_t H[2]) {
Htable[13].hi = V.hi ^ Htable[5].hi, Htable[13].lo = V.lo ^ Htable[5].lo;
Htable[14].hi = V.hi ^ Htable[6].hi, Htable[14].lo = V.lo ^ Htable[6].lo;
Htable[15].hi = V.hi ^ Htable[7].hi, Htable[15].lo = V.lo ^ Htable[7].lo;
#if defined(GHASH_ASM) && defined(OPENSSL_ARM)
for (int j = 0; j < 16; ++j) {
V = Htable[j];
Htable[j].hi = V.lo;
Htable[j].lo = V.hi;
}
#endif
}
#if !defined(GHASH_ASM) || defined(OPENSSL_AARCH64) || defined(OPENSSL_PPC64LE)
static const size_t rem_4bit[16] = {
PACK(0x0000), PACK(0x1C20), PACK(0x3840), PACK(0x2460),
PACK(0x7080), PACK(0x6CA0), PACK(0x48C0), PACK(0x54E0),
PACK(0xE100), PACK(0xFD20), PACK(0xD940), PACK(0xC560),
PACK(0x9180), PACK(0x8DA0), PACK(0xA9C0), PACK(0xB5E0)};
void gcm_gmult_4bit(uint64_t Xi[2], const u128 Htable[16]) {
u128 Z;
int cnt = 15;
size_t rem, nlo, nhi;
nlo = ((const uint8_t *)Xi)[15];
nhi = nlo >> 4;
nlo &= 0xf;
Z.hi = Htable[nlo].hi;
Z.lo = Htable[nlo].lo;
while (1) {
rem = (size_t)Z.lo & 0xf;
Z.lo = (Z.hi << 60) | (Z.lo >> 4);
Z.hi = (Z.hi >> 4);
if (sizeof(size_t) == 8) {
Z.hi ^= rem_4bit[rem];
} else {
Z.hi ^= (uint64_t)rem_4bit[rem] << 32;
}
Z.hi ^= Htable[nhi].hi;
Z.lo ^= Htable[nhi].lo;
if (--cnt < 0) {
break;
}
nlo = ((const uint8_t *)Xi)[cnt];
nhi = nlo >> 4;
nlo &= 0xf;
rem = (size_t)Z.lo & 0xf;
Z.lo = (Z.hi << 60) | (Z.lo >> 4);
Z.hi = (Z.hi >> 4);
if (sizeof(size_t) == 8) {
Z.hi ^= rem_4bit[rem];
} else {
Z.hi ^= (uint64_t)rem_4bit[rem] << 32;
}
Z.hi ^= Htable[nlo].hi;
Z.lo ^= Htable[nlo].lo;
}
Xi[0] = CRYPTO_bswap8(Z.hi);
Xi[1] = CRYPTO_bswap8(Z.lo);
}
// Streamed gcm_mult_4bit, see CRYPTO_gcm128_[en|de]crypt for
// details... Compiler-generated code doesn't seem to give any
// performance improvement, at least not on x86[_64]. It's here
// mostly as reference and a placeholder for possible future
// non-trivial optimization[s]...
void gcm_ghash_4bit(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
size_t len) {
u128 Z;
int cnt;
size_t rem, nlo, nhi;
do {
cnt = 15;
nlo = ((const uint8_t *)Xi)[15];
nlo ^= inp[15];
nhi = nlo >> 4;
nlo &= 0xf;
Z.hi = Htable[nlo].hi;
Z.lo = Htable[nlo].lo;
while (1) {
rem = (size_t)Z.lo & 0xf;
Z.lo = (Z.hi << 60) | (Z.lo >> 4);
Z.hi = (Z.hi >> 4);
if (sizeof(size_t) == 8) {
Z.hi ^= rem_4bit[rem];
} else {
Z.hi ^= (uint64_t)rem_4bit[rem] << 32;
}
Z.hi ^= Htable[nhi].hi;
Z.lo ^= Htable[nhi].lo;
if (--cnt < 0) {
break;
}
nlo = ((const uint8_t *)Xi)[cnt];
nlo ^= inp[cnt];
nhi = nlo >> 4;
nlo &= 0xf;
rem = (size_t)Z.lo & 0xf;
Z.lo = (Z.hi << 60) | (Z.lo >> 4);
Z.hi = (Z.hi >> 4);
if (sizeof(size_t) == 8) {
Z.hi ^= rem_4bit[rem];
} else {
Z.hi ^= (uint64_t)rem_4bit[rem] << 32;
}
Z.hi ^= Htable[nlo].hi;
Z.lo ^= Htable[nlo].lo;
}
Xi[0] = CRYPTO_bswap8(Z.hi);
Xi[1] = CRYPTO_bswap8(Z.lo);
} while (inp += 16, len -= 16);
}
#endif // !GHASH_ASM || AARCH64 || PPC64LE
#define GCM_MUL(ctx, Xi) gcm_gmult_4bit((ctx)->Xi.u, (ctx)->gcm_key.Htable)
#define GCM_MUL(ctx, Xi) gcm_gmult_nohw((ctx)->Xi.u, (ctx)->gcm_key.Htable)
#define GHASH(ctx, in, len) \
gcm_ghash_4bit((ctx)->Xi.u, (ctx)->gcm_key.Htable, in, len)
gcm_ghash_nohw((ctx)->Xi.u, (ctx)->gcm_key.Htable, in, len)
// GHASH_CHUNK is "stride parameter" missioned to mitigate cache
// trashing effect. In other words idea is to hash data while it's
// still in L1 cache after encryption pass...
@ -268,13 +140,13 @@ void gcm_init_ssse3(u128 Htable[16], const uint64_t Xi[2]) {
}
#endif // GHASH_ASM_X86_64 || GHASH_ASM_X86
#ifdef GCM_FUNCREF_4BIT
#ifdef GCM_FUNCREF
#undef GCM_MUL
#define GCM_MUL(ctx, Xi) (*gcm_gmult_p)((ctx)->Xi.u, (ctx)->gcm_key.Htable)
#undef GHASH
#define GHASH(ctx, in, len) \
(*gcm_ghash_p)((ctx)->Xi.u, (ctx)->gcm_key.Htable, in, len)
#endif // GCM_FUNCREF_4BIT
#endif // GCM_FUNCREF
void CRYPTO_ghash_init(gmult_func *out_mult, ghash_func *out_hash,
u128 *out_key, u128 out_table[16], int *out_is_avx,
@ -350,13 +222,18 @@ void CRYPTO_ghash_init(gmult_func *out_mult, ghash_func *out_hash,
}
#endif
gcm_init_4bit(out_table, H.u);
#if defined(GHASH_ASM_X86)
// TODO(davidben): This implementation is not constant-time, but it is
// dramatically faster than |gcm_gmult_nohw|. See if we can get a
// constant-time SSE2 implementation to close this gap, or decide we don't
// care.
gcm_init_4bit(out_table, H.u);
*out_mult = gcm_gmult_4bit_mmx;
*out_hash = gcm_ghash_4bit_mmx;
#else
*out_mult = gcm_gmult_4bit;
*out_hash = gcm_ghash_4bit;
gcm_init_nohw(out_table, H.u);
*out_mult = gcm_gmult_nohw;
*out_hash = gcm_ghash_nohw;
#endif
}
@ -378,7 +255,7 @@ void CRYPTO_gcm128_init_key(GCM128_KEY *gcm_key, const AES_KEY *aes_key,
void CRYPTO_gcm128_setiv(GCM128_CONTEXT *ctx, const AES_KEY *key,
const uint8_t *iv, size_t len) {
#ifdef GCM_FUNCREF_4BIT
#ifdef GCM_FUNCREF
void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
ctx->gcm_key.gmult;
#endif
@ -427,13 +304,11 @@ void CRYPTO_gcm128_setiv(GCM128_CONTEXT *ctx, const AES_KEY *key,
}
int CRYPTO_gcm128_aad(GCM128_CONTEXT *ctx, const uint8_t *aad, size_t len) {
#ifdef GCM_FUNCREF_4BIT
#ifdef GCM_FUNCREF
void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
ctx->gcm_key.gmult;
#ifdef GHASH
void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
size_t len) = ctx->gcm_key.ghash;
#endif
#endif
if (ctx->len.u[1]) {
@ -484,7 +359,7 @@ int CRYPTO_gcm128_aad(GCM128_CONTEXT *ctx, const uint8_t *aad, size_t len) {
int CRYPTO_gcm128_encrypt(GCM128_CONTEXT *ctx, const AES_KEY *key,
const uint8_t *in, uint8_t *out, size_t len) {
block128_f block = ctx->gcm_key.block;
#ifdef GCM_FUNCREF_4BIT
#ifdef GCM_FUNCREF
void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
ctx->gcm_key.gmult;
void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@ -572,7 +447,7 @@ int CRYPTO_gcm128_decrypt(GCM128_CONTEXT *ctx, const AES_KEY *key,
const unsigned char *in, unsigned char *out,
size_t len) {
block128_f block = ctx->gcm_key.block;
#ifdef GCM_FUNCREF_4BIT
#ifdef GCM_FUNCREF
void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
ctx->gcm_key.gmult;
void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@ -663,7 +538,7 @@ int CRYPTO_gcm128_decrypt(GCM128_CONTEXT *ctx, const AES_KEY *key,
int CRYPTO_gcm128_encrypt_ctr32(GCM128_CONTEXT *ctx, const AES_KEY *key,
const uint8_t *in, uint8_t *out, size_t len,
ctr128_f stream) {
#ifdef GCM_FUNCREF_4BIT
#ifdef GCM_FUNCREF
void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
ctx->gcm_key.gmult;
void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@ -749,7 +624,7 @@ int CRYPTO_gcm128_encrypt_ctr32(GCM128_CONTEXT *ctx, const AES_KEY *key,
int CRYPTO_gcm128_decrypt_ctr32(GCM128_CONTEXT *ctx, const AES_KEY *key,
const uint8_t *in, uint8_t *out, size_t len,
ctr128_f stream) {
#ifdef GCM_FUNCREF_4BIT
#ifdef GCM_FUNCREF
void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
ctx->gcm_key.gmult;
void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@ -837,7 +712,7 @@ int CRYPTO_gcm128_decrypt_ctr32(GCM128_CONTEXT *ctx, const AES_KEY *key,
}
int CRYPTO_gcm128_finish(GCM128_CONTEXT *ctx, const uint8_t *tag, size_t len) {
#ifdef GCM_FUNCREF_4BIT
#ifdef GCM_FUNCREF
void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
ctx->gcm_key.gmult;
#endif
@ -868,7 +743,7 @@ void CRYPTO_gcm128_tag(GCM128_CONTEXT *ctx, unsigned char *tag, size_t len) {
#if defined(OPENSSL_X86) || defined(OPENSSL_X86_64)
int crypto_gcm_clmul_enabled(void) {
#ifdef GHASH_ASM
#if defined(GHASH_ASM_X86) || defined(GHASH_ASM_X86_64)
const uint32_t *ia32cap = OPENSSL_ia32cap_get();
return (ia32cap[0] & (1 << 24)) && // check FXSR bit
(ia32cap[1] & (1 << 1)); // check PCLMULQDQ bit

View File

@ -0,0 +1,233 @@
/* Copyright (c) 2019, Google Inc.
*
* Permission to use, copy, modify, and/or distribute this software for any
* purpose with or without fee is hereby granted, provided that the above
* copyright notice and this permission notice appear in all copies.
*
* THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
* WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
* MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
* SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
* OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
* CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */
#include <openssl/base.h>
#include "../../internal.h"
#include "internal.h"
// This file contains a constant-time implementation of GHASH based on the notes
// in https://bearssl.org/constanttime.html#ghash-for-gcm and the reduction
// algorithm described in
// https://crypto.stanford.edu/RealWorldCrypto/slides/gueron.pdf.
//
// Unlike the BearSSL notes, we use uint128_t in the 64-bit implementation. Our
// primary compilers (clang, clang-cl, and gcc) all support it. MSVC will run
// the 32-bit implementation, but we can use its intrinsics if necessary.
#if defined(BORINGSSL_HAS_UINT128)
static void gcm_mul64_nohw(uint64_t *out_lo, uint64_t *out_hi, uint64_t a,
uint64_t b) {
// One term every four bits means the largest term is 64/4 = 16, which barely
// overflows into the next term. Using one term every five bits would cost 25
// multiplications instead of 16. It is faster to mask off the bottom four
// bits of |a|, giving a largest term of 60/4 = 15, and apply the bottom bits
// separately.
uint64_t a0 = a & UINT64_C(0x1111111111111110);
uint64_t a1 = a & UINT64_C(0x2222222222222220);
uint64_t a2 = a & UINT64_C(0x4444444444444440);
uint64_t a3 = a & UINT64_C(0x8888888888888880);
uint64_t b0 = b & UINT64_C(0x1111111111111111);
uint64_t b1 = b & UINT64_C(0x2222222222222222);
uint64_t b2 = b & UINT64_C(0x4444444444444444);
uint64_t b3 = b & UINT64_C(0x8888888888888888);
uint128_t c0 = (a0 * (uint128_t)b0) ^ (a1 * (uint128_t)b3) ^
(a2 * (uint128_t)b2) ^ (a3 * (uint128_t)b1);
uint128_t c1 = (a0 * (uint128_t)b1) ^ (a1 * (uint128_t)b0) ^
(a2 * (uint128_t)b3) ^ (a3 * (uint128_t)b2);
uint128_t c2 = (a0 * (uint128_t)b2) ^ (a1 * (uint128_t)b1) ^
(a2 * (uint128_t)b0) ^ (a3 * (uint128_t)b3);
uint128_t c3 = (a0 * (uint128_t)b3) ^ (a1 * (uint128_t)b2) ^
(a2 * (uint128_t)b1) ^ (a3 * (uint128_t)b0);
// Multiply the bottom four bits of |a| with |b|.
uint64_t a0_mask = UINT64_C(0) - (a & 1);
uint64_t a1_mask = UINT64_C(0) - ((a >> 1) & 1);
uint64_t a2_mask = UINT64_C(0) - ((a >> 2) & 1);
uint64_t a3_mask = UINT64_C(0) - ((a >> 3) & 1);
uint128_t extra = (a0_mask & b) ^ ((uint128_t)(a1_mask & b) << 1) ^
((uint128_t)(a2_mask & b) << 2) ^
((uint128_t)(a3_mask & b) << 3);
*out_lo = (((uint64_t)c0) & UINT64_C(0x1111111111111111)) ^
(((uint64_t)c1) & UINT64_C(0x2222222222222222)) ^
(((uint64_t)c2) & UINT64_C(0x4444444444444444)) ^
(((uint64_t)c3) & UINT64_C(0x8888888888888888)) ^ ((uint64_t)extra);
*out_hi = (((uint64_t)(c0 >> 64)) & UINT64_C(0x1111111111111111)) ^
(((uint64_t)(c1 >> 64)) & UINT64_C(0x2222222222222222)) ^
(((uint64_t)(c2 >> 64)) & UINT64_C(0x4444444444444444)) ^
(((uint64_t)(c3 >> 64)) & UINT64_C(0x8888888888888888)) ^
((uint64_t)(extra >> 64));
}
#else // !BORINGSSL_HAS_UINT128
static uint64_t gcm_mul32_nohw(uint32_t a, uint32_t b) {
// One term every four bits means the largest term is 32/4 = 8, which does not
// overflow into the next term.
uint32_t a0 = a & 0x11111111;
uint32_t a1 = a & 0x22222222;
uint32_t a2 = a & 0x44444444;
uint32_t a3 = a & 0x88888888;
uint32_t b0 = b & 0x11111111;
uint32_t b1 = b & 0x22222222;
uint32_t b2 = b & 0x44444444;
uint32_t b3 = b & 0x88888888;
uint64_t c0 = (a0 * (uint64_t)b0) ^ (a1 * (uint64_t)b3) ^
(a2 * (uint64_t)b2) ^ (a3 * (uint64_t)b1);
uint64_t c1 = (a0 * (uint64_t)b1) ^ (a1 * (uint64_t)b0) ^
(a2 * (uint64_t)b3) ^ (a3 * (uint64_t)b2);
uint64_t c2 = (a0 * (uint64_t)b2) ^ (a1 * (uint64_t)b1) ^
(a2 * (uint64_t)b0) ^ (a3 * (uint64_t)b3);
uint64_t c3 = (a0 * (uint64_t)b3) ^ (a1 * (uint64_t)b2) ^
(a2 * (uint64_t)b1) ^ (a3 * (uint64_t)b0);
return (c0 & UINT64_C(0x1111111111111111)) |
(c1 & UINT64_C(0x2222222222222222)) |
(c2 & UINT64_C(0x4444444444444444)) |
(c3 & UINT64_C(0x8888888888888888));
}
static void gcm_mul64_nohw(uint64_t *out_lo, uint64_t *out_hi, uint64_t a,
uint64_t b) {
uint32_t a0 = a & 0xffffffff;
uint32_t a1 = a >> 32;
uint32_t b0 = b & 0xffffffff;
uint32_t b1 = b >> 32;
// Karatsuba multiplication.
uint64_t lo = gcm_mul32_nohw(a0, b0);
uint64_t hi = gcm_mul32_nohw(a1, b1);
uint64_t mid = gcm_mul32_nohw(a0 ^ a1, b0 ^ b1) ^ lo ^ hi;
*out_lo = lo ^ (mid << 32);
*out_hi = hi ^ (mid >> 32);
}
#endif // BORINGSSL_HAS_UINT128
void gcm_init_nohw(u128 Htable[16], const uint64_t Xi[2]) {
// We implement GHASH in terms of POLYVAL, as described in RFC8452. This
// avoids a shift by 1 in the multiplication, needed to account for bit
// reversal losing a bit after multiplication, that is,
// rev128(X) * rev128(Y) = rev255(X*Y).
//
// Per Appendix A, we run mulX_POLYVAL. Note this is the same transformation
// applied by |gcm_init_clmul|, etc. Note |Xi| has already been byteswapped.
//
// See also slide 16 of
// https://crypto.stanford.edu/RealWorldCrypto/slides/gueron.pdf
Htable[0].lo = Xi[1];
Htable[0].hi = Xi[0];
uint64_t carry = Htable[0].hi >> 63;
carry = 0u - carry;
Htable[0].hi <<= 1;
Htable[0].hi |= Htable[0].lo >> 63;
Htable[0].lo <<= 1;
// The irreducible polynomial is 1 + x^121 + x^126 + x^127 + x^128, so we
// conditionally add 0xc200...0001.
Htable[0].lo ^= carry & 1;
Htable[0].hi ^= carry & UINT64_C(0xc200000000000000);
// This implementation does not use the rest of |Htable|.
}
static void gcm_polyval_nohw(uint64_t Xi[2], const u128 *H) {
// Karatsuba multiplication. The product of |Xi| and |H| is stored in |r0|
// through |r3|. Note there is no byte or bit reversal because we are
// evaluating POLYVAL.
uint64_t r0, r1;
gcm_mul64_nohw(&r0, &r1, Xi[0], H->lo);
uint64_t r2, r3;
gcm_mul64_nohw(&r2, &r3, Xi[1], H->hi);
uint64_t mid0, mid1;
gcm_mul64_nohw(&mid0, &mid1, Xi[0] ^ Xi[1], H->hi ^ H->lo);
mid0 ^= r0 ^ r2;
mid1 ^= r1 ^ r3;
r2 ^= mid1;
r1 ^= mid0;
// Now we multiply our 256-bit result by x^-128 and reduce. |r2| and
// |r3| shifts into position and we must multiply |r0| and |r1| by x^-128. We
// have:
//
// 1 = x^121 + x^126 + x^127 + x^128
// x^-128 = x^-7 + x^-2 + x^-1 + 1
//
// This is the GHASH reduction step, but with bits flowing in reverse.
// The x^-7, x^-2, and x^-1 terms shift bits past x^0, which would require
// another reduction steps. Instead, we gather the excess bits, incorporate
// them into |r0| and |r1| and reduce once. See slides 17-19
// of https://crypto.stanford.edu/RealWorldCrypto/slides/gueron.pdf.
r1 ^= (r0 << 63) ^ (r0 << 62) ^ (r0 << 57);
// 1
r2 ^= r0;
r3 ^= r1;
// x^-1
r2 ^= r0 >> 1;
r2 ^= r1 << 63;
r3 ^= r1 >> 1;
// x^-2
r2 ^= r0 >> 2;
r2 ^= r1 << 62;
r3 ^= r1 >> 2;
// x^-7
r2 ^= r0 >> 7;
r2 ^= r1 << 57;
r3 ^= r1 >> 7;
Xi[0] = r2;
Xi[1] = r3;
}
void gcm_gmult_nohw(uint64_t Xi[2], const u128 Htable[16]) {
uint64_t swapped[2];
swapped[0] = CRYPTO_bswap8(Xi[1]);
swapped[1] = CRYPTO_bswap8(Xi[0]);
gcm_polyval_nohw(swapped, &Htable[0]);
Xi[0] = CRYPTO_bswap8(swapped[1]);
Xi[1] = CRYPTO_bswap8(swapped[0]);
}
void gcm_ghash_nohw(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
size_t len) {
uint64_t swapped[2];
swapped[0] = CRYPTO_bswap8(Xi[1]);
swapped[1] = CRYPTO_bswap8(Xi[0]);
while (len >= 16) {
uint64_t block[2];
OPENSSL_memcpy(block, inp, 16);
swapped[0] ^= CRYPTO_bswap8(block[1]);
swapped[1] ^= CRYPTO_bswap8(block[0]);
gcm_polyval_nohw(swapped, &Htable[0]);
inp += 16;
len -= 16;
}
Xi[0] = CRYPTO_bswap8(swapped[1]);
Xi[1] = CRYPTO_bswap8(swapped[0]);
}

View File

@ -119,7 +119,7 @@ TEST(GCMTest, ByteSwap) {
CRYPTO_bswap8(UINT64_C(0x0102030405060708)));
}
#if defined(SUPPORTS_ABI_TEST) && defined(GHASH_ASM)
#if defined(SUPPORTS_ABI_TEST) && !defined(OPENSSL_NO_ASM)
TEST(GCMTest, ABI) {
static const uint64_t kH[2] = {
UINT64_C(0x66e94bd4ef8a2c3b),
@ -135,17 +135,12 @@ TEST(GCMTest, ABI) {
};
alignas(16) u128 Htable[16];
CHECK_ABI(gcm_init_4bit, Htable, kH);
#if defined(GHASH_ASM_X86)
CHECK_ABI(gcm_init_4bit, Htable, kH);
CHECK_ABI(gcm_gmult_4bit_mmx, X, Htable);
for (size_t blocks : kBlockCounts) {
CHECK_ABI(gcm_ghash_4bit_mmx, X, Htable, buf, 16 * blocks);
}
#else
CHECK_ABI(gcm_gmult_4bit, X, Htable);
for (size_t blocks : kBlockCounts) {
CHECK_ABI(gcm_ghash_4bit, X, Htable, buf, 16 * blocks);
}
#endif // GHASH_ASM_X86
#if defined(GHASH_ASM_X86) || defined(GHASH_ASM_X86_64)
@ -222,4 +217,4 @@ TEST(GCMTest, ABI) {
}
#endif // GHASH_ASM_ARM
}
#endif // SUPPORTS_ABI_TEST && GHASH_ASM
#endif // SUPPORTS_ABI_TEST && !OPENSSL_NO_ASM

View File

@ -261,22 +261,17 @@ OPENSSL_EXPORT void CRYPTO_gcm128_tag(GCM128_CONTEXT *ctx, uint8_t *tag,
// GCM assembly.
#if !defined(OPENSSL_NO_ASM) && \
(defined(OPENSSL_X86) || defined(OPENSSL_X86_64) || \
defined(OPENSSL_ARM) || defined(OPENSSL_AARCH64) || \
defined(OPENSSL_PPC64LE))
#define GHASH_ASM
#endif
void gcm_init_4bit(u128 Htable[16], const uint64_t H[2]);
void gcm_gmult_4bit(uint64_t Xi[2], const u128 Htable[16]);
void gcm_ghash_4bit(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
void gcm_init_nohw(u128 Htable[16], const uint64_t H[2]);
void gcm_gmult_nohw(uint64_t Xi[2], const u128 Htable[16]);
void gcm_ghash_nohw(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
size_t len);
#if defined(GHASH_ASM)
#if !defined(OPENSSL_NO_ASM)
#if defined(OPENSSL_X86) || defined(OPENSSL_X86_64)
#define GCM_FUNCREF_4BIT
#define GCM_FUNCREF
void gcm_init_clmul(u128 Htable[16], const uint64_t Xi[2]);
void gcm_gmult_clmul(uint64_t Xi[2], const u128 Htable[16]);
void gcm_ghash_clmul(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@ -316,7 +311,7 @@ void gcm_ghash_4bit_mmx(uint64_t Xi[2], const u128 Htable[16], const uint8_t *in
#elif defined(OPENSSL_ARM) || defined(OPENSSL_AARCH64)
#define GHASH_ASM_ARM
#define GCM_FUNCREF_4BIT
#define GCM_FUNCREF
OPENSSL_INLINE int gcm_pmull_capable(void) {
return CRYPTO_is_ARMv8_PMULL_capable();
@ -336,13 +331,13 @@ void gcm_ghash_neon(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
#elif defined(OPENSSL_PPC64LE)
#define GHASH_ASM_PPC64LE
#define GCM_FUNCREF_4BIT
#define GCM_FUNCREF
void gcm_init_p8(u128 Htable[16], const uint64_t Xi[2]);
void gcm_gmult_p8(uint64_t Xi[2], const u128 Htable[16]);
void gcm_ghash_p8(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
size_t len);
#endif
#endif // GHASH_ASM
#endif // OPENSSL_NO_ASM
// CBC.