On Performance

BearSSL primary optimisation goal is to reduce compiled code size. This does not mean that raw execution speed is unimportant; only that, when faced with a size/speed trade-off, BearSSL tends to put more emphasis on the “size” measure than what most cryptographic libraries do. For instance, the RSA implementation will use generic code that can handle all key sizes, with no specific code for common key sizes (like 2048 bits).

Yet execution speed still matters, though situations vary a lot. For instance, some constrained environments need fast asymmetric crypto primitives, because when the SSL handshake occurs, the human user is waiting; but, conversely, symmetric encryption speed might not matter as much if it is used to support I/O over a slow network (it does not take much CPU to support encryption of a 115200 bauds serial link).

BearSSL’s API are meant to be flexible: for all algorithms, different implementations may be used, and possibly externally provided, with no modification of the library source code (this uses function pointers, not conditional compilation with preprocessor directives). BearSSL already offers several choices for most algorithms, that embody various trade-offs. The figures below illustrate the consequences of these choices.

The following points must be remembered:

  • Most BearSSL implementations are written in generic, portable C code. In its current state, BearSSL has few architecture-specific code routines.

  • Compilation is done with the default compiler on the platform (GCC), with “-Os” optimisation flag, which again favours size over speed. Of note, for RSA with the “i62” code, using Clang instead of GCC would offer a substantial speed-up (472 private-key operations per second instead on 355, on the “amd64” platform).

In future versions, additional implementations, notably architecture-specific routines with assembly, will be added. One notable planned import is the very efficient Curve25519 code from the µNaCl library.

Measuring Speed

Four platforms are presented here, using four different architectures:

  • amd64: an Intel Xeon CPU (E3-1220 V2), at 3.10 GHz, in 64-bit mode. This is a rather powerful, yet already aging processor (launched in 2012). Compiler is GCC 5.4.0, with flags “-Os -fPIC”.

  • i386: an Intel Xeon CPU (E3-1220 V2), at 3.10 GHz, in 32-bit mode. Compiler is GCC 5.4.0, with flags “-Os -fPIC”. This is the same platform as “amd64”, but used in 32-bit mode.

  • pwr8: a POWER8E CPU, at 3.425 GHz, in 64-bit little-endian mode (“ppc64le”). Compiler is GCC 6.2.0, with flags “-Os -fPIC”.

  • m0+: an ARM Cortex M0+, at 48 MHz (Atmel SAM D20 microcontroller). Compiler is GCC 4.9.3, with flags “-Os -mthumb -mcpu=cortex-m0plus”.

Measures are done by repeatedly processing a given data buffer (8 kilobytes) or running the relevant process (e.g. elliptic curve point multiplication) so that the total CPU time is at least two seconds (one second on the M0+, where the clock has millisecond precision). On Unix-like system, the clock() function measures CPU time allocated to the computation. Due to the unavoidable variance on multitasking operating systems, values should be considered as accurate ±3%.

On the M0+, there is no operating system, but a 1 kHz interrupt is set, allowing counting elapsed time with millisecond precision (measures show that the IRQ handler takes less than a microsecond, thus the timer overhead is negligible).

All measures have used BearSSL-0.4. Later versions might contain changes inducing different (hopefully better) performance.

Measures

Hash Functions

Processing speed for a long message (i.e. without padding overhead) is given in megabytes per second, except for the M0+ where values are expressed in kilobytes per second.

Implementation amd64 (MB/s) i386 (MB/s) pwr8 (MB/s) m0+ (kB/s)
md5 516.80 357.41 248.29 1046.48
sha1 331.05 257.90 176.39 669.59
sha256 211.60 166.65 128.65 320.08
sha512 331.56 84.42 206.44 156.13

Symmetric Encryption

For AES, we again measure asymptotic speed for CBC encryption, CBC decryption and CTR mode; key schedule time is not measured. For 3DES, we also measure CBC encryption and CBC decryption, but there is no CTR mode. ChaCha20 performance is also provided.

Implementations are:

  • AES “big”: a classic implementation with lookup tables (not constant-time).

  • AES “small”: a compact implementation with small tables (not constant-time).

  • AES “ct”: a constant-time implementation with bitslicing over 32-bit registers; two blocks are processed in parallel when the encryption mode allows it (CTR and CBC decryption, but not CBC encryption).

  • AES “ct64”: a constant-time implementation similar to “ct” but using 64-bit variables.

  • AES “x86ni”: an implementation using the AES-NI opcodes provided by recent (since late 2011) x86 CPU.

  • AES “pwr8”: an implementation using the cryptographic opcodes of POWER8 processors.

  • 3DES “tab”: a classic, table-based 3DES implementation (not constant-time).

  • 3DES “ct”: a constant-time implementation that uses internal bitslice.

  • ChaCha20 “ct”: a straightforward constant-time implementation of ChaCha20.

Note that in SSL, a MAC is also applied on the bulk of the data. CBC cipher suites use HMAC, which processes data at roughly the same speed as the underlying hash function. For GCM, the MAC part is GHASH; for ChaCha20, Poly1305 is used.

There again, speed is provided in megabytes per second, except for the M0+, where performance is given in kilobytes per second.

Function Implementation amd64 (MB/s) i386 (MB/s) pwr8 (MB/s) m0+ (kB/s)
AES-128 CBC encrypt big 162.88 135.58 111.86 198.14
AES-128 CBC encrypt small 39.81 32.89 23.84 75.76
AES-128 CBC encrypt ct 28.47 21.94 21.44 58.10
AES-128 CBC encrypt ct64 27.17 10.27 20.68 28.95
AES-128 CBC encrypt x86ni 679.76 679.74 - -
AES-128 CBC encrypt pwr8 - - 887.53 -
AES-128 CBC decrypt big 177.73 132.49 118.99 193.46
AES-128 CBC decrypt small 22.58 17.45 16.13 45.99
AES-128 CBC decrypt ct 43.71 31.17 32.24 82.85
AES-128 CBC decrypt ct64 78.45 28.36 56.32 77.28
AES-128 CBC decrypt x86ni 2366.62 2423.44 - -
AES-128 CBC decrypt pwr8 - - 4085.88 -
AES-128 CTR big 169.80 127.21 116.56 194.76
AES-128 CTR small 40.06 32.31 24.01 74.94
AES-128 CTR ct 55.01 42.82 41.17 112.70
AES-128 CTR ct64 92.38 37.14 70.26 105.28
AES-128 CTR x86ni 2426.27 2393.89 - -
AES-128 CTR pwr8 - - 4193.67 -
Function Implementation amd64 (MB/s) i386 (MB/s) pwr8 (MB/s) m0+ (kB/s)
AES-256 CBC encrypt big 125.27 104.72 86.43 146.20
AES-256 CBC encrypt small 29.16 24.11 17.14 54.84
AES-256 CBC encrypt ct 20.61 15.94 15.56 42.28
AES-256 CBC encrypt ct64 20.08 7.48 15.13 21.09
AES-256 CBC encrypt x86ni 488.85 488.12 - -
AES-256 CBC encrypt pwr8 - - 659.78 -
AES-256 CBC decrypt big 135.29 101.86 91.37 143.72
AES-256 CBC decrypt small 16.14 12.42 11.50 32.74
AES-256 CBC decrypt ct 31.60 22.52 23.30 60.07
AES-256 CBC decrypt ct64 58.12 20.80 42.31 56.45
AES-256 CBC decrypt x86ni 1782.17 1791.37 - -
AES-256 CBC decrypt pwr8 - - 2941.71 -
AES-256 CTR big 129.40 99.73 90.35 144.43
AES-256 CTR small 29.30 23.83 17.25 54.52
AES-256 CTR ct 40.07 31.58 30.16 82.70
AES-256 CTR ct64 70.57 27.75 53.41 78.63
AES-256 CTR x86ni 1830.57 1783.39 - -
AES-256 CTR pwr8 - - 2991.74 -
Function Implementation amd64 (MB/s) i386 (MB/s) pwr8 (MB/s) m0+ (kB/s)
3DES CBC encrypt tab 20.25 18.51 14.99 39.48
3DES CBC encrypt ct 6.54 6.32 4.20 10.05
3DES CBC decrypt tab 21.02 18.69 15.82 39.13
3DES CBC decrypt ct 6.58 6.34 4.24 10.03
ChaCha20 ct 322.54 270.72 259.51 550.72

MAC

These are the MAC algorithms used in AEAD cipher suites: GHASH is combined with AES-CTR in AES/GCM, while Poly1305 is used with ChaCha20. Provided implementations are:

  • GHASH “ctmul”: uses 32→64 multiplications.

  • GHASH “ctmul32”: uses 32→32 multiplications.

  • GHASH “ctmul64”: uses 64→64 multiplications.

  • GHASH “pclmul”: an implementation that leverages the pclmulqdq opcode of recent x86 CPU (this opcode was added along with the AES-NI instructions).

  • GHASH “pwr8”: an implementation that leverages the cryptographic opcodes of POWER8 processors.

  • Poly1305 “ctmul”: uses 32→64 multiplications.

  • Poly1305 “ctmul32”: uses 32→32 multiplications.

  • Poly1305 “ctmulq”: uses 64→128 multiplications (available only on some 64-bit architectures).

  • Poly1305 “i15”: an implementation that relies on the generic “i15” big integer code (also used by RSA and elliptic curves).

Processing speed is asymptotic, i.e. only bulk data bandwidth, not per-record overhead (which is slight for records of a few kilobytes in length). Bandwidth is expressed in megabytes per second, except for the M0+, where kilobytes per second are used.

Function Implementation amd64 (MB/s) i386 (MB/s) pwr8 (MB/s) m0+ (kB/s)
GHASH ctmul 193.48 94.64 150.48 50.49
GHASH ctmul32 92.84 83.43 62.23 152.68
GHASH ctmul64 247.03 74.17 148.67 68.62
GHASH pclmul 1740.96 1601.75 - -
GHASH pwr8 - - 4659.59 -
Poly1305 ctmul 1144.49 593.67 995.54 288.55
Poly1305 ctmul32 268.89 200.09 206.68 522.72
Poly1305 ctmulq 1730.92 - 1297.02 -
Poly1305 i15 49.32 34.60 30.98 84.95

Elliptic Curves

Elliptic curve implementations compute point multiplications: a given curve point is multiplied by a scalar whose length is that of the curve subgroup order. Four curves are supported: the three main NIST curves (P-256, P-384 and P-521), and Curve25519. Each implementation supports only some of these curves:

  • “prime_i15” uses the “i15” big-integer code (32→32 multiplications) and supports the three NIST curves.

  • “prime_i31” uses the “i31” big-integer code (32→64 multiplications) and supports the three NIST curves.

  • “p256_m15” uses 32→32 multiplications and supports only P-256.

  • “p256_m31” uses 32→64 multiplications and supports only P-256.

  • “c25519_i15” and “c25519_i31” use, respectively, the “i15” and “i31” big-integer code, to implement Curve25519.

  • “c25519_m15” and “c25519_m31” use, respectively, 32→32 and 32→64 multiplications, to implement Curve25519.

BearSSL also provide aggregate wrappers: the “all_m15” implementation uses “p256_m15” (for P-256), “c25519_m15” (for Curve25519), and “prime_i15” (for P-384 and P-521). Similarly, “all_m31” wraps around “p256_m31”, “c25519_m31” and “prime_i31”.

The “p256_m15” and “p256_m31” also feature an optimised code path when the point that is to be multiplied is the conventional generator for the curve. This is called “fixed point” (FP) below.

Numbers below are given in number of point multiplications per second. These figure translate to actual performance along the following lines:

  • Static ECDH (not ECDHE) requires one point multiplication (both on the client and the server).

  • ECDHE requires two point multiplications on the server (one is amenable to FP), and two point multiplications on the client (again, one may use FP optimisation).

  • ECDSA signature generation (on the server when doing ECDHE_ECDSA, on the client when using client certificates with EC keys) requires one point multiplication (with FP optimisation), and a few extra operations that add about 10 to 20% overhead.

  • ECDSA signature verification (on the client when doing ECDHE_ECDSA, on the server when validating a client signature with an EC key, and also for all ECDSA signatures on certificates) requires two point multiplications, one of which being amenable to FP optimisation. There is also a bit of overhead similar to that of signature generation.

Curve Implementation amd64 (mul/s) i386 (mul/s) pwr8 (mul/s) m0+ (mul/s)
P-256 prime_i15 368.45 268.14 233.58 0.664
P-256 prime_i31 840.63 467.10 576.14 0.437
P-256 p256_m15 719.02 705.69 378.16 1.723
P-256 p256_m15 (FP) 1089.04 1065.25 575.84 2.546
P-256 p256_m31 1857.00 965.19 991.54 1.099
P-256 p256_m31 (FP) 2791.22 1409.64 1457.20 1.623
P-384 prime_i15 137.98 102.18 85.48 0.253
P-384 prime_i31 360.31 182.73 253.43 0.149
P-521 prime_i15 59.34 44.74 38.62 0.112
P-521 prime_i31 181.06 86.21 120.29 0.065
Curve25519 c25519_i15 704.80 505.11 438.00 1.271
Curve25519 c25519_i31 1601.39 837.13 1134.70 0.725
Curve25519 c25519_m15 2052.25 2020.48 1060.21 4.420
Curve25519 c25519_m31 5708.51 3047.40 3750.23 2.226

RSA

For RSA, we measure the number of public-key and private-key operations per second, for a 2048-bit key (public exponent is the classic 65537). RSA key exchange involves a private-key operation on the server, a public-key operation on the client. RSA signature generation is a private-key operation, while verification is a public-key operation.

There are four implementations, that correspond to the underlying generic big-integer code:

  • “i15”: uses 32→32 multiplications only; internally, integers are represented as arrays of 16-bit variables, each containing 15 bits of value.

  • “i31”: uses 32→64 multiplications only; internally, integers are represented as arrays of 32-bit variables, each containing 31 bits of value.

  • “i32”: a predecessor to “i31”, that stores 32 bits of value in each 32-bit variable. In general, “i32” is slower than “i31” on all platforms.

  • “i62” uses the same internal representation as “i31”, except when computing multiplications, in which case 64-bit variables and 64→128 multiplications are used. It is available only on some 64-bit architecture. When present, it is much faster than “i31”.

Operation Implementation amd64 (ops/s) i386 (ops/s) pwr8 (ops/s) m0+ (ops/s)
private i15 60.74 47.10 39.91 0.128
private i31 210.49 86.05 135.50 0.059
private i32 99.40 44.93 70.97 0.040
private i62 355.03 - 291.01 -
public i15 977.77 800.45 681.88 2.216
public i31 3533.44 1517.11 2489.88 1.068
public i32 2077.95 957.80 1553.33 0.833
public i62 4513.53 - 3875.80 -