- Main
- API Documentation
- Browse Source Code
- Change Log
- Project Goals
- On Naming Things
- Supported Crypto
- Roadmap and Status
- OOP in C
- API Overview
- X.509 Certificates
- Constant-Time Crypto
- Constant-Time Mul
- Speed Benchmarks
On Performance
BearSSL primary optimisation goal is to reduce compiled code size. This does not mean that raw execution speed is unimportant; only that, when faced with a size/speed trade-off, BearSSL tends to put more emphasis on the “size” measure than what most cryptographic libraries do. For instance, the RSA implementation will use generic code that can handle all key sizes, with no specific code for common key sizes (like 2048 bits).
Yet execution speed still matters, though situations vary a lot. For instance, some constrained environments need fast asymmetric crypto primitives, because when the SSL handshake occurs, the human user is waiting; but, conversely, symmetric encryption speed might not matter as much if it is used to support I/O over a slow network (it does not take much CPU to support encryption of a 115200 bauds serial link).
BearSSL’s API are meant to be flexible: for all algorithms, different implementations may be used, and possibly externally provided, with no modification of the library source code (this uses function pointers, not conditional compilation with preprocessor directives). BearSSL already offers several choices for most algorithms, that embody various trade-offs. The figures below illustrate the consequences of these choices.
The following points must be remembered:
Most BearSSL implementations are written in generic, portable C code. In its current state, BearSSL has few architecture-specific code routines.
Compilation is done with the default compiler on the platform (GCC), with “
-Os
” optimisation flag, which again favours size over speed. Of note, for RSA with the “i62” code, using Clang instead of GCC would offer a substantial speed-up (472 private-key operations per second instead on 355, on the “amd64” platform).
In future versions, additional implementations, notably architecture-specific routines with assembly, will be added. One notable planned import is the very efficient Curve25519 code from the µNaCl library.
Measuring Speed
Four platforms are presented here, using four different architectures:
amd64: an Intel Xeon CPU (E3-1220 V2), at 3.10 GHz, in 64-bit mode. This is a rather powerful, yet already aging processor (launched in 2012). Compiler is GCC 5.4.0, with flags “
-Os -fPIC
”.i386: an Intel Xeon CPU (E3-1220 V2), at 3.10 GHz, in 32-bit mode. Compiler is GCC 5.4.0, with flags “
-Os -fPIC
”. This is the same platform as “amd64”, but used in 32-bit mode.pwr8: a POWER8E CPU, at 3.425 GHz, in 64-bit little-endian mode (“ppc64le”). Compiler is GCC 6.2.0, with flags “
-Os -fPIC
”.m0+: an ARM Cortex M0+, at 48 MHz (Atmel SAM D20 microcontroller). Compiler is GCC 4.9.3, with flags “
-Os -mthumb -mcpu=cortex-m0plus
”.
Measures are done by repeatedly processing a given data buffer (8 kilobytes) or running the relevant process (e.g. elliptic curve point multiplication) so that the total CPU time is at least two seconds (one second on the M0+, where the clock has millisecond precision). On Unix-like system, the clock()
function measures CPU time allocated to the computation. Due to the unavoidable variance on multitasking operating systems, values should be considered as accurate ±3%.
On the M0+, there is no operating system, but a 1 kHz interrupt is set, allowing counting elapsed time with millisecond precision (measures show that the IRQ handler takes less than a microsecond, thus the timer overhead is negligible).
All measures have used BearSSL-0.4. Later versions might contain changes inducing different (hopefully better) performance.
Measures
Hash Functions
Processing speed for a long message (i.e. without padding overhead) is given in megabytes per second, except for the M0+ where values are expressed in kilobytes per second.
Implementation | amd64 (MB/s) | i386 (MB/s) | pwr8 (MB/s) | m0+ (kB/s) |
---|---|---|---|---|
md5 | 516.80 | 357.41 | 248.29 | 1046.48 |
sha1 | 331.05 | 257.90 | 176.39 | 669.59 |
sha256 | 211.60 | 166.65 | 128.65 | 320.08 |
sha512 | 331.56 | 84.42 | 206.44 | 156.13 |
Symmetric Encryption
For AES, we again measure asymptotic speed for CBC encryption, CBC decryption and CTR mode; key schedule time is not measured. For 3DES, we also measure CBC encryption and CBC decryption, but there is no CTR mode. ChaCha20 performance is also provided.
Implementations are:
AES “big”: a classic implementation with lookup tables (not constant-time).
AES “small”: a compact implementation with small tables (not constant-time).
AES “ct”: a constant-time implementation with bitslicing over 32-bit registers; two blocks are processed in parallel when the encryption mode allows it (CTR and CBC decryption, but not CBC encryption).
AES “ct64”: a constant-time implementation similar to “ct” but using 64-bit variables.
AES “x86ni”: an implementation using the AES-NI opcodes provided by recent (since late 2011) x86 CPU.
AES “pwr8”: an implementation using the cryptographic opcodes of POWER8 processors.
3DES “tab”: a classic, table-based 3DES implementation (not constant-time).
3DES “ct”: a constant-time implementation that uses internal bitslice.
ChaCha20 “ct”: a straightforward constant-time implementation of ChaCha20.
Note that in SSL, a MAC is also applied on the bulk of the data. CBC cipher suites use HMAC, which processes data at roughly the same speed as the underlying hash function. For GCM, the MAC part is GHASH; for ChaCha20, Poly1305 is used.
There again, speed is provided in megabytes per second, except for the M0+, where performance is given in kilobytes per second.
Function | Implementation | amd64 (MB/s) | i386 (MB/s) | pwr8 (MB/s) | m0+ (kB/s) |
---|---|---|---|---|---|
AES-128 CBC encrypt | big | 162.88 | 135.58 | 111.86 | 198.14 |
AES-128 CBC encrypt | small | 39.81 | 32.89 | 23.84 | 75.76 |
AES-128 CBC encrypt | ct | 28.47 | 21.94 | 21.44 | 58.10 |
AES-128 CBC encrypt | ct64 | 27.17 | 10.27 | 20.68 | 28.95 |
AES-128 CBC encrypt | x86ni | 679.76 | 679.74 | - | - |
AES-128 CBC encrypt | pwr8 | - | - | 887.53 | - |
AES-128 CBC decrypt | big | 177.73 | 132.49 | 118.99 | 193.46 |
AES-128 CBC decrypt | small | 22.58 | 17.45 | 16.13 | 45.99 |
AES-128 CBC decrypt | ct | 43.71 | 31.17 | 32.24 | 82.85 |
AES-128 CBC decrypt | ct64 | 78.45 | 28.36 | 56.32 | 77.28 |
AES-128 CBC decrypt | x86ni | 2366.62 | 2423.44 | - | - |
AES-128 CBC decrypt | pwr8 | - | - | 4085.88 | - |
AES-128 CTR | big | 169.80 | 127.21 | 116.56 | 194.76 |
AES-128 CTR | small | 40.06 | 32.31 | 24.01 | 74.94 |
AES-128 CTR | ct | 55.01 | 42.82 | 41.17 | 112.70 |
AES-128 CTR | ct64 | 92.38 | 37.14 | 70.26 | 105.28 |
AES-128 CTR | x86ni | 2426.27 | 2393.89 | - | - |
AES-128 CTR | pwr8 | - | - | 4193.67 | - |
Function | Implementation | amd64 (MB/s) | i386 (MB/s) | pwr8 (MB/s) | m0+ (kB/s) |
AES-256 CBC encrypt | big | 125.27 | 104.72 | 86.43 | 146.20 |
AES-256 CBC encrypt | small | 29.16 | 24.11 | 17.14 | 54.84 |
AES-256 CBC encrypt | ct | 20.61 | 15.94 | 15.56 | 42.28 |
AES-256 CBC encrypt | ct64 | 20.08 | 7.48 | 15.13 | 21.09 |
AES-256 CBC encrypt | x86ni | 488.85 | 488.12 | - | - |
AES-256 CBC encrypt | pwr8 | - | - | 659.78 | - |
AES-256 CBC decrypt | big | 135.29 | 101.86 | 91.37 | 143.72 |
AES-256 CBC decrypt | small | 16.14 | 12.42 | 11.50 | 32.74 |
AES-256 CBC decrypt | ct | 31.60 | 22.52 | 23.30 | 60.07 |
AES-256 CBC decrypt | ct64 | 58.12 | 20.80 | 42.31 | 56.45 |
AES-256 CBC decrypt | x86ni | 1782.17 | 1791.37 | - | - |
AES-256 CBC decrypt | pwr8 | - | - | 2941.71 | - |
AES-256 CTR | big | 129.40 | 99.73 | 90.35 | 144.43 |
AES-256 CTR | small | 29.30 | 23.83 | 17.25 | 54.52 |
AES-256 CTR | ct | 40.07 | 31.58 | 30.16 | 82.70 |
AES-256 CTR | ct64 | 70.57 | 27.75 | 53.41 | 78.63 |
AES-256 CTR | x86ni | 1830.57 | 1783.39 | - | - |
AES-256 CTR | pwr8 | - | - | 2991.74 | - |
Function | Implementation | amd64 (MB/s) | i386 (MB/s) | pwr8 (MB/s) | m0+ (kB/s) |
3DES CBC encrypt | tab | 20.25 | 18.51 | 14.99 | 39.48 |
3DES CBC encrypt | ct | 6.54 | 6.32 | 4.20 | 10.05 |
3DES CBC decrypt | tab | 21.02 | 18.69 | 15.82 | 39.13 |
3DES CBC decrypt | ct | 6.58 | 6.34 | 4.24 | 10.03 |
ChaCha20 | ct | 322.54 | 270.72 | 259.51 | 550.72 |
MAC
These are the MAC algorithms used in AEAD cipher suites: GHASH is combined with AES-CTR in AES/GCM, while Poly1305 is used with ChaCha20. Provided implementations are:
GHASH “ctmul”: uses 32→64 multiplications.
GHASH “ctmul32”: uses 32→32 multiplications.
GHASH “ctmul64”: uses 64→64 multiplications.
GHASH “pclmul”: an implementation that leverages the
pclmulqdq
opcode of recent x86 CPU (this opcode was added along with the AES-NI instructions).GHASH “pwr8”: an implementation that leverages the cryptographic opcodes of POWER8 processors.
Poly1305 “ctmul”: uses 32→64 multiplications.
Poly1305 “ctmul32”: uses 32→32 multiplications.
Poly1305 “ctmulq”: uses 64→128 multiplications (available only on some 64-bit architectures).
Poly1305 “i15”: an implementation that relies on the generic “i15” big integer code (also used by RSA and elliptic curves).
Processing speed is asymptotic, i.e. only bulk data bandwidth, not per-record overhead (which is slight for records of a few kilobytes in length). Bandwidth is expressed in megabytes per second, except for the M0+, where kilobytes per second are used.
Function | Implementation | amd64 (MB/s) | i386 (MB/s) | pwr8 (MB/s) | m0+ (kB/s) |
---|---|---|---|---|---|
GHASH | ctmul | 193.48 | 94.64 | 150.48 | 50.49 |
GHASH | ctmul32 | 92.84 | 83.43 | 62.23 | 152.68 |
GHASH | ctmul64 | 247.03 | 74.17 | 148.67 | 68.62 |
GHASH | pclmul | 1740.96 | 1601.75 | - | - |
GHASH | pwr8 | - | - | 4659.59 | - |
Poly1305 | ctmul | 1144.49 | 593.67 | 995.54 | 288.55 |
Poly1305 | ctmul32 | 268.89 | 200.09 | 206.68 | 522.72 |
Poly1305 | ctmulq | 1730.92 | - | 1297.02 | - |
Poly1305 | i15 | 49.32 | 34.60 | 30.98 | 84.95 |
Elliptic Curves
Elliptic curve implementations compute point multiplications: a given curve point is multiplied by a scalar whose length is that of the curve subgroup order. Four curves are supported: the three main NIST curves (P-256, P-384 and P-521), and Curve25519. Each implementation supports only some of these curves:
“prime_i15” uses the “i15” big-integer code (32→32 multiplications) and supports the three NIST curves.
“prime_i31” uses the “i31” big-integer code (32→64 multiplications) and supports the three NIST curves.
“p256_m15” uses 32→32 multiplications and supports only P-256.
“p256_m31” uses 32→64 multiplications and supports only P-256.
“c25519_i15” and “c25519_i31” use, respectively, the “i15” and “i31” big-integer code, to implement Curve25519.
“c25519_m15” and “c25519_m31” use, respectively, 32→32 and 32→64 multiplications, to implement Curve25519.
BearSSL also provide aggregate wrappers: the “all_m15” implementation uses “p256_m15” (for P-256), “c25519_m15” (for Curve25519), and “prime_i15” (for P-384 and P-521). Similarly, “all_m31” wraps around “p256_m31”, “c25519_m31” and “prime_i31”.
The “p256_m15” and “p256_m31” also feature an optimised code path when the point that is to be multiplied is the conventional generator for the curve. This is called “fixed point” (FP) below.
Numbers below are given in number of point multiplications per second. These figure translate to actual performance along the following lines:
Static ECDH (not ECDHE) requires one point multiplication (both on the client and the server).
ECDHE requires two point multiplications on the server (one is amenable to FP), and two point multiplications on the client (again, one may use FP optimisation).
ECDSA signature generation (on the server when doing ECDHE_ECDSA, on the client when using client certificates with EC keys) requires one point multiplication (with FP optimisation), and a few extra operations that add about 10 to 20% overhead.
ECDSA signature verification (on the client when doing ECDHE_ECDSA, on the server when validating a client signature with an EC key, and also for all ECDSA signatures on certificates) requires two point multiplications, one of which being amenable to FP optimisation. There is also a bit of overhead similar to that of signature generation.
Curve | Implementation | amd64 (mul/s) | i386 (mul/s) | pwr8 (mul/s) | m0+ (mul/s) |
---|---|---|---|---|---|
P-256 | prime_i15 | 368.45 | 268.14 | 233.58 | 0.664 |
P-256 | prime_i31 | 840.63 | 467.10 | 576.14 | 0.437 |
P-256 | p256_m15 | 719.02 | 705.69 | 378.16 | 1.723 |
P-256 | p256_m15 (FP) | 1089.04 | 1065.25 | 575.84 | 2.546 |
P-256 | p256_m31 | 1857.00 | 965.19 | 991.54 | 1.099 |
P-256 | p256_m31 (FP) | 2791.22 | 1409.64 | 1457.20 | 1.623 |
P-384 | prime_i15 | 137.98 | 102.18 | 85.48 | 0.253 |
P-384 | prime_i31 | 360.31 | 182.73 | 253.43 | 0.149 |
P-521 | prime_i15 | 59.34 | 44.74 | 38.62 | 0.112 |
P-521 | prime_i31 | 181.06 | 86.21 | 120.29 | 0.065 |
Curve25519 | c25519_i15 | 704.80 | 505.11 | 438.00 | 1.271 |
Curve25519 | c25519_i31 | 1601.39 | 837.13 | 1134.70 | 0.725 |
Curve25519 | c25519_m15 | 2052.25 | 2020.48 | 1060.21 | 4.420 |
Curve25519 | c25519_m31 | 5708.51 | 3047.40 | 3750.23 | 2.226 |
RSA
For RSA, we measure the number of public-key and private-key operations per second, for a 2048-bit key (public exponent is the classic 65537). RSA key exchange involves a private-key operation on the server, a public-key operation on the client. RSA signature generation is a private-key operation, while verification is a public-key operation.
There are four implementations, that correspond to the underlying generic big-integer code:
“i15”: uses 32→32 multiplications only; internally, integers are represented as arrays of 16-bit variables, each containing 15 bits of value.
“i31”: uses 32→64 multiplications only; internally, integers are represented as arrays of 32-bit variables, each containing 31 bits of value.
“i32”: a predecessor to “i31”, that stores 32 bits of value in each 32-bit variable. In general, “i32” is slower than “i31” on all platforms.
“i62” uses the same internal representation as “i31”, except when computing multiplications, in which case 64-bit variables and 64→128 multiplications are used. It is available only on some 64-bit architecture. When present, it is much faster than “i31”.
Operation | Implementation | amd64 (ops/s) | i386 (ops/s) | pwr8 (ops/s) | m0+ (ops/s) |
---|---|---|---|---|---|
private | i15 | 60.74 | 47.10 | 39.91 | 0.128 |
private | i31 | 210.49 | 86.05 | 135.50 | 0.059 |
private | i32 | 99.40 | 44.93 | 70.97 | 0.040 |
private | i62 | 355.03 | - | 291.01 | - |
public | i15 | 977.77 | 800.45 | 681.88 | 2.216 |
public | i31 | 3533.44 | 1517.11 | 2489.88 | 1.068 |
public | i32 | 2077.95 | 957.80 | 1553.33 | 0.833 |
public | i62 | 4513.53 | - | 3875.80 | - |