On Performance

BearSSL primary optimisation goal is to reduce compiled code size. This does not mean that raw execution speed is unimportant; only that, when faced with a size/speed trade-off, BearSSL tends to put more emphasis on the “size” measure than what most cryptographic libraries do. For instance, the RSA implementation will use generic code that can handle all key sizes, with no specific code for common key sizes (like 2048 bits).

Yet execution speed still matters, though situations vary a lot. For instance, some constrained environments need fast asymmetric crypto primitives, because when the SSL handshake occurs, the human user is waiting; but, conversely, symmetric encryption speed might not matter as much if it is used to support I/O over a slow network (it does not take much CPU to support encryption of a 115200 bauds serial link).

BearSSL’s API are meant to be flexible: for all algorithms, different implementations may be used, and possibly externally provided, with no modification of the library source code (this uses function pointers, not conditional compilation with preprocessor directives). BearSSL already offers several choices for most algorithms, that embody various trade-offs. The figures below illustrate the consequences of these choices.

The following points must be remembered:

Most BearSSL implementations are written in generic, portable C code. In its current state, BearSSL has few architecture-specific code routines.
Compilation is done with the default compiler on the platform (GCC), with “-Os” optimisation flag, which again favours size over speed. Of note, for RSA with the “i62” code, using Clang instead of GCC would offer a substantial speed-up (472 private-key operations per second instead on 355, on the “amd64” platform).

In future versions, additional implementations, notably architecture-specific routines with assembly, will be added. One notable planned import is the very efficient Curve25519 code from the µNaCl library.

Measuring Speed

Four platforms are presented here, using four different architectures:

amd64: an Intel Xeon CPU (E3-1220 V2), at 3.10 GHz, in 64-bit mode. This is a rather powerful, yet already aging processor (launched in 2012). Compiler is GCC 5.4.0, with flags “-Os -fPIC”.
i386: an Intel Xeon CPU (E3-1220 V2), at 3.10 GHz, in 32-bit mode. Compiler is GCC 5.4.0, with flags “-Os -fPIC”. This is the same platform as “amd64”, but used in 32-bit mode.
pwr8: a POWER8E CPU, at 3.425 GHz, in 64-bit little-endian mode (“ppc64le”). Compiler is GCC 6.2.0, with flags “-Os -fPIC”.
m0+: an ARM Cortex M0+, at 48 MHz (Atmel SAM D20 microcontroller). Compiler is GCC 4.9.3, with flags “-Os -mthumb -mcpu=cortex-m0plus”.

Measures are done by repeatedly processing a given data buffer (8 kilobytes) or running the relevant process (e.g. elliptic curve point multiplication) so that the total CPU time is at least two seconds (one second on the M0+, where the clock has millisecond precision). On Unix-like system, the clock() function measures CPU time allocated to the computation. Due to the unavoidable variance on multitasking operating systems, values should be considered as accurate ±3%.

On the M0+, there is no operating system, but a 1 kHz interrupt is set, allowing counting elapsed time with millisecond precision (measures show that the IRQ handler takes less than a microsecond, thus the timer overhead is negligible).

All measures have used BearSSL-0.4. Later versions might contain changes inducing different (hopefully better) performance.

Measures

Hash Functions

Processing speed for a long message (i.e. without padding overhead) is given in megabytes per second, except for the M0+ where values are expressed in kilobytes per second.

Implementation	amd64 (MB/s)	i386 (MB/s)	pwr8 (MB/s)	m0+ (kB/s)
md5	516.80	357.41	248.29	1046.48
sha1	331.05	257.90	176.39	669.59
sha256	211.60	166.65	128.65	320.08
sha512	331.56	84.42	206.44	156.13

Symmetric Encryption

For AES, we again measure asymptotic speed for CBC encryption, CBC decryption and CTR mode; key schedule time is not measured. For 3DES, we also measure CBC encryption and CBC decryption, but there is no CTR mode. ChaCha20 performance is also provided.

Implementations are:

AES “big”: a classic implementation with lookup tables (not constant-time).
AES “small”: a compact implementation with small tables (not constant-time).
AES “ct”: a constant-time implementation with bitslicing over 32-bit registers; two blocks are processed in parallel when the encryption mode allows it (CTR and CBC decryption, but not CBC encryption).
AES “ct64”: a constant-time implementation similar to “ct” but using 64-bit variables.
AES “x86ni”: an implementation using the AES-NI opcodes provided by recent (since late 2011) x86 CPU.
AES “pwr8”: an implementation using the cryptographic opcodes of POWER8 processors.
3DES “tab”: a classic, table-based 3DES implementation (not constant-time).
3DES “ct”: a constant-time implementation that uses internal bitslice.
ChaCha20 “ct”: a straightforward constant-time implementation of ChaCha20.

Note that in SSL, a MAC is also applied on the bulk of the data. CBC cipher suites use HMAC, which processes data at roughly the same speed as the underlying hash function. For GCM, the MAC part is GHASH; for ChaCha20, Poly1305 is used.

There again, speed is provided in megabytes per second, except for the M0+, where performance is given in kilobytes per second.

Function	Implementation	amd64 (MB/s)	i386 (MB/s)	pwr8 (MB/s)	m0+ (kB/s)
AES-128 CBC encrypt	big	162.88	135.58	111.86	198.14
AES-128 CBC encrypt	small	39.81	32.89	23.84	75.76
AES-128 CBC encrypt	ct	28.47	21.94	21.44	58.10
AES-128 CBC encrypt	ct64	27.17	10.27	20.68	28.95
AES-128 CBC encrypt	x86ni	679.76	679.74	-	-
AES-128 CBC encrypt	pwr8	-	-	887.53	-
AES-128 CBC decrypt	big	177.73	132.49	118.99	193.46
AES-128 CBC decrypt	small	22.58	17.45	16.13	45.99
AES-128 CBC decrypt	ct	43.71	31.17	32.24	82.85
AES-128 CBC decrypt	ct64	78.45	28.36	56.32	77.28
AES-128 CBC decrypt	x86ni	2366.62	2423.44	-	-
AES-128 CBC decrypt	pwr8	-	-	4085.88	-
AES-128 CTR	big	169.80	127.21	116.56	194.76
AES-128 CTR	small	40.06	32.31	24.01	74.94
AES-128 CTR	ct	55.01	42.82	41.17	112.70
AES-128 CTR	ct64	92.38	37.14	70.26	105.28
AES-128 CTR	x86ni	2426.27	2393.89	-	-
AES-128 CTR	pwr8	-	-	4193.67	-
Function	Implementation	amd64 (MB/s)	i386 (MB/s)	pwr8 (MB/s)	m0+ (kB/s)
AES-256 CBC encrypt	big	125.27	104.72	86.43	146.20
AES-256 CBC encrypt	small	29.16	24.11	17.14	54.84
AES-256 CBC encrypt	ct	20.61	15.94	15.56	42.28
AES-256 CBC encrypt	ct64	20.08	7.48	15.13	21.09
AES-256 CBC encrypt	x86ni	488.85	488.12	-	-
AES-256 CBC encrypt	pwr8	-	-	659.78	-
AES-256 CBC decrypt	big	135.29	101.86	91.37	143.72
AES-256 CBC decrypt	small	16.14	12.42	11.50	32.74
AES-256 CBC decrypt	ct	31.60	22.52	23.30	60.07
AES-256 CBC decrypt	ct64	58.12	20.80	42.31	56.45
AES-256 CBC decrypt	x86ni	1782.17	1791.37	-	-
AES-256 CBC decrypt	pwr8	-	-	2941.71	-
AES-256 CTR	big	129.40	99.73	90.35	144.43
AES-256 CTR	small	29.30	23.83	17.25	54.52
AES-256 CTR	ct	40.07	31.58	30.16	82.70
AES-256 CTR	ct64	70.57	27.75	53.41	78.63
AES-256 CTR	x86ni	1830.57	1783.39	-	-
AES-256 CTR	pwr8	-	-	2991.74	-
Function	Implementation	amd64 (MB/s)	i386 (MB/s)	pwr8 (MB/s)	m0+ (kB/s)
3DES CBC encrypt	tab	20.25	18.51	14.99	39.48
3DES CBC encrypt	ct	6.54	6.32	4.20	10.05
3DES CBC decrypt	tab	21.02	18.69	15.82	39.13
3DES CBC decrypt	ct	6.58	6.34	4.24	10.03
ChaCha20	ct	322.54	270.72	259.51	550.72

MAC

These are the MAC algorithms used in AEAD cipher suites: GHASH is combined with AES-CTR in AES/GCM, while Poly1305 is used with ChaCha20. Provided implementations are:

GHASH “ctmul”: uses 32→64 multiplications.
GHASH “ctmul32”: uses 32→32 multiplications.
GHASH “ctmul64”: uses 64→64 multiplications.
GHASH “pclmul”: an implementation that leverages the pclmulqdq opcode of recent x86 CPU (this opcode was added along with the AES-NI instructions).
GHASH “pwr8”: an implementation that leverages the cryptographic opcodes of POWER8 processors.
Poly1305 “ctmul”: uses 32→64 multiplications.
Poly1305 “ctmul32”: uses 32→32 multiplications.
Poly1305 “ctmulq”: uses 64→128 multiplications (available only on some 64-bit architectures).
Poly1305 “i15”: an implementation that relies on the generic “i15” big integer code (also used by RSA and elliptic curves).

Processing speed is asymptotic, i.e. only bulk data bandwidth, not per-record overhead (which is slight for records of a few kilobytes in length). Bandwidth is expressed in megabytes per second, except for the M0+, where kilobytes per second are used.

Function	Implementation	amd64 (MB/s)	i386 (MB/s)	pwr8 (MB/s)	m0+ (kB/s)
GHASH	ctmul	193.48	94.64	150.48	50.49
GHASH	ctmul32	92.84	83.43	62.23	152.68
GHASH	ctmul64	247.03	74.17	148.67	68.62
GHASH	pclmul	1740.96	1601.75	-	-
GHASH	pwr8	-	-	4659.59	-
Poly1305	ctmul	1144.49	593.67	995.54	288.55
Poly1305	ctmul32	268.89	200.09	206.68	522.72
Poly1305	ctmulq	1730.92	-	1297.02	-
Poly1305	i15	49.32	34.60	30.98	84.95

Elliptic Curves

Elliptic curve implementations compute point multiplications: a given curve point is multiplied by a scalar whose length is that of the curve subgroup order. Four curves are supported: the three main NIST curves (P-256, P-384 and P-521), and Curve25519. Each implementation supports only some of these curves:

“prime_i15” uses the “i15” big-integer code (32→32 multiplications) and supports the three NIST curves.
“prime_i31” uses the “i31” big-integer code (32→64 multiplications) and supports the three NIST curves.
“p256_m15” uses 32→32 multiplications and supports only P-256.
“p256_m31” uses 32→64 multiplications and supports only P-256.
“c25519_i15” and “c25519_i31” use, respectively, the “i15” and “i31” big-integer code, to implement Curve25519.
“c25519_m15” and “c25519_m31” use, respectively, 32→32 and 32→64 multiplications, to implement Curve25519.

BearSSL also provide aggregate wrappers: the “all_m15” implementation uses “p256_m15” (for P-256), “c25519_m15” (for Curve25519), and “prime_i15” (for P-384 and P-521). Similarly, “all_m31” wraps around “p256_m31”, “c25519_m31” and “prime_i31”.

The “p256_m15” and “p256_m31” also feature an optimised code path when the point that is to be multiplied is the conventional generator for the curve. This is called “fixed point” (FP) below.

Numbers below are given in number of point multiplications per second. These figure translate to actual performance along the following lines:

Static ECDH (not ECDHE) requires one point multiplication (both on the client and the server).
ECDHE requires two point multiplications on the server (one is amenable to FP), and two point multiplications on the client (again, one may use FP optimisation).
ECDSA signature generation (on the server when doing ECDHE_ECDSA, on the client when using client certificates with EC keys) requires one point multiplication (with FP optimisation), and a few extra operations that add about 10 to 20% overhead.
ECDSA signature verification (on the client when doing ECDHE_ECDSA, on the server when validating a client signature with an EC key, and also for all ECDSA signatures on certificates) requires two point multiplications, one of which being amenable to FP optimisation. There is also a bit of overhead similar to that of signature generation.

Curve	Implementation	amd64 (mul/s)	i386 (mul/s)	pwr8 (mul/s)	m0+ (mul/s)
P-256	prime_i15	368.45	268.14	233.58	0.664
P-256	prime_i31	840.63	467.10	576.14	0.437
P-256	p256_m15	719.02	705.69	378.16	1.723
P-256	p256_m15 (FP)	1089.04	1065.25	575.84	2.546
P-256	p256_m31	1857.00	965.19	991.54	1.099
P-256	p256_m31 (FP)	2791.22	1409.64	1457.20	1.623
P-384	prime_i15	137.98	102.18	85.48	0.253
P-384	prime_i31	360.31	182.73	253.43	0.149
P-521	prime_i15	59.34	44.74	38.62	0.112
P-521	prime_i31	181.06	86.21	120.29	0.065
Curve25519	c25519_i15	704.80	505.11	438.00	1.271
Curve25519	c25519_i31	1601.39	837.13	1134.70	0.725
Curve25519	c25519_m15	2052.25	2020.48	1060.21	4.420
Curve25519	c25519_m31	5708.51	3047.40	3750.23	2.226

RSA

For RSA, we measure the number of public-key and private-key operations per second, for a 2048-bit key (public exponent is the classic 65537). RSA key exchange involves a private-key operation on the server, a public-key operation on the client. RSA signature generation is a private-key operation, while verification is a public-key operation.

There are four implementations, that correspond to the underlying generic big-integer code:

“i15”: uses 32→32 multiplications only; internally, integers are represented as arrays of 16-bit variables, each containing 15 bits of value.
“i31”: uses 32→64 multiplications only; internally, integers are represented as arrays of 32-bit variables, each containing 31 bits of value.
“i32”: a predecessor to “i31”, that stores 32 bits of value in each 32-bit variable. In general, “i32” is slower than “i31” on all platforms.
“i62” uses the same internal representation as “i31”, except when computing multiplications, in which case 64-bit variables and 64→128 multiplications are used. It is available only on some 64-bit architecture. When present, it is much faster than “i31”.

Operation	Implementation	amd64 (ops/s)	i386 (ops/s)	pwr8 (ops/s)	m0+ (ops/s)
private	i15	60.74	47.10	39.91	0.128
private	i31	210.49	86.05	135.50	0.059
private	i32	99.40	44.93	70.97	0.040
private	i62	355.03	-	291.01	-
public	i15	977.77	800.45	681.88	2.216
public	i31	3533.44	1517.11	2489.88	1.068
public	i32	2077.95	957.80	1553.33	0.833
public	i62	4513.53	-	3875.80	-