- API Documentation
- Change Log
- Project Goals
- On Naming Things
- Supported Crypto
- Roadmap and Status
- OOP in C
- API Overview
- X.509 Certificates
- Constant-Time Crypto
- Constant-Time Mul
As noted in the section on constant-time crypto, integer multiplication opcodes in CPU may or may not execute in constant time; when they do not, implementations that use such operations may exhibit execution time variations that depend on the involved data, thereby potentially leaking secret information.
In current BearSSL (version 0.3, 2017-01-29), potentially affected implementations are:
- GHASH (for GCM): ctmul, ctmul32, ctmul64
- Poly1305: ctmul, ctmul32
- RSA: i15, i31, i32
- Elliptic curves: prime_i15, prime_i31, p256_m15, p256_m31, c25519_i15, c25519_i31, c25519_m15, c25519_m31
This page aims at aggregating information on which CPU provide constant-time multiplications, and what can be done when working on CPU that do not.
When a CPU has non-constant-time multiplication opcodes, the execution time depends on the size (as integers) of one or both of the operands. For instance, on the ARM Cortex-M3, the
umull opcode takes 3 to 5 cycles to complete, the “short” counts (3 or 4) being taken only if both operands are numerically less than 65536 (so says the manual, but see below). Thus, such a multiplication should be made constant-time by forcing the high bits to 1. In BearSSL, the “i31” big-integer code (used for RSA and elliptic curves) internally uses a macro called
MUL31(), that takes as inputs two 31-bit words (represented in
uint32_t values) and outputs the 62-bit result (in an
uint64_t). There are two versions for this macro; the normal one is a simple multiplication, while the alternate tries to be constant-time:
#define MUL31(x, y) ((uint64_t)((x) | (uint32_t)0x80000000) \ * (uint64_t)((y) | (uint32_t)0x80000000) \ - ((uint64_t)(x) << 31) - ((uint64_t)(y) << 31) \ - ((uint64_t)1 << 62))
The alternate macro is used if the
BR_CT_MUL31 compile-time option is set (either with a compiler invocation switch, or by setting the option in
src/config.h). The resulting code will be both slightly bigger, and noticeably slower, but at least it should make RSA and elliptic curves constant-time on a Cortex-M3.
Or not. Tanja Lange pointed me to this study by Wouter de Groot, who performed actual measures on some Cortex-M3 devices, from which it appeared that a short cycle count could occur not only in the documented case (operands fit in 16 bits) but also when one or both of the operands is zero or a power of 2. In the case of
MUL31(), this means that the underlying
umull will complete in 5 cycles, except if one or both of the two operands are zero, in which case the
umull will complete in 4 cycles. Therefore, on the tested devices (and, presumably, on all devices that use a Cortex-M3 core),
MUL31() does not guarantee a fully constant-time implementation. On “normal” data the short cycle count will be quite rare (about one in a billion invocations for full-width operands) but it may happen much more often on the top bits (e.g. when using 1024-bit integers with 31-bit limbs, the top limb contains only 1 bit of data, since 31×33 = 1023). How to leverage it in an actual attack is still an open question at this point; normal caution dictates that the M3 should be avoided for now.
Even within documented behaviour (ignoring cases like the Cortex-M3), the
MUL31() is not sufficient for all architectures. On systems where the multiplication opcode yields only the low 32 bits (as is the case for ARM platforms in Thumb mode, that do not support Thumb-2), obtaining a 64-bit result necessarily involves a software subroutine which will invoke the multiplication opcode several times, with operands truncated in some way. The truncation may yield smaller values, hence a short cycle count in some cases. The subroutine may also have conditional branches. Finally, that 32-bit-only multiplication opcode may perform early exits when the top bits are all equal to one (this is the case of the ARM9T cores in Thumb mode). A similar situation arises on some architectures such as PowerPC, where the low and high words are obtained with two separate opcodes, and the one that yields the low word may again use signed early exit.
A more general strategy is to implement big integers with 15-bit words, not 31-bit; thus, a multiplication of two 15-bit words can be implemented as:
#define MUL15(x, y) ((uint32_t)((x) | (uint32_t)0x80000000) \ * (uint32_t)((y) | (uint32_t)0x80000000) \ & (uint32_t)0x3FFFFFFF)
BearSSL now implements that macro, both in the “slow” version shown above, and the “fast” version which is a simple multiplication. As in the case of
MUL31, the slow version must be activated with a specific compile-time macro called
Since the top two bits of each operand would be forced to
10, the multiplication should be constant-time on most if not all existing CPU (at least all CPU with registers of 32 bits or more). Moreover, using only 32-bit results is beneficial to platforms that do not have an opcode that yields the top 32 bits of a multiplication (in particular the Cortex-M0 and M0+, and the older ARM in Thumb mode, before Thumb-2 was introduced).
It shall be noted that, at least theoretically, a given CPU could implement a non constant-time multiplication even in the case of the slow version of
MUL15(), by returning early when one of the operands is a power of two (that is, in the context of
MUL15(), when one of the macro parameters is zero). I am not aware of any CPU that currently implements such a shortcut for a 32→32 multiplication, but the Cortex M3 does it for the 32→64 multiplication, so such a possibility cannot be absolutely excluded.
Compilers can be an additional source of trouble. When the operands of a multiplication are larger than what the underlying platform offers, a software routine is necessarily involved, either inlined or as an external function call. This is the case when, for instance, multiplying two 64-bit operands (
uint64_t) on a 32-bit architecture. A typical 32-bit system will have a 32→64 opcode (or possibly a pair of opcodes, one returning the low word, another returning the high word), but for 64-bit operands, a software subroutine may be used. Depending on the compiler, that routine may or may not be constant-time.
In general, compilers will use the direct 32→64 opcode when they can “see” that the two operands are unsigned 32-bit integers (i.e. they are both
uint32_t values, freshly cast to
uint64_t for the purpose of multiplication). In any case, if you ensure that your operands always fit on the low 32 bits, then any software shortcut when the high words are zero will always be taken, thereby avoiding any leak.
Assembly gives more control on what exactly is executed by the CPU, and may open extra options not available in portable C, e.g. vector instructions. When a vector instruction performs two or more multiplications in parallel, it can be assumed that they are constant-time. Assembly also gives control on operand order, which may matter for some architectures.
The generic drawback of assembly is its inherent lack of portability, and increased need for maintenance. And while using assembly avoids issues with C compilers, it does not solve variance within the CPU itself.
Yet another workaround is acceptance. It has been noted by Käsper and Schwabe that in GCM, the GHASH part is computed over either encrypted or non-confidential data, so it cannot leak information on secret data; moreover, its secret key is derived from the encryption key through what amounts to a one-way function, so even if that key was unveiled, encryption would still be safe. In the ChaCha20+Poly1305 cipher suite, the MAC key is even more transient: a new MAC key is pseudorandomly produced for each record. Timing attacks being statistical in nature, requiring accumulation of measures to yield substantial amounts of data, it is plausible that a non-constant-time Poly1305 implementation would not seriously endanger the integrity of records.
Therefore, a non-constant-time Poly1305, because of non-constant-time multiplication, should not be much of a worry. A non-constant-time GHASH is a bit more problematic, in the context of long-lived sessions with many records being processed with the same GHASH key; but this is still less serious than a non-constant-time AES or RSA implementation. It is still better, if possible at a low cost, to go full constant-time.
Per CPU Information
This section is supposed to be enriched over time with information on whether multiplications on some specific CPU are constant-time or not. All the data below has been gleaned from various sources, mostly technical manuals. Since instruction timings can vary at the whim of the CPU vendor, possibly between any two sub-revisions, and are not necessarily well documented, what I wrote below is incomplete and there probably are a few mistakes.
Contributions are most welcome! Please send up any information you may have on the timing behaviour of multiply instructions.
In the tables below, cores are grouped by architecture, and the timing behaviour of multiplications is provided in the following cases:
32→32: multiplication of two 32-bit unsigned words, only the low 32 bits of the result are kept. If the CPU is constant-time for this kind of multiplication, then
br_poly1305_ctmul32(), and all the “i15” and “m15” code (for RSA and EC) are constant-time.
32→64: multiplication of two 32-bit unsigned words, result in a 64-bit unsigned integer. If the CPU is constant-time for this kind of multiplication, then
br_poly1305_ctmul(), RSA “i31” and “i32” implementations, and EC “i31” implementation, are constant-time.
MUL31: multiplication of two 31-bit unsigned words, using the alternate
MUL31()macro described above. If the CPU is constant-time for this multiplication, then RSA “i31” and EC “i31” implementations are constant-time, provided that the
BR_CT_MUL31option is set.
64→64: multiplication of two 64-bit unsigned words, keeping the low 64 bits of the result. If the CPU is constant-time for this multiplication, then
For each multiplication category, the status will be one of the following:
N: not constant-time.
S+: operation not directly provided by the hardware; a software implementation is included by the compiler, and it is probably constant-time.
S: operation not directly provided by the hardware; a software implementation is included by the compiler, and it is possibly constant-time (depending on compiler + library).
S-: operation not directly provided by the hardware; a software implementation is included by the compiler, and it is probably not constant-time.
?: status unknown (more information needed!).
A common assumption is that if the platform offers a 32→64 opcode for unsigned multiplications, then the C compiler will be smart enough to notice that 32-bit unsigned integers, freshly cast to
uint64_t, are still really 32-bit values, and thus their product may use the platform opcode directly instead of using a generic 64→64 routine. In the case of GCC, this assumption holds, unless compilation is done without any optimisations at all (
On the x86 line of CPU, in 32-bit mode, the 32→32 multiplication is typically done with an
imul opcode (nominally signed, but this does not matter in that case), while the 32→64 multiplication is done with
mul, that writes the unsigned result in the
edx (high word) and
For the 64→64 multiplication, GCC inlines the relevant routine, and it combines one
mul (for the two low words), two
imul, and two additions; that routine is constant-time. However, in the same situation, Microsoft Visual C relies on a library routine which is not constant-time, in that it returns early if the top words are all-zeros or all-ones. This has been exploited to extract the private key from a server running Curve25519. Thus, on x86 in 32-bit mode, 64→64 multiplications are deemed “S”, not “S+”: they will be constant-time, or not, depending on the used compiler (a not too comfortable situation).
In 64-bit mode, the
imul opcode is always used, operands being extended with zeros if necessary. The
mul opcode will be used only when trying to obtain a 128-bit output (with a non-standard type like
__m128); on recent cores (Intel Haswell or later), a variant called
mulx allows choosing the two target registers for the 128-bit output, which reduces contention and allows for better pipelining.
In general, Intel x86 CPU all provide constant-time multiplications since the days of the first Pentium. The same does not necessarily hold for other vendors, in particular the early VIA Nano.
|Intel P5 (Pentium, Pentium MMX)||Y||Y||Y||S|
|Intel P6 / Pentium M (Pentium Pro to Core)||Y||Y||Y||S|
|Intel NetBurst (Pentium IV)||Y||Y||Y||S|
|Intel Core 2||Y||Y||Y||S|
|Intel Nehalem (early Core i3/i5/i7)||Y||Y||Y||S|
|Intel Sandy Bridge / Ivy Bridge||Y||Y||Y||S|
|Intel Bonnell (Atom)||Y||Y||Y||S|
|Intel Silvermont (Atom)||Y||Y||Y||S|
|Intel Goldmont (Atom)||Y||Y||Y||S|
|AMD K7 (Athlon)||Y||Y||Y||S|
|AMD K8 (Hammer)||Y||Y||Y||S|
|VIA Nano 2000 Series||N||N||Y||S-|
|VIA Nano 3000 Series||Y||Y||Y||S|
|Intel Core 2||Y||Y||Y||Y|
|Intel Nehalem (early Core i3/i5/i7)||Y||Y||Y||Y|
|Intel Sandy Bridge / Ivy Bridge||Y||Y||Y||Y|
|Intel Bonnell (Atom)||Y||Y||Y||Y|
|Intel Silvermont (Atom)||Y||Y||Y||Y|
|Intel Goldmont (Atom)||Y||Y||Y||Y|
|AMD K8 (Hammer)||Y||Y||Y||Y|
|VIA Nano 2000 Series||N||N||Y||N|
|VIA Nano 3000 Series||Y||Y||Y||Y|
The ARM line of CPU has four instruction sets:
The original set, called “A32”, with 32-bit opcodes.
The “Thumb” set, with 16-bit opcodes.
The “Thumb-2” set, that extends Thumb set with some 32-bit opcodes.
The “A64” set, with 32-bit opcodes and 64-bit registers.
In recent ARM terminology, the Thumb and Thumb-2 sets are called “T32”.
The A32 and Thumb-2 sets offer a native unsigned 32→64 opcode, called
umull. However, the Thumb set has only a 32→32 opcode (called
mul). ARM cores before the ARM11 know only Thumb, not Thumb-2; however, they also support the original A32 set, and switching from one set to the other is efficient (this can be done as part of a function call, so that A32 and Thumb sets may be used together), so software on these older cores may conceivably use an A32 routine to handle a 32→64 or 64→64 multiplication.
In A32 and Thumb-2 modes, GCC uses
mul for 32→32,
umull for 32→64, and an inline subroutine with one
umull and two
mul for 64→64. In Thumb mode, GCC uses an external library function called
__aeabi_lmul() for both 32→64 and 64→64; the implementation for that function is not constant-time, even in the 32→64 case (it uses a conditional branch for carry propagation, for what corresponds to what
umull would have provided).
The low power Cortex M0, M0+ and M1, know Thumb only — however, these offer a constant-time 32→32 opcode. The Cortex M3 has Thumb-2, so it has both the 32→32 opcode (which is constant-time) and the 32→64 opcode (which is not constant-time).
As was described in a previous section, the actual M3 device uses some undocumented “early exit” strategy in some cases, that may prevent the
MUL31() macro from guaranteeing constant-time processing in some cases, which may or may not matter in practice. Of course, any other processor with variable-time multiplication opcode may exhibit a similar behaviour (that’s the problem with undocumented behaviour).
Important caveat: in the table below, it is assumed that the code has been compiled to benefit from the abilities of the CPU; e.g. that code compiled for an ARM9E core will use the whole Thumb-2 instruction set. Code running on an ARM9E but compiled for the ARM9T will use a software subroutine for a 32→64 multiplication, since the ARM9T does not support the relevant Thumb-2 opcode (
Other vendors have licensed the ARM architectures, and may have used constant-time or non-constant-time implementations.
Processors following the PowerPC architecture have been produced by many vendors, with varying behaviours with regards to multiplications.
In 32-bit mode, a 64-bit result is obtained by using two opcodes,
mullhwu to get the high word,
mullw for the low word. Both may return “early”; in particular,
mullw may treat its operands as signed and may return early if the top bits are all-zeros or all-ones. In particular, this prevents
MUL31() from guaranteeing constant-time multiplications.
Recent, “large” CPU offer a 64-bit mode; however, even on systems where the 64-bit mode is supported, many applications are compiled in 32-bit mode, since that reduces memory usage (contrary to the x86 situation, 32-bit PowerPC code has access to as many registers as 64-bit PowerPC, making the use of 64-bit mode comparatively less attractive).
|PowerPC 603/603e (RAD6000)||N||N||N||S-|
|PowerPC 7xx (G3, RAD750)||N||N||N||S-|
|PowerPC 74xx (G4)||N||N||N||S-|
|PowerPC 970 (G5) (32-bit)||?||?||?||?|
|PowerPC 970 (G5) (64-bit)||?||?||?||?|
|PPE (Cell, Xenon) (32-bit)||Y||Y||Y||S+|
|PPE (Cell, Xenon) (64-bit)||Y||Y||Y||Y|
Please contribute any information on CPU types and architectures not listed above!