174 A. Rudra et al.
in a 256-entry table can be performed on AltiVec in 20 instructions. Thus, a set
of 128 multiplications would require 488 instructions. In comparison, our 137-
gate multiplication circuit translates into a bit-sliced implementation that can
perform 128 multiplications in 137 instructions!
The above computation ignores the cost of ordering the input in bit-sliced
fashion and doing the reverse for the output. To evaluate the trade-off correctly,
this cost has to be taken into account as well. In general, this cost will depend
on the target instruction set.
However, it is possible to think of scenarios in which a particular operation
may be efficiently supported in an architecture. For example, if the AltiVec ar-
chitecture were to provide an instruction for 16 parallel GF (2
8
) multiplications
which use the underlying field polynomial of interest to us (a hypothetical but
nonetheless technically feasible scenario since the critical path of the multipli-
cation circuit is only six gates deep), then a direct computation would require
only eight instructions, compared to the 137 required by the bit-sliced version.
Now consider GF (2
16
) multiplications on this hypothetical version of the
AltiVec architecture. It is easy to see that the most efficient computation is
neither a direct one, nor a bit-sliced version, but a byte-sliced computation, in
which each GF (2
16
) multiplication is mapped to a small number of GF (2
8
)
operations, which are efficiently supported by the architecture in question. In
general, the right “slice” to use would depend on the target architecture.
3.1 Encrypting without Chaining
Our Rijndael implementation processes 128 blocks of data in parallel. Tradi-
tionally, such a scheme would be regarded as more useful for decryption than
for encryption, since encryption is usually performed in inherently sequential
modes such as Cipher Block Chaining or CBC [17,18,19]. The well-known CBC
[17,18,19] is used as a defense against replay attacks [12]. In the CBC mode of
encryption, parallel blocks would not be available for encryption except where
data from many streams is encrypted in parallel.
However, a new parallelizable variant of CBC [7] removes this limitation and
makes it possible to use CBC encryption without the usual sequentiality. This
makes it possible to utilize the high throughput rates of our implementation in
conjunction with the popular CBC mode.
4 Rijndael in a Composite Field
Rijndael involves arithmetic on GF(2
8
) elements. In a straightforward implemen-
tation, inverse, multiplication and substitution are likely to be the operations
that determine the overall complexity of the implementation. The most common
approach is to use table lookups for these operations. By mapping the operations
into a composite field, we are able to obtain both a small circuit in case of a
hardware implementation as well as smaller instruction counts and table sizes
in case of software implementations.