Efficient Rijndael Encryption Implementation with Composite Field

Eﬃcient Rijndael Encryption Implementation

with Composite Field Arithmetic

Atri Rudra

, Pradeep K. Dubey

, Charanjit S. Jutla

, Vijay Kumar

,1

Josyula R. Rao

, and Pankaj Rohatgi

IBM India Research Lab, Block I, Indian Institue of Technology,

Hauz Khas, New Delhi, 110016, India

{ratri,pkdubey,vijayk}@in.ibm.com

IBM Thomas J. Watson Research Center,

P.O.Box 704, Yorktown Heights, NY 10598, U.S.A.

{csjutla,jrrao,rohatgi}@watson.ibm.com

Abstract. We explore the use of subﬁeld arithmetic for eﬃcient imple-

mentations of Galois Field arithmetic especially in the context of the

Rijndael block cipher. Our technique involves mapping ﬁeld elements to

a composite ﬁeld representation. We describe how to select a represen-

tation which minimizes the computation cost of the relevant arithmetic,

taking into account the cost of the mapping as well. Our method results

in a very compact and fast gate circuit for Rijndael encryption.

In conjunction with bit-slicing techniques applied to newly proposed par-

allelizable modes of operation, our circuit leads to a high-performance

software implementation for Rijndael encryption which oﬀers signiﬁcant

speedup compared to previously reported implementations.

1 Introduction

In October 2000, the US National Institute of Standards and Technology (NIST)

announced that it had selected the Rijndael Block Cipher [3] as the new Ad-

vanced Encryption Standard (AES). In addition to being the new standard,

Rijndael is a cipher that oﬀers a good “combination of security, performance,

eﬃciency, implementability and ﬂexibility” [20]. It has already attained consid-

erable popularity and acceptance. Rijndael is a block cipher with a block size of

16 bytes, each of which represents an element in the Galois Field GF (2

). All

operations in Rijndael are deﬁned in terms of arithmetic in this ﬁeld.

Apart from Rijndael, there are several other instances of the use of Galois

Field arithmetic in cryptography and coding theory [10]. The eﬃciency and

performance of such applications is dependent upon the representation of ﬁeld

elements and the implementation of ﬁeld arithmetic. It is common practice to

obtain eﬃciency by careful selection of the ﬁeld representation [9,10,11]. In par-

ticular, it is well-known that the computational cost of certain Galois Field



As of April 2001, the author can be reached at Amazon.com, 605 5

Ave South,

Seattle, WA 98104, U.S.A.

C¸ .K. Ko¸c, D. Naccache, and C. Paar (Eds.): CHES 2001, LNCS 2162, pp. 171–184, 2001.

 Springer-Verlag Berlin Heidelberg 2001

172 A. Rudra et al.

operations is lower when ﬁeld elements are mapped to an isomorphic composite

ﬁeld, in which these operations are implemented using lower-cost subﬁeld arith-

metic operations as primitives [11]. Depending upon the computation involved

and the choice of representation, there are costs associated with the mapping and

conversion, and a trade-oﬀ has to be made between such costs and the savings

obtained. The design task is to carefully evaluate these trade-oﬀs to minimize

the computational cost.

In addition to an eﬃcient hardware implementation, a good circuit design is

also useful in obtaining fast software implementations. Using the technique of

bit-slicing [2] a circuit with a small number of gates can be simulated using a

wide-word processor. Multiple instances of the underlying computation are thus

performed in parallel to exploit the parallelism implicit in a wide-word computer.

This technique has been used in [2] to obtain a fast DES implementation.

In this paper, we study the use of composite ﬁeld techniques for Galois Field

arithmetic in the context of the Rijndael cipher. We show that substantial gains

in performance can be obtained through such an approach. We obtain a compact

gate circuit for Rijndael and use its design to illustrate the trade-oﬀs associated

with design choices such as ﬁeld polynomials and representations. We use our

circuit design to obtain a simple and fast software implementation of Rijndael

for wide-word architectures. The performances of both hardware as well as soft-

ware implementations show large gains in comparison with previously reported

performance ﬁgures.

The rest of this paper is organized as follows. In Section 2, we detail the map-

ping of Galois Field operations to composite ﬁeld arithmetic. Section 3 outlines

the data-slicing technique for realizing a highly parallel software implementation

from a circuit design. In Section 4, we describe the mapping of Rijndael opera-

tions to a particular class of composite ﬁelds. The selection of ﬁeld polynomials

and representations and the associated optimizations are discussed in Sections

5 and 6 respectively. Finally, in Section 7 we present our results and a compari-

son with previously reported performance ﬁgures for Rijndael. Drawings of our

Rijndael encryption circuit are included in the Appendix.

2 GF Arithmetic and Composite Fields

Composite ﬁelds are frequently used in implementations of Galois Field arith-

metic [9,10,11]. In cases where arithmetic operations rely on table lookups, sub-

ﬁeld arithmetic is used to reduce lookup-related costs. This technique has been

used to obtain relatively eﬃcient implementations for speciﬁc operations such as

multiplication, inversion and exponentiation. Much of this work has been aimed

at implementation of channel codes. The object has usually been to obtain better

software implementations by using smaller tables through subﬁeld arithmetic.

Applications to hardware design (such as [10]) have been relatively infrequent.

Our techniques are directed at both hardware and software implementations.

We take advantage of the eﬃciency obtained by the use of subﬁeld arithmetic,

not merely in the matter of smaller tables but the overall low-level (gate count)

An Eﬃcient Rijndael Encryption Implementation 173

complexity of various arithmetic operations. The computation and comparison

of such gains and cost is dependent upon several parameters – the overhead of

mapping between the original and the composite ﬁeld representations, the na-

ture of the underlying computation and its composition in terms of the relative

frequency of various arithmetic operations, and in case of software implemen-

tations, the constraints imposed by the target architecture and its instruction

set. Based on these parameters we select the appropriate ﬁeld and representa-

tion to optimize a hardware circuit design. As we shall see, there can be several

objectives for this optimization, such as critical path lengths and gate counts,

depending upon the overall design goals. The circuit design obtained can then

be used to obtain parallelism in a software implementation by means of slicing

techniques.

As described in [11], the two pairs {GF (2

),Q(y)} and {GF ((2

)

),P(x)}

constitute a composite ﬁeld if GF (2

) is constructed from GF (2) by Q(y) and

GF ((2

)

) is constructed from GF (2

)byP (x), where Q(y) and P (x) are

polynomials of degree n and m respectively. The ﬁelds GF ((2

)

) and GF (2

k = nm, are isomorphic to each other. Since the complexity of various arith-

metic operations diﬀers from one ﬁeldto another, we can take advantage of the

isomorphism to map a computation from one to the other in search of eﬃciency.

For a given underlying ﬁeld GF (2

), our gains depend on the choice of n and m

as well as of the polynomials Q(y) and P (x).

While we restrict our description to composite ﬁelds of the type GF ((2

)

it is easy to see that the underlying techniques are fully general and can be used

for any composite ﬁeld.

3 Slicing Techniques

Bit-slicing is a popular technique [2] that makes use of the inbuilt parallel-

processing capability of a wide-word processor. Bit-slicing regards a W -bit pro-

cessor as a SIMD parallel computer capable of performing W parallel 1-bit op-

erations simultaneously. In this mode, an operand word contains W bits from

W diﬀerent instances of the computation. Initially, W diﬀerent inputs are taken

and arranged so that the ﬁrst word of the re-arranged input contains the ﬁrst

bit from each of the W inputs, the second word contains the second bit from

each input, and so on. The resulting bit-sliced computation can be regarded as

simulating W instances of the hardware circuit for the original computation. In-

deed, a bit-sliced computation is designed by ﬁrst designing a hardware circuit

and then simulating it using W -bit registers on the rearranged input described

above.

A bit-sliced implementation corresponding to an N -gate circuit requires N

instructions to carry out W instances of the underlying computation, or N/W

instructions per instance. This can be particularly eﬃcient for computations

which are not eﬃciently supported by the target architecture. Consider for in-

stance GF (2

) multiplication on AltiVec [4]. The straightforward implementa-

tion uses three table-lookups and one addition modulo 255. 16-parallel lookups

174 A. Rudra et al.

in a 256-entry table can be performed on AltiVec in 20 instructions. Thus, a set

of 128 multiplications would require 488 instructions. In comparison, our 137-

gate multiplication circuit translates into a bit-sliced implementation that can

perform 128 multiplications in 137 instructions!

The above computation ignores the cost of ordering the input in bit-sliced

fashion and doing the reverse for the output. To evaluate the trade-oﬀ correctly,

this cost has to be taken into account as well. In general, this cost will depend

on the target instruction set.

However, it is possible to think of scenarios in which a particular operation

may be eﬃciently supported in an architecture. For example, if the AltiVec ar-

chitecture were to provide an instruction for 16 parallel GF (2

) multiplications

which use the underlying ﬁeld polynomial of interest to us (a hypothetical but

nonetheless technically feasible scenario since the critical path of the multipli-

cation circuit is only six gates deep), then a direct computation would require

only eight instructions, compared to the 137 required by the bit-sliced version.

Now consider GF (2

) multiplications on this hypothetical version of the

AltiVec architecture. It is easy to see that the most eﬃcient computation is

neither a direct one, nor a bit-sliced version, but a byte-sliced computation, in

which each GF (2

) multiplication is mapped to a small number of GF (2

)

operations, which are eﬃciently supported by the architecture in question. In

general, the right “slice” to use would depend on the target architecture.

3.1 Encrypting without Chaining

Our Rijndael implementation processes 128 blocks of data in parallel. Tradi-

tionally, such a scheme would be regarded as more useful for decryption than

for encryption, since encryption is usually performed in inherently sequential

modes such as Cipher Block Chaining or CBC [17,18,19]. The well-known CBC

[17,18,19] is used as a defense against replay attacks [12]. In the CBC mode of

encryption, parallel blocks would not be available for encryption except where

data from many streams is encrypted in parallel.

However, a new parallelizable variant of CBC [7] removes this limitation and

makes it possible to use CBC encryption without the usual sequentiality. This

makes it possible to utilize the high throughput rates of our implementation in

conjunction with the popular CBC mode.

4 Rijndael in a Composite Field

Rijndael involves arithmetic on GF(2

) elements. In a straightforward implemen-

tation, inverse, multiplication and substitution are likely to be the operations

that determine the overall complexity of the implementation. The most common

approach is to use table lookups for these operations. By mapping the operations

into a composite ﬁeld, we are able to obtain both a small circuit in case of a

hardware implementation as well as smaller instruction counts and table sizes

in case of software implementations.

An Eﬃcient Rijndael Encryption Implementation 175

For our Rijndael implementation, we work in the composite ﬁeld GF ((2

)

We selected the ﬁeld polynomial Q(y)=y

+ y + 1 for GF(2

). For P (x), we

consider all primitive polynomials of the form P (x)=x

+ x + λ where λ is an

element of GF (2

). There are four such polynomials, for each of which there are

seven diﬀerent transformation matrices to consider, one corresponding to each

possible choice of basis. The criterion used by us to compare various choices is

the gate count of the resulting Rijndael circuit implementation.

Rijndael operations translate to the composite ﬁeld representation as follows.

H denotes the mapping from GF (2

)toGF ((2

)

), and T the corresponding

transformation matrix — that is, H(x)=Tx. S isa4× 4 matrix (the state)on

which all operations are performed.

– ByteSub transformation : This has essentially two sub-steps:

1. P

=(S

)

−1

. In the composite ﬁeld, H(P

)=(H(S

))

−1

The calculation of an inverse is as follows. Every A ∈ GF ((2

)

) can be

represented as A = a

+ βa

where β

+ β + λ = 0, and a

∈ GF (2

The inverse is B = A

−1

= b

+ βb

, b

∈ GF (2

) , such that b

+ a

)∆

−1

and b

= a

∆

−1

, where ∆ = a

+ a

)+λa

2. Q

= AP

+ c where A is a ﬁxed 8 × 8 matrix and c ∈ GF (2

In the composite ﬁeld, H(Q

)=H(AP

)+H(c)=TAP

+ H(c)

= TAT

−1

H(P

)+H(c).

– ShiftRow transformation : This step is independent of representation.

– MixColumn transformation : This involves essentially the computation P

+ a

, where (a

) is a permutation of

(01, 01, 02, 03). In the composite ﬁeld,

H(P

)=H(a

)H(S

)+H(a

)H(S

)+H(a

)H(S

)+H(a

)H(S

The following observations are useful in the implementation -

• If x ∈ GF ((2

)

) then H(01) × x = x as the identity element is mapped

to the identity element in a homomorphism.

• H(03) = H(02) + H(01).

– Round Key addition : The operation is P = S +K where K is the round key.

In the composite ﬁeld, H(P )=H(S)+H(K). Addition is simply an EXOR

in either representation.

The mapping of the arithmetic to the composite ﬁeld together with judicious

choice of the ﬁeld polynomial gives us a substantially smaller circuit, as we shall

see in Section 7.

5 Optimizations

All operations in the Rijndael block cipher are in GF (2

). As outlined in sec-

tion 4, some of these GF (2

) operations have relatively ineﬃcient gate circuit

implementations and can be implemented more eﬃciently in some isomorphic

composite ﬁeld. One overhead in using subﬁeld arithmetic is the cost of the

conversion from the original to the composite ﬁeld and vice-versa. To illustrate,

176 A. Rudra et al.

consider our Rijndael implementation, which uses subﬁeld arithmetic. The cost

of the transformations is dependent on the choice of the composite ﬁeld. We

describe below a method by which an eﬃcient

transformation matrix from the

set T = {T

, T

, ....} of valid transformation matrices can be chosen.

Let C(θ) denote the cost of the operation θ, which in the present case is taken

to be the gate count of the circuit implementation of θ. Depending upon design

objectives and application, there can be alternative cost measures, such as the

depth of the critical path, for instance. Let W(x) denote the hamming weight

of x, i.e., the number of 1s in the polynomial representation of x.

The aim is to ﬁnd T

∗

, the most eﬃcient transformation, and the correspond-

ing choice of composite ﬁeld. This is the composite ﬁeld which minimizes the

gate count of the Rijndael circuit implementation. Note that while comparing

the cost for diﬀerent transformations, we need to consider only those Rijndael

operations whose costs are dependent upon the choice of composite ﬁeld. The

relevant operations are those which involve λ or the conversion matrices (T and

−1

The costs of diﬀerent operations are:

– Transform : This step involves computing H(S).

Thus, C(Transform) = 16 ×C(T.x).

– ByteSub transformation : As noted earlier, this step consists of an inverse

calculation and an aﬃne transform –

1. P

=(S

)

−1

. The only operation whose cost depends on the choice of

ﬁeld is the calculation of λa

.SoC(inverse)=16×C(λ.x).

2. Q

= AP

+ c, or, in the composite ﬁeld, H(Q

)=H(AP

)+H(c)

= TAP

+ H(c)

= TAT

−1

H(P

)+H(c).

Thus C(affine)=16× (C(B.x)+W(H(c)) ), where B = TAT

−1

– ShiftRow transformation : This step does not require any computation.

– MixColumn transformation :

As note earlier, this step is the computation

H(P

)=H(a

)H(S

)+H(a

)H(S

)+H(a

)H(S

)+H(a

)H(S

Since H(01).x = x, C(mixClm)=16× (C(H(02).x)+C(H(03).x)).

– Round Key addition : The computation is H(P )=H(S)+H(K), so C(addKey)

=16×C(T.x).

– Inverse Transform : C(invT ransform)=16×C(T

−1

.x).

∗

depends upon whether a pipelined (unrolled loop) or iterative (loop not

unrolled) Rijndael circuit is to be obtained. The former oﬀers superior perfor-

mance compared to the latter at the cost of a larger gate count.

In terms of the Rijndael gate-circuit implementation.

Note that W(H(c)) is the number of not gates required to implement H(c)+x ,

where x ∈ GF ((2

)

An Eﬃcient Rijndael Encryption Implementation 177

The criterion for the best transformation can be represented as follows:

∗

= arg min

∈T

( C(transform)+n ×C(inverse)+n ×C(affine)

+m ×C(mixClm)+(n +1)×C(addKey)+C(invT ransform)).

i.e.

∗

= arg min

∈T

((n +2)×C(T.x)+n ×C(λ.x)+n × (C(B.x)+W(H(c)) )

+m × ( C(H(02).x)+C(H(03).x))+C(T

−1

.x))

where m and n are both 1 for an iterative circuit, and R and R−1 respectively

for a pipelined circuit, where R is the number of rounds as speciﬁed in the

Rijndael cipher.

Based on these considerations, we selected the polynomial P (x)=x

x + ω

. That is, we chose λ to be ω

where ω is the primitive element of

GF (2

). The following transformation matrix maps an element from GF (2

)to

the corresponding element in the chosen composite ﬁeld:







10100000

10101100

11010010

01110000

11000110

01010010

00001010

11011101







6 Finding a Transform

A method for generating a transformation matrix to map elements of GF (2

)to

GF ((2

)

) can be found in literature [11] for the case where all the ﬁeld poly-

nomials involved are primitive polynomials. However, in the case of Rijndael the

ﬁeld polynomial is R(z)=z

+ z

+ z + 1 is an irreducible polynomial

but is not primitive. Since the ﬁelds involved are small, we use an exhaustive

search method that can ﬁnd the transformation in question in case R(z) is irre-

ducible but not primitive. The basic idea is to map α, the primitive element of

GF ((2

)

)toγ, a primitive element of GF (2

), such that ﬁeld homomorphism

holds.

The algorithm is composed of the following three steps —

1. Get a primitive element γ of GF (2

) and map α

to γ

for i ∈ [0..(2

− 1)].

Note that this step preserves the multiplicative group homomorphism — for

any i, j ∈ [0..(2

− 1)], α

× α

= α

i+j

maps to γ

× γ

= γ

i+j

2. Perform the following check — ∀i ∈ [0..(2

−1)], if α

= α

+1 then γ

= γ

1. If so then we have the required mapping; else repeat this step for the next

primitive element.

178 A. Rudra et al.

This is to verify additive group homomorphism, which requires that

∀i, j ∈ [0..(2

− 1)] α

= α

+ α

⇒ γ

= γ

+ γ

. That is,

= α

× (1 + α

j−i

) ⇒ γ

= γ

× (1 + γ

j−i

Multiplicative group homomorphism implies that it is suﬃcient to verify

whether

∀i, j ∈ [0..(2

− 1)], α

t−i

=1+α

j−i

⇒ γ

t−i

=1+γ

j−i

3. The matrix,T

−1

is obtained by placing in the i

column the element H(2

)

in the standard basis representation

for all i.

7 Performance

Our performance ﬁgures reported below are for Rijndael encryption circuit and

software, which assume key size of 128 bits.

Our core circuit for Rijndael encryption contains less than four thousand

gates. For the purpose of comparison, we report numbers based upon a circuit

with 520 I/O pins that uses multiple cores in parallel.

Table 1. Circuit Performance Figures

Transistor/Gate count Cycles/block Throughput

Ichikawa

et al.[6] 518K gates ? 1.95 Gbps

Weeks et al.[13] 642K transistors ? 606 Mbps

Elbirt et al.[5] ? 6 300Mbps@14MHz

(256-pin I/O) ? 2.1 1.938 Gbps@32 MHz

Our hardware circuit 256K gates 0.5 7.5 Gbps@32 MHz

using 32 parallel cores (iterated) of 4k gates each and

252 gate levels

Table 2 lists cycle counts and target architectures for various reported imple-

mentations. In our case, the numbers apply to any architecture that can support

bitwise AND and EXOR in addition to LOAD and STORE operations. The three

numbers we report correspond to architectures with eﬀective datapath widths

(number in parenthesis) of 256 bits, 384 bits and 512 bits respectively (this is

perhaps the interesting range of architectures today). The cycle count goes down

with increasing datapath width.

It may be mentioned that no minimization or synthesis tools were used for

our circuit — the only minimization used is in the sense of section 5. The only

gates in our circuit are XOR, AND and NOT gates.

Here 2

denotes the element whose bit representation contains all 0s except a 1 in

the ith place. For example for n =4,m =2,2

is the element 00010000, i.e., α.

This circuit performs encryption as well as decryption.

An Eﬃcient Rijndael Encryption Implementation 179

Table 2. Cycle counts per block for software implementations

Worley et al[16] 284 (Pentium) 176 (PA-RISC) 124 (IA-64)

Requires an 8KB table

Weiss et al.[14] 210 (Alpha 21264)

Wollinger et al.[15] 228 (TMS320C6x)

Aoki et al.[1] 237 (Pentium II)

Our bit-sliced software

170 (256b) 119 (384b) 100 (512b)

Requires only EXOR, AND, L/S, and 2KB table

Acknowledgments

The authors would like to thank Christof Paar and Gaurav Aggarwal for helpful

discussions.

References

1. Kazumaro Aoki and Helger Lipmaa, ”Fast Implementations of AES can-

didates”. In Proc. Third AES Candidate Conference, April 13-14, 2000.

http://csrc.nist.gov/encryption/aes/round2/conf3/aes3papers.html

2. Eli Biham, “A Fast New DES Implementation in Software”. In Proc. Fast Software

Encryption 4,1997. http://www.cs.technion.ac.il/˜biham/publications.html

3. Joan Daemen and Vincent Rijmen, “AES Proposal: Rijndael”.

http://www.esat.kuleuven.ac.be/˜rijmen/rijndael.

4. Keith Diefendorﬀ, Pradeep K. Dubey, Ron Hochsprung and Hunter Scales, “Al-

tiVec Extension to PowerPC Accelerates Media Processing”. In IEEE Micro,

March-April 2000, pp85-95.

5. AJ Elbirt, W Yip, B Chetwynd and C Paar, “An FPGA Implementa-

tion and Performance Evaluation of the AES Block Cipher Candidate Algo-

rithm Finalists”. In Proc. Third AES Candidate Conference, April 13-14, 2000.

http://csrc.nist.gov/encryption/aes/round2/conf3/aes3papers.html

6. Tetsuya Ichikawa, Tomomi Kasuya and Mitsuru Matsui, “Hardware Evaluation of

the AES Finalists”. In Proc. Third AES Candidate Conference, April 13-14, 2000.

http://csrc.nist.gov/encryption/aes/round2/conf3/aes3papers.html

7. Charanjit S. Jutla, “Encryption Modes with Almost Free Message Integrity”.

Manuscript.

8. Rudolf Lidl and Harald Niederreiter, Introduction to ﬁnite ﬁelds and their appli-

cations. Cambridge University Press, Cambridge, Ma., 1986.

9. Edoardo D. Mastrovito, VLSI Architectures for Computations in Galois Fields.

PhD Thesis, Dept. of EE, Link¨oping University, Link¨oping, Sweden 1991.

10. Christof Paar and Pedro Soria-Rodriguez, “Fast Arithmetic Architectures for

Public-Key Algorithms over Galois Fields GF ((2

)

)”. In Proc. EUROCRYPT

’97.

The performance numbers include the cost of ordering the input in bit-sliced fashion

and the reverse for the output.

180 A. Rudra et al.

11. Chirstof Paar, Eﬃcient VLSI Architectures for Bit-Parallel Computations in Ga-

lois Fields. PhD Thesis, Institute for Experimental Mathematics, University of

Essen, Germany, 1994.

http://www.ece.wpi.edu/Research/crypt/theses/paar

thesispage.html.

12. Bruce Schneier, Applied Cryptography, John Wiley and Sons,1996.

13. Bryan Weeks, Mark Bean, Tom Rozylowicz and Chris Ficke, “Hardware Perfor-

mance Simulations of Round 2 Advanced Encryption Standard Algorithm”. In

Proc. Third AES Candidate Conference, April 13-14, 2000.

http://csrc.nist.gov/encryption/aes/round2/conf3/aes3papers.html

14. Richard Weiss and Nathan Binkert “A comparison of AES candidates on the

Alpha 21264”. In Proc. Third AES Candidate Conference, April 13-14, 2000.

http://csrc.nist.gov/encryption/aes/round2/conf3/aes3papers.html

15. Thomas J. Wollinger, Min Wang, Jorge Guajardo and Christof Paar, “How Well

Are High-End DSPs suited for AES Algorithms?” In Proc. Third AES Candidate

Conference, April 13-14, 2000.

http://csrc.nist.gov/encryption/aes/round2/conf3/aes3papers.html

16. John Worley, Bill Worley, Tom Christian and Christopher Worley, “AES Finalists

on PA-RISC and IA-64: Implementations & Performance”. In Proc. Third AES

Candidate Conference, April 13-14, 2000.

http://csrc.nist.gov/encryption/aes/round2/conf3/aes3papers.html

17. “American National Standard for Information Systems — Data Encryption Al-

gorithm — Modes of Operation”. ANSI X3.106, American National Standards

Institute, 1983.

18. “Information processing — Modes of operation for a 64-bit block cipher algorithm”.

ISO 8372, International Organisation for Standardisation, Geneva, Switzerland,

1987.

19. “DES modes of operation”. NBS FIPS PUB 81, National Bureau of Standards,

U.S. Department of Commerce, 1980.

20. http://www.nist.gov/public

affairs/releases/g00-176.htm, US Commerce

Department Press Release.

Appendix: Our Rijndael Ciruit

Presented below are drawings of our gate circuit for Rijndael encryption.The

ﬁgures appear in the order of the level of detail in them – Figure 1 showing the

high level view of our circuit.

An Eﬃcient Rijndael Encryption Implementation 181

128

128*(n+1)

Cipher Text

Transform

Rijndael-impl

Inverse-transform

Plain text block Round Keys

Fig. 1. This ﬁgure contains the high level view of the Rijndael encryption circuit. The

transform function consists of 16 parallel ciruits for T

∗

.x, where x ∈ GF (2

) and T

∗

the matrix to convert elements from GF (2

) to elements of composite ﬁeld as decided

by section 5. Similarly, Inverse-transform consists of 16 parallel circuits for (T

∗

)

−1

.x.

Circuits for the multiplication of a constant matrix with a vector are obtained from

the method given in [11]

128

Add_Round_Key

Transform

128 128

Transform

128 128

128

Transform

Round_1

128

128 128

128

Transform

128 128

Round_2

Round_n

Key_2

Key_0

Key_1

Key_n

Fig. 2. This ﬁgure describes the rijndael-impl block in Figure 1

182 A. Rudra et al.

128

Key_i

Byte_sub

Shift_row

Mix_coloumn

Add_Round_Key

Shift_row

Byte_sub

Key_n

Round_i (0<i<n) Round_n

Fig. 3. This ﬁgure shows the composition of each round. Note that in our implementa-

tion, n=10. Shift

row does not require any gate. Add Round Key is simply the EXOR

of the corresponding bits of the two inputs

i_1 i_8

o_8

Inverse8

Affine

Inverse8

Affine

i_113 i_128

o_128

Inverse8

Affine

i_9 i_16

o_16o_1 o_9 o_113

Fig. 4. This ﬁgure shows the implementation of the Byte sub operation. Aﬃne has 16

parallel circuits for calculating T

∗

A(T

∗

)

−1

.x + H(c)

An Eﬃcient Rijndael Encryption Implementation 183

Linear_comb Linear_comb

o_97 o_128o_32

i_1 i_32

o_1

i_97 i_128

Fig. 5. mix

column

H(02).x H(03).x

Add8

H(02).x H(03).x

Add8

H(02).x H(03).x

Add8

H(02).x H(03).x

Add8

i_1 i_8

o_1 o_8

i_9 i_17 i_24

o_9 o_16 o_24 o_32

i_16

i_25

i_32

o_17 o_25

All datapaths are 8-bit wide

Fig. 6. This ﬁgure describes the linear comb operation from Figure 5. Add8 is simply

the EXOR of the corresponding bits of the two inputs

184 A. Rudra et al.

Add4 Cnst_mult11 Cnst_mult11

Cnst_mult1

Add4 Add4

Add4Add4

H(02).x

H(03).x

i_8

o_5 o_8

o_1 o_4

o_5 o_8

o_4o_1

i_5

i_1

i_4

All datapaths are 4-bit wide

Fig. 7. This ﬁgure shows the circuits for calculating H(02).x and H(03).x; Add4 is

simply the EXOR of the corresponding bits of the two inputs. Const

multi evaluates

the constant multiplication ω

.x, where ω is the primitive element of GF (2

) and

x ∈ GF (2

). These circuits have been obtained from [11]

Inverse4

Square4 Add4

Mult4Cnst_mult14

Add4

Mult4 Mult4

i_5 i_8

o_1 o_4 o_8o_5

i_1 i_4

All datapaths are 4-bit wide

Fig. 8. This ﬁgure 8 shows the Inverse8 operation in Figure 4. Square4 is from [11];

Mult4 and Inverse4 are from [9]