# Martin Kochanski’s web site / Encryption / The FAP4 chip

## Montgomery multiplication: a surreal technique

P.L. Montgomery, "Modular multiplication without trial division", Math. Computation, (1985), 44:519-521.

The FAP4 algorithm works by computing the wrong answer in the hope that it will all turn out right in the end.  A modular multiplication algorithm invented by Montgomery is even stranger: it computes the answer to a completely different sum to the one we asked for.

Let's get back to our familiar example, 789098× 123456, but this time we'll compute 789098 × 123456 ÷ 1000000 instead.  To do this we'll use the shift-and-add technique, but instead of starting from the left-hand end of 789098 and multiplying by 10 each time, we'll start from the right-hand end and divide by 10:

 8 × 123456 = 9 8 7 6 4 8 ÷ 10 = 9 8 7 6 4 . 8 9 × 123456 = + 1 1 1 1 1 0 4 9.8 × 123456 = 1 2 0 9 8 6 8 . 8 ÷ 10 = 1 2 0 9 8 6 . 8 8 0 × 123456 = + 0 0.98 × 123456 = 1 2 0 9 8 6 . 8 8 ÷ 10 = 1 2 0 9 8 . 6 8 8 9 × 123456 = + 1 1 1 1 1 0 4 9.098 × 123456 = 1 1 2 3 2 0 2 . 6 8 8 ÷ 10 = 1 1 2 3 2 0 . 2 6 8 8 8 × 123456 = + 9 8 7 6 4 8 8.9098 × 123456 = 1 0 9 9 9 6 8 . 2 6 8 8 ÷ 10 = 1 0 9 9 9 6 . 8 2 6 8 8 7 × 123456 = + 8 6 4 1 9 2 7.89098 × 123456 = 9 7 4 1 8 8 . 8 2 6 8 8 0.789098 × 123456 = 9 7 4 1 8 . 8 8 2 6 8 8

Really, this is just doing the old multiplication sum backwards, but there is one difference.  When we do addition, the carries always propagate from right to left, and in the original method this meant that we couldn't be certain of the value of any digit of the result until we'd finished the whole calculation.  But now we are generating the result from the right-hand end while the carries still flow to the left, and this means that no digit in the result changes once it's gone past the decimal point.

### Converting to a modular form

Now suppose that we want to compute 789098 × 123456 ÷ 1000000 modulo 876543.  I haven't mentioned modular division before, but its definition is simple: X ÷ Y (mod R) is simply the number Z that satisfies Z × Y = X (mod R).

Let's go through the multiplication again.

 8 × 123456 = 9 8 7 6 4 8 Now we want to divide by 10, but we can't do it because this number isn't divisible by 10.  So we add a multiple of 876543, which we're allowed to do because we're working modulo 876543. + 3 5 0 6 1 7 2 8 × 123456 = = 4 4 9 3 8 2 0 ÷ 10 = 0.8 × 123456 = 4 4 9 3 8 2 9 × 123456 = + 1 1 1 1 1 0 4 9.8 × 123456 = = 1 5 6 0 4 8 6 Another multiple of 876543 + 7 0 1 2 3 4 4 = 8 5 7 2 8 3 0 ÷ 10 = 0.98 × 123456 = 8 5 7 2 8 3 0 × 123456 = + 0 0.98 × 123456 = = 8 5 7 2 8 3 Another multiple of 876543 + 7 8 8 8 8 8 7 = 8 7 4 6 1 7 0 ÷ 10 = 0.098 × 123456 = 8 7 4 6 1 7 9 × 123456 = + 1 1 1 1 1 0 4 9.098 × 123456 = = 1 9 8 5 7 2 1 Another multiple of 876543 + 2 6 2 9 6 2 9 = 4 6 1 5 3 5 0 ÷ 10 = 0.9098 × 123456 = 4 6 1 5 3 5 8 × 123456 = + 9 8 7 6 4 8 8.9098 × 123456 = = 1 4 4 9 1 8 3 Another multiple of 876543 + 7 8 8 8 8 8 7 = 9 3 3 8 0 7 0 ÷ 10 = 0.89098 × 123456 = 9 3 3 8 0 7 7 × 123456 = + 8 6 4 1 9 2 7.89098 × 123456 = = 1 7 9 7 9 9 9 Another multiple of 876543 + 6 1 3 5 8 0 1 = 7 9 3 3 8 0 0 ÷ 10 = 0.789098 × 123456 = 7 9 3 3 8 0

You can check this result if you like, by multiplying it by 1000000 and checking that it equals 770211 modulo 876543.

The good thing about this technique is that the decisions about what multiple of the modulus to add depend only on the last digit of the result, which is not affected by any carries.  So a carry-save architecture works very well indeed here.

### Getting the answer we really wanted

You may object that when you asked for A × B you weren't asking for A × B ÷ 1000000.  Suspend disbelief for a moment and look at what happens if we multiply both A and B by a million:

1000000×A × 1000000×B ÷ 1000000 = 1000000 × A × B

In other words, if you multiply both the numbers by a million, Montgomery multiplication gives you a million times their multiple.  Thus to multiply two numbers:

1. Multiply each of them by 1000000 (modulo R).
2. Perform a Montgomery multiplication.
3. Divide the result by 1000000 (modulo R).

This sounds needlessly elaborate, and indeed if you only think of Montgomery multiplication as a form of modular multiplication then it isn't much use.  Where it comes into its own is when you need to do a great many consecutive modular multiplications, as you would when doing a modular exponentiation.  For example, here is how to raise y to the 25th power.  I'll use "*" to mean "Montgomery-multiply the numbers modulo R".

Y = 1000000 × y (modulo R)

Y2 = Y * Y

Y4 = Y2 * Y2

Y6 = Y4 * Y2

Y12 = Y6 * Y6

Y24 = Y12 * Y12

Y25 = Y24 * Y

y25 = Y25 ÷ 1000000 (modulo R)

All the middle stages are just the same as the ones that would we have to go through using ordinary modular multiplication, so the first and the last stage are the only tricky ones.

For the last stage, we can implement the division by performing a single Montgomery multiplication:

y25 = Y25 * 1

For the first stage, we could use another Montgomery multiplication:

Y = y * 1000000000000

which would automatically divide by a million and thus give us 1000000 × y, but there is a snag because Montgomery multiplication can't handle numbers larger than the modulus.  So we precompute (using some other means, perhaps even in software)

TRILLION = 1000000000000 (modulo R)

and then the first stage can be written as

Y = y * TRILLION

So altogether the extra cost of exponentiating using Montgomery multiplication is two extra multiplications, one at the beginning and one at the end, plus the precomputation of TRILLION, which only has to be done once for any particular value of R.

### Easier in binary

It's easier to explain in decimal, but Montgomery multiplication is easier in binary.  The place of 1000000 is taken by some suitable power of 2, but the key simplification is that the adding of the multiple of the modulus becomes much easier.  The rule is this: if the number you are looking at is odd, add R before you halve it; if it's even, just halve it.

### But is it useful?

Montgomery multiplication is certainly beautiful and unexpected and I envy Montgomery for discovering it.  If most of the work to be done by an encryption chip is to be modular exponentiation, and either the modulus changes rarely (as in Diffie-Hellman key exchange) or you can store the precomputed TRILLION, then it is quite efficient; but all in all it is less versatile than my algorithm or Brickell's, neither of which needs any precomputation or pre- and post-multiplication.