
Montgomery multiplication: a surreal techniqueP.L. Montgomery, "Modular multiplication without trial division", Math. Computation, (1985), 44:519521. The FAP4 algorithm works by computing the wrong answer in the hope that it will all turn out right in the end. A modular multiplication algorithm invented by Montgomery is even stranger: it computes the answer to a completely different sum to the one we asked for. Let's get back to our familiar example, 789098× 123456, but this time we'll compute 789098 × 123456 ÷ 1000000 instead. To do this we'll use the shiftandadd technique, but instead of starting from the lefthand end of 789098 and multiplying by 10 each time, we'll start from the righthand end and divide by 10:
Really, this is just doing the old multiplication sum backwards, but there is one difference. When we do addition, the carries always propagate from right to left, and in the original method this meant that we couldn't be certain of the value of any digit of the result until we'd finished the whole calculation. But now we are generating the result from the righthand end while the carries still flow to the left, and this means that no digit in the result changes once it's gone past the decimal point. Converting to a modular formNow suppose that we want to compute 789098 × 123456 ÷ 1000000 modulo 876543. I haven't mentioned modular division before, but its definition is simple: X ÷ Y (mod R) is simply the number Z that satisfies Z × Y = X (mod R). Let's go through the multiplication again.
You can check this result if you like, by multiplying it by 1000000 and checking that it equals 770211 modulo 876543. The good thing about this technique is that the decisions about what multiple of the modulus to add depend only on the last digit of the result, which is not affected by any carries. So a carrysave architecture works very well indeed here. Getting the answer we really wantedYou may object that when you asked for A × B you weren't asking for A × B ÷ 1000000. Suspend disbelief for a moment and look at what happens if we multiply both A and B by a million: 1000000×A × 1000000×B ÷ 1000000 = 1000000 × A × B In other words, if you multiply both the numbers by a million, Montgomery multiplication gives you a million times their multiple. Thus to multiply two numbers:
This sounds needlessly elaborate, and indeed if you only think of Montgomery multiplication as a form of modular multiplication then it isn't much use. Where it comes into its own is when you need to do a great many consecutive modular multiplications, as you would when doing a modular exponentiation. For example, here is how to raise y to the 25th power. I'll use "*" to mean "Montgomerymultiply the numbers modulo R".
All the middle stages are just the same as the ones that would we have to go through using ordinary modular multiplication, so the first and the last stage are the only tricky ones. For the last stage, we can implement the division by performing a single Montgomery multiplication:
For the first stage, we could use another Montgomery multiplication:
which would automatically divide by a million and thus give us 1000000 × y, but there is a snag because Montgomery multiplication can't handle numbers larger than the modulus. So we precompute (using some other means, perhaps even in software)
and then the first stage can be written as
So altogether the extra cost of exponentiating using Montgomery multiplication is two extra multiplications, one at the beginning and one at the end, plus the precomputation of TRILLION, which only has to be done once for any particular value of R. Easier in binaryIt's easier to explain in decimal, but Montgomery multiplication is easier in binary. The place of 1000000 is taken by some suitable power of 2, but the key simplification is that the adding of the multiple of the modulus becomes much easier. The rule is this: if the number you are looking at is odd, add R before you halve it; if it's even, just halve it. But is it useful?Montgomery multiplication is certainly beautiful and unexpected and I envy Montgomery for discovering it. If most of the work to be done by an encryption chip is to be modular exponentiation, and either the modulus changes rarely (as in DiffieHellman key exchange) or you can store the precomputed TRILLION, then it is quite efficient; but all in all it is less versatile than my algorithm or Brickell's, neither of which needs any precomputation or pre and postmultiplication.

