
Lookahead carry and other improvementsRemember that the big problem in addition comes when you are doing a sum like this:
where I've used the red background to show an adder that's emitting a carry. That carry digit has to propagate all the way from the rightmost adder to the leftmost one before we can get the true result, and having to allow for this is what slows everything down. One option: knowing how long to waitMost of the time, carries don't propagate very far. The sum that I've shown you above is exceptional and it occurs, on average, only once every 1,000 million million times that you do an addition. Because most chips have a clock that ticks at a fixed rate (you start a calculation, wait long enough for it to be completed, then the clock ticks and you store the result and start the next calculation), we have to make the clock slow enough for even the rarest case and in practice most of the time between clock ticks is spent waiting for something that probably won't happen. It is possible to redesign the chip get the chip so that it reports when a calculation is complete. In essence, we arrange things so that each carry signal is no longer "Yes" or "No", but "Yes", "No", or "Don't know", with 9 (the awkward case) reporting "Don't know" until the adder has received "Yes" or "No" from its righthand neighbour. If you then have a circuit that checks if any of the adders is saying "Don't know", then you have a simple strategy: wait until there is no "Don't know", and then let the clock tick. In the worst case (the 1 in 1,000,000,000,000,000 chance) this won't be any slower than the old method, but almost always it will be much faster. This works. The main trouble is that this revision makes all the circuitry much more complex and the irregular timing makes control more difficult. As far as I was concerned, the increased complexity ruled out further investigation: chip space was short enough as it was. Lookahead carryLookahead carry depends on the fact that a carry can only propagate through a sequence of adders if the sum of the digits in every one of those adders is 9. First, we divide the adders into blocks of four. This is a purely conceptual division and nothing really changes. For 300 digits, this means 75 blocks. Next, we look at a typical block somewhere in the middle and consider the circumstances under which it will pass a carry to the block on its left. There are two. Either the carry has originated within the block itself, or the carry came in from the right and has been passed through all the adders in the block becuse every one of them had a sum of 9. To each block, we add a supervisor circuit. Its job is to check whether every adder in the block has a sum of 9. If it does, then the supervisor watches out for a carry coming in from the right. If a carry does arrive, the supervisor instantly passes it out to the block on the left. What does this achieve? Its effect is that if carries have to be propagated over a long distance, they will go from supervisor to supervisor rather than trickling through every single adder. Since each supervisor is responsible for four adders, the result is that the signal travels four times as fast. Let's look again at the example. I'll give the adders names from A to P so we can follow them more easily.
If we define the time taken for an adder to pass on a carry to be one unit, then the lookahead carry has meant that the worstcase addition can be completed in 10 time units instead of 16. The benefit is more pronounced with larger numbers. If we look at our 300adder problem, and incorporate a team of 75 supervisor circuits each of which watches a block of four adders, the worstcase addition is improved from 300 time units to 81. We can go further, because we now have a structure of 75 supervisors that have to pass carries between them. Let's add a new level of supervision, 19 supersupervisors watching (mostly) four supervisors each. That reduces the 81 to 31. And those supersupervisors can be watched in their turn, by five higherlevel ones, which cuts the amount of delay to 23 time units: not bad when you compare it to the original 300. There are many choices that can be made about how many adders each supervisor should handle and how many layers of supervision there should be, and the best answer depends strongly on how many adders are involved and the technicalities of exactly how many nanoseconds it takes a signal to pass through a particular circuit. As a rough guide, it should be possible to reduce the delay in a 1024bit adder (remember, chips think in bits) from 1024 time units to 30 or so. This is good but it still makes addition a lot slower than it would be if we didn't need to handle carries at all.

