fatslob-:O said: These 2 CUs are disabled because of printing mistakes therefore it will never be enabled due to the fact that those units are non functional. |
Slight consusion here. When you process wafers into chips, there are two factors to account for:
Every wafer a fab processes has intrinsic mistakes. Although it is supposed to be a single silica crystal with a maximum specified impurity, the reality is more complex. During manufacturing, you have additional problems like laser fluctuations, too much/not enough etching, cosmic rays, chemicals not behaving like they should. All in all, you have a certain chance that parts of your wafer are bad. Usually all is told with a single number: The defect rate. This number tells you how many defects per 100mm^2 you will encounter on average (of course we disregard bad/faulty design here right grom the start).
Let's assume that number is 0.2 for the PS4's apu (the number is totally secret and nobody is ever going to tell you its actual value for any factory. But 0.2 is reasonable without going into details why). The PS4's apu die size is roughly 320mm'2, so per die you have a chance of 0.2*320*100% = 64% that there is a fault in your die. This means that roughly two thirds of your wavers will produce nothing and only a third of the chips will work (at best). That is of course unacceptable because if you pay $5000 for a waver processing and you only get 10-15 working chips, you can figure out yourself the problem.
The solution found by engineers is simple (and can be very complex to implement at the same time). Engineers put more stuff into the die that what is really necessary. Then if one thing turns out to be bad, you simply replace it with a surplus thing. You don't put too much of everything into your die. You could put more cores, more chache, more CUs, more drivers into your die and replace anything that is bad with a corresponding surplus. In the end you would get 100% of your chips working. However, if you add too much reserve stuff into your die, it gets way too big and way too complex to manage, so while you have 100% yield, you only get half the chips per wafer.
Let's look at the PS4's die and where we could add redundancy. The obvious choice is the 18 CU units in the gpu. This is the largest block in the die, roughly taking 33% of the entire die, So it is the most likely place a random fault will be located. It is also very easy to add two spare CUs because it is mostly a cut-and-paste operation. With this simple increase, we just saved at least 33% of all bad chips. The next obvious place to add spares are those "chessboard areas" at the Jaguar cores. These are second level caches, and memory is rather easy to add as spare parts. Unfortunately, at this time, we are already coming to an end to adding reasonable spare parts. Adding spare Jaguar cores is not a realistic option, also the memory controllers are rather large and there is no place for a spare (I have no idea at all how redundant gddr5 controllers can be designed). There may be individual cache areas in various parts that have "spares" built in. All in all, probably 60% of the die area is "saved by spares".
One point should be made clear: If you have bad parts or unused reserve parts in your chip, you must make sure that everything that is bad or unused is electronically disabled. Any transistor in a chip that is "free to do whatever it wants" will kill the chip sonner than later. Hence if the PS4 apu promises 18CUs, the surplus 2 CUs (whether they are working or replacement units for 1 or 2 defective regular CUs MUST be disabled at the end of the manufacturing line. How that is done (permanently or unlockanle) is up to the designer.