nin10do said:
Alright, let's do this.
We'll start with the Megahertz Myth:
The Megahertz Myth, explained from someone who been to a class on processor architecture and design.
- Clock rate
Inside a CPU, there are multiple units that work independently. The unit that takes the longest to systematically alter the input bits and provide some from of output is what determines clock speed. A unit can be broken down so that it reaches one spot of performing a function and waits for the next clock to finish. So such lengths are shorten. The number of times per that slowest circuit can complete in a second is our maximum clock cycle speed. It is limited by the speed of light and thermodynamics. Barring thermodynamics, if the CPU is clocked too fast, these function do not complete properly and cause mechanical errors.
Basically saying, a line of instruction code, such as ADD registers A and B result in A, takes multiple clock cycles to complete.
- Stages
The CPU is designed to process instruction code in four stages: Fetch, Decode, Execute, Store. Using the prior example, the instruction code is "ADD registers A and B store in A".
The Fetch stage loads this instruction into the decode stage, and then increment the "next instruction" register.
The Decode stage takes the instruction and lines up registers A and B to the arithmetic unit with lines set to tell it to ADD and store in A.
The Execute stage pulls the data from Registers A and B, performs the add function leaving the result open to the next stage.
The Store stage takes the result and send it to register A.
Once each stage completes, the clock has to cycle before processing the next stage. Therefore it takes multiple clock cycles to complete an instruction.
- Pipeline architecture
Before pipeline architecture, one instruction took 3 to some number of cycles to complete. I say 3 because a few CPUs are designed to have Fetch and Decode combined. But for a 4 stage CPU running at 1 MHz, it can complete at maximum 250,000 instructions, with the exception of the noop code, which is only 1 clock cycle and nothing is done. Noop is intended for time sensitive synchronization.
With pipeline architecture, there is a buffer between each stage. This allows each stage to work on an instruction independently of other instructions. So in a 4 stage pipeline, when instruction 001 is in stage 4, 002 is in stage 3, 003 is in stage 2, and 004 just got loaded into stage 1. On the next clock cycle, 001 is done, 002 is in stage 4, 003 in stage 3, 004 in stage 2, and 005 is loaded into stage 1. This permits the result of 1 operation completed per cycle.
- Advance Pipelines
Remember what I said about clock speed is limited by the speed of light? These circuits have a length of time for an electrical signal to move. To increase clock speed, you can either shrink the transistors reducing the length, or you can shorten the circuits. One method of shortening circuits that pipeline provided was splitting work into multiple stages. So instead of a 4 stage pipeline, I can break it down to 8 stages and increase the clock speed by 100%. So now we have a 4 stage 1 MHz CPU example and an 8 stage 2 MHz CPU example. Without any bumps, the 2 MHz CPU has a maximum of 2 million instructions completed per second.
- Time for the Bad
This sounds great, which explains the massive boost of MHz speed during the age of the Pentium 4. The Pentium 4 had a 37 stage pipeline at its worst. Now what can go wrong?
- In Order Execution
Before the pipeline staging, one instruction could be completed at a time, some instructions require additional cycles for complex functions. In pipeline staging, one instruction is using the ALU to calculate 5 taken to the 23rd power. That instruction will take a few cycles to complete at this stage. The other instructions in the pipeline are waiting on the ALU.
Answer? Out-of-Order execution. When entering the execution stage, if an instruction is using integer math, it uses the integer ALU. The next cycle, a floating point instruction come up and sees the floating point ALU is not used, so it uses it, but it is waiting on the integer instruction to complete because its result is one factor in this instruction. Again, Out-of-Order execution will allow more instructions to operate at a time by using more units simultaneously. This mostly helps a minor problem. This problem is practically solved.
- Conditional Branching
There are these mythical functions we use in programming. They are "IF/THEN" statements, "FOR/NEXT" loops, "DO/WHILE" loops, "WHILE/WEND" loops, "SWITCH CASE" statements and such. These functions perform an operation and select 1 of 2 possible instructions to perform. If you break down the loops and switch case statement, everything is an "If... Then..." function with a Go To statement.
This translates to conditional branch instructions in the CPU. But first, what is a branch statement? The CPU has a "GO TO" instruction that sets the "next instruction" register to the next line of code to execute. Looking at the 8 stage CPU at 2 MHz example, a Go To instruction would take 8 instructions, with the instruction following the Go To in the previous stage and so on. These are the wrong instructions to be next when that Go To completes. The result, 7 clock cycles are lost, versus 3 clock cycles for the 1 MHz CPU. The 1 MHz CPU lost 3 microseconds of time, while the 2 MHz CPU lost 3.5 microseconds.
Here is the MHz myth. GO TO statements are the most important statements in the CPU and it allows us to direct things and to program AIs. This was easily solved, although. The Go To statement is detected and resolved from the Fetch stage. No loss.
Now if I have this solved, why is this a MHz Myth? Only standard branching can be handled on the fetch stage. Conditional Branching doesn't. Although we have branch prediction, so when the conditional branch instruction enters the fetch stage, the CPU chooses which branch to take, somehow. Then the conditional branch has to go through all stages and if the predicted next instruction is wrong, the next instruction will be loaded into the fetch stage next cycle.
Here comes the CPU comparison of the 4 stage Wii U CPU running at 1.2 GHz versus the 14 stage PS4/XB1 1.6 GHz (somewhere around there). One for one, the 1.6 GHz should perform 400 million more instructions per second. Due to the Wii U's RISC nature, the condition would have been calculated by a prior instruction, leaving a true/false result ready in a primary register for the conditional branch. If the conditional branch was right after the conditional function, then the instructions after the conditional branch instruction are predicted. If the prediction is bad, then that is 3 clock cycles wasted (2.5 nano seconds). Same thing for PS4/XB1 CPU, 13 cycles wasted (8.125 nano seconds). Based on the nanoseconds lost, the Wii U will recover very quickly from bad branches.
Here is another trick the Wii U has up its sleeve. If the "If/Then" statement had 2 lines before it that don't affect the condition, I mean having a logic condition instruction stored in a specific register and have 3 instruction that do other functions first, then the conditional branch can be resolved in the fetch stage like a regular branch, resulting in no wasted clock cycles. This means it is easy to write code for the Wii U that doesn't cost clock cycles.
A conditional branch may only lose a few clock cycles, so why does it matter? If/Then statements occur a lot. For crunching numbers, like a GPU, these happen so infrequently it wouldn't matter. For a CPU where these are the bread and bones of running the system, imagine the percentage of conditional branches ran per second.
This post will be focused mainly on the CPU.
The primary weakness in the PS4 and XB1 is using x86 instruction code. AMD had the option to make a version of these same chips with a hardware optimal instruction code, which would run colder, cheaper, far more efficient.
Half of the stages in the pipeline in the Jaguar CPUs are fetch and decode, on a 14 stage pipeline. If the instruction code went to fixed width, easier to decode, this would result in a 9 stage pipeline, more efficient, higher clock speed, less electrical power, so on and so forth.
If people stopped looking at the 8 core CPU at 1.5GHz, they would notice these chips are not that powerful. They were built for the target of a netbook PC or tablet PC. Think of an 8-core CPU in your cellphone.
The POWERPC design can make a CPU powerful while being low power, low heat, and less transistor usage.
In general, PC CPUs need to evolve. They have hit the limits of this world and they need to evolve. No one want to invest into the costs.
Cores Cores Cores. PS4 uses 4 for gaming: http://www.mcvuk.com/news/read/ps4-and-xbox-one-s-amd-jaguar-cpu-examined/0116297
Meanwhile, it's reported that the Wii U has a second, 2 Core ARM CPU dedicated to the OS. This could have explained some older issues like the OS being slow to load. In saying that, interrupt handling may be on the ARM CPU. That would mean the Expresso CPU is dedicated solely to gaming.
Now we can't just ignore the GPU, but I'm skimming through this bit for a reason:
The reason I stay out of the GPU side of performance specs is because of the old metaphor, "There is more than 1 way to skin a cat." The 1.8 trillion floating point operations per second in the PS4 GPU is how many floating point operations can be performed per second if all of the floating point logic units were doing that every clock cycle. That does not actually mean that it reliably does 1.8 TFLOpS.
Also, there is this thing called a patent. Many of the techniques and formulas used in GPUs are under a patent, and therefore competitors have to find their own formula. The GPU is taking in complex data and drawing objects onto a 2d surface, which is a Matrix operation. The methods between the two GPUs can be different.
The next thing, Nintendo is known to release achievable specs instead of system max specs to the media. Max specs are under NDA. That 350 GFLOPS may not be the real performance capability but just the GPGPU capability for all we know.
GPUs are special purpose chips, unlike CPUs.
In the RAM department, I'm sure you've read what Shin'en had to say on the eDRAM:
“The Wii U eDRAM has a similar function as the eDRAM in the XBOX360. You put your GPU buffers there for fast access. On Wii U it is just much more available than on XBOX360, which means you can render faster because all of your buffers can reside in this very fast RAM. On Wii U the eDRAM is available to the GPU and CPU. So you can also use it very efficiently to speed up your application.
The 1GB application RAM is used for all the games resources. Audio, textures, geometry, etc.
Theoretical RAM bandwidth in a system doesn’t tell you too much because GPU caching will hide a lot of this latency. Bandwidth is mostly an issue for the GPU if you make scattered reads around the memory. This is never a good idea for good performance.
The Wii U uses less RAM and loses less cycles than the XB1 and PS4. Smaller is better, as clock rate is govern by speed of light.
x86 is very inefficient today. It was efficient back when it was made for cheap 4-bit and 8-bit systems. There's a reason NASA equipment is PowerPC and not x86.
...
So that's all nice and fancy, but what does that mean? I'll lay it down as simply as I can:
The Wii U's CPU is more efficient and can do more, with 3 cores for gaming, than the PS4 does with the 4 use for games. There is practically no resource management on the PS4, the Wii U will make good use of the resources available. CPU+eDRAM will result in much better performance than the PS4, something that is being seen already with the performance in Nintendo's upcoming games. Now the biggest issue here is the GPU, and the lack of information on it. As stated above, with it being under heavy NDA, for all we know the numbers we know can be minimum specs. Another issue is Sony's numbers don't match up with the performance, which judging by their history, shouldn't be a surprise to anyone they are exaggerating their tech. Something we already saw with Deep Down earlier this year. Here are some real world results: http://www.eurogamer.net/articles/digitalfoundry-hands-on-with-playstation-4 They don't match the proposed specs, and we know there isn't much leeway for the PS4/XBO, they won't have much change over the years, being practically completely off the shelf parts. The PS4's lock only thing that can happen are firmware/BIOS updates that increase the clock rate, which can cause several issues. We know the bare minimum about the Wii U Specs, while have numbers beyond what the PS4 is known to do.
Forza is hitting the walls, as is Knack and Killer Instinct. We are a long way from even discovering the Wii U's limits, much less hitting them. It's easy to discard this because of the bigger numbers, but the numbers being bigger don't always mean better. These boxes are designed as PCs, you may have noticed that it takes 2-4GB Minimum of RAM and a much better CPU+GPU than the consoles have with 512MB of RAM, to even play those games, and the difference is already negligible in most cases, the PS4/XBO should be seen in the same light, inefficient, and most of the resources are split away from gaming. The Wii U is more capable than what can be seen at first glance, and what we've seen is the bare minimum. Since Nintendo is remaining quiet, and the hardware is custom, and under NDA, it's potentially more capable than we can anticipate, but for now let's just use what we know.
I'll quickly knock out a retort that is sure to come in; "wii u is performing better because it's using PS2 graphics" A ridicuous notion of course but one that should be mentioned, to that I will use an old saying; the proof is in the pudding. Knack, Killer Instinct, and Bayonetta. They are prood that the Wii U is capable. Bayonetta running at 1080p 60FPS, looking incredibly good, while Knack and Killer Instinct can't hit that point, and are far from what people consider "graphically impressive". If I recall correctly, every exclusive so far has ran at 60FPS, and any third party games, that weren't gimped/quick ported (Batman, Mass Effect, Splinter Cell, Assassin's Creed), all run at a stable 30FPS at the least 720p native, not bad for a console's first year.
Now it's just a matter of time, Bayonetta hits in 2014, I predict Q1. The power argument should end then. Not that horsepower makes much of a difference today, although I will say, Nintendo's artstyle in HD makes games look stellar. SM3DW looks beautiful in HD with the amazing lighting and particle effects, and it's actually surprising to me that it's able to run so well, with the Cherry powerup reportedly allowing unlimited amounts of characters at once, IGN having used 6 Luigi's at one time, and Multiplayer having 2-3 characters each, potentially more, it may not be obvious as it isn't realistic, but that takes quite a toll on the hardware, imagine 10-12 characters all at once with those particle effects throwing fireballs all at the same time killing enemies and collecting coins bringing out more effects. I suppose that makes me a bit of a hypocrite :P
In all seriousness, this gen it isn't about the power anymore, it's about the experiences, at the right value. That's the biggest problem with x86 and the PS4, it may be easier for developers, not having to program for various different architectures and porting being a breeze, but in the end the consumer is the one that ends up paying for it, through more expensive hardware for less 'power' than it should have at the price. The Wii U is potentially much more capable than the PS4 is, with the hardware costing approximately 200 or so, while the PS4 is being sold at a loss for 400 dollars. The way I see it is a shortcut into the future, getting 2014-15 graphics now, but worse performance and being stuck there for the next 10 years.
All that's left now is to wait and see how the next 12 months go.
|