By using this site, you agree to our Privacy Policy and our Terms of Use. Close

Forums - PC Discussion - So what comes after parallel processing?

MisterBlonde said:
@alephnull

The function paramaters are accessed via offsets from ebp. There aren't enough registers to store all the parameters for a particular callstack. They are pushed and popped onto the stack. All compilers do is compile c code into assembly instructions that are executed by the CPU. At run time they are non existent (unless you are talking about a JIT compiler). The assembly code will do the pushing and popping.

Eip points to the next instruction to be executed. When a function makes a call it pushes it's return address onto the stack. There is no way during a synchronous call for the calling function to be running until the called function returns. That is what crashman was saying and that is what I was agreeing with. When the called function returns the calling function's return address is popped back into eip.

google function prologue and epilogue for more details.

 

If you don't think the compiler is involved what do you think happens when you add the keyword inline before a function definition? Would you agree with me that the compiler does not put an inline function's parameters on the call stack?

Try doing a backtrace on a piece of code compiled with -O3 in gcc. I think you will find that it inlines quite a bit. Particularly with the CBE and Xenon you have over a hundred registers per core, and if you don't think copious inlining is going on then... well, I dunno.

 



Around the Network

@MisterBlonde
http://en.wikipedia.org/wiki/Calling_convention

I suggest you read the section for PowerPC, which happens to probably be the most applicable to the consoles.



alephnull said:
jlauro said:
MisterBlonde said:
@jlauro

A function call is single assembly instruction.

Scheduling a thread for execution requires determining which thread from the pool to schedule time (Threads have different priorities, different states). Additionally the scheduler would have to preempt currently running threads, save their context (state of the CPU registers at the time the thread was interrupted), and then load the context into registers from the thread that is getting scheduled for execution.

Certainly all that stuff could possibly be optimized more in the future, but nevertheless it will always be a lot more expensive than a single function call. The more threads and cores the scheduler has to deal with the more complex it becomes and the more often it happens.

That is only an issue if the scheduler is done in software. A hardware scheduler could easily look at all the threads once per clock cycle, even if there were thousands of cores. It's faster to do that then adding two 64 bit numbers together in hardware, which typically only takes one cycle. That is what is so cool with hardware. The os could impose resource limits in register as to how many threads are allowed to be automatically created before a request has to be approved or rebalanced by the os, or to only go back to the os if no open threads are available. Remember, in this hypothetical environment there are lots of cores, and you shoud never run out except on the busiest servers at peek time.

 

I don't see how your hardware schedualer would help reduce the overhead for context switches. And I think you are overestimating the power of hardware schedualers and underestimating the problem.

Calculating the optimal schedual on an SMP system is equivalent to bin packing, which is NP hard

Once apon a time my area of research in CS was multiprocessor hybrid (EDF, RM, and fixed priority schedualing) for realtime schedualers. While the schedualing algorithms are somewhat different for normal systems, I do know something about it.

 

The problem isn't NP hard if number of threads supported by the hardware > number of active threads.  If you have 1000 cores on the chip, then just fall back to software if you run out.  When you run out, the OS can swap out 20% of the threads and start to worry about scheduling.  The idea, 99% of the time, you will always have an empty thread slot available.

It's also easy to do round-robin scheduling if you have thread support in the processor.  Sure, granular prioritizing in hardware may not be as simple, but simply q priority isn't NP hard.  It's the same as how QOS works in a layer 2 or layer 3 switch.

The main point is, if you have enough cores, who cares about optimal scheduling.  Just run everything, and do a qos level if you need to.  The OS threads will always have the QOS to always be running, and so they can reschedule things if the rest goes crazy.

 

The reason why there is no over head for a context switch, is you put the memory for each thread on the chip.  That is why you need a fair amount of memory per thread on each chip, instead of putting cache memory on the chip.  This is also why you need a cross-bar between the cores and the thread states.  There is no cost for context switching because the context stays on the other side of the cross bar.  The only time there is a cost for a context switch, is if the thread has to be swapped out to external ram, and a new thread switched in.  All the registers are esentially part of the memory area for that thread.

 

This idea will not work with only a low number of cpus, and it also will not work well with multiple physical cpus, as it does assume uniform memory access (which is not easily done with a large number of cores without using a cross-bar switch), and you were probably working with a non uniform memory architecture, or else a small number of processors.  Unless you put everything on one chip, including enough memory (4k minimum per thread, 64+k better) to handle a large number of threads, and enough cores, you just will run out of memory bandwidth... 

Anyways, it will probably be another 10-20 years before this design would be really feasable in a single chip.

 



@alephnull

This is the PC discussion forum so I was thinking more like x86. Perhaps this is why we are disagreeing?

From your link:
"Due to the small number of architectural registers, the x86 calling conventions mostly pass arguments on the stack"

As far as inline functions, I always thought those were compiler hints so there is no guarantee that the compiler will actually inline them. I also think of inline functions as now actually being a part of the function they were inlined within. No different than other lines of code within the calling function. Since no call instructions are being executed there wouldn't be a purpose for pushing params on to the stack.

Let me know if any of my assumptions are wrong for X86.

Whether or not a function is inlined it still doesn't change that the called function (inlined or not) code has to execute before the calling function continues execution. Assuming this is a synchronous function call.




jlauro said:


@NJ5: Inlining them would help, but you would still need multiple cores for them to run in parallel with the called function. However, you could probably get by with less cores then having the hardware do it... That said, inlining would not give you the pipelining speedups of stream processing data through multiple functions. A lot of times you are just moving data and making minor alterations to it as it goes from one function to another, often feeding the output of one function to the next. If you let the next function work on the outgoing data of the previous function as it's being generated, it can process through both functions in just a little more time then it would take for the first function to process it.
Consider:
char *c;
makeupper( c );
removewhitespace( c );
tokenize( c );

and let's say C is 2k stream of data.
Normally, each of those functions have to run through all 2k of data. If nothing else, it's going to take time to read through that.

Now, obviously we have language issues in c, but...
pipeline *c
tokenize( removewhitespace( makeupper( c ) ) )

So, it's going to take 12k cycles at least as a minimum, being nice as saying it's only 1 cycle to read and 1 cycle to process each character in the standard example.

However, with it pipelined, it will not take much more then 4k cycles to process through all the functions, and it the primary core could actually start doing other stuff after only a relatively few cycles. Now, it does use 4 cores, but no amount of inlining will give you that speed up. Being generous, you could rewrite it as one function token-n-removewhite-n-makeupper, and give it 1 for read, and a cycle for each operation, or 8k cycles to process the full 2k. Less modular code, and still take twice as long, but is a compromise.

I was assuming the parameters were just numbers. If they're bigger data structures which can be processed in a pipelined fashion we can run the functions in parallel, but I assume that's not what CrashMan had in mind with his example.

"you would still need multiple cores for them to run in parallel with the called function"

Not necessarily, not if the CPU allows for out-of-order execution in parallel. In the end it depends of what the function is doing. Having multiple cores may help a lot if there are lots of parallelizable statements in the two functions, but then the current architectures can perform fine by using normal multi-threading and either a refactoring of the function or the use of high-level optimizations in the compiler.

I am not saying your idea for an architecture is worthless, it just seems to me that the improvements you talk about can already be realized today with smart programming, compilers and/or existing features in CPUs. In your example, software pipelining can be used and some CPUs would parallelize the operations to potentially 4 operations per cycle. Perhaps compilers don't do this today, but it may be simpler to implement it in the compiler than to make a whole new architecture for it.

 



My Mario Kart Wii friend code: 2707-1866-0957

Around the Network
alephnull said:
MisterBlonde said:
@alephnull

The function paramaters are accessed via offsets from ebp. There aren't enough registers to store all the parameters for a particular callstack. They are pushed and popped onto the stack. All compilers do is compile c code into assembly instructions that are executed by the CPU. At run time they are non existent (unless you are talking about a JIT compiler). The assembly code will do the pushing and popping.

Eip points to the next instruction to be executed. When a function makes a call it pushes it's return address onto the stack. There is no way during a synchronous call for the calling function to be running until the called function returns. That is what crashman was saying and that is what I was agreeing with. When the called function returns the calling function's return address is popped back into eip.

google function prologue and epilogue for more details.

 

If you don't think the compiler is involved what do you think happens when you add the keyword inline before a function definition? Would you agree with me that the compiler does not put an inline function's parameters on the call stack?

Try doing a backtrace on a piece of code compiled with -O3 in gcc. I think you will find that it inlines quite a bit. Particularly with the CBE and Xenon you have over a hundred registers per core, and if you don't think copious inlining is going on then... well, I dunno.

 

inline functions replace their function call with the actual code for the function, so you are right there, there is no pushing and popping off the stack.  There may be, however, memory read/writes since you may have more params than available registers.

However, inline functions have to be used carefully.  They can increase code size dramatically.  You could eliminate all stack push/pops by making your code all one function, but that is horrible design, and would basically be what inlineing all your functions does.

inline eliminates the benefits of functions by not reusing code.  If you have:

inline int functiona(int a, int b, int c)
{
//Code Here
}

and this function is called 100 different times, the "Code Here" section is compiled in to assembly 100 times.  Ouch.



I am a Gauntlet Adventurer.

I strive to improve my living conditions by hoarding gold, food, and sometimes keys and potions. I love adventure, fighting, and particularly winning - especially when there's a prize at stake. I occasionally get lost inside buildings and can't find the exit. I need food badly. What Video Game Character Are You?

Mega Man 9 Challenges: 74%

Waltz Tango Jitterbug Bust a move Headbanging
Bunny Hop Mr. Trigger Happy Double Trouble Mr. Perfect Invincible
Almost Invincible No Coffee Break Air Shoes Mega Diet Encore
Peacekeeper Conservationist Farewell To Arms Gamer's Day Daily Dose
Whomp Wiley! Truly Addicted! Truly Hardcore! Conqueror Vanquisher
Destroyer World Warrior Trusty Sidearm Pack Rat Valued Customer
Shop A Holic Last Man Standing Survivor Hard Rock Heavy Metal
Speed Metal Fantastic 9 Fully Unloaded Blue Bomber Eco Fighter
Marathon Fight Quick Draw G Quick Draw C Quick Draw S Quick Draw H
Quick Draw J Quick Draw P Quick Draw T Quick Draw M Quick Draw X
CrashMan said:
alephnull said:
MisterBlonde said:
@alephnull

The function paramaters are accessed via offsets from ebp. There aren't enough registers to store all the parameters for a particular callstack. They are pushed and popped onto the stack. All compilers do is compile c code into assembly instructions that are executed by the CPU. At run time they are non existent (unless you are talking about a JIT compiler). The assembly code will do the pushing and popping.

Eip points to the next instruction to be executed. When a function makes a call it pushes it's return address onto the stack. There is no way during a synchronous call for the calling function to be running until the called function returns. That is what crashman was saying and that is what I was agreeing with. When the called function returns the calling function's return address is popped back into eip.

google function prologue and epilogue for more details.

 

If you don't think the compiler is involved what do you think happens when you add the keyword inline before a function definition? Would you agree with me that the compiler does not put an inline function's parameters on the call stack?

Try doing a backtrace on a piece of code compiled with -O3 in gcc. I think you will find that it inlines quite a bit. Particularly with the CBE and Xenon you have over a hundred registers per core, and if you don't think copious inlining is going on then... well, I dunno.

 

inline functions replace their function call with the actual code for the function, so you are right there, there is no pushing and popping off the stack.  There may be, however, memory read/writes since you may have more params than available registers.

However, inline functions have to be carefully.  They can increase code size dramatically.  You could eliminate all stack push/pops by making your code all one function, but that is horrible design, and would basically be what inlineing all your functions does.

inline eliminates the benefits of functions by not reusing code.  If you have:

inline int functiona(int a, int b, int c)
{
//Code Here
}

and this function is called 100 different times, the "Code Here" section is compiled in to assembly 100 times.  Ouch.

100 times is fine for simple functions like

int min(int a, int b)

{ return a < b ? a : b;

}

 

Where the function iteself is as, or at least nearly as small as calling an external function...

 

 

 



MisterBlonde said:
@alephnull

This is the PC discussion forum so I was thinking more like x86. Perhaps this is why we are disagreeing?

From your link:
"Due to the small number of architectural registers, the x86 calling conventions mostly pass arguments on the stack"

As far as inline functions, I always thought those were compiler hints so there is no guarantee that the compiler will actually inline them. I also think of inline functions as now actually being a part of the function they were inlined within. No different than other lines of code within the calling function. Since no call instructions are being executed there wouldn't be a purpose for pushing params on to the stack.

Let me know if any of my assumptions are wrong for X86.

Whether or not a function is inlined it still doesn't change that the called function (inlined or not) code has to execute before the calling function continues execution. Assuming this is a synchronous function call.


 

The inline keyword is indeed a compiler hint which the compiler is free to ignore. When a function is "inlined" it is equivalent to just putting everything inside the callee function in {} braces in the at the point where that function was called. If you do that, it is no longer just a "hint".

Even if you don't put in an inline keyword you will find that gcc on an x86 will "inline" functions you didn't ask it to if you compile your program with -O3. Furthermore I suspect you are assuming that the generated x86 ISA is the final word on what the processor actually does. The problem with this is that the x86 instructions you are looking at actually get translated into microarchitecture instructions. For example the pentium 4 actually has more registers than you can explicitly reference through 32-bit instructions and uses rotating register windows to compensate for the limitations of the x86 ISA. Just because you see push and pop instructions in your out.s doesn't mean it actually happens like that. It's complicated and irritating, this is why many realtime people stick with 386s because there is all this stuff that can happen.

Note that gcc doesn't use nearly as many fancy optimizations as icc or suncc (which amusingly enough seems to beat icc with matrix multiplies for me).

Also note that I am not one of those people who believe the compiler is better at optimizing than a human who knows what they are doing is. Only compiler researchers and Java/.Net programmers believe this :P

I guess it depends on what you define "PC" as. I guess I consider my dual processor POWER machine here to be a PC, but you don't need that. Is an Itanium workstation a PC? What about my old transmeta laptop? Or amd64 running in 64-bit windows or linux? All of these have large numbers of registers.



"Also note that I am not one of those people who believe the compiler is better at optimizing than a human who knows what they are doing is. Only compiler researchers and Java/.Net programmers believe this :P"

Indeed. Sometimes even a simple manual translation of C code into assembly with little effort to optimize yields better results.



My Mario Kart Wii friend code: 2707-1866-0957

There is only so many concurrant threads you can have running, particularly in a video game


Simply not true. Videogames are perfectly suited for almost unlimited numbers of concurrent threads. Because of this graphic cards exist, which are not much more than specialized vector processors with dozens of pixel/vertext shader units, texturers etc. pp.

Now if you remove graphics, it gets a bit harder but sound processing, physics, AI can all be executed in parallel. It is not done most of the time because processing power for these areas is peanuts compared to the power required for realistic graphics.

Take AI: Starcraft et al. had very good AI for hundreds of onscreen units a decade ago. Its a bit more complicated in 3D environments but still. The thing is better AI etc. is seldom limited by processing power but by programmers.