By using this site, you agree to our Privacy Policy and our Terms of Use. Close

Forums - PC Discussion - So what comes after parallel processing?

MisterBlonde said:
@jlauro

A function call is single assembly instruction.

Scheduling a thread for execution requires determining which thread from the pool to schedule time (Threads have different priorities, different states). Additionally the scheduler would have to preempt currently running threads, save their context (state of the CPU registers at the time the thread was interrupted), and then load the context into registers from the thread that is getting scheduled for execution.

Certainly all that stuff could possibly be optimized more in the future, but nevertheless it will always be a lot more expensive than a single function call. The more threads and cores the scheduler has to deal with the more complex it becomes and the more often it happens.

That is only an issue if the scheduler is done in software.  A hardware scheduler could easily look at all the threads once per clock cycle, even if there were thousands of cores.  It's faster to do that then adding two 64 bit numbers together in hardware, which typically only takes one cycle.  That is what is so cool with hardware.  The os could impose resource limits in register as to how many threads are allowed to be automatically created before a request has to be approved or rebalanced by the os, or to only go back to the os if no open threads are available.  Remember, in this hypothetical environment there are lots of cores, and you shoud never run out except on the busiest servers at peek time.



Around the Network

@jlauro

When you explain it that way it sounds plausible. Is this hypothetical idea something that is being discussed somehow, or is this of your creation?

The reason why I ask, is because I would like to read more about it (if there is more info out there).



@jlauro: From what I understand, your approach amounts to the same behavior as inlining all the functions and performing high-level optimizations on the resulting code.

For example, by inlining+optimizing, any starting statements in func using "c" but not "x" could be done in parallel with funb's statements (which an optimizing compiler / out-of-order CPU can easily do nowadays I believe).

The advantage would be that inlining of those statements would not be required. The disadvantage would be that the optimization would happen at runtime in hardware instead of at compile time, which would make hardware more expensive (perhaps unnecessarily?).

 



My Mario Kart Wii friend code: 2707-1866-0957

Replacing silicon with manufactured diamond



Never argue with idiots
They bring you down to their level and then beat you with experience

@MisterBlonde: It partly my own creation, but some of it is based on reading a bunch of stuff. I can't take credit for the ideas of implementing a hardware scheduler, but unfortunately I don't recall where I read the paper on it. I am trying to play with some ideas with a fpga kit in spare time, but I am not very good with any HDLs yet (teaching myself), and can not fit many cores into a little $150 FPGA kiit... Would probably cost a few thousand to get a large enough fpga to be able to create enough soft cores as a single cpu to do any real experiments...

@NJ5: Inlining them would help, but you would still need multiple cores for them to run in parallel with the called function. However, you could probably get by with less cores then having the hardware do it... That said, inlining would not give you the pipelining speedups of stream processing data through multiple functions. A lot of times you are just moving data and making minor alterations to it as it goes from one function to another, often feeding the output of one function to the next. If you let the next function work on the outgoing data of the previous function as it's being generated, it can process through both functions in just a little more time then it would take for the first function to process it.
Consider:
char *c;
makeupper( c );
removewhitespace( c );
tokenize( c );

and let's say C is 2k stream of data.
Normally, each of those functions have to run through all 2k of data. If nothing else, it's going to take time to read through that.

Now, obviously we have language issues in c, but...
pipeline *c
tokenize( removewhitespace( makeupper( c ) ) )

So, it's going to take 12k cycles at least as a minimum, being nice as saying it's only 1 cycle to read and 1 cycle to process each character in the standard example.

However, with it pipelined, it will not take much more then 4k cycles to process through all the functions, and it the primary core could actually start doing other stuff after only a relatively few cycles. Now, it does use 4 cores, but no amount of inlining will give you that speed up.  Being generous, you could rewrite it as one function token-n-removewhite-n-makeupper, and give it 1 for read, and a cycle for each operation, or 8k cycles to process the full 2k.  Less modular code, and still take twice as long, but is a compromise.



Around the Network
MisterBlonde said:
@crashman

For the most part your right. When a caller calls another function the called function's entry point moves to the top of the stack (each thread has it's own call stack). Eip and esp are changed to reflect this. Therefore the caller HAS to wait until the called function returns even if doesn't need the return value. The called function needs to be popped off the stack before eip can be updated with the original caller's instruction. Each thread has a single stack, and each processor has only one eip, ebp, esp, etc..

The only way a calling function can continue execution prior to the called function completing is if it makes a call asynchronously to code/function running on another thread.

For the code you mentioned, the only way multiple cores can speed it up is if somehow funb, func, or fund spawn their own threads. If they don't then funa will run on a single thread which will be affinitized to the same processor.

No, he's not. In practice, most C compilers will keep the function parameters in registers and will not actually push them onto the call stack. On the intel machine you are talking about if you were to write a piece of inline assembly the function parameters will be stored in EAX, EBX, etc.

 



the 2nd coming of christ!



Playing Assassin's Creed and Resident evil 5 <3

I dont want to be fanboy anymore...Why? it takes to much work but i will call on ppl on there B.S!!!:)

jlauro said:
MisterBlonde said:
@jlauro

A function call is single assembly instruction.

Scheduling a thread for execution requires determining which thread from the pool to schedule time (Threads have different priorities, different states). Additionally the scheduler would have to preempt currently running threads, save their context (state of the CPU registers at the time the thread was interrupted), and then load the context into registers from the thread that is getting scheduled for execution.

Certainly all that stuff could possibly be optimized more in the future, but nevertheless it will always be a lot more expensive than a single function call. The more threads and cores the scheduler has to deal with the more complex it becomes and the more often it happens.

That is only an issue if the scheduler is done in software. A hardware scheduler could easily look at all the threads once per clock cycle, even if there were thousands of cores. It's faster to do that then adding two 64 bit numbers together in hardware, which typically only takes one cycle. That is what is so cool with hardware. The os could impose resource limits in register as to how many threads are allowed to be automatically created before a request has to be approved or rebalanced by the os, or to only go back to the os if no open threads are available. Remember, in this hypothetical environment there are lots of cores, and you shoud never run out except on the busiest servers at peek time.

 

I don't see how your hardware schedualer would help reduce the overhead for context switches. And I think you are overestimating the power of hardware schedualers and underestimating the problem.

Calculating the optimal schedual on an SMP system is equivalent to bin packing, which is NP hard

Once apon a time my area of research in CS was multiprocessor hybrid (EDF, RM, and fixed priority schedualing) for realtime schedualers. While the schedualing algorithms are somewhat different for normal systems, I do know something about it.

 



@alephnull

The function paramaters are accessed via offsets from ebp. There aren't enough registers to store all the parameters for a particular callstack. They are pushed and popped onto the stack. All compilers do is compile c code into assembly instructions that are executed by the CPU. At run time they are non existent (unless you are talking about a JIT compiler). The assembly code will do the pushing and popping.

Eip points to the next instruction to be executed. When a function makes a call it pushes it's return address onto the stack. There is no way during a synchronous call for the calling function to be running until the called function returns. That is what crashman was saying and that is what I was agreeing with. When the called function returns the calling function's return address is popped back into eip.

google function prologue and epilogue for more details.



jlauro said:
smallflyingtaco said:
jlauro said:
With enough cores, you could design the cpu such that instead of at the normal thread level, you could have functions work inside of cores. As one function calls another, it can literally be processing the values as they are coming in. With a single (or only a few cores) you are sending all the values on the stack, switching to the function, having the function pull them back to operate on, and pausing the calling function, and then returning once the processing is done. With tons of cores, the function can begin processing the data as it's coming while the calling function can be working on calculating the values to process. Both cores running concurrently. Some of that could be done by the compilers even for problems that don't naturally lend themselves to massive amounts of cores directly. As functions are hundreds of levels deep, think of the speed ups that are possible.

 

The speedup your talking about assumes that your not getting any cache misses and that your functions can be called without reliance on data from another function. Whenever those things happen your just going to leave one of your cores idling, with that happening hundreds or thousands of times your really just going to have hundreds of billions of wasted cycles. You also can only add so many cores before the speed of light is going to limit how well they can communicate, even with a 3D chip layout. You can in theory reduce the fab size to increase this but then your going to eventually run into Heisenberg problems. This means your going to hit a maximum number of cores per chip, at which point adding more cores slows the whole thing down.

 

 

When it relies on data from another function, that is a plus as those functions can also be started in other cores. As you see a huge number of cores ties in, as those other functions will run in other cores. Local memory (at least 4k, more the better) for each core is essential, as is a cross bar switch between all of the cores. As to the issue of worrying about all of the idle cores... actually make the cross bar go between the threads, and have a separate crossbar interconnecting threads and cores. When the available on board memory for threads are all used, you can swap them out to external ram.

Of course with hundreds of threads on the chip, the cross bar switches will take as much space as the cores.

 

 

You don't need a cross bar between cores, cell doesn't (not for what you are talking about at least).