So what comes after parallel processing?

jlauro

Currently Online

13,183

2982 posts since 21/02/07

Recent Badges:

Watch Your Back! Received 10,000 profile views.
Trust Me, It'll Have Legs 100 replies made to user's most popular thread.
The High Flyer Earned 40 badges.
I'm Special VGChartz staff member.
14 Years Has been a VGChartz member for over 14 years.
Congratulations on Pressing Start! Score your first game in your collection.

jlauro on 07 December 2008

MisterBlonde said:
@jlauro

A function call is single assembly instruction.

Scheduling a thread for execution requires determining which thread from the pool to schedule time (Threads have different priorities, different states). Additionally the scheduler would have to preempt currently running threads, save their context (state of the CPU registers at the time the thread was interrupted), and then load the context into registers from the thread that is getting scheduled for execution.

Certainly all that stuff could possibly be optimized more in the future, but nevertheless it will always be a lot more expensive than a single function call. The more threads and cores the scheduler has to deal with the more complex it becomes and the more often it happens.

That is only an issue if the scheduler is done in software. A hardware scheduler could easily look at all the threads once per clock cycle, even if there were thousands of cores. It's faster to do that then adding two 64 bit numbers together in hardware, which typically only takes one cycle. That is what is so cool with hardware. The os could impose resource limits in register as to how many threads are allowed to be automatically created before a request has to be approved or rebalanced by the os, or to only go back to the os if no open threads are available. Remember, in this hypothetical environment there are lots of cores, and you shoud never run out except on the busiest servers at peek time.

MisterBlonde

Currently Offline

98

16 posts since 23/01/08

Recent Badges:

2 Years Has been a VGChartz member for over 2 years.
Site Veteran Has been a VGChartz member for over 5 years.
14 Years Has been a VGChartz member for over 14 years.
1st Birthday Has been a VGChartz member for over 1 year.
9 Years Has been a VGChartz member for over 9 years.
4 Years Has been a VGChartz member for over 4 years.

MisterBlonde on 07 December 2008

@jlauro

When you explain it that way it sounds plausible. Is this hypothetical idea something that is being discussed somehow, or is this of your creation?

The reason why I ask, is because I would like to read more about it (if there is more info out there).

NJ5

Currently Offline

69,544

14947 posts since 24/03/07

Recent Badges:

13 Years Has been a VGChartz member for over 13 years.
Making Friends 10 friends on gamrConnect.
It's a Start Bank a Total of 2,000 VG$.
Brotherhood 100 friends on gamrConnect.
2 Years Has been a VGChartz member for over 2 years.
'Ello Princess! Awarded for signing up.

NJ5 on 07 December 2008

@jlauro: From what I understand, your approach amounts to the same behavior as inlining all the functions and performing high-level optimizations on the resulting code.

For example, by inlining+optimizing, any starting statements in func using "c" but not "x" could be done in parallel with funb's statements (which an optimizing compiler / out-of-order CPU can easily do nowadays I believe).

The advantage would be that inlining of those statements would not be required. The disadvantage would be that the optimization would happen at runtime in hardware instead of at compile time, which would make hardware more expensive (perhaps unnecessarily?).

My Mario Kart Wii friend code: 2707-1866-0957

Infamy79

Currently Offline

1,659

679 posts since 10/11/08

Recent Badges:

3 Years Has been a VGChartz member for over 3 years.
15 Years Has been a VGChartz member for over 15 years.
12 Years Has been a VGChartz member for over 12 years.
4 Years Has been a VGChartz member for over 4 years.
10 Years Has been a VGChartz member for over 10 years.
A Badge Within A Badge Earned 20 badges.

Infamy79 on 07 December 2008

Replacing silicon with manufactured diamond

Never argue with idiots
They bring you down to their level and then beat you with experience

jlauro

Currently Online

13,183

2982 posts since 21/02/07

Recent Badges:

Watch Your Back! Received 10,000 profile views.
Trust Me, It'll Have Legs 100 replies made to user's most popular thread.
The High Flyer Earned 40 badges.
I'm Special VGChartz staff member.
14 Years Has been a VGChartz member for over 14 years.
Congratulations on Pressing Start! Score your first game in your collection.

jlauro on 07 December 2008

@MisterBlonde: It partly my own creation, but some of it is based on reading a bunch of stuff. I can't take credit for the ideas of implementing a hardware scheduler, but unfortunately I don't recall where I read the paper on it. I am trying to play with some ideas with a fpga kit in spare time, but I am not very good with any HDLs yet (teaching myself), and can not fit many cores into a little $150 FPGA kiit... Would probably cost a few thousand to get a large enough fpga to be able to create enough soft cores as a single cpu to do any real experiments...

@NJ5: Inlining them would help, but you would still need multiple cores for them to run in parallel with the called function. However, you could probably get by with less cores then having the hardware do it... That said, inlining would not give you the pipelining speedups of stream processing data through multiple functions. A lot of times you are just moving data and making minor alterations to it as it goes from one function to another, often feeding the output of one function to the next. If you let the next function work on the outgoing data of the previous function as it's being generated, it can process through both functions in just a little more time then it would take for the first function to process it.
Consider:
char *c;
makeupper( c );
removewhitespace( c );
tokenize( c );

and let's say C is 2k stream of data.
Normally, each of those functions have to run through all 2k of data. If nothing else, it's going to take time to read through that.

Now, obviously we have language issues in c, but...
pipeline *c
tokenize( removewhitespace( makeupper( c ) ) )

So, it's going to take 12k cycles at least as a minimum, being nice as saying it's only 1 cycle to read and 1 cycle to process each character in the standard example.

However, with it pipelined, it will not take much more then 4k cycles to process through all the functions, and it the primary core could actually start doing other stuff after only a relatively few cycles. Now, it does use 4 cores, but no amount of inlining will give you that speed up. Being generous, you could rewrite it as one function token-n-removewhite-n-makeupper, and give it 1 for read, and a cycle for each operation, or 8k cycles to process the full 2k. Less modular code, and still take twice as long, but is a compromise.

alephnull

Currently Offline

2,168

452 posts since 12/10/07

Recent Badges:

13 Years Has been a VGChartz member for over 13 years.
4 Years Has been a VGChartz member for over 4 years.
16 Years Has been a VGChartz member for over 16 years.
8 Years Has been a VGChartz member for over 8 years.
12 Years Has been a VGChartz member for over 12 years.
10 Years Has been a VGChartz member for over 10 years.

alephnull on 07 December 2008

MisterBlonde said:
@crashman

For the most part your right. When a caller calls another function the called function's entry point moves to the top of the stack (each thread has it's own call stack). Eip and esp are changed to reflect this. Therefore the caller HAS to wait until the called function returns even if doesn't need the return value. The called function needs to be popped off the stack before eip can be updated with the original caller's instruction. Each thread has a single stack, and each processor has only one eip, ebp, esp, etc..

The only way a calling function can continue execution prior to the called function completing is if it makes a call asynchronously to code/function running on another thread.

For the code you mentioned, the only way multiple cores can speed it up is if somehow funb, func, or fund spawn their own threads. If they don't then funa will run on a single thread which will be affinitized to the same processor.

No, he's not. In practice, most C compilers will keep the function parameters in registers and will not actually push them onto the call stack. On the intel machine you are talking about if you were to write a piece of inline assembly the function parameters will be stored in EAX, EBX, etc.

~flame

Currently Offline

438

270 posts since 28/11/08

Recent Badges:

7 Years Has been a VGChartz member for over 7 years.
14 Years Has been a VGChartz member for over 14 years.
6 Years Has been a VGChartz member for over 6 years.
1st Birthday Has been a VGChartz member for over 1 year.
2 Years Has been a VGChartz member for over 2 years.
13 Years Has been a VGChartz member for over 13 years.

~flame on 07 December 2008

the 2nd coming of christ!

Playing Assassin's Creed and Resident evil 5 <3

I dont want to be fanboy anymore...Why? it takes to much work but i will call on ppl on there B.S!!!:)

alephnull

Currently Offline

2,168

452 posts since 12/10/07

Recent Badges:

Open For Business Earned 10 badges.
One Piece at a Time Add your first game to your collection.
'Ello Princess! Awarded for signing up.
9 Years Has been a VGChartz member for over 9 years.
7 Years Has been a VGChartz member for over 7 years.
Mirror Image Awarded for uploading an avatar.

alephnull on 07 December 2008

jlauro said:

MisterBlonde said:
@jlauro

A function call is single assembly instruction.

Scheduling a thread for execution requires determining which thread from the pool to schedule time (Threads have different priorities, different states). Additionally the scheduler would have to preempt currently running threads, save their context (state of the CPU registers at the time the thread was interrupted), and then load the context into registers from the thread that is getting scheduled for execution.

Certainly all that stuff could possibly be optimized more in the future, but nevertheless it will always be a lot more expensive than a single function call. The more threads and cores the scheduler has to deal with the more complex it becomes and the more often it happens.

That is only an issue if the scheduler is done in software. A hardware scheduler could easily look at all the threads once per clock cycle, even if there were thousands of cores. It's faster to do that then adding two 64 bit numbers together in hardware, which typically only takes one cycle. That is what is so cool with hardware. The os could impose resource limits in register as to how many threads are allowed to be automatically created before a request has to be approved or rebalanced by the os, or to only go back to the os if no open threads are available. Remember, in this hypothetical environment there are lots of cores, and you shoud never run out except on the busiest servers at peek time.

I don't see how your hardware schedualer would help reduce the overhead for context switches. And I think you are overestimating the power of hardware schedualers and underestimating the problem.

Calculating the optimal schedual on an SMP system is equivalent to bin packing, which is NP hard

Once apon a time my area of research in CS was multiprocessor hybrid (EDF, RM, and fixed priority schedualing) for realtime schedualers. While the schedualing algorithms are somewhat different for normal systems, I do know something about it.

MisterBlonde

Currently Offline

98

16 posts since 23/01/08

Recent Badges:

10 Years Has been a VGChartz member for over 10 years.
7 Years Has been a VGChartz member for over 7 years.
So You Came Back For More, Huh? Logged in a second time.
1st Birthday Has been a VGChartz member for over 1 year.
'Ello Princess! Awarded for signing up.
15 Years Has been a VGChartz member for over 15 years.

MisterBlonde on 07 December 2008

@alephnull

The function paramaters are accessed via offsets from ebp. There aren't enough registers to store all the parameters for a particular callstack. They are pushed and popped onto the stack. All compilers do is compile c code into assembly instructions that are executed by the CPU. At run time they are non existent (unless you are talking about a JIT compiler). The assembly code will do the pushing and popping.

Eip points to the next instruction to be executed. When a function makes a call it pushes it's return address onto the stack. There is no way during a synchronous call for the calling function to be running until the called function returns. That is what crashman was saying and that is what I was agreeing with. When the called function returns the calling function's return address is popped back into eip.

google function prologue and epilogue for more details.

alephnull

Currently Offline

2,168

452 posts since 12/10/07

Recent Badges:

4 Years Has been a VGChartz member for over 4 years.
13 Years Has been a VGChartz member for over 13 years.
Leaving Limbo 100 posts on the gamrConnect forums.
Open For Business Earned 10 badges.
3 Years Has been a VGChartz member for over 3 years.
2 Years Has been a VGChartz member for over 2 years.

alephnull on 07 December 2008

jlauro said:

smallflyingtaco said:

jlauro said:
With enough cores, you could design the cpu such that instead of at the normal thread level, you could have functions work inside of cores. As one function calls another, it can literally be processing the values as they are coming in. With a single (or only a few cores) you are sending all the values on the stack, switching to the function, having the function pull them back to operate on, and pausing the calling function, and then returning once the processing is done. With tons of cores, the function can begin processing the data as it's coming while the calling function can be working on calculating the values to process. Both cores running concurrently. Some of that could be done by the compilers even for problems that don't naturally lend themselves to massive amounts of cores directly. As functions are hundreds of levels deep, think of the speed ups that are possible.

The speedup your talking about assumes that your not getting any cache misses and that your functions can be called without reliance on data from another function. Whenever those things happen your just going to leave one of your cores idling, with that happening hundreds or thousands of times your really just going to have hundreds of billions of wasted cycles. You also can only add so many cores before the speed of light is going to limit how well they can communicate, even with a 3D chip layout. You can in theory reduce the fab size to increase this but then your going to eventually run into Heisenberg problems. This means your going to hit a maximum number of cores per chip, at which point adding more cores slows the whole thing down.

When it relies on data from another function, that is a plus as those functions can also be started in other cores. As you see a huge number of cores ties in, as those other functions will run in other cores. Local memory (at least 4k, more the better) for each core is essential, as is a cross bar switch between all of the cores. As to the issue of worrying about all of the idle cores... actually make the cross bar go between the threads, and have a separate crossbar interconnecting threads and cores. When the available on board memory for threads are all used, you can swap them out to external ram.

Of course with hundreds of threads on the chip, the cross bar switches will take as much space as the cores.

You don't need a cross bar between cores, cell doesn't (not for what you are talking about at least).

Existing User Log In

New User Registration

Forums - PC Discussion - So what comes after parallel processing?

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Playing Assassin's Creed and Resident evil 5 <3

Recent Badges:

Recent Badges:

Recent Badges: