By using this site, you agree to our Privacy Policy and our Terms of Use. Close

Forums - PC Discussion - So what comes after parallel processing?

smallflyingtaco said:
jlauro said:
With enough cores, you could design the cpu such that instead of at the normal thread level, you could have functions work inside of cores. As one function calls another, it can literally be processing the values as they are coming in. With a single (or only a few cores) you are sending all the values on the stack, switching to the function, having the function pull them back to operate on, and pausing the calling function, and then returning once the processing is done. With tons of cores, the function can begin processing the data as it's coming while the calling function can be working on calculating the values to process. Both cores running concurrently. Some of that could be done by the compilers even for problems that don't naturally lend themselves to massive amounts of cores directly. As functions are hundreds of levels deep, think of the speed ups that are possible.

 

 The speedup your talking about assumes that your not getting any cache misses and that your functions can be called without reliance on data from another function.  Whenever those things happen your just going to leave one of your cores idling, with that happening hundreds or thousands of times your really just going to have hundreds of billions of wasted cycles.  You also can only add so many cores before the speed of light is going to limit how well they can communicate, even with a 3D chip layout.  You can in theory reduce the fab size to increase this but then your going to eventually run into Heisenberg problems.  This means your going to hit a maximum number of cores per chip, at which point adding more cores slows the whole thing down.

 

 

When it relies on data from another function, that is a plus as those functions can also be started in other cores.  As you see a huge number of cores ties in, as those other functions will run in other cores.  Local memory (at least 4k, more the better) for each core is essential, as is a cross bar switch between all of the cores.  As to the issue of worrying about all of the idle cores...  actually make the cross bar go between the threads, and have a separate crossbar interconnecting threads and cores. When the available on board memory for threads are all used, you can swap them out to external ram.

Of course with hundreds of threads on the chip, the cross bar switches will take as much space as the cores.

 

 



Around the Network
jake_the_fake1 said:
After parallel processing comes Quantum processing, and then will get faze engines and go quantum speed...

Quantum computing isn't really faster than traditional computing, just different. Much faster for certain problems, that's for sure, but as far as current knowledge goes those problems aren't very wide in application.

 



My Mario Kart Wii friend code: 2707-1866-0957

The only way you can cantinue processing a function after calling another function is if the calling function does not rely on a return value from the called function. In cases like that, you are almost universally talking about another subsystem to the game than the calling function resides in, which would be on another thread any way for optimization.

The instances of long, intensive functions with no return value, and no variables passed by refference is extrordinarily low. They are usually short functions that should probably be inlined any way.

The actual function call on the core won't be any faster than a standard function call any way. You will need to push the variables to some data structure in memory (probably the stack for that core) and push the return value back to the calling core (stack again)

You will have to set the function pointer on the called core to the proper function as well.

the only thing that wouldn't need to be done would be to have the calling function's call memory location pushed and popped off the stack. You would have some sort of memory write/read to tell when the return variable is ready, so there is no speed up when calling a function on a separate core than a standard core.

I don't see how mulitple cores could possibly speed up the following

int funa(int a, int b, int c)
{

int x = funb(a,b);
int y = func(c,x);
int z = fund(y,a)

return z;

}



I am a Gauntlet Adventurer.

I strive to improve my living conditions by hoarding gold, food, and sometimes keys and potions. I love adventure, fighting, and particularly winning - especially when there's a prize at stake. I occasionally get lost inside buildings and can't find the exit. I need food badly. What Video Game Character Are You?

Mega Man 9 Challenges: 74%

Waltz Tango Jitterbug Bust a move Headbanging
Bunny Hop Mr. Trigger Happy Double Trouble Mr. Perfect Invincible
Almost Invincible No Coffee Break Air Shoes Mega Diet Encore
Peacekeeper Conservationist Farewell To Arms Gamer's Day Daily Dose
Whomp Wiley! Truly Addicted! Truly Hardcore! Conqueror Vanquisher
Destroyer World Warrior Trusty Sidearm Pack Rat Valued Customer
Shop A Holic Last Man Standing Survivor Hard Rock Heavy Metal
Speed Metal Fantastic 9 Fully Unloaded Blue Bomber Eco Fighter
Marathon Fight Quick Draw G Quick Draw C Quick Draw S Quick Draw H
Quick Draw J Quick Draw P Quick Draw T Quick Draw M Quick Draw X

@crashman

For the most part your right. When a caller calls another function the called function's entry point moves to the top of the stack (each thread has it's own call stack). Eip and esp are changed to reflect this. Therefore the caller HAS to wait until the called function returns even if doesn't need the return value. The called function needs to be popped off the stack before eip can be updated with the original caller's instruction. Each thread has a single stack, and each processor has only one eip, ebp, esp, etc..

The only way a calling function can continue execution prior to the called function completing is if it makes a call asynchronously to code/function running on another thread.

For the code you mentioned, the only way multiple cores can speed it up is if somehow funb, func, or fund spawn their own threads. If they don't then funa will run on a single thread which will be affinitized to the same processor.



Barring great and unexpected advances in computer science, even multi-core is already limited in application. Sure, it's nice to have one core to run all the spyware and another one to run the useful software, or even to run two useful programs at the same time, but many kinds of programs will never take advantage of even a few cores, let alone the many-core architectures.

Sure, it's possible to take most badly designed programs and make them much faster using multithreading, but it's much harder to do so for programs which are already efficient in the first place.

I'm having fun with programming with CUDA right now... on number theory problems which are embarrassingly parallel. Other than that, I'm still hoping for faster cores.

 



My Mario Kart Wii friend code: 2707-1866-0957

Around the Network

@NJ5

Right. For the most part the only way to get linear benefit is when you can find independent tasks that do not need to operate on shared data. Even then multiple threads would still have to contend for multiple shared resources like disk drives, network interfaces, RAM etc...

Also the code responsible for scheduling execution of threads to idle cores can get bogged down like an overworked traffic cop. The more cores the more work for the thread scheduler, the more behind it gets, the more cycles get consumed.



jlauro said:
smallflyingtaco said:
jlauro said:
With enough cores, you could design the cpu such that instead of at the normal thread level, you could have functions work inside of cores. As one function calls another, it can literally be processing the values as they are coming in. With a single (or only a few cores) you are sending all the values on the stack, switching to the function, having the function pull them back to operate on, and pausing the calling function, and then returning once the processing is done. With tons of cores, the function can begin processing the data as it's coming while the calling function can be working on calculating the values to process. Both cores running concurrently. Some of that could be done by the compilers even for problems that don't naturally lend themselves to massive amounts of cores directly. As functions are hundreds of levels deep, think of the speed ups that are possible.

 

 The speedup your talking about assumes that your not getting any cache misses and that your functions can be called without reliance on data from another function.  Whenever those things happen your just going to leave one of your cores idling, with that happening hundreds or thousands of times your really just going to have hundreds of billions of wasted cycles.  You also can only add so many cores before the speed of light is going to limit how well they can communicate, even with a 3D chip layout.  You can in theory reduce the fab size to increase this but then your going to eventually run into Heisenberg problems.  This means your going to hit a maximum number of cores per chip, at which point adding more cores slows the whole thing down.

 

 

When it relies on data from another function, that is a plus as those functions can also be started in other cores.  As you see a huge number of cores ties in, as those other functions will run in other cores.  Local memory (at least 4k, more the better) for each core is essential, as is a cross bar switch between all of the cores.  As to the issue of worrying about all of the idle cores...  actually make the cross bar go between the threads, and have a separate crossbar interconnecting threads and cores. When the available on board memory for threads are all used, you can swap them out to external ram.

Of course with hundreds of threads on the chip, the cross bar switches will take as much space as the cores.

 

 

Your talking about something like Intel's Teraflops project, that is a little different but similar to what your talking about.  I am not actually certain if this works well but I had assumed not for general purpose as I did not think it was going to be released in any actual form. 

 



Proud member of the Sonic Support Squad

CrashMan said:
The only way you can cantinue processing a function after calling another function is if the calling function does not rely on a return value from the called function. In cases like that, you are almost universally talking about another subsystem to the game than the calling function resides in, which would be on another thread any way for optimization.

The instances of long, intensive functions with no return value, and no variables passed by refference is extrordinarily low. They are usually short functions that should probably be inlined any way.

The actual function call on the core won't be any faster than a standard function call any way. You will need to push the variables to some data structure in memory (probably the stack for that core) and push the return value back to the calling core (stack again)

You will have to set the function pointer on the called core to the proper function as well.

the only thing that wouldn't need to be done would be to have the calling function's call memory location pushed and popped off the stack. You would have some sort of memory write/read to tell when the return variable is ready, so there is no speed up when calling a function on a separate core than a standard core.

I don't see how mulitple cores could possibly speed up the following

int funa(int a, int b, int c)
{

int x = funb(a,b);
int y = func(c,x);
int z = fund(y,a)

return z;

}

 

I am talking about a whole design change, where essentially each core would have it's own ingoing and outgoing queues, so they could begin processing imeediately.

core1: call funb in core2, read a, pass a to funb, read b, pass b to funb, wait for return from call b, assign into x

core1: call func, read parameter c, pass c to func, pass x to func, wait for return from call c and assign to y

core1: call fund, pass y to fund, pass a to fund, wait for return from d and assign to z, pass z back to calling function

 

Now the cool thing is, but this gets into messy side-effects (not really in your example, as it mostly safe in this case because of int type and non overlapping variables, but global variables could be modified in both functions making them unsafe), is the second group could actually process the steps on the second line upto pass x to func before it gets a return result.  So func actually gets a head start processing things until it needs the x parameter.  Simillarly fund can get a head start doing some initialization until it requires the value for y.

With dedicated queues between cores, you save a lot of use of the stack (although register passing would work for this simple case).  This is a whole new architecture I am suggesting once there are enough cores on a single chip. 

 

Where the benifits would really be seen is in data manipulation, where one set of data goes through multiple stream functions.

 



MisterBlonde said:

Also the code responsible for scheduling execution of threads to idle cores can get bogged down like an overworked traffic cop. The more cores the more work for the thread scheduler, the more behind it gets, the more cycles get consumed.

 

As it is now, large number of cores would be an issue...  however, if you put the thread scheduler in hardware, then it can be done in once every clock cycle. 

Basically allocating a thread has to also be in hardware, and not take any more time then otherwise calling a function.



@jlauro

A function call is single assembly instruction.

Scheduling a thread for execution requires determining which thread from the pool to schedule time (Threads have different priorities, different states). Additionally the scheduler would have to preempt currently running threads, save their context (state of the CPU registers at the time the thread was interrupted), and then load the context into registers from the thread that is getting scheduled for execution.

Certainly all that stuff could possibly be optimized more in the future, but nevertheless it will always be a lot more expensive than a single function call. The more threads and cores the scheduler has to deal with the more complex it becomes and the more often it happens.