So what comes after parallel processing?

jlauro

Currently Offline

13,183

2982 posts since 21/02/07

Recent Badges:

Watch Your Back! Received 10,000 profile views.
Trust Me, It'll Have Legs 100 replies made to user's most popular thread.
The High Flyer Earned 40 badges.
I'm Special VGChartz staff member.
14 Years Has been a VGChartz member for over 14 years.
Congratulations on Pressing Start! Score your first game in your collection.

jlauro on 07 December 2008

smallflyingtaco said:

jlauro said:
With enough cores, you could design the cpu such that instead of at the normal thread level, you could have functions work inside of cores. As one function calls another, it can literally be processing the values as they are coming in. With a single (or only a few cores) you are sending all the values on the stack, switching to the function, having the function pull them back to operate on, and pausing the calling function, and then returning once the processing is done. With tons of cores, the function can begin processing the data as it's coming while the calling function can be working on calculating the values to process. Both cores running concurrently. Some of that could be done by the compilers even for problems that don't naturally lend themselves to massive amounts of cores directly. As functions are hundreds of levels deep, think of the speed ups that are possible.

The speedup your talking about assumes that your not getting any cache misses and that your functions can be called without reliance on data from another function. Whenever those things happen your just going to leave one of your cores idling, with that happening hundreds or thousands of times your really just going to have hundreds of billions of wasted cycles. You also can only add so many cores before the speed of light is going to limit how well they can communicate, even with a 3D chip layout. You can in theory reduce the fab size to increase this but then your going to eventually run into Heisenberg problems. This means your going to hit a maximum number of cores per chip, at which point adding more cores slows the whole thing down.

When it relies on data from another function, that is a plus as those functions can also be started in other cores. As you see a huge number of cores ties in, as those other functions will run in other cores. Local memory (at least 4k, more the better) for each core is essential, as is a cross bar switch between all of the cores. As to the issue of worrying about all of the idle cores... actually make the cross bar go between the threads, and have a separate crossbar interconnecting threads and cores. When the available on board memory for threads are all used, you can swap them out to external ram.

Of course with hundreds of threads on the chip, the cross bar switches will take as much space as the cores.

NJ5

Currently Offline

69,544

14947 posts since 24/03/07

Recent Badges:

15 Years Has been a VGChartz member for over 15 years.
Open For Business Earned 10 badges.
One Piece at a Time Add your first game to your collection.
Cleanse The Wilderness 250 comments posted on VGChartz news articles.
Watch Your Back! Received 10,000 profile views.
First Rung Of The Ladder Earned 10,000 gamrPoints

NJ5 on 07 December 2008

jake_the_fake1 said:
After parallel processing comes Quantum processing, and then will get faze engines and go quantum speed...

Quantum computing isn't really faster than traditional computing, just different. Much faster for certain problems, that's for sure, but as far as current knowledge goes those problems aren't very wide in application.

My Mario Kart Wii friend code: 2707-1866-0957

CrashMan

Currently Offline

2,048

1279 posts since 12/09/07

Recent Badges:

15 Years Has been a VGChartz member for over 15 years.
Leaving Limbo 100 posts on the gamrConnect forums.
7 Years Has been a VGChartz member for over 7 years.
Littlest Genocide 1,000 posts on the gamrConnect forums.
17 Years Has been a VGChartz member for over 17 years.
9 Years Has been a VGChartz member for over 9 years.

CrashMan on 07 December 2008

The only way you can cantinue processing a function after calling another function is if the calling function does not rely on a return value from the called function. In cases like that, you are almost universally talking about another subsystem to the game than the calling function resides in, which would be on another thread any way for optimization.

The instances of long, intensive functions with no return value, and no variables passed by refference is extrordinarily low. They are usually short functions that should probably be inlined any way.

The actual function call on the core won't be any faster than a standard function call any way. You will need to push the variables to some data structure in memory (probably the stack for that core) and push the return value back to the calling core (stack again)

You will have to set the function pointer on the called core to the proper function as well.

the only thing that wouldn't need to be done would be to have the calling function's call memory location pushed and popped off the stack. You would have some sort of memory write/read to tell when the return variable is ready, so there is no speed up when calling a function on a separate core than a standard core.

I don't see how mulitple cores could possibly speed up the following

int funa(int a, int b, int c)
{

int x = funb(a,b);
int y = func(c,x);
int z = fund(y,a)

return z;

}

I am a Gauntlet Adventurer.

I strive to improve my living conditions by hoarding gold, food, and sometimes keys and potions. I love adventure, fighting, and particularly winning - especially when there's a prize at stake. I occasionally get lost inside buildings and can't find the exit. I need food badly. What Video Game Character Are You?

Mega Man 9 Challenges: 74%

Waltz	Tango	Jitterbug	Bust a move	Headbanging
Bunny Hop	Mr. Trigger Happy	Double Trouble	Mr. Perfect	Invincible
Almost Invincible	No Coffee Break	Air Shoes	Mega Diet	Encore
Peacekeeper	Conservationist	Farewell To Arms	Gamer's Day	Daily Dose
Whomp Wiley!	Truly Addicted!	Truly Hardcore!	Conqueror	Vanquisher
Destroyer	World Warrior	Trusty Sidearm	Pack Rat	Valued Customer
Shop A Holic	Last Man Standing	Survivor	Hard Rock	Heavy Metal
Speed Metal	Fantastic 9	Fully Unloaded	Blue Bomber	Eco Fighter
Marathon Fight	Quick Draw G	Quick Draw C	Quick Draw S	Quick Draw H
Quick Draw J	Quick Draw P	Quick Draw T	Quick Draw M	Quick Draw X

MisterBlonde

Currently Offline

98

16 posts since 23/01/08

Recent Badges:

13 Years Has been a VGChartz member for over 13 years.
4 Years Has been a VGChartz member for over 4 years.
16 Years Has been a VGChartz member for over 16 years.
10 Years Has been a VGChartz member for over 10 years.
So You Came Back For More, Huh? Logged in a second time.
'Ello Princess! Awarded for signing up.

MisterBlonde on 07 December 2008

@crashman

For the most part your right. When a caller calls another function the called function's entry point moves to the top of the stack (each thread has it's own call stack). Eip and esp are changed to reflect this. Therefore the caller HAS to wait until the called function returns even if doesn't need the return value. The called function needs to be popped off the stack before eip can be updated with the original caller's instruction. Each thread has a single stack, and each processor has only one eip, ebp, esp, etc..

The only way a calling function can continue execution prior to the called function completing is if it makes a call asynchronously to code/function running on another thread.

For the code you mentioned, the only way multiple cores can speed it up is if somehow funb, func, or fund spawn their own threads. If they don't then funa will run on a single thread which will be affinitized to the same processor.

NJ5

Currently Offline

69,544

14947 posts since 24/03/07

Recent Badges:

4 Years Has been a VGChartz member for over 4 years.
16 Years Has been a VGChartz member for over 16 years.
One Piece at a Time Add your first game to your collection.
Site Veteran Has been a VGChartz member for over 5 years.
Quite a Comeback Enter your first Prediction League event.
Some Here, Some There Bank a Total of 5,000 VG$.

NJ5 on 07 December 2008

Barring great and unexpected advances in computer science, even multi-core is already limited in application. Sure, it's nice to have one core to run all the spyware and another one to run the useful software, or even to run two useful programs at the same time, but many kinds of programs will never take advantage of even a few cores, let alone the many-core architectures.

Sure, it's possible to take most badly designed programs and make them much faster using multithreading, but it's much harder to do so for programs which are already efficient in the first place.

I'm having fun with programming with CUDA right now... on number theory problems which are embarrassingly parallel. Other than that, I'm still hoping for faster cores.

My Mario Kart Wii friend code: 2707-1866-0957

MisterBlonde

Currently Offline

98

16 posts since 23/01/08

Recent Badges:

Site Veteran Has been a VGChartz member for over 5 years.
11 Years Has been a VGChartz member for over 11 years.
2 Years Has been a VGChartz member for over 2 years.
12 Years Has been a VGChartz member for over 12 years.
'Ello Princess! Awarded for signing up.
So You Came Back For More, Huh? Logged in a second time.

MisterBlonde on 07 December 2008

@NJ5

Right. For the most part the only way to get linear benefit is when you can find independent tasks that do not need to operate on shared data. Even then multiple threads would still have to contend for multiple shared resources like disk drives, network interfaces, RAM etc...

Also the code responsible for scheduling execution of threads to idle cores can get bogged down like an overworked traffic cop. The more cores the more work for the thread scheduler, the more behind it gets, the more cycles get consumed.

smallflyingtaco

Currently Offline

991

380 posts since 10/08/07

Recent Badges:

'Ello Princess! Awarded for signing up.
9 Years Has been a VGChartz member for over 9 years.
50 in One Add a total of 50 games to your collection.
A Badge Within A Badge Earned 20 badges.
8 Years Has been a VGChartz member for over 8 years.
6 Years Has been a VGChartz member for over 6 years.

smallflyingtaco on 07 December 2008

jlauro said:

smallflyingtaco said:

jlauro said:
With enough cores, you could design the cpu such that instead of at the normal thread level, you could have functions work inside of cores. As one function calls another, it can literally be processing the values as they are coming in. With a single (or only a few cores) you are sending all the values on the stack, switching to the function, having the function pull them back to operate on, and pausing the calling function, and then returning once the processing is done. With tons of cores, the function can begin processing the data as it's coming while the calling function can be working on calculating the values to process. Both cores running concurrently. Some of that could be done by the compilers even for problems that don't naturally lend themselves to massive amounts of cores directly. As functions are hundreds of levels deep, think of the speed ups that are possible.

The speedup your talking about assumes that your not getting any cache misses and that your functions can be called without reliance on data from another function. Whenever those things happen your just going to leave one of your cores idling, with that happening hundreds or thousands of times your really just going to have hundreds of billions of wasted cycles. You also can only add so many cores before the speed of light is going to limit how well they can communicate, even with a 3D chip layout. You can in theory reduce the fab size to increase this but then your going to eventually run into Heisenberg problems. This means your going to hit a maximum number of cores per chip, at which point adding more cores slows the whole thing down.

When it relies on data from another function, that is a plus as those functions can also be started in other cores. As you see a huge number of cores ties in, as those other functions will run in other cores. Local memory (at least 4k, more the better) for each core is essential, as is a cross bar switch between all of the cores. As to the issue of worrying about all of the idle cores... actually make the cross bar go between the threads, and have a separate crossbar interconnecting threads and cores. When the available on board memory for threads are all used, you can swap them out to external ram.

Of course with hundreds of threads on the chip, the cross bar switches will take as much space as the cores.

Your talking about something like Intel's Teraflops project, that is a little different but similar to what your talking about. I am not actually certain if this works well but I had assumed not for general purpose as I did not think it was going to be released in any actual form.

Proud member of the Sonic Support Squad

jlauro

Currently Offline

13,183

2982 posts since 21/02/07

Recent Badges:

Watch Your Back! Received 10,000 profile views.
Trust Me, It'll Have Legs 100 replies made to user's most popular thread.
The High Flyer Earned 40 badges.
I'm Special VGChartz staff member.
14 Years Has been a VGChartz member for over 14 years.
Congratulations on Pressing Start! Score your first game in your collection.

jlauro on 07 December 2008

CrashMan said:
The only way you can cantinue processing a function after calling another function is if the calling function does not rely on a return value from the called function. In cases like that, you are almost universally talking about another subsystem to the game than the calling function resides in, which would be on another thread any way for optimization.

The instances of long, intensive functions with no return value, and no variables passed by refference is extrordinarily low. They are usually short functions that should probably be inlined any way.

The actual function call on the core won't be any faster than a standard function call any way. You will need to push the variables to some data structure in memory (probably the stack for that core) and push the return value back to the calling core (stack again)

You will have to set the function pointer on the called core to the proper function as well.

the only thing that wouldn't need to be done would be to have the calling function's call memory location pushed and popped off the stack. You would have some sort of memory write/read to tell when the return variable is ready, so there is no speed up when calling a function on a separate core than a standard core.

I don't see how mulitple cores could possibly speed up the following

int funa(int a, int b, int c)
{

int x = funb(a,b);
int y = func(c,x);
int z = fund(y,a)

return z;

}

I am talking about a whole design change, where essentially each core would have it's own ingoing and outgoing queues, so they could begin processing imeediately.

core1: call funb in core2, read a, pass a to funb, read b, pass b to funb, wait for return from call b, assign into x

core1: call func, read parameter c, pass c to func, pass x to func, wait for return from call c and assign to y

core1: call fund, pass y to fund, pass a to fund, wait for return from d and assign to z, pass z back to calling function

Now the cool thing is, but this gets into messy side-effects (not really in your example, as it mostly safe in this case because of int type and non overlapping variables, but global variables could be modified in both functions making them unsafe), is the second group could actually process the steps on the second line upto pass x to func before it gets a return result. So func actually gets a head start processing things until it needs the x parameter. Simillarly fund can get a head start doing some initialization until it requires the value for y.

With dedicated queues between cores, you save a lot of use of the stack (although register passing would work for this simple case). This is a whole new architecture I am suggesting once there are enough cores on a single chip.

Where the benifits would really be seen is in data manipulation, where one set of data goes through multiple stream functions.

jlauro

Currently Offline

13,183

2982 posts since 21/02/07

Recent Badges:

Watch Your Back! Received 10,000 profile views.
Trust Me, It'll Have Legs 100 replies made to user's most popular thread.
The High Flyer Earned 40 badges.
I'm Special VGChartz staff member.
14 Years Has been a VGChartz member for over 14 years.
Congratulations on Pressing Start! Score your first game in your collection.

jlauro on 07 December 2008

MisterBlonde said:

Also the code responsible for scheduling execution of threads to idle cores can get bogged down like an overworked traffic cop. The more cores the more work for the thread scheduler, the more behind it gets, the more cycles get consumed.

As it is now, large number of cores would be an issue... however, if you put the thread scheduler in hardware, then it can be done in once every clock cycle.

Basically allocating a thread has to also be in hardware, and not take any more time then otherwise calling a function.

MisterBlonde

Currently Offline

98

16 posts since 23/01/08

Recent Badges:

1st Birthday Has been a VGChartz member for over 1 year.
15 Years Has been a VGChartz member for over 15 years.
So You Came Back For More, Huh? Logged in a second time.
9 Years Has been a VGChartz member for over 9 years.
12 Years Has been a VGChartz member for over 12 years.
8 Years Has been a VGChartz member for over 8 years.

MisterBlonde on 07 December 2008

@jlauro

A function call is single assembly instruction.

Scheduling a thread for execution requires determining which thread from the pool to schedule time (Threads have different priorities, different states). Additionally the scheduler would have to preempt currently running threads, save their context (state of the CPU registers at the time the thread was interrupted), and then load the context into registers from the thread that is getting scheduled for execution.

Certainly all that stuff could possibly be optimized more in the future, but nevertheless it will always be a lot more expensive than a single function call. The more threads and cores the scheduler has to deal with the more complex it becomes and the more often it happens.

Existing User Log In

New User Registration

Forums - PC Discussion - So what comes after parallel processing?

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges: