By using this site, you agree to our Privacy Policy and our Terms of Use. Close
alephnull said:
jlauro said:
MisterBlonde said:
@jlauro

A function call is single assembly instruction.

Scheduling a thread for execution requires determining which thread from the pool to schedule time (Threads have different priorities, different states). Additionally the scheduler would have to preempt currently running threads, save their context (state of the CPU registers at the time the thread was interrupted), and then load the context into registers from the thread that is getting scheduled for execution.

Certainly all that stuff could possibly be optimized more in the future, but nevertheless it will always be a lot more expensive than a single function call. The more threads and cores the scheduler has to deal with the more complex it becomes and the more often it happens.

That is only an issue if the scheduler is done in software. A hardware scheduler could easily look at all the threads once per clock cycle, even if there were thousands of cores. It's faster to do that then adding two 64 bit numbers together in hardware, which typically only takes one cycle. That is what is so cool with hardware. The os could impose resource limits in register as to how many threads are allowed to be automatically created before a request has to be approved or rebalanced by the os, or to only go back to the os if no open threads are available. Remember, in this hypothetical environment there are lots of cores, and you shoud never run out except on the busiest servers at peek time.

 

I don't see how your hardware schedualer would help reduce the overhead for context switches. And I think you are overestimating the power of hardware schedualers and underestimating the problem.

Calculating the optimal schedual on an SMP system is equivalent to bin packing, which is NP hard

Once apon a time my area of research in CS was multiprocessor hybrid (EDF, RM, and fixed priority schedualing) for realtime schedualers. While the schedualing algorithms are somewhat different for normal systems, I do know something about it.

 

The problem isn't NP hard if number of threads supported by the hardware > number of active threads.  If you have 1000 cores on the chip, then just fall back to software if you run out.  When you run out, the OS can swap out 20% of the threads and start to worry about scheduling.  The idea, 99% of the time, you will always have an empty thread slot available.

It's also easy to do round-robin scheduling if you have thread support in the processor.  Sure, granular prioritizing in hardware may not be as simple, but simply q priority isn't NP hard.  It's the same as how QOS works in a layer 2 or layer 3 switch.

The main point is, if you have enough cores, who cares about optimal scheduling.  Just run everything, and do a qos level if you need to.  The OS threads will always have the QOS to always be running, and so they can reschedule things if the rest goes crazy.

 

The reason why there is no over head for a context switch, is you put the memory for each thread on the chip.  That is why you need a fair amount of memory per thread on each chip, instead of putting cache memory on the chip.  This is also why you need a cross-bar between the cores and the thread states.  There is no cost for context switching because the context stays on the other side of the cross bar.  The only time there is a cost for a context switch, is if the thread has to be swapped out to external ram, and a new thread switched in.  All the registers are esentially part of the memory area for that thread.

 

This idea will not work with only a low number of cpus, and it also will not work well with multiple physical cpus, as it does assume uniform memory access (which is not easily done with a large number of cores without using a cross-bar switch), and you were probably working with a non uniform memory architecture, or else a small number of processors.  Unless you put everything on one chip, including enough memory (4k minimum per thread, 64+k better) to handle a large number of threads, and enough cores, you just will run out of memory bandwidth... 

Anyways, it will probably be another 10-20 years before this design would be really feasable in a single chip.