alephnull said:
I don't see how your hardware schedualer would help reduce the overhead for context switches. And I think you are overestimating the power of hardware schedualers and underestimating the problem. Calculating the optimal schedual on an SMP system is equivalent to bin packing, which is NP hard Once apon a time my area of research in CS was multiprocessor hybrid (EDF, RM, and fixed priority schedualing) for realtime schedualers. While the schedualing algorithms are somewhat different for normal systems, I do know something about it.
|
The problem isn't NP hard if number of threads supported by the hardware > number of active threads. If you have 1000 cores on the chip, then just fall back to software if you run out. When you run out, the OS can swap out 20% of the threads and start to worry about scheduling. The idea, 99% of the time, you will always have an empty thread slot available.
It's also easy to do round-robin scheduling if you have thread support in the processor. Sure, granular prioritizing in hardware may not be as simple, but simply q priority isn't NP hard. It's the same as how QOS works in a layer 2 or layer 3 switch.
The main point is, if you have enough cores, who cares about optimal scheduling. Just run everything, and do a qos level if you need to. The OS threads will always have the QOS to always be running, and so they can reschedule things if the rest goes crazy.
The reason why there is no over head for a context switch, is you put the memory for each thread on the chip. That is why you need a fair amount of memory per thread on each chip, instead of putting cache memory on the chip. This is also why you need a cross-bar between the cores and the thread states. There is no cost for context switching because the context stays on the other side of the cross bar. The only time there is a cost for a context switch, is if the thread has to be swapped out to external ram, and a new thread switched in. All the registers are esentially part of the memory area for that thread.
This idea will not work with only a low number of cpus, and it also will not work well with multiple physical cpus, as it does assume uniform memory access (which is not easily done with a large number of cores without using a cross-bar switch), and you were probably working with a non uniform memory architecture, or else a small number of processors. Unless you put everything on one chip, including enough memory (4k minimum per thread, 64+k better) to handle a large number of threads, and enough cores, you just will run out of memory bandwidth...
Anyways, it will probably be another 10-20 years before this design would be really feasable in a single chip.