Here is a test I ran under Windows XP 64-bit on two machines: one with 2 dual-core opterons @ 2.6 Ghz, the other with two Xeon E5345 (Core quad @ 2.33 ghz)
Two thread, each sits in WaitForMultipleObjectsEx with an infinite timeout.
We call QueueUserAPC(threadA, funcA)
funcA() {
QueueUserAPC(threadB, funcB);
}
funcB() {
QueueUserAPC(threadA, funcA);
}
This basically measures context-switching speed. When run in a single thread on the AMD machine, each function gets invoked about 550K times/sec.
With two threads, the AMD machine does 140K calls/sec for each function.
The Intel machine can only pull in 62K calls/sec.
Now, this is a very realistic kind of load for a multi-threaded app. A real-life app is not spending all of its time doing integer math, but in fact it does do a lot of context switching, if only to/from the OS. When every 62 context switches eat up a whole millisecond of your time, it's very hard to build something both multi-threaded and fast.
By the way, in case you wonder, this IS the fastest way to force a context switch that I know of. I tried passing a 1-byte token between two anonymous pipes (i.e. each read does a blocking read of 1 byte, and then writes it back for the other thread to receive). The results are not inspiring -- 80K tokens each thread on AMD, even fewer on Xeon.
More numbers: Dual AMD 285 (2x dual core, 2.6 ghz), Windows 32-bit: 105K calls/sec.
Dual Xeon X5365 (2x quad core, 3.00ghz): 101K calls/sec
The conclusion: something about Intel processors really makes context switching slow.
Another conclusion: whenever possible, use x64 code. It's not the "large memory" benefit, it's the extra 8 registers that will make your code fly.
How did you get around, "If you perform an alertable wait inside an APC, it will recursively dispatch the APCs. This can cause a stack overflow.
ReplyDelete"
The answer is that you don't perform an alertable wait inside an APC. Each thread enters an alertable wait only once. Call that the main loop.
ReplyDelete