Archive for the ‘SSE2’ tag
The priorities of LxEngine have always been to produce simple, general, and correct code rather than pursuing optimal performance. This is arguably a justifiable set of priorities given that LxEngine is a long-term hobby project with a single developer aimed at experimenting with a variety of different technologies. In any case, I did spend a bit of time looking into the ray tracer’s performance and was able to improve it significantly via a few relative simple code changes.
I also tweaked out another 18% or so of an improvement via mere compiler flags. I’m going to confine this post to simply mentioning the effects of a few of those Visual Studio 2010 compiler optimization flags.
Let’s start with what’s most interesting: the results. These are averaged times from 6 runs of a sample scene with a few lights and a couple reflective objects at 1024×1024 resolution.
2,464 ms /arch:SSE2 /fp:fast
2,619 ms /fp:fast
2,956 ms /arch:SSE2
3,038 ms /fp:precise (default)
The slowest results came from the default compiler settings, which is to not set the /arch flag and use precise floating point precision. In this case, the floating point stack is used with most calculations taking place in 80-bit precision on the FPU before being copied back to host memory.
The first change I attempted was enabling /arch:SSE2, which tells the Visual Studio compiler (cl.exe) that it should use Streaming SIMD Extensions 2. With a ray tracer there are surely plenty of optimization opportunities for single instruction multiple data (SIMD) instructions and I wondered how much the compiler alone, without any code modifications, could take advantage of this. The result was surprisingly not much at all. The times for this single benchmark were only about 2.7% faster. I didn’t expect the compiler to be able to rework the ray tracer functions into fully parallel computations, but I figured that the presence of eight extra XMM registers alone would have more significant impact. Lesson: make benchmarks, not assumptions, when optimizing.
The then looked at the generated assembly code and found the CVTPS2PD instruction used quite frequently. Huh? Convert single precision to double precision? Why is it doing that? All my data in single precision, 32-bit floating point form. Why are conversions happening? The reason was the /fp:precise flag. Even if the final results are single-precision, the intermediate calculations were being done in double-precision to retain as much precision as possible during multiple floating point operations.
When I turned on the /fp:fast flag, the generated assembly became much more straightforward as the CVTPS2PD instructions all disappeared. The benchmark also yielded noticeably faster results (18.9% faster), which is quite significant given no coding effort was required to get this speed improvement. Now, of course, it’s important to note that both the /arch:SSE2 and /fp:fast flags do change the behavior of the code. The XMM registers used by the /arch:SSE2 flag – even in /fp:precise mode – still operate with at most 64-bits of precision and with the /fp:fast enabled, that is reduced to 32-bits precision. In the initial code, the 80-bit FPU representation was used. A change from 80-bits to 32-bits is non-trivial. I haven’t any analysis on the actual effect of the precision change, but it does need to be considered.
The final result was unfortunately the mysterious one. I could imagine how enabling SSE2 didn’t have much of an effect if the code was constantly converting from single to double precision; thus that explained my surprise at the minimal effect of /arch:SSE2 without /fp:fast. But trying to test out all possibilities, I enabled /fp:fast without any SIMD instructions or registers made available. The result was nearly as fast as using the SSE2 registers and instructions at 32-bits of precision. Huh? I haven’t dug through the the assembly comparing /fp:precise and /fp:fast without SSE2 enabled so at this point, I’m simply very surprised. Rather than spout out an untested, unresearched theory on what the compiler must be doing in this case to manage to save so much time, I’ll just leave it at that: I’m surprised.