{"id":23977,"date":"2019-03-19T07:00:25","date_gmt":"2019-03-19T07:00:25","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=23977"},"modified":"2024-09-10T07:57:26","modified_gmt":"2024-09-10T07:57:26","slug":"game-performance-and-compilation-time-improvements-in-visual-studio-2019","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/game-performance-and-compilation-time-improvements-in-visual-studio-2019\/","title":{"rendered":"Game performance and compilation time improvements in Visual Studio 2019"},"content":{"rendered":"

The C++ compiler in Visual Studio 2019 includes several new optimizations and improvements geared towards increasing the performance of games and making game developers more productive by reducing the compilation time of large projects. Although the focus of this blog post is on the game industry, these improvements apply to most C++ applications and C++ developers.<\/p>\n

Compilation time improvements<\/h5>\n
One of the focus points of the C++ toolset team in the VS 2019 release is improving linking time, which in turn allows faster iteration builds and quicker debugging. Two significant changes to the linker help speed up the generation of debug information (PDB files):<\/p>\n
\n
Type pruning in the backend removes type information that is not referenced by any variables and reduces the amount of work the linker must do during type merging.<\/li>\n
Speed up type merging by using a fast hash function to identify identical types.<\/li>\n<\/ul>\n
The table below shows the speedup measured in linking a large, popular AAA game:<\/p>\n\n\n\n\n\n
\n
Debug build\n<\/strong>configuration<\/strong><\/p>\n<\/td>\n
Linking time (sec)\n<\/strong>VS 2017 (15.9)<\/strong><\/td>\n Linking time (sec)\n<\/strong>VS 2019 (16.0)<\/strong><\/td>\n Linking time speedup<\/strong><\/td>\n<\/tr>\n
\/DEBUG:full<\/td>\n \n
392.1<\/p>\n<\/td>\n
\n
163.3<\/p>\n<\/td>\n
\n
2.40x<\/strong><\/span><\/p>\n<\/td>\n<\/tr>\n
\/DEBUG:fastlink<\/td>\n 72.3<\/td>\n 31.2<\/td>\n \n
2.32x<\/strong><\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n
<\/p>\n
More details and additional benchmarks can be found in this blog post<\/a>.<\/p>\n
Vector (SIMD) expression optimizations<\/h5>\n
One of the most significant improvements in the code optimizer is handling of vector (SIMD) intrinsics, both from source code and as a result of automated vectorization. In VS 2017 and prior, most vector operations would go through the main optimizer without any special handling, similar to function calls, although they are represented as intrinsics – special functions known to the compiler. Starting with VS 2019, most expressions involving vector intrinsics are optimized just like regular integer\/float code using the SSA optimizer<\/a>.<\/p>\n
Both float (eg. _mm_add_ps<\/em>) and integer (eg. _mm_add_epi32<\/em>) versions of the intrinsics are supported, targeting the SSE\/SSE2 and AVX\/AVX2 instruction sets. Some of the performed optimizations, among many others:<\/p>\n
\n
constant folding<\/li>\n
arithmetic simplifications, including reassociation<\/li>\n
handling of cmp, min\/max, abs, extract operations<\/li>\n
converting vector to scalar operations if profitable<\/li>\n
patterns for shuffle and pack operations<\/li>\n<\/ul>\n
Other optimizations, such as common sub-expression elimination, can now take advantage of a better understanding of load\/store vector operations, which are handled like regular loads\/stores. Several ways of initializing a vector register are recognized and the values are used during the expression simplifications (eg. _mm_set_ps, _mm_set_ps1, _mm_setr_ps, _mm_setzero_ps<\/em> for float values).<\/p>\n
Another important addition is the generation of fused multiply-add (FMA) for vector intrinsics when the \/arch:AVX2 compiler flag is used \u2013 previously it was done only for scalar float code. This allows the CPU to compute the expression ab + c<\/em> in fewer cycles, which can be a significant speedup in math-heavy code, as one of the examples below is showing.<\/p>\n*
The following code exemplifies both the generation of FMA with \/arch:AVX2 and the expression optimizations when \/fp:fast is used:<\/p>\n
`m128 test(float a, float b) { <\/span><\/code><\/span>\n\u00a0 \u00a0 m128 va = _mm_set1_ps(a); <\/span><\/code><\/span>\n\u00a0 \u00a0 __m128 vb = _mm_set1_ps(b); <\/span><\/code><\/span>\n\u00a0 \u00a0 __m128 vd = _mm_set1_ps(-b);<\/span><\/code><\/span><\/p>\n`
\u00a0 \u00a0 \/\/ Computes (va * vb) + (va * -vb) <\/span><\/code><\/span>\n\u00a0 \u00a0 return _mm_add_ps(_mm_mul_ps(va, vb),<\/span><\/span><\/code>_mm_mul_ps(va, vd)); <\/span><\/code><\/span><\/span>\n}<\/span><\/span><\/code><\/span><\/p>\n\n\n\n\n\n\nNo simplifications are done; FMA not generated.<\/p>\n<\/td>\n VS 2017 \/arch:AVX2 \/fp:fast\n<\/strong> vmovaps xmm3, xmm0<\/span>\n<\/code>vbroadcastss xmm3, xmm0<\/span><\/code><\/span>\nvxorps xmm0, xmm1, DWORD PTR xmm@80000000800000008000000080000000<\/span><\/code><\/span>\nvbroadcastss xmm0, xmm0<\/span><\/code><\/span>\nvmulps<\/span><\/strong> xmm2, xmm0, xmm3<\/span><\/code><\/span>\nvbroadcastss xmm1, xmm1<\/span><\/code><\/span>\nvmulps<\/span><\/strong> xmm0, xmm1, xmm3<\/span><\/code><\/span>\nvaddps<\/span><\/strong> xmm0, xmm2, xmm0<\/span><\/code><\/span>\nret 0<\/span><\/code><\/span><\/td>\n<\/tr>\n No simplifications done \u2013 not legal under \/fp:precise; FMA generated.<\/td>\n VS 2019 \/arch:AVX2\n<\/strong> vmovaps xmm2, xmm0<\/span><\/code>\nvbroadcastss xmm2, xmm0<\/span><\/code>\nvmovaps xmm0, xmm1<\/span><\/code>\nvbroadcastss xmm0, xmm1<\/span><\/code>\nvxorps xmm1, xmm1, DWORD PTR xmm@80000000800000008000000080000000<\/span><\/code>\nvbroadcastss xmm1, xmm1<\/span><\/code>\nvmulps<\/span><\/strong> xmm0, xmm0, xmm2<\/span><\/code>\nvfmadd231ps<\/strong><\/span> xmm0, xmm1, xmm2<\/span><\/code>\nret 0<\/span><\/code><\/td>\n<\/tr>\n Entire expression simplified to \u201creturn 0\u201d since \/fp:fast allows applying the usual arithmetic rules.<\/td>\n VS 2019 \/arch:AVX2 \/fp:fast<\/p>\nvxorps<\/strong><\/span> xmm0, xmm0, xmm0<\/span><\/code>\nret 0<\/span><\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n <\/p>\n More examples can be found in this older blog post<\/a>, which discusses the SIMD generation of several compilers \u2013 VS 2019 now handles all the cases as expected, and a lot more!<\/p>\n Benchmarking the vector optimizations<\/h5>\nFor measuring the benefit of the vector optimizations, Xbox ATG (Advanced Technology Group) provided a benchmark based on code from Unreal Engine 4 for commonly used mathematical operations, such as SIMD expressions, vector\/matrix transformations and sin\/cos\/sqrt functions. The tests are a combination of cases where the values are constants and cases where the values are unknown at compile time. This tests the common scenario where the values are not known at compile-time, but also the situation that arises usually after inlining when some values turn out to be constants.<\/p>\n The table below shows the speedup of the tests grouped into four categories, the execution time (milliseconds) being the sum of all tests in the category. The next table shows the improvements for a few individual tests when using unknown, random values – the versions that use constants are folded now as expected.<\/p>\n\n\n\n\n\n\n\n\nCategory<\/strong><\/p>\n<\/td>\n VS 2017 (ms)<\/strong><\/td>\n VS 2019 (ms)<\/strong><\/td>\n \nSpeedup<\/strong><\/p>\n<\/td>\n<\/tr>\n Math<\/td>\n 482<\/td>\n 366<\/td>\n 27.36%<\/strong><\/span><\/td>\n<\/tr>\n Vector<\/td>\n 337<\/td>\n 238<\/td>\n 34.43%<\/strong><\/span><\/td>\n<\/tr>\n Matrix<\/td>\n 3168<\/td>\n 3158<\/td>\n 0.32%<\/strong><\/span><\/td>\n<\/tr>\n Trigonometry<\/td>\n 3268<\/td>\n 1882<\/td>\n 53.83%<\/strong><\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n <\/p>\n\n\n\n\n\n\n\n\n\nTest<\/strong><\/p>\n<\/td>\n VS 2017 (ms)<\/strong><\/td>\n VS 2019 (ms)<\/strong><\/td>\n \nSpeedup<\/strong><\/p>\n<\/td>\n<\/tr>\n \nVectorDot3<\/p>\n<\/td>\n \n42<\/p>\n<\/td>\n \n39<\/p>\n<\/td>\n \n7.4%<\/strong><\/span><\/p>\n<\/td>\n<\/tr>\n \nMatrixMultiply<\/p>\n<\/td>\n \n204<\/p>\n<\/td>\n \n194<\/p>\n<\/td>\n \n5%<\/strong><\/span><\/p>\n<\/td>\n<\/tr>\n \nVectorCRTSin<\/p>\n<\/td>\n \n421<\/p>\n<\/td>\n \n402<\/p>\n<\/td>\n \n4.6%<\/strong><\/span><\/p>\n<\/td>\n<\/tr>\n \nNormalizeSqrt<\/p>\n<\/td>\n \n82<\/p>\n<\/td>\n \n77<\/p>\n<\/td>\n \n7.4%<\/strong><\/span><\/p>\n<\/td>\n<\/tr>\n NormalizeInvSqrt<\/td>\n \n106<\/p>\n<\/td>\n \n97<\/p>\n<\/td>\n \n8.8%<\/strong><\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n <\/p>\n Improvements in Unreal Engine 4 – Infiltrator Demo<\/h5>\nTo ensure that our efforts benefit actual games and not just micro-benchmarks, we used the Infiltrator Demo<\/a> as a representative for an AAA game based on Unreal Engine 4.21. Being mostly a cinematic sequence rendered in real-time, with complex graphics, animations and physics, the execution profile is similar to an actual game; at the same time it is a great target for getting the stable, reproducible results needed to investigate performance and measure the impact of compiler improvements.<\/p>\n The main way of measuring a game\u2019s performance is using the frame time. Frame times can be viewed as the inverse of FPS (frames per second), representing the time it takes to prepare one frame to be displayed, lower values being better. The two main threads in Unreal Engine are the gaming thread and rendering thread \u2013 this work focuses mostly on the gaming thread performance.<\/p>\n There are four builds being tested, all based on the default Unreal Engine settings, which use unity (jumbo) builds<\/a> and have \/fp:fast \/favor:AMD64 enabled. Note that the AVX2 instruction set is being used, except for one build that keeps the default AVX:<\/p>\n \nVS 2017 (15.9) with \/arch:AVX2<\/li>\n VS 2019 (16.0) with \/arch:AVX2<\/li>\nVS 2019 (16.0) with \/arch:AVX2 and \/LTCG, to showcase the benefit\nof using link time code generation<\/a><\/li>\n VS 2019 (16.0) with \/arch:AVX, to showcase the benefit of using AVX2 over AVX<\/li>\n<\/ul>\nTesting details:<\/strong><\/p>\n\nTo capture frame times, a custom ETW<\/a> provider was integrated into the game to report the values to Xperf<\/a> running in the background. Each build of the game has one warm-up run, then 10 runs of the entire game with ETW tracing enabled. The final frame time is computed, for each 0.5 second interval, as the average of these 10 runs. The process is automated by a script that starts the game once and after each iteration restarts the level from the beginning. Out of the 210 seconds (3:30m) long demo, the first 170 seconds are captured.<\/li>\n Test PC configuration:\n\nAMD Ryzen 2700x CPU (8 cores\/16 threads) fixed at 3.4Ghz to eliminate potential noise in the measurements from dynamic frequency scaling<\/li>\n AMD Radeon RX 470 GPU<\/li>\n 32 GB DDR4-2400 RAM<\/li>\n Windows 10 1809<\/li>\n<\/ul>\n<\/li>\nThe game runs at a resolution of 640×480 to reduce the impact the GPU rendering has<\/li>\n<\/ul>\nResults:<\/strong><\/p>\nThe chart below shows the measured frame times up to second 170 for the four tested builds of the game. Frame time ranges from 4ms to 15ms in the more graphic intensive part around seconds 155-165<\/a>. To make the difference between builds more obvious, the \u201cfastest\u201d and \u201cslowest\u201d sections are zoomed in. As mentioned before, a lower frame time value is better.<\/p>\n