Getting the most from Parallelized and Vectorized Code
Does choice of C++ compiler make a difference when coding for parallelized and vectorized code? Slashdot Media Contributing Editor Rick Leinecker discusses the hows and whys of choosing your compiler
For years I have been an advocate of the Microsoft development tools. This goes way back to the command-line compilers that preceded Visual Studio 1.0. There were lots of heated arguments when my programming staffs compared Turbo C and Microsoft C—but of course I always came down on the side of Microsoft C (and the Director of Technology always wins that argument).
In recent years, though, I have gained enormous respect for the Intel C++ compiler. I started using it when I installed Parallel Studio and it easily integrated with Visual Studio. A single submenu selection tells Visual Studio to use the Intel compiler instead of the Microsoft compiler. My benchmarks show that programs compiled with the Intel compiler are almost always faster than those compiled with the Microsoft compiler. And many times they are smaller as well. This is a win-win for software developers.
Vectorization Example
The first example I created in order to compare both compilers enlists auto vectorization. Both compilers have this feature, which makes use of the processor’s vectorization (sometimes known as SIMD) instructions to gain a performance boost. The following code is what I used.
//Initialize
for (int j = 0; j < ROW; j++)
{
for (int i = 0; i < COL; i++)
{
Data[j][i] = i*j;
}
}
// Perform operations that should be vectorized.
for (int j = 0; j < ROW; j++)
{
for (int i = 0; i < COL; i++)
{
sum += Data[i][j] + Data[j][i];
}
}
The ROW define has a value of 20000 and the COL define has a value of 20000, so the nested loops add to the sum variable 400,000,000 times. This is enough to put both compilers to the test, and see how well they auto vectorize. The Intel compiler showed a modest advantage over the Microsoft compiler. After running this program I realized that I could probably improve its performance with a well placed pragma directive that provided the compiler with a hint for even better gains as shown in the following code. The table that follows the code shows the execution time in millseconds.
// Perform operations that should be vectorized.
for (int j = 0; j < ROW; j++)
{
#pragma omp simd reduction(+:sum)
for (int i = 0; i < COL; i++)
{
sum += Data[i][j] + Data[j][i];
}
}
Without Helper Pragma With Helper Pragma
Microsoft Compiler 6120 ms 5772 ms
Intel Compiler 5902 ms 5382 ms
Parallelized Loops
The next comparison I made was in parallelized loops. To do this, I used OpenMP, which is a standardized way of creating parallelization with pragma decorations. The following code is a parallelized loop from the previous code example. You will see the directive “pragma omp parallel for reduction(+:sum)”, which is all that must be done to parallelize the loop.
The reduction clause prevents a data race on the sum variable. Essentially, behind the scenes there is a variable for each thread. At the end of the parallel for loop, all of them are combined.
// Perform operations that should be vectorized.
for (int j = 0; j < ROW; j++)
{
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < COL; i++)
{
sum += Data[i][j] + Data[j][i];
}
}
Here again, the Intel code was faster. The following chart shows the comparison.
Microsoft Compiler 5877 ms
Intel Compiler 5438 ms
Conclusion
Both the Microsoft and Intel compilers are fine pieces of development software. When I need to squeeze out the absolute most, I use the Intel compiler. Otherwise, of all things are equal, I don’t take the time to change the compiler type and leave it the default Microsoft compiler.




