top of page
Search
vietapedeskhy

Intel Xeon Phi Coprocessor High Performance Programming 1



The Intel Xeon Phi coprocessor takes advantage of familiar programming languages, parallelism models, techniques and developer tools available for the Intel architecture. This helps ensure that software companies and IT departments are equipped with greater use of parallel code without retraining developers on proprietary and hardware specific programming models associated with accelerators. Intel is providing the software tools to help scientists and engineers optimize their code to take full advantage of Intel Xeon Phi coprocessors, including Intel Parallel Studio XE and Intel Cluster Studio XE. Available today, these tools enable code optimization and, through using the same programming languages and models shared by Intel Xeon Phi coprocessors and Intel Xeon processors E5 product family, help applications benefit both from tens of Intel Xeon Phi coprocessor cores and also from more efficient use of Intel Xeon processor threads.


The Intel Xeon Phi coprocessor 3100 family will provide great value for those seeking to run compute-bound workloads such as life science applications and financial simulations. The Intel Xeon Phi 3100 family will offer more than 1000 Gigaflops (1 TFlops) double-precision performance, support for up to 6GB memory at 240GB/sec bandwidth, and a series of reliability features including memory error correction codes (ECC). The family will operate within a 300W thermal design point (TDP) envelope.




intel xeon phi coprocessor high performance programming 1



The Intel Xeon Phi coprocessor 5110P provides additional performance at a lower power envelope. It reaches 1,011 Gigaflops (1.01 TFlops) double-precision performance, and supports 8GB of GDDR5 memory at a higher 320 GB/sec memory bandwidth. With 225 watts TDP, the passively cooled Intel Xeon Phi coprocessor 5110P delivers power efficiency that is ideal for dense computing environments, and is aimed at capacity-bound workloads such as digital content creation and energy research. This processor has been delivered to early customers and featured in the 40th edition of the top500 list.


To provide early access to new Intel Xeon Phi coprocessor technology for customers such as Texas Advanced Computing Center (TACC), Intel has additionally offered customized products: Intel Xeon Phi coprocessor SE10X and Intel Xeon Phi coprocessor SE10P.These offer 1073 GFlops double precision performance at a 300W TDP with rest of the specification similar to Intel Xeon Phi coprocessor 5110P.


The Intel Xeon Phi coprocessor 5110P is shipping today with general availability on Jan. 28 with recommended customer price of $2,649. The Intel Xeon Phi coprocessor 3100 product family will be available during the first half of 2013 with recommended customer price below $2,000. Additional information on availability and ordering Intel Xeon Phi coprocessor 5110P can be found at www.intel.com/xeonphi


Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi coprocessor. They have distilled their own experiences coupled with insights from many expert customers, Intel Field Engineers, Application Engineers and Technical Consulting Engineers, to create this authoritative first book on the essentials of programming for this new architecture and these new products.


This book is useful even before you ever touch a system with an Intel Xeon Phi coprocessor. To ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi coprocessors, or other high performance microprocessors. Applying these techniques will generally increase your program performance on any system, and better prepare you for Intel Xeon Phi coprocessors and the Intel MIC architecture.


"Read this book. Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi coprocessor. They have distilled their own experiences coupled with insights from many expert customers, to create this authoritative first book on the essentials of programming for this new architecture and these new products." —Slashdot.org, May 5, 2013


"This book belongs on the bookshelf of every HPC professional. Not only does it successfully and accessibly teach us how to use and obtain high performance on the Intel MIC architecture, it is about much more than that. It takes us back to the universal fundamentals of high-performance computing including how to think and reason about the performance of algorithms mapped to modern architectures, and it puts into your hands powerful tools that will be useful for years to come." —Robert J. Harrison, Institute for Advanced Computational Science, Stony Brook University, from the Foreword


Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel MIC specific alignment optimization, and small matrix transpose/multiplication 2D vectorization implemented in the Intel C/C++ and Fortran production compilers for Intel Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel Xeon Phi coprocessor. We also demonstrate a 2000x performance speedup from the seamless integration of SIMD vectorization and parallelization.


The Intel Xeon Phi coprocessor is much more sensitive to data alignment than the Intel Xeon E5 processor, so developing an Intel MIC oriented alignment strategy and optimization schemes is one of the key aspects for achieving optimal performance.(i)Similar to Intel SSE4.2, the SIMD load+op instructions require vector size alignment, which is 64-byte alignment for the Intel MIC architecture. However, simple load/store instructions require the alignment information to be known at compile time on the Intel Xeon Phi coprocessor.(ii)Different from prior Intel SIMD extensions, all SIMD load/store instructions including gather/scatter require at least element size alignment. Misaligned elements will cause a fault. This necessitates the Intel MIC architecture ABI [8] to require that all memory accesses be elementwise aligned.(iii)There are no special unaligned load/store instructions in the Intel Initial Many Core Instruction (Intel IMCI) set. This is overcome by using unpacking loads and packing stores that are capable of dealing with unaligned (element-aligned) memory locations. Due to their unpacking and packing nature, these instructions cannot be directly used for masked loads/stores, except under special circumstances.(iv)The faulting nature of masked memory access instructions in Intel IMCI adds extra complexity to those instructions addressing data outside paged memory and may fail even if actual data access is masked out. The exceptions are gather/scatter instructions.


We have selected a set of workloads to demonstrate the performance benefits and importance of SIMD vectorization on the Intel MIC architecture. These workloads exhibit a wide range of application behavior that can be found in areas such as high performance computing, financial services, databases, image processing, searching, and other domains. These workloads include the following.


All benchmarks were compiled as native executable using the Intel 13.0 product compilers and run on the Intel Xeon Phi coprocessor system specified in Table 14. To demonstrate the performance gains obtained through the SIMD vectorization, two versions of the binaries were generated for each workload. The baseline version was compiled with OpenMP parallelization only (-mmic -openmp -novec); the vectorized version is compiled with vectorization (default ON) and OpenMP parallelization (-mmic -openmp).


The performance scaling is derived from the OpenMP-only execution and OpenMP with 512-bit SIMD vector execution on the Intel Xeon Phi coprocessor system that we described at beginning of this section. That is, when the workload contains 32-bit single precision computations, 16-way vectorization may be achieved. When the workload contains 64-bit double-precision computations, 8-way vectorization is achieved.


Small matrix operations such as addition and multiplication have served as important parts of many HPC applications. A number of classic compiler optimizations such as loop complete unrolling, partial redundancy elimination (PRE), scalar replacement, and partial summation have been developed to achieve optimal vector execution performance. The conventional inner or outer loop vectorization for 3-level loop nests of 4 4 matrix operations is not performing well on Intel Xeon Phi coprocessor due to(i)less effective use of 512-bit long SIMD unit, for example, for 32-bit float data type, when either inner loop or outer loop is vectorized. In this case 4-way vectorization is used instead of 16-way vectorization,(ii)side-effects on classic optimizations, for example, the partial redundancy elimination, partial summation, and operator strength reduction, when the loop is vectorized.


Effectively exploiting the power of a coprocessor like Xeon Phi requires that both thread- and vector-level parallelism are exploited. While the parallelization topic is beyond the scope of this paper, we would still like to highlight that the SIMD vector extensions can be seamlessly integrated with threading models such as OpenM 4.0 supported by the Intel compilers. Given the Mandelbrot example Mandelbrot computes a graphical image representing a subset of the Mandelbrot set (a well-known 2D fractal shape) out of a range of complex numbers. It outputs the number of points inside and outside the set. 2ff7e9595c


0 views0 comments

Recent Posts

See All

Comments


bottom of page