Computer Science

Single Instruction Multiple Data (SIMD)

SIMD is a type of parallel computing architecture that allows multiple processing elements to simultaneously execute the same instruction on different data. It is commonly used in applications that require high-performance computing, such as video and audio processing, scientific simulations, and machine learning. SIMD can significantly improve processing speed and efficiency by reducing the number of instructions needed to perform a task.

Written by Perlego with AI-assistance

Related key terms

Concurrent Programming

Harvard Architecture

MIMD

Parallel Architectures

Pipelining

Processor Architecture

RISC Processor

Superscalar Architecture

Threading In Computer Science

Types of Processor

5 Key excerpts on "Single Instruction Multiple Data (SIMD)"

eBook - ePub
Computer Architecture
A Quantitative Approach
- John L. Hennessy, David A. Patterson(Authors)
- 2011(Publication Date)
- Morgan Kaufmann
  (Publisher)
Chapter 1 introduced, has always been just how wide a set of applications has significant data-level parallelism (DLP). Fifty years later, the answer is not only the matrix-oriented computations of scientific computing, but also the media-oriented image and sound processing. Moreover, since a single instruction can launch many data operations, SIMD is potentially more energy efficient than multiple instruction multiple data (MIMD), which needs to fetch and execute one instruction per data operation. These two answers make SIMD attractive for Personal Mobile Devices. Finally, perhaps the biggest advantage of SIMD versus MIMD is that the programmer continues to think sequentially yet achieves parallel speedup by having parallel data operations.

This chapter covers three variations of SIMD: vector architectures, multimedia SIMD instruction set extensions, and graphics processing units (GPUs).1

The first variation, which predates the other two by more than 30 years, means essentially pipelined execution of many data operations. These vector architectures are easier to understand and to compile to than other SIMD variations, but they were considered too expensive for microprocessors until recently. Part of that expense was in transistors and part was in the cost of sufficient DRAM bandwidth, given the widespread reliance on caches to meet memory performance demands on conventional microprocessors.

The second SIMD variation borrows the SIMD name to mean basically simultaneous parallel data operations and is found in most instruction set architectures today that support multimedia applications. For x86 architectures, the SIMD instruction extensions started with the MMX (Multimedia Extensions) in 1996, which were followed by several SSE (Streaming SIMD Extensions) versions in the next decade, and they continue to this day with AVX (Advanced Vector Extensions). To get the highest computation rate from an x86 computer, you often need to use these SIMD instructions, especially for floating-point programs.

The third variation on SIMD comes from the GPU community, offering higher potential performance than is found in traditional multicore computers today. While GPUs share features with vector architectures, they have their own distinguishing characteristics, in part due to the ecosystem in which they evolved. This environment has a system processor and system memory in addition to the GPU and its graphics memory. In fact, to recognize those distinctions, the GPU community refers to this type of architecture as heterogeneous
Sign up to read
Learn more about book
eBook - ePub
Computer Architecture
Software Aspects, Coding, and Hardware
- John Y. Hsu(Author)
- 2017(Publication Date)
- CRC Press
  (Publisher)
HAPTER 8Vector and Multiple-Processor Machines

8.1 VECTOR PROCESSORS

A SIMD machine provides a general purpose set of instructions to operate on arrays, namely, vectors. As an example, one add vector instruction can add two arrays and store the result in a third array. That is, each corresponding word in the first and second array are added and stored in the corresponding word in the third array. This also means that after a single instruction is fetched and decoded, its EU (execution unit) provides control signals to fetch many operandi and execute them in a loop. As a consequence, the overhead of instruction retrievals and decodes are reduced. As a vector means an array in programming, the terms vector processor, array processor, and SIMD machine are all synonymous. A vector processor provides general purpose instructions, such as integer arithmetic, floating arithmetic, logical, shift, etc. on vectors. Each instruction contains an opcode, the size of the vector, and addresses of vectors. A SIMD or vector machine may have its data stream transmitted in serial or in parallel. A parallel data machine uses more hardware logic than a serial data machine.

8.1.1 Serial Data Transfer

The execution unit is called the processing element (PE) where the operations are performed. If one PE is connected to one processing element memory (PEM), we have a SIMD machine with serial data transfer as shown in Figure 8.1 a . That is, after decoding a vector instruction in the CU (control unit), an operand stream is fetched and executed serially in a hardware loop. That is, serial data are transferred on the data bus between the PE and PEM on a continuous basis until the execution is completed. In a serial data SIMD machine, there is one PE and one PEM. However, one instruction retrieval is followed by many operand fetches.

8.1.2 Parallel Data Transfer

If multiple PEs are tied to the CU and each PE is connected to a PEM, we have a parallel data machine, as shown in Figure 8.1b
Sign up to read
Learn more about book
eBook - ePub
Advanced Computer Architectures
- Sajjan G. Shiva(Author)
- 2018(Publication Date)
- CRC Press
  (Publisher)
5.7 provide further details on these language extensions. There are also compilers that translate serial programs into data-parallel object codes.

An algorithm that is efficient for SISD implementation may not be efficient for an SIMD, as illustrated by the matrix multiplication algorithm of Section 5.3 . Thus, the major challenge in programming SIMDs is in devising an efficient algorithm and corresponding data partitioning such that all the PEs in the system are kept busy throughout the execution of the application. This also requires minimizing conditional branch operations in the algorithm.

The data exchange characteristics of the algorithm dictate the type of IN needed. If the desired type of IN is not available, routing strategies that minimize the number of hops needed to transmit data between non-neighboring PEs will have to be devised.
5.6 Example Systems
The Intel and MIPS processors described in Chapter 1 have SIMD features. The supercomputer systems described in Chapter 4 also operate in an SIMD mode. This section provides brief descriptions of the hardware, software, and application characteristics of two SIMD systems. The ILLIAC-IV has been the most famous experimental SIMD architecture and is selected for its historical interest. Thinking Machine Corporation’s Connection Machine series, although no longer in production, originally envisioned for data-parallel symbolic computations, and later allowed numeric applications.
5.6.1 ILLIAC-IV
The ILLIAC-IV project was started in 1966 at the University of Illinois. The objective was to build a parallel machine capable of executing 109 instructions per second. To achieve this speed, a system with 256 processors controlled by a control processor was envisioned. The set of processors was divided into 4 quadrants of 64 processors each, each quadrant to be controlled by one control unit. Only one quadrant was built and it achieved a speed of 2 × 108 instructions per second.
Figure 5.22
Sign up to read
Learn more about book
eBook - ePub
A Survey of Computational Physics
Introductory Computational Science
- Rubin Landau, José Páez, Cristian Bordeianu(Authors)
- 2011(Publication Date)
- Princeton University Press
  (Publisher)
This [B] = [A][B] multiplication is an example of data dependency, in which the data elements used in the computation depend on the order in which they are used. In contrast, the matrix multiplication [C] = [A][B] is a data parallel operation in which the data can be used in any order. So already we see the importance of communication, synchronization, and understanding of the mathematics behind an algorithm for parallel computation. The processors in a parallel computer are placed at the nodes of a communication network. Each node may contain one CPU or a small number of CPUs, and the communication network may be internal to or external to the computer. One way of categorizing parallel computers is by the approach they employ in handling instructions and data. From this viewpoint there are three types of machines: • Single-instruction, single-data (SISD): These are the classic (von Neumann) serial computers executing a single instruction on a single data stream before the next instruction and next data stream are encountered. • Single-instruction, multiple-data (SIMD): Here instructions are processed from a single stream, but the instructions act concurrently on multiple data elements. Generally the nodes are simple and relatively slow but are large in number. • Multiple instructions, multiple data (MIMD): In this category each processor runs independently of the others with independent instructions and data. These are the types of machines that employ message-passing packages, such as MPI,to communicate among processors. They may be a collection of work-stations linked via a network, or more integrated machines with thousands of processors on internal boards, such as the Blue Gene computer described in §14.13. These computers, which do not have a shared memory space, are also called multicomputers
Sign up to read
Learn more about book
eBook - ePub
Professional Parallel Programming with C#
Master Parallel Extensions with .NET 4
- Gastón C. Hillar(Author)
- 2010(Publication Date)
- Wrox
  (Publisher)
Chapter 11 Vectorization, SIMD Instructions, and Additional Parallel Libraries What's in this Chapter?

Understanding SIMD and vectorization

Understanding extended instruction sets

Working with Intel Math Kernel Library

Working with multicore-ready, highly optimized software functions

Mixing task-based programming with external optimized libraries

Generating pseudo-random numbers in parallel

Working with the ThreadLocal<T> class

Using Intel Integrated Performance Primitives

In the previous 10 chapters, you learned to create and coordinate code that runs many tasks in parallel to improve performance. If you want to improve throughput even further, you can take advantage of other possibilities offered by modern hardware related to parallelism. This chapter is about the usage of additional performance libraries and includes examples of their integration with .NET Framework 4 and the new task-based programming model. In addition, the chapter provides examples of the usage of the new thread-local storage classes and the lazy-initialization capabilities provided by these classes.

Understanding SIMD and Vectorization

The “Parallel Programming and Multicore Programming” section of Chapter 1, “Task-Based Programming,” introduced the different kinds of parallel architectures. This section also explained that most modern microprocessors can execute Single Instruction, Multiple Data (SIMD) instructions. Because the execution units for SIMD instructions usually belong to a physical core, it is possible to run as many SIMD instructions in parallel as available physical cores. The usage of these vector-processing capabilities in parallel can provide important speedups in certain algorithms.

Here's a simple example that will help you understand the power of SIMD instructions. Figure 11-1 shows a diagram that represents the PABSD
Sign up to read
Learn more about book

Learn about this page

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

Explore more topic indexes

View all