joryanick.com



memcpy() function wizardry - MMX / SSE / SIMD

Fast x86 memcpy() techniques

Last Updated: January 14, 2016

These are the remaining bones from an adventure that began on and around September 24th, 2001. Today in 2016 this page is just as relevant as it was over a decade ago, so pick and choose the things you find useful and do something amazing. For me, it was video compression and streaming, and the ideas on this page were nestled deep in my old (circa 2001) Emulive Interactive Broadcast Platform. SGI's original code follows first, and I've added more useful references in this last update. Onward & godspeed, Captain.

Jory

Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540

Contents

  • Overview
  • Visual Workstation 320 and 540 Memory System
  • Memory Types
  • Optimizing a Basic Memory Copy Routine
    • Using Write Combining
    • Prefetching Source Data
    • Minimizing Memory Access Times
  • Compute-Bound vs. Memory-Bound Functions
  • A Note on MP Systems
  • Sample Code
  • Related Documents

Overview

CPU to memory read and write performance is affected by a number of factors and can vary dramatically between the best and worst case scenarios. Performance gains of more than a factor of two have been observed by optimizing standard code to make the best use of the SGI Visual Workstations 320 and 540 memory system. This paper will go through optimizations to a memory copy routine to illustrate the optimizations.

The performance numbers were measured on a Visual Workstation 540 with 512 MB of memory and four Pentium III Xeon processors. While the system has four processors, the code is single-threaded and no other applications were run while the memory performance was measured. Consequently, similar performance should be seen on uniprocessor or dual-processor systems. Visual Workstation 320 systems with Pentium III processors may see slightly lower performance numbers due to the half-speed L2 cache.

While this white paper is targeted towards the SGI Visual Workstation 320 and 540 systems, it should be generally applicable to all Pentium-III class systems.


Visual Workstation 320 and 540 Memory System

The Cobalt chipset's memory controller provides access to the 320 and 540's 3.2 GB/s high-performance memory system. It services the Pentium processors as well as other subsystems such as Cobalt Graphics and Cobalt I/O.

Processors gain access to the memory controller via the 800 MB/s processor front-side bus. The actual achievable rate on the front-side bus may be significantly less than 800 MB/s depending on the rate of processor read/write requests and memory latencies. These roughly correspond to the two links that processors must take to get to memory: the processor to Cobalt Memory Controller link, and the Cobalt Memory Controller to Memory link. Factors affecting efficiency on both links will be discussed.


Memory Types

Pentium II and III processors support three memory types: cached, uncached, and write-combined uncached memory. This paper is concerned with cached memory accesses, which is the type of memory most commonly used by applications.

Uncached accesses are considerably slower than cached memory accesses and will in general not benefit from the following discussion. Uncached memory is typically used by device drivers or other components that interact directly with hardware.

Write-combined uncached memory is an optimization to uncached memory and is generally applied to frame buffer memory. In standard uncached memory accesses, each register read or write results in a front-side bus transactions. Write combining allows the processor to gather all writes within a single cache line into a single write-to-memory transaction, significantly improving the performance of writes to uncached memory. Since processor reads remain uncached, the following discussion applies if the target of the memory copy is write combined uncached memory, but not if the source is write-combined uncached.


Optimizing a Basic Memory Copy Routine

Memory copies can be implemented in several ways on Pentium II and Pentium III systems. The macro instruction rep movsd can be used for block copies, or data can be moved with the CPU registers (mov / movq) if it will be manipulated while in transit. Because the application has control over the register reads and writes, register copies can vary greatly while rep movsd performance is generally fixed.

This paper discusses memory copies using the MMX registers. While it focuses on pure data movement in and out of registers, the discussion is also applicable when the data is manipulated, such as in color space conversion, color correction, and compositing.

On the Visual Workstations 320 and 540, the rep movsd instruction will outperform naive register-based copies by approximately 15% (95 MB/s vs. 112 MB/s), although as will be shown optimized code can outperform rep movsd by more than 200%. The MMX code fragment shown in in Example 1- a simple copy routine - will be used as a starting point for discussion.

Example 1: MMX memcpy (95 MB/s)

_asm { 
        mov esi, src 
        mov edi, dest 
        mov ecx, nbytes 
        shr ecx, 6 // 64 bytes per iteration 

loop1: 
        movq mm1,  0[ESI] // Read in source data 
        movq mm2,  8[ESI] 
        movq mm3, 16[ESI] 
        movq mm4, 24[ESI] 
        movq mm5, 32[ESI] 
        movq mm6, 40[ESI] 
        movq mm7, 48[ESI] 
        movq mm0, 56[ESI] 

        movq  0[EDI], mm1 // Write to destination 
        movq  8[EDI], mm2 
        movq 16[EDI], mm3 
        movq 24[EDI], mm4 
        movq 32[EDI], mm5 
        movq 40[EDI], mm6 
        movq 48[EDI], mm7 
        movq 56[EDI], mm0 

        add esi, 64 
        add edi, 64 
        dec ecx 
        jnz loop1 

        emms 
} 

Using Write Combining

The Pentium II and III CPU caches operate on 32-byte cache-line sized blocks. When data is written to or read from (cached) memory, entire cache lines are read or written. While this generally enhances CPU-memory performance, under some conditions it can lead to unnecessary data fetches. In particular, consider a case where the CPU will do an 8-byte MMX register store (movq). Since this is only one quarter of a cache line, it will be treated as a read-modify-write operation from the cache's perspective; the target cache line will be fetched into cache, then the 8-byte write will occur. In the case of a memory copy, this fetched data is unnecessary; subsequent stores will overwrite the remainder of the cache line. The read-modify-write behavior can be avoided by having the CPU gather all writes to a cache line then doing a single write to memory. Coalescing individual writes into a single cache-line write is referred to as write combining. Write combining takes place when the memory being written to is explicitly marked as write combining (as opposed to cached or uncached), or when the MMX non-temporal store instruction is used. Memory is generally marked write combining only when it is used in frame buffers; memory allocated by VirtualAlloc is either uncached or cached (but not write combining). The MMX movntps and movntq non-temporal store instructions instruct the CPU to write the data directly to memory, bypassing the L1 and L2 caches. As a side effect, it also enables write combining if the target memory is cached.

Non-temporal stores are weakly ordered, so fence instructions may be necessary to ensure memory operation order among different devices. In addition, to gain the most from write combining applications must take some care in localizing memory accesses. If a write combining buffer is in use, a stray memory access can cause the buffer to be flushed prematurely. The Intel Architecture Software Developer's Manual describes these in more detail.

Example 2: MMX memcpy with non-temporal stores (135 MB/s)

_asm { 
        mov esi, src 
        mov edi, dest 
        mov ecx, nbytes 
        shr ecx, 6 // 64 bytes per iteration 

loop1: 
        movq mm1,  0[ESI] // Read in source data 
        movq mm2,  8[ESI] 
        movq mm3, 16[ESI] 
        movq mm4, 24[ESI] 
        movq mm5, 32[ESI] 
        movq mm6, 40[ESI] 
        movq mm7, 48[ESI] 
        movq mm0, 56[ESI] 

        movntq  0[EDI], mm1 // Non-temporal stores 
        movntq  8[EDI], mm2 
        movntq 16[EDI], mm3 
        movntq 24[EDI], mm4 
        movntq 32[EDI], mm5 
        movntq 40[EDI], mm6 
        movntq 48[EDI], mm7 
        movntq 56[EDI], mm0 

        add esi, 64 
        add edi, 64 
        dec ecx 
        jnz loop1 

        emms 
} 

Prefetching Source Data

Memory reads incur a delay as the system's memory controller fetches the data requested by the CPU. In addition to the delay imposed by the memory system, the actual delay may be affected by other system components contending for access to memory, or interference from other CPUs contending for the processor bus. The Pentium II and Pentium III processors can absorb some amount of latency through out of order and speculative execution. However at some point a memory-bound application will stall.

An application can reduce stalls by fetching data before it is actually needed. The Pentium III adds prefetch instructions for this purpose. Data may be fetched into the L0, L1, and L2 caches (if it will be frequently accessed) or it can be fetched into the non-temporal cache structure if it will not be frequently used. Choosing how to prefetch is largely an issue of what data the application will have in the caches, since a cache collision with image data may force other data, such as lookup tables, out of the cache. Actual performance for the memcpy example remains at 160-165 MB/s when prefetches are done to the non-temporal cache structure (prefetchnta), L0, L1, and L2 (prefetcht0), L1 and L2 (prefetcht1), or L2 only (prefetcht2). (Note that Pentium III processors may incur a penalty since the L2 cache runs at half the speed of the processor core and L1 cache.)

Example 3: MMX memcpy with prefetch and non-temporal stores (165 MB/s)

_asm { 
        mov esi, src 
        mov edi, dest 
        mov ecx, nbytes 
        shr ecx, 6 // 64 bytes per iteration 

loop1: 
        prefetchnta 64[ESI] // Prefetch next loop, non-temporal 
        prefetchnta 96[ESI] 

        movq mm1,  0[ESI] // Read in source data 
        movq mm2,  8[ESI] 
        movq mm3, 16[ESI] 
        movq mm4, 24[ESI] 
        movq mm5, 32[ESI] 
        movq mm6, 40[ESI] 
        movq mm7, 48[ESI] 
        movq mm0, 56[ESI] 

        movntq  0[EDI], mm1 // Non-temporal stores 
        movntq  8[EDI], mm2 
        movntq 16[EDI], mm3 
        movntq 24[EDI], mm4 
        movntq 32[EDI], mm5 
        movntq 40[EDI], mm6 
        movntq 48[EDI], mm7 
        movntq 56[EDI], mm0 

        add esi, 64 
        add edi, 64 
        dec ecx 
        jnz loop1 

        emms 
} 

Minimizing Memory Access Times

Up until this point, the optimizations listed above deal with the CPU's interaction with caches and the memory controller. These practices tend to improve performance on any platform with a Pentium II or Pentium III processor. We now turn to practices that allow the memory controller itself to service requests more efficiently. Since the memory controller is part of the Cobalt chip set, these optimizations are particularly advantageous to the Visual Workstations 320 and 540 and may or may not be effective with other chip sets. They are, however, unlikely to degrade performance on other platforms.

Applications can gain additional memory access performance by maximizing the locality of memory accesses. That is accesses to addresses within the same page (4k) will generally be faster than accesses to addresses randomly scattered throughout physical memory. On the surface, it would seem likely that most image processing and streaming applications provide good access locality since they sequentially step through buffers. Note, however, that Example 3 has rather poor locality since it bounces between accesses to the read buffer and the write buffer. Example 3 provides good read locality and good write locality, however the interleaving of the two results in overall poor memory access locality.

In addition to maximizing locality, additional performance can be obtained by reducing the number of turnarounds between reads and writes. There is a small penalty paid when reads and writes are interleaved; two cache line reads followed by two cache line writes do not take the same amount of time as four cache line reads or writes. The "small" penalty can add up if the number of turnarounds is significant. Note that Example 3 switches between reading and writing every 64 bytes (two cache lines).

In the context of the memcpy example, increasing memory access locality and reducing the number of read-write turnarounds poses a single problem: the application can only stream as many reads as it has places to hold the read data. As it turns out, accesses to the Pentium II and Pentium III L1 cache are quite fast since the caches run at the same rate as the CPU core. Consequently Example 4 uses 2k of L1 cache as a temporary store. Data is streamed from the source system memory through the CPU registers and into L1 cache. The data is then streamed from L1 cache through the CPU registers to the target system memory buffer. The result is an over 50% improvement in the overall memcpy rate when compares to Example 3, and a more than 250% improvement when compared to Example 1.

Streaming data into and out of system memory provides good locality and reduces the number of read-write turnarounds significantly (a factor of 64 in this case). It may not always be possible to accomplish both objectives. For example, a compositing application requires two (or more) source buffers and one target buffer. In these cases, it is generally advantageous to minimize the read-write turnaround while prefetching reads to minimize the impact of poor locality (resulting in increased memory read access time).

Example 4: MMX memcpy with prefetch, non-temporal stores, and streaming reads/writes (240 MB/s)

asm { 
        mov esi, src 
        mov ecx, nbytes 
        mov ebx, ecx 
        shr ebx, 11 // 2048 bytes at a time 
        mov edi, dest 

loop2k: // Copy 2k into temporary buffer 
        push edi 
        mov edi, tbuf 
        mov ecx, 2048 
        shr ecx, 6 

loopMemToL1: 
        prefetchnta 64[ESI] // Prefetch next loop, non-temporal 
        prefetchnta 96[ESI] 

        movq mm1,  0[ESI] // Read in source data 
        movq mm2,  8[ESI] 
        movq mm3, 16[ESI] 
        movq mm4, 24[ESI] 
        movq mm5, 32[ESI] 
        movq mm6, 40[ESI] 
        movq mm7, 48[ESI] 
        movq mm0, 56[ESI] 

        movq  0[EDI], mm1 // Store into L1 
        movq  8[EDI], mm2 
        movq 16[EDI], mm3 
        movq 24[EDI], mm4 
        movq 32[EDI], mm5 
        movq 40[EDI], mm6 
        movq 48[EDI], mm7 
        movq 56[EDI], mm0 
        add esi, 64 
        add edi, 64 
        dec ecx 
        jnz loopMemToL1 

        pop edi // Now copy from L1 to system memory 
        push esi 
        mov esi, tbuf 
        mov ecx, 2048 
        shr ecx, 6 

loopL1ToMem: 
        movq mm1, 0[ESI] // Read in source data from L1 
        movq mm2, 8[ESI] 
        movq mm3, 16[ESI] 
        movq mm4, 24[ESI] 
        movq mm5, 32[ESI] 
        movq mm6, 40[ESI] 
        movq mm7, 48[ESI] 
        movq mm0, 56[ESI] 

        movntq 0[EDI], mm1 // Non-temporal stores 
        movntq 8[EDI], mm2 
        movntq 16[EDI], mm3 
        movntq 24[EDI], mm4 
        movntq 32[EDI], mm5 
        movntq 40[EDI], mm6 
        movntq 48[EDI], mm7 
        movntq 56[EDI], mm0 

        add esi, 64 
        add edi, 64 
        dec ecx 
        jnz loopL1ToMem 

        pop esi // Do next 2k block 
        dec ebx 
        jnz loop2k 
} 

Compute-Bound vs. Memory-Bound Functions

Most of the optimizations described in this document apply to both compute-bound and memory-bound routines. In the case of compute-bound routines, they minimize the amount of time "wasted" in moving data into and out of the CPU. For example, a color space conversion routine originally spent 50% of its time in memory I/O and 50% of its time doing the actual conversion. After optimizing memory accesses, the amount of time required for memory I/O was reduced to 10%, keeping the CPU occupied 90% of the time doing computation.

On the other hand, the color space conversion code above does not benefit from read-write streaming. Read-write streaming reduces the latency in memory I/O. If a compute-bound routine prefetches data, then the latency inherent in the compute cycles hides much of the latency introduced by the memory controller.


A Note on MP Systems

The discussion above regarding locality and read-write streaming apply "per-device." The Cobalt chipset services several devices, however it considers the entire processor front-side bus as a single device. Consequently, locality and read/write streaming should be viewed from the perspective of all CPUs, not just a single CPU.

In most cases, interference between processors is minimal. This is particularly true if only one processor performs memory-intensive tasks while the other processors perform compute-intensive tasks. However, if memory-intensive code is multi-threaded (e.g. if different threads act in tandem on different parts of an image), their performance may degrade. For example, running three copies of the Example 4 copy code on three processors simultaneously lowers the overall copy rate to approximately 195 MB/s.


Sample Code

The following samples are available and should accompany this document:

  • Basic bcopy (or memcpy)
  • Picture vertical flip
  • Color space conversion with interpolated chrominance

Related Documents

For more information on developing for the Intel Pentium II and Pentium III processors discussed in this paper, see the following documentation:

  • Intel Architecture Software Developer's Manual, Intel Corporation, order number 243192
  • Intel Architecture MMX Technology Programmer's Reference Manual, Intel Corporation, order number 243007
  • Intel Architecture Optimization Reference Manual, Intel Corporation, order number 245127
(c) Silicon Graphics

This page is cited by:
"Large Memory Fast Copy Technology Based on SSE Instructions"
QIAN Chang-song, LIU Zhi-gang and LIU Dai-zhi
Second Artillery Institute of Engineering, Xian Shanxi, China

2004 article by f2 on Codeproject: Using SSE/SSE2 for optimization

Did this page help you? Let me know.