Optimizing CPU to Memory Accesses on the SGI Visual Workstations
320 and 540
Contents
Overview
CPU to memory read and write performance is affected by a number of factors and can vary dramatically between the best and worst case scenarios. Performance gains of more than a factor of two have been observed by optimizing standard code to make the best use of the SGI Visual Workstations 320 and 540 memory system. This paper will go through optimizations to a memory copy routine to illustrate the optimizations.
The performance numbers were measured on a Visual Workstation 540 with 512 MB of memory and four Pentium III Xeon processors. While the system has four processors, the code is single-threaded and no other applications were run while the memory performance was measured. Consequently, similar performance should be seen on uniprocessor or dual-processor systems. Visual Workstation 320 systems with Pentium III processors may see slightly lower performance numbers due to the half-speed L2 cache.
While this white paper is targeted towards the SGI Visual Workstation 320 and 540 systems, it should be generally applicable to all Pentium-III class systems.
Visual Workstation 320 and 540 Memory SystemThe Cobalt chipset's memory controller provides access to the 320 and 540's 3.2 GB/s high-performance memory system. It services the Pentium processors as well as other subsystems such as Cobalt Graphics and Cobalt I/O.
Processors gain access to the memory controller via the 800 MB/s processor front-side bus. The actual achievable rate on the front-side bus may be significantly less than 800 MB/s depending on the rate of processor read/write requests and memory latencies. These roughly correspond to the two links that processors must take to get to memory: the processor to Cobalt Memory Controller link, and the Cobalt Memory Controller to Memory link. Factors affecting efficiency on both links will be discussed.
Memory TypesPentium II and III processors support three memory types: cached, uncached, and write-combined uncached memory. This paper is concerned with cached memory accesses, which is the type of memory most commonly used by applications.
Uncached accesses are considerably slower than cached memory accesses and will in general not benefit from the following discussion. Uncached memory is typically used by device drivers or other components that interact directly with hardware.
Write-combined uncached memory is an optimization to uncached memory and is generally applied to frame buffer memory. In standard uncached memory accesses, each register read or write results in a front-side bus transactions. Write combining allows the processor to gather all writes within a single cache line into a single write-to-memory transaction, significantly improving the performance of writes to uncached memory. Since processor reads remain uncached, the following discussion applies if the target of the memory copy is write combined uncached memory, but not if the source is write-combined uncached.
Optimizing a Basic Memory Copy RoutineMemory copies can be implemented in several ways on Pentium II and Pentium
III systems. The macro instruction rep movsd
can be used for block copies, or data can be moved with the CPU
registers (mov
/ movq
)
if it will be manipulated while in transit. Because the application
has control over the register reads and writes, register copies
can vary greatly while rep movsd performance is generally fixed.
This paper discusses memory copies using the MMX registers. While it focuses on pure data movement in and out of registers, the discussion is also applicable when the data is manipulated, such as in color space conversion, color correction, and compositing.
On the Visual Workstations 320 and 540, the rep movsd instruction will outperform naive register-based copies by approximately 15% (95 MB/s vs. 112 MB/s), although as will be shown optimized code can outperform rep movsd by more than 200%. The MMX code fragment shown in in Example 1- a simple copy routine - will be used as a starting point for discussion.
Example 1: MMX memcpy (95 MB/s)_asm { mov esi, src mov edi, dest mov ecx, nbytes shr ecx, 6 // 64 bytes per iteration loop1: movq mm1, 0[ESI] // Read in source data movq mm2, 8[ESI] movq mm3, 16[ESI] movq mm4, 24[ESI] movq mm5, 32[ESI] movq mm6, 40[ESI] movq mm7, 48[ESI] movq mm0, 56[ESI] movq 0[EDI], mm1 // Write to destination movq 8[EDI], mm2 movq 16[EDI], mm3 movq 24[EDI], mm4 movq 32[EDI], mm5 movq 40[EDI], mm6 movq 48[EDI], mm7 movq 56[EDI], mm0 add esi, 64 add edi, 64 dec ecx jnz loop1 emms }Using Write Combining
The Pentium II and III CPU caches operate on 32-byte cache-line sized blocks.
When data is written to or read from (cached) memory, entire cache
lines are read or written. While this generally enhances CPU-memory
performance, under some conditions it can lead to unnecessary data
fetches. In particular, consider a case where the CPU will do an
8-byte MMX register store (movq
).
Since this is only one quarter of a cache line, it will be treated
as a read-modify-write operation from the cache's perspective; the
target cache line will be fetched into cache, then the 8-byte write
will occur. In the case of a memory copy, this fetched data is unnecessary;
subsequent stores will overwrite the remainder of the cache line.
The read-modify-write behavior can be avoided by having the CPU
gather all writes to a cache line then doing a single write to memory.
Coalescing individual writes into a single cache-line write is referred
to as write combining. Write combining takes place when the memory
being written to is explicitly marked as write combining (as opposed
to cached or uncached), or when the MMX non-temporal store instruction
is used. Memory is generally marked write combining only when it
is used in frame buffers; memory allocated by VirtualAlloc
is either uncached or cached (but not write combining). The MMX
movntps
and movntq
non-temporal store instructions instruct the CPU to write the data
directly to memory, bypassing the L1 and L2 caches. As a side effect,
it also enables write combining if the target memory is cached.
Non-temporal stores are weakly ordered, so fence instructions may be necessary to ensure memory operation order among different devices. In addition, to gain the most from write combining applications must take some care in localizing memory accesses. If a write combining buffer is in use, a stray memory access can cause the buffer to be flushed prematurely. The Intel Architecture Software Developer's Manual describes these in more detail.
Example 2: MMX memcpy with non-temporal stores (135 MB/s)_asm { mov esi, src mov edi, dest mov ecx, nbytes shr ecx, 6 // 64 bytes per iteration loop1: movq mm1, 0[ESI] // Read in source data movq mm2, 8[ESI] movq mm3, 16[ESI] movq mm4, 24[ESI] movq mm5, 32[ESI] movq mm6, 40[ESI] movq mm7, 48[ESI] movq mm0, 56[ESI] movntq 0[EDI], mm1 // Non-temporal stores movntq 8[EDI], mm2 movntq 16[EDI], mm3 movntq 24[EDI], mm4 movntq 32[EDI], mm5 movntq 40[EDI], mm6 movntq 48[EDI], mm7 movntq 56[EDI], mm0 add esi, 64 add edi, 64 dec ecx jnz loop1 emms }
Memory reads incur a delay as the system's memory controller fetches the data requested by the CPU. In addition to the delay imposed by the memory system, the actual delay may be affected by other system components contending for access to memory, or interference from other CPUs contending for the processor bus. The Pentium II and Pentium III processors can absorb some amount of latency through out of order and speculative execution. However at some point a memory-bound application will stall.
An application can reduce stalls by fetching data before it is actually needed.
The Pentium III adds prefetch instructions for this purpose. Data
may be fetched into the L0, L1, and L2 caches (if it will be frequently
accessed) or it can be fetched into the non-temporal cache structure
if it will not be frequently used. Choosing how to prefetch is largely
an issue of what data the application will have in the caches, since
a cache collision with image data may force other data, such as
lookup tables, out of the cache. Actual performance for the memcpy
example remains at 160-165 MB/s when prefetches are done to the
non-temporal cache structure (prefetchnta
),
L0, L1, and L2 (prefetcht0
),
L1 and L2 (prefetcht1
), or L2
only (prefetcht2
). (Note that Pentium III processors may incur
a penalty since the L2 cache runs at half the speed of the processor core and L1 cache.)
_asm { mov esi, src mov edi, dest mov ecx, nbytes shr ecx, 6 // 64 bytes per iteration loop1: prefetchnta 64[ESI] // Prefetch next loop, non-temporal prefetchnta 96[ESI] movq mm1, 0[ESI] // Read in source data movq mm2, 8[ESI] movq mm3, 16[ESI] movq mm4, 24[ESI] movq mm5, 32[ESI] movq mm6, 40[ESI] movq mm7, 48[ESI] movq mm0, 56[ESI] movntq 0[EDI], mm1 // Non-temporal stores movntq 8[EDI], mm2 movntq 16[EDI], mm3 movntq 24[EDI], mm4 movntq 32[EDI], mm5 movntq 40[EDI], mm6 movntq 48[EDI], mm7 movntq 56[EDI], mm0 add esi, 64 add edi, 64 dec ecx jnz loop1 emms }Minimizing Memory Access Times
Up until this point, the optimizations listed above deal with the CPU's interaction with caches and the memory controller. These practices tend to improve performance on any platform with a Pentium II or Pentium III processor. We now turn to practices that allow the memory controller itself to service requests more efficiently. Since the memory controller is part of the Cobalt chip set, these optimizations are particularly advantageous to the Visual Workstations 320 and 540 and may or may not be effective with other chip sets. They are, however, unlikely to degrade performance on other platforms.
Applications can gain additional memory access performance by maximizing the locality of memory accesses. That is accesses to addresses within the same page (4k) will generally be faster than accesses to addresses randomly scattered throughout physical memory. On the surface, it would seem likely that most image processing and streaming applications provide good access locality since they sequentially step through buffers. Note, however, that Example 3 has rather poor locality since it bounces between accesses to the read buffer and the write buffer. Example 3 provides good read locality and good write locality, however the interleaving of the two results in overall poor memory access locality.
In addition to maximizing locality, additional performance can be obtained by reducing the number of turnarounds between reads and writes. There is a small penalty paid when reads and writes are interleaved; two cache line reads followed by two cache line writes do not take the same amount of time as four cache line reads or writes. The "small" penalty can add up if the number of turnarounds is significant. Note that Example 3 switches between reading and writing every 64 bytes (two cache lines).
In the context of the memcpy example, increasing memory access locality and reducing the number of read-write turnarounds poses a single problem: the application can only stream as many reads as it has places to hold the read data. As it turns out, accesses to the Pentium II and Pentium III L1 cache are quite fast since the caches run at the same rate as the CPU core. Consequently Example 4 uses 2k of L1 cache as a temporary store. Data is streamed from the source system memory through the CPU registers and into L1 cache. The data is then streamed from L1 cache through the CPU registers to the target system memory buffer. The result is an over 50% improvement in the overall memcpy rate when compares to Example 3, and a more than 250% improvement when compared to Example 1.
Streaming data into and out of system memory provides good locality and reduces the number of read-write turnarounds significantly (a factor of 64 in this case). It may not always be possible to accomplish both objectives. For example, a compositing application requires two (or more) source buffers and one target buffer. In these cases, it is generally advantageous to minimize the read-write turnaround while prefetching reads to minimize the impact of poor locality (resulting in increased memory read access time).
Example 4: MMX memcpy with prefetch, non-temporal stores, and streaming reads/writes (240 MB/s)asm { mov esi, src mov ecx, nbytes mov ebx, ecx shr ebx, 11 // 2048 bytes at a time mov edi, dest loop2k: // Copy 2k into temporary buffer push edi mov edi, tbuf mov ecx, 2048 shr ecx, 6 loopMemToL1: prefetchnta 64[ESI] // Prefetch next loop, non-temporal prefetchnta 96[ESI] movq mm1, 0[ESI] // Read in source data movq mm2, 8[ESI] movq mm3, 16[ESI] movq mm4, 24[ESI] movq mm5, 32[ESI] movq mm6, 40[ESI] movq mm7, 48[ESI] movq mm0, 56[ESI] movq 0[EDI], mm1 // Store into L1 movq 8[EDI], mm2 movq 16[EDI], mm3 movq 24[EDI], mm4 movq 32[EDI], mm5 movq 40[EDI], mm6 movq 48[EDI], mm7 movq 56[EDI], mm0 add esi, 64 add edi, 64 dec ecx jnz loopMemToL1 pop edi // Now copy from L1 to system memory push esi mov esi, tbuf mov ecx, 2048 shr ecx, 6 loopL1ToMem: movq mm1, 0[ESI] // Read in source data from L1 movq mm2, 8[ESI] movq mm3, 16[ESI] movq mm4, 24[ESI] movq mm5, 32[ESI] movq mm6, 40[ESI] movq mm7, 48[ESI] movq mm0, 56[ESI] movntq 0[EDI], mm1 // Non-temporal stores movntq 8[EDI], mm2 movntq 16[EDI], mm3 movntq 24[EDI], mm4 movntq 32[EDI], mm5 movntq 40[EDI], mm6 movntq 48[EDI], mm7 movntq 56[EDI], mm0 add esi, 64 add edi, 64 dec ecx jnz loopL1ToMem pop esi // Do next 2k block dec ebx jnz loop2k }Compute-Bound vs. Memory-Bound Functions
Most of the optimizations described in this document apply to both compute-bound and memory-bound routines. In the case of compute-bound routines, they minimize the amount of time "wasted" in moving data into and out of the CPU. For example, a color space conversion routine originally spent 50% of its time in memory I/O and 50% of its time doing the actual conversion. After optimizing memory accesses, the amount of time required for memory I/O was reduced to 10%, keeping the CPU occupied 90% of the time doing computation.
On the other hand, the color space conversion code above does not benefit from read-write streaming. Read-write streaming reduces the latency in memory I/O. If a compute-bound routine prefetches data, then the latency inherent in the compute cycles hides much of the latency introduced by the memory controller.
A Note on MP SystemsThe discussion above regarding locality and read-write streaming apply "per-device." The Cobalt chipset services several devices, however it considers the entire processor front-side bus as a single device. Consequently, locality and read/write streaming should be viewed from the perspective of all CPUs, not just a single CPU.
In most cases, interference between processors is minimal. This is particularly true if only one processor performs memory-intensive tasks while the other processors perform compute-intensive tasks. However, if memory-intensive code is multi-threaded (e.g. if different threads act in tandem on different parts of an image), their performance may degrade. For example, running three copies of the Example 4 copy code on three processors simultaneously lowers the overall copy rate to approximately 195 MB/s.
Related DocumentsFor more information on developing for the Intel Pentium II and Pentium III processors discussed in this paper, see the following documentation:
Intel Architecture Software Developer's Manual, Intel Corporation, order number 243192