The i875 in theory can provide an amazing 6.4GB/s of Peak Bandwidth and it is exactly this term which inspired my delving into this subject matter. Next to the processor's physical cache, the North Bridges, and more specifically it's MCH (Memory Controller Hub) will play a critical role in system performance. The architecture of the MCH may be as important as any component in your system, and certainly as much as the processor itself. Where the chipset is concerned, many theoretical figures are thrown at us by marketing departments. Yet I wonder how many of us truly realize the distinction between peak and effective bandwidth? I've come to realize, to the Overclocker, the word default albeit seven letters, surely macerates a four letter connotation on his/her best carpet. I for one, just can't live with my P4 2.4C running at 2400MHz. After 3500MHz it's simply offensive. Therefore I need components which will keep pace at these overclocked speeds. And while memory can be overclocked, it has no where near the flexibility, or the performance ceiling most processor's do. I need DDR500 to see 3GHz such that I'm not using a divider. It is, however; somewhat disheartening not to reach the "labelled" bandwidth on my modules, or bandwidth claimed by the chipset maker. For these reasons, we will investigate whether the industry has aggrandized these values, or if the majority of systems are simply performing below their specifications? All too often we (I) fail to meet the ostentatious claims (bandwidth) associated with today's memory, and chipset. Are these figures theoretical, and therefore rarely, if ever attained? Or are the values associated with today's memory "real-world" numbers, and empirically verifiable? First the formula;
Bus Size X Clock Speed = Bandwidth
DDR = 64-bit Bus ~ 8-bytes
8 x 533 = PC4200 MB/s (actually 4264 MB/s you get 64MB/s extra for your money!)
I've been running OCZ PC4200 at a constant 285FSB for six days now. Since were using DDR (Double Data Rate) memory, this infers our bus speed to be 570MHz. Ergo 8 x 570 = PC4560. Since I paid for PC4200, and been running the memory completely stable at PC4560 speed, I'd say things are good. Yet when we look at the screenshot below one becomes befuddled. First, I must qualify the figurative results of this screenshot. At this speed, running with PAT enabled was not possible given the voltage available to the DIMM's. And it is around this attribute of the 875 chipset our discussion will revolve;
First off let's look at the bandwidth. Since we operating on Dual DDR platform, then the figure needs to be divided in half, so 5914 MB/s divided by 2 = 2957 MB/s subtract the theoretical single channel bandwidth of 4560 MB/s = 1603 MB/s. What happened to 1603 MB/s ? Or 3206MB/s between both channels? I want it back. Even Sandra tells me my "estimated" bandwidth could be 9120MB/s ((peak) 4560 x 2 = 9120MB/s). Do I call OCZ, Intel, Asus, who? Who is going to give me back my 1603MB/s per channel which I (we) have been deprived? The fact is we never really see this kind of "Peak" bandwidth. And why is that? Some may call it false-advertising, sand-bagging, perhaps even duplicitous. I simply call it, "wishful Bandwidth thinking." I believe the clearest explanation I've found is in the article by Peter Rundberg, Memory Bandwidth Explained;
...there is a difference between peak bus bandwidth and effective memory bandwidth. Where the peak bus bandwidth is just the product of the bus width and the bus frequency, the effective memory bandwidth includes addressing, and other things that is needed to perform a memory read or write. The bold figures of DDR-SDRAM and DRDRAM does not indicate how these new memory technologies perform in real life. In this article we will look into the cause of the failed promises of these technologies by focusing on the most important part of memory performance, latency. What neither of these two new memory technologies gives us is reduced memory latency, which is the time it takes to look something up in memory. This is because they are both based on DRAM. The latency is not so much an issue of the memory interface as it is the memory cell itself, and since both these two new memories use DRAM, the latency is not improved. As we will see in the following sections, latency is more important than peak bus bandwidth when it comes to providing effective memory bandwidth...(In and Outs Of Memory Bandwidth, Peter Rundberg.)
The quoted article detail's the many processes which conspire to slow the potential bandwidth label your memory carries. What may shock you, is the primary culprit responsible slowing your RAM's bandwidth ironically happens to be the same device responsible for speeding it up, your CPU's cache. A cache operates on the principle of locality, "temporal," and "spatial." Temporal locality, assumes that if a given program uses a piece of data, it will use that same soon. Spatial locality states if a program uses a specific piece of data, it will use data in close proximity soon. When data is requested, which is not in the Processor's cache, a "miss" occurs. And in reality the only time Main Memory (RAM) is accessed, is during a cache miss. When the data is retrieved from RAM, the Spatial locality effect also determines what is retrieved. The problem with accessing main memory is it cannot simply be retrieved in one block. DRAM is stored in a matrix, rows and columns so that it can be "easily" found. Below is a basic example of how data is organized, and therefore retrieved from SDRAM;
RAS - Row Access Strobe. A signal indicating that the row address is being transferred.
CAS - Column Access Strobe. A signal indicating that the column address is being transfered.
tRCD - Time between RAS and CAS.
tRP - The RAS Precharge delay. Time to switch memory row.
tCAC - Time to access a column.
The CPU addresses the memory with row and bank during the time RAS is held (tRP).
After a certain time, tRCD, the CPU address the memory with the column of interest during the time CAS is held (tCAC).
The addressed data is now available for transfer over the 64 bit memory bus.
The immediate following 64 bits are transferred the next cycle and so on for the whole cache block.
For SDRAM these times are usually presented as 3-2-2 or 2-2-2, where these numbers indicate tCAC, tRP, and t RCD. Thus for 2-2-2 memory the first 64 bit chunk is transferred after 6 cycles. (In and Outs Of Memory Bandwidth, Peter Rundberg.)
It's very simple, the numbers associated with Peak-Bandwidth fail to account for the above processes, as well as the write-back /write-allocate steps. In order to better understand the difference between peak, and effectual bandwidth, one must understand how the FSB (cache) and NB-MCH (North Bridge-Memory Controller HUB) interact, as well as the relationship between the MCH and memory itself. nVidia revolutionized memory throughput with it's introduction of Twinbank memory architecture in it's nForce chipset. Twinbank combines two distinct 64-bit DDR channels via an arbiter, to form 128-bit throughput to the FSB. This effectively doubles the bandwidth, and combines another revolutionary North Bridge architecture known as DASP. In fact they patented it. DASP or Dynamic Adaptive Speculative Pre-processor was designed to reduce latency acting as an additional prefetch along the memory bus. DASP sought to reduce latency by predicting what the next data block/s would be through temporal locality. It would then fetch and store data concurrently for the processor. In fact it's been labelled an "L3 cache." The concept is not so unique, as the Athlon FX51 has an L3 cache, which basically renders the MCH moot. Of course ECC memory is required, although I'm sure given time that will not be so. DASP is still the strongest performance element of the nForce chipset, and while Twinbank certainly improves performance dramatically, there's a reason nVidia decided to release two version's of its nForce-2 400. The nForce-2 400 being single channel, and the nForce-2 400 Ultra version incorperating Twinbank. Of course DASP is common to both.