博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
What every programmer should know about memory (Part 2-0) 译
阅读量:4072 次
发布时间:2019-05-25

本文共 12830 字,大约阅读时间需要 42 分钟。

What Every Programmer Should Know About Memory
Ulrich Drepper
Red Hat, Inc.
drepper@redhat.com
November 21, 2007

2 Commodity Hardware Today

Understanding commodity hardware is important because specialized hardware is in retreat. Scaling these days is most often achieved horizontally instead of vertically, meaning today it is more cost-effective to use many smaller, connected commodity computers instead of a few really large and exceptionally fast (and expensive) systems. This is the case because fast and inexpensive network hardware is widely available. There are still situations where the large specialized systems have their place and these systems still provide a business opportunity, but the overall market is dwarfed by the commodity hardware market. Red Hat, as of 2007, expects that for future products, the “standard building blocks” for most data centers will be a computer with up to four sockets, each filled with a quad core CPU that, in the case of Intel CPUs, will be hyper-threaded. {

Hyper-threading enables a single processor core to be used for two or more concurrent executions with just a little extra hardware.} This means the standard system in the data center will have up to 64 virtual processors. Bigger machines will be supported, but the quad socket, quad CPU core case is currently thought to be the sweet spot and most optimizations are targeted for such machines.

Large differences exist in the structure of commodity computers. That said, we will cover more than 90% of such hardware by concentrating on the most important differences. Note that these technical details tend to change rapidly, so the reader is advised to take the date of this writing into account.

Over the years the personal computers and smaller servers standardized on a chipset with two parts: the Northbridge and Southbridge. Figure 2.1 shows this structure.

Figure 2.1: Structure with Northbridge and Southbridge

All CPUs (two in the previous example, but there can be more) are connected via a common bus (the Front Side Bus, FSB) to the Northbridge. The Northbridge contains, among other things, the memory controller, and its implementation determines the type of RAM chips used for the computer. Different types of RAM, such as DRAM, Rambus, and SDRAM, require different memory controllers.

To reach all other system devices, the Northbridge must communicate with the Southbridge. The Southbridge, often referred to as the I/O bridge, handles communication with devices through a variety of different buses. Today the PCI, PCI Express, SATA, and USB buses are of most importance, but PATA, IEEE 1394, serial, and parallel ports are also supported by the Southbridge. Older systems had AGP slots which were attached to the Northbridge. This was done for performance reasons related to insufficiently fast connections between the Northbridge and Southbridge. However, today the PCI-E slots are all connected to the Southbridge.

Such a system structure has a number of noteworthy consequences:

  • All data communication from one CPU to another must travel over the same bus used to communicate with the Northbridge.
  • All communication with RAM must pass through the Northbridge.
  • The RAM has only a single port. {
    We will not discuss multi-port RAM in this document as this type of RAM is not found in commodity hardware, at least not in places where the programmer has access to it. It can be found in specialized hardware such as network routers which depend on utmost speed.}
  • Communication between a CPU and a device attached to the Southbridge is routed through the Northbridge.

A couple of bottlenecks are immediately apparent in this design. One such bottleneck involves access to RAM for devices. In the earliest days of the PC, all communication with devices on either bridge had to pass through the CPU, negatively impacting overall system performance. To work around this problem some devices became capable of direct memory access (DMA). DMA allows devices, with the help of the Northbridge, to store and receive data in RAM directly without the intervention of the CPU (and its inherent performance cost). Today all high-performance devices attached to any of the buses can utilize DMA. While this greatly reduces the workload on the CPU, it also creates contention for the bandwidth of the Northbridge as DMA requests compete with RAM access from the CPUs. This problem, therefore, must to be taken into account.

A second bottleneck involves the bus from the Northbridge to the RAM. The exact details of the bus depend on the memory types deployed. On older systems there is only one bus to all the RAM chips, so parallel access is not possible. Recent RAM types require two separate buses (or channels as they are called for DDR2, see Figure 2.8) which doubles the available bandwidth. The Northbridge interleaves memory access across the channels. More recent memory technologies (FB-DRAM, for instance) add more channels.

With limited bandwidth available, it is important to schedule memory access in ways that minimize delays. As we will see, processors are much faster and must wait to access memory, despite the use of CPU caches. If multiple hyper-threads, cores, or processors access memory at the same time, the wait times for memory access are even longer. This is also true for DMA operations.

There is more to accessing memory than concurrency, however. Access patterns themselves also greatly influence the performance of the memory subsystem, especially with multiple memory channels. Refer to Section 2.2 for more details of RAM access patterns.

On some more expensive systems, the Northbridge does not actually contain the memory controller. Instead the Northbridge can be connected to a number of external memory controllers (in the following example, four of them).

Figure 2.2: Northbridge with External Controllers

The advantage of this architecture is that more than one memory bus exists and therefore total bandwidth increases. This design also supports more memory. Concurrent memory access patterns reduce delays by simultaneously accessing different memory banks. This is especially true when multiple processors are directly connected to the Northbridge, as in Figure 2.2. For such a design, the primary limitation is the internal bandwidth of the Northbridge, which is phenomenal for this architecture (from Intel). { For completeness it should be mentioned that such a memory controller arrangement can be used for other purposes such as “memory RAID” which is useful in combination with hotplug memory.}

Figure 2.3: Integrated Memory Controller

Using multiple external memory controllers is not the only way to increase memory bandwidth. One other increasingly popular way is to integrate memory controllers into the CPUs and attach memory to each CPU. This architecture is made popular by SMP systems based on AMD’s Opteron processor. Figure 2.3 shows such a system. Intel will have support for the Common System Interface (CSI) starting with the Nehalem processors; this is basically the same approach: an integrated memory controller with the possibility of local memory for each processor.

With an architecture like this there are as many memory banks available as there are processors. On a quad-CPU machine the memory bandwidth is quadrupled without the need for a complicated Northbridge with enormous bandwidth. Having a memory controller integrated into the CPU has some additional advantages; we will not dig deeper into this technology here.

 

There are disadvantages to this architecture, too. First of all, because the machine still has to make all the memory of the system accessible to all processors, the memory is not uniform anymore (hence the name NUMA – Non-Uniform Memory Architecture – for such an architecture). Local memory (memory attached to a processor) can be accessed with the usual speed. The situation is different when memory attached to another processor is accessed. In this case the interconnects between the processors have to be used. To access memory attached to CPU2 from CPU1 requires communication across one interconnect. When the same CPU accesses memory attached to CPU4 two interconnects have to be crossed.

Each such communication has an associated cost. We talk about “NUMA factors” when we describe the extra time needed to access remote memory. The example architecture in Figure 2.3 has two levels for each CPU: immediately adjacent CPUs and one CPU which is two interconnects away. With more complicated machines the number of levels can grow significantly. There are also machine architectures (for instance IBM’s x445 and SGI’s Altix series) where there is more than one type of connection. CPUs are organized into nodes; within a node the time to access the memory might be uniform or have only small NUMA factors. The connection between nodes can be very expensive, though, and the NUMA factor can be quite high.

Commodity NUMA machines exist today and will likely play an even greater role in the future. It is expected that, from late 2008 on, every SMP machine will use NUMA. The costs associated with NUMA make it important to recognize when a program is running on a NUMA machine. In Section 5 we will discuss more machine architectures and some technologies the Linux kernel provides for these programs.

Beyond the technical details described in the remainder of this section, there are several additional factors which influence the performance of RAM. They are not controllable by software, which is why they are not covered in this section. The interested reader can learn about some of these factors in Section 2.1. They are really only needed to get a more complete picture of RAM technology and possibly to make better decisions when purchasing computers.

The following two sections discuss hardware details at the gate level and the access protocol between the memory controller and the DRAM chips. Programmers will likely find this information enlightening since these details explain why RAM access works the way it does. It is optional knowledge, though, and the reader anxious to get to topics with more immediate relevance for everyday life can jump ahead to Section 2.2.5.


商用硬件的现状

理解商用硬件是十分重要的因为专业硬件越来越少.今日人们更多的是水平扩展而不是垂直扩展,意味着今日使用许多更小的,互联的商用计算机而不是一些更大的,异常快(但是昂贵的)系统.这是因为快的,廉价的网络硬件正在开始普及.那些大型的专用系统仍然占据一席之位并且这些系统仍然具有商业机会,但是最总的整体市场将会被商用硬件代替.2007年,Redhat认为未来的数据中心将会是一个拥有4个插槽的计算机,每个插槽可以插入一个4核的CPU,对于Intel的CPU,将会是超线程的.(超线程能使每一个单独的处理器核心通过使用一点额外的硬件实现处理两个以上的并发任务).这意味着在标准的数据中心中将有多大64个虚拟处理器.更大的机器也会被支持,但是4插槽,4核Cpu通常被认为是最佳的配置并且大多数的优化都是针对这样的机器.

极大的差异存在于不同的商用计算机架构.即便如此,我们依然可以覆盖90%以上的硬件通过专注于最重要的差异上.注意技术细节改变的很快,因此读者应该将写作时间考虑在内.

多年以来,私人的计算机和小型的服务器被标准化到一个芯片集上,芯片由南北桥组成.图2-1展示了这种结构.

图2-1

所有CPU被连接通过一个通用总线(FBS 前端总线)连接到北桥.北桥除了包含一些其他的东西之外,还有内存控制器,他的实现决定了RAM芯片的类型.不同的RAM芯片,类似DRAM,Rambus,SDRAM,要求不同的内存控制器.

为了连接系统不同的设备,北桥必须和南桥相连.南桥经常被称为I/O桥,通过各种各样的总线连接设备.目前PCI,PCI Express,SATA和USB总线是十分重要的,但是PATA,IEEE 1394,串行和并行端口也是被南桥支持.更老的系统拥有AGP插槽去连接北桥.这是因为南北桥之间的速度并不是特变快,性能不太好.然而,今日PCI-E插槽全部连接到南桥.

这样的一个系统架构造成了一系列应该注意的后果:

1. 所有从一个CPU到另一个CPU的数据都必须经过相同的总线,该总线用来与北桥交流.

2. 所有与RAM的的交流都必须通过北桥.
3. RAM只有一个端口(这里不讨论多端口RAM,因为这部分RAM不会用于商用RAM.多端口RAM可以被发现在指定的硬件类似路由器).
4.CPU与南桥硬件之间的沟通必须通过北桥.

在这种设计中,一系列的瓶颈立刻出现.其中的一个瓶颈则是设备对RAM的访问.在PC的早期,无论对南北桥其中哪一个的设备的交流都必须通过CPU,对系统整体行性能造成了极大的影响.为了解决这个问题,许多设备通过DMA来实现了优化.DMA允许设备在南桥的帮助下,在RAM里存储和接收数据,同时没有CPU的介入.今日所有的与任何总线相连的高性能设备够利用了DMA.虽然这极大的降低了CPU的负载,但是同时也创造了北桥带宽的竞争.因为DMA和CPU竞争内存.因此这个问题也必须考虑在内.

第二个瓶颈出现在北桥到RAM之间.总线的准确细节依赖于RAM的类型.较老的系统只有一个总线对所有的RAM芯片,因此平行的访问是不可能的.最近RAM的类型有求两个单独的总线(在DDR2中被称为通道),这使得带宽翻倍.北桥将内存访问交错的分配给多通道.更多的内存技术(类似FB-DRAM)增加了更多的通道.

由于可用的带宽有限,这是重要的以调度内存访问的方式去最小化延迟,优化性能.正如我们看到的,除去CPU缓冲的使用,处理器仍旧是十分的快并且需要去等待内存.如果是多线程,多核,多处理器同时访问RAM,等待时间甚至更长.这是相同的对DMA来说.

除了并发的访问内存外,访问模式也极大的影响了内存子系统的性能,尤其是多内存通道.参考2.2节获取等多的RAM访问模式.

图2-2

在一些更加昂贵的系统上,北桥实际上并不包含内存控制器.相反一系列额外的内存控制器与北桥相连. 如图2-2

图2-3

使用多个外部的内存控制器并不是唯一的方法通通去提高内存带宽.另一种受欢迎的方法是去集成内存控制器到CPU内部并且使得内存与CPU直连.基于SMP(对称多处理器)的AMD Opteron处理器使得这种架构开始流行起来.图2.3展示了这样的架构.Intel将从Nehalem处理器开始支持通用的系统接口CSI;这是基本的方法:一个集成的内存控制器处理每个CPU的本地内存.

在这种架构上,我们可以拥有和处理器数量一致的内存块.在一个4CPU机器上,我们在没有一个巨大带宽的复杂北桥下,就可以获得4倍的内存带宽.同时,将内存控制器集成到CPU内部有一些额外的有点,但是这里我们将不会深入挖掘.

这种架构也有一些缺点,首先,因为这种架构仍然必须使得所有的内存能被处理器访问,内存不再是对称的(NUMA 非一致内存访问 ),处理器可以正常的速度访问本地内存,但是当处理器访问不属于本地的内存时,则必须使用CPU之间的互联通道.CPU1访问CPU2的内存,则必须经过一个互联通道,当CPU1访问CPU4的内存时,则必须通过两个互联通道.

不同的通信将会有不同的消耗.我们将额外的时间称之为NUMA因素当我们访问远端的内存.在图2.3中,每个CPU都有两个层级.紧邻的CPU和两个互联通道之外的CPU.越复杂的机器,它的层级数也会显著的增多.有一些机器,比如IBM的X445和SGI的Altix系列,都有超过一种的连接.CPU被划入节点,一个节点内访问内存应该是一致的或者只有很小的NUMA因素.在节点之间的连接是十分昂贵的,NUMA因素也是很高的.

目前商用NUMA机器以及出现并且可能在未来扮演一个及其重要的角色.这是可预料的,在2008年年底,每一个SMP机器都将使用NUMA架构.这是重要的对于每一个NUMA机器上运行的程序认识到NUMA带来的代价.在第5节我们将讨论更多的架构以及一些Linux为这些程序提供的技术.

除了本节中提到的技术细节,还有一些额外因素影响RAM的性能.他们是无法被软件所左右的,因此没有放在本节.感兴趣的读者可以从2.1节中了解.介绍这些技术仅仅是为了让我们对RAM有一个更加全面的了解,同时让我们在购买计算机时做出更好的选择.

接下来的两节介绍了一些入门级别的硬件细节和内存控制器与DRAM之间的访问协议.程序员可能会从RAM访问原理的细节中获得一些启发.这部分知识是可选的,心急的读者为了获取核心的部分可以直接跳到2.2.5节.

转载地址:http://kbgji.baihongyu.com/

你可能感兴趣的文章
poj 1976 A Mini Locomotive (dp 二维01背包)
查看>>
斯坦福大学机器学习——因子分析(Factor analysis)
查看>>
项目导入时报错:The import javax.servlet.http.HttpServletRequest cannot be resolved
查看>>
linux对于没有写权限的文件如何保存退出vim
查看>>
Windows下安装ElasticSearch6.3.1以及ElasticSearch6.3.1的Head插件
查看>>
IntelliJ IDEA 下的svn配置及使用的非常详细的图文总结
查看>>
【IntelliJ IDEA】idea导入项目只显示项目中的文件,不显示项目结构
查看>>
ssh 如何方便的切换到其他节点??
查看>>
JSP中文乱码总结
查看>>
Java-IO-File类
查看>>
Java-IO-java的IO流
查看>>
Java-IO-输入/输出流体系
查看>>
Java实现DES加密解密
查看>>
HTML基础
查看>>
Java IO
查看>>
Java NIO
查看>>
Java大数据:Hbase分布式存储入门
查看>>
Java大数据:全文搜索引擎Elasticsearch入门
查看>>
大数据学习:Hadoop入门学习书单
查看>>
大数据学习:Spark SQL入门简介
查看>>