Linux内核中的内存管理浅谈

[十月往昔]——linux内核中的内存管理浅谈

为什么要叫做“十月往昔”呢？是为了纪念我的原博客。

不知道为什么，突然想来一个新的开始——而那个博客存活至今刚好十个月，也有十个月里的文档。

十月往昔，总有一些觉得珍贵的，所以搬迁到这里来。

而这篇文章是在09.04.20-09.04.21里写的。

jason lee

————————————–cut-line

1。基本框架（此处主要谈页式内存管理）

4g是一个比较敏感的字眼，早些日子，大多数机器（或者说操作系统）支持的内存上限都是这个数字。为什么呢？

之所以说是早些日子，因为现在64位的计算机已经很多了，而对于32位的计算机而言，页式管理是这么进行的，逻辑地址格式如下：

0 －11位：页内偏移offset

12－21位：页面表偏移pt

22－31位：页面目录偏移pgd

寻址过程如下：

1）操作系统从寄存器cr3获得当前页面目录指针（基地址）；

2）基地址＋页面目录偏移->页面表指针（基地址）；

3）页面表指针＋页面表偏移->内存页基址；

4）内存页基址＋页内偏移->具体物理内存单元。

显然，12位的页内偏移可以寻址4k，所以一张内存页为4k；而总共可寻内存为4g＝2^10

* 2^10 * 2^12；因此在32位机器上内存上限一般为4g。

而操作系统是需要支持不同的平台的，比如32位，比如64位等。所以，linux统一使用页式三层映射：pgd－pmd－pt－offset。

pae是地址扩充功能（physical address extension）的缩写，如果将内存管理设置为pae模式，这时候就需要三层映射了。

三层映射架构是如何实现双层映射的？linux在暗地里“弄虚作假”了一番，有点类似领导让linux给三层映射一个重要位置，但是在32位计算机的地盘里就“阳奉阴违”了，只给三层映射一个有名无权的虚职。那么这个虚职是怎么实现的呢？

首先，开启了pae模式的计算机是真切需要三层映射的，所以它不会给三层映射虚职，而是需要三层映射机制去做实事的；而32位计算机如果没有开启pae模式，那么它是不需要三层映射的，双层映射是它更喜欢的。所以，首先是判断什么情况下给三层映射虚职——

109/*

110 * the linux x86 paging architecture is ‘compile-time dual-mode’, it

111 * implements both the traditional 2-level x86 page tables and the

112 * newer 3-level pae-mode page tables.

113 */

114#ifndef __assembly__

115#if config_x86_pae

116# include <asm/pgtable-3level.h>

117

118/*

119 * need to initialise the x86 pae caches

120 */

121extern void pgtable_cache_init(void);

122

123#else

124# include <asm/pgtable-2level.h>

从第一段的注释说明我们可以知道linux x86

的页式映射机制在编译时可以选择使用传统的双层映射和新的

pae 模式下的三层映射。而从接下来的代码可以知道，如果对

config_x86_pae进行了预处理，即开启了

pae 模式，那么就使用

pgtable-3level.h ，并且对

x86 pae caches 进行初始化，而如果没有，则包含

pgtable-2level.h ，即使用双层映射。

pgtable-2level.h实现的双层映射：

4/*

5 * traditional i386 two-level paging structure:

6 */

8#define pgdir_shift 22

9#define ptrs_per_pgd 1024

11/*

12 * the i386 is two-level, so we don’t really have any

13 * pmd directory physically.

14 */

15#define pmd_shift 22

16#define ptrs_per_pmd 1

从11

行到14

行的注释我们可以知道这里并没有让pmd

实际存在。

pgdir_shift 是

pgd 的偏移量——这里的偏移量是指位于

32 位中的几位，显然是

22 位，即第

23 位。而

ptrs_per_pgd是

pointers per pgd，即每个

pgd

位段能表示的指针。这里是 1024

，显然需要 10

位，那么 pgd

就是从位 22

到位 31

，即第 23

位到第 32

位。

于是很显然我们可以了解到pmd

在这里是虚设的，挂了个虚职。因为

ptrs_per_pmd 为

1 ，那么占用的是

0 位，因为

2^0 = 1 。

到这里，我们知道什么人的地盘上给三层映射挂虚职，怎么设置这个虚职的。而三层映射如果真干起了实事，本质其实和双层映射差不多，只不过多了几个位而已。

1.数据结构和函数

众所周知，linux

下有许多与

ansi c 不同的数据类型，比如

pid_t ；这些类型实际上是通过一层或者若干层的

typedef 定义而实现的，这样做的一个主要原因是为了可移植性的实现，而这样做的影响是看类型即可以很直观地知道用于何处，比如

pid_t

显然是一个进程 id

的类型；另外一个影响便是，编译内核需要使用相应的 gcc

编译器。

那么，在内存管理(1)

中提到的

pgd 、

pmd 、

pt 等是什么呢？在

include/asm-i386/page.h 中有如下代码：

36/*

37 * these are used to make use of c type-checking..

38 */

39#if config_x86_pae

40typedef struct { unsigned long pte_low, pte_high; } pte_t;

41typedef struct { unsigned long long pmd; } pmd_t;

42typedef struct { unsigned long long pgd; } pgd_t;

43#define pte_val(x) ((x).pte_low | ((unsigned long long)(x).pte_high << 32))

在开启了pae

模式的情况下，

pgd_t 、

pmd_t 都是长整形变量，而

pte_t 分为

pte_low 和

pte_high 两个部分。

pte 是指

page table entry ，即某个具体的页表项，指向一张具体的内存页。但是一个内存页并不需要

32位全部使用，因为每张内存页大小都为

4kb

，所以从地址 0

开始，每间隔 4kb

为一张内存页。所以，内存页的首地址的低 12

位都为 0

，我们只需要高 20

位来指向一个内存页基址，低 12

位用来设置页面状态和权限。另外，还有一个宏用来读取 pte_t

类型的成员。

而没有开启pae

模式的情况如下：

44#else

45typedef struct { unsigned long pte_low; } pte_t;

46typedef struct { unsigned long pmd; } pmd_t;

47typedef struct { unsigned long pgd; } pgd_t;

48#define pte_val(x) ((x).pte_low)

49#endif

有了pmd

等结构后就有地方存储地址信息了，那么如何获取这些信息呢？见如下几个宏：

54#define pmd_val(x) ((x).pmd)

55#define pgd_val(x) ((x).pgd)

56#define pgprot_val(x) ((x).pgprot)

58#define __pte(x) ((pte_t) { (x) } )

59#define __pmd(x) ((pmd_t) { (x) } )

60#define __pgd(x) ((pgd_t) { (x) } )

61#define __pgprot(x) ((pgprot_t) { (x) } )

行到 56

行是读取成员变量的宏，而 58

行到 61

行则是进行类型转换。这里出现了一个 pgprot

，展开为 page protection

，页面保护。 pgprot

对应着上文提到的页面状态和权限，从而实现页面的保护机制：

52typedef struct { unsigned long pgprot; } pgprot_t;

具体的pgprot_t

在 /include/asm-i386/pgtable.h

中定义：

187#define _page_present 0×001

188#define _page_rw 0×002

189#define _page_user 0×004

190#define _page_pwt 0×008

191#define _page_pcd 0×010

192#define _page_accessed 0×020

193#define _page_dirty 0×040

194#define _page_pse 0×080 /* 4 mb (or 2mb) page, pentium+, if present.. */

195#define _page_global 0×100 /* global tlb entry ppro+ */

196

197#define _page_protnone 0×080 /* if not present */

显然，pgprot_t

的位设置都是在低

12 位，而

pte 的指针部分是高

20 位，共同构成了

32 位。那么，二者是如何构成

32 位的页面表表项呢？我们自然而然想到了

20 位左移

12 位再与

pgprot_t 的低

12 位相或，在

pgtable.h 中是由宏

mk_pte 来完成的：

309#define mk_pte(page, pgprot) __mk_pte((page) - mem_map, (pgprot))

而我们自然又遇到了__mk_pte

。那么

__mk_pte 是什么呢？在

/include/asm-i386/pgtable-2level.h中它一个宏：

63#define __mk_pte(page_nr,pgprot) __pte(((page_nr) << page_shift) | pgprot_val(pgprot))

以上为63

行单行。而在/include/asm-i386/page.h

中对 page_shift

进行了宏定义：

5#define page_shift 12

所以实现的是将内存页面编号左移12

位再与保护字段pgprot

相或得到了

pte 页面表项。另外在上述中出现了

__pte() ，它的原型为： 58#define __pte(x) ((pte_t) { (x) } )，即进行类型转换。而

pgprot_val(pgprot)

的原型为： 56#define pgprot_val(x) ((x).pgprot)，与

52typedef struct { unsigned long pgprot; } pgprot_t;相对应则易知是获得某个

pgprot_t

类型变量的成员变量 pgprot

。

最后就剩下一个mem_map

了。我们先来了解一下

/include/linux/mm.h 中的

page 结构。

首先，先看一段前置说明：

139/*

140 * each physical page in the system has a struct page associated with

141 * it to keep track of whatever it is we are using the page for at the

142 * moment. note that we have no way to track which tasks are using

143 * a page.

144 *

145 * try to keep the most commonly accessed fields in single cache lines

146 * here (16 bytes or greater). this ordering should be particularly

147 * beneficial on 32-bit processors.

148 *

149 * the first line is data used in page cache lookup, the second line

150 * is used for linear searches (eg. clock algorithm scans).

151 *

152 * todo: make this structure smaller, it could be as small as 32 bytes.

153 */

简略说下，就是page

结构是与物理内存页相联系的，从而进行状态跟踪；其次，最经常访问的结构体内的成员字段应该保持在

16 位或者更大的单条缓冲线上——显然，这样有利于高速访问。接着来看page

结构体的定义：

154typedef struct page {

155 struct list_head list; /* ->mapping has some page lists. */

156 struct address_space *mapping; /* the inode (or …) we belong to. */

157 unsigned long index; /* our offset within mapping. */

158 struct page *next_hash; /* next page sharing our hash bucket in

159 the pagecache hash table. */

160 atomic_t count; /* usage count, see below. */

161 unsigned long flags; /* atomic flags, some possibly

162 updated asynchronously */

163 struct list_head lru; /* pageout list, eg. active_list;

164 protected by pagemap_lru_lock !! */

165 struct page **pprev_hash; /* complement to *next_hash. */

166 struct buffer_head * buffers; /* buffer maps us to a disk block. */

167

168 /*

169 * on machines where all ram is mapped into kernel address space,

170 * we can simply calculate the virtual address. on machines with

171 * highmem some memory is mapped into kernel virtual memory

172 * dynamically, so we need a place to store that address.

173 * note that this field could be 16 bits on x86 … ;)

174 *

175 * architectures with slow multiplication can define

176 * want_page_virtual in asm/page.h

177 */

178#if defined(config_highmem) || defined(want_page_virtual)

179 void *virtual; /* kernel virtual address (null if

180 not kmapped, ie. highmem) */

181#endif /* config_higmem || want_page_virtual */

182} mem_map_t;

当我们看到最后一行（182

行）的时候会有种恍然大悟的感觉—— mem_map_t

。于是我们就会联想

mem_map 是这么一个类型的变量。

实际上，mem_map

是一个全局变量（目前为止是），而且是一个指向

page 结构数组的指针；系统在初始化时根据物理内存的大小创建该数组。每一个数组元素都对应一张物理内存页。从软件方面来讲，页面表项的高

20 位是物理页面的编号，即

mem_map

数组的索引下标，通过该下标可以访问到与物理页面对应的page

结构。而从硬件方面来讲，页面表项的高 20

位再与 12

个 0

结合则构成了 32

位，即每张物理页面的基址。

映射着全部的物理内存页，而其本身则分为不同的区，比如 zone_dma、

zone_normal和

zone_highmem等。其中

zone_dma

是供 dma

使用的； zone_highmem

是用于处理物理地址超过 1g

的存储空间。

事实上，三个管理区是这么分配的：0

～ 16mb

分配给

zone_dma ，

16 ～896mb

zone_normal ，最后，

896mb 以上的分配给

zone_highmem 。那么，为什么要这么分配呢？这是由于某些硬件只能特定地访问

0 ～

16mb来执行

dma

模式；有些机器的配置使得物理内存页面无法总是保持被内核地址映射，这时需要使用zone_highmem

进行动态映射；而其余的就是可以被正常映射的。

那么，为什么这里是896mb

呢，而不是上文提的

1gb ？这是由于内核不仅为

highmem 预留了空间，也为

fixmap 和

vmalloc 预留了虚存空间。

，那内核中的虚拟地址是什么？虚拟地址其实就是逻辑地址——与物理地址相对应。

我们不妨来看看物理地址和内核中虚拟地址在内核空间的关系：

128#define page_offset ((unsigned long)__page_offset)

…

132#define __pa(x) ((unsigned long)(x)-page_offset)

133#define __va(x) ((void *)((unsigned long)(x)+page_offset))

表示 physical address

，即物理地址，而 va

表示虚拟地址 virtual address

。这里，我们不得不去看看 __page_offset

：

68/*

69 * this handles the memory map.. we could make this a config

70 * option, but too many people screw it up, and too few need

71 * it.

72 *

73 * a __page_offset of 0xc0000000 means that the kernel has

74 * a virtual address space of one gigabyte, which limits the

75 * amount of physical memory you can use to about 950mb.

76 *

77 * if you want more physical memory than this then see the config_highmem4g

78 * and config_highmem64g options in the kernel configuration.

79 */

81#define __page_offset (0xc0000000)

前置注释有一堆，而宏定义只有一行。在32

位机器上，通过linux

内核的页式映射可以实现

4gb 的逻辑地址（虚拟地址）。而在

4g 字节中，

0xc0000000 到

0xffffffff 的这

1g 最高的逻辑地址用于内核本身，称之为“内核空间”；而较低的

3g 字节空间为用户空间。注意，这里的是虚的、逻辑地址。

于是我们知道了__page_offset

是用户空间和内核空间在虚地址上的分界。然而，物理地址始终是从

0×00000000开始的；所以对于内核空间来说，

pa 与

va 就相差了一个

page_offset 。而同时，

page_offset 也代表着用户空间的上限。

到这里，我们了解了内核空间只能“线性映射”1gb“

的物理地址，如果没有

zone_highmem 来管理高于

1gb 的物理地址，那么这些内存就会浪费掉了。于是系统初始化时预留了

128mb的虚存来用于将来可能的映射。以上是对于

x86

体系结构而言，对于其它体系，物理内存可以全部被映射， zone_highmem

为空。

现在回到内存管理区。/include/linux/mmzone.h

中有如下数据结构用于管理区：

（代码有点长，分段来看）

39/*

40 * on machines where it is needed (eg pcs) we divide physical memory

41 * into multiple physical zones. on a pc we have 3 zones:

42 *

43 * zone_dma < 16 mb isa dma capable memory

44 * zone_normal 16-896 mb direct mapped by the kernel

45 * zone_highmem > 896 mb only page cache and user processes

46 */

这里的前置注释说明了三个管理区的分布。

47typedef struct zone_struct {

48 /*

49 * commonly accessed fields:

50 */

51 spinlock_t lock;

52 unsigned long free_pages;

这里是经常访问的字段。这里遇到了spinlock_t这个数据类型，在/include/asm-i386/spinlock.h中有定义：

22/*

23 * your basic smp spinlocks, allowing only a single cpu anywhere

24 */

26typedef struct {

27 volatile unsigned int lock;

28#if spinlock_debug

29 unsigned magic;

30#endif

31} spinlock_t;

由注释我们可以知道这是用来控制smp

使用的，仅允许单

cpu 工作。

而free_pages

表示着该区目前拥有的空闲页数。

53 /*

54 * we don’t know if the memory that we’re going to allocate will be freeable

55 * or/and it will be released eventually, so to avoid totally wasting several

56 * gb of ram we must reserve some of the lower zone memory (otherwise we risk

57 * to run oom on the lower zones despite there’s tons of freeable ram

58 * on the higher zones).

59 */

60 zone_watermarks_t watermarks[max_nr_zones];

由前置注释可知这是为了保留一些低端内存。我们在这里又遇到了一个新的数据类型：

34typedef struct zone_watermarks_s {

35 unsigned long min, low, high;

36} zone_watermarks_t;

62 /*

63 * the below fields are protected by different locks (or by

64 * no lock at all like need_balance), so they’re longs to

65 * provide an atomic granularity against each other on

66 * all architectures.

67 */

68 unsigned long need_balance;

69 /* protected by the pagemap_lru_lock */

70 unsigned long nr_active_pages, nr_inactive_pages;

71 /* protected by the pagecache_lock */

72 unsigned long nr_cache_pages;

75 /*

76 * free areas of different sizes

77 */

78 free_area_t free_area[max_order];

引入free_area_t：

27typedef struct free_area_struct {

28 struct list_head free_list;

29 unsigned long *map;

30} free_area_t;

这里free_area[max_order]

是一组队列，用于分配不连续的内存块。队列的实现是通过

free_area_t类型中的成员

struct list_head free_list ，可参加

list.h 。

80 /*

81 * wait_table — the array holding the hash table

82 * wait_table_size — the size of the hash table array

83 * wait_table_shift — wait_table_size

84 * == bits_per_long (1 << wait_table_bits)

85 *

86 * the purpose of all these is to keep track of the people

87 * waiting for a page to become available and make them

88 * runnable again when possible. the trouble is that this

89 * consumes a lot of space, especially when so few things

90 * wait on pages at a given time. so instead of using

91 * per-page waitqueues, we use a waitqueue hash table.

92 *

93 * the bucket discipline is to sleep on the same queue when

94 * colliding and wake all in that wait queue when removing.

95 * when something wakes, it must check to be sure its page is

96 * truly available, a la thundering herd. the cost of a

97 * collision is great, but given the expected load of the

98 * table, they should be so rare as to be outweighed by the

99 * benefits from the saved space.

100 *

101 * __wait_on_page() and unlock_page() in mm/filemap.c, are the

102 * primary users of these fields, and in mm/page_alloc.c

103 * free_area_init_core() performs the initialization of them.

104 */

105 wait_queue_head_t * wait_table;

106 unsigned long wait_table_size;

107 unsigned long wait_table_shift;

一些管理区信息如下：

109 /*

110 * discontig memory support fields.

111 */

112 struct pglist_data *zone_pgdat;

113 struct page *zone_mem_map;

114 unsigned long zone_start_paddr;

115 unsigned long zone_start_mapnr;

112

表示的是该管理区所在的存储节点； 113

显然是一张内存映射表； 114

是该管理区的物理起始地址，而 115

表示的是在 mem_map

中的起始下标。显然这些都可以直接从变量名看出来。

117 /*

118 * rarely used fields:

119 */

120 char *name;

121 unsigned long size;

122 unsigned long realsize;

123} zone_t;

120

表示的是管理区的名字， 121

表示的是管理区的大小， 122

表示的是管理区实用大小。

当多cpu

引入之后，

numa(non-uniform memory architecture)结构体系出现了，即非匀质存储结构。于是，每个

cpu

都有自己的物理地址，并且有一个公共的物存模块。这样有时候会出现cpu

请求的内存块无法在自己管辖的物理地址模块获得，也不能手伸太长去其它

cpu管理的模块，那么就需要到公共模块请求。同时，新的物理页面管理机制也进行了修正。

在numa

下，我们称

cpu 请求的一片连续物理内存页为

node （节点）。而且，此时的

mem_map 不再是全局变量，而是从属于具体节点；管理区也不再高高在上，也是被节点所拥有，每个存储节点至少有两个管理区。从而在

zone_struct

上便有了 pglist_data

数据结构，在 /include/linux/mmzone.h

定义：

142/*

143 * the pg_data_t structure is used in machines with config_discontigmem

144 * (mostly numa machines?) to denote a higher-level memory zone than the

145 * zone_struct denotes.

146 *

147 * on numa machines, each numa node would have a pg_data_t to describe

148 * it’s memory layout.

149 *

150 * xxx: we need to move the global memory statistics (active_list, …)

151 * into the pg_data_t to properly support numa.

152 */

153struct bootmem_data;

154typedef struct pglist_data {

155 zone_t node_zones[max_nr_zones];

156 zonelist_t node_zonelists[gfp_zonemask+1];

157 int nr_zones;

158 struct page *node_mem_map;

159 unsigned long *valid_addr_bitmap;

160 struct bootmem_data *bdata;

161 unsigned long node_start_paddr;

162 unsigned long node_start_mapnr;

163 unsigned long node_size;

164 int node_id;

165 struct pglist_data *node_next;

166} pg_data_t;

首先看看158

行 struct page *node_mem_map

，由于每个节点有一片的内存页，这里的

node_mem_map 便是用来映射表示它们的（

page 结构数组）；接着看首行，

155 行

zone_t node_zones[max_nr_zones]是该节点所拥有的管理区，同时在

也有一行 struct pglist_data *zone_pgdat

，指向所属节点pglist_data

数据结构。

————————————–cut-line –以上数据结构用于物理内存页面管理

–2009-04-20

晚

（续）数据结构和函数

现在开始接触的是用于虚存管理的数据结构和函数。

通常，一个进程所需要使用的虚存空间是离散的各个区间，而区间的数据结构是/include/linux/mm.h中定义的：

38/*

39 * this struct defines a memory vmm memory area. there is one of these

40 * per vm-area/task. a vm area is any part of the process virtual memory

41 * space that has a special rule for the page-fault handlers (ie a shared

42 * library, the executable area etc).

43 */

44struct vm_area_struct {

45 struct mm_struct * vm_mm; /* the address space we belong to. */

46 unsigned long vm_start; /* our start address within vm_mm. */

47 unsigned long vm_end; /* the first byte after our end address

48 within vm_mm. */

50 /* linked list of vm areas per task, sorted by address */

51 struct vm_area_struct *vm_next;

53 pgprot_t vm_page_prot; /* access permissions of this vma. */

54 unsigned long vm_flags; /* flags, listed below. */

56 rb_node_t vm_rb;

58 /*

59 * for areas with an address space and backing store,

60 * one of the address_space->i_mmap{,shared} lists,

61 * for shm areas, the list of attaches, otherwise unused.

62 */

63 struct vm_area_struct *vm_next_share;

64 struct vm_area_struct **vm_pprev_share;

66 /* function pointers to deal with this struct. */

67 struct vm_operations_struct * vm_ops;

69 /* information about our backing store: */

70 unsigned long vm_pgoff; /* offset (within vm_file) in page_size

71 units, *not* page_cache_size */

72 struct file * vm_file; /* file we map to (can be null). */

73 unsigned long vm_raend; /* xxx: put full readahead info here. */

74 void * vm_private_data; /* was vm_pte (shared mem) */

75};

行是定义了一个指向 mm_struct

结构体的指针，该结构体稍后了解。 vm_start

和 vm_end

是这一段 vm_area

的开始和结束位置，然而 vm_end

是该 vm_area

之后的第一个地址，不属于本 vm_area

行定义了一个指向 vm_area_struct

结构体的指针 vm_next

。这是由于进程使用的区间是离散的，所以各个区间需要形成链表来保持联系，这里的

vm_next 便是指向下一片

vm_area 的；该链表是按地址排序的。

行的 pgprot_t vm_page_prot

显然是本 vm_area

的保护信息， pgprot_t

在之前有谈过。

行的 vm_flags

是本 vm_area

的标志，如下：

77/*

78 * vm_flags..

80#define vm_read 0×00000001 /* currently active flags */

81#define vm_write 0×00000002

82#define vm_exec 0×00000004

83#define vm_shared 0×00000008

85#define vm_mayread 0×00000010 /* limits for mprotect() etc */

86#define vm_maywrite 0×00000020

87#define vm_mayexec 0×00000040

88#define vm_mayshare 0×00000080

90#define vm_growsdown 0×00000100 /* general info on the segment */

91#define vm_growsup 0×00000200

92#define vm_shm 0×00000400 /* shared memory area, don’t swap out */

93#define vm_denywrite 0×00000800 /* etxtbsy on write attempts.. */

95#define vm_executable 0×00001000

96#define vm_locked 0×00002000

97#define vm_io 0×00004000 /* memory mapped i/o or similar */

99 /* used by sys_madvise() */

100#define vm_seq_read 0×00008000 /* app will access data sequentially */

101#define vm_rand_read 0×00010000 /* app will not benefit from clustered reads */

102

103#define vm_dontcopy 0×00020000 /* do not copy this vma on fork */

104#define vm_dontexpand 0×00040000 /* cannot expand with mremap() */

105#define vm_reserved 0×00080000 /* don’t unmap it from swap_out */

106

107#ifndef vm_stack_flags

108#define vm_stack_flags 0×00000177

109#endif

80 ～83

行分别表示页是否可以被读、写、执行和共享。

85 ～88

行表示可以对

行的标志进行设置。

行表示该页含可执行代码。

行表示该页被锁。

其它标志均有注释。

在这里一般会有个疑惑，一个vm_area可能包含很多个内存页，为什么只有一个

vm_page_prot

和vm_flags呢？这是因为同一片

vm_area

的所有页面都必须保持相同的保护信息和状态标志。

现在回到vm_area_struct

行是 rb_node_t vm_rb; rb_node_t

是红黑树 (red-black tree)

节点类型。红黑树的结构如下：

100typedef struct rb_node_s

101{

102 struct rb_node_s * rb_parent;

103 int rb_color;

104#define rb_red 0

105#define rb_black 1

106 struct rb_node_s * rb_right;

107 struct rb_node_s * rb_left;

108}

109rb_node_t;

之所以使用红黑树是因为使用链表搜索的话每次都要从头开始，会影响效率。

63 ～64

行为共享内存中的前后区间：

67行定义了一个vm_ops，指向的是一个vm_oprations_struct结构体，该结构体在/include/linux/mm.h有定义：

128/*

129 * these are the virtual mm functions - opening of an area, closing and

130 * unmapping it (needed to keep files on disk up-to-date etc), pointer

131 * to the functions called when a no-page or a wp-page exception occurs.

132 */

133struct vm_operations_struct {

134 void (*open)(struct vm_area_struct * area);

135 void (*close)(struct vm_area_struct * area);

136 struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);

137};

显然可见vm_ops

是一个指针，可以执行操作函数作用在该

vm_area 上。其中

open 和

close 用于打开、关闭虚存空间。而当请求页面不在内存中调用

nopage。

vm_area_struct后面的成员都有注释。

在了解vm_area_struct

的开始，我们提到了

mm_struct 。

206struct mm_struct {

207 struct vm_area_struct * mmap; /* list of vmas */

208 rb_root_t mm_rb;

209 struct vm_area_struct * mmap_cache; /* last find_vma result */

210 pgd_t * pgd;

211 atomic_t mm_users; /* how many users with user space? */

212 atomic_t mm_count; /* how many references to “struct mm_struct” (users count as 1) */

213 int map_count; /* number of vmas */

214 struct rw_semaphore mmap_sem;

215 spinlock_t page_table_lock; /* protects task page tables and mm->rss */

216

217 struct list_head mmlist; /* list of all active mm’s. these are globally strung

218 * together off init_mm.mmlist, and are protected

219 * by mmlist_lock

220 */

221

222 unsigned long start_code, end_code, start_data, end_data;

223 unsigned long start_brk, brk, start_stack;

224 unsigned long arg_start, arg_end, env_start, env_end;

225 unsigned long rss, total_vm, locked_vm;

226 unsigned long def_flags;

227 unsigned long cpu_vm_mask;

228 unsigned long swap_address;

229

230 unsigned dumpable:1;

231

232 /* architecture-specific mm context */

233 mm_context_t context;

234};

207

行的 mmap

指向虚存区间链表。

208

行是指向红黑树。

209

行的 mmap_cache

指向最后一次使用的虚存区间，因为虚存区间有若干个内存页，下一次请求的内存页很可能还在该区间。

210

行的 pgd

显然是进程的页面目录，当内核调度一个进程运行时，将该指针转换为物理地址并写入控制寄存器

cr3 。

211

行的 mm_users

表示用户空间中有多少用户。而 212

行的 mm_count

表示该 mm_count

结构的被引用数。

213

行 map_count

表示 vm_area

的个数。

214

和 215

是一些状态控制，进行诸如锁定等状态控制。

217

行是 mm_struct

链表。

余下部分用途较显然。

从mm_users

和 mm_count

我们可以知道一个

mm_struct 允许被多个进程引用，但是一个进程只能使用一个

mm_struct结构。

至此，我们了解到以下几点。

。虚存方面是由 vm_area_struct

和 mm_struct

进行处理的。 32

位的计算机可以形成 4g

的虚存空间，其中 3

～ 4g

的虚存空间用作内核空间，其余用作用户空间。 mm_struct

是用户空间抽象，位于虚存管理的高层。而 vm_area_struct则是从属于

mm_struct

。一个进程允许有多个 vma

，这些虚存区间构成链表以及红黑树，在 vma

个数较少的时候使用链表操作，个数多的时候使用红黑树操作。mm_struct

中的 mmap

指向 vma

链表，而

map_count 则指示有多少个

vma 。当一个进程进入运行时，进程所对应的

mm_struct 中的

pgd （页面目录）被写入控制寄存器

cr3 ，于是页式映射机制的源头

cr3 就有内容了。

。在 cr3

被设置以后，便可以进行页式映射了。负责将虚拟地址映射为物理地址的内存管理单元从

cr3 读出数据，然后结合

pgd 等内容完成映射。

此外，如果要通过进程的虚拟地址找到所属区间以及相应的vma

结构可以使用

find_vma ：

666/* look up the first vma which satisfies addr < vm_end, null if none. */

667struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr)

668{

669 struct vm_area_struct *vma = null;

670

671 if (mm) {

672 /* check the cache first. */

673 /* (cache hit rate is typically around 35%.) */

674 vma = mm->mmap_cache;

675 if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {

676 rb_node_t * rb_node;

677

678 rb_node = mm->mm_rb.rb_node;

679 vma = null;

680

681 while (rb_node) {

682 struct vm_area_struct * vma_tmp;

683

684 vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);

685

686 if (vma_tmp->vm_end > addr) {

687 vma = vma_tmp;

688 if (vma_tmp->vm_start <= addr)

689 break;

690 rb_node = rb_node->rb_left;

691 } else

692 rb_node = rb_node->rb_right;

693 }

694 if (vma)

695 mm->mmap_cache = vma;

696 }

697 }

698 return vma;

699}

首先通过查找mmap_cache

，如果不是，则在链表中或者红黑树中搜索。如果返回

0 ，表示还没有创建

vma ，这时候就需要创建一个新的虚存区间结构。

1。越界访问

页式映射将虚拟地址转换成物理地址，并不是每次映射都是成功的，以下是几种失败的情况：

）映射过程中遇到 pgd

或者 pte

等项为空，映射没有建立

）物理页面不在内存中

）权限不符

于是就有相应的错误处理程序/arch/i386/mm/fault.c

中的 do_page_fault()

130/*

131 * this routine handles page faults. it determines the address,

132 * and the problem, and then passes it off to one of the appropriate

133 * routines.

134 *

135 * error_code:

136 * bit 0 == 0 means no page found, 1 means protection fault

137 * bit 1 == 0 means read, 1 means write

138 * bit 2 == 0 means kernel, 1 means user-mode

139 */

140asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code)

141{

由前置注释可知，错误码第0

位为0

表示页面不存在，1

表示权限不符；第1

表示为读访问引起的错误，1

表示写访问引起错误；第2

表示错误发生在内核态，1

表示在用户态。

该页面错误处理机制需要两个参数，一个是regs

指向错误前现场，

error_code 如上。

151 /* get the address */

152 __asm__(”movl %%cr2,%0″:”=r” (address));

这两行是获得导致映射失败的线性地址，它存储在cr2

中，由汇编语言实现。

接着首先是处理在内核空间发生的非权限不符错误：

160 /*

161 * we fault-in kernel-space virtual memory on-demand. the

162 * ‘reference’ page table is init_mm.pgd.

163 *

164 * note! we must not take any locks for this case. we may

165 * be in an interrupt or a critical region, and should

166 * only copy the information from the master page table,

167 * nothing more.

168 *

169 * this verifies that the fault happens in kernel space

170 * (error_code & 4) == 0, and that the fault was not a

171 * protection error (error_code & 1) == 0.

172 */

173 if (address >= task_size && !(error_code & 5))

174 goto vmalloc_fault;

175

176 mm = tsk->mm;

177 info.si_code = segv_maperr;

由前置注释可知if

条件的判断保证了错误发生在内核空间，而且不是权限不符错误。这种错误转向vmalloc_fault

处理，该处理机制也在内部定义。

接着处理的是中断或者进程映射未建立的情况：

179 /*

180 * if we’re in an interrupt or have no user

181 * context, we must not take the fault..

182 */

183 if (in_interrupt() || !mm)

184 goto no_context;

在这段代码之下是一段有关于堆栈越界的处理。当用尽了本进程的堆栈空间后，如果再执行进栈操作，由于堆栈是从上往下延伸的，所以一般情况下会把数据写到(%esp-4)

位置，如果是

32 字节操作则是

(%esp-32) 了。

188 vma = find_vma(mm, address);

查找虚存区间。

如果没有找到：

189 if (!vma)

190 goto bad_area;

转向bad_area

处理。

如果找到，且地址大于vma

起始地址（非堆栈）则转向：

191 if (vma->vm_start <= address)

192 goto good_area;

而如果是堆栈，那么vm_growsdown

标记为

1 ，当向下越界时，如果超过

%esp-32 那么就转向

bad_area 否则扩充堆栈，调用

expand_stack() ：

193 if (!(vma->vm_flags & vm_growsdown))

194 goto bad_area;

195 if (error_code & 4) {

196 /*

197 * accessing the stack below %esp is always a bug.

198 * the “+ 32″ is there due to some instructions (like

199 * pusha) doing post-decrement on the stack and that

200 * doesn’t show up until later..

201 */

202 if (address + 32 < regs->esp)

203 goto bad_area;

204 }

205 if (expand_stack(vma, address))

206 goto bad_area;

但是并不是无限制地扩充堆栈的，每个进程都有限制，如果超过就跳转到bad_area

。如果允许扩充，转向

good_area 继续完成新增页面对物理内存的映射。

具体的处理机制见/arch/i386/mm/fault.c

Linux内核中的内存管理浅谈

继续阅读

C语言：初学者必定看懂的注释！！！猴子吃桃问题。猴子第一天摘下若干个桃子，每天都吃了前一天剩下的一半零一个，到第10天早上想再吃的时候，就剩下一个桃子. 求第一天共摘多少个桃子。

Ubuntu14.04 LTS下安装mongodb

[转]九大排序算法——C语言实现及详解

Nginx服务优化（1）——隐藏版本号、修改用户与组、网页缓存时间、日志切割、连接超时一、隐藏版本号二、修改用户与组三、配置Nginx网页缓存时间四、实现Nginx日志分割五、配置Nginx实现连接超时六、补充关于时间日期的命令

while 循环、do- while 循环和 for 循环之间的那点事C语言自学之三种循环比较

httpd服务的部署、启动、配置和简单优化一、部署二、启动三、配置文件

配置网页内容访问

手动安装Intel network I217-LM网卡的Linux驱动

禁止ubuntu系统弹出报错界面

Ubuntu Linux下Apache的配置文件

结构体：typedef与struct的区别

samba服务器的功能

【Linux】UDP广播报文接收速率问题

Linux设备模型（中）之上层容器

PowerPC平台 Linux移植三

hdu7108哈希