linux kernel虚拟内存地址耗尽一个实例分析

ARM 195浏览

转自:http://www.cnblogs.com/xmphoenix/p/3627007.html

引子

现在android智能手机市场异常火热,硬件升级非常迅猛,arm cortex A9 + 1GB DDR似乎已经跟不上主流配置了。虽说硬件是王道,可我们还是不禁还怀疑这么强大的硬件配置得到充分利用了吗?因此以后我都会正对ARM平台分析kernel的内容。 

正文

在linux内存管理中,有两个资源非常重要,一个是虚拟地址,一个是物理地址。听起来似乎是废话,实际上内存管理主要就是围绕这两个概念展开的。如果对linux kernel如果管理虚拟地址和物理地址还没有概念的,建议浏览一下文献【2】,这是一本很棒的书,言简意赅。文献【1】会讲更多的实现细节。

本文主要目的是对内核1GB虚拟地址空间映射有个总体了解,包括:

1. 1GB内核虚拟地址空间具体用于什么地方?

2. 其和实际物理地址的映射关系.

3. 一些板级相关的宏定义,为了便于日后查阅,我也将这些宏定义整理了出来。根据这些宏定义,你也可以轻松画出你所用的平台的内核虚拟地址空间映射关系。

首先申明,实例中的映射规划不见得就是最优的,但它却是一个实际的例子。实际上我个人觉得还是有很多值得商榷的地方。

从下图我们可以看到,粉色部分0xbf80 0000 ~ 0xc000 0000是为modules及kpmap的,从下面的板级宏定义我们可以看到,modules放在这段位置是因为它需要和kernel code段在32MB寻址空间内。kpmap为什么放这段空间我还不清楚,这个是在map highmem时用到的。

橙色部分0xc000 0000 ~ 0xe000 0000映射 lowmem(低端内存,即zone[Normal])。这段映射是一对一的平坦映射,也就是说kernel初始化这段映射后,页表将不会改变。这样即可以省去不断的修改页表,刷新TLB(TLB可以认为是页表的硬件cache,如果访问的虚拟地址的页表在这个cache中,则CPU无需访问DDR寻址页表了,这样可以提高IO效率)了。显然这段地址空间非常珍贵,因为这段映射的效率高。从图中我们可以看到,在512MB映射空间中,有128MB预留给PMEM(android特有的连续物理内存管理机制),16MB预留CP(modem运行空间)。实际可用lowmem大致只有360MB。

蓝色部分0xe000 0000 ~ 0xf000 0000映射highmem(高端内存,即zone[HighMem])。因为示例为1GB DDR,因此需要高端内存映射部分物理地址空间。

绿色部分0xf000 0000 ~ 0xffc0 0000为IO映射区域。我们知道在内核空间,比如写驱动的时候,需要访问芯片的寄存器(IO空间),部分IO空间映射是通过ioremap在VMALLOC区域动态申请映射,还有部分是系统初始化时通过iotable_init静态映射的。图中我们可以看到在IO静态映射区域有大约200MB的空间没有使用。这个是不是太浪费了呢?

紫色部分没什么花头,ARM default定义就是这样的。

下图给出了内核虚拟地址空间和实际物理地址的映射关系。

下面开始玩点激情的,看看这个mapping存在什么问题。

实际上我在这个平台上遇到一个bug,即在用monkey test做压力测试的时候,系统运行很长时间后会出现vmalloc失败。OMG,调用vmalloc都会失败,而且此时还有足够多的物理内存,神奇吧?

【错误log】系统的graphic模块在用vmalloc申请1MB内存时失败

【分析】

1. 首先查看此时基本的内存信息。通过/proc/meminfo可以看到,实际可用物理内存还剩156MB,内存此时并未耗尽。vmalloc所使用的VMALLOC虚拟地址还剩余22MB,也是够用的。根据vmalloc实现原理,它会通过调用alloc_page()去buddy系统中取一个个孤立的page(即在2^0链表上取page)。page此时是足够多的,为什么会申请失败呢?vmalloc要求虚拟地址是连续的,难道是VMALLOC中没有连续的1MB虚拟地址了?

2. 带着这个问题,我们继续分析/proc/vmallocinfo.

从/proc/vmallocinfo的信息看到,VMALLOC已经用到0xefeff00了,那么最大可用连续空间为0xf0000000 - 0xefeff000 = 0x101000. 还记得我们要申请的内存空间大小吗?没错,是0x1a0000。哇,第一次发现kernel虚拟地址也能耗尽。那为什么从meminfo信息来看还有22MB VMALLOC虚拟地址呢?显然这段虚拟地址空间也产生了大量碎片。

好吧,虚拟地址资源耗尽,我们似乎也没办法了,穷途末路。不过本着研究的精神,我们还得怀疑为什么VMALLOC这段虚拟地址使用这么多,毕竟我们给这段空间规划了256MB。物理内存还有这么多,为什么不直接调用kmalloc或者get_free_pages呢?

3. 继续分析看下此时物理内存分布情况

/proc/buddyinfo可以看到buddy系统总得内存分配状态, 及更多关于碎片管理的信息。

大致了解下pagetypeinfo,kernel会将物理内存分为不同的zone, 在我的平台上上,有zone[Normal]及zone[HighMem]。migrate type是为避免内存碎片而设计的,不明的可以参考文献【1】。从/proc/pagetypeinfo看到我们可以得到的最大连续内存为2^7个page,即512KB。看来此时是满足不了graphic需求,进一步验证的graphic为什么会大量使用vmalloc.

/proc/buddyinfo信息。

4. 结论

根据上面分析,graphic通过get_free_pages()向kernel的buddy系统申请连续内存,经过一段时间,buddy系统产生了大量碎片,graphic无法获取连续的物理内存,因此通过vmalloc想从buddy系统申请不连续的内存,不幸的是VMALLOC的虚拟地址空间耗尽,尽管这是还有大量物理内存,vmalloc申请失败。

5. 从新审视内存映射

这里一个问题就是lowmem的规划空间太小了,vmalloc默认会从zone[HighMem]申请内存,这样很容易在highmem产生碎片。看到最开始我们kernel虚拟映射图了吗?我们不是有200MB的虚拟空间没有使用吗?如果把它mapping给lowmem多好啊。

下面我对这段映射做了修改。最大的变化就是lowmem从512MB增加到了720MB。200MB未使用的虚拟地址空间得到了充分利用。

修改后,我们再看看buddy信息吧,最大可申请的连续内存为2^15个page=128MB。这样的规划也增加内存利用效率。

下面列表是板级相关的一些宏定义,这些宏定义决定了如何规划内核虚拟地址。现在一般也没什么机会从零开始bringup一块新的芯片,因此这些定义大家可能不会关注。不过在研究内存规划时,这些定义还是非常重要的,我将它们整理出来也是为了日后方便查阅。大家也可以试着根据自己的板子填写这些宏定义,这样整个内核空间映射视图就会展现出来。

Board specific macro definition

Refer to [Documentation/arm/Porting]

Decompressor Symbols

Macro name

description

example

ZTEXTADDR

[arch/arm/boot/compressed/Makefile]

Start address of decompressor.  There's no point in talking about virtual or physical addresses here, since the MMU will be off at the time when you call the decompressor code.  You normally call the kernel at this address to start it booting.  This doesn't have to be located in RAM, it can be in flash or other read-only or read-write addressable medium.

0x0

ZTEXTADDR        := $(CONFIG_ZBOOT_ROM_TEXT)

ONFIG_ZBOOT_ROM_TEXT=0x0

ZBSSADDR

[arch/arm/boot/compressed/Makefile]

Start address of zero-initialised work area for the decompressor. This must be pointing at RAM.  The decompressor will zero initialize this for you.  Again, the MMU will be off.

0x0

ZBSSADDR   := $(CONFIG_ZBOOT_ROM_BSS)

CONFIG_ZBOOT_ROM_BSS=0x0

ZRELADDR

[arch/arm/boot/Makefile]

This is the address where the decompressed kernel will be written, and eventually executed.  The following constraint must be valid:

__virt_to_phys(TEXTADDR) == ZRELADDR

The initial part of the kernel is carefully coded to be position independent.

Note: the following conditions must always be true:

ZRELADDR == virt_to_phys(PAGE_OFFSET + TEXT_OFFSET)

0x81088000

ZRELADDR    := $(zreladdr-y)

zreladdr-y       := $(__ZRELADDR)

__ZRELADDR = TEXT_OFFSET + 0x80000000

[arch/arm/mach-pxa/Makefile.boot]

INITRD_PHYS

Physical address to place the initial RAM disk.  Only relevant if you are using the bootpImage stuff (which only works on the old struct param_struct).

INITRD_PHYS must be in RAM

Not defined

INITRD_VIRT

Virtual address of the initial RAM disk.  The following constraint must be valid:

__virt_to_phys(INITRD_VIRT) == INITRD_PHYS

Not defined

PARAMS_PHYS

Physical address of the struct param_struct or tag list, giving the kernel various parameters about its execution environment.

PARAMS_PHYS must be within 4MB of ZRELADDR

Not defined

Kernel Symbols

PHYS_OFFSET

[arch/arm/include/asm/memory.h]

Physical start address of the first bank of RAM.

#define PHYS_OFFSET      PLAT_PHYS_OFFSET

#define PLAT_PHYS_OFFSET    UL(0x80000000)

[arch/arm/mach-pxa/include/mach/memory.h]

PAGE_OFFSET

[arch/arm/include/asm/memory.h]

Virtual start address of the first bank of RAM.  During the kernel boot phase, virtual address PAGE_OFFSET will be mapped to physical address PHYS_OFFSET, along with any other mappings you supply. This should be the same value as TASK_SIZE.

CONFIG_PAGE_OFFSET

=0xC0000000

TASK_SIZE

[arch/arm/include/asm/memory.h]

The maximum size of a user process in bytes.  Since user space always starts at zero, this is the maximum address that a user process can access+1.  The user space stack grows down from this address.

Any virtual address below TASK_SIZE is deemed to be user process area, and therefore managed dynamically on a process by process basis by the kernel.  I'll call this the user segment.

Anything above TASK_SIZE is common to all processes.  I'll call this the kernel segment.

(In other words, you can't put IO mappings below TASK_SIZE, and hence PAGE_OFFSET).

CONFIG_PAGE_OFFSET

-0x01000000

=0xBF000000

TASK_UNMAPPED_BASE

[arch/arm/include/asm/memory.h]

the lower boundary of the mmap VM area

CONFIG_PAGE_OFFSET/3

=0x40000000

MODULES_VADDR

[arch/arm/include/asm/memory.h]

The module space lives between the addresses given by TASK_SIZE and PAGE_OFFSET - it must be within 32MB of the kernel text.

TEXT_OFFSET does not allow to use 16MB modules area as ARM32 branches to kernel may go out of range taking into account the kernel .text size

PAGE_OFFSET

- 8*1024*1024

=0x0XBF800000

MODULES_END

[arch/arm/include/asm/memory.h]

The highmem pkmap virtual space shares the end of the module area.

0XBFE00000

#ifdef CONFIG_HIGHMEM

#define MODULES_END           (PAGE_OFFSET - PMD_SIZE)

#else

#define MODULES_END           (PAGE_OFFSET)

#endif

TEXTADDR

Virtual start address of kernel, normally PAGE_OFFSET + 0x8000.

This is where the kernel image ends up.  With the latest kernels, it must be located at 32768 bytes into a 128MB region.  Previous kernels placed a restriction of 256MB here.

DATAADDR

Virtual address for the kernel data segment.  Must not be defined when using the decompressor.

VMALLOC_START

VMALLOC_END

[arch/arm/mach-pxa/include/mach/vmalloc.h]

Virtual addresses bounding the vmalloc() area.  There must not be any static mappings in this area; vmalloc will overwrite them. The addresses must also be in the kernel segment (see above). Normally, the vmalloc() area starts VMALLOC_OFFSET bytes above the last virtual RAM address (found using variable high_memory).

#define VMALLOC_END       (0xf0000000UL)

The default vmalloc size is 128MB.

vmalloc_min = (VMALLOC_END - SZ_128M);

[defined in arch/arm/mm/mmu.c]

If vmalloc is configured passed by OSL, then it’s redefined.

early_param("vmalloc", early_vmalloc);

[defined in arch/arm/mm/mmu.c]

VMALLOC_OFFSET

[arch/arm/include/asm/pgtable.h]

Offset normally incorporated into VMALLOC_START to provide a hole between virtual RAM and the vmalloc area.  We do this to allow out of bounds memory accesses (eg, something writing off the end of the mapped memory map) to be caught.  Normally set to 8MB.

#define VMALLOC_OFFSET               (8*1024*1024)

CONSISTENT_DMA_SIZE

CONSISTENT_BASE

CONSISTENT_END

[arch/arm/include/asm/memory.h]

Size of DMA-consistent memory region.  Must be multiple of 2M, between 2MB and 14MB inclusive.

CONSISTENT_DMA_SIZE = 2MB

CONSISTENT_BASE = 0XFFC00000

CONSISTENT_END = 0XFFE00000

FIXADDR_START

FIXADDR_TOP

FIXADDR_SIZE

[arch/arm/include/asm/fixmap.h]

fixed virtual addresses

#define FIXADDR_START          0xfff00000UL

#define FIXADDR_TOP              0xfffe0000UL

#define FIXADDR_SIZE              (FIXADDR_TOP - FIXADDR_START)

PKMAP_BASE

[arch/arm/include/asm/highmen.h]

0XBFE00000

#define PKMAP_BASE               (PAGE_OFFSET - PMD_SIZE)

参考文献:

【1】  深入linux内核架构,Wolfgang Mauerer

【2】 linux内核设计与实现, Robert Love

【3】kernel/Documemtation/arm/Porting