GoLang-内存管理

一、tcmalloc介绍< 参考资源 >

go的内存管理和tcmalloc(thread-caching malloc)很像,先看一下tcmalloc的实现。

1.1 简介

tcmalloc是google推出的一种内存分配器,常见的内存分配器还有glibc的ptmalloc和google的jemalloc。相比于ptmalloc,tcmalloc性能更好,特别适用于高并发场景。

1.2 tcmalloc算法策略

tcmalloc分配的内存主要来自两个地方:全局缓存堆和进程的私有缓存。对于一些小容量的内存申请使用进程的私有缓存,私有缓存不足的时候可以再从全局缓存申请一部分作为私有缓存。对于大容量的内存申请则需要从全局缓存中进行申请。而大小容量的边界就是32k。缓存的组织方式是一个单链表数组,数组的每个元素是一个单链表,链表中的每个元素具有相同的大小。

1.2.1 Small Object Allocation

小对象内存分配默认会分配86个不同大小的块,而这些块的大小并没有明确说明,需要查一下源码。每种大小的块的数组的长度都采用使用了才初始化,有点类似于lazy-initialize。

Small Object Allocation

1.2.2 Big Object Allocation

对于大于32k的内存申请,使用全局内存来分配。全局内存的组织也是单链表数组,数组长度为256,分别对用1 page大小, 2 page大小(1 page=4k)。

Big Object Allocation

1.2.3 Span

tcmalloc使用span来管理内存分页,一个span可以包含几个连续分页。span的状态只有未分配、作为大对象分配、作为小对象分配。

Span

1.3 golang对比

golang的内存分配并不是和tcmalloc一模一样。

  • 局部缓存并不是分配给进程或者线程,而是分配给P(这个还需要说一下go的goroutine实现)
  • go的GC是stop the world,并不是每个进程单独进行GC。
  • span的管理更有效率

二、基本知识

2.1 golang运行调度

在 Golang 里面有三个基本的概念:G, M, P。

  • G(Goroutine):我们所说的协程,为用户级的轻量级线程,每个Goroutine对象中的sched保存着其上下文信息
  • M(Machine):对内核级线程的封装,数量对应真实的CPU数(真正干活的对象)
  • P(Processor):即为G和M的调度对象,用来调度G和M之间的关联关系,其数量可通过GOMAXPROCS()来设置,默认为核心数。
    一个 Goroutine 的运行需要 G + P + M 三部分结合起来。

2.2 逃逸分析(escape analysis)

逃逸分析请戳 GoLang-逃逸分析

三、关键数据结构

几个关键的地方:

  • mcache: per-P cache,可以认为是 local cache。
  • mcentral: 全局 cache,mcache 不够用的时候向 mcentral 申请。
  • mheap: 当 mcentral 也不够用的时候,通过 mheap 向操作系统申请。
    可以将其看成多级内存分配器。

3.1 mcache

我们知道每个 Gorontine 的运行都是绑定到一个 P 上面,mcache 是每个 P 的 cache。这么做的好处是分配内存时不需要加锁。mcache 结构如下。

//go version 1.10.8
//file runtime/mcache.go

// Per-thread (in Go, per-P) cache for small objects.
// No locking needed because it is per-thread (per-P).
//
// mcaches are allocated from non-GC'd memory, so any heap pointers
// must be specially handled.
//
//go:notinheap
type mcache struct {
    // The following members are accessed on every malloc,
    // so they are grouped here for better caching.
    next_sample int32   // trigger heap sample after allocating this many bytes
    local_scan  uintptr // bytes of scannable heap allocated

    // Allocator cache for tiny objects w/o pointers.
    // See "Tiny allocator" comment in malloc.go.

    // tiny points to the beginning of the current tiny block, or
    // nil if there is no current tiny block.
    //
    // tiny is a heap pointer. Since mcache is in non-GC'd memory,
    // we handle it by clearing it in releaseAll during mark
    // termination.
    tiny             uintptr
    tinyoffset       uintptr
    local_tinyallocs uintptr // number of tiny allocs not counted in other stats

    // The rest is not accessed on every malloc.

    alloc [numSpanClasses]*mspan // spans to allocate from, indexed by spanClass

    stackcache [_NumStackOrders]stackfreelist

    // Local allocator stats, flushed during GC.
    local_nlookup    uintptr                  // number of pointer lookups
    local_largefree  uintptr                  // bytes freed for large objects (>maxsmallsize)
    local_nlargefree uintptr                  // number of frees for large objects (>maxsmallsize)
    local_nsmallfree [_NumSizeClasses]uintptr // number of frees for small objects (<=maxsmallsize)
}

我们可以暂时只关注 alloc [_NumSizeClasses]*mspan,这是一个大小为 67 的指针(指针指向 mspan )数组(_NumSizeClasses = 67),每个数组元素用来包含特定大小的块。当要分配内存大小时,为 object 在 alloc 数组中选择合适的元素来分配。67 种块大小为 0,8 byte, 16 byte, … ,这个和 tcmalloc 稍有区别。

//go version 1.10.8
//file runtime/sizeclasses.go

var class_to_size = [_NumSizeClasses]uint16{0, 8, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 896, 1024, 1152, 1280, 1408, 1536, 1792, 2048, 2304, 2688, 3072, 3200, 3456, 4096, 4864, 5376, 6144, 6528, 6784, 6912, 8192, 9472, 9728, 10240, 10880, 12288, 13568, 14336, 16384, 18432, 19072, 20480, 21760, 24576, 27264, 28672, 32768}

上面的 alloc 类似内存池的 freelist 数组或者链表,正常实现每个数组元素是一个链表,链表由特定大小的块串起来。但是这里统一使用了 mspan 结构,那么只有一种可能,就是 mspan 中记录了需要分配的块大小。我们来看一下 mspan 的结构。

3.2 mspan

span 在 tcmalloc 中作为一种管理内存的基本单位而存在。Golang 的 mspan 的结构如下。

//go version 1.10.8
//file runtime/mheap.go

//go:notinheap
type mspan struct {
    next *mspan     // next span in list, or nil if none
    prev *mspan     // previous span in list, or nil if none
    list *mSpanList // For debugging. TODO: Remove.

    startAddr uintptr // address of first byte of span aka s.base()
    npages    uintptr // number of pages in span

    manualFreeList gclinkptr // list of free objects in _MSpanManual spans

    // freeindex is the slot index between 0 and nelems at which to begin scanning
    // for the next free object in this span.
    // Each allocation scans allocBits starting at freeindex until it encounters a 0
    // indicating a free object. freeindex is then adjusted so that subsequent scans begin
    // just past the newly discovered free object.
    //
    // If freeindex == nelem, this span has no free objects.
    //
    // allocBits is a bitmap of objects in this span.
    // If n >= freeindex and allocBits[n/8] & (1<<(n%8)) is 0
    // then object n is free;
    // otherwise, object n is allocated. Bits starting at nelem are
    // undefined and should never be referenced.
    //
    // Object n starts at address n*elemsize + (start << pageShift).
    freeindex uintptr
    // TODO: Look up nelems from sizeclass and remove this field if it
    // helps performance.
    nelems uintptr // number of object in the span.

    // Cache of the allocBits at freeindex. allocCache is shifted
    // such that the lowest bit corresponds to the bit freeindex.
    // allocCache holds the complement of allocBits, thus allowing
    // ctz (count trailing zero) to use it directly.
    // allocCache may contain bits beyond s.nelems; the caller must ignore
    // these.
    allocCache uint64

    // allocBits and gcmarkBits hold pointers to a span's mark and
    // allocation bits. The pointers are 8 byte aligned.
    // There are three arenas where this data is held.
    // free: Dirty arenas that are no longer accessed
    //       and can be reused.
    // next: Holds information to be used in the next GC cycle.
    // current: Information being used during this GC cycle.
    // previous: Information being used during the last GC cycle.
    // A new GC cycle starts with the call to finishsweep_m.
    // finishsweep_m moves the previous arena to the free arena,
    // the current arena to the previous arena, and
    // the next arena to the current arena.
    // The next arena is populated as the spans request
    // memory to hold gcmarkBits for the next GC cycle as well
    // as allocBits for newly allocated spans.
    //
    // The pointer arithmetic is done "by hand" instead of using
    // arrays to avoid bounds checks along critical performance
    // paths.
    // The sweep will free the old allocBits and set allocBits to the
    // gcmarkBits. The gcmarkBits are replaced with a fresh zeroed
    // out memory.
    allocBits  *gcBits
    gcmarkBits *gcBits

    // sweep generation:
    // if sweepgen == h->sweepgen - 2, the span needs sweeping
    // if sweepgen == h->sweepgen - 1, the span is currently being swept
    // if sweepgen == h->sweepgen, the span is swept and ready to use
    // h->sweepgen is incremented by 2 after every GC

    sweepgen    uint32
    divMul      uint16     // for divide by elemsize - divMagic.mul
    baseMask    uint16     // if non-0, elemsize is a power of 2, & this will get object allocation base
    allocCount  uint16     // number of allocated objects
    spanclass   spanClass  // size class and noscan (uint8)
    incache     bool       // being used by an mcache
    state       mSpanState // mspaninuse etc
    needzero    uint8      // needs to be zeroed before allocation
    divShift    uint8      // for divide by elemsize - divMagic.shift
    divShift2   uint8      // for divide by elemsize - divMagic.shift2
    elemsize    uintptr    // computed from sizeclass or from npages
    unusedsince int64      // first time spotted by gc in mspanfree state
    npreleased  uintptr    // number of pages released to the os
    limit       uintptr    // end of data in span
    speciallock mutex      // guards specials list
    specials    *special   // linked list of special records sorted by offset.
}

从上面的结构可以看出:

  • next, prev: 指针域,因为 mspan 一般都是以链表形式使用。
  • npages: mspan 的大小为 page 大小的整数倍。
  • spanclass(size class and noscan (uint8)): 0 ~ _NumSizeClasses 之间的一个值,这个解释了我们的疑问。比如,spanclass = 3,那么这个 mspan 被分割成 32 byte 的块。
  • elemsize: 通过 sizeclass 或者 npages 可以计算出来。比如 sizeclass = 3, elemsize = 32 byte。对于大于 32Kb 的内存分配,都是分配整数页,elemsize = page_size * npages。
  • nelems: span 中包块的总数目。
  • freeindex: 0 ~ nelemes-1,表示分配到第几个块。

3.3 mcentral

上面说到当 mcache 不够用的时候,会从 mcentral 申请。那我们下面就来介绍一下 mcentral。

//go version 1.10.8
//file runtime/mcentral.go

// Central list of free objects of a given size.
//
//go:notinheap
type mcentral struct {
    lock      mutex
    spanclass spanClass
    nonempty  mSpanList // list of spans with a free object, ie a nonempty free list
    empty     mSpanList // list of spans with no free objects (or cached in an mcache)

    // nmalloc is the cumulative count of objects allocated from
    // this mcentral, assuming all spans in mcaches are
    // fully-allocated. Written atomically, read under STW.
    nmalloc uint64
}

// mSpanList heads a linked list of spans.
//
//go:notinheap
type mSpanList struct {
    first *mspan // first span in list, or nil if none
    last  *mspan // last span in list, or nil if none
}

mcentral 分析:

  • spanclass: 也有成员 spanclass,那么 mcentral 是不是也有 67 个呢?是的。
  • lock: 因为会有多个 P 过来竞争。
  • nonempty: mspan 的双向链表,当前 mcentral 中可用的 mspan list。
  • empty: 已经被使用的,可以认为是一种对所有 mspan 的 track。

问题来了,mcentral 存在于什么地方?虽然在上面我们将 mcentral 和 mheap 作为两个部分来讲,但是作为全局的结构,这两部分是可以定义在一起的。实际上也是这样,mcentral 包含在 mheap 中。

3.4 mheap

Golang 中的 mheap 结构定义如下。

//go version 1.10.8
//file runtime/mheap.go

// Main malloc heap.
// The heap itself is the "free[]" and "large" arrays,
// but all the other global data is here too.
//
// mheap must not be heap-allocated because it contains mSpanLists,
// which must not be heap-allocated.
//
//go:notinheap
type mheap struct {
    lock      mutex
    free      [_MaxMHeapList]mSpanList // free lists of given length up to _MaxMHeapList
    freelarge mTreap                   // free treap of length >= _MaxMHeapList
    busy      [_MaxMHeapList]mSpanList // busy lists of large spans of given length
    busylarge mSpanList                // busy lists of large spans length >= _MaxMHeapList
    sweepgen  uint32                   // sweep generation, see comment in mspan
    sweepdone uint32                   // all spans are swept
    sweepers  uint32                   // number of active sweepone calls

    // allspans is a slice of all mspans ever created. Each mspan
    // appears exactly once.
    //
    // The memory for allspans is manually managed and can be
    // reallocated and move as the heap grows.
    //
    // In general, allspans is protected by mheap_.lock, which
    // prevents concurrent access as well as freeing the backing
    // store. Accesses during STW might not hold the lock, but
    // must ensure that allocation cannot happen around the
    // access (since that may free the backing store).
    allspans []*mspan // all spans out there

    // spans is a lookup table to map virtual address page IDs to *mspan.
    // For allocated spans, their pages map to the span itself.
    // For free spans, only the lowest and highest pages map to the span itself.
    // Internal pages map to an arbitrary span.
    // For pages that have never been allocated, spans entries are nil.
    //
    // Modifications are protected by mheap.lock. Reads can be
    // performed without locking, but ONLY from indexes that are
    // known to contain in-use or stack spans. This means there
    // must not be a safe-point between establishing that an
    // address is live and looking it up in the spans array.
    //
    // This is backed by a reserved region of the address space so
    // it can grow without moving. The memory up to len(spans) is
    // mapped. cap(spans) indicates the total reserved memory.
    spans []*mspan

    // sweepSpans contains two mspan stacks: one of swept in-use
    // spans, and one of unswept in-use spans. These two trade
    // roles on each GC cycle. Since the sweepgen increases by 2
    // on each cycle, this means the swept spans are in
    // sweepSpans[sweepgen/2%2] and the unswept spans are in
    // sweepSpans[1-sweepgen/2%2]. Sweeping pops spans from the
    // unswept stack and pushes spans that are still in-use on the
    // swept stack. Likewise, allocating an in-use span pushes it
    // on the swept stack.
    sweepSpans [2]gcSweepBuf

    _ uint32 // align uint64 fields on 32-bit for atomics

    // Proportional sweep
    //
    // These parameters represent a linear function from heap_live
    // to page sweep count. The proportional sweep system works to
    // stay in the black by keeping the current page sweep count
    // above this line at the current heap_live.
    //
    // The line has slope sweepPagesPerByte and passes through a
    // basis point at (sweepHeapLiveBasis, pagesSweptBasis). At
    // any given time, the system is at (memstats.heap_live,
    // pagesSwept) in this space.
    //
    // It's important that the line pass through a point we
    // control rather than simply starting at a (0,0) origin
    // because that lets us adjust sweep pacing at any time while
    // accounting for current progress. If we could only adjust
    // the slope, it would create a discontinuity in debt if any
    // progress has already been made.
    pagesInUse         uint64  // pages of spans in stats _MSpanInUse; R/W with mheap.lock
    pagesSwept         uint64  // pages swept this cycle; updated atomically
    pagesSweptBasis    uint64  // pagesSwept to use as the origin of the sweep ratio; updated atomically
    sweepHeapLiveBasis uint64  // value of heap_live to use as the origin of sweep ratio; written with lock, read without
    sweepPagesPerByte  float64 // proportional sweep ratio; written with lock, read without
    // TODO(austin): pagesInUse should be a uintptr, but the 386
    // compiler can't 8-byte align fields.

    // Malloc stats.
    largealloc  uint64                  // bytes allocated for large objects
    nlargealloc uint64                  // number of large object allocations
    largefree   uint64                  // bytes freed for large objects (>maxsmallsize)
    nlargefree  uint64                  // number of frees for large objects (>maxsmallsize)
    nsmallfree  [_NumSizeClasses]uint64 // number of frees for small objects (<=maxsmallsize)

    // range of addresses we might see in the heap
    bitmap        uintptr // Points to one byte past the end of the bitmap
    bitmap_mapped uintptr

    // The arena_* fields indicate the addresses of the Go heap.
    //
    // The maximum range of the Go heap is
    // [arena_start, arena_start+_MaxMem+1).
    //
    // The range of the current Go heap is
    // [arena_start, arena_used). Parts of this range may not be
    // mapped, but the metadata structures are always mapped for
    // the full range.
    arena_start uintptr
    arena_used  uintptr // Set with setArenaUsed.

    // The heap is grown using a linear allocator that allocates
    // from the block [arena_alloc, arena_end). arena_alloc is
    // often, but *not always* equal to arena_used.
    arena_alloc uintptr
    arena_end   uintptr

    // arena_reserved indicates that the memory [arena_alloc,
    // arena_end) is reserved (e.g., mapped PROT_NONE). If this is
    // false, we have to be careful not to clobber existing
    // mappings here. If this is true, then we own the mapping
    // here and *must* clobber it to use it.
    arena_reserved bool

    _ uint32 // ensure 64-bit alignment

    // central free lists for small size classes.
    // the padding makes sure that the MCentrals are
    // spaced CacheLineSize bytes apart, so that each MCentral.lock
    // gets its own cache line.
    // central is indexed by spanClass.
    central [numSpanClasses]struct {
        mcentral mcentral
        pad      [sys.CacheLineSize - unsafe.Sizeof(mcentral{})%sys.CacheLineSize]byte
    }

    spanalloc             fixalloc // allocator for span*
    cachealloc            fixalloc // allocator for mcache*
    treapalloc            fixalloc // allocator for treapNodes* used by large objects
    specialfinalizeralloc fixalloc // allocator for specialfinalizer*
    specialprofilealloc   fixalloc // allocator for specialprofile*
    speciallock           mutex    // lock for special record allocators.

    unused *specialfinalizer // never set, just here to force the specialfinalizer type into DWARF
}

var mheap_ mheap

mheap_ 是一个全局变量,会在系统初始化的时候初始化(在函数 mallocinit() 中)。我们先看一下 mheap 具体结构。

  • allspans []*mspan: 所有的 spans 都是通过 mheap_ 申请,所有申请过的 mspan 都会记录在 allspans。结构体中的 lock 就是用来保证并发安全的。注释中有关于 STW 的说明,与Golang 的 GC 有关,这里不做详细说明。
  • central [_NumSizeClasses]…: 这个就是之前介绍的 mcentral ,每种大小的块对应一个 mcentral。mcentral 上面介绍过了。pad 可以认为是一个字节填充,为了避免伪共享(false sharing)问题的。False Sharing 可以参考 False Sharing - wikipedia ,这里就不细说了。
  • sweepgen, sweepdone: GC 相关。(Golang 的 GC 策略是 Mark & Sweep, 这里是用来表示 sweep 的,这里就不再深入了。)
  • free [_MaxMHeapList]mSpanList: 这是一个 SpanList 数组,每个 SpanList 里面的 mspan 由 1 ~ 127 (_MaxMHeapList - 1) 个 page 组成。比如 free[3] 是由包含 3 个 page 的 mspan 组成的链表。free 表示的是 free list,也就是未分配的。对应的还有 busy list。
  • freelarge: mspan 组成的链表,每个元素(也就是 mspan)的 page 个数大于 127。对应的还有 busylarge。
  • spans []*mspan: 记录 arena 区域页号(page number)和 mspan 的映射关系。
  • arena_start, arena_end, arena_used: 要解释这几个变量之前要解释一下 arena。arena 是 Golang 中用于分配内存的连续虚拟地址区域。由 mheap 管理,堆上申请的所有内存都来自 arena。那么如何标志内存可用呢?操作系统的常见做法用两种:一种是用链表将所有的可用内存都串起来;另一种是使用位图来标志内存块是否可用。结合上面一条 spans,内存的布局是下面这样的。

    golang内存布局

  • spanalloc, cachealloc fixalloc: fixalloc 是 free-list,用来分配特定大小的块。
  • 剩下的是一些统计信息和 GC 相关的信息,这里暂且按住不表。

四、初始化

在系统初始化阶段,上面介绍的几个结构会被进行初始化,我们直接看一下初始化代码:mallocinit()。

//go version 1.10.8
//file runtime/malloc.go

func mallocinit() {
    if class_to_size[_TinySizeClass] != _TinySize {
        throw("bad TinySizeClass")
    }

    testdefersizes()

    // Copy class sizes out for statistics table.
    for i := range class_to_size {
        memstats.by_size[i].size = uint32(class_to_size[i])
    }

    // Check physPageSize.
    if physPageSize == 0 {
        // The OS init code failed to fetch the physical page size.
        throw("failed to get system page size")
    }
    if physPageSize < minPhysPageSize {
        print("system page size (", physPageSize, ") is smaller than minimum page size (", minPhysPageSize, ")\n")
        throw("bad system page size")
    }
    if physPageSize&(physPageSize-1) != 0 {
        print("system page size (", physPageSize, ") must be a power of 2\n")
        throw("bad system page size")
    }

    // The auxiliary regions start at p and are laid out in the
    // following order: spans, bitmap, arena.
    var p, pSize uintptr
    var reserved bool

    // The spans array holds one *mspan per _PageSize of arena.
    var spansSize uintptr = (_MaxMem + 1) / _PageSize * sys.PtrSize
    spansSize = round(spansSize, _PageSize)
    // The bitmap holds 2 bits per word of arena.
    var bitmapSize uintptr = (_MaxMem + 1) / (sys.PtrSize * 8 / 2)
    bitmapSize = round(bitmapSize, _PageSize)

    // Set up the allocation arena, a contiguous area of memory where
    // allocated data will be found.
    if sys.PtrSize == 8 {
        // On a 64-bit machine, allocate from a single contiguous reservation.
        // 512 GB (MaxMem) should be big enough for now.
        //
        // The code will work with the reservation at any address, but ask
        // SysReserve to use 0x0000XXc000000000 if possible (XX=00...7f).
        // Allocating a 512 GB region takes away 39 bits, and the amd64
        // doesn't let us choose the top 17 bits, so that leaves the 9 bits
        // in the middle of 0x00c0 for us to choose. Choosing 0x00c0 means
        // that the valid memory addresses will begin 0x00c0, 0x00c1, ..., 0x00df.
        // In little-endian, that's c0 00, c1 00, ..., df 00. None of those are valid
        // UTF-8 sequences, and they are otherwise as far away from
        // ff (likely a common byte) as possible. If that fails, we try other 0xXXc0
        // addresses. An earlier attempt to use 0x11f8 caused out of memory errors
        // on OS X during thread allocations.  0x00c0 causes conflicts with
        // AddressSanitizer which reserves all memory up to 0x0100.
        // These choices are both for debuggability and to reduce the
        // odds of a conservative garbage collector (as is still used in gccgo)
        // not collecting memory because some non-pointer block of memory
        // had a bit pattern that matched a memory address.
        //
        // Actually we reserve 544 GB (because the bitmap ends up being 32 GB)
        // but it hardly matters: e0 00 is not valid UTF-8 either.
        //
        // If this fails we fall back to the 32 bit memory mechanism
        //
        // However, on arm64, we ignore all this advice above and slam the
        // allocation at 0x40 << 32 because when using 4k pages with 3-level
        // translation buffers, the user address space is limited to 39 bits
        // On darwin/arm64, the address space is even smaller.
        arenaSize := round(_MaxMem, _PageSize)
        pSize = bitmapSize + spansSize + arenaSize + _PageSize
        for i := 0; i <= 0x7f; i++ {
            switch {
            case GOARCH == "arm64" && GOOS == "darwin":
                p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
            case GOARCH == "arm64":
                p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
            default:
                p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
            }
            p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
            if p != 0 {
                break
            }
        }
    }

    if p == 0 {
        // On a 32-bit machine, we can't typically get away
        // with a giant virtual address space reservation.
        // Instead we map the memory information bitmap
        // immediately after the data segment, large enough
        // to handle the entire 4GB address space (256 MB),
        // along with a reservation for an initial arena.
        // When that gets used up, we'll start asking the kernel
        // for any memory anywhere.

        // We want to start the arena low, but if we're linked
        // against C code, it's possible global constructors
        // have called malloc and adjusted the process' brk.
        // Query the brk so we can avoid trying to map the
        // arena over it (which will cause the kernel to put
        // the arena somewhere else, likely at a high
        // address).
        procBrk := sbrk0()

        // If we fail to allocate, try again with a smaller arena.
        // This is necessary on Android L where we share a process
        // with ART, which reserves virtual memory aggressively.
        // In the worst case, fall back to a 0-sized initial arena,
        // in the hope that subsequent reservations will succeed.
        arenaSizes := []uintptr{
            512 << 20,
            256 << 20,
            128 << 20,
            0,
        }

        for _, arenaSize := range arenaSizes {
            // SysReserve treats the address we ask for, end, as a hint,
            // not as an absolute requirement. If we ask for the end
            // of the data segment but the operating system requires
            // a little more space before we can start allocating, it will
            // give out a slightly higher pointer. Except QEMU, which
            // is buggy, as usual: it won't adjust the pointer upward.
            // So adjust it upward a little bit ourselves: 1/4 MB to get
            // away from the running binary image and then round up
            // to a MB boundary.
            p = round(firstmoduledata.end+(1<<18), 1<<20)
            pSize = bitmapSize + spansSize + arenaSize + _PageSize
            if p <= procBrk && procBrk < p+pSize {
                // Move the start above the brk,
                // leaving some room for future brk
                // expansion.
                p = round(procBrk+(1<<20), 1<<20)
            }
            p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
            if p != 0 {
                break
            }
        }
        if p == 0 {
            throw("runtime: cannot reserve arena virtual address space")
        }
    }

    // PageSize can be larger than OS definition of page size,
    // so SysReserve can give us a PageSize-unaligned pointer.
    // To overcome this we ask for PageSize more and round up the pointer.
    p1 := round(p, _PageSize)
    pSize -= p1 - p

    spansStart := p1
    p1 += spansSize
    mheap_.bitmap = p1 + bitmapSize
    p1 += bitmapSize
    if sys.PtrSize == 4 {
        // Set arena_start such that we can accept memory
        // reservations located anywhere in the 4GB virtual space.
        mheap_.arena_start = 0
    } else {
        mheap_.arena_start = p1
    }
    mheap_.arena_end = p + pSize
    mheap_.arena_used = p1
    mheap_.arena_alloc = p1
    mheap_.arena_reserved = reserved

    if mheap_.arena_start&(_PageSize-1) != 0 {
        println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
        throw("misrounded allocation in mallocinit")
    }

    // Initialize the rest of the allocator.
    mheap_.init(spansStart, spansSize)
    _g_ := getg()
    _g_.m.mcache = allocmcache()
}

4.1 arena 相关

// The spans array holds one *mspan per _PageSize of arena.
var spansSize uintptr = (_MaxMem + 1) / _PageSize * sys.PtrSize
spansSize = round(spansSize, _PageSize)
// The bitmap holds 2 bits per word of arena.
var bitmapSize uintptr = (_MaxMem + 1) / (sys.PtrSize * 8 / 2)
bitmapSize = round(bitmapSize, _PageSize)
arenaSize := round(_MaxMem, _PageSize)

// _MaxMem is the maximum heap arena size minus 1.
//
// On 32-bit, this is also the maximum heap pointer value,
// since the arena starts at address 0.
_MaxMem = 1<<_MHeapMap_TotalBits - 1

首先解释一下变量 _MaxMem ,里面还有一个变量就不再列出来了。简单来说 _MaxMem 就是系统为 arena 区分配的大小:64 位系统分配 512 G;对于 Windows 64 位系统,arena 区分配 32 G。round 是一个对齐操作,向上取 _PageSize 的倍数。实现也很有意思,代码如下。

// round n up to a multiple of a.  a must be a power of 2.
func round(n, a uintptr) uintptr {
    return (n + a - 1) &^ (a - 1)
}

bitmap 用两个 bit 表示一个字的可用状态,那么算下来 bitmap 的大小为 16 G。spans 记录的 arena 区的块页号和对应的 mspan 指针的对应关系。比如 arena 区内存地址 x,对应的页号就是 page_num = (x - arena_start) / page_size,那么 spans 就会记录 spans[page_num] = x。如果 arena 为 512 G的话,spans 区的大小为 512 G / 8K * 8 = 512 M。这里值得注意的是 Golang 的内存管理虚拟地址页大小为 8k。

_PageSize = 1 << _PageShift
_PageMask = _PageSize - 1

所以这一段连续的的虚拟内存布局(64 位)如下:

golang虚拟内存布局

4.2 虚拟地址申请

主要是下面这段代码。

pSize = bitmapSize + spansSize + arenaSize + _PageSize
for i := 0; i <= 0x7f; i++ {
    switch {
    case GOARCH == "arm64" && GOOS == "darwin":
        p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
    case GOARCH == "arm64":
        p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
    default:
        p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
    }
    p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
    if p != 0 {
        break
    }
}

初始化的时候,Golang 向操作系统申请一段连续的地址空间,就是上面的 spans + bitmap + arena。p 就是这段连续地址空间的开始地址,不同平台的 p 取值不一样。向 OS 申请的时候视不同的 OS 版本,调用不同的系统调用,比如 Unix 系统调用 mmap (mmap 向操作系统内核申请新的虚拟地址区间,可指定起始地址和长度),Windows 系统调用 VirtualAlloc (类似 mmap)。

//来源于互联网

//bsd
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    if sys.PtrSize == 8 && uint64(n) > 1<<32 || sys.GoosNacl != 0 {
        *reserved = false
        return v
    }

    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if uintptr(p) < 4096 {
        return nil
    }
    *reserved = true
    return p
}

//darwin
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    *reserved = true
    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if uintptr(p) < 4096 {
        return nil
    }
    return p
}

//linux
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    ...
    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if uintptr(p) < 4096 {
        return nil
    }
    *reserved = true
    return p
}
//windows
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    *reserved = true
    // v is just a hint.
    // First try at v.
    v = unsafe.Pointer(stdcall4(_VirtualAlloc, uintptr(v), n, _MEM_RESERVE, _PAGE_READWRITE))
    if v != nil {
        return v
    }

    // Next let the kernel choose the address.
    return unsafe.Pointer(stdcall4(_VirtualAlloc, 0, n, _MEM_RESERVE, _PAGE_READWRITE))
}

4.3 mheap 初始化

我们上面介绍 mheap 结构的时候知道 spans, bitmap, arena 都是存在于 mheap 中的,从操作系统申请完地址之后就是初始化 mheap 了。

// PageSize can be larger than OS definition of page size,
    // so SysReserve can give us a PageSize-unaligned pointer.
    // To overcome this we ask for PageSize more and round up the pointer.
    p1 := round(p, _PageSize)
    pSize -= p1 - p

    spansStart := p1
    p1 += spansSize
    mheap_.bitmap = p1 + bitmapSize
    p1 += bitmapSize
    if sys.PtrSize == 4 {
        // Set arena_start such that we can accept memory
        // reservations located anywhere in the 4GB virtual space.
        mheap_.arena_start = 0
    } else {
        mheap_.arena_start = p1
    }
    mheap_.arena_end = p + pSize
    mheap_.arena_used = p1
    mheap_.arena_alloc = p1
    mheap_.arena_reserved = reserved

    if mheap_.arena_start&(_PageSize-1) != 0 {
        println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
        throw("misrounded allocation in mallocinit")
    }

    // Initialize the rest of the allocator.
    mheap_.init(spansStart, spansSize)
    _g_ := getg()
    _g_.m.mcache = allocmcache()

p 是从连续虚拟地址的起始地址,先进行对齐,然后初始化 arena,bitmap,spans 地址。mheap_.init()会初始化 fixalloc 等相关的成员,还有 mcentral 的初始化。

// Initialize the heap.
func (h *mheap) init(spansStart, spansBytes uintptr) {
    h.treapalloc.init(unsafe.Sizeof(treapNode{}), nil, nil, &memstats.other_sys)
    h.spanalloc.init(unsafe.Sizeof(mspan{}), recordspan, unsafe.Pointer(h), &memstats.mspan_sys)
    h.cachealloc.init(unsafe.Sizeof(mcache{}), nil, nil, &memstats.mcache_sys)
    h.specialfinalizeralloc.init(unsafe.Sizeof(specialfinalizer{}), nil, nil, &memstats.other_sys)
    h.specialprofilealloc.init(unsafe.Sizeof(specialprofile{}), nil, nil, &memstats.other_sys)

    // Don't zero mspan allocations. Background sweeping can
    // inspect a span concurrently with allocating it, so it's
    // important that the span's sweepgen survive across freeing
    // and re-allocating a span to prevent background sweeping
    // from improperly cas'ing it from 0.
    //
    // This is safe because mspan contains no heap pointers.
    h.spanalloc.zero = false

    // h->mapcache needs no init
    for i := range h.free {
        h.free[i].init()
        h.busy[i].init()
    }

    h.busylarge.init()
    for i := range h.central {
        h.central[i].mcentral.init(spanClass(i))
    }

    sp := (*slice)(unsafe.Pointer(&h.spans))
    sp.array = unsafe.Pointer(spansStart)
    sp.len = 0
    sp.cap = int(spansBytes / sys.PtrSize)

    // Map metadata structures. But don't map race detector memory
    // since we're not actually growing the arena here (and TSAN
    // gets mad if you map 0 bytes).
    h.setArenaUsed(h.arena_used, false)
}

mheap 初始化之后,对当前的线程也就是 M 进行初始化。

//获取当前 G
_g_ := getg()
// 获取 G 上绑定的 M 的 mcache
_g_.m.mcache = allocmcache()

4.4 per-P mcache 初始化

上面好像并没有说到针对 P 的 mcache 初始化,因为这个时候还没有初始化 P。我们看一下 bootstrap 的代码。

func schedinit() {
    ...
    mallocinit()
    ...
    
    if procs > _MaxGomaxprocs {
        procs = _MaxGomaxprocs
    }
    if procresize(procs) != nil {
        ...
    }
}

其中 mallocinit() 上面说过了。对 P 的初始化在函数 procresize() 中执行,我们下面只看内存相关的部分。

//go version 1.10.8
//file runtime/proc.go

func procresize(nprocs int32) *p {
    ...
    // initialize new P's
    for i := int32(0); i < nprocs; i++ {
        pp := allp[i]
        if pp == nil {
            pp = new(p)
            pp.id = i
            pp.status = _Pgcstop
            pp.sudogcache = pp.sudogbuf[:0]
            for i := range pp.deferpool {
                pp.deferpool[i] = pp.deferpoolbuf[i][:0]
            }
            atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
        }
        // P mcache 初始化
        if pp.mcache == nil {
            if old == 0 && i == 0 {
                if getg().m.mcache == nil {
                    throw("missing mcache?")
                }
                // P[0] 分配给主 Goroutine
                pp.mcache = getg().m.mcache // bootstrap
            } else {
                // P[0] 之外的 P 申请 mcache
                pp.mcache = allocmcache()
            }
        }
        ...
    }
    ...
}

所有的 P 都存放在一个全局数组 allp 中,procresize() 的目的就是将 allp 中用到的 P 进行初始化,同时对多余 P 的资源剥离。

五、内存分配

先说一下给对象 object 分配内存的主要流程:

  1. object size > 32K,则使用 mheap 直接分配。
  2. object size < 16 byte,使用 mcache 的小对象分配器 tiny 直接分配。 (其实 tiny 就是一个指针,暂且这么说吧。)
  3. object size > 16 byte && size <=32K byte 时,先使用 mcache 中对应的 size class 分配。
  4. 如果 mcache 对应的 size class 的 span 已经没有可用的块,则向 mcentral 请求。
  5. 如果 mcentral 也没有可用的块,则向 mheap 申请,并切分。
  6. 如果 mheap 也没有合适的 span,则向操作系统申请。
    我们看一下在堆上,也就是 arena 区分配内存的相关函数。
package main

func foo() *int {
    x := 1
    return &x
}

func main() {
    x := foo()
    println(*x)
}

根据之前介绍的逃逸分析,foo() 中的 x 会被分配到堆上。把上面代码保存为 main.go 看一下汇编代码。

$ go build -gcflags '-l' -o main main.go
$ go tool objdump -s "main\.foo" main
TEXT main.foo(SB) /Users/xxx/src/heap/main.go
  main.go:3             0x104aad0               65488b0c25a0080000      MOVQ GS:0x8a0, CX                       
  main.go:3             0x104aad9               483b6110                CMPQ 0x10(CX), SP                       
  main.go:3             0x104aadd               7639                    JBE 0x104ab18                           
  main.go:3             0x104aadf               4883ec18                SUBQ $0x18, SP                          
  main.go:3             0x104aae3               48896c2410              MOVQ BP, 0x10(SP)                       
  main.go:3             0x104aae8               488d6c2410              LEAQ 0x10(SP), BP                       
  main.go:4             0x104aaed               488d052c990000          LEAQ type.*+38976(SB), AX               
  main.go:4             0x104aaf4               48890424                MOVQ AX, 0(SP)                          
  main.go:4             0x104aaf8               e82304fcff              CALL runtime.newobject(SB)              
  main.go:4             0x104aafd               488b442408              MOVQ 0x8(SP), AX                        
  main.go:4             0x104ab02               48c70001000000          MOVQ $0x1, 0(AX)                        
  main.go:5             0x104ab09               4889442420              MOVQ AX, 0x20(SP)                       
  main.go:5             0x104ab0e               488b6c2410              MOVQ 0x10(SP), BP                       
  main.go:5             0x104ab13               4883c418                ADDQ $0x18, SP                          
  main.go:5             0x104ab17               c3                      RET                                     
  main.go:3             0x104ab18               e80389ffff              CALL runtime.morestack_noctxt(SB)       
  main.go:3             0x104ab1d               ebb1                    JMP main.foo(SB)                        
  :-1                   0x104ab1f               cc                      INT $0x3

堆上内存分配调用了 runtime 包的 newobject 函数。

//go version 1.10.8
//file runtime/malloc.go

// implementation of new builtin
// compiler (both frontend and SSA backend) knows the signature
// of this function
func newobject(typ *_type) unsafe.Pointer {
    return mallocgc(typ.size, typ, true)
}

// Allocate an object of size bytes.
// Small objects are allocated from the per-P cache's free lists.
// Large objects (> 32 kB) are allocated straight from the heap.
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    if gcphase == _GCmarktermination {
        throw("mallocgc called with gcphase == _GCmarktermination")
    }

    if size == 0 {
        return unsafe.Pointer(&zerobase)
    }

    if debug.sbrk != 0 {
        align := uintptr(16)
        if typ != nil {
            align = uintptr(typ.align)
        }
        return persistentalloc(size, align, &memstats.other_sys)
    }

    // assistG is the G to charge for this allocation, or nil if
    // GC is not currently active.
    var assistG *g
    if gcBlackenEnabled != 0 {
        // Charge the current user G for this allocation.
        assistG = getg()
        if assistG.m.curg != nil {
            assistG = assistG.m.curg
        }
        // Charge the allocation against the G. We'll account
        // for internal fragmentation at the end of mallocgc.
        assistG.gcAssistBytes -= int64(size)

        if assistG.gcAssistBytes < 0 {
            // This G is in debt. Assist the GC to correct
            // this before allocating. This must happen
            // before disabling preemption.
            gcAssistAlloc(assistG)
        }
    }

    // Set mp.mallocing to keep from being preempted by GC.
    mp := acquirem()
    if mp.mallocing != 0 {
        throw("malloc deadlock")
    }
    if mp.gsignal == getg() {
        throw("malloc during signal")
    }
    mp.mallocing = 1

    shouldhelpgc := false
    dataSize := size
    c := gomcache()
    var x unsafe.Pointer
    noscan := typ == nil || typ.kind&kindNoPointers != 0
    if size <= maxSmallSize {
        // object size <= 32K
        if noscan && size < maxTinySize {
        // 小于 16 byte 的小对象分配
            // Tiny allocator.
            //
            // Tiny allocator combines several tiny allocation requests
            // into a single memory block. The resulting memory block
            // is freed when all subobjects are unreachable. The subobjects
            // must be noscan (don't have pointers), this ensures that
            // the amount of potentially wasted memory is bounded.
            //
            // Size of the memory block used for combining (maxTinySize) is tunable.
            // Current setting is 16 bytes, which relates to 2x worst case memory
            // wastage (when all but one subobjects are unreachable).
            // 8 bytes would result in no wastage at all, but provides less
            // opportunities for combining.
            // 32 bytes provides more opportunities for combining,
            // but can lead to 4x worst case wastage.
            // The best case winning is 8x regardless of block size.
            //
            // Objects obtained from tiny allocator must not be freed explicitly.
            // So when an object will be freed explicitly, we ensure that
            // its size >= maxTinySize.
            //
            // SetFinalizer has a special case for objects potentially coming
            // from tiny allocator, it such case it allows to set finalizers
            // for an inner byte of a memory block.
            //
            // The main targets of tiny allocator are small strings and
            // standalone escaping variables. On a json benchmark
            // the allocator reduces number of allocations by ~12% and
            // reduces heap size by ~20%.
            off := c.tinyoffset
            // Align tiny pointer for required (conservative) alignment.
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // Allocate a new maxTinySize block.
            span := c.alloc[tinySpanClass]
            v := nextFreeFast(span)
            if v == 0 {
                v, _, shouldhelpgc = c.nextFree(tinySpanClass)
            }
            x = unsafe.Pointer(v)
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
        } else {
            var sizeclass uint8
            if size <= smallSizeMax-8 {
                sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {
                sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            spc := makeSpanClass(sizeclass, noscan)
            span := c.alloc[spc]
            v := nextFreeFast(span)
            if v == 0 {
                v, span, shouldhelpgc = c.nextFree(spc)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {
                memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
        }
    } else {
        //object size > 32K byte
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, needzero, noscan)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }

    var scanSize uintptr
    if !noscan {
        // If allocating a defer+arg block, now that we've picked a malloc size
        // large enough to hold everything, cut the "asked for" size down to
        // just the defer header, so that the GC bitmap will record the arg block
        // as containing nothing at all (as if it were unused space at the end of
        // a malloc block caused by size rounding).
        // The defer arg areas are scanned as part of scanstack.
        if typ == deferType {
            dataSize = unsafe.Sizeof(_defer{})
        }
        heapBitsSetType(uintptr(x), size, dataSize, typ)
        if dataSize > typ.size {
            // Array allocation. If there are any
            // pointers, GC has to scan to the last
            // element.
            if typ.ptrdata != 0 {
                scanSize = dataSize - typ.size + typ.ptrdata
            }
        } else {
            scanSize = typ.ptrdata
        }
        c.local_scan += scanSize
    }

    // Ensure that the stores above that initialize x to
    // type-safe memory and set the heap bits occur before
    // the caller can make x observable to the garbage
    // collector. Otherwise, on weakly ordered machines,
    // the garbage collector could follow a pointer to x,
    // but see uninitialized memory or stale heap bits.
    publicationBarrier()

    // Allocate black during GC.
    // All slots hold nil so no scanning is needed.
    // This may be racing with GC so do it atomically if there can be
    // a race marking the bit.
    if gcphase != _GCoff {
        gcmarknewobject(uintptr(x), size, scanSize)
    }

    if raceenabled {
        racemalloc(x, size)
    }

    if msanenabled {
        msanmalloc(x, size)
    }

    mp.mallocing = 0
    releasem(mp)

    if debug.allocfreetrace != 0 {
        tracealloc(x, size, typ)
    }

    if rate := MemProfileRate; rate > 0 {
        if size < uintptr(rate) && int32(size) < c.next_sample {
            c.next_sample -= int32(size)
        } else {
            mp := acquirem()
            profilealloc(mp, x, size)
            releasem(mp)
        }
    }

    if assistG != nil {
        // Account for internal fragmentation in the assist
        // debt now that we know it.
        assistG.gcAssistBytes -= int64(size - dataSize)
    }

    if shouldhelpgc {
        if t := (gcTrigger{kind: gcTriggerHeap}); t.test() {
            gcStart(gcBackgroundMode, t)
        }
    }

    return x
}

整个分配过程可以根据 object size 拆解成三部分:size < 16 byte, 16 byte <= size <= 32 K byte, size > 32 K byte。

5.1 size 小于 16 byte

对于小于 16 byte 的内存块,mcache 有个专门的内存区域 tiny 用来分配,tiny 是指针,指向开始地址。

func mallocgc(...) {
    ...
            off := c.tinyoffset
            // 地址对齐
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            // 分配
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // tiny 不够了,为其重新分配一个 16 byte 内存块
            span := c.alloc[tinySizeClass]
            v := nextFreeFast(span)
            if v == 0 {
                v, _, shouldhelpgc = c.nextFree(tinySizeClass)
            }
            x = unsafe.Pointer(v)
            //将申请的内存块全置为 0
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            // 如果申请的内存块用不完,则将剩下的给 tiny,用 tinyoffset 记录分配了多少。
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
}

如上所示,tinyoffset 表示 tiny 当前分配到什么地址了,之后的分配根据 tinyoffset 寻址。先根据要分配的对象大小进行地址对齐,比如 size 是 8 的倍数,tinyoffset 和 8 对齐。然后就是进行分配。如果 tiny 剩余的空间不够用,则重新申请一个 16 byte 的内存块,并分配给 object。如果有结余,则记录在 tiny 上。

5.2 size 大于 32 K byte

对于大于 32 Kb 的内存分配,直接跳过 mcache 和 mcentral,通过 mheap 分配。

func mallocgc(...) {
    } else {
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, needzero)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }
    ...
}

func largeAlloc(size uintptr, needzero bool) *mspan {
    ...
    npages := size >> _PageShift
    if size&_PageMask != 0 {
        npages++
    }
    ...
    s := mheap_.alloc(npages, 0, true, needzero)
    if s == nil {
        throw("out of memory")
    }
    s.limit = s.base() + size
    heapBitsForSpan(s.base()).initSpan(s)
    return s
}

对于大于 32 K 的内存分配都是分配整数页,先右移然后低位与计算需要的页数。

5.3 size 介于 16 和 32K

对于 size 介于 16 ~ 32K byte 的内存分配先计算应该分配的 sizeclass,然后去 mcache 里面 alloc[sizeclass] 申请,如果 mcache.alloc[sizeclass] 不足以申请,则 mcache 向 mcentral 申请,然后再分配。mcentral 给 mcache 分配完之后会判断自己需不需要扩充,如果需要则想 mheap 申请。

func mallocgc(...) {
        ...
        } else {
            var sizeclass uint8
            //计算 sizeclass
            if size <= smallSizeMax-8 {
                sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {
                sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            spc := makeSpanClass(sizeclass, noscan)
            span := c.alloc[spc]
            //从对应的 span 里面分配一个 object 
            v := nextFreeFast(span)
            if v == 0 {
                v, span, shouldhelpgc = c.nextFree(spc)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {
                memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
          }
}

我们首先看一下如何计算 sizeclass 的,预先定义了两个数组:size_to_class8 和 size_to_class128。 数组 size_to_class8,其第 i 个值表示地址区间 ( (i-1) 8, i 8 ] (smallSizeDiv = 8) 对应的 sizeclass,size_to_class128 类似。小于 1024 - 8 = 1016 (smallSizeMax=1024),使用 size_to_class8,否则使用数组 size_to_class128。看一下数组具体的值:0, 1, 2, 3, 3, 4, 4…。举个例子,比如要分配 17 byte 的内存 (16 byte 以下的使用 mcache.tiny 分配),sizeclass = size_to_calss8[(17+7)/8] = size_to_class8[3] = 3。不得不说这种用空间换时间的策略确实提高了运行效率。

计算出 sizeclass,那么就可以去 mcache.alloc[sizeclass] 分配了,注意这是一个 mspan 指针,真正的分配函数是 nextFreeFast() 函数,如下。

// nextFreeFast returns the next free object if one is quickly available.
// Otherwise it returns 0.
func nextFreeFast(s *mspan) gclinkptr {
    theBit := sys.Ctz64(s.allocCache) // Is there a free object in the allocCache?
    if theBit < 64 {
        result := s.freeindex + uintptr(theBit)
        if result < s.nelems {
            freeidx := result + 1
            if freeidx%64 == 0 && freeidx != s.nelems {
                return 0
            }
            s.allocCache >>= uint(theBit + 1)
            s.freeindex = freeidx
            s.allocCount++
            return gclinkptr(result*s.elemsize + s.base())
        }
    }
    return 0
}

allocCache 这里是用位图表示内存是否可用,1 表示可用。然后通过 span 里面的 freeindex 和 elemsize 来计算地址即可。

如果 mcache.alloc[sizeclass] 已经不够用了,则从 mcentral 申请内存到 mcache。

// nextFree returns the next free object from the cached span if one is available.
// Otherwise it refills the cache with a span with an available object and
// returns that object along with a flag indicating that this was a heavy
// weight allocation. If it is a heavy weight allocation the caller must
// determine whether a new GC cycle needs to be started or if the GC is active
// whether this goroutine needs to assist the GC.
func (c *mcache) nextFree(sizeclass uint8) (v gclinkptr, s *mspan, shouldhelpgc bool) {
    s = c.alloc[sizeclass]
    shouldhelpgc = false
    freeIndex := s.nextFreeIndex()
    if freeIndex == s.nelems {
        // The span is full.
        if uintptr(s.allocCount) != s.nelems {
            println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
            throw("s.allocCount != s.nelems && freeIndex == s.nelems")
        }
        systemstack(func() {
            // 这个地方 mcache 向 mcentral 申请
            c.refill(int32(sizeclass))
        })
        shouldhelpgc = true
        s = c.alloc[sizeclass]
        // mcache 向 mcentral 申请完之后,再次从 mcache 申请
        freeIndex = s.nextFreeIndex()
    }

    ...
}

// nextFreeIndex returns the index of the next free object in s at
// or after s.freeindex.
// There are hardware instructions that can be used to make this
// faster if profiling warrants it.
// 这个函数和 nextFreeFast 有点冗余了
func (s *mspan) nextFreeIndex() uintptr {
    ...
}

mcache 向 mcentral,如果 mcentral 不够,则向 mheap 申请。

//go version 1.10.8

func (c *mcache) refill(sizeclass int32) *mspan {
    ...
    // 向 mcentral 申请
    s = mheap_.central[sizeclass].mcentral.cacheSpan()
    ...
    return s
}

// Allocate a span to use in an MCache.
func (c *mcentral) cacheSpan() *mspan {
    ...
    // Replenish central list if empty.
    s = c.grow()
}

func (c *mcentral) grow() *mspan {
    npages := uintptr(class_to_allocnpages[c.sizeclass])
    size := uintptr(class_to_size[c.sizeclass])
    n := (npages << _PageShift) / size
    
    //这里向 mheap 申请
    s := mheap_.alloc(npages, c.sizeclass, false, true)
    ...
    return s
}

如果 mheap 不足,则向 OS 申请。接上面的代码 mheap_.alloc()

func (h *mheap) alloc(npage uintptr, sizeclass int32, large bool, needzero bool) *mspan {
    ...
    var s *mspan
    systemstack(func() {
        s = h.alloc_m(npage, sizeclass, large)
    })
    ...
}

func (h *mheap) alloc_m(npage uintptr, sizeclass int32, large bool) *mspan {
    ... 
    s := h.allocSpanLocked(npage)
    ...
}

func (h *mheap) allocSpanLocked(npage uintptr) *mspan {
    ...
    s = h.allocLarge(npage)
    if s == nil {
        if !h.grow(npage) {
            return nil
        }
        s = h.allocLarge(npage)
        if s == nil {
            return nil
        }
    }
    ...
}

func (h *mheap) grow(npage uintptr) bool {
    // Ask for a big chunk, to reduce the number of mappings
    // the operating system needs to track; also amortizes
    // the overhead of an operating system mapping.
    // Allocate a multiple of 64kB.
    npage = round(npage, (64<<10)/_PageSize)
    ask := npage << _PageShift
    if ask < _HeapAllocChunk {
        ask = _HeapAllocChunk
    }

    v := h.sysAlloc(ask)
    ...
}

整个函数调用链如上所示,最后 sysAlloc 会调用系统调用(mmap 或者 VirtualAlloc,和初始化那部分有点类似)去向操作系统申请。

六、内存回收

这里只会介绍一些简单的内存回收。

6.1 mcache 回收

mcache 回收可以分两部分:第一部分是将 alloc 中未用完的内存归还给对应的 mcentral。

func freemcache(c *mcache) {
    systemstack(func() {
        c.releaseAll()
        ...

        lock(&mheap_.lock)
        purgecachedstats(c)
        mheap_.cachealloc.free(unsafe.Pointer(c))
        unlock(&mheap_.lock)
    })
}

func (c *mcache) releaseAll() {
    for i := 0; i < _NumSizeClasses; i++ {
        s := c.alloc[i]
        if s != &emptymspan {
            mheap_.central[i].mcentral.uncacheSpan(s)
            c.alloc[i] = &emptymspan
        }
    }
    // Clear tinyalloc pool.
    c.tiny = 0
    c.tinyoffset = 0
}

6.2 mcentral 回收

当 mspan 没有 free object 的时候,将 mspan 归还给 mheap。

func (c *mcentral) freeSpan(s *mspan, preserve bool, wasempty bool) bool {
    ...
    lock(&c.lock)
    ...
    if s.allocCount != 0 {
        unlock(&c.lock)
        return false
    }

    c.nonempty.remove(s)
    unlock(&c.lock)
    mheap_.freeSpan(s, 0)
    return true
}

6.3 mheap

mheap 并不会定时向操作系统归还,但是会对 span 做一些操作,比如合并相邻的 span。

七、总结

目前GoLang内存管理还处于学习阶段,上文多是整理学习了一些目前互联网的信息,参考资源都已列出,有些地方还没有理解,如有问题,请多交流。

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章