22 Aug, 2012

5 commits

  • Jim Schutt reported a problem that pointed at compaction contending
    heavily on locks. The workload is straight-forward and in his own words;

    The systems in question have 24 SAS drives spread across 3 HBAs,
    running 24 Ceph OSD instances, one per drive. FWIW these servers
    are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
    Ceph Linux clients doing dd simultaneously to a Ceph file system
    backed by 12 of these servers.

    Early in the test everything looks fine

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
    27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
    28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
    6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
    22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

    and then it goes to pot

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
    207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
    123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
    123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
    622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
    223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

    Note that system CPU usage is very high blocks being written out has
    dropped by 42%. He analysed this with perf and found

    perf record -g -a sleep 10
    perf report --sort symbol --call-graph fractal,5
    34.63% [k] _raw_spin_lock_irqsave
    |
    |--97.30%-- isolate_freepages
    | compaction_alloc
    | unmap_and_move
    | migrate_pages
    | compact_zone
    | compact_zone_order
    | try_to_compact_pages
    | __alloc_pages_direct_compact
    | __alloc_pages_slowpath
    | __alloc_pages_nodemask
    | alloc_pages_vma
    | do_huge_pmd_anonymous_page
    | handle_mm_fault
    | do_page_fault
    | page_fault
    | |
    | |--87.39%-- skb_copy_datagram_iovec
    | | tcp_recvmsg
    | | inet_recvmsg
    | | sock_recvmsg
    | | sys_recvfrom
    | | system_call
    | | __recv
    | | |
    | | --100.00%-- (nil)
    | |
    | --12.61%-- memcpy
    --2.70%-- [...]

    There was other data but primarily it is all showing that compaction is
    contended heavily on the zone->lock and zone->lru_lock.

    commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration] noted that it was possible for
    migration to hold the lru_lock for an excessive amount of time. Very
    broadly speaking this patch expands the concept.

    This patch introduces compact_checklock_irqsave() to check if a lock
    is contended or the process needs to be scheduled. If either condition
    is true then async compaction is aborted and the caller is informed.
    The page allocator will fail a THP allocation if compaction failed due
    to contention. This patch also introduces compact_trylock_irqsave()
    which will acquire the lock only if it is not contended and the process
    does not need to schedule.

    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
    left") introduced a caching mechanism to reduce the amount work the free
    page scanner does in compaction. However, it has a problem. Consider
    two process simultaneously scanning free pages

    C
    Process A M S F
    |---------------------------------------|
    Process B M FS

    C is zone->compact_cached_free_pfn
    S is cc->start_pfree_pfn
    M is cc->migrate_pfn
    F is cc->free_pfn

    In this diagram, Process A has just reached its migrate scanner, wrapped
    around and updated compact_cached_free_pfn accordingly.

    Simultaneously, Process B finishes isolating in a block and updates
    compact_cached_free_pfn again to the location of its free scanner.

    Process A moves to "end_of_zone - one_pageblock" and runs this check

    if (cc->order > 0 && (!cc->wrapped ||
    zone->compact_cached_free_pfn >
    cc->start_free_pfn))
    pfn = min(pfn, zone->compact_cached_free_pfn);

    compact_cached_free_pfn is above where it started so the free scanner
    skips almost the entire space it should have scanned. When there are
    multiple processes compacting it can end in a situation where the entire
    zone is not being scanned at all. Further, it is possible for two
    processes to ping-pong update to compact_cached_free_pfn which is just
    random.

    Overall, the end result wrecks allocation success rates.

    There is not an obvious way around this problem without introducing new
    locking and state so this patch takes a different approach.

    First, it gets rid of the skip logic because it's not clear that it
    matters if two free scanners happen to be in the same block but with
    racing updates it's too easy for it to skip over blocks it should not.

    Second, it updates compact_cached_free_pfn in a more limited set of
    circumstances.

    If a scanner has wrapped, it updates compact_cached_free_pfn to the end
    of the zone. When a wrapped scanner isolates a page, it updates
    compact_cached_free_pfn to point to the highest pageblock it
    can isolate pages from.

    If a scanner has not wrapped when it has finished isolated pages it
    checks if compact_cached_free_pfn is pointing to the end of the
    zone. If so, the value is updated to point to the highest
    pageblock that pages were isolated from. This value will not
    be updated again until a free page scanner wraps and resets
    compact_cached_free_pfn.

    This is not optimal and it can still race but the compact_cached_free_pfn
    will be pointing to or very near a pageblock with free pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit cfd19c5a9ecf ("mm: only set page->pfmemalloc when
    ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
    setting, but it missed some places the pfmemalloc should be set.

    So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
    cause incorrect deactivate_slab() on our core2 server:

    64.73% fio [kernel.kallsyms] [k] _raw_spin_lock
    |
    --- _raw_spin_lock
    |
    |---0.34%-- deactivate_slab
    | __slab_alloc
    | kmem_cache_alloc
    | |

    That causes our fio sync write performance to have a 40% regression.

    Move the checking in get_page_from_freelist() which resolves this issue.

    Signed-off-by: Alex Shi
    Acked-by: Mel Gorman
    Cc: David Miller
    Tested-by: Eric Dumazet
    Tested-by: Sage Weil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Shi
     
  • Commit aff622495c9a ("vmscan: only defer compaction for failed order and
    higher") fixed bad deferring policy but made mistake about checking
    compact_order_failed in __compact_pgdat(). So it can't update
    compact_order_failed with the new order. This ends up preventing
    correct operation of policy deferral. This patch fixes it.

    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Occasionally an isolated BUG_ON(mm->nr_ptes) gets reported, indicating
    that not all the page tables allocated could be found and freed when
    exit_mmap() tore down the user address space.

    There's usually nothing we can say about it, beyond that it's probably a
    sign of some bad memory or memory corruption; though it might still
    indicate a bug in vma or page table management (and did recently reveal a
    race in THP, fixed a few months ago).

    But one overdue change we can make is from BUG_ON to WARN_ON.

    It's fairly likely that the system will crash shortly afterwards in some
    other way (for example, the BUG_ON(page_mapped(page)) in
    __delete_from_page_cache(), once an inode mapped into the lost page tables
    gets evicted); but might tell us more before that.

    Change the BUG_ON(page_mapped) to WARN_ON too? Later perhaps: I'm less
    eager, since that one has several times led to fixes.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Aug, 2012

1 commit

  • Finally we can kill the 'sync_supers' kernel thread along with the
    '->write_super()' superblock operation because all the users are gone.
    Now every file-system is supposed to self-manage own superblock and
    its dirty state.

    The nice thing about killing this thread is that it improves power management.
    Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
    every 5 seconds no matter what - even if there were no dirty superblocks and
    even if there were no file-systems using this service (e.g., btrfs and
    journalled ext4 do not need it). So it was wasting power most of the time. And
    because the thread was in the core of the kernel, all systems had to have it.
    So I am quite happy to make it go away.

    Interestingly, this thread is a left-over from the pdflush kernel thread which
    was a self-forking kernel thread responsible for all the write-back in old
    Linux kernels. It was turned into per-block device BDI threads, and
    'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Al Viro

    Artem Bityutskiy
     

03 Aug, 2012

1 commit

  • Borislav Petkov reports that the new warning added in commit
    88fdf75d1bb5 ("mm: warn if pg_data_t isn't initialized with zero")
    triggers for him, and it is the node_start_pfn field that has already
    been initialized once.

    The call trace looks like this:

    x86_64_start_kernel ->
    x86_64_start_reservations ->
    start_kernel ->
    setup_arch ->
    paging_init ->
    zone_sizes_init ->
    free_area_init_nodes ->
    free_area_init_node

    and (with the warning replaced by debug output), Borislav sees

    On node 0 totalpages: 4193848
    DMA zone: 64 pages used for memmap
    DMA zone: 6 pages reserved
    DMA zone: 3890 pages, LIFO batch:0
    DMA32 zone: 16320 pages used for memmap
    DMA32 zone: 798464 pages, LIFO batch:31
    Normal zone: 52736 pages used for memmap
    Normal zone: 3322368 pages, LIFO batch:31
    free_area_init_node: pgdat->node_start_pfn: 4423680 node_start_pfn: 8617984 node_start_pfn: 12812288
    Cc: Minchan Kim
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

02 Aug, 2012

2 commits

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     
  • Pull core block IO bits from Jens Axboe:
    "The most complicated part if this is the request allocation rework by
    Tejun, which has been queued up for a long time and has been in
    for-next ditto as well.

    There are a few commits from yesterday and today, mostly trivial and
    obvious fixes. So I'm pretty confident that it is sound. It's also
    smaller than usual."

    * 'for-3.6/core' of git://git.kernel.dk/linux-block:
    block: remove dead func declaration
    block: add partition resize function to blkpg ioctl
    block: uninitialized ioc->nr_tasks triggers WARN_ON
    block: do not artificially constrain max_sectors for stacking drivers
    blkcg: implement per-blkg request allocation
    block: prepare for multiple request_lists
    block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
    blkcg: inline bio_blkcg() and friends
    block: allocate io_context upfront
    block: refactor get_request[_wait]()
    block: drop custom queue draining used by scsi_transport_{iscsi|fc}
    mempool: add @gfp_mask to mempool_create_node()
    blkcg: make root blkcg allocation use %GFP_KERNEL
    blkcg: __blkg_lookup_create() doesn't need radix preload

    Linus Torvalds
     

01 Aug, 2012

31 commits

  • Merge Andrew's second set of patches:
    - MM
    - a few random fixes
    - a couple of RTC leftovers

    * emailed patches from Andrew Morton : (120 commits)
    rtc/rtc-88pm80x: remove unneed devm_kfree
    rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
    mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
    tmpfs: distribute interleave better across nodes
    mm: remove redundant initialization
    mm: warn if pg_data_t isn't initialized with zero
    mips: zero out pg_data_t when it's allocated
    memcg: gix memory accounting scalability in shrink_page_list
    mm/sparse: remove index_init_lock
    mm/sparse: more checks on mem_section number
    mm/sparse: optimize sparse_index_alloc
    memcg: add mem_cgroup_from_css() helper
    memcg: further prevent OOM with too many dirty pages
    memcg: prevent OOM with too many dirty pages
    mm: mmu_notifier: fix freed page still mapped in secondary MMU
    mm: memcg: only check anon swapin page charges for swap cache
    mm: memcg: only check swap cache pages for repeated charging
    mm: memcg: split swapin charge function into private and public part
    mm: memcg: remove needless !mm fixup to init_mm when charging
    mm: memcg: remove unneeded shmem charge type
    ...

    Linus Torvalds
     
  • If a process creates a large hugetlbfs mapping that is eligible for page
    table sharing and forks heavily with children some of whom fault and
    others which destroy the mapping then it is possible for page tables to
    get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
    output a message to the kernel log. The final teardown will trigger a
    BUG_ON in mm/filemap.c.

    This was reproduced in 3.4 but is known to have existed for a long time
    and goes back at least as far as 2.6.37. It was probably was introduced
    in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
    look like this;

    [ ..........] Lots of bad pmd messages followed by this
    [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
    [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
    [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
    [ 127.186778] ------------[ cut here ]------------
    [ 127.186781] kernel BUG at mm/filemap.c:134!
    [ 127.186782] invalid opcode: 0000 [#1] SMP
    [ 127.186783] CPU 7
    [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
    [ 127.186801]
    [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
    [ 127.186804] RIP: 0010:[] [] __delete_from_page_cache+0x15e/0x160
    [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
    [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
    [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
    [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
    [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
    [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
    [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
    [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
    [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
    [ 127.186821] Stack:
    [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
    [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
    [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
    [ 127.186827] Call Trace:
    [ 127.186829] [] delete_from_page_cache+0x3b/0x80
    [ 127.186832] [] truncate_hugepages+0x115/0x220
    [ 127.186834] [] hugetlbfs_evict_inode+0x13/0x30
    [ 127.186837] [] evict+0xa7/0x1b0
    [ 127.186839] [] iput_final+0xd3/0x1f0
    [ 127.186840] [] iput+0x39/0x50
    [ 127.186842] [] d_kill+0xf8/0x130
    [ 127.186843] [] dput+0xd2/0x1a0
    [ 127.186845] [] __fput+0x170/0x230
    [ 127.186848] [] ? rb_erase+0xce/0x150
    [ 127.186849] [] fput+0x1d/0x30
    [ 127.186851] [] remove_vma+0x37/0x80
    [ 127.186853] [] do_munmap+0x2d2/0x360
    [ 127.186855] [] sys_shmdt+0xc9/0x170
    [ 127.186857] [] system_call_fastpath+0x16/0x1b
    [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
    [ 127.186868] RIP [] __delete_from_page_cache+0x15e/0x160
    [ 127.186870] RSP
    [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---

    The bug is a race and not always easy to reproduce. To reproduce it I was
    doing the following on a single socket I7-based machine with 16G of RAM.

    $ hugeadm --pool-pages-max DEFAULT:13G
    $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
    $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
    $ for i in `seq 1 9000`; do ./hugetlbfs-test; done

    On my particular machine, it usually triggers within 10 minutes but
    enabling debug options can change the timing such that it never hits.
    Once the bug is triggered, the machine is in trouble and needs to be
    rebooted. The machine will respond but processes accessing proc like "ps
    aux" will hang due to the BUG_ON. shutdown will also hang and needs a
    hard reset or a sysrq-b.

    The basic problem is a race between page table sharing and teardown. For
    the most part page table sharing depends on i_mmap_mutex. In some cases,
    it is also taking the mm->page_table_lock for the PTE updates but with
    shared page tables, it is the i_mmap_mutex that is more important.

    Unfortunately it appears to be also insufficient. Consider the following
    situation

    Process A Process B
    --------- ---------
    hugetlb_fault shmdt
    LockWrite(mmap_sem)
    do_munmap
    unmap_region
    unmap_vmas
    unmap_single_vma
    unmap_hugepage_range
    Lock(i_mmap_mutex)
    Lock(mm->page_table_lock)
    huge_pmd_unshare/unmap tables page_table_lock)
    Unlock(i_mmap_mutex)
    huge_pte_alloc ...
    Lock(i_mmap_mutex) ...
    vma_prio_walk, find svma, spte ...
    Lock(mm->page_table_lock) ...
    share spte ...
    Unlock(mm->page_table_lock) ...
    Unlock(i_mmap_mutex) ...
    hugetlb_no_page < end; a += 4096)
    *a = 0;
    }

    int main(int argc, char **argv)
    {
    key_t key = IPC_PRIVATE;
    size_t sizeA = nr_huge_page_A * huge_page_size;
    size_t sizeB = nr_huge_page_B * huge_page_size;
    int shmidA, shmidB;
    void *addrA = NULL, *addrB = NULL;
    int nr_children = 300, n = 0;

    if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
    perror("shmget:");
    return 1;
    }

    if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
    perror("shmat");
    return 1;
    }
    if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
    perror("shmget:");
    return 1;
    }

    if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
    perror("shmat");
    return 1;
    }

    fork_child:
    switch(fork()) {
    case 0:
    switch (n%3) {
    case 0:
    play(addrA, sizeA);
    break;
    case 1:
    play(addrB, sizeB);
    break;
    case 2:
    break;
    }
    break;
    case -1:
    perror("fork:");
    break;
    default:
    if (++n < nr_children)
    goto fork_child;
    play(addrA, sizeA);
    break;
    }
    shmdt(addrA);
    shmdt(addrB);
    do {
    wait(NULL);
    } while (--n > 0);
    shmctl(shmidA, IPC_RMID, NULL);
    shmctl(shmidB, IPC_RMID, NULL);
    return 0;
    }

    [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
    Signed-off-by: Hugh Dickins
    Reviewed-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When tmpfs has the interleave memory policy, it always starts allocating
    for each file from node 0 at offset 0. When there are many small files,
    the lower nodes fill up disproportionately.

    This patch spreads out node usage by starting files at nodes other than 0,
    by using the inode number to bias the starting node for interleave.

    Signed-off-by: Nathan Zimmer
    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: Lee Schermerhorn
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer
     
  • pg_data_t is zeroed before reaching free_area_init_core(), so remove the
    now unnecessary initializations.

    Signed-off-by: Minchan Kim
    Cc: Tejun Heo
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
    when it is allocated. Arch code and memory hotplug already initiailize
    pg_data_t. So this warning should never happen. I select fields randomly
    near the beginning, middle and end of pg_data_t for checking.

    This patch isn't for performance but for removing initialization code
    which is necessary to add whenever we adds new field to pg_data_t or zone.

    Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
    Tejun doesn't like it because in the future, some archs can initialize
    some fields in arch code and pass them into general MM part so blindly
    clearing it out in mm core part would be very annoying.

    Signed-off-by: Minchan Kim
    Cc: Tejun Heo
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • I noticed in a multi-process parallel files reading benchmark I ran on a 8
    socket machine, throughput slowed down by a factor of 8 when I ran the
    benchmark within a cgroup container. I traced the problem to the
    following code path (see below) when we are trying to reclaim memory from
    file cache. The res_counter_uncharge function is called on every page
    that's reclaimed and created heavy lock contention. The patch below
    allows the reclaimed pages to be uncharged from the resource counter in
    batch and recovered the regression.

    Tim

    40.67% usemem [kernel.kallsyms] [k] _raw_spin_lock
    |
    --- _raw_spin_lock
    |
    |--92.61%-- res_counter_uncharge
    | |
    | |--100.00%-- __mem_cgroup_uncharge_common
    | | |
    | | |--100.00%-- mem_cgroup_uncharge_cache_page
    | | | __remove_mapping
    | | | shrink_page_list
    | | | shrink_inactive_list
    | | | shrink_mem_cgroup_zone
    | | | shrink_zone
    | | | do_try_to_free_pages
    | | | try_to_free_pages
    | | | __alloc_pages_nodemask
    | | | alloc_pages_current

    Signed-off-by: Tim Chen
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • sparse_index_init() uses the index_init_lock spinlock to protect root
    mem_section assignment. The lock is not necessary anymore because the
    function is called only during boot (during paging init which is executed
    only from a single CPU) and from the hotplug code (by add_memory() via
    arch_add_memory()) which uses mem_hotplug_mutex.

    The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug
    preparation") and sparse_index_init() was used only during boot at that
    time.

    Later when the hotplug code (and add_memory()) was introduced there was no
    synchronization so it was possible to online more sections from the same
    root probably (though I am not 100% sure about that). The first
    synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and
    hibernation in the same kernel") which was later replaced by the
    mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce
    {un}lock_memory_hotplug()").

    Let's remove the lock as it is not needed and it makes the code more
    confusing.

    [mhocko@suse.cz: changelog]
    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • __section_nr() was implemented to retrieve the corresponding memory
    section number according to its descriptor. It's possible that the
    specified memory section descriptor doesn't exist in the global array. So
    add more checking on that and report an error for a wrong case.

    Signed-off-by: Gavin Shan
    Acked-by: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
    descriptors are allocated from slab or bootmem. When allocating from
    slab, let slab/bootmem allocator clear the memory chunk. We needn't clear
    it explicitly.

    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • Add a mem_cgroup_from_css() helper to replace open-coded invokations of
    container_of(). To clarify the code and to add a little more type safety.

    [akpm@linux-foundation.org: fix extensive breakage]
    Signed-off-by: Wanpeng Li
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Gavin Shan
    Cc: Wanpeng Li
    Cc: Gavin Shan
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The may_enter_fs test turns out to be too restrictive: though I saw no
    problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
    on 3.5-rc6-mm1. I don't know what the difference there is, perhaps I just
    slightly changed the way I started off the testing: dd if=/dev/zero
    of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
    memory.limit_in_bytes cgroup to ext4 on USB stick.

    ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
    AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
    the transaction needs to be started even before allocating pagecache
    memory. But it may not be worth worrying about these days: if direct
    reclaim avoids FS writeback, does __GFP_FS now mean anything?

    Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
    device; but since that also masks off __GFP_IO, we can test for __GFP_IO
    directly, ignoring may_enter_fs and __GFP_FS.

    But even so, the test still OOMs sometimes: when originally testing on
    3.5-rc6, it OOMed about one time in five or ten; when testing just now on
    3.5-rc6-mm1, it OOMed on the first iteration.

    This residual problem comes from an accumulation of pages under ordinary
    writeback, not marked PageReclaim, so rightly not causing the memcg check
    to wait on their writeback: these too can prevent shrink_page_list() from
    freeing any pages, so many times that memcg reclaim fails and OOMs.

    Deal with these in the same way as direct reclaim now deals with dirty FS
    pages: mark them PageReclaim. It is appropriate to rotate these to tail
    of list when writepage completes, but more importantly, the PageReclaim
    flag makes memcg reclaim wait on them if encountered again. Increment
    NR_VMSCAN_IMMEDIATE? That's arguable: I chose not.

    Setting PageReclaim here may occasionally race with end_page_writeback()
    clearing it: lru_deactivate_fn() already faced the same race, and
    correctly concluded that the window is small and the issue non-critical.

    With these changes, the test runs indefinitely without OOMing on ext4,
    ext3 and ext2: I'll move on to test with other filesystems later.

    Trivia: invert conditions for a clearer block without an else, and goto
    keep_locked to do the unlock_page.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Fengguang Wu
    Acked-by: Michal Hocko
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The current implementation of dirty pages throttling is not memcg aware
    which makes it easy to have memcg LRUs full of dirty pages. Without
    throttling, these LRUs can be scanned faster than the rate of writeback,
    leading to memcg OOM conditions when the hard limit is small.

    This patch fixes the problem by throttling the allocating process
    (possibly a writer) during the hard limit reclaim by waiting on
    PageReclaim pages. We are waiting only for PageReclaim pages because
    those are the pages that made one full round over LRU and that means that
    the writeback is much slower than scanning.

    The solution is far from being ideal - long term solution is memcg aware
    dirty throttling - but it is meant to be a band aid until we have a real
    fix. We are seeing this happening during nightly backups which are placed
    into containers to prevent from eviction of the real working set.

    The change affects only memcg reclaim and only when we encounter
    PageReclaim pages which is a signal that the reclaim doesn't catch up on
    with the writers so somebody should be throttled. This could be
    potentially unfair because it could be somebody else from the group who
    gets throttled on behalf of the writer but as writers need to allocate as
    well and they allocate in higher rate the probability that only innocent
    processes would be penalized is not that high.

    I have tested this change by a simple dd copying /dev/zero to tmpfs or
    ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
    containers) and dd got killed by OOM killer every time. With the patch I
    could run the dd with the same size under 5M controller without any OOM.
    The issue is more visible with slower devices for output.

    * With the patch
    ================
    * tmpfs size=2G
    ---------------
    $ vim cgroup_cache_oom_test.sh
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s

    * ext3
    ------
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s

    * Without the patch
    ===================
    * tmpfs size=2G
    ---------------
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    ./cgroup_cache_oom_test.sh: line 46: 4668 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s

    * ext3
    ------
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    ./cgroup_cache_oom_test.sh: line 46: 4689 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    ./cgroup_cache_oom_test.sh: line 46: 4692 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s

    [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
    [hughd@google.com: fix deadlock with loop driver]
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Reviewed-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Fengguang Wu
    Signed-off-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mmu_notifier_release() is called when the process is exiting. It will
    delete all the mmu notifiers. But at this time the page belonging to the
    process is still present in page tables and is present on the LRU list, so
    this race will happen:

    CPU 0 CPU 1
    mmu_notifier_release: try_to_unmap:
    hlist_del_init_rcu(&mn->hlist);
    ptep_clear_flush_notify:
    mmu nofifler not found
    free page !!!!!!
    /*
    * At the point, the page has been
    * freed, but it is still mapped in
    * the secondary MMU.
    */

    mn->ops->release(mn, mm);

    Then the box is not stable and sometimes we can get this bug:

    [ 738.075923] BUG: Bad page state in process migrate-perf pfn:03bec
    [ 738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping: (null) index:0x8076
    [ 738.075936] page flags: 0x20000000000014(referenced|dirty)

    The same issue is present in mmu_notifier_unregister().

    We can call ->release before deleting the notifier to ensure the page has
    been unmapped from the secondary MMU before it is freed.

    Signed-off-by: Xiao Guangrong
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Cc: Paul Gortmaker
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • shmem knows for sure that the page is in swap cache when attempting to
    charge a page, because the cache charge entry function has a check for it.
    Only anon pages may be removed from swap cache already when trying to
    charge their swapin.

    Adjust the comment, though: '4969c11 mm: fix swapin race condition' added
    a stable PageSwapCache check under the page lock in the do_swap_page()
    before calling the memory controller, so it's unuse_pte()'s pte_same()
    that may fail.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only anon and shmem pages in the swap cache are attempted to be charged
    multiple times, from every swap pte fault or from shmem_unuse(). No other
    pages require checking PageCgroupUsed().

    Charging pages in the swap cache is also serialized by the page lock, and
    since both the try_charge and commit_charge are called under the same page
    lock section, the PageCgroupUsed() check might as well happen before the
    counter charging, let alone reclaim.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When shmem is charged upon swapin, it does not need to check twice whether
    the memory controller is enabled.

    Also, shmem pages do not have to be checked for everything that regular
    anon pages have to be checked for, so let shmem use the internal version
    directly and allow future patches to move around checks that are only
    required when swapping in anon pages.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It does not matter to __mem_cgroup_try_charge() if the passed mm is NULL
    or init_mm, it will charge the root memcg in either case.

    Also fix up the comment in __mem_cgroup_try_charge() that claimed the
    init_mm would be charged when no mm was passed. It's not really
    incorrect, but confusing. Clarify that the root memcg is charged in this
    case.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • shmem page charges have not needed a separate charge type to tell them
    from regular file pages since 08e552c ("memcg: synchronized LRU").

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Charging cache pages may require swapin in the shmem case. Save the
    forward declaration and just move the swapin functions above the cache
    charging functions.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only anon pages that are uncharged at the time of the last page table
    mapping vanishing may be in swapcache.

    When shmem pages, file pages, swap-freed anon pages, or just migrated
    pages are uncharged, they are known for sure to be not in swapcache.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Not all uncharge paths need to check if the page is swapcache, some of
    them can know for sure.

    Push down the check into all callsites of uncharge_common() so that the
    patch that removes some of them is more obvious.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when
    the function would continue to reestablish the page even after
    mem_cgroup_try_charge_swapin() failed. After 85d9fc8 "memcg: fix refcnt
    handling at swapoff", the condition is always true when this code is
    reached.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Compaction (and page migration in general) can currently be hindered
    through pages being owned by memory cgroups that are at their limits and
    unreclaimable.

    The reason is that the replacement page is being charged against the limit
    while the page being replaced is also still charged. But this seems
    unnecessary, given that only one of the two pages will still be in use
    after migration finishes.

    This patch changes the memcg migration sequence so that the replacement
    page is not charged. Whatever page is still in use after successful or
    failed migration gets to keep the charge of the page that was going to be
    replaced.

    The replacement page will still show up temporarily in the rss/cache
    statistics, this can be fixed in a later patch as it's less urgent.

    Reported-by: David Rientjes
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit b3a27d ("swap: Add swap slot free callback to
    block_device_operations") dereferences p->bdev->bd_disk but this is a NULL
    dereference if using swap-over-NFS. This patch checks SWP_BLKDEV on the
    swap_info_struct before dereferencing.

    With reference to this callback, Christoph Hellwig stated "Please just
    remove the callback entirely. It has no user outside the staging tree and
    was added clearly against the rules for that staging tree". This would
    also be my preference but there was not an obvious way of keeping zram in
    staging/ happy.

    Signed-off-by: Xiaotian Feng
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The patch "mm: add support for a filesystem to activate swap files and use
    direct_IO for writing swap pages" added support for using direct_IO to
    write swap pages but it is insufficient for highmem pages.

    To support highmem pages, this patch kmaps() the page before calling the
    direct_IO() handler. As direct_IO deals with virtual addresses an
    additional helper is necessary for get_kernel_pages() to lookup the struct
    page for a kmap virtual address.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The version of swap_activate introduced is sufficient for swap-over-NFS
    but would not provide enough information to implement a generic handler.
    This patch shuffles things slightly to ensure the same information is
    available for aops->swap_activate() as is available to the core.

    No functionality change.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently swapfiles are managed entirely by the core VM by using ->bmap to
    allocate space and write to the blocks directly. This effectively ensures
    that the underlying blocks are allocated and avoids the need for the swap
    subsystem to locate what physical blocks store offsets within a file.

    If the swap subsystem is to use the filesystem information to locate the
    blocks, it is critical that information such as block groups, block
    bitmaps and the block descriptor table that map the swap file were
    resident in memory. This patch adds address_space_operations that the VM
    can call when activating or deactivating swap backed by a file.

    int swap_activate(struct file *);
    int swap_deactivate(struct file *);

    The ->swap_activate() method is used to communicate to the file that the
    VM relies on it, and the address_space should take adequate measures such
    as reserving space in the underlying device, reserving memory for mempools
    and pinning information such as the block descriptor table in memory. The
    ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
    returned success.

    After a successful swapfile ->swap_activate, the swapfile is marked
    SWP_FILE and swapper_space.a_ops will proxy to
    sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
    pages and ->readpage to read.

    It is perfectly possible that direct_IO be used to read the swap pages but
    it is an unnecessary complication. Similarly, it is possible that
    ->writepage be used instead of direct_io to write the pages but filesystem
    developers have stated that calling writepage from the VM is undesirable
    for a variety of reasons and using direct_IO opens up the possibility of
    writing back batches of swap pages in the future.

    [a.p.zijlstra@chello.nl: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
    may be used to pin a vector of kernel addresses for IO. The initial user
    is expected to be NFS for allowing pages to be written to swap using
    aops->direct_IO(). Strictly speaking, swap-over-NFS only needs to pin one
    page for IO but it makes sense to express the API in terms of a vector and
    add a helper for pinning single pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In order to teach filesystems to handle swap cache pages, three new page
    functions are introduced:

    pgoff_t page_file_index(struct page *);
    loff_t page_file_offset(struct page *);
    struct address_space *page_file_mapping(struct page *);

    page_file_index() - gives the offset of this page in the file in
    PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this
    function also gives the correct index for PG_swapcache pages.

    page_file_offset() - uses page_file_index(), so that it will give the
    expected result, even for PG_swapcache pages.

    page_file_mapping() - gives the mapping backing the actual page; that is
    for swap cache pages it will give swap_file->f_mapping.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Under significant pressure when writing back to network-backed storage,
    direct reclaimers may get throttled. This is expected to be a short-lived
    event and the processes get woken up again but processes do get stalled.
    This patch counts how many times such stalling occurs. It's up to the
    administrator whether to reduce these stalls by increasing
    min_free_kbytes.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If swap is backed by network storage such as NBD, there is a risk that a
    large number of reclaimers can hang the system by consuming all
    PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
    min_free_kbytes in advance which is a bit fragile.

    This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
    are in use. If the system is routinely getting throttled the system
    administrator can increase min_free_kbytes so degradation is smoother but
    the system will keep running.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman