20 Nov, 2015

3 commits

  • Pass correct argument to subsys_cgroup_allow_attach(), which
    expects 'struct cgroup_subsys_state *' argument but we pass
    'struct cgroup *' instead which doesn't seem right.

    This fixes following 'incompatible pointer type' compiler warning:
    ----------
    CC mm/memcontrol.o
    mm/memcontrol.c: In function ‘mem_cgroup_allow_attach’:
    mm/memcontrol.c:5052:2: warning: passing argument 1 of ‘subsys_cgroup_allow_attach’ from incompatible pointer type [enabled by default]
    In file included from include/linux/memcontrol.h:22:0,
    from mm/memcontrol.c:29:
    include/linux/cgroup.h:953:5: note: expected ‘struct cgroup_subsys_state *’ but argument is of type ‘struct cgroup *’
    ----------

    Signed-off-by: Amit Pundir

    Amit Pundir
     
  • Use the 'allow_attach' handler for the 'mem' cgroup to allow
    non-root processes to add arbitrary processes to a 'mem' cgroup
    if it has the CAP_SYS_NICE capability set.

    Bug: 18260435
    Change-Id: If7d37bf90c1544024c4db53351adba6a64966250
    Signed-off-by: Rom Lemarchand

    Rom Lemarchand
     
  • NOT FOR STAGING
    This patch re-adds the original shmem_set_file to mm/shmem.c
    and converts ashmem.c back to using it.

    CC: Brian Swetland
    CC: Colin Cross
    CC: Arve Hjønnevåg
    CC: Dima Zavin
    CC: Robert Love
    CC: Greg KH
    Signed-off-by: John Stultz

    John Stultz
     

18 Jun, 2015

1 commit

  • It appears that, at some point last year, XFS made directory handling
    changes which bring it into lockdep conflict with shmem_zero_setup():
    it is surprising that mmap() can clone an inode while holding mmap_sem,
    but that has been so for many years.

    Since those few lockdep traces that I've seen all implicated selinux,
    I'm hoping that we can use the __shmem_file_setup(,,,S_PRIVATE) which
    v3.13's commit c7277090927a ("security: shmem: implement kernel private
    shmem inodes") introduced to avoid LSM checks on kernel-internal inodes:
    the mmap("/dev/zero") cloned inode is indeed a kernel-internal detail.

    This also covers the !CONFIG_SHMEM use of ramfs to support /dev/zero
    (and MAP_SHARED|MAP_ANONYMOUS). I thought there were also drivers
    which cloned inode in mmap(), but if so, I cannot locate them now.

    Reported-and-tested-by: Prarit Bhargava
    Reported-and-tested-by: Daniel Wagner
    Reported-and-tested-by: Morten Stevens
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Jun, 2015

4 commits

  • If zs_create_pool()->create_handle_cache()->kmem_cache_create() or
    pool->name allocation fails, zs_create_pool()->destroy_handle_cache()
    will dereference the NULL pool->handle_cachep.

    Modify destroy_handle_cache() to avoid this.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • On -rt, the VM_BUG_ON(!irqs_disabled()) triggers inside the memcg
    swapout path because the spin_lock_irq(&mapping->tree_lock) in the
    caller doesn't actually disable the hardware interrupts - which is fine,
    because on -rt the tophalves run in process context and so we are still
    safe from preemption while updating the statistics.

    Remove the VM_BUG_ON() but keep the comment of what we rely on.

    Signed-off-by: Johannes Weiner
    Reported-by: Clark Williams
    Cc: Fernando Lopez-Lezcano
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When trimming memcg consumption excess (see memory.high), we call
    try_to_free_mem_cgroup_pages without checking if we are allowed to sleep
    in the current context, which can result in a deadlock. Fix this.

    Fixes: 241994ed8649 ("mm: memcontrol: default hierarchy interface for memory")
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Izumi found the following oops when hot re-adding a node:

    BUG: unable to handle kernel paging request at ffffc90008963690
    IP: __wake_up_bit+0x20/0x70
    Oops: 0000 [#1] SMP
    CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
    Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
    task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
    RIP: 0010:[] [] __wake_up_bit+0x20/0x70
    RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
    RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
    RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
    RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
    R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
    FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
    Call Trace:
    unlock_page+0x6d/0x70
    generic_write_end+0x53/0xb0
    xfs_vm_write_end+0x29/0x80 [xfs]
    generic_perform_write+0x10a/0x1e0
    xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
    xfs_file_write_iter+0x79/0x120 [xfs]
    __vfs_write+0xd4/0x110
    vfs_write+0xac/0x1c0
    SyS_write+0x58/0xd0
    system_call_fastpath+0x12/0x76
    Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
    RIP [] __wake_up_bit+0x20/0x70
    RSP
    CR2: ffffc90008963690

    Reproduce method (re-add a node)::
    Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)

    This seems an use-after-free problem, and the root cause is
    zone->wait_table was not set to *NULL* after free it in
    try_offline_node.

    When hot re-add a node, we will reuse the pgdat of it, so does the zone
    struct, and when add pages to the target zone, it will init the zone
    first (including the wait_table) if the zone is not initialized. The
    judgement of zone initialized is based on zone->wait_table:

    static inline bool zone_is_initialized(struct zone *zone)
    {
    return !!zone->wait_table;
    }

    so if we do not set the zone->wait_table to *NULL* after free it, the
    memory hotplug routine will skip the init of new zone when hot re-add
    the node, and the wait_table still points to the freed memory, then we
    will access the invalid address when trying to wake up the waiting
    people after the i/o operation with the page is done, such as mentioned
    above.

    Signed-off-by: Gu Zheng
    Reported-by: Taku Izumi
    Reviewed by: Yasuaki Ishimatsu
    Cc: KAMEZAWA Hiroyuki
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gu Zheng
     

29 May, 2015

1 commit

  • bdi_unregister() now contains very little functionality.

    It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
    real consequence as bdi->dev isn't needed by anything else in the function,
    and it triggers if
    blk_cleanup_queue() -> bdi_destroy()
    is called before bdi_unregister, which happens since
    Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")

    So this isn't wanted.

    It also calls bdi_set_min_ratio(). This needs to be called after
    writes through the bdi have all been flushed, and before the bdi is destroyed.
    Calling it early is better than calling it late as it frees up a global
    resource.

    Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
    perfectly fits these requirements.

    So bdi_unregister() can be discarded with the important content moved to
    bdi_destroy(), as can the
    writeback_bdi_unregister
    event which is already not used.

    Reported-by: Mike Snitzer
    Cc: stable@vger.kernel.org (v4.0)
    Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info")
    Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Dan Williams
    Tested-by: Nicholas Moulin
    Signed-off-by: NeilBrown
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    NeilBrown
     

15 May, 2015

3 commits

  • NUMA balancing is meant to be disabled by default on UMA machines but
    the check is using nr_node_ids (highest node) instead of
    num_online_nodes (online nodes).

    The consequences are that a UMA machine with a node ID of 1 or higher
    will enable NUMA balancing. This will incur useless overhead due to
    minor faults with the impact depending on the workload. These are the
    impact on the stats when running a kernel build on a single node machine
    whose node ID happened to be 1:

    vanilla patched
    NUMA base PTE updates 5113158 0
    NUMA huge PMD updates 643 0
    NUMA page range updates 5442374 0
    NUMA hint faults 2109622 0
    NUMA hint local faults 2109622 0
    NUMA hint local percent 100 100
    NUMA pages migrated 0 0

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • I had an issue:

    Unable to handle kernel NULL pointer dereference at virtual address 0000082a
    pgd = cc970000
    [0000082a] *pgd=00000000
    Internal error: Oops: 5 [#1] PREEMPT SMP ARM
    PC is at get_pageblock_flags_group+0x5c/0xb0
    LR is at unset_migratetype_isolate+0x148/0x1b0
    pc : [] lr : [] psr: 80000093
    sp : c7029d00 ip : 00000105 fp : c7029d1c
    r10: 00000001 r9 : 0000000a r8 : 00000004
    r7 : 60000013 r6 : 000000a4 r5 : c0a357e4 r4 : 00000000
    r3 : 00000826 r2 : 00000002 r1 : 00000000 r0 : 0000003f
    Flags: Nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
    Control: 10c5387d Table: 2cb7006a DAC: 00000015
    Backtrace:
    get_pageblock_flags_group+0x0/0xb0
    unset_migratetype_isolate+0x0/0x1b0
    undo_isolate_page_range+0x0/0xdc
    __alloc_contig_range+0x0/0x34c
    alloc_contig_range+0x0/0x18

    This issue is because when calling unset_migratetype_isolate() to unset
    a part of CMA memory, it try to access the buddy page to get its status:

    if (order >= pageblock_order) {
    page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
    buddy_idx = __find_buddy_index(page_idx, order);
    buddy = page + (buddy_idx - page_idx);

    if (!is_migrate_isolate_page(buddy)) {

    But the begin addr of this part of CMA memory is very close to a part of
    memory that is reserved at boot time (not in buddy system). So add a
    check before accessing it.

    [akpm@linux-foundation.org: use conventional code layout]
    Signed-off-by: Hui Zhu
    Suggested-by: Laura Abbott
    Suggested-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hui Zhu
     
  • Not all kmem allocations should be accounted to memcg. The following
    patch gives an example when accounting of a certain type of allocations to
    memcg can effectively result in a memory leak. This patch adds the
    __GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
    allocation to go through the root cgroup. It will be used by the next
    patch.

    Note, since in case of kmemleak enabled each kmalloc implies yet another
    allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
    gfp_kmemleak_mask.

    Alternatively, we could introduce a per kmem cache flag disabling
    accounting for all allocations of a particular kind, but (a) we would not
    be able to bypass accounting for kmalloc then and (b) a kmem cache with
    this flag set could not be merged with a kmem cache without this flag,
    which would increase the number of global caches and therefore
    fragmentation even if the memory cgroup controller is not used.

    Despite its generic name, currently __GFP_NOACCOUNT disables accounting
    only for kmem allocations while user page allocations are always charged.
    To catch abusing of this flag, a warning is issued on an attempt of
    passing it to mem_cgroup_try_charge.

    Signed-off-by: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Greg Thelen
    Cc: Greg Kroah-Hartman
    Cc: [4.0.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 May, 2015

1 commit

  • Pull block fixes from Jens Axboe:
    "A collection of fixes since the merge window;

    - fix for a double elevator module release, from Chao Yu. Ancient bug.

    - the splice() MORE flag fix from Christophe Leroy.

    - a fix for NVMe, fixing a patch that went in in the merge window.
    From Keith.

    - two fixes for blk-mq CPU hotplug handling, from Ming Lei.

    - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.

    - two blk-mq fixes from Shaohua, fixing a race on queue stop and a
    bad merge issue with FUA writes.

    - division-by-zero fix for writeback from Tejun.

    - a block bounce page accounting fix, making sure we inc/dec after
    bouncing so that pre/post IO pages match up. From Wang YanQing"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    splice: sendfile() at once fails for big files
    blk-mq: don't lose requests if a stopped queue restarts
    blk-mq: fix FUA request hang
    block: destroy bdi before blockdev is unregistered.
    block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
    elevator: fix double release of elevator module
    writeback: use |1 instead of +1 to protect against div by zero
    blk-mq: fix CPU hotplug handling
    blk-mq: fix race between timeout and CPU hotplug
    NVMe: Fix VPD B0 max sectors translation

    Linus Torvalds
     

06 May, 2015

4 commits

  • Hwpoison injector checks PageLRU of the raw target page to find out
    whether the page is an appropriate target, but current code now filters
    out thp tail pages, which prevents us from testing for such cases via this
    interface. So let's check hpage instead of p.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Hwpoison injection via debugfs:hwpoison/corrupt-pfn takes a refcount of
    the target page. But current code doesn't release it if the target page
    is not supposed to be injected, which results in memory leak. This patch
    simply adds the refcount releasing code.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • If multiple soft offline events hit one free page/hugepage concurrently,
    soft_offline_page() can handle the free page/hugepage multiple times,
    which makes num_poisoned_pages counter increased more than once. This
    patch fixes this wrong counting by checking TestSetPageHWPoison for normal
    papes and by checking the return value of dequeue_hwpoisoned_huge_page()
    for hugepages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: [3.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently memory_failure() calls shake_page() to sweep pages out from
    pcplists only when the victim page is 4kB LRU page or thp head page.
    But we should do this for a thp tail page too.

    Consider that a memory error hits a thp tail page whose head page is on
    a pcplist when memory_failure() runs. Then, the current kernel skips
    shake_pages() part, so hwpoison_user_mappings() returns without calling
    split_huge_page() nor try_to_unmap() because PageLRU of the thp head is
    still cleared due to the skip of shake_page().

    As a result, me_huge_page() runs for the thp, which is broken behavior.

    One effect is a leak of the thp. And another is to fail to isolate the
    memory error, so later access to the error address causes another MCE,
    which kills the processes which used the thp.

    This patch fixes this problem by calling shake_page() for thp tail case.

    Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU")
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Acked-by: Dean Nelson
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Cc: Jin Dongming
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

24 Apr, 2015

1 commit

  • mm/page-writeback.c has several places where 1 is added to the divisor
    to prevent division by zero exceptions; however, if the original
    divisor is equivalent to -1, adding 1 leads to division by zero.

    There are three places where +1 is used for this purpose - one in
    pos_ratio_polynom() and two in bdi_position_ratio(). The second one
    in bdi_position_ratio() actually triggered div-by-zero oops on a
    machine running a 3.10 kernel. The divisor is

    x_intercept - bdi_setpoint + 1 == span + 1

    span is confirmed to be (u32)-1. It isn't clear how it ended up that
    but it could be from write bandwidth calculation underflow fixed by
    c72efb658f7c ("writeback: fix possible underflow in write bandwidth
    calculation").

    At any rate, +1 isn't a proper protection against div-by-zero. This
    patch converts all +1 protections to |1. Note that
    bdi_update_dirty_ratelimit() was already using |1 before this patch.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

17 Apr, 2015

1 commit

  • Pull third hunk of vfs changes from Al Viro:
    "This contains the ->direct_IO() changes from Omar + saner
    generic_write_checks() + dealing with fcntl()/{read,write}() races
    (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
    repeatedly looking at ->f_flags, which can be changed by fcntl(2),
    check ->ki_flags - which cannot) + infrastructure bits for dhowells'
    d_inode annotations + Christophs switch of /dev/loop to
    vfs_iter_write()"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
    block: loop: switch to VFS ITER_BVEC
    configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
    VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
    VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
    VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
    NFS: Don't use d_inode as a variable name
    VFS: Impose ordering on accesses of d_inode and d_flags
    VFS: Add owner-filesystem positive/negative dentry checks
    nfs: generic_write_checks() shouldn't be done on swapout...
    ocfs2: use __generic_file_write_iter()
    mirror O_APPEND and O_DIRECT into iocb->ki_flags
    switch generic_write_checks() to iocb and iter
    ocfs2: move generic_write_checks() before the alignment checks
    ocfs2_file_write_iter: stop messing with ppos
    udf_file_write_iter: reorder and simplify
    fuse: ->direct_IO() doesn't need generic_write_checks()
    ext4_file_write_iter: move generic_write_checks() up
    xfs_file_aio_write_checks: switch to iocb/iov_iter
    generic_write_checks(): drop isblk argument
    blkdev_write_iter: expand generic_file_checks() call in there
    ...

    Linus Torvalds
     

16 Apr, 2015

20 commits

  • Merge second patchbomb from Andrew Morton:

    - the rest of MM

    - various misc bits

    - add ability to run /sbin/reboot at reboot time

    - printk/vsprintf changes

    - fiddle with seq_printf() return value

    * akpm: (114 commits)
    parisc: remove use of seq_printf return value
    lru_cache: remove use of seq_printf return value
    tracing: remove use of seq_printf return value
    cgroup: remove use of seq_printf return value
    proc: remove use of seq_printf return value
    s390: remove use of seq_printf return value
    cris fasttimer: remove use of seq_printf return value
    cris: remove use of seq_printf return value
    openrisc: remove use of seq_printf return value
    ARM: plat-pxa: remove use of seq_printf return value
    nios2: cpuinfo: remove use of seq_printf return value
    microblaze: mb: remove use of seq_printf return value
    ipc: remove use of seq_printf return value
    rtc: remove use of seq_printf return value
    power: wakeup: remove use of seq_printf return value
    x86: mtrr: if: remove use of seq_printf return value
    linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
    MAINTAINERS: CREDITS: remove Stefano Brivio from B43
    .mailmap: add Ricardo Ribalda
    CREDITS: add Ricardo Ribalda Delgado
    ...

    Linus Torvalds
     
  • Do not perform cond_resched() before the busy compaction loop in
    __zs_compact(), because this loop does it when needed.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • There is no point in overriding the size class below. It causes fatal
    corruption on the next chunk on the 3264-bytes size class, which is the
    last size class that is not huge.

    For example, if the requested size was exactly 3264 bytes, current
    zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
    not 4096. User access to this chunk may overwrite head of the next
    adjacent chunk.

    Here is the panic log captured when freelist was corrupted due to this:

    Kernel BUG at ffffffc00030659c [verbose debug info unavailable]
    Internal error: Oops - BUG: 96000006 [#1] PREEMPT SMP
    Modules linked in:
    exynos-snapshot: core register saved(CPU:5)
    CPUMERRSR: 0000000000000000, L2MERRSR: 0000000000000000
    exynos-snapshot: context saved(CPU:5)
    exynos-snapshot: item - log_kevents is disabled
    CPU: 5 PID: 898 Comm: kswapd0 Not tainted 3.10.61-4497415-eng #1
    task: ffffffc0b8783d80 ti: ffffffc0b71e8000 task.ti: ffffffc0b71e8000
    PC is at obj_idx_to_offset+0x0/0x1c
    LR is at obj_malloc+0x44/0xe8
    pc : [] lr : [] pstate: a0000045
    sp : ffffffc0b71eb790
    x29: ffffffc0b71eb790 x28: ffffffc00204c000
    x27: 000000000001d96f x26: 0000000000000000
    x25: ffffffc098cc3500 x24: ffffffc0a13f2810
    x23: ffffffc098cc3501 x22: ffffffc0a13f2800
    x21: 000011e1a02006e3 x20: ffffffc0a13f2800
    x19: ffffffbc02a7e000 x18: 0000000000000000
    x17: 0000000000000000 x16: 0000000000000feb
    x15: 0000000000000000 x14: 00000000a01003e3
    x13: 0000000000000020 x12: fffffffffffffff0
    x11: ffffffc08b264000 x10: 00000000e3a01004
    x9 : ffffffc08b263fea x8 : ffffffc0b1e611c0
    x7 : ffffffc000307d24 x6 : 0000000000000000
    x5 : 0000000000000038 x4 : 000000000000011e
    x3 : ffffffbc00003e90 x2 : 0000000000000cc0
    x1 : 00000000d0100371 x0 : ffffffbc00003e90

    Reported-by: Sooyong Suk
    Signed-off-by: Heesub Shin
    Tested-by: Sooyong Suk
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heesub Shin
     
  • In putback_zspage, we don't need to insert a zspage into list of zspage
    in size_class again to just fix fullness group. We could do directly
    without reinsertion so we could save some instuctions.

    Reported-by: Heesub Shin
    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Ganesh Mahendran
    Cc: Luigi Semenzato
    Cc: Gunho Lee
    Cc: Juneho Choi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • A micro-optimization. Avoid additional branching and reduce (a bit)
    registry pressure (f.e. s_off += size; d_off += size; may be calculated
    twise: first for >= PAGE_SIZE check and later for offset update in "else"
    clause).

    scripts/bloat-o-meter shows some improvement

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
    function old new delta
    zs_object_copy 550 540 -10

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Do not synchronize rcu in zs_compact(). Neither zsmalloc not
    zram use rcu.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Signed-off-by: Yinghao Xie
    Suggested-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghao Xie
     
  • Create zsmalloc doc which explains design concept and stat information.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • During investigating compaction, fullness information of each class is
    helpful for investigating how the compaction works well. With that, we
    could know how compaction works well more clear on each size class.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We store handle on header of each allocated object so it increases the
    size of each object by sizeof(unsigned long).

    If zram stores 4096 bytes to zsmalloc(ie, bad compression), zsmalloc needs
    4104B-class to add handle.

    However, 4104B-class has 1-pages_per_zspage so wasted size by internal
    fragment is 8192 - 4104, which is terrible.

    So this patch records the handle in page->private on such huge object(ie,
    pages_per_zspage == 1 && maxobj_per_zspage == 1) instead of header of each
    object so we could use 4096B-class, not 4104B-class.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Curretly, zsmalloc regards a zspage as ZS_ALMOST_EMPTY if the zspage has
    under 1/4 used objects(ie, fullness_threshold_frac). It could make result
    in loose packing since zsmalloc migrates only ZS_ALMOST_EMPTY zspage out.

    This patch changes the rule so that zsmalloc makes zspage which has above
    3/4 used object ZS_ALMOST_FULL so it could make tight packing.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch provides core functions for migration of zsmalloc. Migraion
    policy is simple as follows.

    for each size class {
    while {
    src_page = get zs_page from ZS_ALMOST_EMPTY
    if (!src_page)
    break;
    dst_page = get zs_page from ZS_ALMOST_FULL
    if (!dst_page)
    dst_page = get zs_page from ZS_ALMOST_EMPTY
    if (!dst_page)
    break;
    migrate(from src_page, to dst_page);
    }
    }

    For migration, we need to identify which objects in zspage are allocated
    to migrate them out. We could know it by iterating of freed objects in a
    zspage because first_page of zspage keeps free objects singly-linked list
    but it's not efficient. Instead, this patch adds a tag(ie,
    OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
    whether the object is allocated easily.

    This patch adds another status bit in handle to synchronize between user
    access through zs_map_object and migration. During migration, we cannot
    move objects user are using due to data coherency between old object and
    new object.

    [akpm@linux-foundation.org: zsmalloc.c needs sched.h for cond_resched()]
    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In later patch, migration needs some part of functions in zs_malloc and
    zs_free so this patch factor out them.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Recently, we started to use zram heavily and some of issues
    popped.

    1) external fragmentation

    I got a report from Juneho Choi that fork failed although there are plenty
    of free pages in the system. His investigation revealed zram is one of
    the culprit to make heavy fragmentation so there was no more contiguous
    16K page for pgd to fork in the ARM.

    2) non-movable pages

    Other problem of zram now is that inherently, user want to use zram as
    swap in small memory system so they use zRAM with CMA to use memory
    efficiently. However, unfortunately, it doesn't work well because zRAM
    cannot use CMA's movable pages unless it doesn't support compaction. I
    got several reports about that OOM happened with zram although there are
    lots of swap space and free space in CMA area.

    3) internal fragmentation

    zRAM has started support memory limitation feature to limit memory usage
    and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
    harmonized with zram-swap to stop anonymous page reclaim if zram consumed
    memory up to the limit although there are free space on the swap. One
    problem for that direction is zram has no way to know any hole in memory
    space zsmalloc allocated by internal fragmentation so zram would regard
    swap is full although there are free space in zsmalloc. For solving the
    issue, zram want to trigger compaction of zsmalloc before it decides full
    or not.

    This patchset is first step to support above issues. For that, it adds
    indirect layer between handle and object location and supports manual
    compaction to solve 3th problem first of all.

    After this patchset got merged, next step is to make VM aware of zsmalloc
    compaction so that generic compaction will move zsmalloced-pages
    automatically in runtime.

    In my imaginary experiment(ie, high compress ratio data with heavy swap
    in/out on 8G zram-swap), data is as follows,

    Before =
    zram allocated object : 60212066 bytes
    zram total used: 140103680 bytes
    ratio: 42.98 percent
    MemFree: 840192 kB

    Compaction

    After =
    frag ratio after compaction
    zram allocated object : 60212066 bytes
    zram total used: 76185600 bytes
    ratio: 79.03 percent
    MemFree: 901932 kB

    Juneho reported below in his real platform with small aging.
    So, I think the benefit would be bigger in real aging system
    for a long time.

    - frag_ratio increased 3% (ie, higher is better)
    - memfree increased about 6MB
    - In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased

    frag ratio after swap fragment
    used : 156677 kbytes
    total: 166092 kbytes
    frag_ratio : 94
    meminfo before compaction
    MemFree: 83724 kB
    Node 0, zone Normal 13642 1364 57 10 61 17 9 5 4 0 0
    Node 0, zone HighMem 425 29 1 0 0 0 0 0 0 0 0

    num_migrated : 23630
    compaction done

    frag ratio after compaction
    used : 156673 kbytes
    total: 160564 kbytes
    frag_ratio : 97
    meminfo after compaction
    MemFree: 89060 kB
    Node 0, zone Normal 14076 1544 67 14 61 17 9 5 4 0 0
    Node 0, zone HighMem 863 50 1 0 0 0 0 0 0 0 0

    This patchset adds more logics(about 480 lines) in zsmalloc but when I
    tested heavy swapin/out program, the regression for swapin/out speed is
    marginal because most of overheads were caused by compress/decompress and
    other MM reclaim stuff.

    This patch (of 7):

    Currently, handle of zsmalloc encodes object's location directly so it
    makes support of migration hard.

    This patch decouples handle and object via adding indirect layer. For
    that, it allocates handle dynamically and returns it to user. The handle
    is the address allocated by slab allocation so it's unique and we could
    keep object's location in the memory space allocated for handle.

    With it, we can change object's position without changing handle itself.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • mm/compaction.c:250:13: warning: 'suitable_migration_target' defined but not used [-Wunused-function]

    Reported-by: Fengguang Wu
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
    get notified when access is a write to a read-only PFN.

    This can happen if we mmap() a file then first mmap-read from it to
    page-in a read-only PFN, than we mmap-write to the same page.

    We need this functionality to fix a DAX bug, where in the scenario above
    we fail to set ctime/mtime though we modified the file. An xfstest is
    attached to this patchset that shows the failure and the fix. (A DAX
    patch will follow)

    This functionality is extra important for us, because upon dirtying of a
    pmem page we also want to RDMA the page to a remote cluster node.

    We define a new pfn_mkwrite and do not reuse page_mkwrite because
    1 - The name ;-)
    2 - But mainly because it would take a very long and tedious
    audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
    users. To make sure they do not now CRASH. For example current
    DAX code (which this is for) would crash.
    If we would want to reuse page_mkwrite, We will need to first
    patch all users, so to not-crash-on-no-page. Then enable this
    patch. But even if I did that I would not sleep so well at night.
    Adding a new vector is the safest thing to do, and is not that
    expensive. an extra pointer at a static function vector per driver.
    Also the new vector is better for performance, because else we
    Will call all current Kernel vectors, so to:
    check-ha-no-page-do-nothing and return.

    No need to call it from do_shared_fault because do_wp_page is called to
    change pte permissions anyway.

    Signed-off-by: Yigal Korman
    Signed-off-by: Boaz Harrosh
    Acked-by: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • A lot of filesystems use generic_file_mmap() and filemap_fault(),
    f_op->mmap and vm_ops->fault aren't enough to identify filesystem.

    This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
    (which is almost always implemented and filesystem-specific).

    Example:

    [ 23.676410] BUG: Bad page map in process sh pte:1b7e6025 pmd:19bbd067
    [ 23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
    [ 23.677481] flags: 0x10000000000000c(referenced|uptodate)
    [ 23.677896] page dumped because: bad pte
    [ 23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma: (null) mapping:ffff8800196426c0 index:97
    [ 23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage

    [akpm@linux-foundation.org: use pr_alert, per Kirill]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Mempools keep allocated objects in reserved for situations when ordinary
    allocation may not be possible to satisfy. These objects shouldn't be
    accessed before they leave the pool.

    This patch poison elements when get into the pool and unpoison when they
    leave it. This will let KASan to detect use-after-free of mempool's
    elements.

    Signed-off-by: Andrey Ryabinin
    Tested-by: David Rientjes
    Cc: Catalin Marinas
    Cc: Dmitry Chernenkov
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Like EXPORT_SYMBOL(): the positioning communicates that the macro pertains
    to the immediately preceding function.

    Cc: Dmitry Safonov
    Cc: Michal Nazarewicz
    Cc: Stefan Strogin
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Cc: Pintu Kumar
    Cc: Weijie Yang
    Cc: Laurent Pinchart
    Cc: Vyacheslav Tyrtov
    Cc: Aleksei Mateosian
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Here are two functions that provide interface to compute/get used size and
    size of biggest free chunk in cma region. Add that information to
    debugfs.

    [akpm@linux-foundation.org: move debug code from cma.c into cma_debug.c]
    [stefan.strogin@gmail.com: move code from cma_get_used() and cma_get_maxchunk() to cma_used_get() and cma_maxchunk_get()]
    Signed-off-by: Dmitry Safonov
    Signed-off-by: Stefan Strogin
    Acked-by: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Cc: Pintu Kumar
    Cc: Weijie Yang
    Cc: Laurent Pinchart
    Cc: Vyacheslav Tyrtov
    Cc: Aleksei Mateosian
    Signed-off-by: Stefan Strogin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov