23 Nov, 2015

6 commits

  • Merge slub bulk allocator updates from Andrew Morton:
    "This missed the merge window because I was waiting for some repairs to
    come in. Nothing actually uses the bulk allocator yet and the changes
    to other code paths are pretty small. And the net guys are waiting
    for this so they can start merging the client code"

    More comments from Jesper Dangaard Brouer:
    "The kmem_cache_alloc_bulk() call, in mm/slub.c, were included in
    previous kernel. The present version contains a bug. Vladimir
    Davydov noticed it contained a bug, when kernel is compiled with
    CONFIG_MEMCG_KMEM (see commit 03ec0ed57ffc: "slub: fix kmem cgroup
    bug in kmem_cache_alloc_bulk"). Plus the mem cgroup counterpart in
    kmem_cache_free_bulk() were missing (see commit 033745189b1b "slub:
    add missing kmem cgroup support to kmem_cache_free_bulk").

    I don't consider the fix stable-material because there are no in-tree
    users of the API.

    But with known bugs (for memcg) I cannot start using the API in the
    net-tree"

    * emailed patches from Andrew Morton :
    slab/slub: adjust kmem_cache_alloc_bulk API
    slub: add missing kmem cgroup support to kmem_cache_free_bulk
    slub: fix kmem cgroup bug in kmem_cache_alloc_bulk
    slub: optimize bulk slowpath free by detached freelist
    slub: support for bulk free with SLUB freelists

    Linus Torvalds
     
  • Adjust kmem_cache_alloc_bulk API before we have any real users.

    Adjust API to return type 'int' instead of previously type 'bool'. This
    is done to allow future extension of the bulk alloc API.

    A future extension could be to allow SLUB to stop at a page boundary, when
    specified by a flag, and then return the number of objects.

    The advantage of this approach, would make it easier to make bulk alloc
    run without local IRQs disabled. With an approach of cmpxchg "stealing"
    the entire c->freelist or page->freelist. To avoid overshooting we would
    stop processing at a slab-page boundary. Else we always end up returning
    some objects at the cost of another cmpxchg.

    To keep compatible with future users of this API linking against an older
    kernel when using the new flag, we need to return the number of allocated
    objects with this API change.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Initial implementation missed support for kmem cgroup support in
    kmem_cache_free_bulk() call, add this.

    If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough
    to not add any asm code.

    Incoming bulk free objects can belong to different kmem cgroups, and
    object free call can happen at a later point outside memcg context. Thus,
    we need to keep the orig kmem_cache, to correctly verify if a memcg object
    match against its "root_cache" (s->memcg_params.root_cache).

    Signed-off-by: Jesper Dangaard Brouer
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to
    be called several times inside the bulk alloc for loop, due to the call to
    memcg_kmem_get_cache().

    This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache.

    As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able
    to handle an array of objects.

    A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have
    same type (size_t) as size argument. This helps the compiler to easier
    realize that it can remove the loop, when all debug statements inside loop
    evaluates to nothing. Note, this is only an issue because the kernel is
    compiled with GCC option: -fno-strict-overflow

    In slab_alloc_node() the compiler inlines and optimizes the invocation of
    slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access
    object directly.

    Signed-off-by: Jesper Dangaard Brouer
    Reported-by: Vladimir Davydov
    Suggested-by: Vladimir Davydov
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • This change focus on improving the speed of object freeing in the
    "slowpath" of kmem_cache_free_bulk.

    The calls slab_free (fastpath) and __slab_free (slowpath) have been
    extended with support for bulk free, which amortize the overhead of
    the (locked) cmpxchg_double.

    To use the new bulking feature, we build what I call a detached
    freelist. The detached freelist takes advantage of three properties:

    1) the free function call owns the object that is about to be freed,
    thus writing into this memory is synchronization-free.

    2) many freelist's can co-exist side-by-side in the same slab-page
    each with a separate head pointer.

    3) it is the visibility of the head pointer that needs synchronization.

    Given these properties, the brilliant part is that the detached
    freelist can be constructed without any need for synchronization. The
    freelist is constructed directly in the page objects, without any
    synchronization needed. The detached freelist is allocated on the
    stack of the function call kmem_cache_free_bulk. Thus, the freelist
    head pointer is not visible to other CPUs.

    All objects in a SLUB freelist must belong to the same slab-page.
    Thus, constructing the detached freelist is about matching objects
    that belong to the same slab-page. The bulk free array is scanned is
    a progressive manor with a limited look-ahead facility.

    Kmem debug support is handled in call of slab_free().

    Notice kmem_cache_free_bulk no longer need to disable IRQs. This
    only slowed down single free bulk with approx 3 cycles.

    Performance data:
    Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz

    SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns

    To get stable and comparable numbers, the kernel have been booted with
    "slab_merge" (this also improve performance for larger bulk sizes).

    Performance data, compared against fallback bulking:

    bulk - fallback bulk - improvement with this patch
    1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
    2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
    3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
    4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
    8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
    16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
    30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
    32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
    34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
    48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
    64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
    128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
    158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
    250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%

    Performance data, compared current in-kernel bulking:

    bulk - curr in-kernel - improvement with this patch
    1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
    2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
    3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
    4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
    8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
    16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
    30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
    32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
    34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
    48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
    64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
    128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
    158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
    250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%

    Performance with normal SLUB merging is significantly slower for
    larger bulking. This is believed to (primarily) be an effect of not
    having to share the per-CPU data-structures, as tuning per-CPU size
    can achieve similar performance.

    bulk - slab_nomerge - normal SLUB merge
    1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
    2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
    3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
    4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
    8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
    16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
    30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
    32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
    34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
    48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
    64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
    128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
    158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
    250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19

    Joint work with Alexander Duyck.

    [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

    [akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexander Duyck
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Make it possible to free a freelist with several objects by adjusting API
    of slab_free() and __slab_free() to have head, tail and an objects counter
    (cnt).

    Tail being NULL indicate single object free of head object. This allow
    compiler inline constant propagation in slab_free() and
    slab_free_freelist_hook() to avoid adding any overhead in case of single
    object free.

    This allows a freelist with several objects (all within the same
    slab-page) to be free'ed using a single locked cmpxchg_double in
    __slab_free() and with an unlocked cmpxchg_double in slab_free().

    Object debugging on the free path is also extended to handle these
    freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if
    objects don't belong to the same slab-page.

    These changes are needed for the next patch to bulk free the detached
    freelists it introduces and constructs.

    Micro benchmarking showed no performance reduction due to this change,
    when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexander Duyck
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

22 Nov, 2015

1 commit

  • Merge misc fixes from Andrew Morton:
    "A bunch of fixes"

    * emailed patches from Andrew Morton :
    slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG
    slub: avoid irqoff/on in bulk allocation
    slub: create new ___slab_alloc function that can be called with irqs disabled
    mm: fix up sparse warning in gfpflags_allow_blocking
    ocfs2: fix umask ignored issue
    PM/OPP: add entry in MAINTAINERS
    kernel/panic.c: turn off locks debug before releasing console lock
    kernel/signal.c: unexport sigsuspend()
    kasan: fix kmemleak false-positive in kasan_module_alloc()
    fat: fix fake_offset handling on error path
    mm/hugetlbfs: fix bugs in fallocate hole punch of areas with holes
    mm/page-writeback.c: initialize m_dirty to avoid compile warning
    various: fix pci_set_dma_mask return value checking
    mm: loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
    mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()
    tools/vm/page-types.c: support KPF_IDLE
    ncpfs: don't allow negative timeouts
    configfs: allow dynamic group creation
    MAINTAINERS: add Moritz as reviewer for FPGA Manager Framework
    slab.h: sprinkle __assume_aligned attributes

    Linus Torvalds
     

21 Nov, 2015

7 commits

  • The #ifdef of CONFIG_SLUB_DEBUG is located very far from the associated
    #else. For readability mark it with a comment.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Use the new function that can do allocation while interrupts are disabled.
    Avoids irq on/off sequences.

    Signed-off-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Bulk alloc needs a function like that because it enables interrupts before
    calling __slab_alloc which promptly disables them again using the expensive
    local_irq_save().

    Signed-off-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Kmemleak reports the following leak:

    unreferenced object 0xfffffbfff41ea000 (size 20480):
    comm "modprobe", pid 65199, jiffies 4298875551 (age 542.568s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x4e/0xc0
    [] __vmalloc_node_range+0x4b8/0x740
    [] kasan_module_alloc+0x72/0xc0
    [] module_alloc+0x78/0xb0
    [] module_alloc_update_bounds+0x14/0x70
    [] layout_and_allocate+0x16f4/0x3c90
    [] load_module+0x2ff/0x6690
    [] SyS_finit_module+0x136/0x170
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    kasan_module_alloc() allocates shadow memory for module and frees it on
    module unloading. It doesn't store the pointer to allocated shadow memory
    because it could be calculated from the shadowed address, i.e.
    kasan_mem_to_shadow(addr).

    Since kmemleak cannot find pointer to allocated shadow, it thinks that
    memory leaked.

    Use kmemleak_ignore() to tell kmemleak that this is not a leak and shadow
    memory doesn't contain any pointers.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • When building kernel with gcc 5.2, the below warning is raised:

    mm/page-writeback.c: In function 'balance_dirty_pages.isra.10':
    mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in this function [-Wmaybe-uninitialized]
    unsigned long m_dirty, m_thresh, m_bg_thresh;

    The m_dirty{thresh, bg_thresh} are initialized in the block of "if
    (mdtc)", so if mdts is null, they won't be initialized before being used.
    Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to
    keep consistency.

    They are used later by if condition: !mdtc || m_dirty
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
    hugepage but hugepage_madvise() takes the error path when we ask to turn
    on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
    new postcopy migration feature to fail on s390 because its first action is
    to madvise the guest address space as NOHUGEPAGE. This patch modifies the
    code so that the operation succeeds without error now.

    For consistency reasons do the same for MADV_HUGEPAGE.

    Signed-off-by: Jason J. Herne
    Reviewed-by: Andrea Arcangeli
    Acked-by: Christian Borntraeger
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason J. Herne
     
  • Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole
    allocation") missed a spot. Currently remove_vm_area() decreases vm->size
    to "remove" the guard hole page, even when it isn't present. All but one
    users just free the vm_struct rigth away and never access vm->size anyway.

    Don't touch the size in remove_vm_area() and have __vunmap() use the
    proper get_vm_area_size() helper.

    Signed-off-by: Jerome Marchand
    Acked-by: Andrey Ryabinin
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

19 Nov, 2015

1 commit

  • DAX handling of COW faults has wrong locking sequence:
    dax_fault does i_mmap_lock_read
    do_cow_fault does i_mmap_unlock_write

    Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
    commit[3].

    Original COW locking logic was introduced by Matthew here[4].

    This should be applied to v4.3 as well.

    [1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
    [2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
    [3] 843172978bb9 dax: fix race between simultaneous faults
    [4] 2e4cdab0584f mm: allow page fault handlers to perform the COW

    Cc:
    Cc: Boaz Harrosh
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Acked-by: Ross Zwisler
    Signed-off-by: Yigal Korman
    Signed-off-by: Dan Williams

    Yigal Korman
     

11 Nov, 2015

3 commits

  • Merge final patch-bomb from Andrew Morton:
    "Various leftovers, mainly Christoph's pci_dma_supported() removals"

    * emailed patches from Andrew Morton :
    pci: remove pci_dma_supported
    usbnet: remove ifdefed out call to dma_supported
    kaweth: remove ifdefed out call to dma_supported
    sfc: don't call dma_supported
    nouveau: don't call pci_dma_supported
    netup_unidvb: use pci_set_dma_mask insted of pci_dma_supported
    cx23885: use pci_set_dma_mask insted of pci_dma_supported
    cx25821: use pci_set_dma_mask insted of pci_dma_supported
    cx88: use pci_set_dma_mask insted of pci_dma_supported
    saa7134: use pci_set_dma_mask insted of pci_dma_supported
    saa7164: use pci_set_dma_mask insted of pci_dma_supported
    tw68-core: use pci_set_dma_mask insted of pci_dma_supported
    pcnet32: use pci_set_dma_mask insted of pci_dma_supported
    lib/string.c: add ULL suffix to the constant definition
    hugetlb: trivial comment fix
    selftests/mlock2: add ULL suffix to 64-bit constants
    selftests/mlock2: add missing #define _GNU_SOURCE

    Linus Torvalds
     
  • Recently alloc_buddy_huge_page() was renamed to __alloc_buddy_huge_page(),
    so let's sync comments.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • In commit a1c34a3bf00a ("mm: Don't offset memmap for flatmem") Laura
    fixed a problem for Srinivas relating to the bottom 2MB of RAM on an ARM
    IFC6410 board.

    One small wrinkle on ia64 is that it allocates the node_mem_map earlier
    in arch code, so it skips the block of code where "offset" is
    initialized.

    Move initialization of start and offset before the check for the
    node_mem_map so that they will always be available in the latter part of
    the function.

    Tested-by: Laura Abbott
    Fixes: a1c34a3bf00a (mm: Don't offset memmap for flatmem)
    Signed-off-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Tony Luck
     

08 Nov, 2015

2 commits

  • Merge second patch-bomb from Andrew Morton:

    - most of the rest of MM

    - procfs

    - lib/ updates

    - printk updates

    - bitops infrastructure tweaks

    - checkpatch updates

    - nilfs2 update

    - signals

    - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
    dma-debug, dma-mapping, ...

    * emailed patches from Andrew Morton : (102 commits)
    ipc,msg: drop dst nil validation in copy_msg
    include/linux/zutil.h: fix usage example of zlib_adler32()
    panic: release stale console lock to always get the logbuf printed out
    dma-debug: check nents in dma_sync_sg*
    dma-mapping: tidy up dma_parms default handling
    pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
    kexec: use file name as the output message prefix
    fs, seqfile: always allow oom killer
    seq_file: reuse string_escape_str()
    fs/seq_file: use seq_* helpers in seq_hex_dump()
    coredump: change zap_threads() and zap_process() to use for_each_thread()
    coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
    signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
    signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
    signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
    signals: kill block_all_signals() and unblock_all_signals()
    nilfs2: fix gcc uninitialized-variable warnings in powerpc build
    nilfs2: fix gcc unused-but-set-variable warnings
    MAINTAINERS: nilfs2: add header file for tracing
    nilfs2: add tracepoints for analyzing reading and writing metadata files
    ...

    Linus Torvalds
     
  • Pull trivial updates from Jiri Kosina:
    "Trivial stuff from trivial tree that can be trivially summed up as:

    - treewide drop of spurious unlikely() before IS_ERR() from Viresh
    Kumar

    - cosmetic fixes (that don't really affect basic functionality of the
    driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek

    - various comment / printk fixes and updates all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    bcache: Really show state of work pending bit
    hwmon: applesmc: fix comment typos
    Kconfig: remove comment about scsi_wait_scan module
    class_find_device: fix reference to argument "match"
    debugfs: document that debugfs_remove*() accepts NULL and error values
    net: Drop unlikely before IS_ERR(_OR_NULL)
    mm: Drop unlikely before IS_ERR(_OR_NULL)
    fs: Drop unlikely before IS_ERR(_OR_NULL)
    drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
    drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
    UBI: Update comments to reflect UBI_METAONLY flag
    pktcdvd: drop null test before destroy functions

    Linus Torvalds
     

07 Nov, 2015

20 commits

  • Let's try to be consistent about data type of page order.

    [sfr@canb.auug.org.au: fix build (type of pageblock_order)]
    [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The patch halves space occupied by compound_dtor and compound_order in
    struct page.

    For compound_order, it's trivial long -> short conversion.

    For get_compound_page_dtor(), we now use hardcoded table for destructor
    lookup and store its index in the struct page instead of direct pointer
    to destructor. It shouldn't be a big trouble to maintain the table: we
    have only two destructor and NULL currently.

    This patch free up one word in tail pages for reuse. This is preparation
    for the next patch.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to rework how compound_head() work. It will not use
    page->first_page as we have it now.

    The only other user of page->first_page beyond compound pages is
    zsmalloc.

    Let's use page->private instead of page->first_page here. It occupies
    the same storage space.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Reviewed-by: Sergey Senozhatsky
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We have properly typed page->rcu_head, no need to cast page->lru.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Each `struct size_class' contains `struct zs_size_stat': an array of
    NR_ZS_STAT_TYPE `unsigned long'. For zsmalloc built with no
    CONFIG_ZSMALLOC_STAT this results in a waste of `2 * sizeof(unsigned
    long)' per-class.

    The patch removes unneeded `struct zs_size_stat' members by redefining
    NR_ZS_STAT_TYPE (max stat idx in array).

    Since both NR_ZS_STAT_TYPE and zs_stat_type are compile time constants,
    GCC can eliminate zs_stat_inc()/zs_stat_dec() calls that use zs_stat_type
    larger than NR_ZS_STAT_TYPE: CLASS_ALMOST_EMPTY and CLASS_ALMOST_FULL at
    the moment.

    ./scripts/bloat-o-meter mm/zsmalloc.o.old mm/zsmalloc.o.new
    add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-39 (-39)
    function old new delta
    fix_fullness_group 97 94 -3
    insert_zspage 100 86 -14
    remove_zspage 141 119 -22

    To summarize:
    a) each class now uses less memory
    b) we avoid a number of dec/inc stats (a minor optimization,
    but still).

    The gain will increase once we introduce additional stats.

    A simple IO test.

    iozone -t 4 -R -r 32K -s 60M -I +Z
    patched base
    " Initial write " 4145599.06 4127509.75
    " Rewrite " 4146225.94 4223618.50
    " Read " 17157606.00 17211329.50
    " Re-read " 17380428.00 17267650.50
    " Reverse Read " 16742768.00 16162732.75
    " Stride read " 16586245.75 16073934.25
    " Random read " 16349587.50 15799401.75
    " Mixed workload " 10344230.62 9775551.50
    " Random write " 4277700.62 4260019.69
    " Pwrite " 4302049.12 4313703.88
    " Pread " 6164463.16 6126536.72
    " Fwrite " 7131195.00 6952586.00
    " Fread " 12682602.25 12619207.50

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Signed-off-by: Hui Zhu
    Reviewed-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hui Zhu
     
  • We don't let user to disable shrinker in zsmalloc (once it's been
    enabled), so no need to check ->shrinker_enabled in zs_shrinker_count(),
    at the moment at least.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • A cosmetic change.

    Commit c60369f01125 ("staging: zsmalloc: prevent mappping in interrupt
    context") added in_interrupt() check to zs_map_object() and 'hardirq.h'
    include; but in_interrupt() macro is defined in 'preempt.h' not in
    'hardirq.h', so include it instead.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • In obj_malloc():

    if (!class->huge)
    /* record handle in the header of allocated chunk */
    link->handle = handle;
    else
    /* record handle in first_page->private */
    set_page_private(first_page, handle);

    In the hugepage we save handle to private directly.

    But in obj_to_head():

    if (class->huge) {
    VM_BUG_ON(!is_first_page(page));
    return *(unsigned long *)page_private(page);
    } else
    return *(unsigned long *)obj;

    It is used as a pointer.

    The reason why there is no problem until now is huge-class page is born
    with ZS_FULL so it can't be migrated. However, we need this patch for
    future work: "VM-aware zsmalloced page migration" to reduce external
    fragmentation.

    Signed-off-by: Hui Zhu
    Acked-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hui Zhu
     
  • [akpm@linux-foundation.org: fix grammar]
    Signed-off-by: Hui Zhu
    Reviewed-by: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hui Zhu
     
  • Constify `struct zs_pool' ->name.

    [akpm@inux-foundation.org: constify zpool_create_pool()'s `type' arg also]
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Dan Streetman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey SENOZHATSKY
     
  • Make the return type of zpool_get_type const; the string belongs to the
    zpool driver and should not be modified. Remove the redundant type field
    in the struct zpool; it is private to zpool.c and isn't needed since
    ->driver->type can be used directly. Add comments indicating strings must
    be null-terminated.

    Signed-off-by: Dan Streetman
    Cc: Sergey Senozhatsky
    Cc: Seth Jennings
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Instead of using a fixed-length string for the zswap params, use charp.
    This simplifies the code and uses less memory, as most zswap param strings
    will be less than the current maximum length.

    Signed-off-by: Dan Streetman
    Cc: Rusty Russell
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • On the next line entry variable will be re-initialized so no need to init
    it with NULL.

    Signed-off-by: Alexey Klimov
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Klimov
     
  • gcc version 5.2.1 20151010 (Debian 5.2.1-22)
    $ size mm/memcontrol.o mm/memcontrol.o.before
    text data bss dec hex filename
    35535 7908 64 43507 a9f3 mm/memcontrol.o
    35762 7908 64 43734 aad6 mm/memcontrol.o.before

    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The "vma" parameter to khugepaged_alloc_page() is unused. It has to
    remain unused or the drop read lock 'map_sem' optimisation introduce by
    commit 8b1645685acf ("mm, THP: don't hold mmap_sem in khugepaged when
    allocating THP") wouldn't be safe. So let's remove it.

    Signed-off-by: Aaron Tomlin
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Tomlin
     
  • There are many places which use mapping_gfp_mask to restrict a more
    generic gfp mask which would be used for allocations which are not
    directly related to the page cache but they are performed in the same
    context.

    Let's introduce a helper function which makes the restriction explicit and
    easier to track. This patch doesn't introduce any functional changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Andrew stated the following

    We have quite a history of remote parts of the kernel using
    weird/wrong/inexplicable combinations of __GFP_ flags. I tend
    to think that this is because we didn't adequately explain the
    interface.

    And I don't think that gfp.h really improved much in this area as
    a result of this patchset. Could you go through it some time and
    decide if we've adequately documented all this stuff?

    This patches first moves some GFP flag combinations that are part of the MM
    internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
    bits under various headings and then documents the flag combinations. It
    will not help callers that are brain damaged but the clarity might motivate
    some fixes and avoid future mistakes.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The primary purpose of watermarks is to ensure that reclaim can always
    make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
    These assume that order-0 allocations are all that is necessary for
    forward progress.

    High-order watermarks serve a different purpose. Kswapd had no high-order
    awareness before they were introduced
    (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was
    particularly important when there were high-order atomic requests. The
    watermarks both gave kswapd awareness and made a reserve for those atomic
    requests.

    There are two important side-effects of this. The most important is that
    a non-atomic high-order request can fail even though free pages are
    available and the order-0 watermarks are ok. The second is that
    high-order watermark checks are expensive as the free list counts up to
    the requested order must be examined.

    With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
    have high-order watermarks. Kswapd and compaction still need high-order
    awareness which is handled by checking that at least one suitable
    high-order page is free.

    With the patch applied, there was little difference in the allocation
    failure rates as the atomic reserves are small relative to the number of
    allocation attempts. The expected impact is that there will never be an
    allocation failure report that shows suitable pages on the free lists.

    The one potential side-effect of this is that in a vanilla kernel, the
    watermark checks may have kept a free page for an atomic allocation. Now,
    we are 100% relying on the HighAtomic reserves and an early allocation to
    have allocated them. If the first high-order atomic allocation is after
    the system is already heavily fragmented then it'll fail.

    [akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman