28 May, 2016

10 commits

  • The register_page_bootmem_info_node() function needs to be marked __init
    in order to avoid a new warning introduced by commit f65e91df25aa ("mm:
    use early_pfn_to_nid in register_page_bootmem_info_node").

    Otherwise you'll get a warning about how a non-init function calls
    early_pfn_to_nid (which is __meminit)

    Cc: Yang Shi
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • When we have !NO_BOOTMEM, the deferred page struct initialization
    doesn't work well because the pages reserved in bootmem are released to
    the page allocator uncoditionally. It causes memory corruption and
    system crash eventually.

    As Mel suggested, the bootmem is retiring slowly. We fix the issue by
    simply hiding DEFERRED_STRUCT_PAGE_INIT when bootmem is enabled.

    Link: http://lkml.kernel.org/r/1460602170-5821-1-git-send-email-gwshan@linux.vnet.ibm.com
    Signed-off-by: Gavin Shan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • Move the comments for get_mctgt_type() to be before get_mctgt_type()
    implementation.

    Link: http://lkml.kernel.org/r/1463644638-7446-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • mem_cgroup_margin() might return (memory.limit - memory_count) when the
    memsw.limit is in excess. This doesn't happen usually because we do not
    allow excess on hard limits and (memory.limit
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • pageblock_order can be (at least) an unsigned int or an unsigned long
    depending on the kernel config and architecture, so use max_t(unsigned
    long, ...) when comparing it.

    fixes these warnings:

    In file included from include/asm-generic/bug.h:13:0,
    from arch/powerpc/include/asm/bug.h:127,
    from include/linux/bug.h:4,
    from include/linux/mmdebug.h:4,
    from include/linux/mm.h:8,
    from include/linux/memblock.h:18,
    from mm/cma.c:28:
    mm/cma.c: In function 'cma_init_reserved_mem':
    include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
    (void) (&_max1 == &_max2); ^
    mm/cma.c:186:27: note: in expansion of macro 'max'
    alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
    ^
    mm/cma.c: In function 'cma_declare_contiguous':
    include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
    (void) (&_max1 == &_max2); ^
    include/linux/kernel.h:747:9: note: in definition of macro 'max'
    typeof(y) _max2 = (y); ^
    mm/cma.c:270:29: note: in expansion of macro 'max'
    (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
    ^
    include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
    (void) (&_max1 == &_max2); ^
    include/linux/kernel.h:747:21: note: in definition of macro 'max'
    typeof(y) _max2 = (y); ^
    mm/cma.c:270:29: note: in expansion of macro 'max'
    (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
    ^

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160526150748.5be38a4f@canb.auug.org.au
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • If page_move_anon_rmap() is refiling a pmd-splitted THP mapped in a tail
    page from a pte, the "address" must be THP aligned in order for the
    page->index bugcheck to pass in the CONFIG_DEBUG_VM=y builds.

    Link: http://lkml.kernel.org/r/1464253620-106404-1-git-send-email-kirill.shutemov@linux.intel.com
    Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Mika Westerberg
    Tested-by: Mika Westerberg
    Reviewed-by: Andrea Arcangeli
    Cc: [4.5]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Tetsuo has reported:
    Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
    Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
    sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
    sh cpuset=/ mems_allowed=0
    CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
    Call Trace:
    dump_stack+0x85/0xc8
    dump_header+0x5b/0x394
    oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    In other words:

    __oom_reap_task exit_mm
    atomic_inc_not_zero
    tsk->mm = NULL
    mmput
    atomic_dec_and_test # > 0
    exit_oom_victim # New victim will be
    # selected

    # no TIF_MEMDIE task so we can select a new one
    unmap_page_range # to release the memory

    The race exists even without the oom_reaper because anybody who pins the
    address space and gets preempted might race with exit_mm but oom_reaper
    made this race more probable.

    We can address the oom_reaper part by using oom_lock for __oom_reap_task
    because this would guarantee that a new oom victim will not be selected
    if the oom reaper might race with the exit path. This doesn't solve the
    original issue, though, because somebody else still might be pinning
    mm_users and so __mmput won't be called to release the memory but that
    is not really realiably solvable because the task will get away from the
    oom sight as soon as it is unhashed from the task_list and so we cannot
    guarantee a new victim won't be selected.

    [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen]
    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • register_page_bootmem_info_node() is invoked in mem_init(), so it will
    be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is
    enabled. But, pfn_to_nid() depends on memmap which won't be fully setup
    until page_alloc_init_late() is done, so replace pfn_to_nid() by
    early_pfn_to_nid().

    Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • page_ext_init() checks suitable pages with pfn_to_nid(), but
    pfn_to_nid() depends on memmap which will not be setup fully until
    page_alloc_init_late() is done. Use early_pfn_to_nid() instead of
    pfn_to_nid() so that page extension could be still used early even
    though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page
    allocation call sites.

    Suggested by Joonsoo Kim [1], this fix basically undoes the change
    introduced by commit b8f1a75d61d840 ("mm: call page_ext_init() after all
    struct pages are initialized") and fixes the same problem with a better
    approach.

    [1] http://lkml.kernel.org/r/CAAmzW4OUmyPwQjvd7QUfc6W1Aic__TyAuH80MLRZNMxKy0-wPQ@mail.gmail.com

    Link: http://lkml.kernel.org/r/1464198689-23458-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • If the current process is exiting, we don't invoke oom killer, instead
    we give it access to memory reserves and try to reap its mm in case
    nobody is going to use it. There's a mistake in the code performing
    this check - we just ignore any process of the same thread group no
    matter if it is exiting or not - see try_oom_reaper. Fix it.

    Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com
    Fixes: 3ef22dfff239 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

27 May, 2016

6 commits

  • Merge fixes from Andrew Morton:
    "10 fixes"

    * emailed patches from Andrew Morton :
    drivers/pinctrl/intel/pinctrl-baytrail.c: fix build with gcc-4.4
    update "mm/zsmalloc: don't fail if can't create debugfs info"
    dma-debug: avoid spinlock recursion when disabling dma-debug
    mm: oom_reaper: remove some bloat
    memcg: fix mem_cgroup_out_of_memory() return value.
    ocfs2: fix improper handling of return errno
    mm: slub: remove unused virt_to_obj()
    mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta
    mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly
    seqlock: fix raw_read_seqcount_latch()

    Linus Torvalds
     
  • Pull DAX locking updates from Ross Zwisler:
    "Filesystem DAX locking for 4.7

    - We use a bit in an exceptional radix tree entry as a lock bit and
    use it similarly to how page lock is used for normal faults. This
    fixes races between hole instantiation and read faults of the same
    index.

    - Filesystem DAX PMD faults are disabled, and will be re-enabled when
    PMD locking is implemented"

    * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Remove i_mmap_lock protection
    dax: Use radix tree entry lock to protect cow faults
    dax: New fault locking
    dax: Allow DAX code to replace exceptional entries
    dax: Define DAX lock bit for radix tree exceptional entry
    dax: Make huge page handling depend of CONFIG_BROKEN
    dax: Fix condition for filling of PMD holes

    Linus Torvalds
     
  • Some updates to commit d34f615720d1 ("mm/zsmalloc: don't fail if can't
    create debugfs info"):

    - add pr_warn to all stat failure cases
    - do not prevent module loading on stat failure

    Link: http://lkml.kernel.org/r/1463671123-5479-1-git-send-email-ddstreet@ieee.org
    Signed-off-by: Dan Streetman
    Reviewed-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • mem_cgroup_out_of_memory() is returning "true" if it finds a TIF_MEMDIE
    task after an eligible task was found, "false" if it found a TIF_MEMDIE
    task before an eligible task is found.

    This difference confuses memory_max_write() which checks the return
    value of mem_cgroup_out_of_memory(). Since memory_max_write() wants to
    continue looping, mem_cgroup_out_of_memory() should return "true" in
    this case.

    This patch sets a dummy pointer in order to return "true".

    Link: http://lkml.kernel.org/r/1463753327-5170-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable
    stackdepot for SLAB") added 'reserved' field, but never used it.

    Link: http://lkml.kernel.org/r/1464021054-2307-1-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Per the suggestion from Michal Hocko [1], DEFERRED_STRUCT_PAGE_INIT
    requires some ordering wrt other initialization operations, e.g.
    page_ext_init has to happen after the whole memmap is initialized
    properly.

    For SPARSEMEM this requires to wait for page_alloc_init_late. Other
    memory models (e.g. flatmem) might have different initialization
    layouts (page_ext_init_flatmem). Currently DEFERRED_STRUCT_PAGE_INIT
    depends on MEMORY_HOTPLUG which in turn

    depends on SPARSEMEM || X86_64_ACPI_NUMA
    depends on ARCH_ENABLE_MEMORY_HOTPLUG

    and X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM
    memory model:

    config ARCH_FLATMEM_ENABLE
    def_bool y
    depends on X86_32 && !NUMA

    so FLATMEM is ruled out via dependency maze. Be explicit and disable
    FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce
    subtle initialization bugs

    [1] http://lkml.kernel.org/r/20160523073157.GD2278@dhcp22.suse.cz

    Link: http://lkml.kernel.org/r/1464027356-32282-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

24 May, 2016

8 commits

  • Merge yet more updates from Andrew Morton:

    - Oleg's "wait/ptrace: assume __WALL if the child is traced". It's a
    kernel-based workaround for existing userspace issues.

    - A few hotfixes

    - befs cleanups

    - nilfs2 updates

    - sys_wait() changes

    - kexec updates

    - kdump

    - scripts/gdb updates

    - the last of the MM queue

    - a few other misc things

    * emailed patches from Andrew Morton : (84 commits)
    kgdb: depends on VT
    drm/amdgpu: make amdgpu_mn_get wait for mmap_sem killable
    drm/radeon: make radeon_mn_get wait for mmap_sem killable
    drm/i915: make i915_gem_mmap_ioctl wait for mmap_sem killable
    uprobes: wait for mmap_sem for write killable
    prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable
    exec: make exec path waiting for mmap_sem killable
    aio: make aio_setup_ring killable
    coredump: make coredump_wait wait for mmap_sem for write killable
    vdso: make arch_setup_additional_pages wait for mmap_sem for write killable
    ipc, shm: make shmem attach/detach wait for mmap_sem killable
    mm, fork: make dup_mmap wait for mmap_sem for write killable
    mm, proc: make clear_refs killable
    mm: make vm_brk killable
    mm, elf: handle vm_brk error
    mm, aout: handle vm_brk failures
    mm: make vm_munmap killable
    mm: make vm_mmap killable
    mm: make mmap_sem for write waits killable for mm syscalls
    MAINTAINERS: add co-maintainer for scripts/gdb
    ...

    Linus Torvalds
     
  • Now that all the callers handle vm_brk failure we can change it wait for
    mmap_sem killable to help oom_reaper to not get blocked just because
    vm_brk gets blocked behind mmap_sem readers.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Almost all current users of vm_munmap are ignoring the return value and
    so they do not handle potential error. This means that some VMAs might
    stay behind. This patch doesn't try to solve those potential problems.
    Quite contrary it adds a new failure mode by using down_write_killable
    in vm_munmap. This should be safer than other failure modes, though,
    because the process is guaranteed to die as soon as it leaves the kernel
    and exit_mmap will clean the whole address space.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Alexander Viro
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • All the callers of vm_mmap seem to check for the failure already and
    bail out in one way or another on the error which means that we can
    change it to use killable version of vm_mmap_pgoff and return -EINTR if
    the current task gets killed while waiting for mmap_sem. This also
    means that vm_mmap_pgoff can be killable by default and drop the
    additional parameter.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Please note that load_elf_binary is ignoring vm_mmap error for
    current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
    problem because the address is not used anywhere and we never return to
    the userspace if we got killed.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mem_cgroup_oom may be invoked multiple times while a process is handling
    a page fault, in which case current->memcg_in_oom will be overwritten
    leaking the previously taken css reference.

    Link: http://lkml.kernel.org/r/1464019330-7579-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pull drm updates from Dave Airlie:
    "Here's the main drm pull request for 4.7, it's been a busy one, and
    I've been a bit more distracted in real life this merge window. Lots
    more ARM drivers, not sure if it'll ever end. I think I've at least
    one more coming the next merge window.

    But changes are all over the place, support for AMD Polaris GPUs is in
    here, some missing GM108 support for nouveau (found in some Lenovos),
    a bunch of MST and skylake fixes.

    I've also noticed a few fixes from Arnd in my inbox, that I'll try and
    get in asap, but I didn't think they should hold this up.

    New drivers:
    - Hisilicon kirin display driver
    - Mediatek MT8173 display driver
    - ARC PGU - bitstreamer on Synopsys ARC SDP boards
    - Allwinner A13 initial RGB output driver
    - Analogix driver for DisplayPort IP found in exynos and rockchip

    DRM Core:
    - UAPI headers fixes and C++ safety
    - DRM connector reference counting
    - DisplayID mode parsing for Dell 5K monitors
    - Removal of struct_mutex from drivers
    - Connector registration cleanups
    - MST robustness fixes
    - MAINTAINERS updates
    - Lockless GEM object freeing
    - Generic fbdev deferred IO support

    panel:
    - Support for a bunch of new panels

    i915:
    - VBT refactoring
    - PLL computation cleanups
    - DSI support for BXT
    - Color manager support
    - More atomic patches
    - GEM improvements
    - GuC fw loading fixes
    - DP detection fixes
    - SKL GPU hang fixes
    - Lots of BXT fixes

    radeon/amdgpu:
    - Initial Polaris support
    - GPUVM/Scheduler/Clock/Power improvements
    - ASYNC pageflip support
    - New mesa feature support

    nouveau:
    - GM108 support
    - Power sensor support improvements
    - GR init + ucode fixes.
    - Use GPU provided topology information

    vmwgfx:
    - Add host messaging support

    gma500:
    - Some cleanups and fixes

    atmel:
    - Bridge support
    - Async atomic commit support

    fsl-dcu:
    - Timing controller for LCD support
    - Pixel clock polarity support

    rcar-du:
    - Misc fixes

    exynos:
    - Pipeline clock support
    - Exynoss4533 SoC support
    - HW trigger mode support
    - export HDMI_PHY clock
    - DECON5433 fixes
    - Use generic prime functions
    - use DMA mapping APIs

    rockchip:
    - Lots of little fixes

    vc4:
    - Render node support
    - Gamma ramp support
    - DPI output support

    msm:
    - Mostly cleanups and fixes
    - Conversion to generic struct fence

    etnaviv:
    - Fix for prime buffer handling
    - Allow hangcheck to be coalesced with other wakeups

    tegra:
    - Gamme table size fix"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1050 commits)
    drm/edid: add displayid detailed 1 timings to the modelist. (v1.1)
    drm/edid: move displayid validation to it's own function.
    drm/displayid: Iterate over all DisplayID blocks
    drm/edid: move displayid tiled block parsing into separate function.
    drm: Nuke ->vblank_disable_allowed
    drm/vmwgfx: Report vmwgfx version to vmware.log
    drm/vmwgfx: Add VMWare host messaging capability
    drm/vmwgfx: Kill some lockdep warnings
    drm/nouveau/gr/gf100-: fix race condition in fecs/gpccs ucode
    drm/nouveau/core: recognise GM108 chipsets
    drm/nouveau/gr/gm107-: fix touching non-existent ppcs in attrib cb setup
    drm/nouveau/gr/gk104-: share implementation of ppc exception init
    drm/nouveau/gr/gk104-: move rop_active_fbps init to nonctx
    drm/nouveau/bios/pll: check BIT table version before trying to parse it
    drm/nouveau/bios/pll: prevent oops when limits table can't be parsed
    drm/nouveau/volt/gk104: round up in gk104_volt_set
    drm/nouveau/fb/gm200: setup mmu debug buffer registers at init()
    drm/nouveau/fb/gk20a,gm20b: setup mmu debug buffer registers at init()
    drm/nouveau/fb/gf100-: allocate mmu debug buffers
    drm/nouveau/fb: allow chipset-specific actions for oneinit()
    ...

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this update was stabilized before the merge window and
    appeared in -next. The "device dax" implementation was revised this
    week in response to review feedback, and to address failures detected
    by the recently expanded ndctl unit test suite.

    Not included in this pull request are two dax topic branches (dax
    error handling, and dax radix-tree locking). These topics were
    deferred to get a few more days of -next integration testing, and to
    coordinate a branch baseline with Ted and the ext4 tree. Vishal and
    Ross will send the error handling and locking topics respectively in
    the next few days.

    This branch has received a positive build result from the kbuild robot
    across 226 configs.

    Summary:

    - Device DAX for persistent memory: Device DAX is the device-centric
    analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory
    ranges to be allocated and mapped without need of an intervening
    file system. Device DAX is strict, precise and predictable.
    Specifically this interface:

    a) Guarantees fault granularity with respect to a given page size
    (pte, pmd, or pud) set at configuration time.

    b) Enforces deterministic behavior by being strict about what
    fault scenarios are supported.

    Persistent memory is the first target, but the mechanism is also
    targeted for exclusive allocations of performance/feature
    differentiated memory ranges.

    - Support for the HPE DSM (device specific method) command formats.
    This enables management of these first generation devices until a
    unified DSM specification materializes.

    - Further ACPI 6.1 compliance with support for the common dimm
    identifier format.

    - Various fixes and cleanups across the subsystem"

    * tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits)
    libnvdimm, dax: fix deletion
    libnvdimm, dax: fix alignment validation
    libnvdimm, dax: autodetect support
    libnvdimm: release ida resources
    Revert "block: enable dax for raw block devices"
    /dev/dax, core: file operations and dax-mmap
    /dev/dax, pmem: direct access to persistent memory
    libnvdimm: stop requiring a driver ->remove() method
    libnvdimm, dax: record the specified alignment of a dax-device instance
    libnvdimm, dax: reserve space to store labels for device-dax
    libnvdimm, dax: introduce device-dax infrastructure
    nfit: add sysfs dimm 'family' and 'dsm_mask' attributes
    tools/testing/nvdimm: ND_CMD_CALL support
    nfit: disable vendor specific commands
    nfit: export subsystem ids as attributes
    nfit: fix format interface code byte order per ACPI6.1
    nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
    nfit, libnvdimm: clarify "commands" vs "_DSMs"
    libnvdimm: increase max envelope size for ioctl
    acpi/nfit: Add sysfs "id" for NVDIMM ID
    ...

    Linus Torvalds
     

23 May, 2016

1 commit

  • I'm looking at trying to possibly merge the 32-bit and 64-bit versions
    of the x86 uaccess.h implementation, but first this needs to be cleaned
    up.

    For example, the 32-bit version of "__copy_from_user_inatomic()" is
    mostly the special cases for the constant size, and it's actually almost
    never relevant. Most users aren't actually using a constant size
    anyway, and the few cases that do small constant copies are better off
    just using __get_user() instead.

    So get rid of the unnecessary complexity.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 May, 2016

15 commits

  • The "Device DAX" core enables dax mappings of performance / feature
    differentiated memory. An open mapping or file handle keeps the backing
    struct device live, but new mappings are only possible while the device
    is enabled. Faults are handled under rcu_read_lock to synchronize
    with the enabled state of the device.

    Similar to the filesystem-dax case the backing memory may optionally
    have struct page entries. However, unlike fs-dax there is no support
    for private mappings, or mappings that are not backed by media (see
    use of zero-page in fs-dax).

    Mappings are always guaranteed to match the alignment of the dax_region.
    If the dax_region is configured to have a 2MB alignment, all mappings
    are guaranteed to be backed by a pmd entry. Contrast this determinism
    with the fs-dax case where pmd mappings are opportunistic. If userspace
    attempts to force a misaligned mapping, the driver will fail the mmap
    attempt. See dax_dev_check_vma() for other scenarios that are rejected,
    like MAP_PRIVATE mappings.

    Cc: Hannes Reinecke
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: Ross Zwisler
    Acked-by: "Paul E. McKenney"
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In addition to replacing the entry, we also clear all associated tags.
    This is really a one-off special for page_cache_tree_delete() which had
    far too much detailed knowledge about how the radix tree works.

    For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It
    can be used by radix_tree_delete_item() as well as
    radix_tree_replace_clear_tags().

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • I've been receiving increasingly concerned notes from 0day about how
    much my recent changes have been bloating the radix tree. Make it
    happier by only including multiorder support if
    CONFIG_TRANSPARENT_HUGEPAGES is set.

    This is an independent Kconfig option, so other radix tree users can
    also set it if they have a need.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Change the return type of zs_pool_stat_create() to void, and remove the
    logic to abort pool creation if the stat debugfs dir/file could not be
    created.

    The debugfs stat file is for debugging/information only, and doesn't
    affect operation of zsmalloc; there is no reason to abort creating the
    pool if the stat file can't be created. This was seen with zswap, which
    used the same name for all pool creations, which caused zsmalloc to fail
    to create a second pool for zswap if CONFIG_ZSMALLOC_STAT was enabled.

    Signed-off-by: Dan Streetman
    Reviewed-by: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add a work_struct to struct zswap_pool, and change __zswap_pool_empty to
    use the workqueue instead of using call_rcu().

    When zswap destroys a pool no longer in use, it uses call_rcu() to
    perform the destruction/freeing. Since that executes in softirq
    context, it must not sleep. However, actually destroying the pool
    involves freeing the per-cpu compressors (which requires locking the
    cpu_add_remove_lock mutex) and freeing the zpool, for which the
    implementation may sleep (e.g. zsmalloc calls kmem_cache_destroy, which
    locks the slab_mutex). So if either mutex is currently taken, or any
    other part of the compressor or zpool implementation sleeps, it will
    result in a BUG().

    It's not easy to reproduce this when changing zswap's params normally.
    In testing with a loaded system, this does not fail:

    $ cd /sys/module/zswap/parameters
    $ echo lz4 > compressor ; echo zsmalloc > zpool

    nor does this:

    $ while true ; do
    > echo lzo > compressor ; echo zbud > zpool
    > sleep 1
    > echo lz4 > compressor ; echo zsmalloc > zpool
    > sleep 1
    > done

    although it's still possible either of those might fail, depending on
    whether anything else besides zswap has locked the mutexes.

    However, changing a parameter with no delay immediately causes the
    schedule while atomic BUG:

    $ while true ; do
    > echo lzo > compressor ; echo lz4 > compressor
    > done

    This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
    but moved to zswap, to cover compressor and zpool freeing.

    Fixes: f1c54846ee45 ("zswap: dynamic pool creation")
    Signed-off-by: Dan Streetman
    Reported-by: Yu Zhao
    Reviewed-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to
    zs_create_pool(), so we can be more flexible, but, more importantly, we
    need this to switch zram to per-cpu compression streams -- zram will try
    to allocate handle with preemption disabled in a fast path and switch to
    a slow path (using different gfp mask) if the fast one has failed.

    Apart from that, this also align zs_malloc() interface with zspool/zbud.

    [sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask]
    Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
    Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Let's remove unused pool param in obj_free

    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Clean up function parameter ordering to order higher data structure
    first.

    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There are many BUG_ON in zsmalloc.c which is not recommened so change
    them as alternatives.

    Normal rule is as follows:

    1. avoid BUG_ON if possible. Instead, use VM_BUG_ON or VM_BUG_ON_PAGE

    2. use VM_BUG_ON_PAGE if we need to see struct page's fields

    3. use those assertion in primitive functions so higher functions can
    rely on the assertion in the primitive function.

    4. Don't use assertion if following instruction can trigger Oops

    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Clean up function parameter "struct page". Many functions of zsmalloc
    expect that page paramter is "first_page" so use "first_page" rather
    than "page" for code readability.

    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Memory access coded in an assembly won't be seen by KASAN as a compiler
    can instrument only C code. Add kasan_check_[read,write]() API which is
    going to be used to check a certain memory range.

    Link: http://lkml.kernel.org/r/1462538722-1574-3-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • When bogus memory access happens in mem[set,cpy,move]() it's usually
    caller's fault. So don't blame mem[set,cpy,move]() in bug report, blame
    the caller instead.

    Before:
    BUG: KASAN: out-of-bounds access in memset+0x23/0x40 at


    After:
    BUG: KASAN: out-of-bounds access in at

    Link: http://lkml.kernel.org/r/1462538722-1574-2-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Instead of calling kasan_krealloc(), which replaces the memory
    allocation stack ID (if stack depot is used), just unpoison the whole
    memory chunk.

    Signed-off-by: Alexander Potapenko
    Acked-by: Andrey Ryabinin
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    When the object is freed, its state changes from KASAN_STATE_ALLOC to
    KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
    instead of being returned to the allocator, therefore every subsequent
    access to that object triggers a KASAN error, and the error handler is
    able to say where the object has been allocated and deallocated.

    When it's time for the object to leave quarantine, its state becomes
    KASAN_STATE_FREE and it's returned to the allocator. From now on the
    allocator may reuse it for another allocation. Before that happens,
    it's still possible to detect a use-after free on that object (it
    retains the allocation/deallocation stacks).

    When the allocator reuses this object, the shadow is unpoisoned and old
    allocation/deallocation stacks are wiped. Therefore a use of this
    object, even an incorrect one, won't trigger ASan warning.

    Without the quarantine, it's not guaranteed that the objects aren't
    reused immediately, that's why the probability of catching a
    use-after-free is lower than with quarantine in place.

    Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    Freed objects are first added to per-cpu quarantine queues. When a
    cache is destroyed or memory shrinking is requested, the objects are
    moved into the global quarantine queue. Whenever a kmalloc call allows
    memory reclaiming, the oldest objects are popped out of the global queue
    until the total size of objects in quarantine is less than 3/4 of the
    maximum quarantine size (which is a fraction of installed physical
    memory).

    As long as an object remains in the quarantine, KASAN is able to report
    accesses to it, so the chance of reporting a use-after-free is
    increased. Once the object leaves quarantine, the allocator may reuse
    it, in which case the object is unpoisoned and KASAN can't detect
    incorrect accesses to it.

    Right now quarantine support is only enabled in SLAB allocator.
    Unification of KASAN features in SLAB and SLUB will be done later.

    This patch is based on the "mm: kasan: quarantine" patch originally
    prepared by Dmitry Chernenkov. A number of improvements have been
    suggested by Andrey Ryabinin.

    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • If page migration fails due to -ENOMEM, nr_failed should still be
    incremented for proper statistics.

    This was encountered recently when all page migration vmstats showed 0,
    and inferred that migrate_pages() was never called, although in reality
    the first page migration failed because compaction_alloc() failed to
    find a migration target.

    This patch increments nr_failed so the vmstat is properly accounted on
    ENOMEM.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes