04 Oct, 2017

40 commits

  • The allmodconfig build of m32r is failing with the error:

    lib/mpi/mpih-div.o: In function 'mpihelp_divrem':
    mpih-div.c:(.text+0x40): undefined reference to 'abort'
    mpih-div.c:(.text+0x40): relocation truncated to fit:
    R_M32R_26_PCREL_RELA against undefined symbol 'abort'

    The function 'abort' was never defined for the m32r architecture.

    Create 'abort' as is done in other arch like 'arm' and 'unicore32'.

    Link: http://lkml.kernel.org/r/1506727220-6108-1-git-send-email-sudip.mukherjee@codethink.co.uk
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sudip Mukherjee
     
  • printk_ratelimit() invokes ___ratelimit() which may invoke a normal
    printk() (pr_warn() in this particular case) to warn about suppressed
    output. Given that printk_ratelimit() may be called from anywhere, that
    pr_warn() is dangerous - it may end up deadlocking the system. Fix
    ___ratelimit() by using deferred printk().

    Sasha reported the following lockdep error:

    : Unregister pv shared memory for cpu 8
    : select_fallback_rq: 3 callbacks suppressed
    : process 8583 (trinity-c78) no longer affine to cpu8
    :
    : ======================================================
    : WARNING: possible circular locking dependency detected
    : 4.14.0-rc2-next-20170927+ #252 Not tainted
    : ------------------------------------------------------
    : migration/8/62 is trying to acquire lock:
    : (&port_lock_key){-.-.}, at: serial8250_console_write()
    :
    : but task is already holding lock:
    : (&rq->lock){-.-.}, at: sched_cpu_dying()
    :
    : which lock already depends on the new lock.
    :
    :
    : the existing dependency chain (in reverse order) is:
    :
    : -> #3 (&rq->lock){-.-.}:
    : __lock_acquire()
    : lock_acquire()
    : _raw_spin_lock()
    : task_fork_fair()
    : sched_fork()
    : copy_process.part.31()
    : _do_fork()
    : kernel_thread()
    : rest_init()
    : start_kernel()
    : x86_64_start_reservations()
    : x86_64_start_kernel()
    : verify_cpu()
    :
    : -> #2 (&p->pi_lock){-.-.}:
    : __lock_acquire()
    : lock_acquire()
    : _raw_spin_lock_irqsave()
    : try_to_wake_up()
    : default_wake_function()
    : woken_wake_function()
    : __wake_up_common()
    : __wake_up_common_lock()
    : __wake_up()
    : tty_wakeup()
    : tty_port_default_wakeup()
    : tty_port_tty_wakeup()
    : uart_write_wakeup()
    : serial8250_tx_chars()
    : serial8250_handle_irq.part.25()
    : serial8250_default_handle_irq()
    : serial8250_interrupt()
    : __handle_irq_event_percpu()
    : handle_irq_event_percpu()
    : handle_irq_event()
    : handle_level_irq()
    : handle_irq()
    : do_IRQ()
    : ret_from_intr()
    : native_safe_halt()
    : default_idle()
    : arch_cpu_idle()
    : default_idle_call()
    : do_idle()
    : cpu_startup_entry()
    : rest_init()
    : start_kernel()
    : x86_64_start_reservations()
    : x86_64_start_kernel()
    : verify_cpu()
    :
    : -> #1 (&tty->write_wait){-.-.}:
    : __lock_acquire()
    : lock_acquire()
    : _raw_spin_lock_irqsave()
    : __wake_up_common_lock()
    : __wake_up()
    : tty_wakeup()
    : tty_port_default_wakeup()
    : tty_port_tty_wakeup()
    : uart_write_wakeup()
    : serial8250_tx_chars()
    : serial8250_handle_irq.part.25()
    : serial8250_default_handle_irq()
    : serial8250_interrupt()
    : __handle_irq_event_percpu()
    : handle_irq_event_percpu()
    : handle_irq_event()
    : handle_level_irq()
    : handle_irq()
    : do_IRQ()
    : ret_from_intr()
    : native_safe_halt()
    : default_idle()
    : arch_cpu_idle()
    : default_idle_call()
    : do_idle()
    : cpu_startup_entry()
    : rest_init()
    : start_kernel()
    : x86_64_start_reservations()
    : x86_64_start_kernel()
    : verify_cpu()
    :
    : -> #0 (&port_lock_key){-.-.}:
    : check_prev_add()
    : __lock_acquire()
    : lock_acquire()
    : _raw_spin_lock_irqsave()
    : serial8250_console_write()
    : univ8250_console_write()
    : console_unlock()
    : vprintk_emit()
    : vprintk_default()
    : vprintk_func()
    : printk()
    : ___ratelimit()
    : __printk_ratelimit()
    : select_fallback_rq()
    : sched_cpu_dying()
    : cpuhp_invoke_callback()
    : take_cpu_down()
    : multi_cpu_stop()
    : cpu_stopper_thread()
    : smpboot_thread_fn()
    : kthread()
    : ret_from_fork()
    :
    : other info that might help us debug this:
    :
    : Chain exists of:
    : &port_lock_key --> &p->pi_lock --> &rq->lock
    :
    : Possible unsafe locking scenario:
    :
    : CPU0 CPU1
    : ---- ----
    : lock(&rq->lock);
    : lock(&p->pi_lock);
    : lock(&rq->lock);
    : lock(&port_lock_key);
    :
    : *** DEADLOCK ***
    :
    : 4 locks held by migration/8/62:
    : #0: (&p->pi_lock){-.-.}, at: sched_cpu_dying()
    : #1: (&rq->lock){-.-.}, at: sched_cpu_dying()
    : #2: (printk_ratelimit_state.lock){....}, at: ___ratelimit()
    : #3: (console_lock){+.+.}, at: vprintk_emit()
    :
    : stack backtrace:
    : CPU: 8 PID: 62 Comm: migration/8 Not tainted 4.14.0-rc2-next-20170927+ #252
    : Call Trace:
    : dump_stack()
    : print_circular_bug()
    : check_prev_add()
    : ? add_lock_to_list.isra.26()
    : ? check_usage()
    : ? kvm_clock_read()
    : ? kvm_sched_clock_read()
    : ? sched_clock()
    : ? check_preemption_disabled()
    : __lock_acquire()
    : ? __lock_acquire()
    : ? add_lock_to_list.isra.26()
    : ? debug_check_no_locks_freed()
    : ? memcpy()
    : lock_acquire()
    : ? serial8250_console_write()
    : _raw_spin_lock_irqsave()
    : ? serial8250_console_write()
    : serial8250_console_write()
    : ? serial8250_start_tx()
    : ? lock_acquire()
    : ? memcpy()
    : univ8250_console_write()
    : console_unlock()
    : ? __down_trylock_console_sem()
    : vprintk_emit()
    : vprintk_default()
    : vprintk_func()
    : printk()
    : ? show_regs_print_info()
    : ? lock_acquire()
    : ___ratelimit()
    : __printk_ratelimit()
    : select_fallback_rq()
    : sched_cpu_dying()
    : ? sched_cpu_starting()
    : ? rcutree_dying_cpu()
    : ? sched_cpu_starting()
    : cpuhp_invoke_callback()
    : ? cpu_disable_common()
    : take_cpu_down()
    : ? trace_hardirqs_off_caller()
    : ? cpuhp_invoke_callback()
    : multi_cpu_stop()
    : ? __this_cpu_preempt_check()
    : ? cpu_stop_queue_work()
    : cpu_stopper_thread()
    : ? cpu_stop_create()
    : smpboot_thread_fn()
    : ? sort_range()
    : ? schedule()
    : ? __kthread_parkme()
    : kthread()
    : ? sort_range()
    : ? kthread_create_on_node()
    : ret_from_fork()
    : process 9121 (trinity-c78) no longer affine to cpu8
    : smpboot: CPU 8 is now offline

    Link: http://lkml.kernel.org/r/20170928120405.18273-1-sergey.senozhatsky@gmail.com
    Fixes: 6b1d174b0c27b ("ratelimit: extend to print suppressed messages on release")
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Sasha Levin
    Reviewed-by: Petr Mladek
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Steven Rostedt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Align the parameters passed to STANDARD_PARAM_DEF for clarity.

    Link: http://lkml.kernel.org/r/20170928162728.756143cc@endymion
    Signed-off-by: Jean Delvare
    Suggested-by: Ingo Molnar
    Acked-by: Ingo Molnar
    Cc: Baoquan He
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • Function param_attr_show could overflow the buffer it is operating on.

    The buffer size is PAGE_SIZE, and the string returned by
    attribute->param->ops->get is generated by scnprintf(buffer, PAGE_SIZE,
    ...) so it could be PAGE_SIZE - 1 long, with the terminating '\0' at the
    very end of the buffer. Calling strcat(..., "\n") on this isn't safe, as
    the '\0' will be replaced by '\n' (OK) and then another '\0' will be added
    past the end of the buffer (not OK.)

    Simply add the trailing '\n' when writing the attribute contents to the
    buffer originally. This is safe, and also faster.

    Credits to Teradata for discovering this issue.

    Link: http://lkml.kernel.org/r/20170928162602.60c379c7@endymion
    Signed-off-by: Jean Delvare
    Acked-by: Ingo Molnar
    Cc: Baoquan He
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • The length parameter of strlcpy() is supposed to reflect the size of the
    target buffer, not of the source string. Harmless in this case as the
    buffer is PAGE_SIZE long and the source string is always much shorter than
    this, but conceptually wrong, so let's fix it.

    Link: http://lkml.kernel.org/r/20170928162515.24846b4f@endymion
    Signed-off-by: Jean Delvare
    Acked-by: Ingo Molnar
    Cc: Baoquan He
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • find_{smallest|biggest}_section_pfn()s find the smallest/biggest section
    and return the pfn of the section. But the functions are defined as int.
    So the functions always return 0x00000000 - 0xffffffff. It means if
    memory address is over 16TB, the functions does not work correctly.

    To handle 64 bit value, the patch defines
    find_{smallest|biggest}_section_pfn() as unsigned long.

    Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
    Link: http://lkml.kernel.org/r/d9d5593a-d0a4-c4be-ab08-493df59a85c6@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Acked-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Reza Arbab
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YASUAKI ISHIMATSU
     
  • pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
    pfn_to_section_nr() has no issue even if it is defined as macro. But
    section_nr_to_pfn() has overflow issue if sec is defined as int.

    section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT. If sec is
    defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
    But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
    value.

    __remove_section() calculates start_pfn using section_nr_to_pfn() and
    scn_nr defined as int. So if hot-removed memory address is over 16TB,
    overflow issue occurs and section_nr_to_pfn() does not calculate correct
    pfn.

    To make callers use proper arg, the patch changes the macros to inline
    functions.

    Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
    Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Acked-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Reza Arbab
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YASUAKI ISHIMATSU
     
  • The else branch been left over and escaped the source code refresh. Not
    a problem but better clean it up.

    Fixes: 0791e3644e5e ("kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files")
    Link: http://lkml.kernel.org/r/20170917165838.GA1887@uranus.lan
    Reported-by: Eugene Syromiatnikov
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Andrei Vagin
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • devm_memremap_pages is initializing struct pages in for_each_device_pfn
    and that can take quite some time. We have even seen a soft lockup
    triggering on a non preemptive kernel

    NMI watchdog: BUG: soft lockup - CPU#61 stuck for 22s! [kworker/u641:11:1808]
    [...]
    RIP: 0010:[] [] devm_memremap_pages+0x327/0x430
    [...]
    Call Trace:
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70

    fix this by adding cond_resched every 1024 pages.

    Link: http://lkml.kernel.org/r/20170918121410.24466-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Johannes Thumshirn
    Tested-by: Johannes Thumshirn
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • memmap_init_zone gets a pfn range to initialize and it can be really
    large resulting in a soft lockup on non-preemptible kernels

    NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [kworker/u642:5:1720]
    [...]
    task: ffff88ecd7e902c0 ti: ffff88eca4e50000 task.ti: ffff88eca4e50000
    RIP: move_pfn_range_to_zone+0x185/0x1d0
    [...]
    Call Trace:
    devm_memremap_pages+0x2c7/0x430
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70

    Fix this by adding a scheduling point once per page block.

    Link: http://lkml.kernel.org/r/20170918121410.24466-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Johannes Thumshirn
    Tested-by: Johannes Thumshirn
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "mm, memory_hotplug: fix few soft lockups in memory
    hotadd".

    Johannes has noticed few soft lockups when adding a large nvdimm device.
    All of them were caused by a long loop without any explicit cond_resched
    which is a problem for !PREEMPT kernels.

    The fix is quite straightforward. Just make sure that cond_resched gets
    called from time to time.

    This patch (of 3):

    __add_pages gets a pfn range to add and there is no upper bound for a
    single call. This is usually a memory block aligned size for the
    regular memory hotplug - smaller sizes are usual for memory balloning
    drivers, or the whole NUMA node for physical memory online. There is no
    explicit scheduling point in that code path though.

    This can lead to long latencies while __add_pages is executed and we
    have even seen a soft lockup report during nvdimm initialization with
    !PREEMPT kernel

    NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [kworker/u641:3:832]
    [...]
    Workqueue: events_unbound async_run_entry_fn
    task: ffff881809270f40 ti: ffff881809274000 task.ti: ffff881809274000
    RIP: _raw_spin_unlock_irqrestore+0x11/0x20
    RSP: 0018:ffff881809277b10 EFLAGS: 00000286
    [...]
    Call Trace:
    sparse_add_one_section+0x13d/0x18e
    __add_pages+0x10a/0x1d0
    arch_add_memory+0x4a/0xc0
    devm_memremap_pages+0x29d/0x430
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70
    DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

    Fix this by adding cond_resched once per each memory section in the
    given pfn range. Each section is constant amount of work which itself
    is not too expensive but many of them will just add up.

    Link: http://lkml.kernel.org/r/20170918121410.24466-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Johannes Thumshirn
    Tested-by: Johannes Thumshirn
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • idr_replace() returns the old value on success, not 0.

    Link: http://lkml.kernel.org/r/20170918162642.37511-1-ebiggers3@gmail.com
    Signed-off-by: Eric Biggers
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • For quick per-memcg indexing, slab caches and list_lru structures
    maintain linear arrays of descriptors. As the number of concurrent
    memory cgroups in the system goes up, this requires large contiguous
    allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
    existing slab cache and list_lru, which can easily fail on loaded
    systems. E.g.:

    mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
    CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    Call Trace:
    ? __alloc_pages_direct_compact+0x4c/0x110
    __alloc_pages_nodemask+0xf50/0x1430
    alloc_pages_current+0x60/0xc0
    kmalloc_order_trace+0x29/0x1b0
    __kmalloc+0x1f4/0x320
    memcg_update_all_list_lrus+0xca/0x2e0
    mem_cgroup_css_alloc+0x612/0x670
    cgroup_apply_control_enable+0x19e/0x360
    cgroup_mkdir+0x322/0x490
    kernfs_iop_mkdir+0x55/0x80
    vfs_mkdir+0xd0/0x120
    SyS_mkdirat+0x6c/0xe0
    SyS_mkdir+0x14/0x20
    entry_SYSCALL_64_fastpath+0x18/0xad
    Mem-Info:
    active_anon:2965 inactive_anon:19 isolated_anon:0
    active_file:100270 inactive_file:98846 isolated_file:0
    unevictable:0 dirty:0 writeback:0 unstable:0
    slab_reclaimable:7328 slab_unreclaimable:16402
    mapped:771 shmem:52 pagetables:278 bounce:0
    free:13718 free_pcp:0 free_cma:0

    This output is from an artificial reproducer, but we have repeatedly
    observed order-7 failures in production in the Facebook fleet. These
    systems become useless as they cannot run more jobs, even though there
    is plenty of memory to allocate 128 individual pages.

    Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
    prove too large for allocating them physically contiguous.

    Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Josef Bacik
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • do_proc_douintvec_conv() has two UINT_MAX checks, we can remove one.
    This has no functional changes other than fixing a compiler warning:

    kernel/sysctl.c:2190]: (warning) Identical condition '*lvalp>UINT_MAX', second condition is always false

    Fixes: 4f2fec00afa60 ("sysctl: simplify unsigned int support")
    Link: http://lkml.kernel.org/r/20170919072918.12066-1-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Reported-by: David Binderman
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • I do not see anything that restricts this macro to 32 bit width.

    Link: http://lkml.kernel.org/r/1505921975-23379-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Acked-by: Jakub Kicinski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Don't populate the read-only arrays dec32table and dec64table on the
    stack, instead make them both static const. Makes the object code
    smaller by over 10K bytes:

    Before:
    text data bss dec hex filename
    31500 0 0 31500 7b0c lib/lz4/lz4_decompress.o

    After:
    text data bss dec hex filename
    20237 176 0 20413 4fbd lib/lz4/lz4_decompress.o

    (gcc version 7.2.0 x86_64)

    Link: http://lkml.kernel.org/r/20170921221939.20820-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Cc: Christophe JAILLET
    Cc: Sven Schmidt
    Cc: Arnd Bergmann
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • After the previous change "fmt" can't go away, we can kill
    iname/iname_addr and use fmt->interpreter.

    Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • load_misc_binary() makes a local copy of fmt->interpreter under
    entries_lock to avoid the race with kill_node() but this is not enough;
    the whole Node can be freed after we drop entries_lock, not only the
    ->interpreter string.

    Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free
    this Node.

    Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc: Travis Gummels
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we
    have a bug which should not be silently ignored.

    Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • To ensure that load_misc_binary() can't use the partially destroyed
    Node, see also the next patch.

    The current logic looks wrong in any case, once we close interp_file it
    doesn't make any sense to delay kfree(inode->i_private), this Node is no
    longer valid. Even if the MISC_FMT_OPEN_FILE/interp_file checks were
    not racy (they are), load_misc_binary() should not try to reopen
    ->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL.

    And I can't understand why do we use filp_close(), not fput().

    Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • kill_node() nullifies/checks Node->dentry to avoid double free. This
    complicates the next changes and this is very confusing:

    - we do not need to check dentry != NULL under entries_lock,
    kill_node() is always called under inode_lock(d_inode(root)) and we
    rely on this inode_lock() anyway, without this lock the
    MISC_FMT_OPEN_FILE cleanup could race with itself.

    - if kill_inode() was already called and ->dentry == NULL we should not
    even try to close e->interp_file.

    We can change bm_entry_write() to simply check !list_empty(list) before
    kill_node. Again, we rely on inode_lock(), in particular it saves us
    from the race with bm_status_write(), another caller of kill_node().

    Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Patch series "exec: binfmt_misc: fix use-after-free, kill
    iname[BINPRM_BUF_SIZE]".

    It looks like this code was always wrong, then commit 948b701a607f
    ("binfmt_misc: add persistent opened binary handler for containers")
    added more problems.

    This patch (of 6):

    load_script() can simply use i_name instead, it points into bprm->buf[]
    and nobody can change this memory until we call prepare_binprm().

    The only complication is that we need to also change the signature of
    bprm_change_interp() but this change looks good too.

    While at it, do whitespace/style cleanups.

    NOTE: the real motivation for this change is that people want to
    increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
    this looks more complicated because afaics it is very buggy.

    Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Travis Gummels
    Cc: Ben Woodard
    Cc: Jim Foraker
    Cc:
    Cc: Al Viro
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When reading the event from the uffd, we put it on a temporary
    fork_event list to detect if we can still access it after releasing and
    retaking the event_wqh.lock.

    If fork aborts and removes the event from the fork_event all is fine as
    long as we're still in the userfault read context and fork_event head is
    still alive.

    We've to put the event allocated in the fork kernel stack, back from
    fork_event list-head to the event_wqh head, before returning from
    userfaultfd_ctx_read, because the fork_event head lifetime is limited to
    the userfaultfd_ctx_read stack lifetime.

    Forgetting to move the event back to its event_wqh place then results in
    __remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
    userfaultfd_event_wait_completion to remove it from a head that has been
    already freed from the reader stack.

    This could only happen if resolve_userfault_fork failed (for example if
    there are no file descriptors available to allocate the fork uffd). If
    it succeeded it was put back correctly.

    Furthermore, after find_userfault_evt receives a fork event, the forked
    userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
    be released by the fork thread as soon as the event_wqh.lock is
    released. Taking a reference on the fork_nctx before dropping the lock
    prevents an use after free in resolve_userfault_fork().

    If the fork side aborted and it already released everything, we still
    try to succeed resolve_userfault_fork(), if possible.

    Fixes: 893e26e61d04eac9 ("userfaultfd: non-cooperative: Add fork() event")
    Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Mark Rutland
    Tested-by: Mark Rutland
    Cc: Pavel Emelyanov
    Cc: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • With device public pages at the end of my memory space, I'm getting
    output from _vm_normal_page():

    BUG: Bad page map in process migrate_pages pte:c0800001ffff0d06 pmd:f95d3000
    addr:00007fff89330000 vm_flags:00100073 anon_vma:c0000000fa899320 mapping: (null) index:7fff8933
    file: (null) fault: (null) mmap: (null) readpage: (null)
    CPU: 0 PID: 13963 Comm: migrate_pages Tainted: P B OE 4.14.0-rc1-wip #155
    Call Trace:
    dump_stack+0xb0/0xf4 (unreliable)
    print_bad_pte+0x28c/0x340
    _vm_normal_page+0xc0/0x140
    zap_pte_range+0x664/0xc10
    unmap_page_range+0x318/0x670
    unmap_vmas+0x74/0xe0
    exit_mmap+0xe8/0x1f0
    mmput+0xac/0x1f0
    do_exit+0x348/0xcd0
    do_group_exit+0x5c/0xf0
    SyS_exit_group+0x1c/0x20
    system_call+0x58/0x6c

    The pfn causing this is the very last one. Correct the bounds check
    accordingly.

    Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU")
    Link: http://lkml.kernel.org/r/1506092178-20351-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Reviewed-by: Jérôme Glisse
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear
    SwapBacked). There is no lock to prevent the page is added to swap
    cache between these two steps by page reclaim. If page reclaim finds
    such page, it will simply add the page to swap cache without pageout the
    page to swap because the page is marked as clean. Next time, page fault
    will read data from the swap slot which doesn't have the original data,
    so we have a data corruption. To fix issue, we mark the page dirty and
    pageout the page.

    However, we shouldn't dirty all pages which is clean and in swap cache.
    swapin page is swap cache and clean too. So we only dirty page which is
    added into swap cache in page reclaim, which shouldn't be swapin page.
    As Minchan suggested, simply dirty the page in add_to_swap can do the
    job.

    Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
    Link: http://lkml.kernel.org/r/08c84256b007bf3f63c91d94383bd9eb6fee2daa.1506446061.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Reported-by: Artem Savkov
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear
    SwapBacked). There is no lock to prevent the page is added to swap
    cache between these two steps by page reclaim. Page reclaim could add
    the page to swap cache and unmap the page. After page reclaim, the page
    is added back to lru. At that time, we probably start draining per-cpu
    pagevec and mark the page lazyfree. So the page could be in a state
    with SwapBacked cleared and PG_swapcache set. Next time there is a
    refault in the virtual address, do_swap_page can find the page from swap
    cache but the page has PageSwapCache false because SwapBacked isn't set,
    so do_swap_page will bail out and do nothing. The task will keep
    running into fault handler.

    Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
    Link: http://lkml.kernel.org/r/6537ef3814398c0073630b03f176263bc81f0902.1506446061.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Reported-by: Artem Savkov
    Tested-by: Artem Savkov
    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Eryu noticed that he could sometimes get a leftover error reported when
    it shouldn't be on fsync with ext2 and non-journalled ext4.

    The problem is that writeback_single_inode still uses filemap_fdatawait.
    That picks up a previously set AS_EIO flag, which would ordinarily have
    been cleared before.

    Since we're mostly using this function as a replacement for
    filemap_check_errors, have filemap_check_and_advance_wb_err clear AS_EIO
    and AS_ENOSPC when reporting an error. That should allow the new
    function to better emulate the behavior of the old with respect to these
    flags.

    Link: http://lkml.kernel.org/r/20170922133331.28812-1-jlayton@kernel.org
    Signed-off-by: Jeff Layton
    Reported-by: Eryu Guan
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • The build of m32r allmodconfig is giving lots of build warnings about:

    include/linux/byteorder/big_endian.h:7:2:
    warning: #warning inconsistent configuration,
    needs CONFIG_CPU_BIG_ENDIAN [-Wcpp]
    #warning inconsistent configuration, needs CONFIG_CPU_BIG_ENDIAN

    Define CPU_BIG_ENDIAN like the way CPU_LITTLE_ENDIAN is defined.

    Link: http://lkml.kernel.org/r/1505678083-10320-1-git-send-email-sudipm.mukherjee@gmail.com
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sudip Mukherjee
     
  • In testing I found handle passed to zs_map_object in __zram_bvec_read is
    NULL so eh kernel goes oops in pin_object().

    The reason is there is no routine to check the slot's freeing after
    getting the slot's lock. This patch fixes it.

    [minchan@kernel.org: v2]
    Link: http://lkml.kernel.org/r/1505887347-10881-1-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/1505788488-26723-1-git-send-email-minchan@kernel.org
    Fixes: 1f7319c74275 ("zram: partial IO refactoring")
    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • On powerpc, RODATA_TEST fails with message the following messages:

    Freeing unused kernel memory: 528K
    rodata_test: test data was not read only

    This is because GCC allocates it to .data section:

    c0695034 g O .data 00000004 rodata_test_data

    Since commit 056b9d8a7692 ("mm: remove rodata_test_data export, add
    pr_fmt"), rodata_test_data is used only inside rodata_test.c By
    declaring it static, it gets properly allocated into .rodata section
    instead of .data:

    c04df710 l O .rodata 00000004 rodata_test_data

    Fixes: 056b9d8a7692 ("mm: remove rodata_test_data export, add pr_fmt")
    Link: http://lkml.kernel.org/r/20170921093729.1080368AC1@po15668-vm-win7.idsi0.si.c-s.fr
    Signed-off-by: Christophe Leroy
    Cc: Kees Cook
    Cc: Jinbum Park
    Cc: Segher Boessenkool
    Cc: David Laight
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christophe Leroy
     
  • Locking of config and doorbell operations should be done only if the
    underlying hardware requires it.

    This patch removes the global spinlocks from the rapidio subsystem and
    moves them to the mport drivers (fsl_rio and tsi721), only to the
    necessary places. For example, local config space read and write
    operations (lcread/lcwrite) are atomic in all existing drivers, so there
    should be no need for locking, while the cread/cwrite operations which
    generate maintenance transactions need to be synchronized with a lock.

    Later, each driver could chose to use a per-port lock instead of a
    global one, or even more granular locking.

    Link: http://lkml.kernel.org/r/20170824113023.GD50104@nokia.com
    Signed-off-by: Ioan Nicu
    Signed-off-by: Frank Kunz
    Acked-by: Alexandre Bounine
    Cc: Matt Porter
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ioan Nicu
     
  • The function is called from __meminit context and calls other __meminit
    functions but isn't it self mark as such today:

    WARNING: vmlinux.o(.text.unlikely+0x4516): Section mismatch in reference from the function init_reserved_page() to the function .meminit.text:early_pfn_to_nid()
    The function init_reserved_page() references the function __meminit early_pfn_to_nid().
    This is often because init_reserved_page lacks a __meminit annotation or the annotation of early_pfn_to_nid is wrong.

    On most compilers, we don't notice this because the function gets
    inlined all the time. Adding __meminit here fixes the harmless warning
    for the old versions and is generally the correct annotation.

    Link: http://lkml.kernel.org/r/20170915193149.901180-1-arnd@arndb.de
    Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
    Signed-off-by: Arnd Bergmann
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Fix the situation when clear_bit() is called for page->private before
    the page pointer is actually assigned. While at it, remove work_busy()
    check because it is costly and does not give 100% guarantee anyway.

    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Andrea brought to my attention that the L->{L,S} guarantees are
    completely bogus for this case. I was looking at the diagram, from the
    offending commit, when that _is_ the race, we had the load reordered
    already.

    What we need is at least S->L semantics, thus simply use
    wq_has_sleeper() to serialize the call for good.

    Link: http://lkml.kernel.org/r/20170914175313.GB811@linux-80c1.suse
    Fixes: 46acef048a6 (mm,compaction: serialize waitqueue_active() checks)
    Signed-off-by: Davidlohr Bueso
    Reported-by: Andrea Parri
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Drop the global lru lock in isolate callback before calling
    zap_page_range which calls cond_resched, and re-acquire the global lru
    lock before returning. Also change return code to LRU_REMOVED_RETRY.

    Use mmput_async when fail to acquire mmap sem in an atomic context.

    Fix "BUG: sleeping function called from invalid context"
    errors when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.

    Also restore mmput_async, which was initially introduced in commit
    ec8d7c14ea14 ("mm, oom_reaper: do not mmput synchronously from the oom
    reaper context"), and was removed in commit 212925802454 ("mm: oom: let
    oom_reap_task and exit_mmap run concurrently").

    Link: http://lkml.kernel.org/r/20170914182231.90908-1-sherryy@android.com
    Fixes: f2517eb76f1f2 ("android: binder: Add global lru shrinker to binder")
    Signed-off-by: Sherry Yang
    Signed-off-by: Greg Kroah-Hartman
    Reported-by: Kyle Yan
    Acked-by: Arve Hjønnevåg
    Acked-by: Michal Hocko
    Cc: Martijn Coenen
    Cc: Todd Kjos
    Cc: Riley Andrews
    Cc: Ingo Molnar
    Cc: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Thomas Gleixner
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc: Hoeun Ryu
    Cc: Christopher Lameter
    Cc: Vegard Nossum
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sherry Yang
     
  • Fix for 4.14, zone device page always have an elevated refcount of one
    and thus page count sanity check in uncharge_page() is inappropriate for
    them.

    [mhocko@suse.com: nano-optimize VM_BUG_ON in uncharge_page]
    Link: http://lkml.kernel.org/r/20170914190011.5217-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Michal Hocko
    Reported-by: Evgeny Baskakov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • The following lockdep splat has been noticed during LTP testing

    ======================================================
    WARNING: possible circular locking dependency detected
    4.13.0-rc3-next-20170807 #12 Not tainted
    ------------------------------------------------------
    a.out/4771 is trying to acquire lock:
    (cpu_hotplug_lock.rw_sem){++++++}, at: [] drain_all_stock.part.35+0x18/0x140

    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x175/0x530

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc9/0x230
    __might_fault+0x70/0xa0
    _copy_to_user+0x23/0x70
    filldir+0xa7/0x110
    xfs_dir2_sf_getdents.isra.10+0x20c/0x2c0 [xfs]
    xfs_readdir+0x1fa/0x2c0 [xfs]
    xfs_file_readdir+0x30/0x40 [xfs]
    iterate_dir+0x17a/0x1a0
    SyS_getdents+0xb0/0x160
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    -> #2 (&type->i_mutex_dir_key#3){++++++}:
    lock_acquire+0xc9/0x230
    down_read+0x51/0xb0
    lookup_slow+0xde/0x210
    walk_component+0x160/0x250
    link_path_walk+0x1a6/0x610
    path_openat+0xe4/0xd50
    do_filp_open+0x91/0x100
    file_open_name+0xf5/0x130
    filp_open+0x33/0x50
    kernel_read_file_from_path+0x39/0x80
    _request_firmware+0x39f/0x880
    request_firmware_direct+0x37/0x50
    request_microcode_fw+0x64/0xe0
    reload_store+0xf7/0x180
    dev_attr_store+0x18/0x30
    sysfs_kf_write+0x44/0x60
    kernfs_fop_write+0x113/0x1a0
    __vfs_write+0x37/0x170
    vfs_write+0xc7/0x1c0
    SyS_write+0x58/0xc0
    do_syscall_64+0x6c/0x1f0
    return_from_SYSCALL_64+0x0/0x7a

    -> #1 (microcode_mutex){+.+.+.}:
    lock_acquire+0xc9/0x230
    __mutex_lock+0x88/0x960
    mutex_lock_nested+0x1b/0x20
    microcode_init+0xbb/0x208
    do_one_initcall+0x51/0x1a9
    kernel_init_freeable+0x208/0x2a7
    kernel_init+0xe/0x104
    ret_from_fork+0x2a/0x40

    -> #0 (cpu_hotplug_lock.rw_sem){++++++}:
    __lock_acquire+0x153c/0x1550
    lock_acquire+0xc9/0x230
    cpus_read_lock+0x4b/0x90
    drain_all_stock.part.35+0x18/0x140
    try_charge+0x3ab/0x6e0
    mem_cgroup_try_charge+0x7f/0x2c0
    shmem_getpage_gfp+0x25f/0x1050
    shmem_fault+0x96/0x200
    __do_fault+0x1e/0xa0
    __handle_mm_fault+0x9c3/0xe00
    handle_mm_fault+0x16e/0x380
    __do_page_fault+0x24a/0x530
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

    other info that might help us debug this:

    Chain exists of:
    cpu_hotplug_lock.rw_sem --> &type->i_mutex_dir_key#3 --> &mm->mmap_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&mm->mmap_sem);
    lock(&type->i_mutex_dir_key#3);
    lock(&mm->mmap_sem);
    lock(cpu_hotplug_lock.rw_sem);

    *** DEADLOCK ***

    2 locks held by a.out/4771:
    #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x175/0x530
    #1: (percpu_charge_mutex){+.+...}, at: [] try_charge+0x397/0x6e0

    The problem is very similar to the one fixed by commit a459eeb7b852
    ("mm, page_alloc: do not depend on cpu hotplug locks inside the
    allocator"). We are taking hotplug locks while we can be sitting on top
    of basically arbitrary locks. This just calls for problems.

    We can get rid of {get,put}_online_cpus, fortunately. We do not have to
    be worried about races with memory hotplug because drain_local_stock,
    which is called from both the WQ draining and the memory hotplug
    contexts, is always operating on the local cpu stock with IRQs disabled.

    The only thing to be careful about is that the target memcg doesn't
    vanish while we are still in drain_all_stock so take a reference on it.

    Link: http://lkml.kernel.org/r/20170913090023.28322-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Artem Savkov
    Tested-by: Artem Savkov
    Cc: Johannes Weiner
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Andrea has noticed that the oom_reaper doesn't invalidate the range via
    mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
    corrupt the memory of the kvm guest for example.

    tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
    sufficient as per Andrea:

    "mmu_notifier_invalidate_range cannot be used in replacement of
    mmu_notifier_invalidate_range_start/end. For KVM
    mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
    notifier implementation has to implement either ->invalidate_range
    method or the invalidate_range_start/end methods, not both. And if you
    implement invalidate_range_start/end like KVM is forced to do, calling
    mmu_notifier_invalidate_range in common code is a noop for KVM.

    For those MMU notifiers that can get away only implementing
    ->invalidate_range, the ->invalidate_range is implicitly called by
    mmu_notifier_invalidate_range_end(). And only those secondary MMUs
    that share the same pagetable with the primary MMU (like AMD iommuv2)
    can get away only implementing ->invalidate_range"

    As the callback is allowed to sleep and the implementation is out of
    hand of the MM it is safer to simply bail out if there is an mmu
    notifier registered. In order to not fail too early make the
    mm_has_notifiers check under the oom_lock and have a little nap before
    failing to give the current oom victim some more time to exit.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Signed-off-by: Michal Hocko
    Reported-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • It is possible that on a (partially) unsuccessful page reclaim,
    kref_put() called in z3fold_reclaim_page() does not yield page release,
    but the page is released shortly afterwards by another thread. Then
    z3fold_reclaim_page() would try to list_add() that (released) page again
    which is obviously a bug.

    To avoid that, spin_lock() has to be taken earlier, before the
    kref_put() call mentioned earlier.

    Link: http://lkml.kernel.org/r/20170913162937.bfff21c7d12b12a5f47639fd@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Pinmux_pins[] is initialized through PINMUX_GPIO(), using designated
    array initializers, where the GPIO_* enums serve as indices. If enum
    values are defined, but never used, pinmux_pins[] contains (zero-filled)
    holes. Such entries are treated as pin zero, which was registered
    before, thus leading to pinctrl registration failures, as seen on
    sh7722:

    sh-pfc pfc-sh7722: pin 0 already registered
    sh-pfc pfc-sh7722: error during pin registration
    sh-pfc pfc-sh7722: could not register: -22
    sh-pfc: probe of pfc-sh7722 failed with error -22

    Remove GPIO_PH[0-7] from the enum to fix this.

    Link: http://lkml.kernel.org/r/1505205657-18012-5-git-send-email-geert+renesas@glider.be
    Fixes: ef0fa5331a73e479 ("sh: Add pinmux for sh7269")
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Laurent Pinchart
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Magnus Damm
    Cc: Yoshihiro Shimoda
    Cc: Jacopo Mondi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven