08 Jun, 2018

40 commits

  • Fix missing MODULE_LICENSE() warning in lib/ucs2_string.c:

    WARNING: modpost: missing MODULE_LICENSE() in lib/ucs2_string.o
    see include/linux/module.h for more information

    Link: http://lkml.kernel.org/r/b2505bb4-dcf5-fc46-443d-e47db1cb2f59@infradead.org
    Signed-off-by: Randy Dunlap
    Cc: Greg Kroah-Hartman
    Cc: Matthew Garrett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • MPI headers contain definitions for huge number of non-existing
    functions.

    Most part of these functions was removed in 2012 by Dmitry Kasatkin
    - 7cf4206a99d1 ("Remove unused code from MPI library")
    - 9e235dcaf4f6 ("Revert "crypto: GnuPG based MPI lib - additional ...")
    - bc95eeadf5c6 ("lib/mpi: removed unused functions")
    however headers wwere not updated properly.

    Also I deleted some unused macros.

    Link: http://lkml.kernel.org/r/fb2fc1ef-1185-f0a3-d8d0-173d2f97bbaf@virtuozzo.com
    Signed-off-by: Vasily Averin
    Reviewed-by: Andrew Morton
    Cc: Dmitry Kasatkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     
  • percpu_ida() decouples disabling interrupts from the locking operations.
    This breaks some assumptions if the locking operations are replaced like
    they are under -RT.

    The same locking can be achieved by avoiding local_irq_save() and using
    spin_lock_irqsave() instead. percpu_ida_alloc() gains one more preemption
    point because after unlocking the fastpath and before the pool lock is
    acquired, the interrupts are briefly enabled.

    Link: http://lkml.kernel.org/r/20180504153218.7301-1-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Andrew Morton
    Cc: Thomas Gleixner
    Cc: Nicholas Bellinger
    Cc: Shaohua Li
    Cc: Kent Overstreet
    Cc: Matthew Wilcox
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Improve the scalability of the IDA by using the per-IDA xa_lock rather
    than the global simple_ida_lock. IDAs are not typically used in
    performance-sensitive locations, but since we have this lock anyway, we
    can use it. It is also a step towards converting the IDA from the radix
    tree to the XArray.

    [akpm@linux-foundation.org: idr.c needs xarray.h]
    Link: http://lkml.kernel.org/r/20180331125332.GF13332@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Rasmus Villemoes
    Cc: Daniel Vetter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use BITS_TO_LONGS() macro to avoid calculation of reminder (bits %
    BITS_PER_LONG) On ARM64 it saves 5 instruction for function - 16 before
    and 11 after.

    Link: http://lkml.kernel.org/r/20180411145914.6011-1-ynorov@caviumnetworks.com
    Signed-off-by: Yury Norov
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yury Norov
     
  • There are mode change and rename only patches that are unrecognized
    by the get_maintainer.pl script.

    Recognize them.

    Link: http://lkml.kernel.org/r/bf63101a908d0ff51948164aa60e672368066186.1526949367.git.joe@perches.com
    Signed-off-by: Joe Perches
    Reported-by: Heinrich Schuchardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • When we get a hung task it can often be valuable to see _all_ the hung
    tasks on the system before calling panic().

    Quoting from https://syzkaller.appspot.com/text?tag=CrashReport&id=5316056503549952
    ----------------------------------------
    INFO: task syz-executor0:6540 blocked for more than 120 seconds.
    Not tainted 4.16.0+ #13
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    syz-executor0 D23560 6540 4521 0x80000004
    Call Trace:
    context_switch kernel/sched/core.c:2848 [inline]
    __schedule+0x8fb/0x1ef0 kernel/sched/core.c:3490
    schedule+0xf5/0x430 kernel/sched/core.c:3549
    schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3607
    __mutex_lock_common kernel/locking/mutex.c:833 [inline]
    __mutex_lock+0xb7f/0x1810 kernel/locking/mutex.c:893
    mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
    lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355
    __blkdev_driver_ioctl block/ioctl.c:303 [inline]
    blkdev_ioctl+0x1759/0x1e00 block/ioctl.c:601
    ioctl_by_bdev+0xa5/0x110 fs/block_dev.c:2060
    isofs_get_last_session fs/isofs/inode.c:567 [inline]
    isofs_fill_super+0x2ba9/0x3bc0 fs/isofs/inode.c:660
    mount_bdev+0x2b7/0x370 fs/super.c:1119
    isofs_mount+0x34/0x40 fs/isofs/inode.c:1560
    mount_fs+0x66/0x2d0 fs/super.c:1222
    vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037
    vfs_kern_mount fs/namespace.c:2514 [inline]
    do_new_mount fs/namespace.c:2517 [inline]
    do_mount+0xea4/0x2b90 fs/namespace.c:2847
    ksys_mount+0xab/0x120 fs/namespace.c:3063
    SYSC_mount fs/namespace.c:3077 [inline]
    SyS_mount+0x39/0x50 fs/namespace.c:3074
    do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    (...snipped...)
    Showing all locks held in the system:
    (...snipped...)
    2 locks held by syz-executor0/6540:
    #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: alloc_super fs/super.c:211 [inline]
    #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: sget_userns+0x3b2/0xe60 fs/super.c:502 /* down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING); */
    #1: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */
    (...snipped...)
    3 locks held by syz-executor7/6541:
    #0: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */
    #1: 000000007bf3d3f9 (&bdev->bd_mutex){+.+.}, at: blkdev_reread_part+0x1e/0x40 block/ioctl.c:192
    #2: 00000000566d4c39 (&type->s_umount_key#50){.+.+}, at: __get_super.part.10+0x1d3/0x280 fs/super.c:663 /* down_read(&sb->s_umount); */
    ----------------------------------------

    When reporting an AB-BA deadlock like shown above, it would be nice if
    trace of PID=6541 is printed as well as trace of PID=6540 before calling
    panic().

    Showing hung tasks up to /proc/sys/kernel/hung_task_warnings could delay
    calling panic() but normally there should not be so many hung tasks.

    Link: http://lkml.kernel.org/r/201804050705.BHE57833.HVFOFtSOMQJFOL@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Paul E. McKenney
    Acked-by: Dmitry Vyukov
    Cc: Vegard Nossum
    Cc: Mandeep Singh Baines
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • This header file is not exported. It is safe to reference types without
    double-underscore prefix.

    Link: http://lkml.kernel.org/r/1526350925-14922-3-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Cc: Geert Uytterhoeven
    Cc: Alexey Dobriyan
    Cc: Lihao Liang
    Cc: Philippe Ombredanne
    Cc: Pekka Enberg
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • has the same typedefs except that it prefixes them
    with double-underscore for user space. Use them for the kernel space
    typedefs.

    Link: http://lkml.kernel.org/r/1526350925-14922-2-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Reviewed-by: Andrew Morton
    Cc: Geert Uytterhoeven
    Cc: Alexey Dobriyan
    Cc: Lihao Liang
    Cc: Philippe Ombredanne
    Cc: Pekka Enberg
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • has the same typedefs except that it
    prefixes them with double-underscore for user space. Use them for
    the kernel space typedefs.

    Link: http://lkml.kernel.org/r/1526350925-14922-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Reviewed-by: Andrew Morton
    Cc: Geert Uytterhoeven
    Cc: Alexey Dobriyan
    Cc: Lihao Liang
    Cc: Philippe Ombredanne
    Cc: Pekka Enberg
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • * Test lookup in /proc/self/fd.
    "map_files" lookup story showed that lookup is not that simple.

    * Test that all those symlinks open the same file.
    Check with (st_dev, st_info).

    * Test that kernel threads do not have anything in their /proc/*/fd/
    directory.

    Now this is where things get interesting.

    First, kernel threads aren't pinned by /proc/self or equivalent,
    thus some "atomicity" is required.

    Second, ->comm can contain whitespace and ')'.
    No, they are not escaped.

    Third, the only reliable way to check if process is kernel thread
    appears to be field #9 in /proc/*/stat.

    This field is struct task_struct::flags in decimal!
    Check is done by testing PF_KTHREAD flags like we do in kernel.

    PF_KTREAD value is a part of userspace ABI !!!

    Other methods for determining kernel threadness are not reliable:
    * RSS can be 0 if everything is swapped, even while reading
    from /proc/self.

    * ->total_vm CAN BE ZERO if process is finishing

    munmap(NULL, whole address space);

    * /proc/*/maps and similar files can be empty because unmapping
    everything works. Read returning 0 can't distinguish between
    kernel thread and such suicide process.

    Link: http://lkml.kernel.org/r/20180505000414.GA15090@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct stack_trace::nr_entries is defined as "unsigned int" (YAY!) so
    the iterator should be unsigned as well.

    It saves 1 byte of code or something like that.

    Link: http://lkml.kernel.org/r/20180423215248.GG9043@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • It's defined as atomic_t and really long signal queues are unheard of.

    Link: http://lkml.kernel.org/r/20180423215119.GF9043@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • All those lengths are unsigned as they should be.

    Link: http://lkml.kernel.org/r/20180423213751.GC9043@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct kstat is thread local.

    Link: http://lkml.kernel.org/r/20180423213626.GB9043@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Code can be sonsolidated if a dummy region of 0 length is used in normal
    case of \0-separated command line:

    1) [arg_start, arg_end) + [dummy len=0]
    2) [arg_start, arg_end) + [env_start, env_end)

    Link: http://lkml.kernel.org/r/20180221193335.GB28678@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • "rv" variable is used both as a counter of bytes transferred and an
    error value holder but it can be reduced solely to error values if
    original start of userspace buffer is stashed and used at the very end.

    [akpm@linux-foundation.org: simplify cleanup code]
    Link: http://lkml.kernel.org/r/20180221193009.GA28678@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • "final" variable is OK but we can get away with less lines.

    Link: http://lkml.kernel.org/r/20180221192751.GC28548@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • access_remote_vm() doesn't return negative errors, it returns number of
    bytes read/written (0 if error occurs). This allows to delete some
    comparisons which never trigger.

    Reuse "nr_read" variable while I'm at it.

    Link: http://lkml.kernel.org/r/20180221192605.GB28548@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • When commit bd33ef368135 ("mm: enable page poisoning early at boot") got
    rid of the PAGE_EXT_DEBUG_POISON, page_is_poisoned in the header left
    behind. This patch cleans up the leftovers under the table.

    Link: http://lkml.kernel.org/r/1528101069-21637-1-git-send-email-kpark3469@gmail.com
    Signed-off-by: Sahara
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sahara
     
  • The LKP robot found a 27% will-it-scale/page_fault3 performance
    regression regarding commit e27be240df53("mm: memcg: make sure
    memory.events is uptodate when waking pollers").

    What the test does is:
    1 mkstemp() a 128M file on a tmpfs;
    2 start $nr_cpu processes, each to loop the following:
    2.1 mmap() this file in shared write mode;
    2.2 write 0 to this file in a PAGE_SIZE step till the end of the file;
    2.3 unmap() this file and repeat this process.
    3 After 5 minutes, check how many loops they managed to complete, the
    higher the better.

    The commit itself looks innocent enough as it merely changed some event
    counting mechanism and this test didn't trigger those events at all.
    Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu
    in count_memcg_event_mm()(called by handle_mm_fault()) and in
    __mod_memcg_state() called by page_add_file_rmap(). So it's likely due
    to the changed layout of 'struct mem_cgroup' that either make stat_cpu
    falling into a constantly modifying cacheline or some hot fields stop
    being in the same cacheline.

    I verified this by moving memory_events[] back to where it was:

    : --- a/include/linux/memcontrol.h
    : +++ b/include/linux/memcontrol.h
    : @@ -205,7 +205,6 @@ struct mem_cgroup {
    : int oom_kill_disable;
    :
    : /* memory.events */
    : - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
    : struct cgroup_file events_file;
    :
    : /* protect arrays of thresholds */
    : @@ -238,6 +237,7 @@ struct mem_cgroup {
    : struct mem_cgroup_stat_cpu __percpu *stat_cpu;
    : atomic_long_t stat[MEMCG_NR_STAT];
    : atomic_long_t events[NR_VM_EVENT_ITEMS];
    : + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
    :
    : unsigned long socket_pressure;

    And performance restored.

    Later investigation found that as long as the following 3 fields
    moving_account, move_lock_task and stat_cpu are in the same cacheline,
    performance will be good. To avoid future performance surprise by other
    commits changing the layout of 'struct mem_cgroup', this patch makes
    sure the 3 fields stay in the same cacheline.

    One concern of this approach is, moving_account and move_lock_task could
    be modified when a process changes memory cgroup while stat_cpu is a
    always read field, it might hurt to place them in the same cacheline. I
    assume it is rare for a process to change memory cgroup so this should
    be OK.

    Link: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop
    Link: http://lkml.kernel.org/r/20180601071115.GA27302@intel.com
    Signed-off-by: Aaron Lu
    Reported-by: kernel test robot
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • kvmalloc warned about incompatible gfp_mask to catch abusers (mostly
    GFP_NOFS) with an intention that this will motivate authors of the code
    to fix those. Linus argues that this just motivates people to do even
    more hacks like

    if (gfp == GFP_KERNEL)
    kvmalloc
    else
    kmalloc

    I haven't seen this happening much (Linus pointed to bucket_lock special
    cases an atomic allocation but my git foo hasn't found much more) but it
    is true that we can grow those in future. Therefore Linus suggested to
    simply not fallback to vmalloc for incompatible gfp flags and rather
    stick with the kmalloc path.

    Link: http://lkml.kernel.org/r/20180601115329.27807-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Linus Torvalds
    Cc: Tom Herbert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When bit is equal to 0x4, it means OPT_ZONE_DMA32 should be got from
    GFP_ZONE_TABLE. OPT_ZONE_DMA32 shall be equal to ZONE_DMA32 or
    ZONE_NORMAL according to the status of CONFIG_ZONE_DMA32.

    Similarly, when bit is equal to 0xc, that means OPT_ZONE_DMA32 should be
    got with an allocation policy GFP_MOVABLE. So ZONE_DMA32 or ZONE_NORMAL
    is the possible result value.

    Link: http://lkml.kernel.org/r/20180601163403.1032-1-yehs2007@zoho.com
    Signed-off-by: Huaisheng Ye
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Kate Stewart
    Cc: "Levin, Alexander (Sasha Levin)"
    Cc: Greg Kroah-Hartman
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huaisheng Ye
     
  • shmem/tmpfs uses pseudo vma to allocate page with correct NUMA policy.

    The pseudo vma doesn't have vm_page_prot set. We are going to encode
    encryption KeyID in vm_page_prot. Having garbage there causes problems.

    Zero out all unused fields in the pseudo vma.

    Link: http://lkml.kernel.org/r/20180531135602.20321-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
    allocations that can ignore memory policies. The zonelist is obtained
    from current CPU's node. This is a problem for __GFP_THISNODE
    allocations that want to allocate on a different node, e.g. because the
    allocating thread has been migrated to a different CPU.

    This has been observed to break SLAB in our 4.4-based kernel, because
    there it relies on __GFP_THISNODE working as intended. If a slab page
    is put on wrong node's list, then further list manipulations may corrupt
    the list because page_to_nid() is used to determine which node's
    list_lock should be locked and thus we may take a wrong lock and race.

    Current SLAB implementation seems to be immune by luck thanks to commit
    511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
    arbitrary node") but there may be others assuming that __GFP_THISNODE
    works as promised.

    We can fix it by simply removing the zonelist reset completely. There
    is actually no reason to reset it, because memory policies and cpusets
    don't affect the zonelist choice in the first place. This was different
    when commit 183f6371aac2 ("mm: ignore mempolicies when using
    ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
    own restricted zonelists.

    We might consider this for 4.17 although I don't know if there's
    anything currently broken.

    SLAB is currently not affected, but in kernels older than 4.7 that don't
    yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
    allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
    ones I'll have to check.

    So stable backports should be more important, but will have to be
    reviewed carefully, as the code went through many changes. BTW I think
    that also the ac->preferred_zoneref reset is currently useless if we
    don't also reset ac->nodemask from a mempolicy to NULL first (which we
    probably should for the OOM victims etc?), but I would leave that for a
    separate patch.

    Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • If a process monitored with userfaultfd changes it's memory mappings or
    forks() at the same time as uffd monitor fills the process memory with
    UFFDIO_COPY, the actual creation of page table entries and copying of
    the data in mcopy_atomic may happen either before of after the memory
    mapping modifications and there is no way for the uffd monitor to
    maintain consistent view of the process memory layout.

    For instance, let's consider fork() running in parallel with
    userfaultfd_copy():

    process | uffd monitor
    ---------------------------------+------------------------------
    fork() | userfaultfd_copy()
    ... | ...
    dup_mmap() | down_read(mmap_sem)
    down_write(mmap_sem) | /* create PTEs, copy data */
    dup_uffd() | up_read(mmap_sem)
    copy_page_range() |
    up_write(mmap_sem) |
    dup_uffd_complete() |
    /* notify monitor */ |

    If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
    be present by the time copy_page_range() is called and they will appear
    in the child's memory mappings. However, if the fork() is the first to
    take the mmap_sem, the new pages won't be mapped in the child's address
    space.

    If the pages are not present and child tries to access them, the monitor
    will get page fault notification and everything is fine. However, if
    the pages *are present*, the child can access them without uffd
    noticing. And if we copy them into child it'll see the wrong data.
    Since we are talking about background copy, we'd need to decide whether
    the pages should be copied or not regardless #PF notifications.

    Since userfaultfd monitor has no way to determine what was the order,
    let's disallow userfaultfd_copy in parallel with the non-cooperative
    events. In such case we return -EAGAIN and the uffd monitor can
    understand that userfaultfd_copy() clashed with a non-cooperative event
    and take an appropriate action.

    Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Pavel Emelyanov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Currently an attempt to set swap.max into a value lower than the actual
    swap usage fails, which causes configuration problems as there's no way
    of lowering the configuration below the current usage short of turning
    off swap entirely. This makes swap.max difficult to use and allows
    delegatees to lock the delegator out of reducing swap allocation.

    This patch updates swap_max_write() so that the limit can be lowered
    below the current usage. It doesn't implement active reclaiming of swap
    entries for the following reasons.

    * mem_cgroup_swap_full() already tells the swap machinary to
    aggressively reclaim swap entries if the usage is above 50% of
    limit, so simply lowering the limit automatically triggers gradual
    reclaim.

    * Forcing back swapped out pages is likely to heavily impact the
    workload and mess up the working set. Given that swap usually is a
    lot less valuable and less scarce, letting the existing usage
    dissipate over time through the above gradual reclaim and as they're
    falted back in is likely the better behavior.

    Link: http://lkml.kernel.org/r/20180523185041.GR1718769@devbig577.frc2.facebook.com
    Signed-off-by: Tejun Heo
    Acked-by: Roman Gushchin
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    See commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    Link: http://lkml.kernel.org/r/20180521202410.GA17912@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • Christoph doubts anyone was using the 'reserved' file in sysfs, so remove
    it.

    Link: http://lkml.kernel.org/r/20180518194519.3820-17-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: "Kirill A . Shutemov"
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • The reserved field was only used for embedding an rcu_head in the data
    structure. With the previous commit, we no longer need it. That lets us
    remove the 'reserved' argument to a lot of functions.

    Link: http://lkml.kernel.org/r/20180518194519.3820-16-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: "Kirill A . Shutemov"
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • rcu_head may now grow larger than list_head without affecting slab or
    slub.

    Link: http://lkml.kernel.org/r/20180518194519.3820-15-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: "Kirill A . Shutemov"
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Make hmm_data an explicit member of the struct page union.

    Link: http://lkml.kernel.org/r/20180518194519.3820-14-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: "Kirill A . Shutemov"
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • For pgd page table pages, x86 overloads the page->index field to store a
    pointer to the mm_struct. Rename this to pt_mm so it's visible to other
    users.

    Link: http://lkml.kernel.org/r/20180518194519.3820-13-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: "Kirill A . Shutemov"
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Rewrite the documentation to describe what you can use in struct page
    rather than what you can't.

    Link: http://lkml.kernel.org/r/20180518194519.3820-12-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Randy Dunlap
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: "Kirill A . Shutemov"
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This gives us five words of space in a single union in struct page. The
    compound_mapcount moves position (from offset 24 to offset 20) on 64-bit
    systems, but that does not seem likely to cause any trouble.

    Link: http://lkml.kernel.org/r/20180518194519.3820-11-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Since the LRU is two words, this does not affect the double-word alignment
    of SLUB's freelist.

    Link: http://lkml.kernel.org/r/20180518194519.3820-10-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Now that we can represent the location of 'deferred_list' in C instead of
    comments, make use of that ability.

    Link: http://lkml.kernel.org/r/20180518194519.3820-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • By combining these three one-word unions into one three-word union, we
    make it easier for users to add their own multi-word fields to struct
    page, as well as making it obvious that SLUB needs to keep its double-word
    alignment for its freelist & counters.

    No field moves position; verified with pahole.

    Link: http://lkml.kernel.org/r/20180518194519.3820-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Keeping the refcount in the union only encourages people to put something
    else in the union which will overlap with _refcount and eventually explode
    messily. pahole reports no fields change location.

    Link: http://lkml.kernel.org/r/20180518194519.3820-7-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • By moving page->private to the fourth word of struct page, we can put the
    SLUB counters in the same word as SLAB's s_mem and still do the
    cmpxchg_double trick. Now the SLUB counters no longer overlap with the
    mapcount or refcount so we can drop the call to page_mapcount_reset() and
    simplify set_page_slub_counters() to a single line.

    Link: http://lkml.kernel.org/r/20180518194519.3820-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox