15 Sep, 2017

1 commit

  • Pull more set_fs removal from Al Viro:
    "Christoph's 'use kernel_read and friends rather than open-coding
    set_fs()' series"

    * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: unexport vfs_readv and vfs_writev
    fs: unexport vfs_read and vfs_write
    fs: unexport __vfs_read/__vfs_write
    lustre: switch to kernel_write
    gadget/f_mass_storage: stop messing with the address limit
    mconsole: switch to kernel_read
    btrfs: switch write_buf to kernel_write
    net/9p: switch p9_fd_read to kernel_write
    mm/nommu: switch do_mmap_private to kernel_read
    serial2002: switch serial2002_tty_write to kernel_{read/write}
    fs: make the buf argument to __kernel_write a void pointer
    fs: fix kernel_write prototype
    fs: fix kernel_read prototype
    fs: move kernel_read to fs/read_write.c
    fs: move kernel_write to fs/read_write.c
    autofs4: switch autofs4_write to __kernel_write
    ashmem: switch to ->read_iter

    Linus Torvalds
     

07 Sep, 2017

1 commit

  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Sep, 2017

1 commit


09 May, 2017

2 commits

  • __vmalloc* allows users to provide gfp flags for the underlying
    allocation. This API is quite popular

    $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
    77

    The only problem is that many people are not aware that they really want
    to give __GFP_HIGHMEM along with other flags because there is really no
    reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
    which are mapped to the kernel vmalloc space. About half of users don't
    use this flag, though. This signals that we make the API unnecessarily
    too complex.

    This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
    be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
    are simplified and drop the flag.

    Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Al Viro
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Cristopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "kvmalloc", v5.

    There are many open coded kmalloc with vmalloc fallback instances in the
    tree. Most of them are not careful enough or simply do not care about
    the underlying semantic of the kmalloc/page allocator which means that
    a) some vmalloc fallbacks are basically unreachable because the kmalloc
    part will keep retrying until it succeeds b) the page allocator can
    invoke a really disruptive steps like the OOM killer to move forward
    which doesn't sound appropriate when we consider that the vmalloc
    fallback is available.

    As it can be seen implementing kvmalloc requires quite an intimate
    knowledge if the page allocator and the memory reclaim internals which
    strongly suggests that a helper should be implemented in the memory
    subsystem proper.

    Most callers, I could find, have been converted to use the helper
    instead. This is patch 6. There are some more relying on __GFP_REPEAT
    in the networking stack which I have converted as well and Eric Dumazet
    was not opposed [2] to convert them as well.

    [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

    This patch (of 9):

    Using kmalloc with the vmalloc fallback for larger allocations is a
    common pattern in the kernel code. Yet we do not have any common helper
    for that and so users have invented their own helpers. Some of them are
    really creative when doing so. Let's just add kv[mz]alloc and make sure
    it is implemented properly. This implementation makes sure to not make
    a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
    to not warn about allocation failures. This also rules out the OOM
    killer as the vmalloc is a more approapriate fallback than a disruptive
    user visible action.

    This patch also changes some existing users and removes helpers which
    are specific for them. In some cases this is not possible (e.g.
    ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
    require GFP_NO{FS,IO} context which is not vmalloc compatible in general
    (note that the page table allocation is GFP_KERNEL). Those need to be
    fixed separately.

    While we are at it, document that __vmalloc{_node} about unsupported gfp
    mask because there seems to be a lot of confusion out there.
    kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
    superset) flags to catch new abusers. Existing ones would have to die
    slowly.

    [sfr@canb.auug.org.au: f2fs fixup]
    Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Reviewed-by: Andreas Dilger [ext4 part]
    Acked-by: Vlastimil Babka
    Cc: John Hubbard
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Mar, 2017

1 commit

  • Pull sched.h split-up from Ingo Molnar:
    "The point of these changes is to significantly reduce the
    header footprint, to speed up the kernel build and to
    have a cleaner header structure.

    After these changes the new 's typical preprocessed
    size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
    lines), which is around 40% faster to build on typical configs.

    Not much changed from the last version (-v2) posted three weeks ago: I
    eliminated quirks, backmerged fixes plus I rebased it to an upstream
    SHA1 from yesterday that includes most changes queued up in -next plus
    all sched.h changes that were pending from Andrew.

    I've re-tested the series both on x86 and on cross-arch defconfigs,
    and did a bisectability test at a number of random points.

    I tried to test as many build configurations as possible, but some
    build breakage is probably still left - but it should be mostly
    limited to architectures that have no cross-compiler binaries
    available on kernel.org, and non-default configurations"

    * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
    sched/headers: Clean up
    sched/headers: Remove #ifdefs from
    sched/headers: Remove the include from
    sched/headers, hrtimer: Remove the include from
    sched/headers, x86/apic: Remove the header inclusion from
    sched/headers, timers: Remove the include from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/core: Remove unused prefetch_stack()
    sched/headers: Remove from
    sched/headers: Remove the 'init_pid_ns' prototype from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the runqueue_is_locked() prototype
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the include from
    sched/headers: Remove from
    ...

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Pull vfs pile two from Al Viro:

    - orangefs fix

    - series of fs/namei.c cleanups from me

    - VFS stuff coming from overlayfs tree

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    orangefs: Use RCU for destroy_inode
    vfs: use helper for calling f_op->fsync()
    mm: use helper for calling f_op->mmap()
    vfs: use helpers for calling f_op->{read,write}_iter()
    vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
    vfs: extract common parts of {compat_,}do_readv_writev()
    vfs: wrap write f_ops with file_{start,end}_write()
    vfs: deny copy_file_range() for non regular files
    vfs: deny fallocate() on directory
    vfs: create vfs helper vfs_tmpfile()
    namei.c: split unlazy_walk()
    namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
    lookup_fast(): clean up the logics around the fallback to non-rcu mode
    namei: fold unlazy_link() into its sole caller

    Linus Torvalds
     

02 Mar, 2017

3 commits

  • Overlayfs-related series from Miklos and Amir

    Al Viro
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …sched.h> to <linux/mm_types>

    The <linux/sched.h> header includes various vmacache related defines,
    which are arguably misplaced.

    Move them to mm_types.h and minimize the sched.h impact by putting
    all task vmacache state into a new 'struct vmacache' structure.

    No change in functionality.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

25 Feb, 2017

3 commits

  • When a non-cooperative userfaultfd monitor copies pages in the
    background, it may encounter regions that were already unmapped.
    Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
    changes in the virtual memory layout.

    Since there might be different uffd contexts for the affected VMAs, we
    first should create a temporary representation for the unmap event for
    each uffd context and then notify them one by one to the appropriate
    userfault file descriptors.

    The event notification occurs after the mmap_sem has been released.

    [arnd@arndb.de: fix nommu build]
    Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
    [mhocko@suse.com: fix nommu build]
    Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • mmap_init() is no longer associated with VMA slab. So fix it.

    Link: http://lkml.kernel.org/r/1485182601-9294-1-git-send-email-iamyooon@gmail.com
    Signed-off-by: seokhoon.yoon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    seokhoon.yoon
     
  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

23 Feb, 2017

1 commit

  • show_mem() allows to filter out node specific data which is irrelevant
    to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is
    done in skip_free_areas_node which skips all nodes which are not in the
    mems_allowed of the current process. This works most of the time as
    expected because the nodemask shouldn't be outside of the allocating
    task but there are some exceptions. E.g. memory hotplug might want to
    request allocations from outside of the allowed nodes (see
    new_node_page).

    Get rid of this hardcoded behavior and push the allocation mask down the
    show_mem path and use it instead of cpuset_current_mems_allowed. NULL
    nodemask is interpreted as cpuset_current_mems_allowed.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 Feb, 2017

1 commit


25 Dec, 2016

1 commit


15 Dec, 2016

4 commits

  • Merge more updates from Andrew Morton:

    - a few misc things

    - kexec updates

    - DMA-mapping updates to better support networking DMA operations

    - IPC updates

    - various MM changes to improve DAX fault handling

    - lots of radix-tree changes, mainly to the test suite. All leading up
    to reimplementing the IDA/IDR code to be a wrapper layer over the
    radix-tree. However the final trigger-pulling patch is held off for
    4.11.

    * emailed patches from Andrew Morton : (114 commits)
    radix tree test suite: delete unused rcupdate.c
    radix tree test suite: add new tag check
    radix-tree: ensure counts are initialised
    radix tree test suite: cache recently freed objects
    radix tree test suite: add some more functionality
    idr: reduce the number of bits per level from 8 to 6
    rxrpc: abstract away knowledge of IDR internals
    tpm: use idr_find(), not idr_find_slowpath()
    idr: add ida_is_empty
    radix tree test suite: check multiorder iteration
    radix-tree: fix replacement for multiorder entries
    radix-tree: add radix_tree_split_preload()
    radix-tree: add radix_tree_split
    radix-tree: add radix_tree_join
    radix-tree: delete radix_tree_range_tag_if_tagged()
    radix-tree: delete radix_tree_locate_item()
    radix-tree: improve multiorder iterators
    btrfs: fix race in btrfs_free_dummy_fs_info()
    radix-tree: improve dump output
    radix-tree: make radix_tree_find_next_bit more useful
    ...

    Linus Torvalds
     
  • Currently we have two different structures for passing fault information
    around - struct vm_fault and struct fault_env. DAX will need more
    information in struct vm_fault to handle its faults so the content of
    that structure would become event closer to fault_env. Furthermore it
    would need to generate struct fault_env to be able to call some of the
    generic functions. So at this point I don't think there's much use in
    keeping these two structures separate. Just embed into struct vm_fault
    all that is needed to use it for both purposes.

    Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Unexport the low-level __get_user_pages_unlocked() function and replaces
    invocations with calls to more appropriate higher-level functions.

    In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked()
    with get_user_pages_unlocked() since we can now pass gup_flags.

    In async_pf_execute() and process_vm_rw_single_vec() we need to pass
    different tsk, mm arguments so get_user_pages_remote() is the sane
    replacement in these cases (having added manual acquisition and release
    of mmap_sem.)

    Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH
    flag. However, this flag was originally silently dropped by commit
    1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this
    appears to have been unintentional and reintroducing it is therefore not
    an issue.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Pull namespace updates from Eric Biederman:
    "After a lot of discussion and work we have finally reachanged a basic
    understanding of what is necessary to make unprivileged mounts safe in
    the presence of EVM and IMA xattrs which the last commit in this
    series reflects. While technically it is a revert the comments it adds
    are important for people not getting confused in the future. Clearing
    up that confusion allows us to seriously work on unprivileged mounts
    of fuse in the next development cycle.

    The rest of the fixes in this set are in the intersection of user
    namespaces, ptrace, and exec. I started with the first fix which
    started a feedback cycle of finding additional issues during review
    and fixing them. Culiminating in a fix for a bug that has been present
    since at least Linux v1.0.

    Potentially these fixes were candidates for being merged during the rc
    cycle, and are certainly backport candidates but enough little things
    turned up during review and testing that I decided they should be
    handled as part of the normal development process just to be certain
    there were not any great surprises when it came time to backport some
    of these fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC"
    exec: Ensure mm->user_ns contains the execed files
    ptrace: Don't allow accessing an undumpable mm
    ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
    mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

    Linus Torvalds
     

23 Nov, 2016

1 commit

  • It is the reasonable expectation that if an executable file is not
    readable there will be no way for a user without special privileges to
    read the file. This is enforced in ptrace_attach but if ptrace
    is already attached before exec there is no enforcement for read-only
    executables.

    As the only way to read such an mm is through access_process_vm
    spin a variant called ptrace_access_vm that will fail if the
    target process is not being ptraced by the current process, or
    the current process did not have sufficient privileges when ptracing
    began to read the target processes mm.

    In the ptrace implementations replace access_process_vm by
    ptrace_access_vm. There remain several ptrace sites that still use
    access_process_vm as they are reading the target executables
    instructions (for kernel consumption) or register stacks. As such it
    does not appear necessary to add a permission check to those calls.

    This bug has always existed in Linux.

    Fixes: v1.0
    Cc: stable@vger.kernel.org
    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Nov, 2016

1 commit

  • The flush_icache_range() API is meant to be used on kernel addresses
    only as it may not have the infrastructure (exception entries) to handle
    user memory faults.

    The lkdtm execute_user_location() function tests the kernel execution of
    user space addresses by mmap'ing an anonymous page, copying some code
    together with cache maintenance and attempting to run it. However, the
    cache maintenance step may fail because of the incorrect API usage
    described above. The patch changes lkdtm to use access_process_vm() for
    copying the code into user space which would take care of the necessary
    cache maintenance.

    Signed-off-by: Catalin Marinas
    [kees: export access_process_vm() for module use]
    Signed-off-by: Kees Cook
    Signed-off-by: Greg Kroah-Hartman

    Catalin Marinas
     

25 Oct, 2016

1 commit

  • This patch unexports the low-level __get_user_pages() function.

    Recent refactoring of the get_user_pages* functions allow flags to be
    passed through get_user_pages() which eliminates the need for access to
    this function from its one user, kvm.

    We can see that the two calls to get_user_pages() which replace
    __get_user_pages() in kvm_main.c are equivalent by examining their call
    stacks:

    get_user_page_nowait():
    get_user_pages(start, 1, flags, page, NULL)
    __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, start, 1,
    flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)

    check_user_page_hwpoison():
    get_user_pages(addr, 1, flags, NULL, NULL)
    __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
    NULL, NULL)

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

19 Oct, 2016

7 commits

  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from __access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_user_pages() and replaces
    them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
    as use of this flag can result in surprising behaviour (and hence bugs)
    within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Christian König
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_locked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_unlocked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the redundant 'write' and 'force' parameters from
    __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

27 Jul, 2016

1 commit

  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

28 May, 2016

1 commit

  • The do_brk() and vm_brk() return value was "unsigned long" and returned
    the starting address on success, and an error value on failure. The
    reasons are entirely historical, and go back to it basically behaving
    like the mmap() interface does.

    However, nobody actually wanted that interface, and it causes totally
    pointless IS_ERR_VALUE() confusion.

    What every single caller actually wants is just the simpler integer
    return of zero for success and negative error number on failure.

    So just convert to that much clearer and more common calling convention,
    and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Apr, 2016

1 commit


07 Apr, 2016

1 commit

  • The pkeys changes brought about a truly hideous set of macros in:

    cde70140fed8 ("mm/gup: Overload get_user_pages() functions")

    ... which macros are (ab-)using the fact that __VA_ARGS__ can be used
    to shift parameter positions in macro arguments without breaking the
    build and so can be used to call separate C functions depending on
    the number of arguments of the macro.

    This allowed easy migration of these 3 GUP APIs, as both these variants
    worked at the C level:

    old:
    ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);

    new:
    ret = get_user_pages(address, 1, 1, 0, &page, NULL);

    ... while we also generated a (functionally harmless but noticeable) build
    time warning if the old API was used. As there are over 300 uses of these
    APIs, this trick eased the migration of the API and avoided excessive
    migration pain in linux-next.

    Now, with its work done, get rid of all of that complication and ugliness:

    3 files changed, 16 insertions(+), 140 deletions(-)

    ... where the linecount of the migration hack was further inflated by the
    fact that there are NOMMU variants of these GUP APIs as well.

    Much of the conversion was done in linux-next over the past couple of months,
    and Linus recently removed all remaining old API uses from the upstream tree
    in the following upstrea commit:

    cb107161df3c ("Convert straggling drivers to new six-argument get_user_pages()")

    There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
    code path that ARM, ARM64 and PowerPC uses.

    After this commit any old API usage will break the build.

    [ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]

    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

2 commits


19 Feb, 2016

1 commit

  • This plumbs a protection key through calc_vm_flag_bits(). We
    could have done this in calc_vm_prot_bits(), but I did not feel
    super strongly which way to go. It was pretty arbitrary which
    one to use.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arve Hjønnevåg
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Airlie
    Cc: Denys Vlasenko
    Cc: Eric W. Biederman
    Cc: Geliang Tang
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Maxime Coquelin
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Riley Andrews
    Cc: Vladimir Davydov
    Cc: devel@driverdev.osuosl.org
    Cc: linux-api@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

16 Feb, 2016

1 commit

  • The concept here was a suggestion from Ingo. The implementation
    horrors are all mine.

    This allows get_user_pages(), get_user_pages_unlocked(), and
    get_user_pages_locked() to be called with or without the
    leading tsk/mm arguments. We will give a compile-time warning
    about the old style being __deprecated and we will also
    WARN_ON() if the non-remote version is used for a remote-style
    access.

    Doing this, folks will get nice warnings and will not break the
    build. This should be nice for -next and will hopefully let
    developers fix up their own code instead of maintainers needing
    to do it at merge time.

    The way we do this is hideous. It uses the __VA_ARGS__ macro
    functionality to call different functions based on the number
    of arguments passed to the macro.

    There's an additional hack to ensure that our EXPORT_SYMBOL()
    of the deprecated symbols doesn't trigger a warning.

    We should be able to remove this mess as soon as -rc1 hits in
    the release after this is merged.

    Signed-off-by: Dave Hansen
    Cc: Al Viro
    Cc: Alexander Kuleshov
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Dominik Dingel
    Cc: Geliang Tang
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Mateusz Guzik
    Cc: Maxime Coquelin
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen