06 Jan, 2017

1 commit

  • commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3 upstream.

    It is the reasonable expectation that if an executable file is not
    readable there will be no way for a user without special privileges to
    read the file. This is enforced in ptrace_attach but if ptrace
    is already attached before exec there is no enforcement for read-only
    executables.

    As the only way to read such an mm is through access_process_vm
    spin a variant called ptrace_access_vm that will fail if the
    target process is not being ptraced by the current process, or
    the current process did not have sufficient privileges when ptracing
    began to read the target processes mm.

    In the ptrace implementations replace access_process_vm by
    ptrace_access_vm. There remain several ptrace sites that still use
    access_process_vm as they are reading the target executables
    instructions (for kernel consumption) or register stacks. As such it
    does not appear necessary to add a permission check to those calls.

    This bug has always existed in Linux.

    Fixes: v1.0
    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

25 Oct, 2016

1 commit

  • This patch unexports the low-level __get_user_pages() function.

    Recent refactoring of the get_user_pages* functions allow flags to be
    passed through get_user_pages() which eliminates the need for access to
    this function from its one user, kvm.

    We can see that the two calls to get_user_pages() which replace
    __get_user_pages() in kvm_main.c are equivalent by examining their call
    stacks:

    get_user_page_nowait():
    get_user_pages(start, 1, flags, page, NULL)
    __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, start, 1,
    flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)

    check_user_page_hwpoison():
    get_user_pages(addr, 1, flags, NULL, NULL)
    __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
    NULL, NULL)

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

19 Oct, 2016

7 commits

  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from __access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_user_pages() and replaces
    them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
    as use of this flag can result in surprising behaviour (and hence bugs)
    within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Christian König
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_locked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_unlocked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the redundant 'write' and 'force' parameters from
    __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

27 Jul, 2016

1 commit

  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

28 May, 2016

1 commit

  • The do_brk() and vm_brk() return value was "unsigned long" and returned
    the starting address on success, and an error value on failure. The
    reasons are entirely historical, and go back to it basically behaving
    like the mmap() interface does.

    However, nobody actually wanted that interface, and it causes totally
    pointless IS_ERR_VALUE() confusion.

    What every single caller actually wants is just the simpler integer
    return of zero for success and negative error number on failure.

    So just convert to that much clearer and more common calling convention,
    and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Apr, 2016

1 commit


07 Apr, 2016

1 commit

  • The pkeys changes brought about a truly hideous set of macros in:

    cde70140fed8 ("mm/gup: Overload get_user_pages() functions")

    ... which macros are (ab-)using the fact that __VA_ARGS__ can be used
    to shift parameter positions in macro arguments without breaking the
    build and so can be used to call separate C functions depending on
    the number of arguments of the macro.

    This allowed easy migration of these 3 GUP APIs, as both these variants
    worked at the C level:

    old:
    ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);

    new:
    ret = get_user_pages(address, 1, 1, 0, &page, NULL);

    ... while we also generated a (functionally harmless but noticeable) build
    time warning if the old API was used. As there are over 300 uses of these
    APIs, this trick eased the migration of the API and avoided excessive
    migration pain in linux-next.

    Now, with its work done, get rid of all of that complication and ugliness:

    3 files changed, 16 insertions(+), 140 deletions(-)

    ... where the linecount of the migration hack was further inflated by the
    fact that there are NOMMU variants of these GUP APIs as well.

    Much of the conversion was done in linux-next over the past couple of months,
    and Linus recently removed all remaining old API uses from the upstream tree
    in the following upstrea commit:

    cb107161df3c ("Convert straggling drivers to new six-argument get_user_pages()")

    There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
    code path that ARM, ARM64 and PowerPC uses.

    After this commit any old API usage will break the build.

    [ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]

    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

2 commits


19 Feb, 2016

1 commit

  • This plumbs a protection key through calc_vm_flag_bits(). We
    could have done this in calc_vm_prot_bits(), but I did not feel
    super strongly which way to go. It was pretty arbitrary which
    one to use.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arve Hjønnevåg
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Airlie
    Cc: Denys Vlasenko
    Cc: Eric W. Biederman
    Cc: Geliang Tang
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Maxime Coquelin
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Riley Andrews
    Cc: Vladimir Davydov
    Cc: devel@driverdev.osuosl.org
    Cc: linux-api@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

16 Feb, 2016

1 commit

  • The concept here was a suggestion from Ingo. The implementation
    horrors are all mine.

    This allows get_user_pages(), get_user_pages_unlocked(), and
    get_user_pages_locked() to be called with or without the
    leading tsk/mm arguments. We will give a compile-time warning
    about the old style being __deprecated and we will also
    WARN_ON() if the non-remote version is used for a remote-style
    access.

    Doing this, folks will get nice warnings and will not break the
    build. This should be nice for -next and will hopefully let
    developers fix up their own code instead of maintainers needing
    to do it at merge time.

    The way we do this is hideous. It uses the __VA_ARGS__ macro
    functionality to call different functions based on the number
    of arguments passed to the macro.

    There's an additional hack to ensure that our EXPORT_SYMBOL()
    of the deprecated symbols doesn't trigger a warning.

    We should be able to remove this mess as soon as -rc1 hits in
    the release after this is merged.

    Signed-off-by: Dave Hansen
    Cc: Al Viro
    Cc: Alexander Kuleshov
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Dominik Dingel
    Cc: Geliang Tang
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Mateusz Guzik
    Cc: Maxime Coquelin
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

06 Nov, 2015

2 commits

  • (1) For !CONFIG_BUG cases, the bug call is a no-op, so we couldn't
    care less and the change is ok.

    (2) ppc and mips, which HAVE_ARCH_BUG_ON, do not rely on branch
    predictions as it seems to be pointless[1] and thus callers should not
    be trying to push an optimization in the first place.

    (3) For CONFIG_BUG and !HAVE_ARCH_BUG_ON cases, BUG_ON() contains an
    unlikely compiler flag already.

    Hence, we can drop unlikely behind BUG_ON().

    [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.3/02289.html

    Signed-off-by: Geliang Tang
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • linux/mm.h provides offset_in_page() macro. Let's use already predefined
    macro instead of (addr & ~PAGE_MASK).

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     

11 Sep, 2015

1 commit

  • Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(),
    rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple
    wrapper on top of do_mmap(). Perhaps we should update the callers of
    do_mmap_pgoff() and kill it later.

    This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not
    play with vm internals.

    After this change mmap_region() has a single user outside of mmap.c,
    arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to
    change arch/tile/ and unexport mmap_region().

    [kirill@shutemov.name: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Acked-by: Dave Hansen
    Tested-by: Dave Hansen
    Signed-off-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Sep, 2015

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk()
    fixes, Documentation and MAINTAINERS updates)"

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
    MAINTAINERS: update my e-mail address
    mod_devicetable: add space before */
    scsi: a100u2w: trivial typo in printk
    i2c: Fix typo in i2c-bfin-twi.c
    treewide: fix typos in comment blocks
    Doc: fix trivial typo in SubmittingPatches
    proportions: Spelling s/consitent/consistent/
    dm: Spelling s/consitent/consistent/
    aic7xxx: Fix typo in error message
    pcmcia: Fix typo in locking documentation
    scsi/arcmsr: Fix typos in error log
    drm/nouveau/gr: Fix typo in nv10.c
    [SCSI] Fix printk typos in drivers/scsi
    staging: comedi: Grammar s/Enable support a/Enable support for a/
    Btrfs: Spelling s/consitent/consistent/
    README: GTK+ is a acronym
    ASoC: omap: Fix typo in config option description
    mm: tlb.c: Fix error message
    ntfs: super.c: Fix error log
    fix typo in Documentation/SubmittingPatches
    ...

    Linus Torvalds
     

07 Aug, 2015

1 commit


10 Jul, 2015

1 commit

  • Today proc and sysfs do not contain any executable files. Several
    applications today mount proc or sysfs without noexec and nosuid and
    then depend on there being no exectuables files on proc or sysfs.
    Having any executable files show on proc or sysfs would cause
    a user space visible regression, and most likely security problems.

    Therefore commit to never allowing executables on proc and sysfs by
    adding a new flag to mark them as filesystems without executables and
    enforce that flag.

    Test the flag where MNT_NOEXEC is tested today, so that the only user
    visible effect will be that exectuables will be treated as if the
    execute bit is cleared.

    The filesystems proc and sysfs do not currently incoporate any
    executable files so this does not result in any user visible effects.

    This makes it unnecessary to vet changes to proc and sysfs tightly for
    adding exectuable files or changes to chattr that would modify
    existing files, as no matter what the individual file say they will
    not be treated as exectuable files by the vfs.

    Not having to vet changes to closely is important as without this we
    are only one proc_create call (or another goof up in the
    implementation of notify_change) from having problematic executables
    on proc. Those mistakes are all too easy to make and would create
    a situation where there are security issues or the assumptions of
    some program having to be broken (and cause userspace regressions).

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Jul, 2015

1 commit

  • …scm/linux/kernel/git/paulg/linux

    Pull module_init replacement part two from Paul Gortmaker:
    "Replace module_init with appropriate alternate initcall in non
    modules.

    This series converts non-modular code that is using the module_init()
    call to hook itself into the system to instead use one of our
    alternate priority initcalls.

    Unlike the previous series that used device_initcall and hence was a
    runtime no-op, these commits change to one of the alternate initcalls,
    because (a) we have them and (b) it seems like the right thing to do.

    For example, it would seem logical to use arch_initcall for arch
    specific setup code and fs_initcall for filesystem setup code.

    This does mean however, that changes in the init ordering will be
    taking place, and so there is a small risk that some kind of implicit
    init ordering issue may lie uncovered. But I think it is still better
    to give these ones sensible priorities than to just assign them all to
    device_initcall in order to exactly preserve the old ordering.

    Thad said, we have already made similar changes in core kernel code in
    commit c96d6660dc65 ("kernel: audit/fix non-modular users of
    module_init in core code") without any regressions reported, so this
    type of change isn't without precedent. It has also got the same
    local testing and linux-next coverage as all the other pull requests
    that I'm sending for this merge window have got.

    Once again, there is an unused module_exit function removal that shows
    up as an outlier upon casual inspection of the diffstat"

    * tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling
    x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling
    mm/page_owner.c: use late_initcall to hook in enabling
    lib/list_sort: use late_initcall to hook in self tests
    arm: use subsys_initcall in non-modular pl320 IPC code
    powerpc: don't use module_init for non-modular core hugetlb code
    powerpc: use subsys_initcall for Freescale Local Bus
    x86: don't use module_init for non-modular core bootflag code
    netfilter: don't use module_init/exit in core IPV4 code
    fs/notify: don't use module_init for non-modular inotify_user code
    mm: replace module_init usages with subsys_initcall in nommu.c

    Linus Torvalds
     

25 Jun, 2015

1 commit

  • kenter/kleave/kdebug are wrapper macros to print functions flow and debug
    information. This set was written before pr_devel() was introduced, so it
    was controlled by "#if 0" construction. It is questionable if anyone is
    using them [1] now.

    This patch removes these macros, converts numerous printk(KERN_WARNING,
    ...) to use general pr_warn(...) and removes debug print line from
    validate_mmap_request() function.

    Signed-off-by: Leon Romanovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Leon Romanovsky
     

17 Jun, 2015

1 commit

  • Compiling some arm/m68k configs with "# CONFIG_MMU is not set" reveals
    two more instances of module_init being used for code that can't
    possibly be modular, as CONFIG_MMU is either on or off.

    We replace them with subsys_initcall as per what was done in other
    mmu-enabled code.

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier).

    One might think that core_initcall (l2) or postcore_initcall (l3) would
    be more appropriate for anything in mm/ but if we look at the actual init
    functions themselves, we see they are just sysctl setup stuff, and
    hence the choice of subsys_initcall (l4) seems reasonable. At the same
    time it minimizes the risk of changing the priority too drastically all
    at once. We can adjust further in the future.

    Also, a couple instances of missing ";" at EOL are fixed.

    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

12 Apr, 2015

1 commit


13 Mar, 2015

1 commit

  • Several modules may need max_mapnr, so export, the related error with
    allmodconfig under c6x:

    MODPOST 3327 modules
    ERROR: "max_mapnr" [fs/pstore/ramoops.ko] undefined!
    ERROR: "max_mapnr" [drivers/media/v4l2-core/videobuf2-dma-contig.ko] undefined!

    Signed-off-by: Chen Gang
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    gchen gchen
     

01 Mar, 2015

1 commit

  • Maxime reported the following memory leak regression due to commit
    dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own
    implementation").

    On v3.19, I am facing a memory leak. Each time I run a command one page
    is lost. Here an example with busybox's free command:

    / # free
    total used free shared buffers cached
    Mem: 7928 1972 5956 0 0 492
    -/+ buffers/cache: 1480 6448
    / # free
    total used free shared buffers cached
    Mem: 7928 1976 5952 0 0 492
    -/+ buffers/cache: 1484 6444
    / # free
    total used free shared buffers cached
    Mem: 7928 1980 5948 0 0 492
    -/+ buffers/cache: 1488 6440
    / # free
    total used free shared buffers cached
    Mem: 7928 1984 5944 0 0 492
    -/+ buffers/cache: 1492 6436
    / # free
    total used free shared buffers cached
    Mem: 7928 1988 5940 0 0 492
    -/+ buffers/cache: 1496 6432

    At some point, the system fails to sastisfy 256KB allocations:

    free: page allocation failure: order:6, mode:0xd0
    CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64
    Hardware name: STM32 (Device Tree Support)
    show_stack+0xb/0xc
    warn_alloc_failed+0x97/0xbc
    __alloc_pages_nodemask+0x295/0x35c
    __get_free_pages+0xb/0x24
    alloc_pages_exact+0x19/0x24
    do_mmap_pgoff+0x423/0x658
    vm_mmap_pgoff+0x3f/0x4e
    load_flat_file+0x20d/0x4f8
    load_flat_binary+0x3f/0x26c
    search_binary_handler+0x51/0xe4
    do_execveat_common+0x271/0x35c
    do_execve+0x19/0x1c
    ret_fast_syscall+0x1/0x4a
    Mem-info:
    Normal per-cpu:
    CPU 0: hi: 0, btch: 1 usd: 0
    active_anon:0 inactive_anon:0 isolated_anon:0
    active_file:0 inactive_file:0 isolated_file:0
    unevictable:123 dirty:0 writeback:0 unstable:0
    free:1515 slab_reclaimable:17 slab_unreclaimable:139
    mapped:0 shmem:0 pagetables:0 bounce:0
    free_cma:0
    Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
    lowmem_reserve[]: 0 0
    Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
    123 total pagecache pages
    2048 pages of RAM
    1538 free pages
    66 reserved pages
    109 slab pages
    -46 pages shared
    0 pages swap cached
    nommu: Allocation of length 221184 from process 67 (free) failed
    Normal per-cpu:
    CPU 0: hi: 0, btch: 1 usd: 0
    active_anon:0 inactive_anon:0 isolated_anon:0
    active_file:0 inactive_file:0 isolated_file:0
    unevictable:123 dirty:0 writeback:0 unstable:0
    free:1515 slab_reclaimable:17 slab_unreclaimable:139
    mapped:0 shmem:0 pagetables:0 bounce:0
    free_cma:0
    Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
    lowmem_reserve[]: 0 0
    Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
    123 total pagecache pages
    Unable to allocate RAM for process text/data, errno 12 SEGV

    This problem happens because we allocate ordered page through
    __get_free_pages() in do_mmap_private() in some cases and we try to free
    individual pages rather than ordered page in free_page_series(). In
    this case, freeing pages whose refcount is not 0 won't be freed to the
    page allocator so memory leak happens.

    To fix the problem, this patch changes __get_free_pages() to
    alloc_pages_exact() since alloc_pages_exact() returns
    physically-contiguous pages but each pages are refcounted.

    Fixes: dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own implementation").
    Reported-by: Maxime Coquelin
    Tested-by: Maxime Coquelin
    Signed-off-by: Joonsoo Kim
    Cc: [3.19]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

13 Feb, 2015

1 commit

  • Pull backing device changes from Jens Axboe:
    "This contains a cleanup of how the backing device is handled, in
    preparation for a rework of the life time rules. In this part, the
    most important change is to split the unrelated nommu mmap flags from
    it, but also removing a backing_dev_info pointer from the
    address_space (and inode), and a cleanup of other various minor bits.

    Christoph did all the work here, I just fixed an oops with pages that
    have a swap backing. Arnd fixed a missing export, and Oleg killed the
    lustre backing_dev_info from staging. Last patch was from Al,
    unexporting parts that are now no longer needed outside"

    * 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
    Make super_blocks and sb_lock static
    mtd: export new mtd_mmap_capabilities
    fs: make inode_to_bdi() handle NULL inode
    staging/lustre/llite: get rid of backing_dev_info
    fs: remove default_backing_dev_info
    fs: don't reassign dirty inodes to default_backing_dev_info
    nfs: don't call bdi_unregister
    ceph: remove call to bdi_unregister
    fs: remove mapping->backing_dev_info
    fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
    nilfs2: set up s_bdi like the generic mount_bdev code
    block_dev: get bdev inode bdi directly from the block device
    block_dev: only write bdev inode on close
    fs: introduce f_op->mmap_capabilities for nommu mmap support
    fs: kill BDI_CAP_SWAP_BACKED
    fs: deduplicate noop_backing_dev_info

    Linus Torvalds
     

12 Feb, 2015

3 commits

  • I noticed that "allowed" can easily overflow by falling below 0, because
    (total_vm / 32) can be larger than "allowed". The problem occurs in
    OVERCOMMIT_NONE mode.

    In this case, a huge allocation can success and overcommit the system
    (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall
    (system-wide), so system become unusable.

    The problem was masked out by commit c9b1d0981fcc
    ("mm: limit growth of 3% hardcoded other user reserve"),
    but it's easy to reproduce it on older kernels:
    1) set overcommit_memory sysctl to 2
    2) mmap() large file multiple times (with VM_SHARED flag)
    3) try to malloc() large amount of memory

    It also can be reproduced on newer kernels, but miss-configured
    sysctl_user_reserve_kbytes is required.

    Fix this issue by switching to signed arithmetic here.

    Signed-off-by: Roman Gushchin
    Cc: Andrew Shewmaker
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
    to get a proper -EHWPOSION retval instead of -EFAULT to take a more
    appropriate action if get_user_pages runs into a memory failure.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
    reading to reduce the mmap_sem contention (for writing), like while
    waiting for I/O completion. The problem is that right now practically no
    get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
    that nifty feature.

    Andres fixed it for the KVM page fault. However get_user_pages_fast
    remains uncovered, and 99% of other get_user_pages aren't using it either
    (the only exception being FOLL_NOWAIT in KVM which is really nonblocking
    and in fact it doesn't even release the mmap_sem).

    So this patchsets extends the optimization Andres did in the KVM page
    fault to the whole kernel. It makes most important places (including
    gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
    during I/O.

    The only few places that remains uncovered are drivers like v4l and other
    exceptions that tends to work on their own memory and they're not working
    on random user memory (for example like O_DIRECT that uses gup_fast and is
    fully covered by this patch).

    A follow up patch should probably also add a printk_once warning to
    get_user_pages that should go obsolete and be phased out eventually. The
    "vmas" parameter of get_user_pages makes it fundamentally incompatible
    with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
    mmap_sem is released).

    While this is just an optimization, this becomes an absolute requirement
    for the userfaultfd feature http://lwn.net/Articles/615086/ .

    The userfaultfd allows to block the page fault, and in order to do so I
    need to drop the mmap_sem first. So this patch also ensures that all
    memory where userfaultfd could be registered by KVM, the very first fault
    (no matter if it is a regular page fault, or a get_user_pages) always has
    FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
    only when the pagetable is already mapped. The second fault attempt after
    the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
    without it.

    This patch (of 5):

    We can leverage the VM_FAULT_RETRY functionality in the page fault paths
    better by using either get_user_pages_locked or get_user_pages_unlocked.

    The former allows conversion of get_user_pages invocations that will have
    to pass a "&locked" parameter to know if the mmap_sem was dropped during
    the call. Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
    up_read(&mm->mmap_sem);

    The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

    Where tsk, mm, the intermediate "..." paramters and "pages" can be any
    value as before. Just the last parameter of get_user_pages (vmas) must be
    NULL for get_user_pages_locked|unlocked to be usable (the latter original
    form wouldn't have been safe anyway if vmas wasn't null, for the former we
    just make it explicit by dropping the parameter).

    If vmas is not NULL these two methods cannot be used.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Andres Lagar-Cavilla
    Reviewed-by: Peter Feiner
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

11 Feb, 2015

1 commit

  • remap_file_pages(2) was invented to be able efficiently map parts of
    huge file into limited 32-bit virtual address space such as in database
    workloads.

    Nonlinear mappings are pain to support and it seems there's no
    legitimate use-cases nowadays since 64-bit systems are widely available.

    Let's drop it and get rid of all these special-cased code.

    The patch replaces the syscall with emulation which creates new VMA on
    each remap_file_pages(), unless they it can be merged with an adjacent
    one.

    I didn't find *any* real code that uses remap_file_pages(2) to test
    emulation impact on. I've checked Debian code search and source of all
    packages in ALT Linux. No real users: libc wrappers, mentions in
    strace, gdb, valgrind and this kind of stuff.

    There are few basic tests in LTP for the syscall. They work just fine
    with emulation.

    To test performance impact, I've written small test case which
    demonstrate pretty much worst case scenario: map 4G shmfs file, write to
    begin of every page pgoff of the page, remap pages in reverse order,
    read every page.

    The test creates 1 million of VMAs if emulation is in use, so I had to
    set vm.max_map_count to 1100000 to avoid -ENOMEM.

    Before: 23.3 ( +- 4.31% ) seconds
    After: 43.9 ( +- 0.85% ) seconds
    Slowdown: 1.88x

    I believe we can live with that.

    Test case:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define MB (1024UL * 1024)
    #define SIZE (4096 * MB)

    int main(int argc, char **argv)
    {
    unsigned long *p;
    long i, pass;

    for (pass = 0; pass < 10; pass++) {
    p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
    perror("mmap");
    return -1;
    }

    for (i = 0; i < SIZE / 4096; i++)
    p[i * 4096 / sizeof(*p)] = i;

    for (i = 0; i < SIZE / 4096; i++) {
    if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
    0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
    perror("remap_file_pages");
    return -1;
    }
    }

    for (i = SIZE / 4096 - 1; i >= 0; i--)
    assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

    munmap(p, SIZE);
    }

    return 0;
    }

    [akpm@linux-foundation.org: fix spello]
    [sasha.levin@oracle.com: initialize populate before usage]
    [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
    Signed-off-by: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Armin Rigo
    Signed-off-by: Sasha Levin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Feb, 2015

1 commit

  • The symbol 'high_memory' is provided on both MMU- and NOMMU-kernels, but
    only one of them is exported, which leads to module build errors in
    drivers that work fine built-in:

    ERROR: "high_memory" [drivers/net/virtio_net.ko] undefined!
    ERROR: "high_memory" [drivers/net/ppp/ppp_mppe.ko] undefined!
    ERROR: "high_memory" [drivers/mtd/nand/nand.ko] undefined!
    ERROR: "high_memory" [crypto/tcrypt.ko] undefined!
    ERROR: "high_memory" [crypto/cts.ko] undefined!

    This exports the symbol to get these to work on NOMMU as well.

    Signed-off-by: Arnd Bergmann
    Cc: Kirill A. Shutemov
    Acked-by: Greg Ungerer
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

21 Jan, 2015

1 commit

  • Since "BDI: Provide backing device capability information [try #3]" the
    backing_dev_info structure also provides flags for the kind of mmap
    operation available in a nommu environment, which is entirely unrelated
    to it's original purpose.

    Introduce a new nommu-only file operation to provide this information to
    the nommu mmap code instead. Splitting this from the backing_dev_info
    structure allows to remove lots of backing_dev_info instance that aren't
    otherwise needed, and entirely gets rid of the concept of providing a
    backing_dev_info for a character device. It also removes the need for
    the mtd_inodefs filesystem.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Acked-by: Brian Norris
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Dec, 2014

1 commit

  • do_mmap_private() in nommu.c try to allocate physically contiguous pages
    with arbitrary size in some cases and we now have good abstract function
    to do exactly same thing, alloc_pages_exact(). So, change to use it.

    There is no functional change. This is the preparation step for support
    page owner feature accurately.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim