17 Jul, 2018

1 commit

  • commit bb177a732c4369bb58a1fe1df8f552b6f0f7db5f upstream.

    syzbot has noticed that a specially crafted library can easily hit
    VM_BUG_ON in __mm_populate

    kernel BUG at mm/gup.c:1242!
    invalid opcode: 0000 [#1] SMP
    CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
    RIP: 0010:__mm_populate+0x1e2/0x1f0
    Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
    Call Trace:
    vm_brk_flags+0xc3/0x100
    vm_brk+0x1f/0x30
    load_elf_library+0x281/0x2e0
    __ia32_sys_uselib+0x170/0x1e0
    do_fast_syscall_32+0xca/0x420
    entry_SYSENTER_compat+0x70/0x7f

    The reason is that the length of the new brk is not page aligned when we
    try to populate the it. There is no reason to bug on that though.
    do_brk_flags already aligns the length properly so the mapping is
    expanded as it should. All we need is to tell mm_populate about it.
    Besides that there is absolutely no reason to to bug_on in the first
    place. The worst thing that could happen is that the last page wouldn't
    get populated and that is far from putting system into an inconsistent
    state.

    Fix the issue by moving the length sanitization code from do_brk_flags
    up to vm_brk_flags. The only other caller of do_brk_flags is brk
    syscall entry and it makes sure to provide the proper length so t here
    is no need for sanitation and so we can use do_brk_flags without it.

    Also remove the bogus BUG_ONs.

    [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
    Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: syzbot
    Tested-by: Tetsuo Handa
    Reviewed-by: Oscar Salvador
    Cc: Zi Yan
    Cc: "Aneesh Kumar K.V"
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Michael S. Tsirkin
    Cc: Al Viro
    Cc: "Huang, Ying"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

12 Jun, 2018

2 commits

  • commit 423913ad4ae5b3e8fb8983f70969fb522261ba26 upstream.

    Commit be83bbf80682 ("mmap: introduce sane default mmap limits") was
    introduced to catch problems in various ad-hoc character device drivers
    doing mmap and getting the size limits wrong. In the process, it used
    "known good" limits for the normal cases of mapping regular files and
    block device drivers.

    It turns out that the "s_maxbytes" limit was less "known good" than I
    thought. In particular, /proc doesn't set it, but exposes one regular
    file to mmap: /proc/vmcore. As a result, that file got limited to the
    default MAX_INT s_maxbytes value.

    This went unnoticed for a while, because apparently the only thing that
    needs it is the s390 kernel zfcpdump, but there might be other tools
    that use this too.

    Vasily suggested just changing s_maxbytes for all of /proc, which isn't
    wrong, but makes me nervous at this stage. So instead, just make the
    new mmap limit always be MAX_LFS_FILESIZE for regular files, which won't
    affect anything else. It wasn't the regular file case I was worried
    about.

    I'd really prefer for maxsize to have been per-inode, but that is not
    how things are today.

    Fixes: be83bbf80682 ("mmap: introduce sane default mmap limits")
    Reported-by: Vasily Gorbik
    Cc: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit be83bbf806822b1b89e0a0f23cd87cddc409e429 upstream.

    The internal VM "mmap()" interfaces are based on the mmap target doing
    everything using page indexes rather than byte offsets, because
    traditionally (ie 32-bit) we had the situation that the byte offset
    didn't fit in a register. So while the mmap virtual address was limited
    by the word size of the architecture, the backing store was not.

    So we're basically passing "pgoff" around as a page index, in order to
    be able to describe backing store locations that are much bigger than
    the word size (think files larger than 4GB etc).

    But while this all makes a ton of sense conceptually, we've been dogged
    by various drivers that don't really understand this, and internally
    work with byte offsets, and then try to work with the page index by
    turning it into a byte offset with "pgoff << PAGE_SHIFT".

    Which obviously can overflow.

    Adding the size of the mapping to it to get the byte offset of the end
    of the backing store just exacerbates the problem, and if you then use
    this overflow-prone value to check various limits of your device driver
    mmap capability, you're just setting yourself up for problems.

    The correct thing for drivers to do is to do their limit math in page
    indices, the way the interface is designed. Because the generic mmap
    code _does_ test that the index doesn't overflow, since that's what the
    mmap code really cares about.

    HOWEVER.

    Finding and fixing various random drivers is a sisyphean task, so let's
    just see if we can just make the core mmap() code do the limiting for
    us. Realistically, the only "big" backing stores we need to care about
    are regular files and block devices, both of which are known to do this
    properly, and which have nice well-defined limits for how much data they
    can access.

    So let's special-case just those two known cases, and then limit other
    random mmap users to a backing store that still fits in "unsigned long".
    Realistically, that's not much of a limit at all on 64-bit, and on
    32-bit architectures the only worry might be the GPU drivers, which can
    have big physical address spaces.

    To make it possible for drivers like that to say that they are 64-bit
    clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
    file flags to allow drivers to mark their file descriptors as safe in
    the full 64-bit mmap address space.

    [ The timing for doing this is less than optimal, and this should really
    go in a merge window. But realistically, this needs wide testing more
    than it needs anything else, and being main-line is the only way to do
    that.

    So the earlier the better, even if it's outside the proper development
    cycle - Linus ]

    Cc: Kees Cook
    Cc: Dan Carpenter
    Cc: Al Viro
    Cc: Willy Tarreau
    Cc: Dave Airlie
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

16 May, 2018

1 commit

  • commit 27ae357fa82be5ab73b2ef8d39dcb8ca2563483a upstream.

    Since exit_mmap() is done without the protection of mm->mmap_sem, it is
    possible for the oom reaper to concurrently operate on an mm until
    MMF_OOM_SKIP is set.

    This allows munlock_vma_pages_all() to concurrently run while the oom
    reaper is operating on a vma. Since munlock_vma_pages_range() depends
    on clearing VM_LOCKED from vm_flags before actually doing the munlock to
    determine if any other vmas are locking the same memory, the check for
    VM_LOCKED in the oom reaper is racy.

    This is especially noticeable on architectures such as powerpc where
    clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd
    is zapped by the oom reaper during follow_page_mask() after the check
    for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a
    kernel oops.

    Fix this by manually freeing all possible memory from the mm before
    doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not
    run on the mm anymore so the munlock is safe to do in exit_mmap(). It
    also matches the logic that the oom reaper currently uses for
    determining when to set MMF_OOM_SKIP itself, so there's no new risk of
    excessive oom killing.

    This issue fixes CVE-2018-1000200.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: David Rientjes
    Suggested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

20 Dec, 2017

1 commit

  • commit 4837fe37adff1d159904f0c013471b1ecbcb455e upstream.

    David Rientjes has reported the following memory corruption while the
    oom reaper tries to unmap the victims address space

    BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000
    addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d
    file: (null) fault: (null) mmap: (null) readpage: (null)
    CPU: 2 PID: 1001 Comm: oom_reaper
    Call Trace:
    unmap_page_range+0x1068/0x1130
    __oom_reap_task_mm+0xd5/0x16b
    oom_reaper+0xff/0x14c
    kthread+0xc1/0xe0

    Tetsuo Handa has noticed that the synchronization inside exit_mmap is
    insufficient. We only synchronize with the oom reaper if
    tsk_is_oom_victim which is not true if the final __mmput is called from
    a different context than the oom victim exit path. This can trivially
    happen from context of any task which has grabbed mm reference (e.g. to
    read /proc// file which requires mm etc.).

    The race would look like this

    oom_reaper oom_victim task
    mmget_not_zero
    do_exit
    mmput
    __oom_reap_task_mm mmput
    __mmput
    exit_mmap
    remove_vma
    unmap_page_range

    Fix this issue by providing a new mm_is_oom_victim() helper which
    operates on the mm struct rather than a task. Any context which
    operates on a remote mm struct should use this helper in place of
    tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
    so it is stable in the exit_mmap path.

    Debugged by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: Michal Hocko
    Reported-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Argangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

05 Dec, 2017

1 commit

  • commit 31383c6865a578834dd953d9dbc88e6b19fe3997 upstream.

    Patch series "device-dax: fix unaligned munmap handling"

    When device-dax is operating in huge-page mode we want it to behave like
    hugetlbfs and fail attempts to split vmas into unaligned ranges. It
    would be messy to teach the munmap path about device-dax alignment
    constraints in the same (hstate) way that hugetlbfs communicates this
    constraint. Instead, these patches introduce a new ->split() vm
    operation.

    This patch (of 2):

    The device-dax interface has similar constraints as hugetlbfs in that it
    requires the munmap path to unmap in huge page aligned units. Rather
    than add more custom vma handling code in __split_vma() introduce a new
    vm operation to perform this vma specific check.

    Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
    Signed-off-by: Dan Williams
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

09 Sep, 2017

1 commit

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

07 Sep, 2017

3 commits

  • This is purely required because exit_aio() may block and exit_mmap() may
    never start, if the oom_reap_task cannot start running on a mm with
    mm_users == 0.

    At the same time if the OOM reaper doesn't wait at all for the memory of
    the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
    generate a spurious OOM kill.

    If it wasn't because of the exit_aio or similar blocking functions in
    the last mmput, it would be enough to change the oom_reap_task() in the
    case it finds mm_users == 0, to wait for a timeout or to wait for
    __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
    problem here so the concurrency of exit_mmap and oom_reap_task is
    apparently warranted.

    It's a non standard runtime, exit_mmap() runs without mmap_sem, and
    oom_reap_task runs with the mmap_sem for reading as usual (kind of
    MADV_DONTNEED).

    The race between the two is solved with a combination of
    tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
    (serialized by a dummy down_write/up_write cycle on the same lines of
    the ksm_exit method).

    If the oom_reap_task() may be running concurrently during exit_mmap,
    exit_mmap will wait it to finish in down_write (before taking down mm
    structures that would make the oom_reap_task fail with use after free).

    If exit_mmap comes first, oom_reap_task() will skip the mm if
    MMF_OOM_SKIP is already set and in turn all memory is already freed and
    furthermore the mm data structures may already have been taken down by
    free_pgtables.

    [aarcange@redhat.com: incremental one liner]
    Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
    [rientjes@google.com: remove unused mmput_async]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
    [aarcange@redhat.com: microoptimization]
    Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
    Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
    Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: David Rientjes
    Reported-by: David Rientjes
    Tested-by: David Rientjes
    Reviewed-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • A __split_vma is not a worthy event to report, and it's definitely not a
    unmap so it would be incorrect to report unmap for the whole region to
    the userfaultfd manager if a __split_vma fails.

    So only call userfaultfd_unmap_prep after the __vma_splitting is over
    and do_munmap cannot fail anymore.

    Also add unlikely because it's better to optimize for the vast majority
    of apps that aren't using userfaultfd in a non cooperative way. Ideally
    we should also find a way to eliminate the branch entirely if
    CONFIG_USERFAULTFD=n, but it would complicate things so stick to
    unlikely for now.

    Link: http://lkml.kernel.org/r/20170802165145.22628-5-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Alexey Perevalov
    Cc: Maxime Coquelin
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Jul, 2017

1 commit

  • Jörn Engel noticed that the expand_upwards() function might not return
    -ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and
    if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE.

    Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa
    which all define TASK_SIZE as 0xffffffff, but since none of those have
    an upwards-growing stack we currently have no actual issue.

    Nevertheless let's fix this just in case any of the architectures with
    an upward-growing stack (currently parisc, metag and partly ia64) define
    TASK_SIZE similar.

    Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box
    Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit")
    Signed-off-by: Helge Deller
    Reported-by: Jörn Engel
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     

11 Jul, 2017

3 commits

  • Use rlimit() helper instead of manually writing whole chain from current
    task to rlim_cur.

    Link: http://lkml.kernel.org/r/20170705172811.8027-1-k.opasiak@samsung.com
    Signed-off-by: Krzysztof Opasiak
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Opasiak
     
  • expand_stack(vma) fails if address < stack_guard_gap even if there is no
    vma->vm_prev. I don't think this makes sense, and we didn't do this
    before the recent commit 1be7107fbe18 ("mm: larger stack guard gap,
    between vmas").

    We do not need a gap in this case, any address is fine as long as
    security_mmap_addr() doesn't object.

    This also simplifies the code, we know that address >= prev->vm_end and
    thus underflow is not possible.

    Link: http://lkml.kernel.org/r/20170628175258.GA24881@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Larry Woodman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has
    introduced a regression in some rust and Java environments which are
    trying to implement their own stack guard page. They are punching a new
    MAP_FIXED mapping inside the existing stack Vma.

    This will confuse expand_{downwards,upwards} into thinking that the
    stack expansion would in fact get us too close to an existing non-stack
    vma which is a correct behavior wrt safety. It is a real regression on
    the other hand.

    Let's work around the problem by considering PROT_NONE mapping as a part
    of the stack. This is a gros hack but overflowing to such a mapping
    would trap anyway an we only can hope that usespace knows what it is
    doing and handle it propely.

    Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
    Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Jul, 2017

1 commit

  • Pull ARM updates from Russell King:

    - add support for ftrace-with-registers, which is needed for kgraft and
    other ftrace tools

    - support for mremap() for the sigpage/vDSO so that checkpoint/restore
    can work

    - add timestamps to each line of the register dump output

    - remove the unused KTHREAD_SIZE from nommu

    - align the ARM bitops APIs with the generic API (using unsigned long
    pointers rather than void pointers)

    - make the configuration of userspace Thumb support an expert option so
    that we can default it on, and avoid some hard to debug userspace
    crashes

    * 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
    ARM: 8684/1: NOMMU: Remove unused KTHREAD_SIZE definition
    ARM: 8683/1: ARM32: Support mremap() for sigpage/vDSO
    ARM: 8679/1: bitops: Align prototypes to generic API
    ARM: 8678/1: ftrace: Adds support for CONFIG_DYNAMIC_FTRACE_WITH_REGS
    ARM: make configuration of userspace Thumb support an expert option
    ARM: 8673/1: Fix __show_regs output timestamps

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • The protection map is only modified by per-arch init code so it can be
    protected from writes after the init code runs.

    This change was extracted from PaX where it's part of KERNEXEC.

    Link: http://lkml.kernel.org/r/20170510174441.26163-1-danielmicay@gmail.com
    Signed-off-by: Daniel Micay
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Micay
     

22 Jun, 2017

2 commits

  • Fix expand_upwards() on architectures with an upward-growing stack (parisc,
    metag and partly IA-64) to allow the stack to reliably grow exactly up to
    the address space limit given by TASK_SIZE.

    Signed-off-by: Helge Deller
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • Trinity gets kernel BUG at mm/mmap.c:1963! in about 3 minutes of
    mmap testing. That's the VM_BUG_ON(gap_end < gap_start) at the
    end of unmapped_area_topdown(). Linus points out how MAP_FIXED
    (which does not have to respect our stack guard gap intentions)
    could result in gap_end below gap_start there. Fix that, and
    the similar case in its alternative, unmapped_area().

    Cc: stable@vger.kernel.org
    Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
    Reported-by: Dave Jones
    Debugged-by: Linus Torvalds
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Jun, 2017

1 commit

  • CRIU restores application mappings on the same place where they
    were before Checkpoint. That means, that we need to move vDSO
    and sigpage during restore on exactly the same place where
    they were before C/R.

    Make mremap() code update mm->context.{sigpage,vdso} pointers
    during VMA move. Sigpage is used for landing after handling
    a signal - if the pointer is not updated during moving, the
    application might crash on any signal after mremap().

    vDSO pointer on ARM32 is used only for setting auxv at this moment,
    update it during mremap() in case of future usage.

    Without those updates, current work of CRIU on ARM32 is not reliable.
    Historically, we error Checkpointing if we find vDSO page on ARM32
    and suggest user to disable CONFIG_VDSO.
    But that's not correct - it goes from x86 where signal processing
    is ended in vDSO blob. For arm32 it's sigpage, which is not disabled
    with `CONFIG_VDSO=n'.

    Looks like C/R was working by luck - because userspace on ARM32 at
    this moment always sets SA_RESTORER.

    Signed-off-by: Dmitry Safonov
    Acked-by: Andy Lutomirski
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Will Deacon
    Cc: Thomas Gleixner
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Christopher Covington
    Signed-off-by: Russell King

    Dmitry Safonov
     

19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 May, 2017

1 commit

  • Commit 091d0d55b286 ("shm: fix null pointer deref when userspace
    specifies invalid hugepage size") had replaced MAP_HUGE_MASK with
    SHM_HUGE_MASK. Though both of them contain the same numeric value of
    0x3f, MAP_HUGE_MASK flag sounds more appropriate than the other one in
    the context. Hence change it back.

    Link: http://lkml.kernel.org/r/20170404045635.616-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Matthew Wilcox
    Acked-by: Balbir Singh
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

03 Mar, 2017

1 commit

  • Pull vfs pile two from Al Viro:

    - orangefs fix

    - series of fs/namei.c cleanups from me

    - VFS stuff coming from overlayfs tree

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    orangefs: Use RCU for destroy_inode
    vfs: use helper for calling f_op->fsync()
    mm: use helper for calling f_op->mmap()
    vfs: use helpers for calling f_op->{read,write}_iter()
    vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
    vfs: extract common parts of {compat_,}do_readv_writev()
    vfs: wrap write f_ops with file_{start,end}_write()
    vfs: deny copy_file_range() for non regular files
    vfs: deny fallocate() on directory
    vfs: create vfs helper vfs_tmpfile()
    namei.c: split unlazy_walk()
    namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
    lookup_fast(): clean up the logics around the fallback to non-rcu mode
    namei: fold unlazy_link() into its sole caller

    Linus Torvalds
     

02 Mar, 2017

1 commit


25 Feb, 2017

5 commits

  • If madvise(2) advice will result in the underlying vma being split and
    the number of areas mapped by the process will exceed
    /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

    EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
    is temporarily unavailable. It indicates that userspace should retry
    the advice in the near future. This is important for advice such as
    MADV_DONTNEED which is often used by malloc implementations to free
    memory back to the system: we really do want to free memory back when
    madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
    or mempolicies) cannot be allocated.

    Encountering /proc/sys/vm/max_map_count is not a temporary failure,
    however, so return ENOMEM to indicate this is a more serious issue. A
    followup patch to the man page will specify this behavior.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Johannes Weiner
    Cc: Jerome Marchand
    Cc: "Kirill A. Shutemov"
    Cc: Michael Kerrisk
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When a non-cooperative userfaultfd monitor copies pages in the
    background, it may encounter regions that were already unmapped.
    Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
    changes in the virtual memory layout.

    Since there might be different uffd contexts for the affected VMAs, we
    first should create a temporary representation for the unmap event for
    each uffd context and then notify them one by one to the appropriate
    userfault file descriptors.

    The event notification occurs after the mmap_sem has been released.

    [arnd@arndb.de: fix nommu build]
    Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
    [mhocko@suse.com: fix nommu build]
    Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "userfaultfd: non-cooperative: better tracking for mapping
    changes", v2.

    These patches try to address issues I've encountered during integration
    of userfaultfd with CRIU.

    Previously added userfaultfd events for fork(), madvise() and mremap()
    unfortunately do not cover all possible changes to a process virtual
    memory layout required for uffd monitor.

    When one or more VMAs is removed from the process mm, the external uffd
    monitor has no way to detect those changes and will attempt to fill the
    removed regions with userfaultfd_copy.

    Another problematic event is the exit() of the process. Here again, the
    external uffd monitor will try to use userfaultfd_copy, although mm
    owning the memory has already gone.

    The first patch in the series is a minor cleanup and it's not strictly
    related to the rest of the series.

    The patches 2 and 3 below add UFFD_EVENT_UNMAP and UFFD_EVENT_EXIT to
    allow the uffd monitor track changes in the memory layout of a process.

    The patches 4 and 5 amend error codes returned by userfaultfd_copy to
    make the uffd monitor able to cope with races that might occur between
    delivery of unmap and exit events and outstanding userfaultfd_copy's.

    This patch (of 5):

    Commit dc0ef0df7b6a ("mm: make mmap_sem for write waits killable for mm
    syscalls") replaced call to vm_munmap in munmap syscall with open coded
    version to allow different waits on mmap_sem in munmap syscall and
    vm_munmap.

    Now both functions use down_write_killable, so we can restore the call
    to vm_munmap from the munmap system call.

    Link: http://lkml.kernel.org/r/1485542673-24387-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • mmap_init() is no longer associated with VMA slab. So fix it.

    Link: http://lkml.kernel.org/r/1485182601-9294-1-git-send-email-iamyooon@gmail.com
    Signed-off-by: seokhoon.yoon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    seokhoon.yoon
     
  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

23 Feb, 2017

1 commit

  • On 32-bit powerpc the ELF PLT sections of binaries (built with
    --bss-plt, or with a toolchain which defaults to it) look like this:

    [17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4
    [18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4
    [19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4

    Which results in an ELF load header:

    Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
    LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000

    This is all correct, the load region containing the PLT is marked as
    executable. Note that the PLT starts at 0002b00c but the file mapping
    ends at 0002aff8, so the PLT falls in the 0 fill section described by
    the load header, and after a page boundary.

    Unfortunately the generic ELF loader ignores the X bit in the load
    headers when it creates the 0 filled non-file backed mappings. It
    assumes all of these mappings are RW BSS sections, which is not the case
    for PPC.

    gcc/ld has an option (--secure-plt) to not do this, this is said to
    incur a small performance penalty.

    Currently, to support 32-bit binaries with PLT in BSS kernel maps
    *entire brk area* with executable rights for all binaries, even
    --secure-plt ones.

    Stop doing that.

    Teach the ELF loader to check the X bit in the relevant load header and
    create 0 filled anonymous mappings that are executable if the load
    header requests that.

    Test program showing the difference in /proc/$PID/maps:

    int main() {
    char buf[16*1024];
    char *p = malloc(123); /* make "[heap]" mapping appear */
    int fd = open("/proc/self/maps", O_RDONLY);
    int len = read(fd, buf, sizeof(buf));
    write(1, buf, len);
    printf("%p\n", p);
    return 0;
    }

    Compiled using: gcc -mbss-plt -m32 -Os test.c -otest

    Unpatched ppc64 kernel:
    00100000-00120000 r-xp 00000000 00:00 0 [vdso]
    0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
    0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
    0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
    10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
    10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
    10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
    10690000-106c0000 rwxp 00000000 00:00 0 [heap]
    f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
    f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
    f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
    ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack]
    0x10690008

    Patched ppc64 kernel:
    00100000-00120000 r-xp 00000000 00:00 0 [vdso]
    0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
    0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
    0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
    10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
    10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
    10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
    10180000-101b0000 rw-p 00000000 00:00 0 [heap]
    ^^^^ this has changed
    f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
    f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
    f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
    ff860000-ff890000 rw-p 00000000 00:00 0 [stack]
    0x10180008

    The patch was originally posted in 2012 by Jason Gunthorpe
    and apparently ignored:

    https://lkml.org/lkml/2012/9/30/138

    Lightly run-tested.

    Link: http://lkml.kernel.org/r/20161215131950.23054-1-dvlasenk@redhat.com
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Denys Vlasenko
    Acked-by: Kees Cook
    Acked-by: Michael Ellerman
    Tested-by: Jason Gunthorpe
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Aneesh Kumar K.V"
    Cc: Oleg Nesterov
    Cc: Florian Weimer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     

20 Feb, 2017

1 commit


25 Dec, 2016

1 commit


08 Oct, 2016

6 commits

  • The old code was always doing:

    vma->vm_end = next->vm_end
    vma_rb_erase(next) // in __vma_unlink
    vma->vm_next = next->vm_next // in __vma_unlink
    next = vma->vm_next
    vma_gap_update(next)

    The new code still does the above for remove_next == 1 and 2, but for
    remove_next == 3 it has been changed and it does:

    next->vm_start = vma->vm_start
    vma_rb_erase(vma) // in __vma_unlink
    vma_gap_update(next)

    In the latter case, while unlinking "vma", validate_mm_rb() is told to
    ignore "vma" that is being removed, but next->vm_start was reduced
    instead. So for the new case, to avoid the false positive from
    validate_mm_rb, it should be "next" that is ignored when "vma" is
    being unlinked.

    "vma" and "next" in the above comment, considered pre-swap().

    Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Tested-by: Shaun Tancheff
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The cases are three not two.

    Link: http://lkml.kernel.org/r/1474492522-2261-3-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • If next would be NULL we couldn't reach such code path.

    Link: http://lkml.kernel.org/r/1474309513-20313-2-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The rmap_walk can access vm_page_prot (and potentially vm_flags in the
    pte/pmd manipulations). So it's not safe to wait the caller to update
    the vm_page_prot/vm_flags after vma_merge returned potentially removing
    the "next" vma and extending the "current" vma over the
    next->vm_start,vm_end range, but still with the "current" vma
    vm_page_prot, after releasing the rmap locks.

    The vm_page_prot/vm_flags must be transferred from the "next" vma to the
    current vma while vma_merge still holds the rmap locks.

    The side effect of this race condition is pte corruption during migrate
    as remove_migration_ptes when run on a address of the "next" vma that
    got removed, used the vm_page_prot of the current vma.

    migrate mprotect
    ------------ -------------
    migrating in "next" vma
    vma_merge() # removes "next" vma and
    # extends "current" vma
    # current vma is not with
    # vm_page_prot updated
    remove_migration_ptes
    read vm_page_prot of current "vma"
    establish pte with wrong permissions
    vm_set_page_prot(vma) # too late!
    change_protection in the old vma range
    only, next range is not updated

    This caused segmentation faults and potentially memory corruption in
    heavy mprotect loads with some light page migration caused by compaction
    in the background.

    Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
    which confirms the case 8 is only buggy one where the race can trigger,
    in all other vma_merge cases the above cannot happen.

    This fix removes the oddness factor from case 8 and it converts it from:

    AAAA
    PPPPNNNNXXXX -> PPPPNNNNNNNN

    to:

    AAAA
    PPPPNNNNXXXX -> PPPPXXXXXXXX

    XXXX has the right vma properties for the whole merged vma returned by
    vma_adjust, so it solves the problem fully. It has the added benefits
    that the callers could stop updating vma properties when vma_merge
    succeeds however the callers are not updated by this patch (there are
    bits like VM_SOFTDIRTY that still need special care for the whole range,
    as the vma merging ignores them, but as long as they're not processed by
    rmap walks and instead they're accessed with the mmap_sem at least for
    reading, they are fine not to be updated within vma_adjust before
    releasing the rmap_locks).

    Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Aditya Mandaleeka
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • mm->highest_vm_end doesn't need any update.

    After finally removing the oddness from vma_merge case 8 that was
    causing:

    1) constant risk of trouble whenever anybody would check vma fields
    from rmap_walks, like it happened when page migration was
    introduced and it read the vma->vm_page_prot from a rmap_walk

    2) the callers of vma_merge to re-initialize any value different from
    the current vma, instead of vma_merge() more reliably returning a
    vma that already matches all fields passed as parameter

    .. it is also worth to take the opportunity of cleaning up superfluous
    code in vma_adjust(), that if not removed adds up to the hard
    readability of the function.

    Link: http://lkml.kernel.org/r/1474492522-2261-5-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
    concurrently and this prevents the risk of reading intermediate values.

    Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 Oct, 2016

1 commit

  • Pull x86 vdso updates from Ingo Molnar:
    "The main changes in this cycle centered around adding support for
    32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry
    Safonov"

    * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl
    x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64
    x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE
    x86/signal: Add SA_{X32,IA32}_ABI sa_flags
    x86/ptrace: Down with test_thread_flag(TIF_IA32)
    x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
    x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
    x86/vdso: Replace calculate_addr in map_vdso() with addr
    x86/vdso: Unmap vdso blob on vvar mapping failure

    Linus Torvalds
     

15 Sep, 2016

1 commit

  • Add API to change vdso blob type with arch_prctl.
    As this is usefull only by needs of CRIU, expose
    this interface under CONFIG_CHECKPOINT_RESTORE.

    Signed-off-by: Dmitry Safonov
    Acked-by: Andy Lutomirski
    Cc: 0x7f454c46@gmail.com
    Cc: oleg@redhat.com
    Cc: linux-mm@kvack.org
    Cc: gorcunov@openvz.org
    Cc: xemul@virtuozzo.com
    Link: http://lkml.kernel.org/r/20160905133308.28234-4-dsafonov@virtuozzo.com
    Signed-off-by: Thomas Gleixner

    Dmitry Safonov
     

26 Aug, 2016

1 commit

  • The ARMv8 architecture allows execute-only user permissions by clearing
    the PTE_UXN and PTE_USER bits. However, the kernel running on a CPU
    implementation without User Access Override (ARMv8.2 onwards) can still
    access such page, so execute-only page permission does not protect
    against read(2)/write(2) etc. accesses. Systems requiring such
    protection must enable features like SECCOMP.

    This patch changes the arm64 __P100 and __S100 protection_map[] macros
    to the new __PAGE_EXECONLY attributes. A side effect is that
    pte_user() no longer triggers for __PAGE_EXECONLY since PTE_USER isn't
    set. To work around this, the check is done on the PTE_NG bit via the
    pte_ng() macro. VM_READ is also checked now for page faults.

    Reviewed-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon

    Catalin Marinas