30 Dec, 2020

1 commit

  • [ Upstream commit 57efa1fe5957694fa541c9062de0a127f0b9acb0 ]

    Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
    fork() for ptes") pages under a FOLL_PIN will not be write protected
    during COW for fork. This means that pages returned from
    pin_user_pages(FOLL_WRITE) should not become write protected while the pin
    is active.

    However, there is a small race where get_user_pages_fast(FOLL_PIN) can
    establish a FOLL_PIN at the same time copy_present_page() is write
    protecting it:

    CPU 0 CPU 1
    get_user_pages_fast()
    internal_get_user_pages_fast()
    copy_page_range()
    pte_alloc_map_lock()
    copy_present_page()
    atomic_read(has_pinned) == 0
    page_maybe_dma_pinned() == false
    atomic_set(has_pinned, 1);
    gup_pgd_range()
    gup_pte_range()
    pte_t pte = gup_get_pte(ptep)
    pte_access_permitted(pte)
    try_grab_compound_head()
    pte = pte_wrprotect(pte)
    set_pte_at();
    pte_unmap_unlock()
    // GUP now returns with a write protected page

    The first attempt to resolve this by using the write protect caused
    problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
    early COW write protect games during fork()")

    Instead wrap copy_p4d_range() with the write side of a seqcount and check
    the read side around gup_pgd_range(). If there is a collision then
    get_user_pages_fast() fails and falls back to slow GUP.

    Slow GUP is safe against this race because copy_page_range() is only
    called while holding the exclusive side of the mmap_lock on the src
    mm_struct.

    [akpm@linux-foundation.org: coding style fixes]
    Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

    Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
    Signed-off-by: Jason Gunthorpe
    Suggested-by: Linus Torvalds
    Reviewed-by: John Hubbard
    Reviewed-by: Jan Kara
    Reviewed-by: Peter Xu
    Acked-by: "Ahmed S. Darwish" [seqcount_t parts]
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Jann Horn
    Cc: Kirill Shutemov
    Cc: Kirill Tkhai
    Cc: Leon Romanovsky
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     

10 Jun, 2020

3 commits

  • Define a new initializer for the mmap locking api. Initially this just
    evaluates to __RWSEM_INITIALIZER as the API is defined as wrappers around
    rwsem.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-9-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The replacement of with made the include
    of the latter in the middle of asm includes. Fix this up with the aid of
    the below script and manual adjustments here and there.

    import sys
    import re

    if len(sys.argv) is not 3:
    print "USAGE: %s " % (sys.argv[0])
    sys.exit(1)

    hdr_to_move="#include " % sys.argv[2]
    moved = False
    in_hdrs = False

    with open(sys.argv[1], "r") as f:
    lines = f.readlines()
    for _line in lines:
    line = _line.rstrip('
    ')
    if line == hdr_to_move:
    continue
    if line.startswith("#include
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The include/linux/pgtable.h is going to be the home of generic page table
    manipulation functions.

    Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
    make the latter include asm/pgtable.h.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

19 Oct, 2019

1 commit

  • mm_init.c needs to include for the definition of
    vm_committed_as_batch. Fixes the following sparse warning:

    mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191016091509.26708-1-ben.dooks@codethink.co.uk
    Signed-off-by: Ben Dooks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Dooks (Codethink)
     

25 Sep, 2019

1 commit

  • Replace open-coded bitmap array initialization of init_mm.cpu_bitmask with
    neat CPU_BITS_NONE macro.

    And, since init_mm.cpu_bitmask is statically set to zero, there is no way
    to clear it again in start_kernel().

    Link: http://lkml.kernel.org/r/1565703815-8584-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

17 Jul, 2018

1 commit

  • The mm_struct always contains a cpumask bitmap, regardless of
    CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
    simplify things, and simply have one bitmask at the end of the
    mm_struct for the mm_cpumask.

    This does necessitate moving everything else in mm_struct into
    an anonymous sub-structure, which can be randomized when struct
    randomization is enabled.

    The second step is to determine the correct size for the
    mm_struct slab object from the size of the mm_struct
    (excluding the CPU bitmap) and the size the cpumask.

    For init_mm we can simply allocate the maximum size this
    kernel is compiled for, since we only have one init_mm
    in the system, anyway.

    Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
    getting confused by the dynamically sized array.

    Tested-by: Song Liu
    Signed-off-by: Rik van Riel
    Signed-off-by: Mike Galbraith
    Signed-off-by: Rik van Riel
    Acked-by: Dave Hansen
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Cc: luto@kernel.org
    Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

08 Jun, 2018

1 commit

  • mmap_sem is on the hot path of kernel, and it very contended, but it is
    abused too. It is used to protect arg_start|end and evn_start|end when
    reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
    sense since those proc files just expect to read 4 values atomically and
    not related to VM, they could be set to arbitrary values by C/R.

    And, the mmap_sem contention may cause unexpected issue like below:

    INFO: task ps:14018 blocked for more than 120 seconds.
    Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    ps D 0 14018 1 0x00000004
    Call Trace:
    schedule+0x36/0x80
    rwsem_down_read_failed+0xf0/0x150
    call_rwsem_down_read_failed+0x18/0x30
    down_read+0x20/0x40
    proc_pid_cmdline_read+0xd9/0x4e0
    __vfs_read+0x37/0x150
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x1a/0xc5

    Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
    for them to mitigate the abuse of mmap_sem.

    So, introduce a new spinlock in mm_struct to protect the concurrent
    access to arg_start|end, env_start|end and others, as well as replace
    write map_sem to read to protect the race condition between prctl and
    sys_brk which might break check_data_rlimit(), and makes prctl more
    friendly to other VM operations.

    This patch just eliminates the abuse of mmap_sem, but it can't resolve
    the above hung task warning completely since the later
    access_remote_vm() call needs acquire mmap_sem. The mmap_sem
    scalability issue will be solved in the future.

    [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
    Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Matthew Wilcox
    Cc: Mateusz Guzik
    Cc: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

23 Nov, 2016

1 commit

  • During exec dumpable is cleared if the file that is being executed is
    not readable by the user executing the file. A bug in
    ptrace_may_access allows reading the file if the executable happens to
    enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
    unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

    This problem is fixed with only necessary userspace breakage by adding
    a user namespace owner to mm_struct, captured at the time of exec, so
    it is clear in which user namespace CAP_SYS_PTRACE must be present in
    to be able to safely give read permission to the executable.

    The function ptrace_may_access is modified to verify that the ptracer
    has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
    This ensures that if the task changes it's cred into a subordinate
    user namespace it does not become ptraceable.

    The function ptrace_attach is modified to only set PT_PTRACE_CAP when
    CAP_SYS_PTRACE is held over task->mm->user_ns. The intent of
    PT_PTRACE_CAP is to be a flag to note that whatever permission changes
    the task might go through the tracer has sufficient permissions for
    it not to be an issue. task->cred->user_ns is always the same
    as or descendent of mm->user_ns. Which guarantees that having
    CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
    credentials.

    To prevent regressions mm->dumpable and mm->user_ns are not considered
    when a task has no mm. As simply failing ptrace_may_attach causes
    regressions in privileged applications attempting to read things
    such as /proc//stat

    Cc: stable@vger.kernel.org
    Acked-by: Kees Cook
    Tested-by: Cyrill Gorcunov
    Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

25 May, 2011

1 commit

  • cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
    It might lead to reduce cache hit ratio.

    This patch has two change.
    1) Move the place of cpumask into last of mm_struct. Because usually cpumask
    is accessed only front bits when the system has cpu-hotplug capability
    2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
    footprint if cpumask_size() will use nr_cpumask_bits properly in future.

    In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
    It may help to detect out of tree cpu_vm_mask users.

    This patch has no functional change.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Howells
    Cc: Koichi Yasutake
    Cc: Hugh Dickins
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

10 Aug, 2010

1 commit

  • Provide an INIT_MM_CONTEXT intializer macro which can be used to
    statically initialize mm_struct:mm_context of init_mm. This way we can
    get rid of code which will do the initialization at run time (on s390).

    In addition the current code can be found at a place where it is not
    expected. So let's have a common initializer which architectures
    can use if needed.

    This is based on a patch from Suzuki Poulose.

    Signed-off-by: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Suzuki Poulose
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

17 Jun, 2009

1 commit

  • * create mm/init-mm.c, move init_mm there
    * remove INIT_MM, initialize init_mm with C99 initializer
    * unexport init_mm on all arches:

    init_mm is already unexported on x86.

    One strange place is some OMAP driver (drivers/video/omap/) which
    won't build modular, but it's already wants get_vm_area() export.
    Somebody should look there.

    [akpm@linux-foundation.org: add missing #includes]
    Signed-off-by: Alexey Dobriyan
    Cc: Mike Frysinger
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan