26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

15 May, 2019

3 commits

  • Since 0cbe3e26abe0 ("mm: update ptep_modify_prot_start/commit to take
    vm_area_struct as arg") the only place that uses the local 'mm' variable
    in change_pte_range() is the call to set_pte_at().

    Many architectures define set_pte_at() as macro that does not use the 'mm'
    parameter, which generates the following compilation warning:

    CC mm/mprotect.o
    mm/mprotect.c: In function 'change_pte_range':
    mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable]
    struct mm_struct *mm = vma->vm_mm;
    ^~

    Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm'
    variable in change_pte_range().

    [liu.song.a23@gmail.com: fix missed conversions]
    Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink: http://lkml.kernel.org/r/1557305432-4940-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

06 Mar, 2019

2 commits

  • Architectures like ppc64 require to do a conditional tlb flush based on
    the old and new value of pte. Enable that by passing old pte value as
    the arg.

    Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Patch series "NestMMU pte upgrade workaround for mprotect", v5.

    We can upgrade pte access (R -> RW transition) via mprotect. We need to
    make sure we follow the recommended pte update sequence as outlined in
    commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to
    handle nest MMU hang") for such updates. This patch series does that.

    This patch (of 5):

    Some architectures may want to call flush_tlb_range from these helpers.

    Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Nicholas Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

29 Dec, 2018

1 commit

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

21 Jun, 2018

1 commit

  • For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
    table entry. This sets the high bits in the CPU's address space, thus
    making sure to point to not point an unmapped entry to valid cached memory.

    Some server system BIOSes put the MMIO mappings high up in the physical
    address space. If such an high mapping was mapped to unprivileged users
    they could attack low memory by setting such a mapping to PROT_NONE. This
    could happen through a special device driver which is not access
    protected. Normal /dev/mem is of course access protected.

    To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

    Valid page mappings are allowed because the system is then unsafe anyways.

    It's not expected that users commonly use PROT_NONE on MMIO. But to
    minimize any impact this is only enforced if the mapping actually refers to
    a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
    the check for root.

    For mmaps this is straight forward and can be handled in vm_insert_pfn and
    in remap_pfn_range().

    For mprotect it's a bit trickier. At the point where the actual PTEs are
    accessed a lot of state has been changed and it would be difficult to undo
    on an error. Since this is a uncommon case use a separate early page talk
    walk pass for MMIO PROT_NONE mappings that checks for this condition
    early. For non MMIO and non PROT_NONE there are no changes.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Dave Hansen

    Andi Kleen
     

12 Apr, 2018

1 commit

  • change_pte_range is called from task work context to mark PTEs for
    receiving NUMA faulting hints. If the marked pages are dirty then
    migration may fail. Some filesystems cannot migrate dirty pages without
    blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
    Even when they can, it can be a waste of cycles when the pages are
    shared forcing higher scan rates. This patch avoids marking shared
    dirty pages for hinting faults but also will skip a migration if the
    page was dirtied after the scanner updated a clean page.

    This is most noticeable running the NASA Parallel Benchmark when backed
    by btrfs, the default root filesystem for some distributions, but also
    noticeable when using XFS.

    The following are results from a 4-socket machine running a 4.16-rc4
    kernel with some scheduler patches that are pending for the next merge
    window.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    Time cg.D 459.07 ( 0.00%) 444.21 ( 3.24%)
    Time ep.D 76.96 ( 0.00%) 77.69 ( -0.95%)
    Time is.D 25.55 ( 0.00%) 27.85 ( -9.00%)
    Time lu.D 601.58 ( 0.00%) 596.87 ( 0.78%)
    Time mg.D 107.73 ( 0.00%) 108.22 ( -0.45%)

    is.D regresses slightly in terms of absolute time but note that that
    particular load varies quite a bit from run to run. The more relevant
    observation is the total system CPU usage.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    User 71471.91 70627.04
    System 11078.96 8256.13
    Elapsed 661.66 632.74

    That is a substantial drop in system CPU usage and overall the workload
    completes faster. The NUMA balancing statistics are also interesting

    NUMA base PTE updates 111407972 139848884
    NUMA huge PMD updates 206506 264869
    NUMA page range updates 217139044 275461812
    NUMA hint faults 4300924 3719784
    NUMA hint local faults 3012539 3416618
    NUMA hint local percent 70 91
    NUMA pages migrated 1517487 1358420

    While more PTEs are scanned due to changes in what faults are gathered,
    it's clear that a far higher percentage of faults are local as the bulk
    of the remote hits were dirty pages that, in this case with btrfs, had
    no chance of migrating.

    The following is a comparison when using XFS as that is a more realistic
    filesystem choice for a data partition

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1r47
    Time cg.D 485.28 ( 0.00%) 442.62 ( 8.79%)
    Time ep.D 77.68 ( 0.00%) 77.54 ( 0.18%)
    Time is.D 26.44 ( 0.00%) 24.79 ( 6.24%)
    Time lu.D 597.46 ( 0.00%) 597.11 ( 0.06%)
    Time mg.D 142.65 ( 0.00%) 105.83 ( 25.81%)

    That is a reasonable gain on two relatively long-lived workloads. While
    not presented, there is also a substantial drop in system CPu usage and
    the NUMA balancing stats show similar improvements in locality as btrfs
    did.

    Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Mar, 2018

2 commits

  • When protection bits are changed on a VMA, some of the architecture
    specific flags should be cleared as well. An examples of this are the
    PKEY flags on x86. This patch expands the current code that clears
    PKEY flags for x86, to support similar functionality for other
    architectures as well.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Reviewed-by: Anthony Yznaga
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     
  • A protection flag may not be valid across entire address space and
    hence arch_validate_prot() might need the address a protection bit is
    being set on to ensure it is a valid protection flag. For example, sparc
    processors support memory corruption detection (as part of ADI feature)
    flag on memory addresses mapped on to physical RAM but not on PFN mapped
    pages or addresses mapped on to devices. This patch adds address to the
    parameters being passed to arch_validate_prot() so protection bits can
    be validated in the relevant context.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Reviewed-by: Anthony Yznaga
    Acked-by: Michael Ellerman (powerpc)
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     

01 Feb, 2018

1 commit

  • Workloads consisting of a large number of processes running the same
    program with a very large shared data segment may experience performance
    problems when numa balancing attempts to migrate the shared cow pages.
    This manifests itself with many processes or tasks in
    TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.

    The program listed below simulates the conditions with these results
    when run with 288 processes on a 144 core/8 socket machine.

    Average throughput Average throughput Average throughput
    with numa_balancing=0 with numa_balancing=1 with numa_balancing=1
    without the patch with the patch
    --------------------- --------------------- ---------------------
    2118782 2021534 2107979

    Complex production environments show less variability and fewer poorly
    performing outliers accompanied with a smaller number of processes
    waiting on NUMA page migration with this patch applied. In some cases,
    %iowait drops from 16%-26% to 0.

    // SPDX-License-Identifier: GPL-2.0
    /*
    * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
    */
    #include
    #include
    #include
    #include

    int a[1000000] = {13};

    int main(int argc, const char **argv)
    {
    int n = 0;
    int i;
    pid_t pid;
    int stat;
    int *count_array;
    int cpu_count = 288;
    long total = 0;

    struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};

    if (argc > 2)
    cpu_count = atoi(argv[2]);

    count_array = mmap(NULL, cpu_count * sizeof(int),
    (PROT_READ|PROT_WRITE),
    (MAP_SHARED|MAP_ANONYMOUS), 0, 0);

    if (count_array == MAP_FAILED) {
    perror("mmap:");
    return 0;
    }

    for (i = 0; i < cpu_count; ++i) {
    pid = fork();
    if (pid < 0)
    break;
    }

    for (i = 0; i < cpu_count; ++i)
    total += count_array[i];

    printf("Total %ld\n", total);
    munmap(count_array, cpu_count * sizeof(int));
    return 0;
    }

    gettimeofday(&t1, 0);
    timeradd(&t1, &t2, &t1);
    while (timercmp(&t2, &t1, < 1000000; j++)
    b += a[j];
    gettimeofday(&t2, 0);
    n++;
    }
    count_array[i] = n;
    return 0;
    }

    This patch changes change_pte_range() to skip shared copy-on-write pages
    when called from change_prot_numa().

    NOTE: change_prot_numa() is nominally called from task_numa_work() and
    queue_pages_test_walk(). task_numa_work() is the auto NUMA balancing
    path, and queue_pages_test_walk() is part of explicit NUMA policy
    management. However, queue_pages_test_walk() only calls
    change_prot_numa() when MPOL_MF_LAZY is specified and currently that is
    not allowed, so change_prot_numa() is only called from auto NUMA
    balancing.

    In the case of explicit NUMA policy management, shared pages are not
    migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL
    depends on CAP_SYS_NICE. Currently, there is no way to pass information
    about MPOL_MF_MOVE_ALL to change_pte_range. This will have to be fixed
    if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
    migration mode.

    task_numa_work() skips the read-only VMAs of programs and shared
    libraries.

    Link: http://lkml.kernel.org/r/1516751617-7369-1-git-send-email-henry.willard@oracle.com
    Signed-off-by: Henry Willard
    Reviewed-by: Håkon Bugge
    Reviewed-by: Steve Sistare
    Acked-by: Mel Gorman
    Cc: Kate Stewart
    Cc: Zi Yan
    Cc: Philippe Ombredanne
    Cc: Andrea Arcangeli
    Cc: Greg Kroah-Hartman
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Henry Willard
     

05 Jan, 2018

1 commit

  • While testing on a large CPU system, detected the following RCU stall
    many times over the span of the workload. This problem is solved by
    adding a cond_resched() in the change_pmd_range() function.

    INFO: rcu_sched detected stalls on CPUs/tasks:
    154-....: (670 ticks this GP) idle=022/140000000000000/0 softirq=2825/2825 fqs=612
    (detected by 955, t=6002 jiffies, g=4486, c=4485, q=90864)
    Sending NMI from CPU 955 to CPUs 154:
    NMI backtrace for cpu 154
    CPU: 154 PID: 147071 Comm: workload Not tainted 4.15.0-rc3+ #3
    NIP: c0000000000b3f64 LR: c0000000000b33d4 CTR: 000000000000aa18
    REGS: 00000000a4b0fb44 TRAP: 0501 Not tainted (4.15.0-rc3+)
    MSR: 8000000000009033 CR: 22422082 XER: 00000000
    CFAR: 00000000006cf8f0 SOFTE: 1
    GPR00: 0010000000000000 c00003ef9b1cb8c0 c0000000010cc600 0000000000000000
    GPR04: 8e0000018c32b200 40017b3858fd6e00 8e0000018c32b208 40017b3858fd6e00
    GPR08: 8e0000018c32b210 40017b3858fd6e00 8e0000018c32b218 40017b3858fd6e00
    GPR12: ffffffffffffffff c00000000fb25100
    NIP [c0000000000b3f64] plpar_hcall9+0x44/0x7c
    LR [c0000000000b33d4] pSeries_lpar_flush_hash_range+0x384/0x420
    Call Trace:
    flush_hash_range+0x48/0x100
    __flush_tlb_pending+0x44/0xd0
    hpte_need_flush+0x408/0x470
    change_protection_range+0xaac/0xf10
    change_prot_numa+0x30/0xb0
    task_numa_work+0x2d0/0x3e0
    task_work_run+0x130/0x190
    do_notify_resume+0x118/0x120
    ret_from_except_lite+0x70/0x74
    Instruction dump:
    60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378
    e9410060 e9610068 e9810070 44000022 e9810028 f88c0000 f8ac0008

    Link: http://lkml.kernel.org/r/20171214140551.5794-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Suggested-by: Nicholas Piggin
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

2 commits

  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

11 Aug, 2017

1 commit

  • Patch series "fixes of TLB batching races", v6.

    It turns out that Linux TLB batching mechanism suffers from various
    races. Races that are caused due to batching during reclamation were
    recently handled by Mel and this patch-set deals with others. The more
    fundamental issue is that concurrent updates of the page-tables allow
    for TLB flushes to be batched on one core, while another core changes
    the page-tables. This other core may assume a PTE change does not
    require a flush based on the updated PTE value, while it is unaware that
    TLB flushes are still pending.

    This behavior affects KSM (which may result in memory corruption) and
    MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
    proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
    Memory corruption in KSM is harder to produce in practice, but was
    observed by hacking the kernel and adding a delay before flushing and
    replacing the KSM page.

    Finally, there is also one memory barrier missing, which may affect
    architectures with weak memory model.

    This patch (of 7):

    Setting and clearing mm->tlb_flush_pending can be performed by multiple
    threads, since mmap_sem may only be acquired for read in
    task_numa_work(). If this happens, tlb_flush_pending might be cleared
    while one of the threads still changes PTEs and batches TLB flushes.

    This can lead to the same race between migration and
    change_protection_range() that led to the introduction of
    tlb_flush_pending. The result of this race was data corruption, which
    means that this patch also addresses a theoretically possible data
    corruption.

    An actual data corruption was not observed, yet the race was was
    confirmed by adding assertion to check tlb_flush_pending is not set by
    two threads, adding artificial latency in change_protection_range() and
    using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

    Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
    Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
    change_protection_range")
    Signed-off-by: Nadav Amit
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Andrea Arcangeli
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Russell King
    Cc: Sergey Senozhatsky
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Jul, 2017

1 commit

  • pte_offset_map_lock() finds and takes ptl, and returns pte. But some
    callers return without unlocking the ptl when pte == NULL, which seems
    weird.

    Git history said that !pte check in change_pte_range() was introduced in
    commit 1ad9f620c3a2 ("mm: numa: recheck for transhuge pages under lock
    during protection changes") and still remains after commit 175ad4f1e7a2
    ("mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock")
    which partially reverts 1ad9f620c3a2. So I think that it's just dead
    code.

    Many other caller of pte_offset_map_lock() never check NULL return, so
    let's do likewise.

    Link: http://lkml.kernel.org/r/1495089737-1292-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

10 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • Patch series "Numabalancing preserve write fix", v2.

    This patch series address an issue w.r.t THP migration and autonuma
    preserve write feature. migrate_misplaced_transhuge_page() cannot deal
    with concurrent modification of the page. It does a page copy without
    following the migration pte sequence. IIUC, this was done to keep the
    migration simpler and at the time of implemenation we didn't had THP
    page cache which would have required a more elaborate migration scheme.
    That means thp autonuma migration expect the protnone with saved write
    to be done such that both kernel and user cannot update the page
    content. This patch series enables archs like ppc64 to do that. We are
    good with the hash translation mode with the current code, because we
    never create a hardware page table entry for a protnone pte.

    This patch (of 2):

    Autonuma preserves the write permission across numa fault to avoid
    taking a writefault after a numa fault (Commit: b191f9b106ea " mm: numa:
    preserve PTE write permissions across a NUMA hinting fault").
    Architecture can implement protnone in different ways and some may
    choose to implement that by clearing Read/ Write/Exec bit of pte.
    Setting the write bit on such pte can result in wrong behaviour. Fix
    this up by allowing arch to override how to save the write bit on a
    protnone pte.

    [aneesh.kumar@linux.vnet.ibm.com: don't mark pte saved write in case of dirty_accountable]
    Link: http://lkml.kernel.org/r/1487942884-16517-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    [aneesh.kumar@linux.vnet.ibm.com: v3]
    Link: http://lkml.kernel.org/r/1487498625-10891-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1487050314-3892-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Michael Neuling
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

23 Feb, 2017

1 commit

  • pmd_trans_unstable does an atomic read on the pmd so it doesn't require
    the pmd_lock for the same check.

    This also removes the special assumption that the mmap_sem is hold for
    writing if prot_numa is not set. userfaultfd will hold the mmap_sem
    only for reading in change_pte_range like prot_numa, but it will not set
    prot_numa.

    This is always a valid micro-optimization regardless of userfaultfd.

    [kirill@shutemov.name: drop unneeded pmd_trans_unstable(pmd) check after __split_huge_pmd()]
    Link: http://lkml.kernel.org/r/20170208120421.GE5578@node.shutemov.name
    Link: http://lkml.kernel.org/r/20161216144821.5183-43-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Dec, 2016

1 commit


13 Dec, 2016

3 commits

  • Having code for the pkey_mprotect, pkey_alloc and pkey_free system calls
    makes only sense if ARCH_HAS_PKEYS is selected. If not selected these
    system calls will always return -ENOSPC or -EINVAL.

    To simplify things and have less code generate the pkey system call code
    only if ARCH_HAS_PKEYS is selected.

    For architectures which have already wired up the system calls, but do
    not select ARCH_HAS_PKEYS this will result in less generated code and a
    different return code: the three system calls will now always return
    -ENOSYS, using the cond_syscall mechanism.

    For architectures which have not wired up the system calls less
    unreachable code will be generated.

    Link: http://lkml.kernel.org/r/20161114111251.70084-1-heiko.carstens@de.ibm.com
    Signed-off-by: Heiko Carstens
    Acked-by: Dave Hansen
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • While doing MADV_DONTNEED on a large area of thp memory, I noticed we
    encountered many unlikely() branches in profiles for each backing
    hugepage. This is because zap_pmd_range() would call split_huge_pmd(),
    which rechecked the conditions that were already validated, but as part
    of an unlikely() branch.

    Avoid the unlikely() branch when in a context where pmd is known to be
    good for __split_huge_pmd() directly.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • We had some problems with pages getting unmapped in single threaded
    affinitized processes. It was tracked down to NUMA scanning.

    In this case it doesn't make any sense to unmap pages if the process is
    single threaded and the page is already on the node the process is
    running on.

    Add a check for this case into the numa protection code, and skip
    unmapping if true.

    In theory the process could be migrated later, but we will eventually
    rescan and unmap and migrate then.

    In theory this could be made more fancy: remembering this state per
    process or even whole mm. However that would need extra tracking and be
    more complicated, and the simple check seems to work fine so far.

    [ak@linux.intel.com: v3: Minor updates from Mel. Change code layout]
    Link: http://lkml.kernel.org/r/1476382117-5440-1-git-send-email-andi@firstfloor.org
    Link: http://lkml.kernel.org/r/1476288949-20970-1-git-send-email-andi@firstfloor.org
    Signed-off-by: Andi Kleen
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

19 Oct, 2016

1 commit


11 Oct, 2016

1 commit

  • Pull protection keys syscall interface from Thomas Gleixner:
    "This is the final step of Protection Keys support which adds the
    syscalls so user space can actually allocate keys and protect memory
    areas with them. Details and usage examples can be found in the
    documentation.

    The mm side of this has been acked by Mel"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/pkeys: Update documentation
    x86/mm/pkeys: Do not skip PKRU register if debug registers are not used
    x86/pkeys: Fix pkeys build breakage for some non-x86 arches
    x86/pkeys: Add self-tests
    x86/pkeys: Allow configuration of init_pkru
    x86/pkeys: Default to a restrictive init PKRU
    pkeys: Add details of system call use to Documentation/
    generic syscalls: Wire up memory protection keys syscalls
    x86: Wire up protection keys system calls
    x86/pkeys: Allocation/free syscalls
    x86/pkeys: Make mprotect_key() mask off additional vm_flags
    mm: Implement new pkey_mprotect() system call
    x86/pkeys: Add fault handling for PF_PK page fault bit

    Linus Torvalds
     

08 Oct, 2016

2 commits

  • The rmap_walk can access vm_page_prot (and potentially vm_flags in the
    pte/pmd manipulations). So it's not safe to wait the caller to update
    the vm_page_prot/vm_flags after vma_merge returned potentially removing
    the "next" vma and extending the "current" vma over the
    next->vm_start,vm_end range, but still with the "current" vma
    vm_page_prot, after releasing the rmap locks.

    The vm_page_prot/vm_flags must be transferred from the "next" vma to the
    current vma while vma_merge still holds the rmap locks.

    The side effect of this race condition is pte corruption during migrate
    as remove_migration_ptes when run on a address of the "next" vma that
    got removed, used the vm_page_prot of the current vma.

    migrate mprotect
    ------------ -------------
    migrating in "next" vma
    vma_merge() # removes "next" vma and
    # extends "current" vma
    # current vma is not with
    # vm_page_prot updated
    remove_migration_ptes
    read vm_page_prot of current "vma"
    establish pte with wrong permissions
    vm_set_page_prot(vma) # too late!
    change_protection in the old vma range
    only, next range is not updated

    This caused segmentation faults and potentially memory corruption in
    heavy mprotect loads with some light page migration caused by compaction
    in the background.

    Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
    which confirms the case 8 is only buggy one where the race can trigger,
    in all other vma_merge cases the above cannot happen.

    This fix removes the oddness factor from case 8 and it converts it from:

    AAAA
    PPPPNNNNXXXX -> PPPPNNNNNNNN

    to:

    AAAA
    PPPPNNNNXXXX -> PPPPXXXXXXXX

    XXXX has the right vma properties for the whole merged vma returned by
    vma_adjust, so it solves the problem fully. It has the added benefits
    that the callers could stop updating vma properties when vma_merge
    succeeds however the callers are not updated by this patch (there are
    bits like VM_SOFTDIRTY that still need special care for the whole range,
    as the vma merging ignores them, but as long as they're not processed by
    rmap walks and instead they're accessed with the mmap_sem at least for
    reading, they are fine not to be updated within vma_adjust before
    releasing the rmap_locks).

    Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Aditya Mandaleeka
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
    concurrently and this prevents the risk of reading intermediate values.

    Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

09 Sep, 2016

3 commits

  • This patch adds two new system calls:

    int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
    int pkey_free(int pkey);

    These implement an "allocator" for the protection keys
    themselves, which can be thought of as analogous to the allocator
    that the kernel has for file descriptors. The kernel tracks
    which numbers are in use, and only allows operations on keys that
    are valid. A key which was not obtained by pkey_alloc() may not,
    for instance, be passed to pkey_mprotect().

    These system calls are also very important given the kernel's use
    of pkeys to implement execute-only support. These help ensure
    that userspace can never assume that it has control of a key
    unless it first asks the kernel. The kernel does not promise to
    preserve PKRU (right register) contents except for allocated
    pkeys.

    The 'init_access_rights' argument to pkey_alloc() specifies the
    rights that will be established for the returned pkey. For
    instance:

    pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

    will allocate 'pkey', but also sets the bits in PKRU[1] such that
    writing to 'pkey' is already denied.

    The kernel does not prevent pkey_free() from successfully freeing
    in-use pkeys (those still assigned to a memory range by
    pkey_mprotect()). It would be expensive to implement the checks
    for this, so we instead say, "Just don't do it" since sane
    software will never do it anyway.

    Any piece of userspace calling pkey_alloc() needs to be prepared
    for it to fail. Why? pkey_alloc() returns the same error code
    (ENOSPC) when there are no pkeys and when pkeys are unsupported.
    They can be unsupported for a whole host of reasons, so apps must
    be prepared for this. Also, libraries or LD_PRELOADs might steal
    keys before an application gets access to them.

    This allocation mechanism could be implemented in userspace.
    Even if we did it in userspace, we would still need additional
    user/kernel interfaces to tell userspace which keys are being
    used by the kernel internally (such as for execute-only
    mappings). Having the kernel provide this facility completely
    removes the need for these additional interfaces, or having an
    implementation of this in userspace at all.

    Note that we have to make changes to all of the architectures
    that do not use mman-common.h because we use the new
    PKEY_DENY_ACCESS/WRITE macros in arch-independent code.

    1. PKRU is the Protection Key Rights User register. It is a
    usermode-accessible register that controls whether writes
    and/or access to each individual pkey is allowed or denied.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: arnd@arndb.de
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     
  • Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
    Three of those bits: READ/WRITE/EXEC get translated directly in to
    vma->vm_flags by calc_vm_prot_bits(). If a bit is unset in
    mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
    during the mprotect() call.

    We do this clearing today by first calculating the VMA flags we
    want set, then clearing the ones we do not want to inherit from
    the original VMA:

    vm_flags = calc_vm_prot_bits(prot, key);
    ...
    newflags = vm_flags;
    newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));

    However, we *also* want to mask off the original VMA's vm_flags in
    which we store the protection key.

    To do that, this patch adds a new macro:

    ARCH_VM_PKEY_FLAGS

    which allows the architecture to specify additional bits that it would
    like cleared. We use that to ensure that the VM_PKEY_BIT* bits get
    cleared.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Reviewed-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: arnd@arndb.de
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163013.E48D6981@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     
  • pkey_mprotect() is just like mprotect, except it also takes a
    protection key as an argument. On systems that do not support
    protection keys, it still works, but requires that key=0.
    Otherwise it does exactly what mprotect does.

    I expect it to get used like this, if you want to guarantee that
    any mapping you create can *never* be accessed without the right
    protection keys set up.

    int real_prot = PROT_READ|PROT_WRITE;
    pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
    ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);

    This way, there is *no* window where the mapping is accessible
    since it was always either PROT_NONE or had a protection key set
    that denied all access.

    We settled on 'unsigned long' for the type of the key here. We
    only need 4 bits on x86 today, but I figured that other
    architectures might need some more space.

    Semantically, we have a bit of a problem if we combine this
    syscall with our previously-introduced execute-only support:
    What do we do when we mix execute-only pkey use with
    pkey_mprotect() use? For instance:

    pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
    mprotect(ptr, PAGE_SIZE, PROT_EXEC); // set pkey=X_ONLY_PKEY?
    mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?

    To solve that, we make the plain-mprotect()-initiated execute-only
    support only apply to VMAs that have the default protection key (0)
    set on them.

    Proposed semantics:
    1. protection key 0 is special and represents the default,
    "unassigned" protection key. It is always allocated.
    2. mprotect() never affects a mapping's pkey_mprotect()-assigned
    protection key. A protection key of 0 (even if set explicitly)
    represents an unassigned protection key.
    2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
    key may or may not result in a mapping with execute-only
    properties. pkey_mprotect() plus pkey_set() on all threads
    should be used to _guarantee_ execute-only semantics if this
    is not a strong enough semantic.
    3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
    kernel will internally attempt to allocate and dedicate a
    protection key for the purpose of execute-only mappings. This
    may not be possible in cases where there are no free protection
    keys available. It can also happen, of course, in situations
    where there is no hardware support for protection keys.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: arnd@arndb.de
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

27 Jul, 2016

1 commit

  • split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
    to pte entries, which can be checked with pmd_trans_unstable(). Some
    callers make this assertion and some do it differently and some not, so
    let's do it in a unified manner.

    Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 May, 2016

1 commit

  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Mar, 2016

1 commit

  • The mprotect(PROT_READ) fails when called by the READ_IMPLIES_EXEC
    binary on a memory mapped file located on non-exec fs. The mprotect
    does not check whether fs is _executable_ or not. The PROT_EXEC flag is
    set automatically even if a memory mapped file is located on non-exec
    fs. Fix it by checking whether a memory mapped file is located on a
    non-exec fs. If so the PROT_EXEC is not implied by the PROT_READ. The
    implementation uses the VM_MAYEXEC flag set properly in mmap. Now it is
    consistent with mmap.

    I did the isolated tests (PT_GNU_STACK X/NX, multiple VMAs, X/NX fs). I
    also patched the official 3.19.0-47-generic Ubuntu 14.04 kernel and it
    seems to work.

    Signed-off-by: Piotr Kwapulinski
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Konstantin Khlebnikov
    Cc: Dan Williams
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Piotr Kwapulinski
     

19 Feb, 2016

2 commits

  • Protection keys provide new page-based protection in hardware.
    But, they have an interesting attribute: they only affect data
    accesses and never affect instruction fetches. That means that
    if we set up some memory which is set as "access-disabled" via
    protection keys, we can still execute from it.

    This patch uses protection keys to set up mappings to do just that.
    If a user calls:

    mmap(..., PROT_EXEC);
    or
    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
    notice this, and set a special protection key on the memory. It
    also sets the appropriate bits in the Protection Keys User Rights
    (PKRU) register so that the memory becomes unreadable and
    unwritable.

    I haven't found any userspace that does this today. With this
    facility in place, we expect userspace to move to use it
    eventually. Userspace _could_ start doing this today. Any
    PROT_EXEC calls get converted to PROT_READ inside the kernel, and
    would transparently be upgraded to "true" PROT_EXEC with this
    code. IOW, userspace never has to do any PROT_EXEC runtime
    detection.

    This feature provides enhanced protection against leaking
    executable memory contents. This helps thwart attacks which are
    attempting to find ROP gadgets on the fly.

    But, the security provided by this approach is not comprehensive.
    The PKRU register which controls access permissions is a normal
    user register writable from unprivileged userspace. An attacker
    who can execute the 'wrpkru' instruction can easily disable the
    protection provided by this feature.

    The protection key that is used for execute-only support is
    permanently dedicated at compile time. This is fine for now
    because there is currently no API to set a protection key other
    than this one.

    Despite there being a constant PKRU value across the entire
    system, we do not set it unless this feature is in use in a
    process. That is to preserve the PKRU XSAVE 'init state',
    which can lead to faster context switches.

    PKRU *is* a user register and the kernel is modifying it. That
    means that code doing:

    pkru = rdpkru()
    pkru |= 0x100;
    mmap(..., PROT_EXEC);
    wrpkru(pkru);

    could lose the bits in PKRU that enforce execute-only
    permissions. To avoid this, we suggest avoiding ever calling
    mmap() or mprotect() when the PKRU value is expected to be
    unstable.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kees Cook
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Piotr Kwapulinski
    Cc: Rik van Riel
    Cc: Stephen Smalley
    Cc: Vladimir Murzin
    Cc: Will Deacon
    Cc: keescook@google.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • This plumbs a protection key through calc_vm_flag_bits(). We
    could have done this in calc_vm_prot_bits(), but I did not feel
    super strongly which way to go. It was pretty arbitrary which
    one to use.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arve Hjønnevåg
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Airlie
    Cc: Denys Vlasenko
    Cc: Eric W. Biederman
    Cc: Geliang Tang
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Maxime Coquelin
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Riley Andrews
    Cc: Vladimir Davydov
    Cc: devel@driverdev.osuosl.org
    Cc: linux-api@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen