29 Dec, 2018

1 commit

  • commit ea5751ccd665a2fd1b24f9af81f6167f0718c5f6 upstream.

    proc_sys_lookup can fail with ENOMEM instead of ENOENT when the
    corresponding sysctl table is being unregistered. In our case we see
    this upon opening /proc/sys/net/*/conf files while network interfaces
    are being deleted, which confuses our configuration daemon.

    The problem was successfully reproduced and this fix tested on v4.9.122
    and v4.20-rc6.

    v2: return ERR_PTRs in all cases when proc_sys_make_inode fails instead
    of mixing them with NULL. Thanks Al Viro for the feedback.

    Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
    Cc: stable@vger.kernel.org
    Signed-off-by: Ivan Delalande
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Ivan Delalande
     

14 Nov, 2018

1 commit

  • commit fa76da461bb0be13c8339d984dcf179151167c8f upstream.

    Leonardo reports an apparent regression in 4.19-rc7:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0
    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency #201810071631
    Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018
    RIP: 0010:smaps_pte_range+0x32d/0x540
    Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 8b b8 f0 00 00 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8
    RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001
    RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578
    R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728
    R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48
    FS: 00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __walk_page_range+0x3c2/0x6f0
    walk_page_vma+0x42/0x60
    smap_gather_stats+0x79/0xe0
    ? gather_pte_stats+0x320/0x320
    ? gather_hugetlb_stats+0x70/0x70
    show_smaps_rollup+0xcd/0x1c0
    seq_read+0x157/0x400
    __vfs_read+0x3a/0x180
    ? security_file_permission+0x93/0xc0
    ? security_file_permission+0x93/0xc0
    vfs_read+0x8f/0x140
    ksys_read+0x55/0xc0
    __x64_sys_read+0x1a/0x20
    do_syscall_64+0x5a/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Decoded code matched to local compilation+disassembly points to
    smaps_pte_entry():

    } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
    && pte_none(*pte))) {
    page = find_get_entry(vma->vm_file->f_mapping,
    linear_page_index(vma, addr));

    Here, vma->vm_file is NULL. mss->check_shmem_swap should be false in that
    case, however for smaps_rollup, smap_gather_stats() can set the flag true
    for one vma and leave it true for subsequent vma's where it should be
    false.

    To fix, reset the check_shmem_swap flag to false. There's also related
    bug which sets mss->swap to shmem_swapped, which in the context of
    smaps_rollup overwrites any value accumulated from previous vma's. Fix
    that as well.

    Note that the report suggests a regression between 4.17.19 and 4.19-rc7,
    which makes the 4.19 series ending with commit 258f669e7e88 ("mm:
    /proc/pid/smaps_rollup: convert to single value seq_file") suspicious.
    But the mss was reused for rollup since 493b0e9d945f ("mm: add
    /proc/pid/smaps_rollup") so let's play it safe with the stable backport.

    Link: http://lkml.kernel.org/r/555fbd1f-4ac9-0b58-dcd4-5dc4380ff7ca@suse.cz
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377
    Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup")
    Signed-off-by: Vlastimil Babka
    Reported-by: Leonardo Soares Müller
    Tested-by: Leonardo Soares Müller
    Cc: Greg Kroah-Hartman
    Cc: Daniel Colascione
    Cc: Alexey Dobriyan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

10 Oct, 2018

1 commit

  • commit f8a00cef17206ecd1b30d3d9f99e10d9fa707aa7 upstream.

    Currently, you can use /proc/self/task/*/stack to cause a stack walk on
    a task you control while it is running on another CPU. That means that
    the stack can change under the stack walker. The stack walker does
    have guards against going completely off the rails and into random
    kernel memory, but it can interpret random data from your kernel stack
    as instruction pointers and stack pointers. This can cause exposure of
    kernel stack contents to userspace.

    Restrict the ability to inspect kernel stacks of arbitrary tasks to root
    in order to prevent a local attacker from exploiting racy stack unwinding
    to leak kernel task stack contents. See the added comment for a longer
    rationale.

    There don't seem to be any users of this userspace API that can't
    gracefully bail out if reading from the file fails. Therefore, I believe
    that this change is unlikely to break things. In the case that this patch
    does end up needing a revert, the next-best solution might be to fake a
    single-entry stack based on wchan.

    Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
    Fixes: 2ec220e27f50 ("proc: add /proc/*/stack")
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Ken Chen
    Cc: Will Deacon
    Cc: Laura Abbott
    Cc: Andy Lutomirski
    Cc: Catalin Marinas
    Cc: Josh Poimboeuf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H . Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

15 Sep, 2018

1 commit

  • [ Upstream commit df865e8337c397471b95f51017fea559bc8abb4a ]

    elf_kcore_store_hdr() uses __pa() to find the physical address of
    KCORE_RAM or KCORE_TEXT entries exported as program headers.

    This trips CONFIG_DEBUG_VIRTUAL's checks, as the KCORE_TEXT entries are
    not in the linear map.

    Handle these two cases separately, using __pa_symbol() for the KCORE_TEXT
    entries.

    Link: http://lkml.kernel.org/r/20180711131944.15252-1-james.morse@arm.com
    Signed-off-by: James Morse
    Cc: Alexey Dobriyan
    Cc: Omar Sandoval
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     

03 Aug, 2018

1 commit

  • [ Upstream commit ab6ecf247a9321e3180e021a6a60164dee53ab2e ]

    In commit ab676b7d6fbf ("pagemap: do not leak physical addresses to
    non-privileged userspace"), the /proc/PID/pagemap is restricted to be
    readable only by CAP_SYS_ADMIN to address some security issue.

    In commit 1c90308e7a77 ("pagemap: hide physical addresses from
    non-privileged users"), the restriction is relieved to make
    /proc/PID/pagemap readable, but hide the physical addresses for
    non-privileged users.

    But the swap entries are readable for non-privileged users too. This
    has some security issues. For example, for page under migrating, the
    swap entry has physical address information. So, in this patch, the
    swap entries are hided for non-privileged users too.

    Link: http://lkml.kernel.org/r/20180508012745.7238-1-ying.huang@intel.com
    Fixes: 1c90308e7a77 ("pagemap: hide physical addresses from non-privileged users")
    Signed-off-by: "Huang, Ying"
    Suggested-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrei Vagin
    Cc: Jerome Glisse
    Cc: Daniel Colascione
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     

17 Jul, 2018

1 commit

  • commit e70cc2bd579e8a9d6d153762f0fe294d0e652ff0 upstream.

    Thomas reports:
    "While looking around in /proc on my v4.14.52 system I noticed that all
    processes got a lot of "Locked" memory in /proc/*/smaps. A lot more
    memory than a regular user can usually lock with mlock().

    Commit 493b0e9d945f (in v4.14-rc1) seems to have changed the behavior
    of "Locked".

    Before that commit the code was like this. Notice the VM_LOCKED check.

    (vma->vm_flags & VM_LOCKED) ?
    (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);

    After that commit Locked is now the same as Pss:

    (unsigned long)(mss->pss >> (10 + PSS_SHIFT)));

    This looks like a mistake."

    Indeed, the commit has added mss->pss_locked with the correct value that
    depends on VM_LOCKED, but forgot to actually use it. Fix it.

    Link: http://lkml.kernel.org/r/ebf6c7fb-fec3-6a26-544f-710ed193c154@suse.cz
    Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup")
    Signed-off-by: Vlastimil Babka
    Reported-by: Thomas Lindroth
    Cc: Alexey Dobriyan
    Cc: Daniel Colascione
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

21 Jun, 2018

3 commits

  • [ Upstream commit 3955333df9a50e8783d115613a397ae55d905080 ]

    The existing kcore code checks for bad addresses against __va(0) with
    the assumption that this is the lowest address on the system. This may
    not hold true on some systems (e.g. arm64) and produce overflows and
    crashes. Switch to using other functions to validate the address range.

    It's currently only seen on arm64 and it's not clear if anyone wants to
    use that particular combination on a stable release. So this is not
    urgent for stable.

    Link: http://lkml.kernel.org/r/20180501201143.15121-1-labbott@redhat.com
    Signed-off-by: Laura Abbott
    Tested-by: Dave Anderson
    Cc: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Ingo Molnar
    Cc: Andi Kleen
    Cc: Alexey Dobriyan a
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Laura Abbott
     
  • [ Upstream commit 2e0ad552f5f8cd0fda02bc45fcd2b89821c62fd1 ]

    task_dump_owner() has the following code:

    mm = task->mm;
    if (mm) {
    if (get_dumpable(mm) != SUID_DUMP_USER) {
    uid = ...
    }
    }

    Check for ->mm is buggy -- kernel thread might be borrowing mm
    and inode will go to some random uid:gid pair.

    Link: http://lkml.kernel.org/r/20180412220109.GA20978@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     
  • [ Upstream commit 88c28f2469151b031f8cea9b28ed5be1b74a4172 ]

    The swap offset reported by /proc//pagemap may be not correct for
    PMD migration entries. If addr passed into pagemap_pmd_range() isn't
    aligned with PMD start address, the swap offset reported doesn't
    reflect this. And in the loop to report information of each sub-page,
    the swap offset isn't increased accordingly as that for PFN.

    This may happen after opening /proc//pagemap and seeking to a page
    whose address doesn't align with a PMD start address. I have verified
    this with a simple test program.

    BTW: migration swap entries have PFN information, do we need to restrict
    whether to show them?

    [akpm@linux-foundation.org: fix typo, per Huang, Ying]
    Link: http://lkml.kernel.org/r/20180408033737.10897-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Andrei Vagin
    Cc: Dan Williams
    Cc: "Jerome Glisse"
    Cc: Daniel Colascione
    Cc: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     

30 May, 2018

1 commit

  • [ Upstream commit a0b0d1c345d0317efe594df268feb5ccc99f651e ]

    proc_sys_link_fill_cache() does not take currently unregistering sysctl
    tables into account, which might result into a page fault in
    sysctl_follow_link() - add a check to fix it.

    This bug has been present since v3.4.

    Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
    Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
    Signed-off-by: Danilo Krummrich
    Acked-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: "Luis R . Rodriguez"
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Danilo Krummrich
     

23 May, 2018

3 commits

  • commit e96f46ee8587607a828f783daa6eb5b44d25004d upstream

    The style for the 'status' file is CamelCase or this. _.

    Fixes: fae1fa0fc ("proc: Provide details on speculation flaw mitigations")
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Konrad Rzeszutek Wilk
     
  • commit 356e4bfff2c5489e016fdb925adbf12a1e3950ee upstream

    For certain use cases it is desired to enforce mitigations so they cannot
    be undone afterwards. That's important for loader stubs which want to
    prevent a child from disabling the mitigation again. Will also be used for
    seccomp(). The extra state preserving of the prctl state for SSB is a
    preparatory step for EBPF dymanic speculation control.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit fae1fa0fc6cca8beee3ab8ed71d54f9a78fa3f64 upstream

    As done with seccomp and no_new_privs, also show speculation flaw
    mitigation state in /proc/$pid/status.

    Signed-off-by: Kees Cook
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

19 May, 2018

1 commit

  • commit 7f7ccc2ccc2e70c6054685f5e3522efa81556830 upstream.

    proc_pid_cmdline_read() and environ_read() directly access the target
    process' VM to retrieve the command line and environment. If this
    process remaps these areas onto a file via mmap(), the requesting
    process may experience various issues such as extra delays if the
    underlying device is slow to respond.

    Let's simply refuse to access file-backed areas in these functions.
    For this we add a new FOLL_ANON gup flag that is passed to all calls
    to access_remote_vm(). The code already takes care of such failures
    (including unmapped areas). Accesses via /proc/pid/mem were not
    changed though.

    This was assigned CVE-2018-1120.

    Note for stable backports: the patch may apply to kernels prior to 4.11
    but silently miss one location; it must be checked that no call to
    access_remote_vm() keeps zero as the last argument.

    Reported-by: Qualys Security Advisory
    Cc: Linus Torvalds
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc: stable@vger.kernel.org
    Signed-off-by: Willy Tarreau
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Willy Tarreau
     

26 Apr, 2018

2 commits

  • [ Upstream commit 595dd46ebfc10be041a365d0a3fa99df50b6ba73 ]

    Commit:

    df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")

    ... introduced a bounce buffer to work around CONFIG_HARDENED_USERCOPY=y.
    However, accessing the vsyscall user page will cause an SMAP fault.

    Replace memcpy() with copy_from_user() to fix this bug works, but adding
    a common way to handle this sort of user page may be useful for future.

    Currently, only vsyscall page requires KCORE_USER.

    Signed-off-by: Jia Zhang
    Reviewed-by: Jiri Olsa
    Cc: Al Viro
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: jolsa@redhat.com
    Link: http://lkml.kernel.org/r/1518446694-21124-2-git-send-email-zhang.jia@linux.alibaba.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jia Zhang
     
  • [ Upstream commit ac7f1061c2c11bb8936b1b6a94cdb48de732f7a4 ]

    Current code does:

    if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)

    However sscanf() is broken garbage.

    It silently accepts whitespace between format specifiers
    (did you know that?).

    It silently accepts valid strings which result in integer overflow.

    Do not use sscanf() for any even remotely reliable parsing code.

    OK
    # readlink '/proc/1/map_files/55a23af39000-55a23b05b000'
    /lib/systemd/systemd

    broken
    # readlink '/proc/1/map_files/ 55a23af39000-55a23b05b000'
    /lib/systemd/systemd

    broken
    # readlink '/proc/1/map_files/55a23af39000-55a23b05b000 '
    /lib/systemd/systemd

    very broken
    # readlink '/proc/1/map_files/1000000000000000055a23af39000-55a23b05b000'
    /lib/systemd/systemd

    Andrei said:

    : This patch breaks criu. It was a bug in criu. And this bug is on a minor
    : path, which works when memfd_create() isn't available. It is a reason why
    : I ask to not backport this patch to stable kernels.
    :
    : In CRIU this bug can be triggered, only if this patch will be backported
    : to a kernel which version is lower than v3.16.

    Link: http://lkml.kernel.org/r/20171120212706.GA14325@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: Pavel Emelyanov
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     

17 Feb, 2018

1 commit

  • commit d0290bc20d4739b7a900ae37eb5d4cc3be2b393f upstream.

    Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext
    data") added a bounce buffer to avoid hardened usercopy checks. Copying
    to the bounce buffer was implemented with a simple memcpy() assuming
    that it is always valid to read from kernel memory iff the
    kern_addr_valid() check passed.

    A simple, but pointless, test case like "dd if=/proc/kcore of=/dev/null"
    now can easily crash the kernel, since the former execption handling on
    invalid kernel addresses now doesn't work anymore.

    Also adding a kern_addr_valid() implementation wouldn't help here. Most
    architectures simply return 1 here, while a couple implemented a page
    table walk to figure out if something is mapped at the address in
    question.

    With DEBUG_PAGEALLOC active mappings are established and removed all the
    time, so that relying on the result of kern_addr_valid() before
    executing the memcpy() also doesn't work.

    Therefore simply use probe_kernel_read() to copy to the bounce buffer.
    This also allows to simplify read_kcore().

    At least on s390 this fixes the observed crashes and doesn't introduce
    warnings that were removed with df04abfd181a ("fs/proc/kcore.c: Add
    bounce buffer for ktext data"), even though the generic
    probe_kernel_read() implementation uses uaccess functions.

    While looking into this I'm also wondering if kern_addr_valid() could be
    completely removed...(?)

    Link: http://lkml.kernel.org/r/20171202132739.99971-1-heiko.carstens@de.ibm.com
    Fixes: df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
    Fixes: f5509cc18daa ("mm: Hardened usercopy")
    Signed-off-by: Heiko Carstens
    Acked-by: Kees Cook
    Cc: Jiri Olsa
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Heiko Carstens
     

24 Jan, 2018

1 commit

  • commit 8bb2ee192e482c5d500df9f2b1b26a560bd3026f upstream.

    do_task_stat() accesses IP and SP of a task without bumping reference
    count of a stack (which became an entity with independent lifetime at
    some point).

    Steps to reproduce:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    setrlimit(RLIMIT_CORE, &(struct rlimit){});

    while (1) {
    char buf[64];
    char buf2[4096];
    pid_t pid;
    int fd;

    pid = fork();
    if (pid == 0) {
    *(volatile int *)0 = 0;
    }

    snprintf(buf, sizeof(buf), "/proc/%u/stat", pid);
    fd = open(buf, O_RDONLY);
    read(fd, buf2, sizeof(buf2));
    close(fd);

    waitpid(pid, NULL, 0);
    }
    return 0;
    }

    BUG: unable to handle kernel paging request at 0000000000003fd8
    IP: do_task_stat+0x8b4/0xaf0
    PGD 800000003d73e067 P4D 800000003d73e067 PUD 3d558067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 0 PID: 1417 Comm: a.out Not tainted 4.15.0-rc8-dirty #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc27 04/01/2014
    RIP: 0010:do_task_stat+0x8b4/0xaf0
    Call Trace:
    proc_single_show+0x43/0x70
    seq_read+0xe6/0x3b0
    __vfs_read+0x1e/0x120
    vfs_read+0x84/0x110
    SyS_read+0x3d/0xa0
    entry_SYSCALL_64_fastpath+0x13/0x6c
    RIP: 0033:0x7f4d7928cba0
    RSP: 002b:00007ffddb245158 EFLAGS: 00000246
    Code: 03 b7 a0 01 00 00 4c 8b 4c 24 70 4c 8b 44 24 78 4c 89 74 24 18 e9 91 f9 ff ff f6 45 4d 02 0f 84 fd f7 ff ff 48 8b 45 40 48 89 ef 8b 80 d8 3f 00 00 48 89 44 24 20 e8 9b 97 eb ff 48 89 44 24
    RIP: do_task_stat+0x8b4/0xaf0 RSP: ffffc90000607cc8
    CR2: 0000000000003fd8

    John Ogness said: for my tests I added an else case to verify that the
    race is hit and correctly mitigated.

    Link: http://lkml.kernel.org/r/20180116175054.GA11513@avx2
    Signed-off-by: Alexey Dobriyan
    Reported-by: "Kohli, Gaurav"
    Tested-by: John Ogness
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     

10 Jan, 2018

1 commit

  • commit 7d5905dc14a87805a59f3c5bf70173aac2bb18f8 upstream.

    After commit 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get()
    for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo
    on x86 can be either the nominal CPU frequency (which is constant)
    or the frequency most recently requested by a scaling governor in
    cpufreq, depending on the cpufreq configuration. That is somewhat
    inconsistent and is different from what it was before 4.13, so in
    order to restore the previous behavior, make it report the current
    CPU frequency like the scaling_cur_freq sysfs file in cpufreq.

    To that end, modify the /proc/cpuinfo implementation on x86 to use
    aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback
    registers, if available, and use their values to compute the CPU
    frequency to be reported as "cpu MHz".

    However, do that carefully enough to avoid accumulating delays that
    lead to unacceptable access times for /proc/cpuinfo on systems with
    many CPUs. Run aperfmperf_snapshot_khz() once on all CPUs
    asynchronously at the /proc/cpuinfo open time, add a single delay
    upfront (if necessary) at that point and simply compute the current
    frequency while running show_cpuinfo() for each individual CPU.

    Also, to avoid slowing down /proc/cpuinfo accesses too much, reduce
    the default delay between consecutive APERF and MPERF reads to 10 ms,
    which should be sufficient to get large enough numbers for the
    frequency computation in all cases.

    Fixes: 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"")
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Thomas Gleixner
    Tested-by: Thomas Gleixner
    Acked-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     

20 Dec, 2017

1 commit

  • [ Upstream commit c79dde629d2027ca80329c62854a7635e623d527 ]

    After rmmod 8250.ko
    tty_kref_put starts kwork (release_one_tty) to release proc interface
    oops when accessing driver->driver_name in proc_tty_unregister_driver

    Use jprobe, found driver->driver_name point to 8250.ko
    static static struct uart_driver serial8250_reg
    .driver_name= serial,

    Use name in proc_dir_entry instead of driver->driver_name to fix oops

    test on linux 4.1.12:

    BUG: unable to handle kernel paging request at ffffffffa01979de
    IP: [] strchr+0x0/0x30
    PGD 1a0d067 PUD 1a0e063 PMD 851c1f067 PTE 0
    Oops: 0000 [#1] PREEMPT SMP
    Modules linked in: ... ... [last unloaded: 8250]
    CPU: 7 PID: 116 Comm: kworker/7:1 Tainted: G O 4.1.12 #1
    Hardware name: Insyde RiverForest/Type2 - Board Product Name1, BIOS NE5KV904 12/21/2015
    Workqueue: events release_one_tty
    task: ffff88085b684960 ti: ffff880852884000 task.ti: ffff880852884000
    RIP: 0010:[] [] strchr+0x0/0x30
    RSP: 0018:ffff880852887c90 EFLAGS: 00010282
    RAX: ffffffff81a5eca0 RBX: ffffffffa01979de RCX: 0000000000000004
    RDX: ffff880852887d10 RSI: 000000000000002f RDI: ffffffffa01979de
    RBP: ffff880852887cd8 R08: 0000000000000000 R09: ffff88085f5d94d0
    R10: 0000000000000195 R11: 0000000000000000 R12: ffffffffa01979de
    R13: ffff880852887d00 R14: ffffffffa01979de R15: ffff88085f02e840
    FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffa01979de CR3: 0000000001a0c000 CR4: 00000000001406e0
    Stack:
    ffffffff812349b1 ffff880852887cb8 ffff880852887d10 ffff88085f5cd6c2
    ffff880852800a80 ffffffffa01979de ffff880852800a84 0000000000000010
    ffff88085bb28bd8 ffff880852887d38 ffffffff812354f0 ffff880852887d08
    Call Trace:
    [] ? __xlate_proc_name+0x71/0xd0
    [] remove_proc_entry+0x40/0x180
    [] ? _raw_spin_lock_irqsave+0x41/0x60
    [] ? destruct_tty_driver+0x60/0xe0
    [] proc_tty_unregister_driver+0x28/0x40
    [] destruct_tty_driver+0x88/0xe0
    [] tty_driver_kref_put+0x1d/0x20
    [] release_one_tty+0x5a/0xd0
    [] process_one_work+0x139/0x420
    [] worker_thread+0x121/0x450
    [] ? process_scheduled_works+0x40/0x40
    [] kthread+0xec/0x110
    [] ? tg_rt_schedulable+0x210/0x220
    [] ? kthread_freezable_should_stop+0x80/0x80
    [] ret_from_fork+0x42/0x70
    [] ? kthread_freezable_should_stop+0x80/0x80

    Signed-off-by: nixiaoming
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    nixiaoming
     

03 Nov, 2017

1 commit

  • When the pagetable is walked in the implementation of /proc//pagemap,
    pmd_soft_dirty() is used for both the PMD huge page map and the PMD
    migration entries. That is wrong, pmd_swp_soft_dirty() should be used
    for the PMD migration entries instead because the different page table
    entry flag is used.

    As a result, /proc/pid/pagemap may report incorrect soft dirty information
    for PMD migration entries.

    Link: http://lkml.kernel.org/r/20171017081818.31795-1-ying.huang@intel.com
    Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
    Signed-off-by: "Huang, Ying"
    Acked-by: Kirill A. Shutemov
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Arnd Bergmann
    Cc: Hugh Dickins
    Cc: "Jérôme Glisse"
    Cc: Daniel Colascione
    Cc: Zi Yan
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

29 Sep, 2017

3 commits

  • Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it
    its own print state because it will not in fact get woken by regular
    wakeups and is a long-term state.

    This requires moving TASK_PARKED into the TASK_REPORT mask, and since
    that latter needs to be a contiguous bitmask, we need to shuffle the
    bits around a bit.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Markus reported that kthreads that idle using TASK_IDLE instead of
    TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things
    like htop mark those red.

    This is undesirable, so add an explicit state for TASK_IDLE.

    Reported-by: Markus Trippelsdorf
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently get_task_state() and task_state_to_char() report different
    states, create a number of common helpers and unify the reported state
    space.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

16 Sep, 2017

1 commit

  • Commit 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in
    /proc/PID/stat") stopped reporting eip/esp because it is
    racy and dangerous for executing tasks. The comment adds:

    As far as I know, there are no use programs that make any
    material use of these fields, so just get rid of them.

    However, existing userspace core-dump-handler applications (for
    example, minicoredumper) are using these fields since they
    provide an excellent cross-platform interface to these valuable
    pointers. So that commit introduced a user space visible
    regression.

    Partially revert the change and make the readout possible for
    tasks with the proper permissions and only if the target task
    has the PF_DUMPCORE flag set.

    Fixes: 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in> /proc/PID/stat")
    Reported-by: Marco Felsch
    Signed-off-by: John Ogness
    Reviewed-by: Andy Lutomirski
    Cc: Tycho Andersen
    Cc: Kees Cook
    Cc: Peter Zijlstra
    Cc: Brian Gerst
    Cc: stable@vger.kernel.org
    Cc: Tetsuo Handa
    Cc: Borislav Petkov
    Cc: Al Viro
    Cc: Linux API
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/87poatfwg6.fsf@linutronix.de
    Signed-off-by: Thomas Gleixner

    John Ogness
     

14 Sep, 2017

2 commits

  • GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
    and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
    primary motivation was to allow users to tell that an allocation is
    short lived and so the allocator can try to place such allocations close
    together and prevent long term fragmentation. As much as this sounds
    like a reasonable semantic it becomes much less clear when to use the
    highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
    context holding that memory sleep? Can it take locks? It seems there is
    no good answer for those questions.

    The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
    __GFP_RECLAIMABLE which in itself is tricky because basically none of
    the existing caller provide a way to reclaim the allocated memory. So
    this is rather misleading and hard to evaluate for any benefits.

    I have checked some random users and none of them has added the flag
    with a specific justification. I suspect most of them just copied from
    other existing users and others just thought it might be a good idea to
    use without any measuring. This suggests that GFP_TEMPORARY just
    motivates for cargo cult usage without any reasoning.

    I believe that our gfp flags are quite complex already and especially
    those with highlevel semantic should be clearly defined to prevent from
    confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
    replace all existing users to simply use GFP_KERNEL. Please note that
    SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
    so they will be placed properly for memory fragmentation prevention.

    I can see reasons we might want some gfp flag to reflect shorterm
    allocations but I propose starting from a clear semantic definition and
    only then add users with proper justification.

    This was been brought up before LSF this year by Matthew [1] and it
    turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
    seems to be a heuristic without any measured advantage for most (if not
    all) its current users. The follow up discussion has revealed that
    opinions on what might be temporary allocation differ a lot between
    developers. So rather than trying to tweak existing users into a
    semantic which they haven't expected I propose to simply remove the flag
    and start from scratch if we really need a semantic for short term
    allocations.

    [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

    [akpm@linux-foundation.org: fix typo]
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: drm/i915: fix up]
    Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Neil Brown
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In NOMMU configurations, we get a warning about a variable that has become
    unused:

    fs/proc/task_nommu.c: In function 'nommu_vma_show':
    fs/proc/task_nommu.c:148:28: error: unused variable 'priv' [-Werror=unused-variable]

    Link: http://lkml.kernel.org/r/20170911200231.3171415-1-arnd@arndb.de
    Fixes: 1240ea0dc3bb ("fs, proc: remove priv argument from is_stack")
    Signed-off-by: Arnd Bergmann
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

09 Sep, 2017

8 commits

  • ... such that we can avoid the tree walks to get the node with the
    smallest key. Semantically the same, as the previously used rb_first(),
    but O(1). The main overhead is the extra footprint for the cached rb_node
    pointer, which should not matter for procfs.

    Link: http://lkml.kernel.org/r/20170719014603.19029-14-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • If there are large numbers of hugepages to iterate while reading
    /proc/pid/smaps, the page walk never does cond_resched(). On archs
    without split pmd locks, there can be significant and observable
    contention on mm->page_table_lock which cause lengthy delays without
    rescheduling.

    Always reschedule in smaps_pte_range() if necessary since the pagewalk
    iteration can be expensive.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708211405520.131071@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Save some code from ~320 invocations all clearing last argument.

    add/remove: 3/0 grow/shrink: 0/158 up/down: 45/-702 (-657)
    function old new delta
    proc_create - 17 +17
    __ksymtab_proc_create - 16 +16
    __kstrtab_proc_create - 12 +12
    yam_init_driver 301 298 -3

    ...

    cifs_proc_init 249 228 -21
    via_fb_pci_probe 2304 2280 -24

    Link: http://lkml.kernel.org/r/20170819094702.GA27864@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Commit b18cb64ead40 ("fs/proc: Stop trying to report thread stacks")
    removed the priv parameter user in is_stack so the argument is
    redundant. Drop it.

    [arnd@arndb.de: remove unused variable]
    Link: http://lkml.kernel.org/r/20170801120150.1520051-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170728075833.7241-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Soft dirty bit is designed to keep tracked over page migration. This
    patch makes it work in the same manner for thp migration too.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

07 Sep, 2017

3 commits

  • Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
    in the child process after fork. This differs from MADV_DONTFORK in one
    important way.

    If a child process accesses memory that was MADV_WIPEONFORK, it will get
    zeroes. The address ranges are still valid, they are just empty.

    If a child process accesses memory that was MADV_DONTFORK, it will get a
    segmentation fault, since those address ranges are no longer valid in
    the child after fork.

    Since MADV_DONTFORK also seems to be used to allow very large programs
    to fork in systems with strict memory overcommit restrictions, changing
    the semantics of MADV_DONTFORK might break existing programs.

    MADV_WIPEONFORK only works on private, anonymous VMAs.

    The use case is libraries that store or cache information, and want to
    know that they need to regenerate it in the child process after fork.

    Examples of this would be:
    - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
    check, which is too slow without a PID cache)
    - PKCS#11 API reinitialization check (mandated by specification)
    - glibc's upcoming PRNG (reseed after fork)
    - OpenSSL PRNG (reseed after fork)

    The security benefits of a forking server having a re-inialized PRNG in
    every child process are pretty obvious. However, due to libraries
    having all kinds of internal state, and programs getting compiled with
    many different versions of each library, it is unreasonable to expect
    calling programs to re-initialize everything manually after fork.

    A further complication is the proliferation of clone flags, programs
    bypassing glibc's functions to call clone directly, and programs calling
    unshare, causing the glibc pthread_atfork hook to not get called.

    It would be better to have the kernel take care of this automatically.

    The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
    MADV_WIPEONFORK.

    This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

    [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
    Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
    Signed-off-by: Rik van Riel
    Reported-by: Florian Weimer
    Reported-by: Colm MacCártaigh
    Reviewed-by: Mike Kravetz
    Cc: "H. Peter Anvin"
    Cc: "Kirill A. Shutemov"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Helge Deller
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Will Drewry
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • /proc/pid/smaps_rollup is a new proc file that improves the performance
    of user programs that determine aggregate memory statistics (e.g., total
    PSS) of a process.

    Android regularly "samples" the memory usage of various processes in
    order to balance its memory pool sizes. This sampling process involves
    opening /proc/pid/smaps and summing certain fields. For very large
    processes, sampling memory use this way can take several hundred
    milliseconds, due mostly to the overhead of the seq_printf calls in
    task_mmu.c.

    smaps_rollup improves the situation. It contains most of the fields of
    /proc/pid/smaps, but instead of a set of fields for each VMA,
    smaps_rollup instead contains one synthetic smaps-format entry
    representing the whole process. In the single smaps_rollup synthetic
    entry, each field is the summation of the corresponding field in all of
    the real-smaps VMAs. Using a common format for smaps_rollup and smaps
    allows userspace parsers to repurpose parsers meant for use with
    non-rollup smaps for smaps_rollup, and it allows userspace to switch
    between smaps_rollup and smaps at runtime (say, based on the
    availability of smaps_rollup in a given kernel) with minimal fuss.

    By using smaps_rollup instead of smaps, a caller can avoid the
    significant overhead of formatting, reading, and parsing each of a large
    process's potentially very numerous memory mappings. For sampling
    system_server's PSS in Android, we measured a 12x speedup, representing
    a savings of several hundred milliseconds.

    One alternative to a new per-process proc file would have been including
    PSS information in /proc/pid/status. We considered this option but
    thought that PSS would be too expensive (by a few orders of magnitude)
    to collect relative to what's already emitted as part of
    /proc/pid/status, and slowing every user of /proc/pid/status for the
    sake of readers that happen to want PSS feels wrong.

    The code itself works by reusing the existing VMA-walking framework we
    use for regular smaps generation and keeping the mem_size_stats
    structure around between VMA walks instead of using a fresh one for each
    VMA. In this way, summation happens automatically. We let seq_file
    walk over the VMAs just as it does for regular smaps and just emit
    nothing to the seq_file until we hit the last VMA.

    Benchmarks:

    using smaps:
    iterations:1000 pid:1163 pss:220023808
    0m29.46s real 0m08.28s user 0m20.98s system

    using smaps_rollup:
    iterations:1000 pid:1163 pss:220702720
    0m04.39s real 0m00.03s user 0m04.31s system

    We're using the PSS samples we collect asynchronously for
    system-management tasks like fine-tuning oom_adj_score, memory use
    tracking for debugging, application-level memory-use attribution, and
    deciding whether we want to kill large processes during system idle
    maintenance windows. Android has been using PSS for these purposes for
    a long time; as the average process VMA count has increased and and
    devices become more efficiency-conscious, PSS-collection inefficiency
    has started to matter more. IMHO, it'd be a lot safer to optimize the
    existing PSS-collection model, which has been fine-tuned over the years,
    instead of changing the memory tracking approach entirely to work around
    smaps-generation inefficiency.

    Tim said:

    : There are two main reasons why Android gathers PSS information:
    :
    : 1. Android devices can show the user the amount of memory used per
    : application via the settings app. This is a less important use case.
    :
    : 2. We log PSS to help identify leaks in applications. We have found
    : an enormous number of bugs (in the Android platform, in Google's own
    : apps, and in third-party applications) using this data.
    :
    : To do this, system_server (the main process in Android userspace) will
    : sample the PSS of a process three seconds after it changes state (for
    : example, app is launched and becomes the foreground application) and about
    : every ten minutes after that. The net result is that PSS collection is
    : regularly running on at least one process in the system (usually a few
    : times a minute while the screen is on, less when screen is off due to
    : suspend). PSS of a process is an incredibly useful stat to track, and we
    : aren't going to get rid of it. We've looked at some very hacky approaches
    : using RSS ("take the RSS of the target process, subtract the RSS of the
    : zygote process that is the parent of all Android apps") to reduce the
    : accounting time, but it regularly overestimated the memory used by 20+
    : percent. Accordingly, I don't think that there's a good alternative to
    : using PSS.
    :
    : We started looking into PSS collection performance after we noticed random
    : frequency spikes while a phone's screen was off; occasionally, one of the
    : CPU clusters would ramp to a high frequency because there was 200-300ms of
    : constant CPU work from a single thread in the main Android userspace
    : process. The work causing the spike (which is reasonable governor
    : behavior given the amount of CPU time needed) was always PSS collection.
    : As a result, Android is burning more power than we should be on PSS
    : collection.
    :
    : The other issue (and why I'm less sure about improving smaps as a
    : long-term solution) is that the number of VMAs per process has increased
    : significantly from release to release. After trying to figure out why we
    : were seeing these 200-300ms PSS collection times on Android O but had not
    : noticed it in previous versions, we found that the number of VMAs in the
    : main system process increased by 50% from Android N to Android O (from
    : ~1800 to ~2700) and varying increases in every userspace process. Android
    : M to N also had an increase in the number of VMAs, although not as much.
    : I'm not sure why this is increasing so much over time, but thinking about
    : ASLR and ways to make ASLR better, I expect that this will continue to
    : increase going forward. I would not be surprised if we hit 5000 VMAs on
    : the main Android process (system_server) by 2020.
    :
    : If we assume that the number of VMAs is going to increase over time, then
    : doing anything we can do to reduce the overhead of each VMA during PSS
    : collection seems like the right way to go, and that means outputting an
    : aggregate statistic (to avoid whatever overhead there is per line in
    : writing smaps and in reading each line from userspace).

    Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
    Signed-off-by: Daniel Colascione
    Cc: Tim Murray
    Cc: Joel Fernandes
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Sonny Rao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Colascione
     
  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Sep, 2017

1 commit

  • Pull char/misc driver updates from Greg KH:
    "Here is the big char/misc driver update for 4.14-rc1.

    Lots of different stuff in here, it's been an active development cycle
    for some reason. Highlights are:

    - updated binder driver, this brings binder up to date with what
    shipped in the Android O release, plus some more changes that
    happened since then that are in the Android development trees.

    - coresight updates and fixes

    - mux driver file renames to be a bit "nicer"

    - intel_th driver updates

    - normal set of hyper-v updates and changes

    - small fpga subsystem and driver updates

    - lots of const code changes all over the driver trees

    - extcon driver updates

    - fmc driver subsystem upadates

    - w1 subsystem minor reworks and new features and drivers added

    - spmi driver updates

    Plus a smattering of other minor driver updates and fixes.

    All of these have been in linux-next with no reported issues for a
    while"

    * tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (244 commits)
    ANDROID: binder: don't queue async transactions to thread.
    ANDROID: binder: don't enqueue death notifications to thread todo.
    ANDROID: binder: Don't BUG_ON(!spin_is_locked()).
    ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl
    ANDROID: binder: push new transactions to waiting threads.
    ANDROID: binder: remove proc waitqueue
    android: binder: Add page usage in binder stats
    android: binder: fixup crash introduced by moving buffer hdr
    drivers: w1: add hwmon temp support for w1_therm
    drivers: w1: refactor w1_slave_show to make the temp reading functionality separate
    drivers: w1: add hwmon support structures
    eeprom: idt_89hpesx: Support both ACPI and OF probing
    mcb: Fix an error handling path in 'chameleon_parse_cells()'
    MCB: add support for SC31 to mcb-lpc
    mux: make device_type const
    char: virtio: constify attribute_group structures.
    Documentation/ABI: document the nvmem sysfs files
    lkdtm: fix spelling mistake: "incremeted" -> "incremented"
    perf: cs-etm: Fix ETMv4 CONFIGR entry in perf.data file
    nvmem: include linux/err.h from header
    ...

    Linus Torvalds