29 Apr, 2016

1 commit

  • In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
    because the layouts may differ depending on the architecture. On s390
    this will lead to inaccurate numa_maps accounting in /proc because of
    misguided pte_present() and pte_dirty() checks on the fake pte.

    On other architectures pte_present() and pte_dirty() may work by chance,
    but there may be an issue with direct-access (dax) mappings w/o
    underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
    available. In vm_normal_page() the fake pte will be checked with
    pte_special() and because there is no "special" bit in a pmd, this will
    always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
    skipped. On dax mappings w/o struct pages, an invalid struct page
    pointer would then be returned that can crash the kernel.

    This patch fixes the numa_maps THP handling by introducing new "_pmd"
    variants of the can_gather_numa_stats() and vm_normal_page() functions.

    Signed-off-by: Gerald Schaefer
    Cc: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Johannes Weiner
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Dan Williams
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Michael Holzheu
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

7 commits

  • On i686 PAE enabled machine the contiguous physical area could be large
    and it can cause trimming down variables in below calculation in
    read_vmcore() and mmap_vmcore():

    tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);

    That is, the types being used is like below on i686:
    m->offset: unsigned long long int
    m->size: unsigned long long int
    *fpos: loff_t (long long int)
    buflen: size_t (unsigned int)

    So casting (m->offset + m->size - *fpos) by size_t means truncating a
    given value by 4GB.

    Suppose (m->offset + m->size - *fpos) being truncated to 0, buflen >0
    then we will get tsz = 0. It is of course not an expected result.
    Similarly we could also get other truncated values less than buflen.
    Then the real size passed down is not correct any more.

    If (m->offset + m->size - *fpos) is above 4GB, read_vmcore or
    mmap_vmcore use the min_t result with truncated values being compared to
    buflen. Then, fpos proceeds with the wrong value so that we reach below
    bugs:

    1) read_vmcore will refuse to continue so makedumpfile fails.
    2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().

    Use unsigned long long in min_t instead so that the variables in are not
    truncated.

    Signed-off-by: Baoquan He
    Signed-off-by: Dave Young
    Cc: HATAYAMA Daisuke
    Cc: Vivek Goyal
    Cc: Jianyu Zhan
    Cc: Minfei Huang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • It is not elegant that prompt shell does not start from new line after
    executing "cat /proc/$pid/wchan". Make prompt shell start from new
    line.

    Signed-off-by: Minfei Huang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • `proc_timers_operations` is only used when CONFIG_CHECKPOINT_RESTORE is
    enabled.

    Signed-off-by: Eric Engestrom
    Acked-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Engestrom
     
  • This patch provides a proc/PID/timerslack_ns interface which exposes a
    task's timerslack value in nanoseconds and allows it to be changed.

    This allows power/performance management software to set timer slack for
    other threads according to its policy for the thread (such as when the
    thread is designated foreground vs. background activity)

    If the value written is non-zero, slack is set to that value. Otherwise
    sets it to the default for the thread.

    This interface checks that the calling task has permissions to to use
    PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure
    arbitrary apps do not change the timer slack for other apps.

    Signed-off-by: John Stultz
    Acked-by: Kees Cook
    Cc: Arjan van de Ven
    Cc: Thomas Gleixner
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
    statistics protocol, corresponding to 'Available' in /proc/meminfo.

    It indicates to the hypervisor how big the balloon can be inflated
    without pushing the guest system to swap. This metric would be very
    useful in VM orchestration software to improve memory management of
    different VMs under overcommit.

    This patch (of 2):

    Factor out calculation of the available memory counter into a separate
    exportable function, in order to be able to use it in other parts of the
    kernel.

    In particular, it appears a relevant metric to report to the hypervisor
    via virtio-balloon statistics interface (in a followup patch).

    Signed-off-by: Igor Redko
    Signed-off-by: Denis V. Lunev
    Reviewed-by: Roman Kagan
    Cc: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Igor Redko
     
  • Currently /proc/kpageflags returns just KPF_COMPOUND_TAIL for slab tail
    pages, which is inconvenient when grasping how slab pages are
    distributed (userspace always needs to check which kind of tail pages by
    itself). This patch sets KPF_SLAB for such pages.

    With this patch:

    $ grep Slab /proc/meminfo ; tools/vm/page-types -b slab
    Slab: 64880 kB
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000080 16220 63 _______S__________________________________ slab
    total 16220 63

    16220 pages equals to 64880 kB, so returned result is consistent with the
    global counter.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Vladimir Davydov
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
    is inconvenient when grasping how free pages are distributed. This
    patch sets KPF_BUDDY for such pages.

    With this patch:

    $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
    MemFree: 3134992 kB
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000400 779272 3044 __________B_______________________________ buddy
    0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap
    total 783657 3061

    783657 pages is 3134628 kB (roughly consistent with the global counter,)
    so it's OK.

    [akpm@linux-foundation.org: update comment, per Naoya]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Vladimir Davydov >
    Cc: Konstantin Khlebnikov
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

19 Feb, 2016

1 commit

  • The protection key can now be just as important as read/write
    permissions on a VMA. We need some debug mechanism to help
    figure out if it is in play. smaps seems like a logical
    place to expose it.

    arch/x86/kernel/setup.c is a bit of a weirdo place to put
    this code, but it already had seq_file.h and there was not
    a much better existing place to put it.

    We also use no #ifdef. If protection keys is .config'd out we
    will effectively get the same function as if we used the weak
    generic function.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Dave Young
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jerome Marchand
    Cc: Jiri Kosina
    Cc: Joerg Roedel
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Laurent Dufour
    Cc: Linus Torvalds
    Cc: Mark Salter
    Cc: Mark Williamson
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

17 Feb, 2016

1 commit

  • Introduce the ability to create new cgroup namespace. The newly created
    cgroup namespace remembers the cgroup of the process at the point
    of creation of the cgroup namespace (referred as cgroupns-root).
    The main purpose of cgroup namespace is to virtualize the contents
    of /proc/self/cgroup file. Processes inside a cgroup namespace
    are only able to see paths relative to their namespace root
    (unless they are moved outside of their cgroupns-root, at which point
    they will see a relative path from their cgroupns-root).
    For a correctly setup container this enables container-tools
    (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
    containers without leaking system level cgroup hierarchy to the task.
    This patch only implements the 'unshare' part of the cgroupns.

    Signed-off-by: Aditya Kali
    Signed-off-by: Serge Hallyn
    Signed-off-by: Tejun Heo

    Aditya Kali
     

04 Feb, 2016

2 commits

  • Commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") added [stack:TID] annotation to /proc//maps.

    Finding the task of a stack VMA requires walking the entire thread list,
    turning this into quadratic behavior: a thousand threads means a
    thousand stacks, so the rendering of /proc//maps needs to look at a
    million combinations.

    The cost is not in proportion to the usefulness as described in the
    patch.

    Drop the [stack:TID] annotation to make /proc//maps (and
    /proc//numa_maps) usable again for higher thread counts.

    The [stack] annotation inside /proc//task//maps is retained, as
    identifying the stack VMA there is an O(1) operation.

    Siddesh said:
    "The end users needed a way to identify thread stacks programmatically and
    there wasn't a way to do that. I'm afraid I no longer remember (or have
    access to the resources that would aid my memory since I changed
    employers) the details of their requirement. However, I did do this on my
    own time because I thought it was an interesting project for me and nobody
    really gave any feedback then as to its utility, so as far as I am
    concerned you could roll back the main thread maps information since the
    information is available in the thread-specific files"

    Signed-off-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Siddhesh Poyarekar
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When working with hugetlbfs ptes (which are actually pmds) is not valid to
    directly use pte functions like pte_present() because the hardware bit
    layout of pmds and ptes can be different. This is the case on s390.
    Therefore we have to convert the hugetlbfs ptes first into a valid pte
    encoding with huge_ptep_get().

    Currently the /proc//numa_maps code uses hugetlbfs ptes without
    huge_ptep_get(). On s390 this leads to the following two problems:

    1) The pte_present() function returns false (instead of true) for
    PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
    completely in the "numa_maps" output.

    2) The pte_dirty() function always returns false for all hugetlb ptes.
    Therefore these pages are reported as "mapped=xxx" instead of
    "dirty=xxx".

    Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.

    Signed-off-by: Michael Holzheu
    Reviewed-by: Gerald Schaefer
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

22 Jan, 2016

1 commit

  • After THP refcounting rework we have only two possible return values
    from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
    ptl doesn't make much sense in this case.

    Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
    failure.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: Linus Torvalds
    Cc: Minchan Kim
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Jan, 2016

3 commits

  • Only functions doing more than one read are modified. Consumeres
    happened to deal with possibly changing data, but it does not seem like
    a good thing to rely on.

    Signed-off-by: Mateusz Guzik
    Acked-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Jarod Wilson
    Cc: Jan Stancek
    Cc: Al Viro
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mateusz Guzik
     
  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • For THP=n, HPAGE_PMD_NR in smaps_account() expands to BUILD_BUG().
    That's fine since this codepath is eliminated by modern compilers.

    But older compilers have not that efficient dead code elimination. It
    causes problem at least with gcc 4.1.2 on m68k:

    fs/built-in.o: In function `smaps_account':
    task_mmu.c:(.text+0x4f8fa): undefined reference to `__compiletime_assert_471'

    Let's replace HPAGE_PMD_NR with 1 << compound_order(page).

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

3 commits

  • Let's define page_mapped() to be true for compound pages if any
    sub-pages of the compound page is mapped (with PMD or PTE).

    On other hand page_mapcount() return mapcount for this particular small
    page.

    This will make cases like page_get_anon_vma() behave correctly once we
    allow huge pages to be mapped with PTE.

    Most users outside core-mm should use page_mapcount() instead of
    page_mapped().

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The goal of this patchset is to make refcounting on THP pages cheaper
    with simpler semantics and allow the same THP compound page to be mapped
    with PMD and PTEs. This is required to get reasonable THP-pagecache
    implementation.

    With the new refcounting design it's much easier to protect against
    split_huge_page(): simple reference on a page will make you the deal.
    It makes gup_fast() implementation simpler and doesn't require
    special-case in futex code to handle tail THP pages.

    It should improve THP utilization over the system since splitting THP in
    one process doesn't necessary lead to splitting the page in all other
    processes have the page mapped.

    The patchset drastically lower complexity of get_page()/put_page()
    codepaths. I encourage people look on this code before-and-after to
    justify time budget on reviewing this patchset.

    This patch (of 37):

    With new refcounting all subpages of the compound page are not necessary
    have the same mapcount. We need to take into account mapcount of every
    sub-page.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

9 commits

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY),
    VM_SOFTDIRTY was already cleared before walk_page_range().

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The MemAvailable item in /proc/meminfo is to give users a hint of how
    much memory is allocatable without causing swapping, so it excludes the
    zones' low watermarks as unavailable to userspace.

    However, for a userspace allocation, kswapd will actually reclaim until
    the free pages hit a combination of the high watermark and the page
    allocator's lowmem protection that keeps a certain amount of DMA and
    DMA32 memory from userspace as well.

    Subtract the full amount we know to be unavailable to userspace from the
    number of free pages when calculating MemAvailable.

    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There are several shortcomings with the accounting of shared memory
    (SysV shm, shared anonymous mapping, mapping of a tmpfs file). The
    values in /proc//status and /statm don't allow to distinguish
    between shmem memory and a shared mapping to a regular file, even though
    theirs implication on memory usage are quite different: during reclaim,
    file mapping can be dropped or written back on disk, while shmem needs a
    place in swap.

    Also, to distinguish the memory occupied by anonymous and file mappings,
    one has to read the /proc/pid/statm file, which has a field for the file
    mappings (again, including shmem) and total memory occupied by these
    mappings (i.e. equivalent to VmRSS in the /status file. Getting
    the value for anonymous mappings only is thus not exactly user-friendly
    (the statm file is intended to be rather efficiently machine-readable).

    To address both of these shortcomings, this patch adds a breakdown of
    VmRSS in /proc//status via new fields RssAnon, RssFile and
    RssShmem, making use of the previous preparatory patch. These fields
    tell the user the memory occupied by private anonymous pages, mapped
    regular files and shmem, respectively. Other existing fields in /status
    and /statm files are left without change. The /statm file can be
    extended in the future, if there's a need for that.

    Example (part of) /proc/pid/status output including the new Rss* fields:

    VmPeak: 2001008 kB
    VmSize: 2001004 kB
    VmLck: 0 kB
    VmPin: 0 kB
    VmHWM: 5108 kB
    VmRSS: 5108 kB
    RssAnon: 92 kB
    RssFile: 1324 kB
    RssShmem: 3692 kB
    VmData: 192 kB
    VmStk: 136 kB
    VmExe: 4 kB
    VmLib: 1784 kB
    VmPTE: 3928 kB
    VmPMD: 20 kB
    VmSwap: 0 kB
    HugetlbPages: 0 kB

    [vbabka@suse.cz: forward-porting, tweak changelog]
    Signed-off-by: Jerome Marchand
    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Currently looking at /proc//status or statm, there is no way to
    distinguish shmem pages from pages mapped to a regular file (shmem pages
    are mapped to /dev/zero), even though their implication in actual memory
    use is quite different.

    The internal accounting currently counts shmem pages together with
    regular files. As a preparation to extend the userspace interfaces,
    this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
    shmem pages separately from MM_FILEPAGES. The next patch will expose it
    to userspace - this patch doesn't change the exported values yet, by
    adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
    used before. The only user-visible change after this patch is the OOM
    killer message that separates the reported "shmem-rss" from "file-rss".

    [vbabka@suse.cz: forward-porting, tweak changelog]
    Signed-off-by: Jerome Marchand
    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Following the previous patch, further reduction of /proc/pid/smaps cost
    is possible for private writable shmem mappings with unpopulated areas
    where the page walk invokes the .pte_hole function. We can use radix
    tree iterator for each such area instead of calling find_get_entry() in
    a loop. This is possible at the extra maintenance cost of introducing
    another shmem function shmem_partial_swap_usage().

    To demonstrate the diference, I have measured this on a process that
    creates a private writable 2GB mapping of a partially swapped out
    /dev/shm/file (which cannot employ the optimizations from the prvious
    patch) and doesn't populate it at all. I time how long does it take to
    cat /proc/pid/smaps of this process 100 times.

    Before this patch:

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    After this patch:

    real 0m1.176s
    user 0m0.180s
    sys 0m0.684s

    The time is similar to the case where a radix tree iterator is employed
    on the whole mapping.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The previous patch has improved swap accounting for shmem mapping, which
    however made /proc/pid/smaps more expensive for shmem mappings, as we
    consult the radix tree for each pte_none entry, so the overal complexity
    is O(n*log(n)).

    We can reduce this significantly for mappings that cannot contain COWed
    pages, because then we can either use the statistics tha shmem object
    itself tracks (if the mapping contains the whole object, or the swap
    usage of the whole object is zero), or use the radix tree iterator,
    which is much more effective than repeated find_get_entry() calls.

    This patch therefore introduces a function shmem_swap_usage(vma) and
    makes /proc/pid/smaps use it when possible. Only for writable private
    mappings of shmem objects (i.e. tmpfs files) with the shmem object
    itself (partially) swapped outwe have to resort to the find_get_entry()
    approach.

    Hopefully such mappings are relatively uncommon.

    To demonstrate the diference, I have measured this on a process that
    creates a 2GB mapping and dirties single pages with a stride of 2MB, and
    time how long does it take to cat /proc/pid/smaps of this process 100
    times.

    Private writable mapping of a /dev/shm/file (the most complex case):

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
    (which needs to employ the radix tree iterator).

    real 0m1.351s
    user 0m0.096s
    sys 0m0.768s

    Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

    real 0m0.935s
    user 0m0.128s
    sys 0m0.344s

    Private anonymous mapping:

    real 0m0.949s
    user 0m0.116s
    sys 0m0.348s

    The cost is now much closer to the private anonymous mapping case, unless
    the shmem mapping is private and writable.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
    shmem-backed mappings, even if the mapped portion does contain pages
    that were swapped out. This is because unlike private anonymous
    mappings, shmem does not change pte to swap entry, but pte_none when
    swapping the page out. In the smaps page walk, such page thus looks
    like it was never faulted in.

    This patch changes smaps_pte_entry() to determine the swap status for
    such pte_none entries for shmem mappings, similarly to how
    mincore_page() does it. Swapped out shmem pages are thus accounted for.
    For private mappings of tmpfs files that COWed some of the pages, swaped
    out status of the original shmem pages is naturally ignored. If some of
    the private copies was also swapped out, they are accounted via their
    page table swap entries, so the resulting reported swap usage is then a
    sum of both swapped out private copies, and swapped out shmem pages that
    were not COWed. No double accounting can thus happen.

    The accounting is arguably still not as precise as for private anonymous
    mappings, since now we will count also pages that the process in
    question never accessed, but another process populated them and then let
    them become swapped out. I believe it is still less confusing and
    subtle than not showing any swap usage by shmem mappings at all.
    Swapped out counter might of interest of users who would like to prevent
    from future swapins during performance critical operation and pre-fault
    them at their convenience. Especially for larger swapped out regions
    the cost of swapin is much higher than a fresh page allocation. So a
    differentiation between pte_none vs. swapped out is important for those
    usecases.

    One downside of this patch is that it makes /proc/pid/smaps more
    expensive for shmem mappings, as we consult the radix tree for each
    pte_none entry, so the overal complexity is O(n*log(n)). I have
    measured this on a process that creates a 2GB mapping and dirties single
    pages with a stride of 2MB, and time how long does it take to cat
    /proc/pid/smaps of this process 100 times.

    Private anonymous mapping:

    real 0m0.949s
    user 0m0.116s
    sys 0m0.348s

    Mapping of a /dev/shm/file:

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    The difference is rather substantial, so the next patch will reduce the
    cost for shared or read-only mappings.

    In a less controlled experiment, I've gathered pids of processes on my
    desktop that have either '/dev/shm/*' or 'SYSV*' in smaps. This
    included the Chrome browser and some KDE processes. Again, I've run cat
    /proc/pid/smaps on each 100 times.

    Before this patch:

    real 0m9.050s
    user 0m0.518s
    sys 0m8.066s

    After this patch:

    real 0m9.221s
    user 0m0.541s
    sys 0m8.187s

    This suggests low impact on average systems.

    Note that this patch doesn't attempt to adjust the SwapPss field for
    shmem mappings, which would need extra work to determine who else could
    have the pages mapped. Thus the value stays zero except for COWed
    swapped out pages in a shmem mapping, which are accounted as usual.

    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Jerome Marchand
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Jan, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of stuff. That probably should've been 5 or 6 separate
    branches, but by the time I'd realized how large and mixed that bag
    had become it had been too close to -final to play with rebasing.

    Some fs/namei.c cleanups there, memdup_user_nul() introduction and
    switching open-coded instances, burying long-dead code, whack-a-mole
    of various kinds, several new helpers for ->llseek(), assorted
    cleanups and fixes from various people, etc.

    One piece probably deserves special mention - Neil's
    lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
    called without ->i_mutex and tries to avoid ever taking it. That, of
    course, means that it's not useful for any directory modifications,
    but things like getting inode attributes in nfds readdirplus are fine
    with that. I really should've asked for moratorium on lookup-related
    changes this cycle, but since I hadn't done that early enough... I
    *am* asking for that for the coming cycle, though - I'm going to try
    and get conversion of i_mutex to rwsem with ->lookup() done under lock
    taken shared.

    There will be a patch closer to the end of the window, along the lines
    of the one Linus had posted last May - mechanical conversion of
    ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
    inode_is_locked()/inode_lock_nested(). To quote Linus back then:

    -----
    | This is an automated patch using
    |
    | sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    | sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    | sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    | sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    | sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    | with a very few manual fixups
    -----

    I'm going to send that once the ->i_mutex-affecting stuff in -next
    gets mostly merged (or when Linus says he's about to stop taking
    merges)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    nfsd: don't hold i_mutex over userspace upcalls
    fs:affs:Replace time_t with time64_t
    fs/9p: use fscache mutex rather than spinlock
    proc: add a reschedule point in proc_readfd_common()
    logfs: constify logfs_block_ops structures
    fcntl: allow to set O_DIRECT flag on pipe
    fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
    fs: xattr: Use kvfree()
    [s390] page_to_phys() always returns a multiple of PAGE_SIZE
    nbd: use ->compat_ioctl()
    fs: use block_device name vsprintf helper
    lib/vsprintf: add %*pg format specifier
    fs: use gendisk->disk_name where possible
    poll: plug an unused argument to do_poll
    amdkfd: don't open-code memdup_user()
    cdrom: don't open-code memdup_user()
    rsxx: don't open-code memdup_user()
    mtip32xx: don't open-code memdup_user()
    [um] mconsole: don't open-code memdup_user_nul()
    [um] hostaudio: don't open-code memdup_user()
    ...

    Linus Torvalds
     

12 Jan, 2016

1 commit

  • Pull vfs RCU symlink updates from Al Viro:
    "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
    even if the symlink is not an embedded one.

    No changes since the mailbomb on Jan 1"

    * 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->get_link() to delayed_call, kill ->put_link()
    kill free_page_put_link()
    teach nfs_get_link() to work in RCU mode
    teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
    teach shmem_get_link() to work in RCU mode
    teach page_get_link() to work in RCU mode
    replace ->follow_link() with new method that could stay in RCU mode
    don't put symlink bodies in pagecache into highmem
    namei: page_getlink() and page_follow_link_light() are the same thing
    ufs: get rid of ->setattr() for symlinks
    udf: don't duplicate page_symlink_inode_operations
    logfs: don't duplicate page_symlink_inode_operations
    switch befs long symlinks to page_symlink_operations

    Linus Torvalds
     

09 Jan, 2016

1 commit

  • User can pass an arbitrary large buffer to getdents().

    It is typically a 32KB buffer used by libc scandir() implementation.

    When scanning /proc/{pid}/fd, we can hold cpu way too long,
    so add a cond_resched() to be kind with other tasks.

    We've seen latencies of more than 50ms on real workloads.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Viro
    Signed-off-by: Al Viro

    Eric Dumazet
     

04 Jan, 2016

1 commit


31 Dec, 2015

1 commit


19 Dec, 2015

1 commit

  • Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit
    774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") removed
    the setting of ret after the get_proc_task call and incorrectly left it as
    -ESRCH. Instead, return 0 when successful.

    Example breakage:

    echo 0 > /proc/self/coredump_filter
    bash: echo: write error: No such process

    Fixes: 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()")
    Signed-off-by: Colin Ian King
    Acked-by: Kees Cook
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     

09 Dec, 2015

2 commits