23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

22 Jan, 2016

1 commit

  • After THP refcounting rework we have only two possible return values
    from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
    ptl doesn't make much sense in this case.

    Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
    failure.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: Linus Torvalds
    Cc: Minchan Kim
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Jan, 2016

3 commits

  • Only functions doing more than one read are modified. Consumeres
    happened to deal with possibly changing data, but it does not seem like
    a good thing to rely on.

    Signed-off-by: Mateusz Guzik
    Acked-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Jarod Wilson
    Cc: Jan Stancek
    Cc: Al Viro
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mateusz Guzik
     
  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • For THP=n, HPAGE_PMD_NR in smaps_account() expands to BUILD_BUG().
    That's fine since this codepath is eliminated by modern compilers.

    But older compilers have not that efficient dead code elimination. It
    causes problem at least with gcc 4.1.2 on m68k:

    fs/built-in.o: In function `smaps_account':
    task_mmu.c:(.text+0x4f8fa): undefined reference to `__compiletime_assert_471'

    Let's replace HPAGE_PMD_NR with 1 << compound_order(page).

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

3 commits

  • Let's define page_mapped() to be true for compound pages if any
    sub-pages of the compound page is mapped (with PMD or PTE).

    On other hand page_mapcount() return mapcount for this particular small
    page.

    This will make cases like page_get_anon_vma() behave correctly once we
    allow huge pages to be mapped with PTE.

    Most users outside core-mm should use page_mapcount() instead of
    page_mapped().

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The goal of this patchset is to make refcounting on THP pages cheaper
    with simpler semantics and allow the same THP compound page to be mapped
    with PMD and PTEs. This is required to get reasonable THP-pagecache
    implementation.

    With the new refcounting design it's much easier to protect against
    split_huge_page(): simple reference on a page will make you the deal.
    It makes gup_fast() implementation simpler and doesn't require
    special-case in futex code to handle tail THP pages.

    It should improve THP utilization over the system since splitting THP in
    one process doesn't necessary lead to splitting the page in all other
    processes have the page mapped.

    The patchset drastically lower complexity of get_page()/put_page()
    codepaths. I encourage people look on this code before-and-after to
    justify time budget on reviewing this patchset.

    This patch (of 37):

    With new refcounting all subpages of the compound page are not necessary
    have the same mapcount. We need to take into account mapcount of every
    sub-page.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

9 commits

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY),
    VM_SOFTDIRTY was already cleared before walk_page_range().

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The MemAvailable item in /proc/meminfo is to give users a hint of how
    much memory is allocatable without causing swapping, so it excludes the
    zones' low watermarks as unavailable to userspace.

    However, for a userspace allocation, kswapd will actually reclaim until
    the free pages hit a combination of the high watermark and the page
    allocator's lowmem protection that keeps a certain amount of DMA and
    DMA32 memory from userspace as well.

    Subtract the full amount we know to be unavailable to userspace from the
    number of free pages when calculating MemAvailable.

    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There are several shortcomings with the accounting of shared memory
    (SysV shm, shared anonymous mapping, mapping of a tmpfs file). The
    values in /proc//status and /statm don't allow to distinguish
    between shmem memory and a shared mapping to a regular file, even though
    theirs implication on memory usage are quite different: during reclaim,
    file mapping can be dropped or written back on disk, while shmem needs a
    place in swap.

    Also, to distinguish the memory occupied by anonymous and file mappings,
    one has to read the /proc/pid/statm file, which has a field for the file
    mappings (again, including shmem) and total memory occupied by these
    mappings (i.e. equivalent to VmRSS in the /status file. Getting
    the value for anonymous mappings only is thus not exactly user-friendly
    (the statm file is intended to be rather efficiently machine-readable).

    To address both of these shortcomings, this patch adds a breakdown of
    VmRSS in /proc//status via new fields RssAnon, RssFile and
    RssShmem, making use of the previous preparatory patch. These fields
    tell the user the memory occupied by private anonymous pages, mapped
    regular files and shmem, respectively. Other existing fields in /status
    and /statm files are left without change. The /statm file can be
    extended in the future, if there's a need for that.

    Example (part of) /proc/pid/status output including the new Rss* fields:

    VmPeak: 2001008 kB
    VmSize: 2001004 kB
    VmLck: 0 kB
    VmPin: 0 kB
    VmHWM: 5108 kB
    VmRSS: 5108 kB
    RssAnon: 92 kB
    RssFile: 1324 kB
    RssShmem: 3692 kB
    VmData: 192 kB
    VmStk: 136 kB
    VmExe: 4 kB
    VmLib: 1784 kB
    VmPTE: 3928 kB
    VmPMD: 20 kB
    VmSwap: 0 kB
    HugetlbPages: 0 kB

    [vbabka@suse.cz: forward-porting, tweak changelog]
    Signed-off-by: Jerome Marchand
    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Currently looking at /proc//status or statm, there is no way to
    distinguish shmem pages from pages mapped to a regular file (shmem pages
    are mapped to /dev/zero), even though their implication in actual memory
    use is quite different.

    The internal accounting currently counts shmem pages together with
    regular files. As a preparation to extend the userspace interfaces,
    this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
    shmem pages separately from MM_FILEPAGES. The next patch will expose it
    to userspace - this patch doesn't change the exported values yet, by
    adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
    used before. The only user-visible change after this patch is the OOM
    killer message that separates the reported "shmem-rss" from "file-rss".

    [vbabka@suse.cz: forward-porting, tweak changelog]
    Signed-off-by: Jerome Marchand
    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Following the previous patch, further reduction of /proc/pid/smaps cost
    is possible for private writable shmem mappings with unpopulated areas
    where the page walk invokes the .pte_hole function. We can use radix
    tree iterator for each such area instead of calling find_get_entry() in
    a loop. This is possible at the extra maintenance cost of introducing
    another shmem function shmem_partial_swap_usage().

    To demonstrate the diference, I have measured this on a process that
    creates a private writable 2GB mapping of a partially swapped out
    /dev/shm/file (which cannot employ the optimizations from the prvious
    patch) and doesn't populate it at all. I time how long does it take to
    cat /proc/pid/smaps of this process 100 times.

    Before this patch:

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    After this patch:

    real 0m1.176s
    user 0m0.180s
    sys 0m0.684s

    The time is similar to the case where a radix tree iterator is employed
    on the whole mapping.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The previous patch has improved swap accounting for shmem mapping, which
    however made /proc/pid/smaps more expensive for shmem mappings, as we
    consult the radix tree for each pte_none entry, so the overal complexity
    is O(n*log(n)).

    We can reduce this significantly for mappings that cannot contain COWed
    pages, because then we can either use the statistics tha shmem object
    itself tracks (if the mapping contains the whole object, or the swap
    usage of the whole object is zero), or use the radix tree iterator,
    which is much more effective than repeated find_get_entry() calls.

    This patch therefore introduces a function shmem_swap_usage(vma) and
    makes /proc/pid/smaps use it when possible. Only for writable private
    mappings of shmem objects (i.e. tmpfs files) with the shmem object
    itself (partially) swapped outwe have to resort to the find_get_entry()
    approach.

    Hopefully such mappings are relatively uncommon.

    To demonstrate the diference, I have measured this on a process that
    creates a 2GB mapping and dirties single pages with a stride of 2MB, and
    time how long does it take to cat /proc/pid/smaps of this process 100
    times.

    Private writable mapping of a /dev/shm/file (the most complex case):

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
    (which needs to employ the radix tree iterator).

    real 0m1.351s
    user 0m0.096s
    sys 0m0.768s

    Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

    real 0m0.935s
    user 0m0.128s
    sys 0m0.344s

    Private anonymous mapping:

    real 0m0.949s
    user 0m0.116s
    sys 0m0.348s

    The cost is now much closer to the private anonymous mapping case, unless
    the shmem mapping is private and writable.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
    shmem-backed mappings, even if the mapped portion does contain pages
    that were swapped out. This is because unlike private anonymous
    mappings, shmem does not change pte to swap entry, but pte_none when
    swapping the page out. In the smaps page walk, such page thus looks
    like it was never faulted in.

    This patch changes smaps_pte_entry() to determine the swap status for
    such pte_none entries for shmem mappings, similarly to how
    mincore_page() does it. Swapped out shmem pages are thus accounted for.
    For private mappings of tmpfs files that COWed some of the pages, swaped
    out status of the original shmem pages is naturally ignored. If some of
    the private copies was also swapped out, they are accounted via their
    page table swap entries, so the resulting reported swap usage is then a
    sum of both swapped out private copies, and swapped out shmem pages that
    were not COWed. No double accounting can thus happen.

    The accounting is arguably still not as precise as for private anonymous
    mappings, since now we will count also pages that the process in
    question never accessed, but another process populated them and then let
    them become swapped out. I believe it is still less confusing and
    subtle than not showing any swap usage by shmem mappings at all.
    Swapped out counter might of interest of users who would like to prevent
    from future swapins during performance critical operation and pre-fault
    them at their convenience. Especially for larger swapped out regions
    the cost of swapin is much higher than a fresh page allocation. So a
    differentiation between pte_none vs. swapped out is important for those
    usecases.

    One downside of this patch is that it makes /proc/pid/smaps more
    expensive for shmem mappings, as we consult the radix tree for each
    pte_none entry, so the overal complexity is O(n*log(n)). I have
    measured this on a process that creates a 2GB mapping and dirties single
    pages with a stride of 2MB, and time how long does it take to cat
    /proc/pid/smaps of this process 100 times.

    Private anonymous mapping:

    real 0m0.949s
    user 0m0.116s
    sys 0m0.348s

    Mapping of a /dev/shm/file:

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    The difference is rather substantial, so the next patch will reduce the
    cost for shared or read-only mappings.

    In a less controlled experiment, I've gathered pids of processes on my
    desktop that have either '/dev/shm/*' or 'SYSV*' in smaps. This
    included the Chrome browser and some KDE processes. Again, I've run cat
    /proc/pid/smaps on each 100 times.

    Before this patch:

    real 0m9.050s
    user 0m0.518s
    sys 0m8.066s

    After this patch:

    real 0m9.221s
    user 0m0.541s
    sys 0m8.187s

    This suggests low impact on average systems.

    Note that this patch doesn't attempt to adjust the SwapPss field for
    shmem mappings, which would need extra work to determine who else could
    have the pages mapped. Thus the value stays zero except for COWed
    swapped out pages in a shmem mapping, which are accounted as usual.

    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Jerome Marchand
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Jan, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of stuff. That probably should've been 5 or 6 separate
    branches, but by the time I'd realized how large and mixed that bag
    had become it had been too close to -final to play with rebasing.

    Some fs/namei.c cleanups there, memdup_user_nul() introduction and
    switching open-coded instances, burying long-dead code, whack-a-mole
    of various kinds, several new helpers for ->llseek(), assorted
    cleanups and fixes from various people, etc.

    One piece probably deserves special mention - Neil's
    lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
    called without ->i_mutex and tries to avoid ever taking it. That, of
    course, means that it's not useful for any directory modifications,
    but things like getting inode attributes in nfds readdirplus are fine
    with that. I really should've asked for moratorium on lookup-related
    changes this cycle, but since I hadn't done that early enough... I
    *am* asking for that for the coming cycle, though - I'm going to try
    and get conversion of i_mutex to rwsem with ->lookup() done under lock
    taken shared.

    There will be a patch closer to the end of the window, along the lines
    of the one Linus had posted last May - mechanical conversion of
    ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
    inode_is_locked()/inode_lock_nested(). To quote Linus back then:

    -----
    | This is an automated patch using
    |
    | sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    | sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    | sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    | sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    | sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    | with a very few manual fixups
    -----

    I'm going to send that once the ->i_mutex-affecting stuff in -next
    gets mostly merged (or when Linus says he's about to stop taking
    merges)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    nfsd: don't hold i_mutex over userspace upcalls
    fs:affs:Replace time_t with time64_t
    fs/9p: use fscache mutex rather than spinlock
    proc: add a reschedule point in proc_readfd_common()
    logfs: constify logfs_block_ops structures
    fcntl: allow to set O_DIRECT flag on pipe
    fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
    fs: xattr: Use kvfree()
    [s390] page_to_phys() always returns a multiple of PAGE_SIZE
    nbd: use ->compat_ioctl()
    fs: use block_device name vsprintf helper
    lib/vsprintf: add %*pg format specifier
    fs: use gendisk->disk_name where possible
    poll: plug an unused argument to do_poll
    amdkfd: don't open-code memdup_user()
    cdrom: don't open-code memdup_user()
    rsxx: don't open-code memdup_user()
    mtip32xx: don't open-code memdup_user()
    [um] mconsole: don't open-code memdup_user_nul()
    [um] hostaudio: don't open-code memdup_user()
    ...

    Linus Torvalds
     

12 Jan, 2016

1 commit

  • Pull vfs RCU symlink updates from Al Viro:
    "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
    even if the symlink is not an embedded one.

    No changes since the mailbomb on Jan 1"

    * 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->get_link() to delayed_call, kill ->put_link()
    kill free_page_put_link()
    teach nfs_get_link() to work in RCU mode
    teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
    teach shmem_get_link() to work in RCU mode
    teach page_get_link() to work in RCU mode
    replace ->follow_link() with new method that could stay in RCU mode
    don't put symlink bodies in pagecache into highmem
    namei: page_getlink() and page_follow_link_light() are the same thing
    ufs: get rid of ->setattr() for symlinks
    udf: don't duplicate page_symlink_inode_operations
    logfs: don't duplicate page_symlink_inode_operations
    switch befs long symlinks to page_symlink_operations

    Linus Torvalds
     

09 Jan, 2016

1 commit

  • User can pass an arbitrary large buffer to getdents().

    It is typically a 32KB buffer used by libc scandir() implementation.

    When scanning /proc/{pid}/fd, we can hold cpu way too long,
    so add a cond_resched() to be kind with other tasks.

    We've seen latencies of more than 50ms on real workloads.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Viro
    Signed-off-by: Al Viro

    Eric Dumazet
     

04 Jan, 2016

1 commit


31 Dec, 2015

1 commit


19 Dec, 2015

1 commit

  • Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit
    774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") removed
    the setting of ret after the get_proc_task call and incorrectly left it as
    -ESRCH. Instead, return 0 when successful.

    Example breakage:

    echo 0 > /proc/self/coredump_filter
    bash: echo: write error: No such process

    Fixes: 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()")
    Signed-off-by: Colin Ian King
    Acked-by: Kees Cook
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     

09 Dec, 2015

2 commits


08 Nov, 2015

2 commits

  • Merge second patch-bomb from Andrew Morton:

    - most of the rest of MM

    - procfs

    - lib/ updates

    - printk updates

    - bitops infrastructure tweaks

    - checkpatch updates

    - nilfs2 update

    - signals

    - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
    dma-debug, dma-mapping, ...

    * emailed patches from Andrew Morton : (102 commits)
    ipc,msg: drop dst nil validation in copy_msg
    include/linux/zutil.h: fix usage example of zlib_adler32()
    panic: release stale console lock to always get the logbuf printed out
    dma-debug: check nents in dma_sync_sg*
    dma-mapping: tidy up dma_parms default handling
    pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
    kexec: use file name as the output message prefix
    fs, seqfile: always allow oom killer
    seq_file: reuse string_escape_str()
    fs/seq_file: use seq_* helpers in seq_hex_dump()
    coredump: change zap_threads() and zap_process() to use for_each_thread()
    coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
    signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
    signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
    signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
    signals: kill block_all_signals() and unblock_all_signals()
    nilfs2: fix gcc uninitialized-variable warnings in powerpc build
    nilfs2: fix gcc unused-but-set-variable warnings
    MAINTAINERS: nilfs2: add header file for tracing
    nilfs2: add tracepoints for analyzing reading and writing metadata files
    ...

    Linus Torvalds
     
  • Pull trivial updates from Jiri Kosina:
    "Trivial stuff from trivial tree that can be trivially summed up as:

    - treewide drop of spurious unlikely() before IS_ERR() from Viresh
    Kumar

    - cosmetic fixes (that don't really affect basic functionality of the
    driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek

    - various comment / printk fixes and updates all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    bcache: Really show state of work pending bit
    hwmon: applesmc: fix comment typos
    Kconfig: remove comment about scsi_wait_scan module
    class_find_device: fix reference to argument "match"
    debugfs: document that debugfs_remove*() accepts NULL and error values
    net: Drop unlikely before IS_ERR(_OR_NULL)
    mm: Drop unlikely before IS_ERR(_OR_NULL)
    fs: Drop unlikely before IS_ERR(_OR_NULL)
    drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
    drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
    UBI: Update comments to reflect UBI_METAONLY flag
    pktcdvd: drop null test before destroy functions

    Linus Torvalds
     

07 Nov, 2015

2 commits

  • The commit 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly")
    fixed the access to /proc/self/fd from sub-threads, but introduced another
    problem: a sub-thread can't access /proc//fd/ or /proc/thread-self/fd
    if generic_permission() fails.

    Change proc_fd_permission() to check same_thread_group(pid_task(), current).

    Fixes: 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly")
    Reported-by: "Jin, Yihua"
    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • For now in task_name() we ignore the return code of string_escape_str()
    call. This is not good if buffer suddenly becomes not big enough. Do the
    proper error handling there.

    Signed-off-by: Andy Shevchenko
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     

06 Nov, 2015

5 commits

  • /proc/pid/oom_adj exists solely to avoid breaking existing userspace
    binaries that write to the tunable.

    Add a comment in the only possible location within the kernel tree to
    describe the situation and motivation for keeping it around.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Don't build clear_soft_dirty_pmd() if transparent huge pages are not
    enabled.

    Signed-off-by: Laurent Dufour
    Reviewed-by: Aneesh Kumar K.V
    Cc: Pavel Emelyanov
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • As mentioned in the commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa()
    for updating _PAGE_NUMA bit"), architectures like ppc64 don't do tlb
    flush in set_pte/pmd functions.

    So when dealing with existing pte in clear_soft_dirty, the pte must be
    cleared before being modified.

    Signed-off-by: Laurent Dufour
    Reviewed-by: Aneesh Kumar K.V
    Cc: Pavel Emelyanov
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • Currently there's no easy way to get per-process usage of hugetlb pages,
    which is inconvenient because userspace applications which use hugetlb
    typically want to control their processes on the basis of how much memory
    (including hugetlb) they use. So this patch simply provides easy access
    to the info via /proc/PID/status.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Joern Engel
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB),
    which is inconvenient when we want to know per-task or per-vma base
    hugetlb usage. To solve this, this patch adds new fields for hugetlb
    usage like below:

    Size: 20480 kB
    Rss: 0 kB
    Pss: 0 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 0 kB
    Private_Dirty: 0 kB
    Referenced: 0 kB
    Anonymous: 0 kB
    AnonHugePages: 0 kB
    Shared_Hugetlb: 18432 kB
    Private_Hugetlb: 2048 kB
    Swap: 0 kB
    KernelPageSize: 2048 kB
    MMUPageSize: 2048 kB
    Locked: 0 kB
    VmFlags: rd wr mr mw me de ht

    [hughd@google.com: fix Private_Hugetlb alignment ]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Joern Engel
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Mike Kravetz
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

05 Nov, 2015

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "There is only one new feature in this pull for the 4.4 merge window,
    most of it is small enhancements, cleanup and bug fixes:

    - Add the s390 backend for the software dirty bit tracking. This
    adds two new pgtable functions pte_clear_soft_dirty and
    pmd_clear_soft_dirty which is why there is a hit to
    arch/x86/include/asm/pgtable.h in this pull request.

    - A series of cleanup patches for the AP bus, this includes the
    removal of the support for two outdated crypto cards (PCICC and
    PCICA).

    - The irq handling / signaling on buffer full in the runtime
    instrumentation code is dropped.

    - Some micro optimizations: remove unnecessary memory barriers for a
    couple of functions: [smb_]rmb, [smb_]wmb, atomics, bitops, and for
    spin_unlock. Use the builtin bswap if available and make
    test_and_set_bit_lock more cache friendly.

    - Statistics and a tracepoint for the diagnose calls to the
    hypervisor.

    - The CPU measurement facility support to sample KVM guests is
    improved.

    - The vector instructions are now always enabled for user space
    processes if the hardware has the vector facility. This simplifies
    the FPU handling code. The fpu-internal.h header is split into fpu
    internals, api and types just like x86.

    - Cleanup and improvements for the common I/O layer.

    - Rework udelay to solve a problem with kprobe. udelay has busy loop
    semantics but still uses an idle processor state for the wait"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (66 commits)
    s390: remove runtime instrumentation interrupts
    s390/cio: de-duplicate subchannel validation
    s390/css: unneeded initialization in for_each_subchannel
    s390/Kconfig: use builtin bswap
    s390/dasd: fix disconnected device with valid path mask
    s390/dasd: fix invalid PAV assignment after suspend/resume
    s390/dasd: fix double free in dasd_eckd_read_conf
    s390/kernel: fix ptrace peek/poke for floating point registers
    s390/cio: move ccw_device_stlck functions
    s390/cio: move ccw_device_call_handler
    s390/topology: reduce per_cpu() invocations
    s390/nmi: reduce size of percpu variable
    s390/nmi: fix terminology
    s390/nmi: remove casts
    s390/nmi: remove pointless error strings
    s390: don't store registers on disabled wait anymore
    s390: get rid of __set_psw_mask()
    s390/fpu: split fpu-internal.h into fpu internals, api, and type headers
    s390/dasd: fix list_del corruption after lcu changes
    s390/spinlock: remove unneeded serializations at unlock
    ...

    Linus Torvalds
     

04 Nov, 2015

1 commit

  • Pull wchan kernel address hiding from Ingo Molnar:
    "This fixes a wchan related information leak in /proc/PID/stat.

    There's a bit of an ABI twist to it: instead of setting the wchan
    field to 0 (which is our usual technique) we set it conditionally to a
    0/1 flag to keep ABI compatibility with older procps versions that
    only fetches /proc/PID/wchan (symbolic names) if the absolute wchan
    address is nonzero"

    * 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    fs/proc, core/debug: Don't expose absolute kernel addresses via wchan

    Linus Torvalds
     

02 Nov, 2015

1 commit

  • It turns out that at least some versions of glibc end up reading
    /proc/meminfo at every single startup, because glibc wants to know the
    amount of memory the machine has. And while that's arguably insane,
    it's just how things are.

    And it turns out that it's not all that expensive most of the time, but
    the vmalloc information statistics (amount of virtual memory used in the
    vmalloc space, and the biggest remaining chunk) can be rather expensive
    to compute.

    The 'get_vmalloc_info()' function actually showed up on my profiles as
    4% of the CPU usage of "make test" in the git source repository, because
    the git tests are lots of very short-lived shell-scripts etc.

    It turns out that apparently this same silly vmalloc info gathering
    shows up on the facebook servers too, according to Dave Jones. So it's
    not just "make test" for git.

    We had two patches to just cache the information (one by me, one by
    Ingo) to mitigate this issue, but the whole vmalloc information of of
    rather dubious value to begin with, and people who *actually* want to
    know what the situation is wrt the vmalloc area should just look at the
    much more complete /proc/vmallocinfo instead.

    In fact, according to my testing - and perhaps more importantly,
    according to that big search engine in the sky: Google - there is
    nothing out there that actually cares about those two expensive fields:
    VmallocUsed and VmallocChunk.

    So let's try to just remove them entirely. Actually, this just removes
    the computation and reports the numbers as zero for now, just to try to
    be minimally intrusive.

    If this breaks anything, we'll obviously have to re-introduce the code
    to compute this all and add the caching patches on top. But if given
    the option, I'd really prefer to just remove this bad idea entirely
    rather than add even more code to work around our historical mistake
    that likely nobody really cares about.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Oct, 2015

1 commit

  • There are primitives to create and query the software dirty bits
    in a pte or pmd. But the clearing of the software dirty bits is done
    in common code with x86 specific page table functions.

    Add the missing architecture primitives to clear the software dirty
    bits to allow the feature to be used on non-x86 systems, e.g. the
    s390 architecture.

    Acked-by: Cyrill Gorcunov
    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

01 Oct, 2015

1 commit

  • So the /proc/PID/stat 'wchan' field (the 30th field, which contains
    the absolute kernel address of the kernel function a task is blocked in)
    leaks absolute kernel addresses to unprivileged user-space:

    seq_put_decimal_ull(m, ' ', wchan);

    The absolute address might also leak via /proc/PID/wchan as well, if
    KALLSYMS is turned off or if the symbol lookup fails for some reason:

    static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
    struct pid *pid, struct task_struct *task)
    {
    unsigned long wchan;
    char symname[KSYM_NAME_LEN];

    wchan = get_wchan(task);

    if (lookup_symbol_name(wchan, symname) < 0) {
    if (!ptrace_may_access(task, PTRACE_MODE_READ))
    return 0;
    seq_printf(m, "%lu", wchan);
    } else {
    seq_printf(m, "%s", symname);
    }

    return 0;
    }

    This isn't ideal, because for example it trivially leaks the KASLR offset
    to any local attacker:

    fomalhaut:~> printf "%016lx\n" $(cat /proc/$$/stat | cut -d' ' -f35)
    ffffffff8123b380

    Most real-life uses of wchan are symbolic:

    ps -eo pid:10,tid:10,wchan:30,comm

    and procps uses /proc/PID/wchan, not the absolute address in /proc/PID/stat:

    triton:~/tip> strace -f ps -eo pid:10,tid:10,wchan:30,comm 2>&1 | grep wchan | tail -1
    open("/proc/30833/wchan", O_RDONLY) = 6

    There's one compatibility quirk here: procps relies on whether the
    absolute value is non-zero - and we can provide that functionality
    by outputing "0" or "1" depending on whether the task is blocked
    (whether there's a wchan address).

    These days there appears to be very little legitimate reason
    user-space would be interested in the absolute address. The
    absolute address is mostly historic: from the days when we
    didn't have kallsyms and user-space procps had to do the
    decoding itself via the System.map.

    So this patch sets all numeric output to "0" or "1" and keeps only
    symbolic output, in /proc/PID/wchan.

    ( The absolute sleep address can generally still be profiled via
    perf, by tasks with sufficient privileges. )

    Reviewed-by: Thomas Gleixner
    Acked-by: Kees Cook
    Acked-by: Linus Torvalds
    Cc:
    Cc: Al Viro
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Denys Vlasenko
    Cc: Dmitry Vyukov
    Cc: Kostya Serebryany
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Sasha Levin
    Cc: kasan-dev
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/20150930135917.GA3285@gmail.com
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

29 Sep, 2015

1 commit

  • IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
    is no need to do that again from its callers. Drop it.

    Signed-off-by: Viresh Kumar
    Reviewed-by: Jeff Layton
    Reviewed-by: David Howells
    Reviewed-by: Steve French
    Signed-off-by: Jiri Kosina

    Viresh Kumar