25 Jan, 2016

1 commit

  • If we detect that there is nothing to do just set the flag and do not
    check if it was already set before. Races really do not matter. If the
    flag is set by any code then the shepherd will start dealing with the
    situation and reenable the vmstat workers when necessary again.

    Since commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again
    and shut down on idle") quiet_vmstat might update cpu_stat_off and mark
    a particular cpu to be handled by vmstat_shepherd. This might trigger a
    VM_BUG_ON in vmstat_update because the work item might have been
    sleeping during the idle period and see the cpu_stat_off updated after
    the wake up. The VM_BUG_ON is therefore misleading and no more
    appropriate. Moreover it doesn't really suite any protection from real
    bugs because vmstat_shepherd will simply reschedule the vmstat_work
    anytime it sees a particular cpu set or vmstat_update would do the same
    from the worker context directly. Even when the two would race the
    result wouldn't be incorrect as the counters update is fully idempotent.

    Reported-by: Sasha Levin
    Signed-off-by: Christoph Lameter
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

24 Jan, 2016

1 commit

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     

23 Jan, 2016

6 commits

  • There are many locations that do

    if (memory_was_allocated_by_vmalloc)
    vfree(ptr);
    else
    kfree(ptr);

    but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
    using is_vmalloc_addr(). Unless callers have special reasons, we can
    replace this branch with kvfree(). Please check and reply if you found
    problems.

    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Jan Kara
    Acked-by: Russell King
    Reviewed-by: Andreas Dilger
    Acked-by: "Rafael J. Wysocki"
    Acked-by: David Rientjes
    Cc: "Luck, Tony"
    Cc: Oleg Drokin
    Cc: Boris Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • To properly handle fsync/msync in an efficient way DAX needs to track
    dirty pages so it is able to flush them durably to media on demand.

    The tracking of dirty pages is done via the radix tree in struct
    address_space. This radix tree is already used by the page writeback
    infrastructure for tracking dirty pages associated with an open file,
    and it already has support for exceptional (non struct page*) entries.
    We build upon these features to add exceptional entries to the radix
    tree for DAX dirty PMD or PTE pages at fault time.

    [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add find_get_entries_tag() to the family of functions that include
    find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
    needed for DAX dirty page handling because we need a list of both page
    offsets and radix tree entries ('indices' and 'entries' in this
    function) that are marked with the PAGECACHE_TAG_TOWRITE tag.

    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Cc: stable@vger.kernel.org # v4.2+
    Signed-off-by: Al Viro

    Al Viro
     
  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

22 Jan, 2016

3 commits

  • This crash is caused by NULL pointer deference, in page_to_pfn() marco,
    when page == NULL :

    Unable to handle kernel NULL pointer dereference at virtual address 00000000
    Internal error: Oops: 94000006 [#1] SMP
    Modules linked in:
    CPU: 1 PID: 26 Comm: khugepaged Tainted: G W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
    PC is at khugepaged+0x378/0x1af8
    LR is at khugepaged+0x418/0x1af8
    Process khugepaged (pid: 26, stack limit = 0xffffffc079638020)
    Call trace:
    khugepaged+0x378/0x1af8
    kthread+0xdc/0xf4
    ret_from_fork+0xc/0x40
    Code: 35001700 f0002c60 aa0703e3 f9009fa0 (f94000e0)
    ---[ end trace 637503d8e28ae69e ]---
    Kernel panic - not syncing: Fatal exception
    CPU2: stopping
    CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
    Hardware name: linux,dummy-virt (DT)

    [akpm@linux-foundation.org: fix fat-fingered merge resolution]
    Signed-off-by: yalin wang
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Acked-by: David Rientjes
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yalin wang
     
  • Tetsuo Handa reported underflow of NR_MLOCK on munlock.

    Testcase:

    #include
    #include
    #include

    #define BASE ((void *)0x400000000000)
    #define SIZE (1UL << 21)

    int main(int argc, char *argv[])
    {
    void *addr;

    system("grep Mlocked /proc/meminfo");
    addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
    -1, 0);
    if (addr == MAP_FAILED)
    printf("mmap() failed\n"), exit(1);
    munmap(addr, SIZE);
    system("grep Mlocked /proc/meminfo");
    return 0;
    }

    It happens on munlock_vma_page() due to unfortunate choice of nr_pages
    data type:

    __mod_zone_page_state(zone, NR_MLOCK, -nr_pages);

    For unsigned int nr_pages, implicitly casted to long in
    __mod_zone_page_state(), it becomes something around UINT_MAX.

    munlock_vma_page() usually called for THP as small pages go though
    pagevec.

    Let's make nr_pages signed int.

    Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in
    mod_zone_page_state()") used `long' type, but `int' here is OK for a
    count of the number of sub-pages in a huge page.

    Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Cc: Michel Lespinasse
    Acked-by: Michal Hocko
    Cc: [4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • After THP refcounting rework we have only two possible return values
    from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
    ptl doesn't make much sense in this case.

    Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
    failure.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: Linus Torvalds
    Cc: Minchan Kim
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Jan, 2016

27 commits

  • Provide statistics on how much of a cgroup's memory footprint is made up
    of socket buffers from network connections owned by the group.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Provide a cgroup2 memory.stat that provides statistics on LRU memory
    and fault event counters. More consumers and breakdowns will follow.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Changing page->mem_cgroup of a live page is tricky and fragile. In
    particular, the memcg writeback code relies on that mapping being stable
    and users of mem_cgroup_replace_page() not overlapping with dirtyable
    inodes.

    Page cache replacement doesn't have to do that, though. Instead of being
    clever and transferring the charge from the old page to the new,
    force-charge the new page and leave the old page alone. A temporary
    overcharge won't matter in practice, and the old page is going to be freed
    shortly after this anyway. And this is not performance critical.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Swap cache pages are freed aggressively if swap is nearly full (>50%
    currently), because otherwise we are likely to stop scanning anonymous
    when we near the swap limit even if there is plenty of freeable swap cache
    pages. We should follow the same trend in case of memory cgroup, which
    has its own swap limit.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We don't scan anonymous memory if we ran out of swap, neither should we do
    it in case memcg swap limit is hit, because swap out is impossible anyway.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mem_cgroup_lruvec_online() takes lruvec, but it only needs memcg. Since
    get_scan_count(), which is the only user of this function, now possesses
    pointer to memcg, let's pass memcg directly to mem_cgroup_online() instead
    of picking it out of lruvec and rename the function accordingly.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • memcg will come in handy in get_scan_count(). It can already be used for
    getting swappiness immediately in get_scan_count() instead of passing it
    around. The following patches will add more memcg-related values, which
    will be used there.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset introduces swap accounting to cgroup2.

    This patch (of 7):

    In the legacy hierarchy we charge memsw, which is dubious, because:

    - memsw.limit must be >= memory.limit, so it is impossible to limit
    swap usage less than memory usage. Taking into account the fact that
    the primary limiting mechanism in the unified hierarchy is
    memory.high while memory.limit is either left unset or set to a very
    large value, moving memsw.limit knob to the unified hierarchy would
    effectively make it impossible to limit swap usage according to the
    user preference.

    - memsw.usage != memory.usage + swap.usage, because a page occupying
    both swap entry and a swap cache page is charged only once to memsw
    counter. As a result, it is possible to effectively eat up to
    memory.limit of memory pages *and* memsw.limit of swap entries, which
    looks unexpected.

    That said, we should provide a different swap limiting mechanism for
    cgroup2.

    This patch adds mem_cgroup->swap counter, which charges the actual number
    of swap entries used by a cgroup. It is only charged in the unified
    hierarchy, while the legacy hierarchy memsw logic is left intact.

    The swap usage can be monitored using new memory.swap.current file and
    limited using memory.swap.max.

    Note, to charge swap resource properly in the unified hierarchy, we have
    to make swap_entry_free uncharge swap only when ->usage reaches zero, not
    just ->count, i.e. when all references to a swap entry, including the one
    taken by swap cache, are gone. This is necessary, because otherwise
    swap-in could result in uncharging swap even if the page is still in swap
    cache and hence still occupies a swap entry. At the same time, this
    shouldn't break memsw counter logic, where a page is never charged twice
    for using both memory and swap, because in case of legacy hierarchy we
    uncharge swap on commit (see mem_cgroup_commit_charge).

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The creation and teardown of struct mem_cgroup is fairly messy and
    that has attracted mistakes and subtle bugs before.

    The main cause for this is that there is no clear model about what
    needs to happen when, and that attracts more chaos. So create one:

    1. mem_cgroup_alloc() should allocate struct mem_cgroup and its
    auxiliary members and initialize work items, locks etc. so that the
    object it returns is fully initialized and in a neutral state.

    2. mem_cgroup_css_alloc() will use mem_cgroup_alloc() to obtain a new
    memcg object and configure it and the system according to the role
    of the new memory-controlled cgroup in the hierarchy.

    3. mem_cgroup_css_online() is no longer needed to synchronize with
    iterators, but it verifies css->id which isn't available earlier.

    4. mem_cgroup_css_offline() implements stuff that needs to happen upon
    the user-visible destruction of a cgroup, which includes stopping
    all user interfacing as well as releasing certain structures when
    continued memory consumption would be unexpected at that point.

    5. mem_cgroup_css_free() prepares the system and the memcg object for
    the object's disappearance, neutralizes its state, and then gives
    it back to mem_cgroup_free().

    6. mem_cgroup_free() releases struct mem_cgroup and auxiliary memory.

    [arnd@arndb.de: fix SLOB build regression]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There are no more external users of struct cg_proto, flatten the
    structure into struct mem_cgroup.

    Since using those struct members doesn't stand out as much anymore,
    add cgroup2 static branches to make it clearer which code is legacy.

    Suggested-by: Vladimir Davydov
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
    controller code is insignificant, having these conditionals is not
    worth the complication and fragility that comes with them.

    [akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
    and mem_cgroup->tcp_mem init/destroy stuff. This doesn't belong to
    network subsys. Let's move it to memcontrol.c. This also allows us to
    reuse generic code for handling legacy memcg files.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: "David S. Miller"
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
    interface. This also makes legacy-only code sections stand out better.

    [arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Kmem accounting might incur overhead that some users can't put up with.
    Besides, the implementation is still considered unstable. So let's
    provide a way to disable it for those users who aren't happy with it.

    To disable kmem accounting for cgroup2, pass cgroup.memory=nokmem at
    boot time.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The original cgroup memory controller has an extension to account slab
    memory (and other "kernel memory" consumers) in a separate "kmem"
    counter, once the user set an explicit limit on that "kmem" pool.

    However, this includes various consumers whose sizes are directly linked
    to userspace activity. Accounting them as an optional "kmem" extension
    is problematic for several reasons:

    1. It leaves the main memory interface with incomplete semantics. A
    user who puts their workload into a cgroup and configures a memory
    limit does not expect us to leave holes in the containment as big
    as the dentry and inode cache, or the kernel stack pages.

    2. If the limit set on this random historical subgroup of consumers is
    reached, subsequent allocations will fail even when the main memory
    pool available to the cgroup is not yet exhausted and/or has
    reclaimable memory in it.

    3. Calling it 'kernel memory' is misleading. The dentry and inode
    caches are no more 'kernel' (or no less 'user') memory than the
    page cache itself. Treating these consumers as different classes is
    a historical implementation detail that should not leak to users.

    So, in addition to page cache, anonymous memory, and network socket
    memory, account the following memory consumers per default in the
    cgroup2 memory controller:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems.

    This should give us reasonable memory isolation for most common
    workloads out of the box.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The cgroup2 memory controller will account important in-kernel memory
    consumers per default. Move all necessary components to CONFIG_MEMCG.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The cgroup2 memory controller will include important in-kernel memory
    consumers per default, including socket memory, but it will no longer
    carry the historic tcp control interface.

    Separate the kmem state init from the tcp control interface init in
    preparation for that.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Put all the related code to setup and teardown the kmem accounting state
    into the same location. No functional change intended.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • On any given memcg, the kmem accounting feature has three separate
    states: not initialized, structures allocated, and actively accounting
    slab memory. These are represented through a combination of the
    kmem_acct_activated and kmem_acct_active flags, which is confusing.

    Convert to a kmem_state enum with the states NONE, ALLOCATED, and
    ONLINE. Then rename the functions to modify the state accordingly.
    This follows the nomenclature of css object states more closely.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The kmem page_counter's limit is initialized to PAGE_COUNTER_MAX inside
    mem_cgroup_css_online(). There is no need to repeat this from
    memcg_propagate_kmem().

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This series adds accounting of the historical "kmem" memory consumers to
    the cgroup2 memory controller.

    These consumers include the dentry cache, the inode cache, kernel stack
    pages, and a few others that are pointed out in patch 7/8. The
    footprint of these consumers is directly tied to userspace activity in
    common workloads, and so they have to be part of the minimally viable
    configuration in order to present a complete feature to our users.

    The cgroup2 interface of the memory controller is far from complete, but
    this series, along with the socket memory accounting series, provides
    the final semantic changes for the existing memory knobs in the cgroup2
    interface, which is scheduled for initial release in the next merge
    window.

    This patch (of 8):

    Remove unused css argument frmo memcg_init_kmem()

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only functions doing more than one read are modified. Consumeres
    happened to deal with possibly changing data, but it does not seem like
    a good thing to rely on.

    Signed-off-by: Mateusz Guzik
    Acked-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Jarod Wilson
    Cc: Jan Stancek
    Cc: Al Viro
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mateusz Guzik
     
  • UBSAN uses compile-time instrumentation to catch undefined behavior
    (UB). Compiler inserts code that perform certain kinds of checks before
    operations that could cause UB. If check fails (i.e. UB detected)
    __ubsan_handle_* function called to print error message.

    So the most of the work is done by compiler. This patch just implements
    ubsan handlers printing errors.

    GCC has this capability since 4.9.x [1] (see -fsanitize=undefined
    option and its suboptions).
    However GCC 5.x has more checkers implemented [2].
    Article [3] has a bit more details about UBSAN in the GCC.

    [1] - https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html
    [2] - https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html
    [3] - http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/

    Issues which UBSAN has found thus far are:

    Found bugs:

    * out-of-bounds access - 97840cb67ff5 ("netfilter: nfnetlink: fix
    insufficient validation in nfnetlink_bind")

    undefined shifts:

    * d48458d4a768 ("jbd2: use a better hash function for the revoke
    table")

    * 10632008b9e1 ("clockevents: Prevent shift out of bounds")

    * 'x << -1' shift in ext4 -
    http://lkml.kernel.org/r/

    * undefined rol32(0) -
    http://lkml.kernel.org/r/

    * undefined dirty_ratelimit calculation -
    http://lkml.kernel.org/r/

    * undefined roundown_pow_of_two(0) -
    http://lkml.kernel.org/r/

    * [WONTFIX] undefined shift in __bpf_prog_run -
    http://lkml.kernel.org/r/

    WONTFIX here because it should be fixed in bpf program, not in kernel.

    signed overflows:

    * 32a8df4e0b33f ("sched: Fix odd values in effective_load()
    calculations")

    * mul overflow in ntp -
    http://lkml.kernel.org/r/

    * incorrect conversion into rtc_time in rtc_time64_to_tm() -
    http://lkml.kernel.org/r/

    * unvalidated timespec in io_getevents() -
    http://lkml.kernel.org/r/

    * [NOTABUG] signed overflow in ktime_add_safe() -
    http://lkml.kernel.org/r/

    [akpm@linux-foundation.org: fix unused local warning]
    [akpm@linux-foundation.org: fix __int128 build woes]
    Signed-off-by: Andrey Ryabinin
    Cc: Peter Zijlstra
    Cc: Sasha Levin
    Cc: Randy Dunlap
    Cc: Rasmus Villemoes
    Cc: Jonathan Corbet
    Cc: Michal Marek
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Yury Gribov
    Cc: Dmitry Vyukov
    Cc: Konstantin Khlebnikov
    Cc: Kostya Serebryany
    Cc: Johannes Berg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • record_obj() in migrate_zspage() does not preserve handle's
    HANDLE_PIN_BIT, set by find_aloced_obj()->trypin_tag(), and implicitly
    (accidentally) un-pins the handle, while migrate_zspage() still performs
    an explicit unpin_tag() on the that handle. This additional explicit
    unpin_tag() introduces a race condition with zs_free(), which can pin
    that handle by this time, so the handle becomes un-pinned.

    Schematically, it goes like this:

    CPU0 CPU1
    migrate_zspage
    find_alloced_obj
    trypin_tag
    set HANDLE_PIN_BIT zs_free()
    pin_tag()
    obj_malloc() -- new object, no tag
    record_obj() -- remove HANDLE_PIN_BIT set HANDLE_PIN_BIT
    unpin_tag() -- remove zs_free's HANDLE_PIN_BIT

    The race condition may result in a NULL pointer dereference:

    Unable to handle kernel NULL pointer dereference at virtual address 00000000
    CPU: 0 PID: 19001 Comm: CookieMonsterCl Tainted:
    PC is at get_zspage_mapping+0x0/0x24
    LR is at obj_free.isra.22+0x64/0x128
    Call trace:
    get_zspage_mapping+0x0/0x24
    zs_free+0x88/0x114
    zram_free_page+0x64/0xcc
    zram_slot_free_notify+0x90/0x108
    swap_entry_free+0x278/0x294
    free_swap_and_cache+0x38/0x11c
    unmap_single_vma+0x480/0x5c8
    unmap_vmas+0x44/0x60
    exit_mmap+0x50/0x110
    mmput+0x58/0xe0
    do_exit+0x320/0x8dc
    do_group_exit+0x44/0xa8
    get_signal+0x538/0x580
    do_signal+0x98/0x4b8
    do_notify_resume+0x14/0x5c

    This patch keeps the lock bit in migration path and update value
    atomically.

    Signed-off-by: Junil Lee
    Signed-off-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junil Lee
     
  • split_queue_lock can be taken from interrupt context in some cases, but
    I forgot to convert locking in split_huge_page() to interrupt-safe
    primitives.

    Let's fix this.

    lockdep output:

    ======================================================
    [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
    4.4.0+ #259 Tainted: G W
    ------------------------------------------------------
    syz-executor/18183 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
    (split_queue_lock){+.+...}, at: free_transhuge_page+0x24/0x90 mm/huge_memory.c:3436

    and this task is already holding:
    (slock-AF_INET){+.-...}, at: spin_lock_bh include/linux/spinlock.h:307
    (slock-AF_INET){+.-...}, at: lock_sock_fast+0x45/0x120 net/core/sock.c:2462
    which would create a new lock dependency:
    (slock-AF_INET){+.-...} -> (split_queue_lock){+.+...}

    but this new dependency connects a SOFTIRQ-irq-safe lock:
    (slock-AF_INET){+.-...}
    ... which became SOFTIRQ-irq-safe at:
    mark_irqflags kernel/locking/lockdep.c:2799
    __lock_acquire+0xfd8/0x4700 kernel/locking/lockdep.c:3162
    lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
    __raw_spin_lock include/linux/spinlock_api_smp.h:144
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:302
    udp_queue_rcv_skb+0x781/0x1550 net/ipv4/udp.c:1680
    flush_stack+0x50/0x330 net/ipv6/udp.c:799
    __udp4_lib_mcast_deliver+0x694/0x7f0 net/ipv4/udp.c:1798
    __udp4_lib_rcv+0x17dc/0x23e0 net/ipv4/udp.c:1888
    udp_rcv+0x21/0x30 net/ipv4/udp.c:2108
    ip_local_deliver_finish+0x2b3/0xa50 net/ipv4/ip_input.c:216
    NF_HOOK_THRESH include/linux/netfilter.h:226
    NF_HOOK include/linux/netfilter.h:249
    ip_local_deliver+0x1c4/0x2f0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:498
    ip_rcv_finish+0x5ec/0x1730 net/ipv4/ip_input.c:365
    NF_HOOK_THRESH include/linux/netfilter.h:226
    NF_HOOK include/linux/netfilter.h:249
    ip_rcv+0x963/0x1080 net/ipv4/ip_input.c:455
    __netif_receive_skb_core+0x1620/0x2f80 net/core/dev.c:4154
    __netif_receive_skb+0x2a/0x160 net/core/dev.c:4189
    netif_receive_skb_internal+0x1b5/0x390 net/core/dev.c:4217
    napi_skb_finish net/core/dev.c:4542
    napi_gro_receive+0x2bd/0x3c0 net/core/dev.c:4572
    e1000_clean_rx_irq+0x4e2/0x1100 drivers/net/ethernet/intel/e1000e/netdev.c:1038
    e1000_clean+0xa08/0x24a0 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
    napi_poll net/core/dev.c:5074
    net_rx_action+0x7eb/0xdf0 net/core/dev.c:5139
    __do_softirq+0x26a/0x920 kernel/softirq.c:273
    invoke_softirq kernel/softirq.c:350
    irq_exit+0x18f/0x1d0 kernel/softirq.c:391
    exiting_irq ./arch/x86/include/asm/apic.h:659
    do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
    ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:520
    arch_safe_halt ./arch/x86/include/asm/paravirt.h:117
    default_idle+0x52/0x2e0 arch/x86/kernel/process.c:304
    arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:295
    default_idle_call+0x48/0xa0 kernel/sched/idle.c:92
    cpuidle_idle_call kernel/sched/idle.c:156
    cpu_idle_loop kernel/sched/idle.c:252
    cpu_startup_entry+0x554/0x710 kernel/sched/idle.c:300
    rest_init+0x192/0x1a0 init/main.c:412
    start_kernel+0x678/0x69e init/main.c:683
    x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:195
    x86_64_start_kernel+0x158/0x167 arch/x86/kernel/head64.c:184

    to a SOFTIRQ-irq-unsafe lock:
    (split_queue_lock){+.+...}
    which became SOFTIRQ-irq-unsafe at:
    mark_irqflags kernel/locking/lockdep.c:2817
    __lock_acquire+0x146e/0x4700 kernel/locking/lockdep.c:3162
    lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
    __raw_spin_lock include/linux/spinlock_api_smp.h:144
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:302
    split_huge_page_to_list+0xcc0/0x1c50 mm/huge_memory.c:3399
    split_huge_page include/linux/huge_mm.h:99
    queue_pages_pte_range+0xa38/0xef0 mm/mempolicy.c:507
    walk_pmd_range mm/pagewalk.c:50
    walk_pud_range mm/pagewalk.c:90
    walk_pgd_range mm/pagewalk.c:116
    __walk_page_range+0x653/0xcd0 mm/pagewalk.c:204
    walk_page_range+0xfe/0x2b0 mm/pagewalk.c:281
    queue_pages_range+0xfb/0x130 mm/mempolicy.c:687
    migrate_to_node mm/mempolicy.c:1004
    do_migrate_pages+0x370/0x4e0 mm/mempolicy.c:1109
    SYSC_migrate_pages mm/mempolicy.c:1453
    SyS_migrate_pages+0x640/0x730 mm/mempolicy.c:1374
    entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185

    other info that might help us debug this:

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(split_queue_lock);
    local_irq_disable();
    lock(slock-AF_INET);
    lock(split_queue_lock);

    lock(slock-AF_INET);

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Acked-by: David Rientjes
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • A newly added tracepoint in the hugepage code uses a variable in the
    error handling that is not initialized at that point:

    include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    The result is relatively harmless, as the trace data will in rare
    cases contain incorrect data.

    This works around the problem by adding an explicit initialization.

    Signed-off-by: Arnd Bergmann
    Fixes: 7d2eba0557c1 ("mm: add tracepoint for scanning pages")
    Reviewed-by: Ebru Akagunduz
    Acked-by: David Rientjes
    Cc: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

19 Jan, 2016

1 commit

  • Pull virtio barrier rework+fixes from Michael Tsirkin:
    "This adds a new kind of barrier, and reworks virtio and xen to use it.

    Plus some fixes here and there"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (44 commits)
    checkpatch: add virt barriers
    checkpatch: check for __smp outside barrier.h
    checkpatch.pl: add missing memory barriers
    virtio: make find_vqs() checkpatch.pl-friendly
    virtio_balloon: fix race between migration and ballooning
    virtio_balloon: fix race by fill and leak
    s390: more efficient smp barriers
    s390: use generic memory barriers
    xen/events: use virt_xxx barriers
    xen/io: use virt_xxx barriers
    xenbus: use virt_xxx barriers
    virtio_ring: use virt_store_mb
    sh: move xchg_cmpxchg to a header by itself
    sh: support 1 and 2 byte xchg
    virtio_ring: update weak barriers to use virt_xxx
    Revert "virtio_ring: Update weak barriers to use dma_wmb/rmb"
    asm-generic: implement virt_xxx memory barriers
    x86: define __smp_xxx
    xtensa: define __smp_xxx
    tile: define __smp_xxx
    ...

    Linus Torvalds
     

18 Jan, 2016

1 commit

  • Commit b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when
    MADV_FREE syscall is called") introduced this new function, but got the
    error handling for when pmd_trans_huge_lock() fails wrong. In the
    failure case, the lock has not been taken, and we should not unlock on
    the way out.

    Cc: Minchan Kim
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds