11 Feb, 2014

3 commits

  • mce-test detected a test failure when injecting error to a thp tail
    page. This is because we take page refcount of the tail page in
    madvise_hwpoison() while the fix in commit a3e0f9e47d5e
    ("mm/memory-failure.c: transfer page count from head page to tail page
    after split thp") assumes that we always take refcount on the head page.

    When a real memory error happens we take refcount on the head page where
    memory_failure() is called without MF_COUNT_INCREASED set, so it seems
    to me that testing memory error on thp tail page using madvise makes
    little sense.

    This patch cancels moving refcount in !MF_COUNT_INCREASED for valid
    testing.

    [akpm@linux-foundation.org: s/&&/&/]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Cc: Chen Gong
    Cc: [3.9+: a3e0f9e47d5e]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Vladimir reported the following issue:

    Commit c65c1877bd68 ("slub: use lockdep_assert_held") requires
    remove_partial() to be called with n->list_lock held, but free_partial()
    called from kmem_cache_close() on cache destruction does not follow this
    rule, leading to a warning:

    WARNING: CPU: 0 PID: 2787 at mm/slub.c:1536 __kmem_cache_shutdown+0x1b2/0x1f0()
    Modules linked in:
    CPU: 0 PID: 2787 Comm: modprobe Tainted: G W 3.14.0-rc1-mm1+ #1
    Hardware name:
    0000000000000600 ffff88003ae1dde8 ffffffff816d9583 0000000000000600
    0000000000000000 ffff88003ae1de28 ffffffff8107c107 0000000000000000
    ffff880037ab2b00 ffff88007c240d30 ffffea0001ee5280 ffffea0001ee52a0
    Call Trace:
    __kmem_cache_shutdown+0x1b2/0x1f0
    kmem_cache_destroy+0x43/0xf0
    xfs_destroy_zones+0x103/0x110 [xfs]
    exit_xfs_fs+0x38/0x4e4 [xfs]
    SyS_delete_module+0x19a/0x1f0
    system_call_fastpath+0x16/0x1b

    His solution was to add a spinlock in order to quiet lockdep. Although
    there would be no contention to adding the lock, that lock also requires
    disabling of interrupts which will have a larger impact on the system.

    Instead of adding a spinlock to a location where it is not needed for
    lockdep, make a __remove_partial() function that does not test if the
    list_lock is held, as no one should have it due to it being freed.

    Also added a __add_partial() function that does not do the lock
    validation either, as it is not needed for the creation of the cache.

    Signed-off-by: Steven Rostedt
    Reported-by: Vladimir Davydov
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Acked-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Commit c65c1877bd68 ("slub: use lockdep_assert_held") incorrectly
    required that add_full() and remove_full() hold n->list_lock. The lock
    is only taken when kmem_cache_debug(s), since that's the only time it
    actually does anything.

    Require that the lock only be taken under such a condition.

    Reported-by: Larry Finger
    Tested-by: Larry Finger
    Tested-by: Paul E. McKenney
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

10 Feb, 2014

2 commits

  • Pull vfs fixes from Al Viro:
    "A couple of fixes, both -stable fodder. The O_SYNC bug is fairly
    old..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fix a kmap leak in virtio_console
    fix O_SYNC|O_APPEND syncing the wrong range on write()

    Linus Torvalds
     
  • It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
    when sync_page_range() had been introduced; generic_file_write{,v}() correctly
    synced
    pos_after_write - written .. pos_after_write - 1
    but generic_file_aio_write() synced
    pos_before_write .. pos_before_write + written - 1
    instead. Which is not the same thing with O_APPEND, obviously.
    A couple of years later correct variant had been killed off when
    everything switched to use of generic_file_aio_write().

    All users of generic_file_aio_write() are affected, and the same bug
    has been copied into other instances of ->aio_write().

    The fix is trivial; the only subtle point is that generic_write_sync()
    ought to be inlined to avoid calculations useless for the majority of
    calls.

    Signed-off-by: Al Viro

    Al Viro
     

09 Feb, 2014

1 commit

  • Pull x86 fixes from Peter Anvin:
    "Quite a varied little collection of fixes. Most of them are
    relatively small or isolated; the biggest one is Mel Gorman's fixes
    for TLB range flushing.

    A couple of AMD-related fixes (including not crashing when given an
    invalid microcode image) and fix a crash when compiled with gcov"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, microcode, AMD: Unify valid container checks
    x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y
    x86/efi: Allow mapping BGRT on x86-32
    x86: Fix the initialization of physnode_map
    x86, cpu hotplug: Fix stack frame warning in check_irq_vectors_for_cpu_disable()
    x86/intel/mid: Fix X86_INTEL_MID dependencies
    arch/x86/mm/srat: Skip NUMA_NO_NODE while parsing SLIT
    mm, x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge
    x86: mm: change tlb_flushall_shift for IvyBridge
    x86/mm: Eliminate redundant page table walk during TLB range flushing
    x86/mm: Clean up inconsistencies when flushing TLB ranges
    mm, x86: Account for TLB flushes only when debugging
    x86/AMD/NB: Fix amd_set_subcaches() parameter type
    x86/quirks: Add workaround for AMD F16h Erratum792
    x86, doc, kconfig: Fix dud URL for Microcode data

    Linus Torvalds
     

08 Feb, 2014

1 commit


07 Feb, 2014

3 commits

  • During aio stress test, we observed the following lockdep warning. This
    mean AIO+numa_balancing is currently deadlockable.

    The problem is, aio_migratepage disable interrupt, but
    __set_page_dirty_nobuffers unintentionally enable it again.

    Generally, all helper function should use spin_lock_irqsave() instead of
    spin_lock_irq() because they don't know caller at all.

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&ctx->completion_lock)->rlock);

    lock(&(&ctx->completion_lock)->rlock);

    *** DEADLOCK ***

    dump_stack+0x19/0x1b
    print_usage_bug+0x1f7/0x208
    mark_lock+0x21d/0x2a0
    mark_held_locks+0xb9/0x140
    trace_hardirqs_on_caller+0x105/0x1d0
    trace_hardirqs_on+0xd/0x10
    _raw_spin_unlock_irq+0x2c/0x50
    __set_page_dirty_nobuffers+0x8c/0xf0
    migrate_page_copy+0x434/0x540
    aio_migratepage+0xb1/0x140
    move_to_new_page+0x7d/0x230
    migrate_pages+0x5e5/0x700
    migrate_misplaced_page+0xbc/0xf0
    do_numa_page+0x102/0x190
    handle_pte_fault+0x241/0x970
    handle_mm_fault+0x265/0x370
    __do_page_fault+0x172/0x5a0
    do_page_fault+0x1a/0x70
    page_fault+0x28/0x30

    Signed-off-by: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • swapoff clear swap_info's SWP_USED flag prematurely and free its
    resources after that. A concurrent swapon will reuse this swap_info
    while its previous resources are not cleared completely.

    These late freed resources are:
    - p->percpu_cluster
    - swap_cgroup_ctrl[type]
    - block_device setting
    - inode->i_flags &= ~S_SWAPFILE

    This patch clears the SWP_USED flag after all its resources are freed,
    so that swapon can reuse this swap_info by alloc_swap_info() safely.

    [akpm@linux-foundation.org: tidy up code comment]
    Signed-off-by: Weijie Yang
    Acked-by: Hugh Dickins
    Cc: Krzysztof Kozlowski
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • This is a patch to improve swap readahead algorithm. It's from Hugh and
    I slightly changed it.

    Hugh's original changelog:

    swapin readahead does a blind readahead, whether or not the swapin is
    sequential. This may be ok on harddisk, because large reads have
    relatively small costs, and if the readahead pages are unneeded they can
    be reclaimed easily - though, what if their allocation forced reclaim of
    useful pages? But on SSD devices large reads are more expensive than
    small ones: if the readahead pages are unneeded, reading them in caused
    significant overhead.

    This patch adds very simplistic random read detection. Stealing the
    PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
    vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
    simply looks at readahead's current success rate, and narrows or widens
    its readahead window accordingly. There is little science to its
    heuristic: it's about as stupid as can be whilst remaining effective.

    The table below shows elapsed times (in centiseconds) when running a
    single repetitive swapping load across a 1000MB mapping in 900MB ram
    with 1GB swap (the harddisk tests had taken painfully too long when I
    used mem=500M, but SSD shows similar results for that).

    Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
    Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
    which Shaohua showed to be defective; HughNew this Nov 14 patch, with
    page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
    patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
    (1-page reads: no readahead).

    HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for
    sequential access to the mapping, cycling five times around; Rand for
    the same number of random touches. Anon for a MAP_PRIVATE anon mapping;
    Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

    One weakness of Shaohua's vma/anon_vma approach was that it did not
    optimize Shmem: seen below. Konstantin's approach was perhaps mistuned,
    50% slower on Seq: did not compete and is not shown below.

    HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 73921 76210 75611 76904 78191 121542
    Seq Shmem 73601 73176 73855 72947 74543 118322
    Rand Anon 895392 831243 871569 845197 846496 841680
    Rand Shmem 1058375 1053486 827935 764955 764376 756489

    SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 24634 24198 24673 25107 21614 70018
    Seq Shmem 24959 24932 25052 25703 22030 69678
    Rand Anon 43014 26146 28075 25989 26935 25901
    Rand Shmem 45349 45215 28249 24268 24138 24332

    These tests are, of course, two extremes of a very simple case: under
    heavier mixed loads I've not yet observed any consistent improvement or
    degradation, and wider testing would be welcome.

    Shaohua Li:

    Test shows Vanilla is slightly better in sequential workload than Hugh's
    patch. I observed with Hugh's patch sometimes the readahead size is
    shrinked too fast (from 8 to 1 immediately) in sequential workload if
    there is no hit. And in such case, continuing doing readahead is good
    actually.

    I don't prepare a sophisticated algorithm for the sequential workload
    because so far we can't guarantee sequential accessed pages are swap out
    sequentially. So I slightly change Hugh's heuristic - don't shrink
    readahead size too fast.

    Here is my test result (unit second, 3 runs average):
    Vanilla Hugh New
    Seq 356 370 360
    Random 4525 2447 2444

    Attached graph is the swapin/swapout throughput I collected with 'vmstat
    2'. The first part is running a random workload (till around 1200 of
    the x-axis) and the second part is running a sequential workload.
    swapin and swapout throughput are almost identical in steady state in
    both workloads. These are expected behavior. while in Vanilla, swapin
    is much bigger than swapout especially in random workload (because wrong
    readahead).

    Original patches by: Shaohua Li and Konstantin Khlebnikov.

    [fengguang.wu@intel.com: swapin_nr_pages() can be static]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Shaohua Li
    Signed-off-by: Fengguang Wu
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

03 Feb, 2014

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "Random bug fixes that have accumulated in my inbox over the past few
    months"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: Fix warning on make htmldocs caused by slab.c
    mm: slub: work around unneeded lockdep warning
    mm: sl[uo]b: fix misleading comments
    slub: Fix possible format string bug.
    slub: use lockdep_assert_held
    slub: Fix calculation of cpu slabs
    slab.h: remove duplicate kmalloc declaration and fix kernel-doc warnings

    Linus Torvalds
     

31 Jan, 2014

9 commits

  • This patch fixed following errors while make htmldocs
    Warning(/mm/slab.c:1956): No description found for parameter 'page'
    Warning(/mm/slab.c:1956): Excess function parameter 'slabp' description in 'slab_destroy'

    Incorrect function parameter "slabp" was set instead of "page"

    Acked-by: Christoph Lameter
    Signed-off-by: Masanari Iida
    Signed-off-by: Pekka Enberg

    Masanari Iida
     
  • The slub code does some setup during early boot in
    early_kmem_cache_node_alloc() with some local data. There is no
    possible way that another CPU can see this data, so the slub code
    doesn't unnecessarily lock it. However, some new lockdep asserts
    check to make sure that add_partial() _always_ has the list_lock
    held.

    Just add the locking, even though it is technically unnecessary.

    Cc: Peter Zijlstra
    Cc: Russell King
    Acked-by: David Rientjes
    Signed-off-by: Dave Hansen
    Signed-off-by: Pekka Enberg

    Dave Hansen
     
  • Commit 842e2873697e ("memcg: get rid of kmem_cache_dup()") introduced a
    mutex for memcg_create_kmem_cache() to protect the tmp_name buffer that
    holds the memcg name. It failed to unlock the mutex if this buffer
    could not be allocated.

    This patch fixes the issue by appropriately unlocking the mutex if the
    allocation fails.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Glauber Costa
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • A 3% of system memory bonus is sometimes too excessive in comparison to
    other processes.

    With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
    killer tries to avoid killing privileged tasks by subtracting 3% of
    overall memory (system or cgroup) from their per-task consumption. But
    as a result, all root tasks that consume less than 3% of overall memory
    are considered equal, and so it only takes 33+ privileged tasks pushing
    the system out of memory for the OOM killer to do something stupid and
    kill dhclient or other root-owned processes. For example, on a 32G
    machine it can't tell the difference between the 1M agetty and the 10G
    fork bomb member.

    The changelog describes this 3% boost as the equivalent to the global
    overcommit limit being 3% higher for privileged tasks, but this is not
    the same as discounting 3% of overall memory from _every privileged task
    individually_ during OOM selection.

    Replace the 3% of system memory bonus with a 3% of current memory usage
    bonus.

    By giving root tasks a bonus that is proportional to their actual size,
    they remain comparable even when relatively small. In the example
    above, the OOM killer will discount the 1M agetty's 256 badness points
    down to 179, and the 10G fork bomb's 262144 points down to 183500 points
    and make the right choice, instead of discounting both to 0 and killing
    agetty because it's first in the task list.

    Signed-off-by: David Rientjes
    Reported-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit abca7c496584 ("mm: fix slab->page _count corruption when using
    slub") notes that we can not _set_ a page->counters directly, except
    when using a real double-cmpxchg. Doing so can lose updates to
    ->_count.

    That is an absolute rule:

    You may not *set* page->counters except via a cmpxchg.

    Commit abca7c496584 fixed this for the folks who have the slub
    cmpxchg_double code turned off at compile time, but it left the bad case
    alone. It can still be reached, and the same bug triggered in two
    cases:

    1. Turning on slub debugging at runtime, which is available on
    the distro kernels that I looked at.
    2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
    cpus, evidently)

    There are at least 3 ways we could fix this:

    1. Take all of the exising calls to cmpxchg_double_slab() and
    __cmpxchg_double_slab() and convert them to take an old, new
    and target 'struct page'.
    2. Do (1), but with the newly-introduced 'slub_data'.
    3. Do some magic inside the two cmpxchg...slab() functions to
    pull the counters out of new_counters and only set those
    fields in page->{inuse,frozen,objects}.

    I've done (2) as well, but it's a bunch more code. This patch is an
    attempt at (3). This was the most straightforward and foolproof way
    that I could think to do this.

    This would also technically allow us to get rid of the ugly

    #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)

    in 'struct page', but leaving it alone has the added benefit that
    'counters' stays 'unsigned' instead of 'unsigned long', so all the
    copies that the slub code does stay a bit smaller.

    Signed-off-by: Dave Hansen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Pravin B Shelar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • As a result of commit 5606e3877ad8 ("mm: numa: Migrate on reference
    policy"), /proc//numa_maps prints the mempolicy for any as
    "prefer:N" for the local node, N, of the process reading the file.

    This should only be printed when the mempolicy of is
    MPOL_PREFERRED for node N.

    If the process is actually only using the default mempolicy for local
    node allocation, make sure "default" is printed as expected.

    Signed-off-by: David Rientjes
    Reported-by: Robert Lippert
    Cc: Peter Zijlstra
    Acked-by: Mel Gorman
    Cc: Ingo Molnar
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add my copyright to the zsmalloc source code which I maintain.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch moves zsmalloc under mm directory.

    Before that, description will explain why we have needed custom
    allocator.

    Zsmalloc is a new slab-based memory allocator for storing compressed
    pages. It is designed for low fragmentation and high allocation success
    rate on large object, but
    Acked-by: Nitin Gupta
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Bob Liu
    Cc: Greg Kroah-Hartman
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Pull core block IO changes from Jens Axboe:
    "The major piece in here is the immutable bio_ve series from Kent, the
    rest is fairly minor. It was supposed to go in last round, but
    various issues pushed it to this release instead. The pull request
    contains:

    - Various smaller blk-mq fixes from different folks. Nothing major
    here, just minor fixes and cleanups.

    - Fix for a memory leak in the error path in the block ioctl code
    from Christian Engelmayer.

    - Header export fix from CaiZhiyong.

    - Finally the immutable biovec changes from Kent Overstreet. This
    enables some nice future work on making arbitrarily sized bios
    possible, and splitting more efficient. Related fixes to immutable
    bio_vecs:

    - dm-cache immutable fixup from Mike Snitzer.
    - btrfs immutable fixup from Muthu Kumar.

    - bio-integrity fix from Nic Bellinger, which is also going to stable"

    * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
    xtensa: fixup simdisk driver to work with immutable bio_vecs
    block/blk-mq-cpu.c: use hotcpu_notifier()
    blk-mq: for_each_* macro correctness
    block: Fix memory leak in rw_copy_check_uvector() handling
    bio-integrity: Fix bio_integrity_verify segment start bug
    block: remove unrelated header files and export symbol
    blk-mq: uses page->list incorrectly
    blk-mq: use __smp_call_function_single directly
    btrfs: fix missing increment of bi_remaining
    Revert "block: Warn and free bio if bi_end_io is not set"
    block: Warn and free bio if bi_end_io is not set
    blk-mq: fix initializing request's start time
    block: blk-mq: don't export blk_mq_free_queue()
    block: blk-mq: make blk_sync_queue support mq
    block: blk-mq: support draining mq queue
    dm cache: increment bi_remaining when bi_end_io is restored
    block: fixup for generic bio chaining
    block: Really silence spurious compiler warnings
    block: Silence spurious compiler warnings
    block: Kill bio_pair_split()
    ...

    Linus Torvalds
     

30 Jan, 2014

8 commits

  • In original bootmem wrapper for memblock, we have limit checking.

    Add it to memblock_virt_alloc, to address arm and x86 booting crash.

    Signed-off-by: Yinghai Lu
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Reported-by: Kevin Hilman
    Tested-by: Kevin Hilman
    Reported-by: Olof Johansson
    Tested-by: Olof Johansson
    Reported-by: Konrad Rzeszutek Wilk
    Tested-by: Konrad Rzeszutek Wilk
    Cc: Dave Hansen
    Cc: Santosh Shilimkar
    Cc: "Strashko, Grygorii"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Commit 63d0f0a3c7e1 ("mm/readahead.c:do_readhead(): don't check for
    ->readpage") unintentionally made do_readahead return 0 for all valid
    files regardless of whether readahead was supported, rather than the
    expected -EINVAL. This gets forwarded on to userspace, and results in
    sys_readahead appearing to succeed in cases that don't make sense (e.g.
    when called on pipes or sockets). This issue is detected by the LTP
    readahead01 testcase.

    As the exact return value of force_page_cache_readahead is currently
    never used, we can simplify it to return only 0 or -EINVAL (when
    readpage or readpages is missing). With that in place we can simply
    forward on the return value of force_page_cache_readahead in
    do_readahead.

    This patch performs said change, restoring the expected semantics.

    Signed-off-by: Mark Rutland
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     
  • Commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON using
    VM_BUG_ON_PAGE") added a bunch of VM_BUG_ON_PAGE() calls.

    But, most of the ones in the slub code are for _temporary_ 'struct
    page's which are declared on the stack and likely have lots of gunk in
    them. Dumping their contents out will just confuse folks looking at
    bad_page() output. Plus, if we try to page_to_pfn() on them or
    soemthing, we'll probably oops anyway.

    Turn them back in to VM_BUG_ON()s.

    Signed-off-by: Dave Hansen
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the
    new cache ptr) undefined. The latter can be NULL if we could not
    allocate the cache, or pointing to a freed area if we failed somewhere
    later while trying to initialize it. Initially we checked 'err'
    immediately before exiting the function and returned NULL if it was set
    ignoring the value of 's':

    out_unlock:
    ...
    if (err) {
    /* report error */
    return NULL;
    }
    return s;

    Recently this check was, in fact, broken by commit f717eb3abb5e ("slab:
    do not panic if we fail to create memcg cache"), which turned it to:

    out_unlock:
    ...
    if (err && !memcg) {
    /* report error */
    return NULL;
    }
    return s;

    As a result, if we are failing creating a cache for a memcg, we will
    skip the check and return 's' that can contain crap. Obviously, commit
    f717eb3abb5e intended not to return crap on error allocating a cache for
    a memcg, but only to remove the error reporting in this case, so the
    check should look like this:

    out_unlock:
    ...
    if (err) {
    if (!memcg)
    return NULL;
    /* report error */
    return NULL;
    }
    return s;

    [rientjes@google.com: despaghettification]
    [vdavydov@parallels.com: patch monkeying]
    Signed-off-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Dave Jones
    Reported-by: Dave Jones
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • A few printk(KERN_*'s have snuck in there.

    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The command line parsing takes place before jump labels are initialised
    which generates a warning if numa_balancing= is specified and
    CONFIG_JUMP_LABEL is set.

    On older kernels before commit c4b2c0c5f647 ("static_key: WARN on usage
    before jump_label_init was called") the kernel would have crashed. This
    patch enables automatic numa balancing later in the initialisation
    process if numa_balancing= is specified.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: stable
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The VM is currently heavily tuned to avoid swapping. Whether that is
    good or bad is a separate discussion, but as long as the VM won't swap
    to make room for dirty cache, we can not consider anonymous pages when
    calculating the amount of dirtyable memory, the baseline to which
    dirty_background_ratio and dirty_ratio are applied.

    A simple workload that occupies a significant size (40+%, depending on
    memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
    uses the remainder for a streaming writer demonstrates this problem. In
    that case, the actual cache pages are a small fraction of what is
    considered dirtyable overall, which results in an relatively large
    portion of the cache pages to be dirtied. As kswapd starts rotating
    these, random tasks enter direct reclaim and stall on IO.

    Only consider free pages and file pages dirtyable.

    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Wu Fengguang
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Tejun reported stuttering and latency spikes on a system where random
    tasks would enter direct reclaim and get stuck on dirty pages. Around
    50% of memory was occupied by tmpfs backed by an SSD, and another disk
    (rotating) was reading and writing at max speed to shrink a partition.

    : The problem was pretty ridiculous. It's a 8gig machine w/ one ssd and 10k
    : rpm harddrive and I could reliably reproduce constant stuttering every
    : several seconds for as long as buffered IO was going on on the hard drive
    : either with tmpfs occupying somewhere above 4gig or a test program which
    : allocates about the same amount of anon memory. Although swap usage was
    : zero, turning off swap also made the problem go away too.
    :
    : The trigger conditions seem quite plausible - high anon memory usage w/
    : heavy buffered IO and swap configured - and it's highly likely that this
    : is happening in the wild too. (this can happen with copying large files
    : to usb sticks too, right?)

    This patch (of 2):

    The dirty_balance_reserve is an approximation of the fraction of free
    pages that the page allocator does not make available for page cache
    allocations. As a result, it has to be taken into account when
    calculating the amount of "dirtyable memory", the baseline to which
    dirty_background_ratio and dirty_ratio are applied.

    However, currently the reserve is subtracted from the sum of free and
    reclaimable pages, which is non-sensical and leads to erroneous results
    when the system is dominated by unreclaimable pages and the
    dirty_balance_reserve is bigger than free+reclaimable. In that case, at
    least the already allocated cache should be considered dirtyable.

    Fix the calculation by subtracting the reserve from the amount of free
    pages, then adding the reclaimable pages on top.

    [akpm@linux-foundation.org: fix CONFIG_HIGHMEM build]
    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Wu Fengguang
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

29 Jan, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
    assorted cleanups and fixes all over the place...

    There will be another pile later this week"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
    __dentry_path() fixes
    vfs: Remove second variable named error in __dentry_path
    vfs: Is mounted should be testing mnt_ns for NULL or error.
    Fix race when checking i_size on direct i/o read
    hfsplus: remove can_set_xattr
    nfsd: use get_acl and ->set_acl
    fs: remove generic_acl
    nfs: use generic posix ACL infrastructure for v3 Posix ACLs
    gfs2: use generic posix ACL infrastructure
    jfs: use generic posix ACL infrastructure
    xfs: use generic posix ACL infrastructure
    reiserfs: use generic posix ACL infrastructure
    ocfs2: use generic posix ACL infrastructure
    jffs2: use generic posix ACL infrastructure
    hfsplus: use generic posix ACL infrastructure
    f2fs: use generic posix ACL infrastructure
    ext2/3/4: use generic posix ACL infrastructure
    btrfs: use generic posix ACL infrastructure
    fs: make posix_acl_create more useful
    fs: make posix_acl_chmod more useful
    ...

    Linus Torvalds
     

28 Jan, 2014

7 commits

  • Merge misc updates from Andrew Morton:

    - a few hotfixes

    - dynamic-debug updates

    - ipc updates

    - various other sweepings off the factory floor

    * akpm: (31 commits)
    firmware/google: drop 'select EFI' to avoid recursive dependency
    compat: fix sys_fanotify_mark
    checkpatch.pl: check for function declarations without arguments
    mm/migrate.c: fix setting of cpupid on page migration twice against normal page
    softirq: use const char * const for softirq_to_name, whitespace neatening
    softirq: convert printks to pr_
    softirq: use ffs() in __do_softirq()
    kernel/kexec.c: use vscnprintf() instead of vsnprintf() in vmcoreinfo_append_str()
    splice: fix unexpected size truncation
    ipc: fix compat msgrcv with negative msgtyp
    ipc,msg: document barriers
    ipc: delete seq_max field in struct ipc_ids
    ipc: simplify sysvipc_proc_open() return
    ipc: remove useless return statement
    ipc: remove braces for single statements
    ipc: standardize code comments
    ipc: whitespace cleanup
    ipc: change kern_ipc_perm.deleted type to bool
    ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races
    ipc/sem.c: avoid overflow of semop undo (semadj) value
    ...

    Linus Torvalds
     
  • Pull powerpc updates from Ben Herrenschmidt:
    "So here's my next branch for powerpc. A bit late as I was on vacation
    last week. It's mostly the same stuff that was in next already, I
    just added two patches today which are the wiring up of lockref for
    powerpc, which for some reason fell through the cracks last time and
    is trivial.

    The highlights are, in addition to a bunch of bug fixes:

    - Reworked Machine Check handling on kernels running without a
    hypervisor (or acting as a hypervisor). Provides hooks to handle
    some errors in real mode such as TLB errors, handle SLB errors,
    etc...

    - Support for retrieving memory error information from the service
    processor on IBM servers running without a hypervisor and routing
    them to the memory poison infrastructure.

    - _PAGE_NUMA support on server processors

    - 32-bit BookE relocatable kernel support

    - FSL e6500 hardware tablewalk support

    - A bunch of new/revived board support

    - FSL e6500 deeper idle states and altivec powerdown support

    You'll notice a generic mm change here, it has been acked by the
    relevant authorities and is a pre-req for our _PAGE_NUMA support"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
    powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
    powerpc: Add support for the optimised lockref implementation
    powerpc/powernv: Call OPAL sync before kexec'ing
    powerpc/eeh: Escalate error on non-existing PE
    powerpc/eeh: Handle multiple EEH errors
    powerpc: Fix transactional FP/VMX/VSX unavailable handlers
    powerpc: Don't corrupt transactional state when using FP/VMX in kernel
    powerpc: Reclaim two unused thread_info flag bits
    powerpc: Fix races with irq_work
    Move precessing of MCE queued event out from syscall exit path.
    pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
    powerpc: Make add_system_ram_resources() __init
    powerpc: add SATA_MV to ppc64_defconfig
    powerpc/powernv: Increase candidate fw image size
    powerpc: Add debug checks to catch invalid cpu-to-node mappings
    powerpc: Fix the setup of CPU-to-Node mappings during CPU online
    powerpc/iommu: Don't detach device without IOMMU group
    powerpc/eeh: Hotplug improvement
    powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
    powerpc/eeh: Add restore_config operation
    ...

    Linus Torvalds
     
  • Pull powerpc mremap fix from Ben Herrenschmidt:
    "This is the patch that I had sent after -rc8 and which we decided to
    wait before merging. It's based on a different tree than my -next
    branch (it needs some pre-reqs that were in -rc4 or so while my -next
    is based on -rc1) so I left it as a separate branch for your to pull.
    It's identical to the request I did 2 or 3 weeks back.

    This fixes crashes in mremap with THP on powerpc.

    The fix however requires a small change in the generic code. It moves
    a condition into a helper we can override from the arch which is
    harmless, but it *also* slightly changes the order of the set_pmd and
    the withdraw & deposit, which should be fine according to Kirill (who
    wrote that code) but I agree -rc8 is a bit late...

    It was acked by Kirill and Andrew told me to just merge it via powerpc"

    * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc/thp: Fix crash on mremap

    Linus Torvalds
     
  • Commit 7851a45cd3f6 ("mm: numa: Copy cpupid on page migration") copies
    over the cpupid at page migration time. It is unnecessary to set it
    again in alloc_misplaced_dst_page().

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Commit da29bd36224b ("mm/mm_init.c: make creation of the mm_kobj happen
    earlier than device_initcall") changed to pure_initcall(mm_sysfs_init).

    That's too early: mm_sysfs_init() depends on core_initcall(ksysfs_init)
    to have made the kernel_kobj directory "kernel" in which to create "mm".

    Make it postcore_initcall(mm_sysfs_init). We could use core_initcall(),
    and depend upon Makefile link order kernel/ mm/ fs/ ipc/ security/ ...
    as core_initcall(debugfs_init) and core_initcall(securityfs_init) do;
    but better not.

    Signed-off-by: Hugh Dickins
    Acked-by: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Revert commit ece86e222db4, which was intended as a small performance
    improvement.

    Despite the claim that the patch doesn't introduce any functional
    changes in fact it does.

    The "no page" path behaves different now. Originally, vmalloc_to_page
    might return NULL under some conditions, with new implementation it
    returns pfn_to_page(0) which is not the same as NULL.

    Simple test shows the difference.

    test.c

    #include
    #include
    #include
    #include

    int __init myi(void)
    {
    struct page *p;
    void *v;

    v = vmalloc(PAGE_SIZE);
    /* trigger the "no page" path in vmalloc_to_page*/
    vfree(v);

    p = vmalloc_to_page(v);

    pr_err("expected val = NULL, returned val = %p", p);

    return -EBUSY;
    }

    void __exit mye(void)
    {

    }
    module_init(myi)
    module_exit(mye)

    Before interchange:
    expected val = NULL, returned val = (null)

    After interchange:
    expected val = NULL, returned val = c7ebe000

    Signed-off-by: Vladimir Murzin
    Cc: Jianyu Zhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    malc
     
  • In original __alloc_memory_core_early() for bootmem wrapper, we do not
    align size silently.

    We should not do that, as later free with old size will leave some range
    not freed.

    It's obvious that code is copied from memblock_base_nid(), and that code
    is wrong for the same reason.

    Also remove that in memblock_alloc_base.

    Signed-off-by: Yinghai Lu
    Acked-by: Santosh Shilimkar
    Cc: Dave Hansen
    Cc: Russell King
    Cc: Konrad Rzeszutek Wilk
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

26 Jan, 2014

2 commits

  • So far I've had one ACK for this, and no other comments. So I think it
    is probably time to send this via some suitable tree. I'm guessing that
    the vfs tree would be the most appropriate route, but not sure that
    there is one at the moment (don't see anything recent at kernel.org)
    so in that case I think -mm is the "back up plan". Al, please let me
    know if you will take this?

    Steve.

    ---------------------

    Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
    reads with append dio writes" thread on linux-fsdevel, this patch is my
    current version of the fix proposed as option (b) in that thread.

    Removing the i_size test from the direct i/o read path at vfs level
    means that filesystems now have to deal with requests which are beyond
    i_size themselves. These I've divided into three sets:

    a) Those with "no op" ->direct_IO (9p, cifs, ceph)
    These are obviously not going to be an issue

    b) Those with "home brew" ->direct_IO (nfs, fuse)
    I've been told that NFS should not have any problem with the larger
    i_size, however I've added an extra test to FUSE to duplicate the
    original behaviour just to be on the safe side.

    c) Those using __blockdev_direct_IO()
    These call through to ->get_block() which should deal with the EOF
    condition correctly. I've verified that with GFS2 and I believe that
    Zheng has verified it for ext4. I've also run the test on XFS and it
    passes both before and after this change.

    The part of the patch in filemap.c looks a lot larger than it really is
    - there are only two lines of real change. The rest is just indentation
    of the contained code.

    There remains a test of i_size though, which was added for btrfs. It
    doesn't cause the other filesystems a problem as the test is performed
    after ->direct_IO has been called. It is possible that there is a race
    that does matter to btrfs, however this patch doesn't change that, so
    its still an overall improvement.

    Signed-off-by: Steven Whitehouse
    Reported-by: Zheng Liu
    Cc: Jan Kara
    Cc: Dave Chinner
    Acked-by: Miklos Szeredi
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Signed-off-by: Al Viro

    Steven Whitehouse
     
  • And instead convert tmpfs to use the new generic ACL code, with two stub
    methods provided for in-memory filesystems.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

25 Jan, 2014

2 commits

  • Merge in the x86 changes to apply a fix.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
    vmstats: tlb flush counters") to cause overhead problems.

    The counters are undeniably useful but how often do we really
    need to debug TLB flush related issues? It does not justify
    taking the penalty everywhere so make it a debugging option.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alex Shi
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
    Signed-off-by: Ingo Molnar

    Mel Gorman