06 Nov, 2015

40 commits

  • Few lines below object is reinitialized by lookup_object() so we don't
    need to init it by NULL in the beginning of find_and_get_object().

    Signed-off-by: Alexey Klimov
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Klimov
     
  • On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc
    configurations defining ARCH_DMA_MINALIGN to 128), the first
    kmalloc_caches[] entry to be initialised after slab_early_init = 0 is
    "kmalloc-128" with index 7. Depending on the debug kernel configuration,
    sizeof(struct kmem_cache) can be larger than 128 resulting in an
    INDEX_NODE of 8.

    Commit 8fc9cf420b36 ("slab: make more slab management structure off the
    slab") enables off-slab management objects for sizes starting with
    PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation
    of the "kmalloc-128" cache would try to place the management objects
    off-slab. However, since KMALLOC_MIN_SIZE is already 128 and
    freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size)
    returns NULL (kmalloc_caches[7] not populated yet). This triggers the
    following bug on arm64:

    kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283!
    Internal error: Oops - BUG: 0 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540
    Hardware name: Juno (DT)
    PC is at __kmem_cache_create+0x21c/0x280
    LR is at __kmem_cache_create+0x210/0x280
    [...]
    Call trace:
    __kmem_cache_create+0x21c/0x280
    create_boot_cache+0x48/0x80
    create_kmalloc_cache+0x50/0x88
    create_kmalloc_caches+0x4c/0xf4
    kmem_cache_init+0x100/0x118
    start_kernel+0x214/0x33c

    This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab
    management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE.

    Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab")
    Signed-off-by: Catalin Marinas
    Reported-by: Geert Uytterhoeven
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: [3.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • In slub_order(), the order starts from max(min_order,
    get_order(min_objects * size)). When (min_objects * size) has different
    order from (min_objects * size + reserved), it will skip this order via a
    check in the loop.

    This patch optimizes this a little by calculating the start order with
    `reserved' in consideration and removing the check in loop.

    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • get_order() is more easy to understand.

    This patch just replaces it.

    Signed-off-by: Wei Yang
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • In calculate_order(), it tries to calculate the best order by adjusting
    the fraction and min_objects. On each iteration on min_objects, fraction
    iterates on 16, 8, 4. Which means the acceptable waste increases with
    1/16, 1/8, 1/4.

    This patch corrects the comment according to the code.

    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • The assignment to NULL within the error condition was written in a 2014
    patch to suppress a compiler warning. However it would be cleaner to just
    initialize the kmem_cache to NULL and just return it in case of an error
    condition.

    Signed-off-by: Alexandru Moise
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandru Moise
     
  • Add documentation on how to use slabinfo-gnuplot.sh script.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • GNUplot `slabinfo -X' stats, collected, for example, using the
    following command:
    while [ 1 ]; do slabinfo -X >> stats; sleep 1; done

    `slabinfo-gnuplot.sh stats' pre-processes collected records
    and generate graphs (totals, slabs sorted by size, slabs
    sorted by size).

    Graphs can be [individually] regenerate with different samples
    range and graph width-heigh (-r %d,%d and -s %d,%d options).

    To visually compare N `totals' graphs:
    slabinfo-gnuplot.sh -t FILE1-totals FILE2-totals ... FILEN-totals

    Signed-off-by: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • checkpatch.pl complains about globals being explicitly zeroed
    out: "ERROR: do not initialise globals to 0 or NULL".

    New globals, introduced in this patch set, have no explicit 0
    initialization; clean up the old ones to make it less hairy.

    Signed-off-by: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Introduce "-B|--Bytes" opt to disable store_size() dynamic
    size scaling and report size in bytes instead.

    This `expands' the interface a bit, it's impossible to use
    printf("%6s") anymore to output sizes.

    Example:

    slabinfo -X -N 2
    Slabcache Totals
    ----------------
    Slabcaches : 91 Aliases : 119->69 Active: 63
    Memory used: 199798784 # Loss : 10689376 MRatio: 5%
    # Objects : 324301 # PartObj: 18151 ORatio: 5%

    Per Cache Average Min Max Total
    ----------------------------------------------------------------------------
    #Objects 5147 1 89068 324301
    #Slabs 199 1 3886 12537
    #PartSlab 12 0 240 778
    %PartSlab 32% 0% 100% 6%
    PartObjs 5 0 4569 18151
    % PartObj 26% 0% 100% 5%
    Memory 3171409 8192 127336448 199798784
    Used 3001736 160 121429728 189109408
    Loss 169672 0 5906720 10689376

    Per Object Average Min Max
    -----------------------------------------------------------
    Memory 585 8 8192
    User 583 8 8192
    Loss 2 0 64

    Slabs sorted by size
    --------------------
    Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
    ext4_inode_cache 69948 1736 127336448 3871/0/15 18 3 0 95 a
    dentry 89068 288 26058752 3164/0/17 28 1 0 98 a

    Slabs sorted by loss
    --------------------
    Name Objects Objsize Loss Slabs/Part/Cpu O/S O %Fr %Ef Flg
    ext4_inode_cache 69948 1736 5906720 3871/0/15 18 3 0 95 a
    inode_cache 11628 864 537472 642/0/4 18 2 0 94 a

    Besides, store_size() does not use powers of two for G/M/K

    if (value > 1000000000UL) {
    divisor = 100000000UL;
    trailer = 'G';
    } else if (value > 1000000UL) {
    divisor = 100000UL;
    trailer = 'M';
    } else if (value > 1000UL) {
    divisor = 100;
    trailer = 'K';
    }

    Signed-off-by: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Add "-X|--Xtotals" opt to output extended totals summary,
    which includes:
    -- totals summary
    -- slabs sorted by size
    -- slabs sorted by loss (waste)

    Example:
    =======

    slabinfo --X -N 1
    Slabcache Totals
    ----------------
    Slabcaches : 91 Aliases : 120->69 Active: 65
    Memory used: 568.3M # Loss : 30.4M MRatio: 5%
    # Objects : 920.1K # PartObj: 161.2K ORatio: 17%

    Per Cache Average Min Max Total
    ---------------------------------------------------------
    #Objects 14.1K 1 227.8K 920.1K
    #Slabs 533 1 11.7K 34.7K
    #PartSlab 86 0 4.3K 5.6K
    %PartSlab 24% 0% 100% 16%
    PartObjs 17 0 129.3K 161.2K
    % PartObj 17% 0% 100% 17%
    Memory 8.7M 8.1K 384.7M 568.3M
    Used 8.2M 160 366.5M 537.9M
    Loss 468.8K 0 18.2M 30.4M

    Per Object Average Min Max
    ---------------------------------------------
    Memory 587 8 8.1K
    User 584 8 8.1K
    Loss 2 0 64

    Slabs sorted by size
    ----------------------
    Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
    ext4_inode_cache 211142 1736 384.7M 11732/40/10 18 3 0 95 a

    Slabs sorted by loss
    ----------------------
    Name Objects Objsize Loss Slabs/Part/Cpu O/S O %Fr %Ef Flg
    ext4_inode_cache 211142 1736 18.2M 11732/40/10 18 3 0 95 a

    Signed-off-by: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Fix mismatches between usage() output and real opts[] options. Add
    missing alternative opt names, e.g., '-S' had no '--Size' opts[] entry,
    etc.

    Signed-off-by: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Introduce opt "-L|--sort-loss" to sort and output slabs by
    loss (waste) in slabcache().

    Signed-off-by: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Introduce opt "-N|--lines=K" to limit the number of slabs
    being reported in output_slabs().

    Signed-off-by: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This patchset adds 'extended' slabinfo mode that provides additional
    information:

    -- totals summary
    -- slabs sorted by size
    -- slabs sorted by loss (waste)

    The patches also introduces several new slabinfo options to limit the
    number of slabs reported, sort slabs by loss (waste); and some fixes.

    Extended output example (slabinfo -X -N 2):

    Slabcache Totals
    ----------------
    Slabcaches : 91 Aliases : 119->69 Active: 63
    Memory used: 199798784 # Loss : 10689376 MRatio: 5%
    # Objects : 324301 # PartObj: 18151 ORatio: 5%

    Per Cache Average Min Max Total
    ----------------------------------------------------------------------------
    #Objects 5147 1 89068 324301
    #Slabs 199 1 3886 12537
    #PartSlab 12 0 240 778
    %PartSlab 32% 0% 100% 6%
    PartObjs 5 0 4569 18151
    % PartObj 26% 0% 100% 5%
    Memory 3171409 8192 127336448 199798784
    Used 3001736 160 121429728 189109408
    Loss 169672 0 5906720 10689376

    Per Object Average Min Max
    -----------------------------------------------------------
    Memory 585 8 8192
    User 583 8 8192
    Loss 2 0 64

    Slabs sorted by size
    --------------------
    Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
    ext4_inode_cache 69948 1736 127336448 3871/0/15 18 3 0 95 a
    dentry 89068 288 26058752 3164/0/17 28 1 0 98 a

    Slabs sorted by loss
    --------------------
    Name Objects Objsize Loss Slabs/Part/Cpu O/S O %Fr %Ef Flg
    ext4_inode_cache 69948 1736 5906720 3871/0/15 18 3 0 95 a
    inode_cache 11628 864 537472 642/0/4 18 2 0 94 a

    The last patch in the series addresses Linus' comment from
    http://marc.info/?l=linux-mm&m=144148518703321&w=2

    (well, it's been some time. sorry.)

    gnuplot script takes the slabinfo records file, where every record is a `slabinfo -X'
    output. So the basic workflow is, for example, as follows:

    while [ 1 ]; do slabinfo -X -N 2 >> stats; sleep 1; done
    ^C
    slabinfo-gnuplot.sh stats

    The last command will produce 3 png files (and 3 stats files)
    -- graph of slabinfo totals
    -- graph of slabs by size
    -- graph of slabs by loss

    It's also possible to select a range of records for plotting (a range of collected
    slabinfo outputs) via `-r 10,100` (for example); and compare totals from several
    measurements (to visially compare slabs behaviour (10,50 range)) using
    pre-parsed totals files:
    slabinfo-gnuplot.sh -r 10,50 -t stats-totals1 .. stats-totals2

    This also, technically, supports ktest. Upload new slabinfo to target,
    collect the stats and give the resulting stats file to slabinfo-gnuplot

    This patch (of 8):

    Use getopt constants in `struct option' ->has_arg instead of numerical
    representations.

    Signed-off-by: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Currently, when kmem_cache_destroy() is called for a global cache, we
    print a warning for each per memcg cache attached to it that has active
    objects (see shutdown_cache). This is redundant, because it gives no new
    information and only clutters the log. If a cache being destroyed has
    active objects, there must be a memory leak in the module that created the
    cache, and it does not matter if the cache was used by users in memory
    cgroups or not.

    This patch moves the warning from shutdown_cache(), which is called for
    shutting down both global and per memcg caches, to kmem_cache_destroy(),
    so that the warning is only printed once if there are objects left in the
    cache being destroyed.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we do not clear pointers to per memcg caches in the
    memcg_params.memcg_caches array when a global cache is destroyed with
    kmem_cache_destroy.

    This is fine if the global cache does get destroyed. However, a cache can
    be left on the list if it still has active objects when kmem_cache_destroy
    is called (due to a memory leak). If this happens, the entries in the
    array will point to already freed areas, which is likely to result in data
    corruption when the cache is reused (via slab merging).

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • do_kmem_cache_create(), do_kmem_cache_shutdown(), and
    do_kmem_cache_release() sound awkward for static helper functions that are
    not supposed to be used outside slab_common.c. Rename them to
    create_cache(), shutdown_cache(), and release_caches(), respectively.
    This patch is a pure cleanup and does not introduce any functional
    changes.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The patch "slab.h: sprinkle __assume_aligned attributes" causes *tons* of
    whinges if you do 'make C=2' with sparse 0.5.0:

    CHECK drivers/media/usb/pwc/pwc-if.c
    include/linux/slab.h:307:43: error: attribute '__assume_aligned__': unknown attribute
    include/linux/slab.h:308:58: error: attribute '__assume_aligned__': unknown attribute
    include/linux/slab.h:337:73: error: attribute '__assume_aligned__': unknown attribute
    include/linux/slab.h:375:74: error: attribute '__assume_aligned__': unknown attribute
    include/linux/slab.h:378:80: error: attribute '__assume_aligned__': unknown attribute

    sparse apparently pretends to be gcc >= 4.9, yet isn't prepared to handle
    all the function attributes supported by those gccs and complains loudly.
    So hide the definition of __assume_aligned from it (so that the generic
    one in compiler.h gets used).

    Signed-off-by: Rasmus Villemoes
    Reported-by: Valdis Kletnieks
    Tested-By: Valdis Kletnieks
    Cc: Christopher Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • gcc 4.9 added the function attribute assume_aligned, indicating to the
    caller that the returned pointer may be assumed to have a certain minimal
    alignment. This is useful if, for example, the return value is passed to
    memset(). Add a shorthand macro for that.

    Signed-off-by: Rasmus Villemoes
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • A good candidate to return a boolean result.

    Signed-off-by: Denis Kirjanov
    Cc: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Kirjanov
     
  • Theoretically it is possible that the watchdog timer expires right at the
    time when a user sets 'watchdog_thresh' to zero (note: this disables the
    lockup detectors). In this scenario, the is_softlockup() function - which
    is called by the timer - could produce a false positive.

    Fix this by checking the current value of 'watchdog_thresh'.

    Signed-off-by: Ulrich Obergfell
    Acked-by: Don Zickus
    Reviewed-by: Aaron Tomlin
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • watchdog_{park|unpark}_threads() are now called in code paths that protect
    themselves against CPU hotplug, so {get|put}_online_cpus() calls are
    redundant and can be removed.

    Signed-off-by: Ulrich Obergfell
    Acked-by: Don Zickus
    Reviewed-by: Aaron Tomlin
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • The handler functions for watchdog parameters in /proc/sys/kernel do not
    protect themselves against races with CPU hotplug. Hence, theoretically
    it is possible that a new watchdog thread is started on a hotplugged CPU
    while a parameter is being modified, and the thread could thus use a
    parameter value that is 'in transition'.

    For example, if 'watchdog_thresh' is being set to zero (note: this
    disables the lockup detectors) the thread would erroneously use the value
    zero as the sample period.

    To avoid such races and to keep the /proc handler code consistent,
    call
    {get|put}_online_cpus() in proc_watchdog_common()
    {get|put}_online_cpus() in proc_watchdog_thresh()
    {get|put}_online_cpus() in proc_watchdog_cpumask()

    Signed-off-by: Ulrich Obergfell
    Acked-by: Don Zickus
    Reviewed-by: Aaron Tomlin
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • The lockup detector suspend/resume interface that was introduced by
    commit 8c073d27d7ad ("watchdog: introduce watchdog_suspend() and
    watchdog_resume()") does not protect itself against races with CPU
    hotplug. Hence, theoretically it is possible that a new watchdog thread
    is started on a hotplugged CPU while the lockup detector is suspended,
    and the thread could thus interfere unexpectedly with the code that
    requested to suspend the lockup detector.

    Avoid the race by calling

    get_online_cpus() in lockup_detector_suspend()
    put_online_cpus() in lockup_detector_resume()

    Signed-off-by: Ulrich Obergfell
    Acked-by: Don Zickus
    Reviewed-by: Aaron Tomlin
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • The only way to enable a hardlockup to panic the machine is to set
    'nmi_watchdog=panic' on the kernel command line.

    This makes it awkward for end users and folks who want to run automate
    tests (like myself).

    Mimic the softlockup_panic knob and create a /proc/sys/kernel/hardlockup_panic
    knob.

    Signed-off-by: Don Zickus
    Cc: Ulrich Obergfell
    Acked-by: Jiri Kosina
    Reviewed-by: Aaron Tomlin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Don Zickus
     
  • In many cases of hardlockup reports, it's actually not possible to know
    why it triggered, because the CPU that got stuck is usually waiting on a
    resource (with IRQs disabled) in posession of some other CPU is holding.

    IOW, we are often looking at the stacktrace of the victim and not the
    actual offender.

    Introduce sysctl / cmdline parameter that makes it possible to have
    hardlockup detector perform all-CPU backtrace.

    Signed-off-by: Jiri Kosina
    Reviewed-by: Aaron Tomlin
    Cc: Ulrich Obergfell
    Acked-by: Don Zickus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • If kthread_park() returns an error, watchdog_park_threads() should not
    blindly 'roll back' the already parked threads to the unparked state.
    Instead leave it up to the callers to handle such errors appropriately in
    their context. For example, it is redundant to unpark the threads if the
    lockup detectors will soon be disabled by the callers anyway.

    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Acked-by: Don Zickus
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • lockup_detector_suspend() now handles errors from watchdog_park_threads().

    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Acked-by: Don Zickus
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • update_watchdog_all_cpus() now passes errors from watchdog_park_threads()
    up to functions in the call chain. This allows watchdog_enable_all_cpus()
    and proc_watchdog_update() to handle such errors too.

    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Acked-by: Don Zickus
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • Move watchdog_disable_all_cpus() outside of the ifdef so that it is
    available if CONFIG_SYSCTL is not defined. This is preparation for
    "watchdog: implement error handling in update_watchdog_all_cpus() and
    callers".

    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Acked-by: Don Zickus
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • The original watchdog_park_threads() function that was introduced by
    commit 81a4beef91ba ("watchdog: introduce watchdog_park_threads() and
    watchdog_unpark_threads()") takes a very simple approach to handle
    errors returned by kthread_park(): It attempts to roll back all watchdog
    threads to the unparked state. However, this may be undesired behaviour
    from the perspective of the caller which may want to handle errors as
    appropriate in its specific context. Currently, there are two possible
    call chains:

    - watchdog suspend/resume interface

    lockup_detector_suspend
    watchdog_park_threads

    - write to parameters in /proc/sys/kernel

    proc_watchdog_update
    watchdog_enable_all_cpus
    update_watchdog_all_cpus
    watchdog_park_threads

    Instead of 'blindly' attempting to unpark the watchdog threads if a
    kthread_park() call fails, the new approach is to disable the lockup
    detectors in the above call chains. Failure becomes visible to the user
    as follows:

    - error messages from lockup_detector_suspend()
    or watchdog_enable_all_cpus()

    - the state that can be read from /proc/sys/kernel/watchdog_enabled

    - the 'write' system call in the latter call chain returns an error

    I did not experience kthread_park() failures in practice, I used some
    instrumentation to fake error returns from kthread_park() in order to test
    the patches.

    This patch (of 5):

    Restore the previous value of watchdog_thresh _and_ sample_period if
    proc_watchdog_update() returns an error. The variables must be consistent
    to avoid false positives of the lockup detectors.

    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Acked-by: Don Zickus
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • Make is_hardlockup return bool to improve readability due to this
    particular function only using either one or zero as its return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Reviewed-by: Aaron Tomlin
    Acked-by: Don Zickus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • If the remote locking fail, we run a local vfs unlock that should work and
    return success to userland when we didn't actually lock at all. We need
    to tell the application that tried to lock that it didn't get it, not that
    all went well.

    Signed-off-by: Dominique Martinet
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dominique Martinet
     
  • Make struct callback_head aligned to size of pointer. On most
    architectures it happens naturally due ABI requirements, but some
    architectures (like CRIS) have weird ABI and we need to ask it explicitly.

    The alignment is required to guarantee that bits 0 and 1 of @next will be
    clear under normal conditions -- as long as we use call_rcu(),
    call_rcu_bh(), call_rcu_sched(), or call_srcu() to queue callback.

    This guarantee is important for few reasons:
    - future call_rcu_lazy() will make use of lower bits in the pointer;
    - the structure shares storage spacer in struct page with @compound_head,
    which encode PageTail() in bit 0. The guarantee is needed to avoid
    false-positive PageTail().

    False postive PageTail() caused crash on crisv32[1]. It happend due
    misaligned task_struct->rcu, which was byte-aligned.

    [1] http://lkml.kernel.org/r/55FAEA67.9000102@roeck-us.net

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Acked-by: Paul E. McKenney
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • readahead_pages in ocfs2_duplicate_clusters_by_page is defined but not
    used, so clean it up.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • A node can mount multiple ocfs2 volumes. And if thread names are same for
    each volume/domain, it will bring inconvenience when analyzing problems
    because we have to identify which volume/domain the messages belong to.

    Since thread name will be printed to messages, so add volume uuid or dlm
    name to thread name can benefit problem analysis.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Gang He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we
    should reclaim the inode successfully claimed above, otherwise, the
    inode never be reused. The case is described below:

    ocfs2_mknod
    ocfs2_mknod_locked
    ocfs2_claim_new_inode
    Successfully claim the inode
    __ocfs2_mknod_locked
    ocfs2_journal_access_di
    Failed because of -ENOMEM or other reasons, the inode
    lockres has not been initialized yet.

    iput(inode)
    ocfs2_evict_inode
    ocfs2_delete_inode
    ocfs2_inode_lock
    ocfs2_inode_lock_full_nested
    __ocfs2_cluster_lock
    Return -EINVAL because of the inode
    lockres has not been initialized.

    So the following operations are not performed
    ocfs2_wipe_inode
    ocfs2_remove_inode
    ocfs2_free_dinode
    ocfs2_free_suballoc_bits

    Signed-off-by: Alex Chen
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    alex chen
     
  • There is a race case between mount and delete node/cluster, which will
    lead o2hb_thread to malfunctioning dead loop.

    o2hb_thread
    {
    o2nm_depend_this_node();
    <<<<<< race window, node may have already been deleted, and then
    enter the loop, o2hb thread will be malfunctioning
    because of no configured nodes found.
    while (!kthread_should_stop() &&
    !reg->hr_unclean_stop && !reg->hr_aborted_start) {
    }

    So check the return value of o2nm_depend_this_node() is needed. If node
    has been deleted, do not enter the loop and let mount fail.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • We have no need to take inode mutex, rw and inode lock if it is not dio
    entry when recover orphans. Optimize it by adding a flag
    OCFS2_INODE_DIO_ORPHAN_ENTRY to ocfs2_inode_info to reduce contention.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi