29 Mar, 2012

3 commits

  • Merge third batch of patches from Andrew Morton:
    - Some MM stragglers
    - core SMP library cleanups (on_each_cpu_mask)
    - Some IPI optimisations
    - kexec
    - kdump
    - IPMI
    - the radix-tree iterator work
    - various other misc bits.

    "That'll do for -rc1. I still have ~10 patches for 3.4, will send
    those along when they've baked a little more."

    * emailed from Andrew Morton : (35 commits)
    backlight: fix typo in tosa_lcd.c
    crc32: add help text for the algorithm select option
    mm: move hugepage test examples to tools/testing/selftests/vm
    mm: move slabinfo.c to tools/vm
    mm: move page-types.c from Documentation to tools/vm
    selftests/Makefile: make `run_tests' depend on `all'
    selftests: launch individual selftests from the main Makefile
    radix-tree: use iterators in find_get_pages* functions
    radix-tree: rewrite gang lookup using iterator
    radix-tree: introduce bit-optimized iterator
    fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
    nbd: rename the nbd_device variable from lo to nbd
    pidns: add reboot_pid_ns() to handle the reboot syscall
    sysctl: use bitmap library functions
    ipmi: use locks on watchdog timeout set on reboot
    ipmi: simplify locking
    ipmi: fix message handling during panics
    ipmi: use a tasklet for handling received messages
    ipmi: increase KCS timeouts
    ipmi: decrease the IPMI message transaction time in interrupt mode
    ...

    Linus Torvalds
     
  • flush_all() is called for each kmem_cache_destroy(). So every cache being
    destroyed dynamically ends up sending an IPI to each CPU in the system,
    regardless if the cache has ever been used there.

    For example, if you close the Infinband ipath driver char device file, the
    close file ops calls kmem_cache_destroy(). So running some infiniband
    config tool on one a single CPU dedicated to system tasks might interrupt
    the rest of the 127 CPUs dedicated to some CPU intensive or latency
    sensitive task.

    I suspect there is a good chance that every line in the output of "git
    grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.

    This patch attempts to rectify this issue by sending an IPI to flush the
    per cpu objects back to the free lists only to CPUs that seem to have such
    objects.

    The check which CPU to IPI is racy but we don't care since asking a CPU
    without per cpu objects to flush does no damage and as far as I can tell
    the flush_all by itself is racy against allocs on remote CPUs anyway, so
    if you required the flush_all to be determinstic, you had to arrange for
    locking regardless.

    Without this patch the following artificial test case:

    $ cd /sys/kernel/slab
    $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done

    produces 166 IPIs on an cpuset isolated CPU. With it it produces none.

    The code path of memory allocation failure for CPUMASK_OFFSTACK=y
    config was tested using fault injection framework.

    Signed-off-by: Gilad Ben-Yossef
    Acked-by: Christoph Lameter
    Cc: Chris Metcalf
    Acked-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Russell King
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Sasha Levin
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Alexander Viro
    Cc: Avi Kivity
    Cc: Michal Nazarewicz
    Cc: Kosaki Motohiro
    Cc: Milton Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • Pull SLAB changes from Pekka Enberg:
    "There's the new kmalloc_array() API, minor fixes and performance
    improvements, but quite honestly, nothing terribly exciting."

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: SLAB Out-of-memory diagnostics
    slab: introduce kmalloc_array()
    slub: per cpu partial statistics change
    slub: include include for prefetch
    slub: Do not hold slub_lock when calling sysfs_slab_add()
    slub: prefetch next freelist pointer in slab_alloc()
    slab, cleanup: remove unneeded return

    Linus Torvalds
     

22 Mar, 2012

1 commit

  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Feb, 2012

1 commit

  • This patch split the cpu_partial_free into 2 parts: cpu_partial_node, PCP refilling
    times from node partial; and same name cpu_partial_free, PCP refilling times in
    slab_free slow path. A new statistic 'cpu_partial_drain' is added to get PCP
    drain to node partial times. These info are useful when do PCP tunning.

    The slabinfo.c code is unchanged, since cpu_partial_node is not on slow path.

    Signed-off-by: Alex Shi
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Alex Shi
     

10 Feb, 2012

1 commit

  • Otherwise m68k breaks:

    On Mon, 30 Jan 2012, Geert Uytterhoeven wrote:
    > m68k/allmodconfig at http://kisskb.ellerman.id.au/kisskb/buildresult/5527349/
    >
    > mm/slub.c:274: error: implicit declaration of function 'prefetch'
    >
    > Sorry, didn't notice it earlier due to other build breakage in -next.

    Reported-by: Geert Uytterhoeven
    Acked-by: Geert Uytterhoeven
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

06 Feb, 2012

1 commit

  • sysfs_slab_add() calls various sysfs functions that actually may
    end up in userspace doing all sorts of things.

    Release the slub_lock after adding the kmem_cache structure to the list.
    At that point the address of the kmem_cache is not known so we are
    guaranteed exlusive access to the following modifications to the
    kmem_cache structure.

    If the sysfs_slab_add fails then reacquire the slub_lock to
    remove the kmem_cache structure from the list.

    Cc: # 3.3+
    Reported-by: Sasha Levin
    Acked-by: Eric Dumazet
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

25 Jan, 2012

1 commit

  • Recycling a page is a problem, since freelist link chain is hot on
    cpu(s) which freed objects, and possibly very cold on cpu currently
    owning slab.

    Adding a prefetch of cache line containing the pointer to next object in
    slab_alloc() helps a lot in many workloads, in particular on assymetric
    ones (allocations done on one cpu, frees on another cpus). Added cost is
    three machine instructions only.

    Examples on my dual socket quad core ht machine (Intel CPU E5540
    @2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.

    Before patch :

    # perf stat -r 32 hackbench 50 process 4000 >/dev/null

    Performance counter stats for 'hackbench 50 process 4000' (32 runs):

    327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
    28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
    1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
    127 151 page-faults # 0,000 M/sec ( +- 0,16% )
    829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
    580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
    197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
    503 548 648 975 instructions # 0,61 insns per cycle
    # 1,15 stalled cycles per insn ( +- 0,46% )
    95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
    1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )

    20,705679994 seconds time elapsed ( +- 0,64% )

    After patch :

    # perf stat -r 32 hackbench 50 process 4000 >/dev/null

    Performance counter stats for 'hackbench 50 process 4000' (32 runs):

    286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
    19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
    1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
    126 776 page-faults # 0,000 M/sec ( +- 0,12% )
    724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
    499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
    156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
    463 897 792 661 instructions # 0,64 insns per cycle
    # 1,08 stalled cycles per insn ( +- 0,94% )
    87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
    941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )

    18,132070670 seconds time elapsed ( +- 1,30% )

    Signed-off-by: Eric Dumazet
    Acked-by: Christoph Lameter
    CC: Matt Mackall
    CC: David Rientjes
    CC: "Alex,Shi"
    CC: Shaohua Li
    Signed-off-by: Pekka Enberg

    Eric Dumazet
     

13 Jan, 2012

2 commits

  • Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
    can simply select the option if it is supported.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • While implementing cmpxchg_double() on s390 I realized that we don't set
    CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.

    However setting that option will increase the size of struct page by
    eight bytes on 64 bit, which we certainly do not want. Also, it doesn't
    make sense that a present cpu feature should increase the size of struct
    page.

    Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
    that it should depend on CMPXCHG_DOUBLE instead.

    This patch:

    If an architecture supports CMPXCHG_LOCAL this shouldn't result
    automatically in larger struct pages if the SLUB allocator is used.
    Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
    can be selected if a double word aligned struct page is required. Also
    update x86 Kconfig so that it should work as before.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

12 Jan, 2012

2 commits


11 Jan, 2012

2 commits

  • Disable slub debug facilities and allocate slabs at minimal order when
    debug_guardpage_minorder > 0 to increase probability to catch random
    memory corruption by cpu exception.

    Signed-off-by: Stanislaw Gruszka
    Cc: "Rafael J. Wysocki"
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Stanislaw Gruszka
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislaw Gruszka
     
  • For caches with debugging enabled, "slub: Switch per cpu partial page
    support off for debugging" changes cpu_partial to 0. It shouldn't be
    tunable from userspace for such caches, otherwise the same accounting
    issues arise during validation.

    This patch disallows tuning /sys/kernel/slab/cache/cpu_partial to be non-
    zero for caches with debugging enabled.

    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Pekka Enberg

    David Rientjes
     

10 Jan, 2012

1 commit


04 Jan, 2012

1 commit

  • Just like the per-CPU ones they had several
    problems/shortcomings:

    Only the first memory operand was mentioned in the asm()
    operands, and the 2x64-bit version didn't have a memory clobber
    while the 2x32-bit one did. The former allowed the compiler to
    not recognize the need to re-load the data in case it had it
    cached in some register, while the latter was overly
    destructive.

    The types of the local copies of the old and new values were
    incorrect (the types of the pointed-to variables should be used
    here, to make sure the respective old/new variable types are
    compatible).

    The __dummy/__junk variables were pointless, given that local
    copies of the inputs already existed (and can hence be used for
    discarded outputs).

    The 32-bit variant of cmpxchg_double_local() referenced
    cmpxchg16b_local().

    At once also:

    - change the return value type to what it really is: 'bool'
    - unify 32- and 64-bit variants
    - abstract out the common part of the 'normal' and 'local' variants

    Signed-off-by: Jan Beulich
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/4F01F12A020000780006A19B@nat28.tlf.novell.com
    Signed-off-by: Ingo Molnar

    Jan Beulich
     

23 Dec, 2011

1 commit

  • We simply say that regular this_cpu use must be safe regardless of
    preemption and interrupt state. That has no material change for x86
    and s390 implementations of this_cpu operations. However, arches that
    do not provide their own implementation for this_cpu operations will
    now get code generated that disables interrupts instead of preemption.

    -tj: This is part of on-going percpu API cleanup. For detailed
    discussion of the subject, please refer to the following thread.

    http://thread.gmane.org/gmane.linux.kernel/1222078

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo
    LKML-Reference:

    Christoph Lameter
     

14 Dec, 2011

4 commits

  • With per-cpu partial list, slab is added to partial list first and then moved
    to node list. The __slab_free() code path for add/remove_partial is almost
    deprecated(except for slub debug). But we forget to account add/remove_partial
    when move per-cpu partial pages to node list, so the statistics for such events
    are always 0. Add corresponding accounting.

    This is against the patch "slub: use correct parameter to add a page to
    partial list tail"

    Acked-by: Christoph Lameter
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     
  • get_freelist retrieves free objects from the page freelist (put there by remote
    frees) or deactivates a slab page if no more objects are available.

    Acked-by: David Rientjes
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Eric saw an issue with accounting of slabs during validation. Its not
    possible to determine accurately how many per cpu partial slabs exist at
    any time so this switches off per cpu partial pages during debug.

    Acked-by: Eric Dumazet
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Zhihua Che reported a possible memleak in slub allocator on
    CONFIG_PREEMPT=y builds.

    It is possible current thread migrates right before disabling irqs in
    __slab_alloc(). We must check again c->freelist, and perform a normal
    allocation instead of scratching c->freelist.

    Many thanks to Zhihua Che for spotting this bug, introduced in 2.6.39

    V2: Its also possible an IRQ freed one (or several) object(s) and
    populated c->freelist, so its not a CONFIG_PREEMPT only problem.

    Cc: [2.6.39+]
    Reported-by: Zhihua Che
    Signed-off-by: Eric Dumazet
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Eric Dumazet
     

28 Nov, 2011

2 commits

  • With per-cpu partial list, slab is added to partial list first and then moved
    to node list. The __slab_free() code path for add/remove_partial is almost
    deprecated(except for slub debug). But we forget to account add/remove_partial
    when move per-cpu partial pages to node list, so the statistics for such events
    are always 0. Add corresponding accounting.

    This is against the patch "slub: use correct parameter to add a page to
    partial list tail"

    Acked-by: Christoph Lameter
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     
  • Pekka Enberg
     

24 Nov, 2011

2 commits

  • show_slab_objects() can trigger NULL dereferences or memory corruption.

    Another cpu can change its c->page to NULL or c->node to NUMA_NO_NODE
    while we use them.

    Use ACCESS_ONCE(c->page) and ACCESS_ONCE(c->node) to make sure this
    cannot happen.

    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Eric Dumazet
    Signed-off-by: Pekka Enberg

    Eric Dumazet
     
  • The cmpxchg must be irq safe. The fallback for this_cpu_cmpxchg only
    disables preemption which results in per cpu partial page operation
    potentially failing on non x86 platforms.

    This patch fixes the following problem reported by Christian Kujau:

    I seem to hit it with heavy disk & cpu IO is in progress on this
    PowerBook
    G4. Full dmesg & .config: http://nerdbynature.de/bits/3.2.0-rc1/oops/

    I've enabled some debug options and now it really points to slub.c:2166

    http://nerdbynature.de/bits/3.2.0-rc1/oops/oops4m.jpg

    With debug options enabled I'm currently in the xmon debugger, not sure
    what to make of it yet, I'll try to get something useful out of it :)

    Reported-by: Christian Kujau
    Tested-by: Christian Kujau
    Acked-by: Eric Dumazet
    Acked-by: David Rientjes
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

17 Nov, 2011

1 commit


16 Nov, 2011

2 commits

  • Lockdep reports there is potential deadlock for slub node list_lock.
    discard_slab() is called with the lock hold in unfreeze_partials(),
    which could trigger a slab allocation, which could hold the lock again.

    discard_slab() doesn't need hold the lock actually, if the slab is
    already removed from partial list.

    Acked-by: Christoph Lameter
    Reported-and-tested-by: Yong Zhang
    Reported-and-tested-by: Julie Sullivan
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     
  • unfreeze_partials() needs add the page to partial list tail, since such page
    hasn't too many free objects. We now explictly use DEACTIVATE_TO_TAIL for this,
    while DEACTIVATE_TO_TAIL != 1. This will cause performance regression (eg, more
    lock contention in node->list_lock) without below fix.

    Signed-off-by: Shaohua Li
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Pekka Enberg

    Shaohua Li
     

01 Nov, 2011

1 commit

  • memchr_inv() is mainly used to check whether the whole buffer is filled
    with just a specified byte.

    The function name and prototype are stolen from logfs and the
    implementation is from SLUB.

    Signed-off-by: Akinobu Mita
    Acked-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Matt Mackall
    Acked-by: Joern Engel
    Cc: Marcin Slusarz
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

26 Oct, 2011

1 commit


28 Sep, 2011

3 commits

  • Discarding slab should be done when node partial > min_partial. Otherwise,
    node partial slab may eat up all memory.

    Signed-off-by: Alex Shi
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Alex Shi
     
  • Correct comment errors, that mistake cpu partial objects number as pages
    number, may make reader misunderstand.

    Signed-off-by: Alex Shi
    Reviewed-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Alex Shi
     
  • Historically /proc/slabinfo and files under /sys/kernel/slab/* have
    world read permissions and are accessible to the world. slabinfo
    contains rather private information related both to the kernel and
    userspace tasks. Depending on the situation, it might reveal either
    private information per se or information useful to make another
    targeted attack. Some examples of what can be learned by
    reading/watching for /proc/slabinfo entries:

    1) dentry (and different *inode*) number might reveal other processes fs
    activity. The number of dentry "active objects" doesn't strictly show
    file count opened/touched by a process, however, there is a good
    correlation between them. The patch "proc: force dcache drop on
    unauthorized access" relies on the privacy of dentry count.

    2) different inode entries might reveal the same information as (1), but
    these are more fine granted counters. If a filesystem is mounted in a
    private mount point (or even a private namespace) and fs type differs from
    other mounted fs types, fs activity in this mount point/namespace is
    revealed. If there is a single ecryptfs mount point, the whole fs
    activity of a single user is revealed. Number of files in ecryptfs
    mount point is a private information per se.

    3) fuse_* reveals number of files / fs activity of a user in a user
    private mount point. It is approx. the same severity as ecryptfs
    infoleak in (2).

    4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
    which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
    the precise number of sysfs files is known to the world.

    5) buffer_head might reveal some kernel activity. With other
    information leaks an attacker might identify what specific kernel
    routines generate buffer_head activity.

    6) *kmalloc* infoleaks are very situational. Attacker should watch for
    the specific kmalloc size entry and filter the noise related to the unrelated
    kernel activity. If an attacker has relatively silent victim system, he
    might get rather precise counters.

    Additional information sources might significantly increase the slabinfo
    infoleak benefits. E.g. if an attacker knows that the processes
    activity on the system is very low (only core daemons like syslog and
    cron), he may run setxid binaries / trigger local daemon activity /
    trigger network services activity / await sporadic cron jobs activity
    / etc. and get rather precise counters for fs and network activity of
    these privileged tasks, which is unknown otherwise.

    Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
    exploitation of kernel heap overflows (and possibly, other bugs). The
    related discussion:

    http://thread.gmane.org/gmane.linux.kernel/1108378

    To keep compatibility with old permission model where non-root
    monitoring daemon could watch for kernel memleaks though slabinfo one
    should do:

    groupadd slabinfo
    usermod -a -G slabinfo $MONITOR_USER

    And add the following commands to init scripts (to mountall.conf in
    Ubuntu's upstart case):

    chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
    chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*

    Signed-off-by: Vasiliy Kulikov
    Reviewed-by: Kees Cook
    Reviewed-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    CC: Valdis.Kletnieks@vt.edu
    CC: Linus Torvalds
    CC: Alan Cox
    Signed-off-by: Pekka Enberg

    Vasiliy Kulikov
     

19 Sep, 2011

1 commit


14 Sep, 2011

1 commit


27 Aug, 2011

2 commits

  • Adding slab to partial list head/tail is sensitive to performance.
    So explicitly uses DEACTIVATE_TO_TAIL/DEACTIVATE_TO_HEAD to document
    it to avoid we get it wrong.

    Acked-by: Christoph Lameter
    Signed-off-by: Shaohua Li
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     
  • The slab has just one free object, adding it to partial list head doesn't make
    sense. And it can cause lock contentation. For example,
    1. CPU takes the slab from partial list
    2. fetch an object
    3. switch to another slab
    4. free an object, then the slab is added to partial list again
    In this way n->list_lock will be heavily contended.
    In fact, Alex had a hackbench regression. 3.1-rc1 performance drops about 70%
    against 3.0. This patch fixes it.

    Acked-by: Christoph Lameter
    Reported-by: Alex Shi
    Signed-off-by: Shaohua Li
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     

20 Aug, 2011

3 commits

  • Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
    partial pages. The partial page list is used in slab_free() to avoid
    per node lock taking.

    In __slab_alloc() we can then take multiple partial pages off the per
    node partial list in one go reducing node lock pressure.

    We can also use the per cpu partial list in slab_alloc() to avoid scanning
    partial lists for pages with free objects.

    The main effect of a per cpu partial list is that the per node list_lock
    is taken for batches of partial pages instead of individual ones.

    Potential future enhancements:

    1. The pickup from the partial list could be perhaps be done without disabling
    interrupts with some work. The free path already puts the page into the
    per cpu partial list without disabling interrupts.

    2. __slab_free() may have some code paths that could use optimization.

    Performance:

    Before After
    ./hackbench 100 process 200000
    Time: 1953.047 1564.614
    ./hackbench 100 process 20000
    Time: 207.176 156.940
    ./hackbench 100 process 20000
    Time: 204.468 156.940
    ./hackbench 100 process 20000
    Time: 204.879 158.772
    ./hackbench 10 process 20000
    Time: 20.153 15.853
    ./hackbench 10 process 20000
    Time: 20.153 15.986
    ./hackbench 10 process 20000
    Time: 19.363 16.111
    ./hackbench 1 process 20000
    Time: 2.518 2.307
    ./hackbench 1 process 20000
    Time: 2.258 2.339
    ./hackbench 1 process 20000
    Time: 2.864 2.163

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • There is no need anymore to return the pointer to a slab page from get_partial()
    since the page reference can be stored in the kmem_cache_cpu structures "page" field.

    Return an object pointer instead.

    That in turn allows a simplification of the spaghetti code in __slab_alloc().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Pass the kmem_cache_cpu pointer to get_partial(). That way
    we can avoid the this_cpu_write() statements.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter