17 Jun, 2011

6 commits


16 Jun, 2011

34 commits

  • Store the AFS vnode uniquifier in the i_generation field, not the i_version
    field of the inode struct. i_version can then be given the AFS data version
    number.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Set s_id in the superblock to the name of the AFS volume that this superblock
    corresponds to.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • I've got a report of a file corruption from fsxlinux on ext3. The important
    operations to the page were:
    mapwrite to a hole
    partial write to the page
    read - found the page zeroed from the end of the normal write

    The culprit seems to be that if get_block() fails in __block_write_begin()
    (e.g. transient ENOSPC in ext3), the function does ClearPageUptodate(page).
    Thus when we retry the write, the logic in __block_write_begin() thinks zeroing
    of the page is needed and overwrites old data. In fact, I don't see why we
    should ever need to zero the uptodate bit here - either the page was uptodate
    when we entered __block_write_begin() and it should stay so when we leave it,
    or it was not uptodate and noone had right to set it uptodate during
    __block_write_begin() so it remains !uptodate when we leave as well. So just
    remove clearing of the bit.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • afs_fill_page should read the page that is about to be written but
    the current implementation has a number of issues. If we aren't
    extending the file we always read PAGE_CACHE_SIZE at offset 0. If we
    are extending the file we try to read the entire file.

    Change afs_fill_page to read PAGE_CACHE_SIZE at the right offset,
    clamped to i_size.

    While here, avoid calling afs_fill_page when we are doing a
    PAGE_CACHE_SIZE write.

    Signed-off-by: Anton Blanchard
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    Anton Blanchard
     
  • Fix build by moving enum list outside of
    #ifdef CONFIG_IIO_RING_BUFFER.

    drivers/staging/iio/accel/adis16201_core.c:413: error: 'ADIS16201_SCAN_SUPPLY' undeclared here (not in a function)
    drivers/staging/iio/accel/adis16201_core.c:417: error: 'ADIS16201_SCAN_TEMP' undeclared here (not in a function)
    ..

    drivers/staging/iio/accel/adis16203_core.c:374: error: 'ADIS16203_SCAN_SUPPLY' undeclared here (not in a function)
    drivers/staging/iio/accel/adis16203_core.c:378: error: 'ADIS16203_SCAN_AUX_ADC' undeclared here (not in a function)
    ..

    Signed-off-by: Randy Dunlap
    Acked-by: Jonathan Cameron
    Cc: linux-iio@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • [Kudos to dhowells for tracking that crap down]

    If two processes attempt to cause automounting on the same mountpoint at the
    same time, the vfsmount holding the mountpoint will be left with one too few
    references on it, causing a BUG when the kernel tries to clean up.

    The problem is that lock_mount() drops the caller's reference to the
    mountpoint's vfsmount in the case where it finds something already mounted on
    the mountpoint as it transits to the mounted filesystem and replaces path->mnt
    with the new mountpoint vfsmount.

    During a pathwalk, however, we don't take a reference on the vfsmount if it is
    the same as the one in the nameidata struct, but do_add_mount() doesn't know
    this.

    The fix is to make sure we have a ref on the vfsmount of the mountpoint before
    calling do_add_mount(). However, if lock_mount() doesn't transit, we're then
    left with an extra ref on the mountpoint vfsmount which needs releasing.
    We can handle that in follow_managed() by not making assumptions about what
    we can and what we cannot get from lookup_mnt() as the current code does.

    The callers of follow_managed() expect that reference to path->mnt will be
    grabbed iff path->mnt has been changed. follow_managed() and follow_automount()
    keep track of whether such reference has been grabbed and assume that it'll
    happen in those and only those cases that'll have us return with changed
    path->mnt. That assumption is almost correct - it breaks in case of
    racing automounts and in even harder to hit race between following a mountpoint
    and a couple of mount --move. The thing is, we don't need to make that
    assumption at all - after the end of loop in follow_manage() we can check
    if path->mnt has ended up unchanged and do mntput() if needed.

    The BUG can be reproduced with the following test program:

    #include
    #include
    #include
    #include
    #include
    int main(int argc, char **argv)
    {
    int pid, ws;
    struct stat buf;
    pid = fork();
    stat(argv[1], &buf);
    if (pid > 0) wait(&ws);
    return 0;
    }

    and the following procedure:

    (1) Mount an NFS volume that on the server has something else mounted on a
    subdirectory. For instance, I can mount / from my server:

    mount warthog:/ /mnt -t nfs4 -r

    On the server /data has another filesystem mounted on it, so NFS will see
    a change in FSID as it walks down the path, and will mark /mnt/data as
    being a mountpoint. This will cause the automount code to be triggered.

    !!! Do not look inside the mounted fs at this point !!!

    (2) Run the above program on a file within the submount to generate two
    simultaneous automount requests:

    /tmp/forkstat /mnt/data/testfile

    (3) Unmount the automounted submount:

    umount /mnt/data

    (4) Unmount the original mount:

    umount /mnt

    At this point the kernel should throw a BUG with something like the
    following:

    BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]

    Note that the bug appears on the root dentry of the original mount, not the
    mountpoint and not the submount because sys_umount() hasn't got to its final
    mntput_no_expire() yet, but this isn't so obvious from the call trace:

    [] shrink_dcache_for_umount+0x69/0x82
    [] generic_shutdown_super+0x37/0x15b
    [] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
    [] kill_anon_super+0x1d/0x7e
    [] nfs4_kill_super+0x60/0xb6 [nfs]
    [] deactivate_locked_super+0x34/0x83
    [] deactivate_super+0x6f/0x7b
    [] mntput_no_expire+0x18d/0x199
    [] mntput+0x3b/0x44
    [] release_mounts+0xa2/0xbf
    [] sys_umount+0x47a/0x4ba
    [] ? trace_hardirqs_on_caller+0x1fd/0x22f
    [] system_call_fastpath+0x16/0x1b

    as do_umount() is inlined. However, you can see release_mounts() in there.

    Note also that it may be necessary to have multiple CPU cores to be able to
    trigger this bug.

    Tested-by: Jeff Layton
    Tested-by: Ian Kent
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    Al Viro
     
  • Git bisection shows that commit e6bc45d65df8599fdbae73be9cec4ceed274db53 causes
    BUG_ONs under high I/O load:

    kernel BUG at fs/inode.c:1368!
    [ 2862.501007] Call Trace:
    [ 2862.501007] [] d_kill+0xf8/0x140
    [ 2862.501007] [] dput+0xc9/0x190
    [ 2862.501007] [] fput+0x15f/0x210
    [ 2862.501007] [] filp_close+0x61/0x90
    [ 2862.501007] [] sys_close+0xb1/0x110
    [ 2862.501007] [] system_call_fastpath+0x16/0x1b

    A reliable way to reproduce this bug is:
    Login to KDE, run 'rsnapshot sync', and apt-get install openjdk-6-jdk,
    and apt-get remove openjdk-6-jdk.

    The buggy part of the patch is this:
    struct inode *inode = NULL;
    .....
    - if (nd.last.name[nd.last.len])
    - goto slashes;
    inode = dentry->d_inode;
    - if (inode)
    - ihold(inode);
    + if (nd.last.name[nd.last.len] || !inode)
    + goto slashes;
    + ihold(inode)
    ...
    if (inode)
    iput(inode); /* truncate the inode here */

    If nd.last.name[nd.last.len] is nonzero (and thus goto slashes branch is taken),
    and dentry->d_inode is non-NULL, then this code now does an additional iput on
    the inode, which is wrong.

    Fix this by only setting the inode variable if nd.last.name[nd.last.len] is 0.

    Reference: https://lkml.org/lkml/2011/6/15/50
    Reported-by: Norbert Preining
    Reported-by: Török Edwin
    Cc: "Theodore Ts'o"
    Cc: Al Viro
    Signed-off-by: Török Edwin
    Signed-off-by: Al Viro

    Török Edwin
     
  • We have some users of this function that date back to before the vma
    list was doubly linked, and just are silly. These days, you can find
    the previous vma by just following the vma->vm_prev pointer.

    In some cases you don't need any find_vma() lookup at all, and in other
    cases you're better off with the regular "find_vma()" that uses the vma
    cache front-end lookup.

    Some "find_vma_prev()" users are still valid, though. For example, in
    the case of a stack that grows up, it can be the case that we don't find
    any 'vma' at all (because we're looking up an address that is past the
    last vma), and that the stack that we want to grow is the 'prev' vma.

    But that kind of special case aside, we generally should prefer to use
    'find_vma()'.

    Noticed due to a totally unrelated POWER memory corruption bug that just
    happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
    using that function here?".

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Signed-off-by: Kuninori Morimoto
    Signed-off-by: Paul Mundt

    Kuninori Morimoto
     
  • Signed-off-by: Kuninori Morimoto
    Signed-off-by: Paul Mundt

    Kuninori Morimoto
     
  • * 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm:
    ARM: footbridge: fix clock event support
    ARM: footbridge: fix debug macros
    ARM: initrd: disable initrds outside of memory
    ARM: extend Code: line by one 16-bit quantity for Thumb instructions
    ARM: 6955/1: cmpxchg syscall should data abort if page not write
    ARM: 6954/1: zImage: fix Thumb2 breakage
    ARM: 6953/1: DT: don't try to access physical address zero
    ARM: 6949/2: mach-u300: fix compilaton warning in IO accessors
    Revert "ARM: 6944/1: mm: allow ASID 0 to be allocated to tasks"
    Revert "ARM: 6943/1: mm: use TTBR1 instead of reserved context ID"
    davinci: make PCM platform devices static
    arm: davinci: Fix fallout from generic irq chip conversion
    ARM: 6894/1: mmci: trigger card detect IRQs on falling and rising edges
    ARM: 6952/1: fix lockdep warning of "unannotated irqs-off"
    ARM: 6951/1: include .bss in memory layout information
    ARM: 6948/1: Fix .size directives for __arm{7,9}tdmi_proc_info
    ARM: 6947/2: mach-u300: fix compilation error in timer
    ARM: 6946/1: vexpress: move v2m clock init to init_early
    ARM: mx51/sdma: Check the chip revision in run-time
    arm: mxs: include asm/processor.h for cpu_relax()

    Linus Torvalds
     
  • This reverts commit 7f81c8890c15a10f5220bebae3b6dfae4961962a.

    It turns out that it's not actually a build-time check on x86-64 UML,
    which does some seriously crazy stuff with VM_STACK_FLAGS.

    The VM_STACK_FLAGS define depends on the arch-supplied
    VM_STACK_DEFAULT_FLAGS value, and on x86-64 UML we have

    arch/um/sys-x86_64/shared/sysdep/vm-flags.h:

    #define VM_STACK_DEFAULT_FLAGS \
    (test_thread_flag(TIF_IA32) ? vm_stack_flags32 : vm_stack_flags)

    #define VM_STACK_DEFAULT_FLAGS vm_stack_flags

    (yes, seriously: two different #define's for that thing, with the first
    one being inside an "#ifdef TIF_IA32")

    It's possible that it is UML that should just be fixed in this area, but
    for now let's just undo the (very small) optimization.

    Reported-by: Randy Dunlap
    Acked-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Fix format and spelling.

    Signed-off-by: Jörg Sommer
    Acked-by: Paul Menage
    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Jörg Sommer
     
  • According to commit 676db4af0430 ("cgroupfs: create /sys/fs/cgroup to
    mount cgroupfs on") the canonical mountpoint for the cgroup filesystem
    is /sys/fs/cgroup. Hence, this should be used in the documentation.

    Signed-off-by: Jörg Sommer
    Acked-by: Paul Menage
    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Jörg Sommer
     
  • Instead of listing the architectures that are supported by
    kmemleak in Documentation/kmemleak.txt, just refer people to
    the list of supported architecutures in lib/Kconfig.debug so
    that Documentation/kmemleak.txt does not need more updates
    for this.

    Signed-off-by: Maxin B. John
    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Maxin B. John
     
  • This patch updates the incomplete documentation concerning the printk
    extended format specifiers.

    Signed-off-by: Andrew Murray
    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Andrew Murray
     
  • …el/git/tip/linux-2.6-tip

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Check if lowest_mask is initialized in find_lowest_rq()
    sched: Fix need_resched() when checking peempt

    Linus Torvalds
     
  • Fix several security issues in Alpha-specific syscalls. Untested, but
    mostly trivial.

    1. Signedness issue in osf_getdomainname allows copying out-of-bounds
    kernel memory to userland.

    2. Signedness issue in osf_sysinfo allows copying large amounts of
    kernel memory to userland.

    3. Typo (?) in osf_getsysinfo bounds minimum instead of maximum copy
    size, allowing copying large amounts of kernel memory to userland.

    4. Usage of user pointer in osf_wait4 while under KERNEL_DS allows
    privilege escalation via writing return value of sys_wait4 to kernel
    memory.

    Signed-off-by: Dan Rosenberg
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Rosenberg
     
  • Fixes this warning:

    drivers/misc/apds990x.c: At top level:
    drivers/misc/apds990x.c:613: warning: `apds990x_chip_on' defined but not used

    Signed-off-by: Geert Uytterhoeven
    Cc: Samu Onkalo
    Cc: Jonathan Cameron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Andrea Righi reported a case where an exiting task can race against
    ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily
    triggering a NULL pointer dereference in ksmd.

    ksm_scan.mm_slot == &ksm_mm_head with only one registered mm

    CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item)
    list_empty() is false
    lock slot == &ksm_mm_head
    list_del(slot->mm_list)
    (list now empty)
    unlock
    lock
    slot = list_entry(slot->mm_list.next)
    (list is empty, so slot is still ksm_mm_head)
    unlock
    slot->mm == NULL ... Oops

    Close this race by revalidating that the new slot is not simply the list
    head again.

    Andrea's test case:

    #include
    #include
    #include
    #include

    #define BUFSIZE getpagesize()

    int main(int argc, char **argv)
    {
    void *ptr;

    if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
    perror("posix_memalign");
    exit(1);
    }
    if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
    perror("madvise");
    exit(1);
    }
    *(char *)NULL = 0;

    return 0;
    }

    Reported-by: Andrea Righi
    Tested-by: Andrea Righi
    Cc: Andrea Arcangeli
    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • RTC_CLASS is changed to bool, so 'm' is invalid.

    Signed-off-by: Wanlong Gao
    Acked-by: Mike Frysinger
    Acked-by: Wolfram Sang
    Acked-by: Hans-Christian Egtvedt
    Acked-by: Benjamin Herrenschmidt
    Cc: Guan Xuetao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanlong Gao
     
  • If dmi_get_system_info() returns NULL, pch_uart_init_port() will
    dereferencea a zero pointer.

    This oops was observed on an Atom based board which has no BIOS, but
    a bootloder which doesn't provide DMI data.

    Signed-off-by: Alexander Stein
    Cc: Greg KH
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Stein
     
  • When interrupts are delayed due to interrupt masking or due to other
    interrupts being serviced the HPET periodic-emuation would fail. This
    happened because given an interval t and a time for the current interrupt
    m we would compute the next time as t + m. This works until we are
    delayed for > t, in which case we would be writing a new value which is in
    fact in the past.

    This can be solved by computing the next time instead as (k * t) + m where
    k is large enough to be in the future. The exact computation of k is
    described in a comment to the code.

    More detail:

    Assuming an interval of 5 between each expected interrupt we have a normal
    case of

    t0: interrupt, read t0 from comparator, set next interrupt t0 + 5
    t5: interrupt, read t5 from comparator, set next interrupt t5 + 5
    t10: interrupt, read t10 from comparator, set next interrupt t10 + 5
    ...

    So, what happens when the interrupt is serviced too late?

    t0: interrupt, read t0 from comparator, set next interrupt t0 + 5
    t11: delayed interrupt serviced, read t5 from comparator, set next
    interrupt t5 + 5, which is in the past!
    ... counter loops ...
    t10: Much much later, get the next interrupt.

    This can happen either because we have interrupts masked for too long
    (some stupid driver goes on a printk rampage) or just because we are
    pushing the limits of the interval (too small a period), or both most
    probably.

    My solution is to read the main counter as well and set the next interrupt
    to occur at the right interval, for example:

    t0: interrupt, read t0 from comparator, set next interrupt t0 + 5
    t11: delayed interrupt serviced, read t5 from comparator, set next
    interrupt t15 as t10 has been missed.
    t15: back on track.

    Signed-off-by: Nils Carlson
    Cc: John Stultz
    Cc: Thomas Gleixner
    Cc: Clemens Ladisch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nils Carlson
     
  • Commit a77aea92010acf ("cgroup: remove the ns_cgroup") removed the
    ns_cgroup but it forgot to remove the related doc in
    feature-removal-schedule.txt.

    Signed-off-by: WANG Cong
    Cc: Daniel Lezcano
    Cc: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     
  • Asynchronous compaction is used when promoting to huge pages. This is all
    very nice but if there are a number of processes in compacting memory, a
    large number of pages can be isolated. An "asynchronous" process can
    stall for long periods of time as a result with a user reporting that
    firefox can stall for 10s of seconds. This patch aborts asynchronous
    compaction if too many pages are isolated as it's better to fail a
    hugepage promotion than stall a process.

    [minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
    Reported-and-tested-by: Ury Stankevich
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is unsafe to run page_count during the physical pfn scan because
    compound_head could trip on a dangling pointer when reading
    page->first_page if the compound page is being freed by another CPU.

    [mgorman@suse.de: split out patch]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Compaction works with two scanners, a migration and a free scanner. When
    the scanners crossover, migration within the zone is complete. The
    location of the scanner is recorded on each cycle to avoid excesive
    scanning.

    When a zone is small and mostly reserved, it's very easy for the migration
    scanner to be close to the end of the zone. Then the following situation
    can occurs

    o migration scanner isolates some pages near the end of the zone
    o free scanner starts at the end of the zone but finds that the
    migration scanner is already there
    o free scanner gets reinitialised for the next cycle as
    cc->migrate_pfn + pageblock_nr_pages
    moving the free scanner into the next zone
    o migration scanner moves into the next zone

    When this happens, NR_ISOLATED accounting goes haywire because some of the
    accounting happens against the wrong zone. One zones counter remains
    positive while the other goes negative even though the overall global
    count is accurate. This was reported on X86-32 with !SMP because !SMP
    allows the negative counters to be visible. The fact that it is the bug
    should theoritically be possible there.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • fragmentation_index() returns -1000 when the allocation might succeed
    This doesn't match the comment and code in compaction_suitable(). I
    thought compaction_suitable should return COMPACT_PARTIAL in -1000
    case, because in this case allocation could succeed depending on
    watermarks.

    The impact of this is that compaction starts and compact_finished() is
    called which rechecks the watermarks and the free lists. It should have
    the same result in that compaction should not start but is more expensive.

    Acked-by: Mel Gorman
    Signed-off-by: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Pages isolated for migration are accounted with the vmstat counters
    NR_ISOLATE_[ANON|FILE]. Callers of migrate_pages() are expected to
    increment these counters when pages are isolated from the LRU. Once the
    pages have been migrated, they are put back on the LRU or freed and the
    isolated count is decremented.

    Memory failure is not properly accounting for pages it isolates causing
    the NR_ISOLATED counters to be negative. On SMP builds, this goes
    unnoticed as negative counters are treated as 0 due to expected per-cpu
    drift. On UP builds, the counter is treated by too_many_isolated() as a
    large value causing processes to enter D state during page reclaim or
    compaction. This patch accounts for pages isolated by memory failure
    correctly.

    [mel@csn.ul.ie: rewrote changelog]
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Minchan Kim
    Cc: Andi Kleen
    Acked-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • CONFIG_CONSTRUCTORS controls support for running constructor functions at
    kernel init time. According to commit b99b87f70c7785ab ("kernel:
    constructor support"), gcov (CONFIG_GCOV_KERNEL) needs this. However,
    CONFIG_CONSTRUCTORS currently defaults to y, with no option to disable it,
    and CONFIG_GCOV_KERNEL depends on it. Instead, default it to n and have
    CONFIG_GCOV_KERNEL select it, so that the normal case of
    CONFIG_GCOV_KERNEL=n will result in CONFIG_CONSTRUCTORS=n.

    Observed in the short list of =y values in a minimal kernel configuration.

    Signed-off-by: Josh Triplett
    Acked-by: WANG Cong
    Acked-by: Peter Oberparleiter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     
  • I shall maintain the legacy eeprom driver, until we finally get rid of it.

    Signed-off-by: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • Based on Michal Hocko's comment.

    We are not draining per cpu cached charges during soft limit reclaim
    because background reclaim doesn't care about charges. It tries to free
    some memory and charges will not give any.

    Cached charges might influence only selection of the biggest soft limit
    offender but as the call is done only after the selection has been already
    done it makes no change.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • For performance, memory cgroup caches some "charge" from res_counter into
    per cpu cache. This works well but because it's cache, it needs to be
    flushed in some cases. Typical cases are

    1. when someone hit limit.

    2. when rmdir() is called and need to charges to be 0.

    But "1" has problem.

    Recently, with large SMP machines, we see many kworker runs because of
    flushing memcg's cache. Bad things in implementation are that even if a
    cpu contains a cache for memcg not related to a memcg which hits limit,
    drain code is called.

    This patch does
    A) check percpu cache contains a useful data or not.
    B) check other asynchronous percpu draining doesn't run.
    C) don't call local cpu callback.

    (*)This patch avoid changing the calling condition with hard-limit.

    When I run "cat 1Gfile > /dev/null" under 300M limit memcg,

    [Before]
    13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat
    58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1
    60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1
    4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0
    57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1
    61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1
    62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1
    63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1

    [After]
    2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat
    2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top
    1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
    3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
    4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0

    [akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Reviewed-by: Michal Hocko
    Tested-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Hierarchical reclaim doesn't swap out if memsw and resource limits are
    thye same (memsw_is_minimum == true) because we would hit mem+swap limit
    anyway (during hard limit reclaim).

    If it comes to the soft limit we shouldn't consider memsw_is_minimum at
    all because it doesn't make much sense. Either the soft limit is bellow
    the hard limit and then we cannot hit mem+swap limit or the direct reclaim
    takes a precedence.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki