29 Oct, 2011

1 commit

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
    leases: fix write-open/read-lease race
    nfs: drop unnecessary locking in llseek
    ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
    vfs: add generic_file_llseek_size
    vfs: do (nearly) lockless generic_file_llseek
    direct-io: merge direct_io_walker into __blockdev_direct_IO
    direct-io: inline the complete submission path
    direct-io: separate map_bh from dio
    direct-io: use a slab cache for struct dio
    direct-io: rearrange fields in dio/dio_submit to avoid holes
    direct-io: fix a wrong comment
    direct-io: separate fields only used in the submission path from struct dio
    vfs: fix spinning prevention in prune_icache_sb
    vfs: add a comment to inode_permission()
    vfs: pass all mask flags check_acl and posix_acl_permission
    vfs: add hex format for MAY_* flag values
    vfs: indicate that the permission functions take all the MAY_* flags
    compat: sync compat_stats with statfs.
    vfs: add "device" tag to /proc/self/mountstats
    cleanup: vfs: small comment fix for block_invalidatepage
    ...

    Fix up trivial conflict in fs/gfs2/file.c (llseek changes)

    Linus Torvalds
     

28 Oct, 2011

1 commit

  • Currently, when you call iov_iter_advance, then the pointer to the iovec
    array can be incremented, but it does not decrement the nr_segs value in
    the iov_iter struct. The result is a iov_iter struct with a nr_segs
    value that goes beyond the end of the array.

    While I'm not aware of anything that's specifically broken by this, it
    seems odd and a bit dangerous not to decrement that value. If someone
    were to trust the nr_segs value to be correct, then they could end up
    walking off the end of the array.

    Changing this might also provide some micro-optimization when dealing
    with the last iovec in an array. Many of the other routines that deal
    with iov_iter have optimized codepaths when nr_segs == 1.

    Cc: Nick Piggin
    Signed-off-by: Jeff Layton
    Signed-off-by: Christoph Hellwig

    Jeff Layton
     

26 Oct, 2011

1 commit


25 Oct, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
    MAINTAINERS: linux-m32r is moderated for non-subscribers
    linux@lists.openrisc.net is moderated for non-subscribers
    Drop default from "DM365 codec select" choice
    parisc: Kconfig: cleanup Kernel page size default
    Kconfig: remove redundant CONFIG_ prefix on two symbols
    cris: remove arch/cris/arch-v32/lib/nand_init.S
    microblaze: add missing CONFIG_ prefixes
    h8300: drop puzzling Kconfig dependencies
    MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
    tty: drop superfluous dependency in Kconfig
    ARM: mxc: fix Kconfig typo 'i.MX51'
    Fix file references in Kconfig files
    aic7xxx: fix Kconfig references to READMEs
    Fix file references in drivers/ide/
    thinkpad_acpi: Fix printk typo 'bluestooth'
    bcmring: drop commented out line in Kconfig
    btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
    doc: raw1394: Trivial typo fix
    CIFS: Don't free volume_info->UNC until we are entirely done with it.
    treewide: Correct spelling of successfully in comments
    ...

    Linus Torvalds
     
  • * 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
    TOMOYO: Fix incomplete read after seek.
    Smack: allow to access /smack/access as normal user
    TOMOYO: Fix unused kernel config option.
    Smack: fix: invalid length set for the result of /smack/access
    Smack: compilation fix
    Smack: fix for /smack/access output, use string instead of byte
    Smack: domain transition protections (v3)
    Smack: Provide information for UDS getsockopt(SO_PEERCRED)
    Smack: Clean up comments
    Smack: Repair processing of fcntl
    Smack: Rule list lookup performance
    Smack: check permissions from user space (v2)
    TOMOYO: Fix quota and garbage collector.
    TOMOYO: Remove redundant tasklist_lock.
    TOMOYO: Fix domain transition failure warning.
    TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
    TOMOYO: Simplify garbage collector.
    TOMOYO: Fix make namespacecheck warnings.
    target: check hex2bin result
    encrypted-keys: check hex2bin result
    ...

    Linus Torvalds
     

20 Oct, 2011

1 commit

  • I don't usually pay much attention to the stale "? " addresses in
    stack backtraces, but this lucky report from Pawel Sikora hints that
    mremap's move_ptes() has inadequate locking against page migration.

    3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
    kernel BUG at include/linux/swapops.h:105!
    RIP: 0010:[] []
    migration_entry_wait+0x156/0x160
    [] handle_pte_fault+0xae1/0xaf0
    [] ? __pte_alloc+0x42/0x120
    [] ? do_huge_pmd_anonymous_page+0xab/0x310
    [] handle_mm_fault+0x181/0x310
    [] ? vma_adjust+0x537/0x570
    [] do_page_fault+0x11d/0x4e0
    [] ? do_mremap+0x2d5/0x570
    [] page_fault+0x1f/0x30

    mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
    and pagetable locks, were good enough before page migration (with its
    requirement that every migration entry be found) came in, and enough
    while migration always held mmap_sem; but not enough nowadays, when
    there's memory hotremove and compaction.

    The danger is that move_ptes() lets a migration entry dodge around
    behind remove_migration_pte()'s back, so it's in the old location when
    looking at the new, then in the new location when looking at the old.

    Either mremap's move_ptes() must additionally take anon_vma lock(), or
    migration's remove_migration_pte() must stop peeking for is_swap_entry()
    before it takes pagetable lock.

    Consensus chooses the latter: we prefer to add overhead to migration
    than to mremapping, which gets used by JVMs and by exec stack setup.

    Reported-and-tested-by: Paweł Sikora
    Signed-off-by: Hugh Dickins
    Acked-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Sep, 2011

3 commits

  • Discarding slab should be done when node partial > min_partial. Otherwise,
    node partial slab may eat up all memory.

    Signed-off-by: Alex Shi
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Alex Shi
     
  • Correct comment errors, that mistake cpu partial objects number as pages
    number, may make reader misunderstand.

    Signed-off-by: Alex Shi
    Reviewed-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Alex Shi
     
  • Historically /proc/slabinfo and files under /sys/kernel/slab/* have
    world read permissions and are accessible to the world. slabinfo
    contains rather private information related both to the kernel and
    userspace tasks. Depending on the situation, it might reveal either
    private information per se or information useful to make another
    targeted attack. Some examples of what can be learned by
    reading/watching for /proc/slabinfo entries:

    1) dentry (and different *inode*) number might reveal other processes fs
    activity. The number of dentry "active objects" doesn't strictly show
    file count opened/touched by a process, however, there is a good
    correlation between them. The patch "proc: force dcache drop on
    unauthorized access" relies on the privacy of dentry count.

    2) different inode entries might reveal the same information as (1), but
    these are more fine granted counters. If a filesystem is mounted in a
    private mount point (or even a private namespace) and fs type differs from
    other mounted fs types, fs activity in this mount point/namespace is
    revealed. If there is a single ecryptfs mount point, the whole fs
    activity of a single user is revealed. Number of files in ecryptfs
    mount point is a private information per se.

    3) fuse_* reveals number of files / fs activity of a user in a user
    private mount point. It is approx. the same severity as ecryptfs
    infoleak in (2).

    4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
    which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
    the precise number of sysfs files is known to the world.

    5) buffer_head might reveal some kernel activity. With other
    information leaks an attacker might identify what specific kernel
    routines generate buffer_head activity.

    6) *kmalloc* infoleaks are very situational. Attacker should watch for
    the specific kmalloc size entry and filter the noise related to the unrelated
    kernel activity. If an attacker has relatively silent victim system, he
    might get rather precise counters.

    Additional information sources might significantly increase the slabinfo
    infoleak benefits. E.g. if an attacker knows that the processes
    activity on the system is very low (only core daemons like syslog and
    cron), he may run setxid binaries / trigger local daemon activity /
    trigger network services activity / await sporadic cron jobs activity
    / etc. and get rather precise counters for fs and network activity of
    these privileged tasks, which is unknown otherwise.

    Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
    exploitation of kernel heap overflows (and possibly, other bugs). The
    related discussion:

    http://thread.gmane.org/gmane.linux.kernel/1108378

    To keep compatibility with old permission model where non-root
    monitoring daemon could watch for kernel memleaks though slabinfo one
    should do:

    groupadd slabinfo
    usermod -a -G slabinfo $MONITOR_USER

    And add the following commands to init scripts (to mountall.conf in
    Ubuntu's upstart case):

    chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
    chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*

    Signed-off-by: Vasiliy Kulikov
    Reviewed-by: Kees Cook
    Reviewed-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    CC: Valdis.Kletnieks@vt.edu
    CC: Linus Torvalds
    CC: Alan Cox
    Signed-off-by: Pekka Enberg

    Vasiliy Kulikov
     

22 Sep, 2011

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-block:
    floppy: use del_timer_sync() in init cleanup
    blk-cgroup: be able to remove the record of unplugged device
    block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
    mm: Add comment explaining task state setting in bdi_forker_thread()
    mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
    block: simplify force plug flush code a little bit
    block: change force plug flush call order
    block: Fix queue_flag update when rq_affinity goes from 2 to 1
    block: separate priority boosting from REQ_META
    block: remove READ_META and WRITE_META
    xen-blkback: fixed indentation and comments
    xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.

    Linus Torvalds
     

19 Sep, 2011

2 commits


15 Sep, 2011

9 commits

  • Fast-forward merge with Linus to be able to merge patches
    based on more recent version of the tree.

    Jiri Kosina
     
  • Signed-off-by: Joe Perches
    Acked-by: Paul Menage
    Signed-off-by: Jiri Kosina

    Joe Perches
     
  • The found entries by find_get_pages() could be all swap entries. In
    this case we skip the entries, but make sure the skipped entries are
    accounted, so we don't keep looping.

    Using nr_found > nr_skip to simplify code as suggested by Eric.

    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Shaohua Li
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Xen backend drivers (e.g., blkback and netback) would sometimes fail to
    map grant pages into the vmalloc address space allocated with
    alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could
    not find the page (in the L2 table) containing the PTEs it needed to
    update.

    (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000

    netback and blkback were making the hypercall from a kernel thread where
    task->active_mm != &init_mm and alloc_vm_area() was only updating the page
    tables for init_mm. The usual method of deferring the update to the page
    tables of other processes (i.e., after taking a fault) doesn't work as a
    fault cannot occur during the hypercall.

    This would work on some systems depending on what else was using vmalloc.

    Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all()
    from alloc_vm_area()") and add a comment to explain why it's needed.

    Signed-off-by: David Vrabel
    Cc: Jeremy Fitzhardinge
    Cc: Konrad Rzeszutek Wilk
    Cc: Ian Campbell
    Cc: Keir Fraser
    Cc: [3.0.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Vrabel
     
  • Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
    memory.vmscan_stat").

    The implementation of per-memcg reclaim statistics violates how memcg
    hierarchies usually behave: hierarchically.

    The reclaim statistics are accounted to child memcgs and the parent
    hitting the limit, but not to hierarchy levels in between. Usually,
    hierarchical statistics are perfectly recursive, with each level
    representing the sum of itself and all its children.

    Since this exports statistics to userspace, this may lead to confusion
    and problems with changing things after the release, so revert it now,
    we can try again later.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Without swap, anonymous pages are not scanned. As such, they should not
    count when considering force-scanning a small target if there is no swap.

    Otherwise, targets are not force-scanned even when their effective scan
    number is zero and the other conditions--kswapd/memcg--apply.

    This fixes 246e87a93934 ("memcg: fix get_scan_count() for small
    targets").

    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Ying Han
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The vmstat_text array is only defined for CONFIG_SYSFS or CONFIG_PROC_FS,
    yet it is referenced for per-node vmstat with CONFIG_NUMA:

    drivers/built-in.o: In function `node_read_vmstat':
    node.c:(.text+0x1106df): undefined reference to `vmstat_text'

    Introduced in commit fa25c503dfa2 ("mm: per-node vmstat: show proper
    vmstats").

    Define the array for CONFIG_NUMA as well.

    [akpm@linux-foundation.org: remove unneeded ifdefs]
    Signed-off-by: David Rientjes
    Reported-by: Cong Wang
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When compiling mm/mempolicy.c with struct user copy checks the following
    warning is shown:

    In file included from arch/x86/include/asm/uaccess.h:572,
    from include/linux/uaccess.h:5,
    from include/linux/highmem.h:7,
    from include/linux/pagemap.h:10,
    from include/linux/mempolicy.h:70,
    from mm/mempolicy.c:68:
    In function `copy_from_user',
    inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415:
    arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
    LD mm/built-in.o

    Fix this by passing correct buffer size value.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") didn't really
    fix the mbind vma merge problem due to wrong pgoff value passing to
    vma_merge(), which made vma_merge() always return NULL.

    Before the patch applied, we are getting a result like:

    addr = 0x7fa58f00c000
    [snip]
    7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0
    7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0
    7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0

    here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be
    merged described as described in commit 9d8cebd.

    Re-testing the patched kernel with the reproducer provided in commit
    9d8cebd, we get the correct result:

    addr = 0x7ffa5aaa2000
    [snip]
    7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0
    7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0 [stack]

    Signed-off-by: Caspar Zhang
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Caspar Zhang
     

14 Sep, 2011

1 commit


03 Sep, 2011

2 commits


27 Aug, 2011

2 commits

  • Adding slab to partial list head/tail is sensitive to performance.
    So explicitly uses DEACTIVATE_TO_TAIL/DEACTIVATE_TO_HEAD to document
    it to avoid we get it wrong.

    Acked-by: Christoph Lameter
    Signed-off-by: Shaohua Li
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     
  • The slab has just one free object, adding it to partial list head doesn't make
    sense. And it can cause lock contentation. For example,
    1. CPU takes the slab from partial list
    2. fetch an object
    3. switch to another slab
    4. free an object, then the slab is added to partial list again
    In this way n->list_lock will be heavily contended.
    In fact, Alex had a hackbench regression. 3.1-rc1 performance drops about 70%
    against 3.0. This patch fixes it.

    Acked-by: Christoph Lameter
    Reported-by: Alex Shi
    Signed-off-by: Shaohua Li
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     

26 Aug, 2011

5 commits

  • Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
    counter") tried to oom lock the hierarchy and roll back upon
    encountering an already locked memcg.

    The code is confused when it comes to detecting a locked memcg, though,
    so it would fail and rollback after locking one memcg and encountering
    an unlocked second one.

    The result is that oom-locking hierarchies fails unconditionally and
    that every oom killer invocation simply goes to sleep on the oom
    waitqueue forever. The tasks practically hang forever without anyone
    intervening, possibly holding locks that trip up unrelated tasks, too.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
    task. It's possible ZONE_CONGESTED isn't cleared in some cases:

    1. the zone is already balanced just entering balance_pgdat() for
    order-0 because concurrent tasks free memory. In this case, later
    check will skip the zone as it's balanced so the flag isn't cleared.

    2. high order balance fallbacks to order-0. quote from Mel: At the
    end of balance_pgdat(), kswapd uses the following logic;

    If reclaiming at high order {
    for each zone {
    if all_unreclaimable
    skip
    if watermark is not met
    order = 0
    loop again

    /* watermark is met */
    clear congested
    }
    }

    i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not,
    it restarts balancing at order-0. However, if the higher zones are
    balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
    that only happens after a zone is shrunk. This can mean that
    wait_iff_congested() stalls unnecessarily.

    This patch makes kswapd clear ZONE_CONGESTED during its initial
    highmem->dma scan for zones that are already balanced.

    Signed-off-by: Shaohua Li
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • I get the below warning:

    BUG: using smp_processor_id() in preemptible [00000000] code: bash/746
    caller is native_sched_clock+0x37/0x6e
    Pid: 746, comm: bash Tainted: G W 3.0.0+ #254
    Call Trace:
    [] debug_smp_processor_id+0xc2/0xdc
    [] native_sched_clock+0x37/0x6e
    [] try_to_free_mem_cgroup_pages+0x7d/0x270
    [] mem_cgroup_force_empty+0x24b/0x27a
    [] ? sys_close+0x38/0x138
    [] ? sys_close+0x38/0x138
    [] mem_cgroup_force_empty_write+0x17/0x19
    [] cgroup_file_write+0xa8/0xba
    [] vfs_write+0xb3/0x138
    [] sys_write+0x4a/0x71
    [] ? sys_close+0xf0/0x138
    [] system_call_fastpath+0x16/0x1b

    sched_clock() can't be used with preempt enabled. And we don't need
    fast approach to get clock here, so let's use ktime API.

    Signed-off-by: Shaohua Li
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Commit d1a05b6973c7 ("memcg do not try to drain per-cpu caches without
    pages") added a drain_local_stock() call to a preemptible section.

    The draining task looks up the cpu-local stock twice to set the
    draining-flag, then to drain the stock and clear the flag again. If the
    task is migrated to a different CPU in between, noone will clear the
    flag on the first stock and it will be forever undrainable. Its charge
    can not be recovered and the cgroup can not be deleted anymore.

    Properly pin the task to the executing CPU while draining stocks.

    Signed-off-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • * 'urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback:
    squeeze max-pause area and drop pass-good area

    Linus Torvalds
     

24 Aug, 2011

1 commit


20 Aug, 2011

6 commits

  • Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
    partial pages. The partial page list is used in slab_free() to avoid
    per node lock taking.

    In __slab_alloc() we can then take multiple partial pages off the per
    node partial list in one go reducing node lock pressure.

    We can also use the per cpu partial list in slab_alloc() to avoid scanning
    partial lists for pages with free objects.

    The main effect of a per cpu partial list is that the per node list_lock
    is taken for batches of partial pages instead of individual ones.

    Potential future enhancements:

    1. The pickup from the partial list could be perhaps be done without disabling
    interrupts with some work. The free path already puts the page into the
    per cpu partial list without disabling interrupts.

    2. __slab_free() may have some code paths that could use optimization.

    Performance:

    Before After
    ./hackbench 100 process 200000
    Time: 1953.047 1564.614
    ./hackbench 100 process 20000
    Time: 207.176 156.940
    ./hackbench 100 process 20000
    Time: 204.468 156.940
    ./hackbench 100 process 20000
    Time: 204.879 158.772
    ./hackbench 10 process 20000
    Time: 20.153 15.853
    ./hackbench 10 process 20000
    Time: 20.153 15.986
    ./hackbench 10 process 20000
    Time: 19.363 16.111
    ./hackbench 1 process 20000
    Time: 2.518 2.307
    ./hackbench 1 process 20000
    Time: 2.258 2.339
    ./hackbench 1 process 20000
    Time: 2.864 2.163

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • There is no need anymore to return the pointer to a slab page from get_partial()
    since the page reference can be stored in the kmem_cache_cpu structures "page" field.

    Return an object pointer instead.

    That in turn allows a simplification of the spaghetti code in __slab_alloc().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Pass the kmem_cache_cpu pointer to get_partial(). That way
    we can avoid the this_cpu_write() statements.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • inuse will always be set to page->objects. There is no point in
    initializing the field to zero in new_slab() and then overwriting
    the value in __slab_alloc().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Two statements in __slab_alloc() do not have any effect.

    1. c->page is already set to NULL by deactivate_slab() called right before.

    2. gfpflags are masked in new_slab() before being passed to the page
    allocator. There is no need to mask gfpflags in __slab_alloc in particular
    since most frequent processing in __slab_alloc does not require the use of a
    gfpmask.

    Cc: torvalds@linux-foundation.org
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • There are two situations in which slub holds a lock while releasing
    pages:

    A. During kmem_cache_shrink()
    B. During kmem_cache_close()

    For A build a list while holding the lock and then release the pages
    later. In case of B we are the last remaining user of the slab so
    there is no need to take the listlock.

    After this patch all calls to the page allocator to free pages are
    done without holding any spinlocks. kmem_cache_destroy() will still
    hold the slub_lock semaphore.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

19 Aug, 2011

1 commit

  • Revert the pass-good area introduced in ffd1f609ab10 ("writeback:
    introduce max-pause and pass-good dirty limits") and make the max-pause
    area smaller and safe.

    This fixes ~30% performance regression in the ext3 data=writeback
    fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
    12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.

    Using deadline scheduler also has a regression, but not that big as CFQ,
    so this suggests we have some write starvation.

    The test logs show that

    - the disks are sometimes under utilized

    - global dirty pages sometimes rush high to the pass-good area for
    several hundred seconds, while in the mean time some bdi dirty pages
    drop to very low value (bdi_dirty << bdi_thresh). Then suddenly the
    global dirty pages dropped under global dirty threshold and bdi_dirty
    rush very high (for example, 2 times higher than bdi_thresh). During
    which time balance_dirty_pages() is not called at all.

    So the problems are

    1) The random writes progress so slow that they break the assumption of
    the max-pause logic that "8 pages per 200ms is typically more than
    enough to curb heavy dirtiers".

    2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
    for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
    and then (bdi_thresh >> bdi_dirty) for others.

    3) The higher max-pause/pass-good thresholds somehow leads to the bad
    swing of dirty pages.

    The fix is to allow the task to slightly dirty over task_bdi_thresh, but
    no way to exceed bdi_dirty and/or global dirty_thresh.

    Tests show that it fixed the JBOD regression completely (both behavior
    and performance), while still being able to cut down large pause times
    in balance_dirty_pages() for single-disk cases.

    Reported-by: Li Shaohua
    Tested-by: Li Shaohua
    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

18 Aug, 2011

1 commit