06 Oct, 2012

1 commit


03 Oct, 2012

5 commits

  • Pull frontswap update from Konrad Rzeszutek Wilk:
    "Features:
    - Support exlusive get if backend is capable.
    Bug-fixes:
    - Fix compile warnings
    - Add comments/cleanup doc
    - Fix wrong if condition"

    * tag 'stable/for-linus-3.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    frontswap: support exclusive gets if tmem backend is capable
    mm: frontswap: fix a wrong if condition in frontswap_shrink
    mm/frontswap: fix uninit'ed variable warning
    mm/frontswap: cleanup doc and comment error
    mm: frontswap: remove unneeded headers

    Linus Torvalds
     
  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:

    - xattr support added. The implementation is shared with tmpfs. The
    usage is restricted and intended to be used to manage per-cgroup
    metadata by system software. tmpfs changes are routed through this
    branch with Hugh's permission.

    - cgroup subsystem ID handling simplified.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
    cgroup: Assign subsystem IDs during compile time
    cgroup: Do not depend on a given order when populating the subsys array
    cgroup: Wrap subsystem selection macro
    cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
    cgroup: net_prio: Do not define task_netpioidx() when not selected
    cgroup: net_cls: Do not define task_cls_classid() when not selected
    cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
    cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
    xattr: mark variable as uninitialized to make both gcc and smatch happy
    fs: add missing documentation to simple_xattr functions
    cgroup: add documentation on extended attributes usage
    cgroup: rename subsys_bits to subsys_mask
    cgroup: add xattr support
    cgroup: revise how we re-populate root directory
    xattr: extract simple_xattr code from tmpfs

    Linus Torvalds
     
  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

02 Oct, 2012

2 commits

  • Pull RCU changes from Ingo Molnar:

    0. 'idle RCU':

    Adds RCU APIs that allow non-idle tasks to enter RCU idle mode and
    provides x86 code to make use of them, allowing RCU to treat
    user-mode execution as an extended quiescent state when the new
    RCU_USER_QS kernel configuration parameter is specified. (Work is
    in progress to port this to a few other architectures, but is not
    part of this series.)

    1. A fix for a latent bug that has been in RCU ever since the addition
    of CPU stall warnings. This bug results in false-positive stall
    warnings, but thus far only on embedded systems with severely
    cut-down userspace configurations.

    2. Further reductions in latency spikes for huge systems, along with
    additional boot-time adaptation to the actual hardware.

    This is a large change, as it moves RCU grace-period initialization
    and cleanup, along with quiescent-state forcing, from softirq to a
    kthread. However, it appears to be in quite good shape (famous
    last words).

    3. Updates to documentation and rcutorture, the latter category
    including keeping statistics on CPU-hotplug latencies and fixing
    some initialization-time races.

    4. CPU-hotplug fixes and improvements.

    5. Idle-loop fixes that were omitted on an earlier submission.

    6. Miscellaneous fixes and improvements

    In certain RCU configurations new kernel threads will show up (rcu_bh,
    rcu_sched), showing RCU processing overhead.

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
    rcu: Apply micro-optimization and int/bool fixes to RCU's idle handling
    rcu: Userspace RCU extended QS selftest
    x86: Exit RCU extended QS on notify resume
    x86: Use the new schedule_user API on userspace preemption
    rcu: Exit RCU extended QS on user preemption
    rcu: Exit RCU extended QS on kernel preemption after irq/exception
    x86: Exception hooks for userspace RCU extended QS
    x86: Unspaghettize do_general_protection()
    x86: Syscall hooks for userspace RCU extended QS
    rcu: Switch task's syscall hooks on context switch
    rcu: Ignore userspace extended quiescent state by default
    rcu: Allow rcu_user_enter()/exit() to nest
    rcu: Settle config for userspace extended quiescent state
    rcu: Make RCU_FAST_NO_HZ handle adaptive ticks
    rcu: New rcu_user_enter_after_irq() and rcu_user_exit_after_irq() APIs
    rcu: New rcu_user_enter() and rcu_user_exit() APIs
    ia64: Add missing RCU idle APIs on idle loop
    xtensa: Add missing RCU idle APIs on idle loop
    score: Add missing RCU idle APIs on idle loop
    parisc: Add missing RCU idle APIs on idle loop
    ...

    Linus Torvalds
     
  • Pull the trivial tree from Jiri Kosina:
    "Tiny usual fixes all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    doc: fix old config name of kprobetrace
    fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
    btrfs: fix the commment for the action flags in delayed-ref.h
    btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
    vfs: fix kerneldoc for generic_fh_to_parent()
    treewide: fix comment/printk/variable typos
    ipr: fix small coding style issues
    doc: fix broken utf8 encoding
    nfs: comment fix
    platform/x86: fix asus_laptop.wled_type module parameter
    mfd: printk/comment fixes
    doc: getdelays.c: remember to close() socket on error in create_nl_socket()
    doc: aliasing-test: close fd on write error
    mmc: fix comment typos
    dma: fix comments
    spi: fix comment/printk typos in spi
    Coccinelle: fix typo in memdup_user.cocci
    tmiofb: missing NULL pointer checks
    tools: perf: Fix typo in tools/perf
    tools/testing: fix comment / output typos
    ...

    Linus Torvalds
     

28 Sep, 2012

1 commit

  • Speculative cache pagecache lookups can elevate the refcount from
    under us, so avoid the false positive. If the refcount is < 2 we'll be
    notified by a VM_BUG_ON in put_page_testzero as there are two
    put_page(src_page) in a row before returning from this function.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Sep, 2012

4 commits


26 Sep, 2012

1 commit


23 Sep, 2012

1 commit


21 Sep, 2012

2 commits

  • Tmem, as originally specified, assumes that "get" operations
    performed on persistent pools never flush the page of data out
    of tmem on a successful get, waiting instead for a flush
    operation. This is intended to mimic the model of a swap
    disk, where a disk read is non-destructive. Unlike a
    disk, however, freeing up the RAM can be valuable. Over
    the years that frontswap was in the review process, several
    reviewers (and notably Hugh Dickins in 2010) pointed out that
    this would result, at least temporarily, in two copies of the
    data in RAM: one (compressed for zcache) copy in tmem,
    and one copy in the swap cache. We wondered if this could
    be done differently, at least optionally.

    This patch allows tmem backends to instruct the frontswap
    code that this backend performs exclusive gets. Zcache2
    already contains hooks to support this feature. Other
    backends are completely unaffected unless/until they are
    updated to support this feature.

    While it is not clear that exclusive gets are a performance
    win on all workloads at all times, this small patch allows for
    experimentation by backends.

    P.S. Let's not quibble about the naming of "get" vs "read" vs
    "load" etc. The naming is currently horribly inconsistent between
    cleancache and frontswap and existing tmem backends, so will need
    to be straightened out as a separate patch. "Get" is used
    by the tmem architecture spec, existing backends, and
    all documentation and presentation material so I am
    using it in this patch.

    Signed-off-by: Dan Magenheimer
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer
     
  • pages_to_unuse is set to 0 to unuse all frontswap pages
    But that doesn't happen since a wrong condition in frontswap_shrink
    cancel it.

    -v2: Add comment to explain return value of __frontswap_shrink,
    as suggested by Dan Carpenter, thanks

    Signed-off-by: Zhenzhong Duan
    Signed-off-by: Konrad Rzeszutek Wilk

    Zhenzhong Duan
     

18 Sep, 2012

6 commits

  • There may be a bug when registering section info. For example, on my
    Itanium platform, the pfn range of node0 includes the other nodes, so
    other nodes' section info will be double registered, and memmap's page
    count will equal to 3.

    node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
    node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
    node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
    node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000

    free_all_bootmem_node()
    register_page_bootmem_info_node()
    register_page_bootmem_info_section()

    When hot remove memory, we can't free the memmap's page because
    page_count() is 2 after put_page_bootmem().

    sparse_remove_one_section()
    free_section_usemap()
    free_map_bootmem()
    put_page_bootmem()

    [akpm@linux-foundation.org: add code comment]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Acked-by: Mel Gorman
    Cc: "Luck, Tony"
    Cc: Yasuaki Ishimatsu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    qiuxishi
     
  • The heuristic method for buddy has been introduced since commit
    43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index
    of adjacent buddy lists"). But the page address of higher page's buddy
    was wrongly calculated, which will lead page_is_buddy to fail for ever.
    IOW, the heuristic method would be disabled with the wrong page address
    of higher page's buddy.

    Calculating the page address of higher page's buddy should be based
    higher_page with the offset between index of higher page and index of
    higher page's buddy.

    Signed-off-by: Haifeng Li
    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Cc: KyongHo Cho
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: [2.6.38+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Haifeng
     
  • get_partial() is currently not checking pfmemalloc_match() meaning that
    it is possible for pfmemalloc pages to leak to non-pfmemalloc users.
    This is a problem in the following situation. Assume that there is a
    request from normal allocation and there are no objects in the per-cpu
    cache and no node-partial slab.

    In this case, slab_alloc enters the slow path and new_slab_objects() is
    called which may return a PFMEMALLOC page. As the current user is not
    allowed to access PFMEMALLOC page, deactivate_slab() is called
    ([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc
    checks]) and returns an object from PFMEMALLOC page.

    Next time, when we get another request from normal allocation,
    slab_alloc() enters the slow-path and calls new_slab_objects(). In
    new_slab_objects(), we call get_partial() and get a partial slab which
    was just deactivated but is a pfmemalloc page. We extract one object
    from it and re-deactivate.

    "deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly.

    As a result, access to PFMEMALLOC page is not properly restricted and it
    can cause a performance degradation due to frequent deactivation.
    deactivation frequently.

    This patch changes get_partial_node() to take pfmemalloc_match() into
    account and prevents the "deactivate -> re-get in get_partial()
    scenario. Instead, new_slab() is called.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Chuck Lever
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In array cache, there is a object at index 0, check it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Chuck Lever
    Cc: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Right now, we call ClearSlabPfmemalloc() for first page of slab when we
    clear SlabPfmemalloc flag. This is fine for most swap-over-network use
    cases as it is expected that order-0 pages are in use. Unfortunately it
    is possible that that __ac_put_obj() checks SlabPfmemalloc on a tail
    page and while this is harmless, it is sloppy. This patch ensures that
    the head page is always used.

    This problem was originally identified by Joonsoo Kim.

    [js1304@gmail.com: Original implementation and problem identification]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Chuck Lever
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If kthread_run() fails, pgdat->kswapd contains errno. When we stop this
    thread, we only check whether pgdat->kswapd is NULL and access it. If
    it contains errno, it will cause page fault. Reset pgdat->kswapd to
    NULL when creating kernel thread fails can avoid this problem.

    Signed-off-by: Wen Congyang
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     

15 Sep, 2012

2 commits

  • Pull a core sparse warning fix from Ingo Molnar

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm/memblock: Use NULL instead of 0 for pointers

    Linus Torvalds
     
  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

07 Sep, 2012

1 commit

  • Trivially triggerable, found by trinity:

    kernel BUG at mm/mempolicy.c:2546!
    Process trinity-child2 (pid: 23988, threadinfo ffff88010197e000, task ffff88007821a670)
    Call Trace:
    show_numa_map+0xd5/0x450
    show_pid_numa_map+0x13/0x20
    traverse+0xf2/0x230
    seq_read+0x34b/0x3e0
    vfs_read+0xac/0x180
    sys_pread64+0xa2/0xc0
    system_call_fastpath+0x1a/0x1f
    RIP: mpol_to_str+0x156/0x360

    Cc: stable@vger.kernel.org
    Signed-off-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Dave Jones
     

05 Sep, 2012

1 commit


30 Aug, 2012

1 commit

  • cache_grow() can reenable irqs so the cpu (and node) can change, so ensure
    that we take list_lock on the correct nodelist.

    This fixes an issue with commit 072bb0aa5e06 ("mm: sl[au]b: add
    knowledge of PFMEMALLOC reserve pages") where list_lock for the wrong
    node was taken after growing the cache.

    Reported-and-tested-by: Haggai Eran
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

27 Aug, 2012

1 commit


26 Aug, 2012

1 commit

  • Pull block-related fixes from Jens Axboe:

    - Improvements to the buffered and direct write IO plugging from
    Fengguang.

    - Abstract out the mapping of a bio in a request, and use that to
    provide a blk_bio_map_sg() helper. Useful for mapping just a bio
    instead of a full request.

    - Regression fix from Hugh, fixing up a patch that went into the
    previous release cycle (and marked stable, too) attempting to prevent
    a loop in __getblk_slow().

    - Updates to discard requests, fixing up the sizing and how we align
    them. Also a change to disallow merging of discard requests, since
    that doesn't really work properly yet.

    - A few drbd fixes.

    - Documentation updates.

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: replace __getblk_slow misfix by grow_dev_page fix
    drbd: Write all pages of the bitmap after an online resize
    drbd: Finish requests that completed while IO was frozen
    drbd: fix drbd wire compatibility for empty flushes
    Documentation: update tunable options in block/cfq-iosched.txt
    Documentation: update tunable options in block/cfq-iosched.txt
    Documentation: update missing index files in block/00-INDEX
    block: move down direct IO plugging
    block: remove plugging at buffered write time
    block: disable discard request merge temporarily
    bio: Fix potential memory leak in bio_find_or_create_slab()
    block: Don't use static to define "void *p" in show_partition_start()
    block: Add blk_bio_map_sg() helper
    block: Introduce __blk_segment_map_sg() helper
    fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices
    block: split discard into aligned requests
    block: reorganize rounding of max_discard_sectors

    Linus Torvalds
     

25 Aug, 2012

1 commit

  • Extract in-memory xattr APIs from tmpfs. Will be used by cgroup.

    $ size vmlinux.o
    text data bss dec hex filename
    4658782 880729 5195032 10734543 a3cbcf vmlinux.o
    $ size vmlinux.o
    text data bss dec hex filename
    4658957 880729 5195032 10734718 a3cc7e vmlinux.o

    v7:
    - checkpatch warnings fixed
    - Implement the changes requested by Hugh Dickins:
    - make simple_xattrs_init and simple_xattrs_free inline
    - get rid of locking and list reinitialization in simple_xattrs_free,
    they're not needed
    v6:
    - no changes
    v5:
    - no changes
    v4:
    - move simple_xattrs_free() to fs/xattr.c
    v3:
    - in kmem_xattrs_free(), reinitialize the list
    - use simple_xattr_* prefix
    - introduce simple_xattr_add() to prevent direct list usage

    Original-patch-by: Li Zefan
    Cc: Li Zefan
    Cc: Hillf Danton
    Cc: Lennart Poettering
    Acked-by: Hugh Dickins
    Signed-off-by: Li Zefan
    Signed-off-by: Aristeu Rozanski
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     

24 Aug, 2012

1 commit


22 Aug, 2012

6 commits

  • Jim Schutt reported a problem that pointed at compaction contending
    heavily on locks. The workload is straight-forward and in his own words;

    The systems in question have 24 SAS drives spread across 3 HBAs,
    running 24 Ceph OSD instances, one per drive. FWIW these servers
    are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
    Ceph Linux clients doing dd simultaneously to a Ceph file system
    backed by 12 of these servers.

    Early in the test everything looks fine

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
    27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
    28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
    6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
    22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

    and then it goes to pot

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
    207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
    123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
    123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
    622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
    223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

    Note that system CPU usage is very high blocks being written out has
    dropped by 42%. He analysed this with perf and found

    perf record -g -a sleep 10
    perf report --sort symbol --call-graph fractal,5
    34.63% [k] _raw_spin_lock_irqsave
    |
    |--97.30%-- isolate_freepages
    | compaction_alloc
    | unmap_and_move
    | migrate_pages
    | compact_zone
    | compact_zone_order
    | try_to_compact_pages
    | __alloc_pages_direct_compact
    | __alloc_pages_slowpath
    | __alloc_pages_nodemask
    | alloc_pages_vma
    | do_huge_pmd_anonymous_page
    | handle_mm_fault
    | do_page_fault
    | page_fault
    | |
    | |--87.39%-- skb_copy_datagram_iovec
    | | tcp_recvmsg
    | | inet_recvmsg
    | | sock_recvmsg
    | | sys_recvfrom
    | | system_call
    | | __recv
    | | |
    | | --100.00%-- (nil)
    | |
    | --12.61%-- memcpy
    --2.70%-- [...]

    There was other data but primarily it is all showing that compaction is
    contended heavily on the zone->lock and zone->lru_lock.

    commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration] noted that it was possible for
    migration to hold the lru_lock for an excessive amount of time. Very
    broadly speaking this patch expands the concept.

    This patch introduces compact_checklock_irqsave() to check if a lock
    is contended or the process needs to be scheduled. If either condition
    is true then async compaction is aborted and the caller is informed.
    The page allocator will fail a THP allocation if compaction failed due
    to contention. This patch also introduces compact_trylock_irqsave()
    which will acquire the lock only if it is not contended and the process
    does not need to schedule.

    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
    left") introduced a caching mechanism to reduce the amount work the free
    page scanner does in compaction. However, it has a problem. Consider
    two process simultaneously scanning free pages

    C
    Process A M S F
    |---------------------------------------|
    Process B M FS

    C is zone->compact_cached_free_pfn
    S is cc->start_pfree_pfn
    M is cc->migrate_pfn
    F is cc->free_pfn

    In this diagram, Process A has just reached its migrate scanner, wrapped
    around and updated compact_cached_free_pfn accordingly.

    Simultaneously, Process B finishes isolating in a block and updates
    compact_cached_free_pfn again to the location of its free scanner.

    Process A moves to "end_of_zone - one_pageblock" and runs this check

    if (cc->order > 0 && (!cc->wrapped ||
    zone->compact_cached_free_pfn >
    cc->start_free_pfn))
    pfn = min(pfn, zone->compact_cached_free_pfn);

    compact_cached_free_pfn is above where it started so the free scanner
    skips almost the entire space it should have scanned. When there are
    multiple processes compacting it can end in a situation where the entire
    zone is not being scanned at all. Further, it is possible for two
    processes to ping-pong update to compact_cached_free_pfn which is just
    random.

    Overall, the end result wrecks allocation success rates.

    There is not an obvious way around this problem without introducing new
    locking and state so this patch takes a different approach.

    First, it gets rid of the skip logic because it's not clear that it
    matters if two free scanners happen to be in the same block but with
    racing updates it's too easy for it to skip over blocks it should not.

    Second, it updates compact_cached_free_pfn in a more limited set of
    circumstances.

    If a scanner has wrapped, it updates compact_cached_free_pfn to the end
    of the zone. When a wrapped scanner isolates a page, it updates
    compact_cached_free_pfn to point to the highest pageblock it
    can isolate pages from.

    If a scanner has not wrapped when it has finished isolated pages it
    checks if compact_cached_free_pfn is pointing to the end of the
    zone. If so, the value is updated to point to the highest
    pageblock that pages were isolated from. This value will not
    be updated again until a free page scanner wraps and resets
    compact_cached_free_pfn.

    This is not optimal and it can still race but the compact_cached_free_pfn
    will be pointing to or very near a pageblock with free pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit cfd19c5a9ecf ("mm: only set page->pfmemalloc when
    ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
    setting, but it missed some places the pfmemalloc should be set.

    So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
    cause incorrect deactivate_slab() on our core2 server:

    64.73% fio [kernel.kallsyms] [k] _raw_spin_lock
    |
    --- _raw_spin_lock
    |
    |---0.34%-- deactivate_slab
    | __slab_alloc
    | kmem_cache_alloc
    | |

    That causes our fio sync write performance to have a 40% regression.

    Move the checking in get_page_from_freelist() which resolves this issue.

    Signed-off-by: Alex Shi
    Acked-by: Mel Gorman
    Cc: David Miller
    Tested-by: Eric Dumazet
    Tested-by: Sage Weil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Shi
     
  • Commit aff622495c9a ("vmscan: only defer compaction for failed order and
    higher") fixed bad deferring policy but made mistake about checking
    compact_order_failed in __compact_pgdat(). So it can't update
    compact_order_failed with the new order. This ends up preventing
    correct operation of policy deferral. This patch fixes it.

    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Occasionally an isolated BUG_ON(mm->nr_ptes) gets reported, indicating
    that not all the page tables allocated could be found and freed when
    exit_mmap() tore down the user address space.

    There's usually nothing we can say about it, beyond that it's probably a
    sign of some bad memory or memory corruption; though it might still
    indicate a bug in vma or page table management (and did recently reveal a
    race in THP, fixed a few months ago).

    But one overdue change we can make is from BUG_ON to WARN_ON.

    It's fairly likely that the system will crash shortly afterwards in some
    other way (for example, the BUG_ON(page_mapped(page)) in
    __delete_from_page_cache(), once an inode mapped into the lost page tables
    gets evicted); but might tell us more before that.

    Change the BUG_ON(page_mapped) to WARN_ON too? Later perhaps: I'm less
    eager, since that one has several times led to fixes.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Initalizers for deferrable delayed_work are confused.

    * __DEFERRED_WORK_INITIALIZER()
    * DECLARE_DEFERRED_WORK()
    * INIT_DELAYED_WORK_DEFERRABLE()

    Rename them to

    * __DEFERRABLE_WORK_INITIALIZER()
    * DECLARE_DEFERRABLE_WORK()
    * INIT_DEFERRABLE_WORK()

    This patch doesn't cause any functional changes.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

21 Aug, 2012

1 commit

  • This patch fixes:

    https://bugzilla.redhat.com/show_bug.cgi?id=843640

    If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path
    does unmap_region() but does not remove the soon-to-be-freed vma
    from rb tree. Actually there are more problems but this is how
    William noticed this bug.

    Perhaps we could do do_munmap() + return in this case, but in
    fact it is simply wrong to abort if uprobe_mmap() fails. Until
    at least we move the !UPROBE_COPY_INSN code from
    install_breakpoint() to uprobe_register().

    For example, uprobe_mmap()->install_breakpoint() can fail if the
    probed insn is not supported (remember, uprobe_register()
    succeeds if nobody mmaps inode/offset), mmap() should not fail
    in this case.

    dup_mmap()->uprobe_mmap() is wrong too by the same reason,
    fork() can race with uprobe_register() and fail for no reason if
    it wins the race and does install_breakpoint() first.

    And, if nothing else, both mmap_region() and dup_mmap() return
    success if uprobe_mmap() fails. Change them to ignore the error
    code from uprobe_mmap().

    Reported-and-tested-by: William Cohen
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Cc: # v3.5
    Cc: Anton Arapov
    Cc: William Cohen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

14 Aug, 2012

1 commit