23 Nov, 2011

1 commit


18 Nov, 2011

2 commits

  • …kernel/git/konrad/xen

    * 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen-gntalloc: signedness bug in add_grefs()
    xen-gntalloc: integer overflow in gntalloc_ioctl_alloc()
    xen-gntdev: integer overflow in gntdev_alloc_map()
    xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs.
    xen/balloon: Avoid OOM when requesting highmem
    xen: Remove hanging references to CONFIG_XEN_PLATFORM_PCI
    xen: map foreign pages for shared rings by updating the PTEs directly

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.dk/linux-block:
    block: add missed trace_block_plug
    paride: fix potential information leak in pg_read()
    bio: change some signed vars to unsigned
    block: avoid unnecessary plug list flush
    cciss: auto engage SCSI mid layer at driver load time
    loop: cleanup set_status interface
    include/linux/bio.h: use a static inline function for bio_integrity_clone()
    loop: prevent information leak after failed read
    block: Always check length of all iov entries in blk_rq_map_user_iov()
    The Windows driver .inf disables ASPM on all cciss devices. Do the same.
    backing-dev: ensure wakeup_timer is deleted
    block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk"

    Linus Torvalds
     

17 Nov, 2011

3 commits

  • They are not used any more.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
    on every 1 4KB-page, which means it cannot throttle a task under
    4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
    10MB/s USB stick, its bdi dirty pages could grow out of control.

    Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
    means a limit of 4KB/s.

    They can eventually be safeguarded by the global limit check
    (nr_dirty < dirty_thresh). However if someone is also writing to an
    HDD at the same time, it'll get poor HDD write performance.

    We at least want to maintain good write performance for other devices
    when one device is attacked by some "massive parallel" workload, or
    suffers from slow write bandwidth, or somehow get stalled due to some
    error condition (eg. NFS server not responding).

    For a stalled device, we need to completely block its dirtiers, too,
    before its bdi dirty pages grow all the way up to the global limit and
    leave no space for the other functional devices.

    So change the loop exit condition to

    /*
    * Always enforce global dirty limit; also enforce bdi dirty limit
    * if the normal max_pause sleeps cannot keep things under control.
    */
    if (nr_dirty < dirty_thresh &&
    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
    break;

    which can be further simplified to

    if (task_ratelimit)
    break;

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • When mapping a foreign page with xenbus_map_ring_valloc() with the
    GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
    pass a pointer to the PTE (in init_mm).

    After the page is mapped, the usual fault mechanism can be used to
    update additional MMs. This allows the vmalloc_sync_all() to be
    removed from alloc_vm_area().

    Signed-off-by: David Vrabel
    Acked-by: Andrew Morton
    [v1: Squashed fix by Michal for no-mmu case]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Michal Simek

    David Vrabel
     

16 Nov, 2011

3 commits

  • There is no reason why task in balance_dirty_pages() shouldn't be killable
    and it helps in recovering from some error conditions (like when filesystem
    goes in error state and cannot accept writeback anymore but we still want to
    kill processes using it to be able to unmount it).

    There will be follow up patches to further abort the generic_perform_write()
    and other filesystem write loops, to avoid large write + SIGKILL combination
    exceeding the dirty limit and possibly strange OOM.

    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Reviewed-by: Neil Brown
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     
  • If we fail to prepare an anon_vma, the {new, old}_page should be released,
    or they will leak.

    Signed-off-by: Hillf Danton
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Commit c9f01245 ("oom: remove oom_disable_count") has removed the
    oom_disable_count counter which has been used for early break out from
    oom_badness so we could never select a task with oom_score_adj set to
    OOM_SCORE_ADJ_MIN (oom disabled).

    Now that the counter is gone we are always going through heuristics
    calculation and we always return a non zero positive value. This means
    that we can end up killing a task with OOM disabled because it is
    indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks
    with 3% usage of memory or tasks with oom_score_adj set but OOM enabled.

    Let's break out early if the task should have OOM disabled.

    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Nov, 2011

1 commit

  • bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
    from all super_blocks and then del_timer_sync() the writeback timer.

    However, this can race with __mark_inode_dirty(), leading to
    bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
    we're unregistering, after we've called del_timer_sync().

    This can end up with the bdi being freed with an active timer inside it,
    as in the case of the following dump after the removal of an SD card.

    Fix this by redoing the del_timer_sync() in bdi_destory().

    ------------[ cut here ]------------
    WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
    ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
    Modules linked in:
    Backtrace:
    [] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
    r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
    [] (dump_stack+0x0/0x1c) from [] (warn_slowpath_common+0x54/0x6c)
    [] (warn_slowpath_common+0x0/0x6c) from [] (warn_slowpath_fmt+0x38/0x40)
    r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
    r3:00000009
    [] (warn_slowpath_fmt+0x0/0x40) from [] (debug_print_object+0x9c/0xc8)
    r3:c02b1b29 r2:c02bc662
    [] (debug_print_object+0x0/0xc8) from [] (debug_check_no_obj_freed+0xac/0x1dc)
    r6:c7964000 r5:00000001 r4:c7964000
    [] (debug_check_no_obj_freed+0x0/0x1dc) from [] (kmem_cache_free+0x88/0x1f8)
    [] (kmem_cache_free+0x0/0x1f8) from [] (blk_release_queue+0x70/0x78)
    [] (blk_release_queue+0x0/0x78) from [] (kobject_release+0x70/0x84)
    r5:c79641f0 r4:c796420c
    [] (kobject_release+0x0/0x84) from [] (kref_put+0x68/0x80)
    r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
    [] (kref_put+0x0/0x80) from [] (kobject_put+0x48/0x5c)
    r5:c79643b4 r4:c79641f0
    [] (kobject_put+0x0/0x5c) from [] (blk_cleanup_queue+0x68/0x74)
    r4:c7964000
    [] (blk_cleanup_queue+0x0/0x74) from [] (mmc_blk_put+0x78/0xe8)
    r5:00000000 r4:c794c400
    [] (mmc_blk_put+0x0/0xe8) from [] (mmc_blk_release+0x24/0x38)
    r5:c794c400 r4:c0322824
    [] (mmc_blk_release+0x0/0x38) from [] (__blkdev_put+0xe8/0x170)
    r5:c78d5e00 r4:c74083c0
    [] (__blkdev_put+0x0/0x170) from [] (blkdev_put+0x11c/0x12c)
    r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
    r3:00000000
    [] (blkdev_put+0x0/0x12c) from [] (kill_block_super+0x60/0x6c)
    r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
    [] (kill_block_super+0x0/0x6c) from [] (deactivate_locked_super+0x44/0x70)
    r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
    [] (deactivate_locked_super+0x0/0x70) from [] (deactivate_super+0x6c/0x70)
    r5:c794dc00 r4:c794dc00
    [] (deactivate_super+0x0/0x70) from [] (mntput_no_expire+0x188/0x194)
    r5:c794dc00 r4:c7942300
    [] (mntput_no_expire+0x0/0x194) from [] (sys_umount+0x2e4/0x310)
    r6:c7942300 r5:00000000 r4:00000000 r3:00000000
    [] (sys_umount+0x0/0x310) from [] (ret_fast_syscall+0x0/0x30)
    ---[ end trace e5c83c92ada51c76 ]---

    Cc: stable@kernel.org
    Signed-off-by: Rabin Vincent
    Signed-off-by: Linus Walleij
    Signed-off-by: Jens Axboe

    Rabin Vincent
     

07 Nov, 2011

3 commits

  • In balance_dirty_pages() task_ratelimit may be not initialized
    (initialization skiped by goto pause), and then used when calling
    tracing hook.

    Fix it by moving the task_ratelimit assignment before goto pause.

    Reported-by: Witold Baryluk
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Add a 'reason' to wb_writeback_work
    writeback: send work item to queue_io, move_expired_inodes
    writeback: trace event balance_dirty_pages
    writeback: trace event bdi_dirty_ratelimit
    writeback: fix ppc compile warnings on do_div(long long, unsigned long)
    writeback: per-bdi background threshold
    writeback: dirty position control - bdi reserve area
    writeback: control dirty pause time
    writeback: limit max dirty pause time
    writeback: IO-less balance_dirty_pages()
    writeback: per task dirty rate limit
    writeback: stabilize bdi->dirty_ratelimit
    writeback: dirty rate control
    writeback: add bg_threshold parameter to __bdi_update_bandwidth()
    writeback: dirty position control
    writeback: account per-bdi accumulated dirtied pages

    Linus Torvalds
     

05 Nov, 2011

1 commit

  • * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
    block: don't call blk_drain_queue() if elevator is not up
    blk-throttle: use queue_is_locked() instead of lockdep_is_held()
    blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
    blk-throttle: Free up policy node associated with deleted rule
    block: warn if tag is greater than real_max_depth.
    block: make gendisk hold a reference to its queue
    blk-flush: move the queue kick into
    blk-flush: fix invalid BUG_ON in blk_insert_flush
    block: Remove the control of complete cpu from bio.
    block: fix a typo in the blk-cgroup.h file
    block: initialize the bounce pool if high memory may be added later
    block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
    block: drop @tsk from attempt_plug_merge() and explain sync rules
    block: make get_request[_wait]() fail if queue is dead
    block: reorganize throtl_get_tg() and blk_throtl_bio()
    block: reorganize queue draining
    block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
    block: pass around REQ_* flags instead of broken down booleans during request alloc/free
    block: move blk_throtl prototypes to block/blk.h
    block: fix genhd refcounting in blkio_policy_parse_and_set()
    ...

    Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
    and making the request functions be of type "void" instead of "int" in
    - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
    - drivers/staging/zram/zram_drv.c

    Linus Torvalds
     

03 Nov, 2011

10 commits

  • Says Andrew:

    "60 patches. That's good enough for -rc1 I guess. I have quite a lot
    of detritus to be rechecked, work through maintainers, etc.

    - most of the remains of MM
    - rtc
    - various misc
    - cgroups
    - memcg
    - cpusets
    - procfs
    - ipc
    - rapidio
    - sysctl
    - pps
    - w1
    - drivers/misc
    - aio"

    * akpm: (60 commits)
    memcg: replace ss->id_lock with a rwlock
    aio: allocate kiocbs in batches
    drivers/misc/vmw_balloon.c: fix typo in code comment
    drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
    w1: disable irqs in critical section
    drivers/w1/w1_int.c: multiple masters used same init_name
    drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
    drivers/power/ds2780_battery.c: add a nolock function to w1 interface
    drivers/power/ds2780_battery.c: create central point for calling w1 interface
    w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
    pps gpio client: add missing dependency
    pps: new client driver using GPIO
    pps: default echo function
    include/linux/dma-mapping.h: add dma_zalloc_coherent()
    sysctl: make CONFIG_SYSCTL_SYSCALL default to n
    sysctl: add support for poll()
    RapidIO: documentation update
    drivers/net/rionet.c: fix ethernet address macros for LE platforms
    RapidIO: fix potential null deref in rio_setup_device()
    RapidIO: add mport driver for Tsi721 bridge
    ...

    Linus Torvalds
     
  • warning: symbol 'swap_cgroup_ctrl' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Cc: Paul Menage
    Cc: Li Zefan
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • Various code in memcontrol.c () calls this_cpu_read() on the calculations
    to be done from two different percpu variables, or does an open-coded
    read-modify-write on a single percpu variable.

    Disable preemption throughout these operations so that the writes go to
    the correct palces.

    [hannes@cmpxchg.org: added this_cpu to __this_cpu conversion]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Steven Rostedt
    Cc: Greg Thelen
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • There is a potential race between a thread charging a page and another
    thread putting it back to the LRU list:

    charge: putback:
    SetPageCgroupUsed SetPageLRU
    PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU

    The order of setting one flag and checking the other is crucial, otherwise
    the charge may observe !PageLRU while the putback observes !PageCgroupUsed
    and the page is not linked to the memcg LRU at all.

    Global memory pressure may fix this by trying to isolate and putback the
    page for reclaim, where that putback would link it to the memcg LRU again.
    Without that, the memory cgroup is undeletable due to a charge whose
    physical page can not be found and moved out.

    Signed-off-by: Johannes Weiner
    Cc: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reclaim decides to skip scanning an active list when the corresponding
    inactive list is above a certain size in comparison to leave the assumed
    working set alone while there are still enough reclaim candidates around.

    The memcg implementation of comparing those lists instead reports whether
    the whole memcg is low on the requested type of inactive pages,
    considering all nodes and zones.

    This can lead to an oversized active list not being scanned because of the
    state of the other lists in the memcg, as well as an active list being
    scanned while its corresponding inactive list has enough pages.

    Not only is this wrong, it's also a scalability hazard, because the global
    memory state over all nodes and zones has to be gathered for each memcg
    and zone scanned.

    Make these calculations purely based on the size of the two LRU lists
    that are actually affected by the outcome of the decision.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Reviewed-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If somebody is touching data too early, it might be easier to diagnose a
    problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
    understand why mem_cgroup_per_zone is [un|partly]initialized.

    Signed-off-by: Igor Mammedov
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Igor Mammedov
     
  • Before calling schedule_timeout(), task state should be changed.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
    "struct mem_cgroup *memcg". Rename all mem variables to memcg in source
    file.

    Signed-off-by: Raghavendra K T
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raghavendra K T
     
  • When the cgroup base was allocated with kmalloc, it was necessary to
    annotate the variable with kmemleak_not_leak(). But because it has
    recently been changed to be allocated with alloc_page() (which skips
    kmemleak checks) causes a warning on boot up.

    I was triggering this output:

    allocated 8388608 bytes of page_cgroup
    please try 'cgroup_disable=memory' option if you don't want memory cgroups
    kmemleak: Trying to color unknown object at 0xf5840000 as Grey
    Pid: 0, comm: swapper Not tainted 3.0.0-test #12
    Call Trace:
    [] ? printk+0x1d/0x1f^M
    [] paint_ptr+0x4f/0x78
    [] kmemleak_not_leak+0x58/0x7d
    [] ? __rcu_read_unlock+0x9/0x7d
    [] kmemleak_init+0x19d/0x1e9
    [] start_kernel+0x346/0x3ec
    [] ? loglevel+0x18/0x18
    [] i386_start_kernel+0xaa/0xb0

    After a bit of debugging I tracked the object 0xf840000 (and others) down
    to the cgroup code. The change from allocating base with kmalloc to
    alloc_page() has the base not calling kmemleak_alloc() which adds the
    pointer to the object_tree_root, but kmemleak_not_leak() adds it to the
    crt_early_log[] table. On kmemleak_init(), the entry is found in the
    early_log[] but not the object_tree_root, and this error message is
    displayed.

    If alloc_page() fails then it defaults back to vmalloc() which still uses
    the kmemleak_alloc() which makes us still need the kmemleak_not_leak()
    call. The solution is to call the kmemleak_alloc() directly if the
    alloc_page() succeeds.

    Reviewed-by: Michal Hocko
    Signed-off-by: Steven Rostedt
    Acked-by: Catalin Marinas
    Signed-off-by: Jonathan Nieder
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Nov, 2011

1 commit


01 Nov, 2011

15 commits

  • Avoid false sharing of the vm_stat array.

    This was found to adversely affect tmpfs I/O performance.

    Tests run on a 640 cpu UV system.

    With 120 threads doing parallel writes, each to different tmpfs mounts:
    No patch: ~300 MB/sec
    With vm_stat alignment: ~430 MB/sec

    Signed-off-by: Dimitri Sivanich
    Acked-by: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dimitri Sivanich
     
  • A process spent 30 minutes exiting, just munlocking the pages of a large
    anonymous area that had been alternately mprotected into page-sized vmas:
    for every single page there's an anon_vma walk through all the other
    little vmas to find the right one.

    A general fix to that would be a lot more complicated (use prio_tree on
    anon_vma?), but there's one very simple thing we can do to speed up the
    common case: if a page to be munlocked is mapped only once, then it is our
    vma that it is mapped into, and there's no need whatever to walk through
    all the others.

    Okay, there is a very remote race in munlock_vma_pages_range(), if between
    its follow_page() and lock_page(), another process were to munlock the
    same page, then page reclaim remove it from our vma, then another process
    mlock it again. We would find it with page_mapcount 1, yet it's still
    mlocked in another process. But never mind, that's much less likely than
    the down_read_trylock() failure which munlocking already tolerates (in
    try_to_unmap_one()): in due course page reclaim will discover and move the
    page to unevictable instead.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Hugh Dickins
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There are three cases of update_mmu_cache() in the file, and the case in
    function collapse_huge_page() has a typo, namely the last parameter used,
    which is corrected based on the other two cases.

    Due to the define of update_mmu_cache by X86, the only arch that
    implements THP currently, the change here has no really crystal point, but
    one or two minutes of efforts could be saved for those archs that are
    likely to support THP in future.

    Signed-off-by: Hillf Danton
    Cc: Johannes Weiner
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • The THP copy-on-write handler falls back to regular-sized pages for a huge
    page replacement upon allocation failure or if THP has been individually
    disabled in the target VMA. The loop responsible for copying page-sized
    chunks accidentally uses multiples of PAGE_SHIFT instead of PAGE_SIZE as
    the virtual address arg for copy_user_highpage().

    Signed-off-by: Hillf Danton
    Acked-by: Johannes Weiner
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • MCL_FUTURE does not move pages between lru list and draining the LRU per
    cpu pagevecs is a nasty activity. Avoid doing it unecessarily.

    Signed-off-by: Christoph Lameter
    Cc: David Rientjes
    Reviewed-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If compaction can proceed, shrink_zones() stops doing any work but its
    callers still call shrink_slab() which raises the priority and potentially
    sleeps. This is unnecessary and wasteful so this patch aborts direct
    reclaim/compaction entirely if compaction can proceed.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: Josh Boyer
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When suffering from memory fragmentation due to unfreeable pages, THP page
    faults will repeatedly try to compact memory. Due to the unfreeable
    pages, compaction fails.

    Needless to say, at that point page reclaim also fails to create free
    contiguous 2MB areas. However, that doesn't stop the current code from
    trying, over and over again, and freeing a minimum of 4MB (2UL <<
    sc->order pages) at every single invocation.

    This resulted in my 12GB system having 2-3GB free memory, a corresponding
    amount of used swap and very sluggish response times.

    This can be avoided by having the direct reclaim code not reclaim from
    zones that already have plenty of free memory available for compaction.

    If compaction still fails due to unmovable memory, doing additional
    reclaim will only hurt the system, not help.

    [jweiner@redhat.com: change comment to explain the order check]
    Signed-off-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When a race between putback_lru_page() and shmem_lock with lock=0 happens,
    progrom execution order is as follows, but clear_bit in processor #1 could
    be reordered right before spin_unlock of processor #1. Then, the page
    would be stranded on the unevictable list.

    spin_lock
    SetPageLRU
    spin_unlock
    clear_bit(AS_UNEVICTABLE)
    spin_lock
    if PageLRU()
    if !test_bit(AS_UNEVICTABLE)
    move evictable list
    smp_mb
    if !test_bit(AS_UNEVICTABLE)
    move evictable list
    spin_unlock

    But, pagevec_lookup() in scan_mapping_unevictable_pages() has
    rcu_read_[un]lock() so it could protect reordering before reaching
    test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
    But it's a unexpected side effect and we should solve this problem
    properly.

    This patch adds a barrier after mapping_clear_unevictable.

    I didn't meet this problem but just found during review.

    Signed-off-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Quiet the sparse noise:

    warning: symbol 'khugepaged_scan' was not declared. Should it be static?
    warning: context imbalance in 'khugepaged_scan_mm_slot' - unexpected unlock

    Signed-off-by: H Hartley Sweeten
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • Quiet the spares noise:

    warning: symbol 'default_policy' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Cc: KOSAKI Motohiro
    Cc: Stephen Wilson
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • Quiet the following sparse noise:

    warning: symbol 'swap_token_memcg' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Cc: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • Quiet the following sparse noise in this file:

    warning: symbol 'memblock_overlaps_region' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Cc: Yinghai Lu
    Cc: "H. Peter Anvin"
    Cc: Benjamin Herrenschmidt
    Cc: Tomi Valkeinen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • At one point, anonymous pages were supposed to go on the unevictable list
    when no swap space was configured, and the idea was to manually rescue
    those pages after adding swap and making them evictable again. But
    nowadays, swap-backed pages on the anon LRU list are not scanned without
    available swap space anyway, so there is no point in moving them to a
    separate list anymore.

    The manual rescue could also be used in case pages were stranded on the
    unevictable list due to race conditions. But the code has been around for
    a while now and newly discovered bugs should be properly reported and
    dealt with instead of relying on such a manual fixup.

    In addition to the lack of a usecase, the sysfs interface to rescue pages
    from a specific NUMA node has been broken since its introduction, so it's
    unlikely that anybody ever relied on that.

    This patch removes the functionality behind the sysctl and the
    node-interface and emits a one-time warning when somebody tries to access
    either of them.

    Signed-off-by: Johannes Weiner
    Reported-by: Kautuk Consul
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • write_scan_unevictable_node() checks the value req returned by
    strict_strtoul() and returns 1 if req is 0.

    However, when strict_strtoul() returns 0, it means successful conversion
    of buf to unsigned long.

    Due to this, the function was not proceeding to scan the zones for
    unevictable pages even though we write a valid value to the
    scan_unevictable_pages sys file.

    Change this check slightly to check for invalid value in buf as well as 0
    value stored in res after successful conversion via strict_strtoul. In
    both cases, we do not perform the scanning of this node's zones.

    Signed-off-by: Kautuk Consul
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • Signed-off-by: Li Haifeng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Haifeng