27 Jul, 2011

17 commits

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     
  • Now cleanup_fault_attr_dentries() recursively removes a directory, So we
    can simplify the error handling in the initialization code and no need
    to hold dentry structs for each debugfs file.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Now cleanup_fault_attr_dentries() recursively removes a directory, So we
    can simplify the error handling in the initialization code and no need
    to hold dentry structs for each debugfs file.

    Signed-off-by: Akinobu Mita
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Use debugfs_remove_recursive() to simplify initialization and
    deinitialization of fault injection debugfs files.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • [ This patch has already been accepted as commit 0ac0c0d0f837 but later
    reverted (commit 35926ff5fba8) because it itroduced arch specific
    __node_random which was defined only for x86 code so it broke other
    archs. This is a followup without any arch specific code. Other than
    that there are no functional changes.]

    Some workloads that create a large number of small files tend to assign
    too many pages to node 0 (multi-node systems). Part of the reason is
    that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
    at node 0 for newly created tasks.

    This patch changes the rotor to be initialized to a random node number
    of the cpuset.

    [akpm@linux-foundation.org: fix layout]
    [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
    [mhocko@suse.cz: Make it arch independent]
    [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
    Signed-off-by: Jack Steiner
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Michal Hocko
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Jack Steiner
    Cc: KOSAKI Motohiro
    Cc: Lee Schermerhorn
    Cc: Michal Hocko
    Cc: Paul Menage
    Cc: Pekka Enberg
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • percpu_charge_mutex protects from multiple simultaneous per-cpu charge
    caches draining because we might end up having too many work items. At
    least this was the case until commit 26fe61684449 ("memcg: fix percpu
    cached charge draining frequency") when we introduced a more targeted
    draining for async mode.

    Now that also sync draining is targeted we can safely remove mutex
    because we will not send more work than the current number of CPUs.
    FLUSHING_CACHED_CHARGE protects from sending the same work multiple
    times and stock->nr_pages == 0 protects from pointless sending a work if
    there is obviously nothing to be done. This is of course racy but we
    can live with it as the race window is really small (we would have to
    see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
    non-zero).

    The only remaining place where we can race is synchronous mode when we
    rely on FLUSHING_CACHED_CHARGE test which might have been set by other
    drainer on the same group but we should wait in that case as well.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We are checking whether a given two groups are same or at least in the
    same subtree of a hierarchy at several places. Let's make a helper for
    it to make code easier to read.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently we have two ways how to drain per-CPU caches for charges.
    drain_all_stock_sync will synchronously drain all caches while
    drain_all_stock_async will asynchronously drain only those that refer to
    a given memory cgroup or its subtree in hierarchy. Targeted async
    draining has been introduced by 26fe6168 (memcg: fix percpu cached
    charge draining frequency) to reduce the cpu workers number.

    sync draining is currently triggered only from mem_cgroup_force_empty
    which is triggered only by userspace (mem_cgroup_force_empty_write) or
    when a cgroup is removed (mem_cgroup_pre_destroy). Although these are
    not usually frequent operations it still makes some sense to do targeted
    draining as well, especially if the box has many CPUs.

    This patch unifies both methods to use the single code (drain_all_stock)
    which relies on the original async implementation and just adds
    flush_work to wait on all caches that are still under work for the sync
    mode. We are using FLUSHING_CACHED_CHARGE bit check to prevent from
    waiting on a work that we haven't triggered. Please note that both sync
    and async functions are currently protected by percpu_charge_mutex so we
    cannot race with other drainers.

    Signed-off-by: Michal Hocko
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • drain_all_stock_async tries to optimize a work to be done on the work
    queue by excluding any work for the current CPU because it assumes that
    the context we are called from already tried to charge from that cache
    and it's failed so it must be empty already.

    While the assumption is correct we can optimize it even more by checking
    the current number of pages in the cache. This will also reduce a work
    on other CPUs with an empty stock.

    For the current CPU we can simply call drain_local_stock rather than
    deferring it to the work queue.

    [kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
    Signed-off-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim
    in...") says it adds scanning stats to memory.stat file. But it doesn't
    because we considered we needed to make a concensus for such new APIs.

    This patch is a trial to add memory.scan_stat. This shows
    - the number of scanned pages(total, anon, file)
    - the number of rotated pages(total, anon, file)
    - the number of freed pages(total, anon, file)
    - the number of elaplsed time (including sleep/pause time)

    for both of direct/soft reclaim.

    The biggest difference with oringinal Ying's one is that this file
    can be reset by some write, as

    # echo 0 ...../memory.scan_stat

    Example of output is here. This is a result after make -j 6 kernel
    under 300M limit.

    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
    scanned_pages_by_limit 9471864
    scanned_anon_pages_by_limit 6640629
    scanned_file_pages_by_limit 2831235
    rotated_pages_by_limit 4243974
    rotated_anon_pages_by_limit 3971968
    rotated_file_pages_by_limit 272006
    freed_pages_by_limit 2318492
    freed_anon_pages_by_limit 962052
    freed_file_pages_by_limit 1356440
    elapsed_ns_by_limit 351386416101
    scanned_pages_by_system 0
    scanned_anon_pages_by_system 0
    scanned_file_pages_by_system 0
    rotated_pages_by_system 0
    rotated_anon_pages_by_system 0
    rotated_file_pages_by_system 0
    freed_pages_by_system 0
    freed_anon_pages_by_system 0
    freed_file_pages_by_system 0
    elapsed_ns_by_system 0
    scanned_pages_by_limit_under_hierarchy 9471864
    scanned_anon_pages_by_limit_under_hierarchy 6640629
    scanned_file_pages_by_limit_under_hierarchy 2831235
    rotated_pages_by_limit_under_hierarchy 4243974
    rotated_anon_pages_by_limit_under_hierarchy 3971968
    rotated_file_pages_by_limit_under_hierarchy 272006
    freed_pages_by_limit_under_hierarchy 2318492
    freed_anon_pages_by_limit_under_hierarchy 962052
    freed_file_pages_by_limit_under_hierarchy 1356440
    elapsed_ns_by_limit_under_hierarchy 351386416101
    scanned_pages_by_system_under_hierarchy 0
    scanned_anon_pages_by_system_under_hierarchy 0
    scanned_file_pages_by_system_under_hierarchy 0
    rotated_pages_by_system_under_hierarchy 0
    rotated_anon_pages_by_system_under_hierarchy 0
    rotated_file_pages_by_system_under_hierarchy 0
    freed_pages_by_system_under_hierarchy 0
    freed_anon_pages_by_system_under_hierarchy 0
    freed_file_pages_by_system_under_hierarchy 0
    elapsed_ns_by_system_under_hierarchy 0

    total_xxxx is for hierarchy management.

    This will be useful for further memcg developments and need to be
    developped before we do some complicated rework on LRU/softlimit
    management.

    This patch adds a new struct memcg_scanrecord into scan_control struct.
    sc->nr_scanned at el is not designed for exporting information. For
    example, nr_scanned is reset frequentrly and incremented +2 at scanning
    mapped pages.

    To avoid complexity, I added a new param in scan_control which is for
    exporting scanning score.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Andrew Bresticker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to
    memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
    when mem_limit == memsw_limit. The flag is checked at the beginning of
    reclaim, and "noswap" is set if the flag is true, because using swap is
    meaningless in this case.

    This works well in most cases, but when we try to shrink mem_limit,
    which is the same as memsw_limit now, we might fail to shrink mem_limit
    because swap doesn't used.

    This patch fixes this behavior by:
    - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
    - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ying Han
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets")
    fixes the memcg/kswapd behavior against small targets and prevent vmscan
    priority too high.

    But the implementation is too naive and adds another problem to small
    memcg. It always force scan to 32 pages of file/anon and doesn't handle
    swappiness and other rotate_info. It makes vmscan to scan anon LRU
    regardless of swappiness and make reclaim bad. This patch fixes it by
    adjusting scanning count with regard to swappiness at el.

    At a test "cat 1G file under 300M limit." (swappiness=20)
    before patch
    scanned_pages_by_limit 360919
    scanned_anon_pages_by_limit 180469
    scanned_file_pages_by_limit 180450
    rotated_pages_by_limit 31
    rotated_anon_pages_by_limit 25
    rotated_file_pages_by_limit 6
    freed_pages_by_limit 180458
    freed_anon_pages_by_limit 19
    freed_file_pages_by_limit 180439
    elapsed_ns_by_limit 429758872
    after patch
    scanned_pages_by_limit 180674
    scanned_anon_pages_by_limit 24
    scanned_file_pages_by_limit 180650
    rotated_pages_by_limit 35
    rotated_anon_pages_by_limit 24
    rotated_file_pages_by_limit 11
    freed_pages_by_limit 180634
    freed_anon_pages_by_limit 0
    freed_file_pages_by_limit 180634
    elapsed_ns_by_limit 367119089
    scanned_pages_by_system 0

    the numbers of scanning anon are decreased(as expected), and elapsed time
    reduced. By this patch, small memcgs will work better.
    (*) Because the amount of file-cache is much bigger than anon,
    recalaim_stat's rotate-scan counter make scanning files more.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
    for oom_control. None of the critical sections which it protects sleep
    (eventfd_signal works from atomic context and the rest are simple linked
    list resp. oom_lock atomic operations).

    Mutex is also too heavyweight for those code paths because it triggers a
    lot of scheduling. It also makes makes convoying effects more visible
    when we have a big number of oom killing because we take the lock
    mutliple times during mem_cgroup_handle_oom so we have multiple places
    where many processes can sleep.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
    counter which is incremented by mem_cgroup_oom_lock when we are about to
    handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep
    if oom_lock > 1 to prevent from multiple oom kills at the same time.
    The counter is then decremented by mem_cgroup_oom_unlock called from the
    same function.

    This works correctly but it can lead to serious starvations when we have
    many processes triggering OOM and many CPUs available for them (I have
    tested with 16 CPUs).

    Consider a process (call it A) which gets the oom_lock (the first one
    that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
    processes that are blocked on the mutex. While A releases the mutex and
    calls mem_cgroup_out_of_memory others will wake up (one after another)
    and increase the counter and fall into sleep (memcg_oom_waitq).

    Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
    decreases oom_lock and wakes other tasks (if releasing memory by
    somebody else - e.g. killed process - hasn't done it yet).

    A testcase would look like:
    Assume malloc XXX is a program allocating XXX Megabytes of memory
    which touches all allocated pages in a tight loop
    # swapoff SWAP_DEVICE
    # cgcreate -g memory:A
    # cgset -r memory.oom_control=0 A
    # cgset -r memory.limit_in_bytes= 200M
    # for i in `seq 100`
    # do
    # cgexec -g memory:A malloc 10 &
    # done

    The main problem here is that all processes still race for the mutex and
    there is no guarantee that we will get counter back to 0 for those that
    got back to mem_cgroup_handle_oom. In the end the whole convoy
    in/decreases the counter but we do not get to 1 that would enable
    killing so nothing useful can be done. The time is basically unbounded
    because it highly depends on scheduling and ordering on mutex (I have
    seen this taking hours...).

    This patch replaces the counter by a simple {un}lock semantic. As
    mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
    make sure that nobody else races with us which is guaranteed by the
    memcg_oom_mutex.

    We have to be careful while locking subtrees because we can encounter a
    subtree which is already locked: hierarchy:

    A
    / \
    B \
    /\ \
    C D E

    B - C - D tree might be already locked. While we want to enable locking
    E subtree because OOM situations cannot influence each other we
    definitely do not want to allow locking A.

    Therefore we have to refuse lock if any subtree is already locked and
    clear up the lock for all nodes that have been set up to the failure
    point.

    On the other hand we have to make sure that the rest of the world will
    recognize that a group is under OOM even though it doesn't have a lock.
    Therefore we have to introduce under_oom variable which is incremented
    and decremented for the whole subtree when we enter resp. leave
    mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be
    updated under memcg_oom_mutex because its users only check a single
    group and they use atomic operations for that.

    This can be checked easily by the following test case:

    # cgcreate -g memory:A
    # cgset -r memory.use_hierarchy=1 A
    # cgset -r memory.oom_control=1 A
    # cgset -r memory.limit_in_bytes= 100M
    # cgset -r memory.memsw.limit_in_bytes= 100M
    # cgcreate -g memory:A/B
    # cgset -r memory.oom_control=1 A/B
    # cgset -r memory.limit_in_bytes=20M
    # cgset -r memory.memsw.limit_in_bytes=20M
    # cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B
    # cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A

    While B gets oom_lock A will not get it. Both of them go into sleep and
    wait for an external action. We can make the limit higher for A to
    enforce waking it up

    # cgset -r memory.memsw.limit_in_bytes=300M A
    # cgset -r memory.limit_in_bytes=300M A

    malloc in A has to wake up even though it doesn't have oom_lock.

    Finally, the unlock path is very easy because we always unlock only the
    subtree we have locked previously while we always decrement under_oom.

    Signed-off-by: Michal Hocko
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In mm/memcontrol.c, there are many lru stat functions as..

    mem_cgroup_zone_nr_lru_pages
    mem_cgroup_node_nr_file_lru_pages
    mem_cgroup_nr_file_lru_pages
    mem_cgroup_node_nr_anon_lru_pages
    mem_cgroup_nr_anon_lru_pages
    mem_cgroup_node_nr_unevictable_lru_pages
    mem_cgroup_nr_unevictable_lru_pages
    mem_cgroup_node_nr_lru_pages
    mem_cgroup_nr_lru_pages
    mem_cgroup_get_local_zonestat

    Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
    This seems bad. This patch consolidates all functions into

    mem_cgroup_zone_nr_lru_pages()
    mem_cgroup_node_nr_lru_pages()
    mem_cgroup_nr_lru_pages()

    For these functions, "which LRU?" information is passed by a mask.

    example:
    mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

    And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.

    example:
    mem_cgroup_nr_lru_pages(mem, ALL_LRU)

    BTW, considering layout of NUMA memory placement of counters, this patch seems
    to be better.

    Now, when we gather all LRU information, we scan in following orer
    for_each_lru -> for_each_node -> for_each_zone.

    This means we'll touch cache lines in different node in turn.

    After patch, we'll scan
    for_each_node -> for_each_zone -> for_each_lru(mask)

    Then, we'll gather information in the same cacheline at once.

    [akpm@linux-foundation.org: fix warnigns, build error]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Each memory cgroup has a 'swappiness' value which can be accessed by
    get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages()
    and swappiness is passed by argument. It's propagated by scan_control.

    get_swappiness() is a static function but some planned updates will need
    to get swappiness from files other than memcontrol.c This patch exports
    get_swappiness() as mem_cgroup_swappiness(). With this, we can remove the
    argument of swapiness from try_to_free... and drop swappiness from
    scan_control. only memcg uses it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Shaohua Li
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

26 Jul, 2011

23 commits

  • * Merge akpm patch series: (122 commits)
    drivers/connector/cn_proc.c: remove unused local
    Documentation/SubmitChecklist: add RCU debug config options
    reiserfs: use hweight_long()
    reiserfs: use proper little-endian bitops
    pnpacpi: register disabled resources
    drivers/rtc/rtc-tegra.c: properly initialize spinlock
    drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time()
    drivers/rtc: add support for Qualcomm PMIC8xxx RTC
    drivers/rtc/rtc-s3c.c: support clock gating
    drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200
    init: skip calibration delay if previously done
    misc/eeprom: add eeprom access driver for digsy_mtc board
    misc/eeprom: add driver for microwire 93xx46 EEPROMs
    checkpatch.pl: update $logFunctions
    checkpatch: make utf-8 test --strict
    checkpatch.pl: add ability to ignore various messages
    checkpatch: add a "prefer __aligned" check
    checkpatch: validate signature styles and To: and Cc: lines
    checkpatch: add __rcu as a sparse modifier
    checkpatch: suggest using min_t or max_t
    ...

    Did this as a merge because of (trivial) conflicts in
    - Documentation/feature-removal-schedule.txt
    - arch/xtensa/include/asm/uaccess.h
    that were just easier to fix up in the merge than in the patch series.

    Linus Torvalds
     
  • devres uses the pointer value as key after it's freed, which is safe but
    triggers spurious use-after-free warnings on some static analysis tools.
    Rearrange code to avoid such warnings.

    Signed-off-by: Maxin B. John
    Reviewed-by: Rolf Eike Beer
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maxin B John
     
  • NR_WRITTEN is now accounted at block IO enqueue time, which is not very
    accurate as to common understanding. This moves NR_WRITTEN accounting to
    the IO completion time and makes it more consistent with BDI_WRITTEN,
    which is used for bandwidth estimation.

    Signed-off-by: Wu Fengguang
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • shmem_unuse_inode() and shmem_writepage() contain a little code to cope
    with pages inserted independently into the filecache, probably by a
    filesystem stacked on top of tmpfs, then fed to its ->readpage() or
    ->writepage().

    Unionfs was indeed experimenting with working in that way three years ago,
    but I find no current examples: nowadays the stacking filesystems use vfs
    interfaces to the lower filesystem.

    It's now illegal: remove most of that code, adding some WARN_ON_ONCEs.

    Signed-off-by: Hugh Dickins
    Cc: Erez Zadok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We can now simplify shmem_getpage_gfp(): there is no longer a dilemma of
    filepage passed in via shmem_readpage(), then swappage found, which must
    then be copied over to it.

    Although at first it's tempting to replace the **pagep arg by returning
    struct page *, that makes a mess of IS_ERR_OR_NULL(page)s in all the
    callers, so leave as is.

    Insert BUG_ON(!PageUptodate) when we find and lock page: some of the
    complication came from uninitialized pages inserted into filecache prior
    to readpage; but now we're in control, and only release pagelock on
    filecache once it's uptodate (if an error occurs in reading back from
    swap, the page remains in swapcache, never moved to filecache).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The prealloc_page handling in shmem_getpage_gfp() is unnecessarily
    complicated: first simplify that before going on to filepage/swappage.

    That's right, don't report ENOMEM when the preallocation fails: we may or
    may not need the page. But simply report ENOMEM once we find we do need
    it, instead of dropping lock, repeating allocation, unwinding on failure
    etc. And leave the out label on the fast path, don't goto.

    Fix something that looks like a bug but turns out not to be: set
    PageSwapBacked on prealloc_page before its mem_cgroup_cache_charge(), as
    the removed case was doing. That's important before adding to LRU
    (determines which LRU the page goes on), and does affect which path it
    takes through memcontrol.c, but in the end MEM_CGROUP_CHANGE_TYPE_ SHMEM
    is handled no differently from CACHE.

    Signed-off-by: Hugh Dickins
    Acked-by: Shaohua Li
    Cc: "Zhang, Yanmin"
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove that pernicious shmem_readpage() at last: the things we needed it
    for (splice, loop, sendfile, i915 GEM) are now fully taken care of by
    shmem_file_splice_read() and shmem_read_mapping_page_gfp().

    This removal clears the way for a simpler shmem_getpage_gfp(), since page
    is never passed in; but leave most of that cleanup until after.

    sys_readahead() and sys_fadvise(POSIX_FADV_WILLNEED) will now EINVAL,
    instead of unexpectedly trying to read ahead on tmpfs: if that proves to
    be an issue for someone, then we can either arrange for them to return
    success instead, or try to implement async readahead on tmpfs.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make shmem_getpage() a wrapper, passing mapping_gfp_mask() down to
    shmem_getpage_gfp(), which in turn passes gfp down to shmem_swp_alloc().

    Change shmem_read_mapping_page_gfp() to use shmem_getpage_gfp() in the
    CONFIG_SHMEM case; but leave tiny !SHMEM using read_cache_page_gfp().

    Add a BUG_ON() in case anyone happens to call this on a non-shmem mapping;
    though we might later want to let that case route to read_cache_page_gfp().

    It annoys me to have these two almost-redundant args, gfp and fault_type:
    I can't find a better way; but initialize fault_type only in shmem_fault().

    Note that before, read_cache_page_gfp() was allocating i915_gem's pages
    with __GFP_NORETRY as intended; but the corresponding swap vector pages
    got allocated without it, leaving a small possibility of OOM.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Tidy up shmem_file_splice_read():

    Remove readahead: okay, we could implement shmem readahead on swap,
    but have never done so before, swap being the slow exceptional path.

    Use shmem_getpage() instead of find_or_create_page() plus ->readpage().

    Remove several comments: sorry, I found them more distracting than
    helpful, and this will not be the reference version of splice_read().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Copy __generic_file_splice_read() and generic_file_splice_read() from
    fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make
    page_cache_pipe_buf_ops and spd_release_page() accessible to it.

    Signed-off-by: Hugh Dickins
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I haven't reproduced it myself but the fail scenario is that on such
    machines (notably ARM and some embedded powerpc), if you manage to hit
    that futex path on a writable page whose dirty bit has gone from the PTE,
    you'll livelock inside the kernel from what I can tell.

    It will go in a loop of trying the atomic access, failing, trying gup to
    "fix it up", getting succcess from gup, go back to the atomic access,
    failing again because dirty wasn't fixed etc...

    So I think you essentially hang in the kernel.

    The scenario is probably rare'ish because affected architecture are
    embedded and tend to not swap much (if at all) so we probably rarely hit
    the case where dirty is missing or young is missing, but I think Shan has
    a piece of SW that can reliably reproduce it using a shared writable
    mapping & fork or something like that.

    On archs who use SW tracking of dirty & young, a page without dirty is
    effectively mapped read-only and a page without young unaccessible in the
    PTE.

    Additionally, some architectures might lazily flush the TLB when relaxing
    write protection (by doing only a local flush), and expect a fault to
    invalidate the stale entry if it's still present on another processor.

    The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
    "fix it up" by causing get_user_pages() which would then be equivalent to
    taking the fault.

    However that isn't the case. get_user_pages() will not call
    handle_mm_fault() in the case where the PTE seems to have the right
    permissions, regardless of the dirty and young state. It will eventually
    update those bits ... in the struct page, but not in the PTE.

    Additionally, it will not handle the lazy TLB flushing that can be
    required by some architectures in the fault case.

    Basically, gup is the wrong interface for the job. The patch provides a
    more appropriate one which boils down to just calling handle_mm_fault()
    since what we are trying to do is simulate a real page fault.

    The futex code currently attempts to write to user memory within a
    pagefault disabled section, and if that fails, tries to fix it up using
    get_user_pages().

    This doesn't work on archs where the dirty and young bits are maintained
    by software, since they will gate access permission in the TLB, and will
    not be updated by gup().

    In addition, there's an expectation on some archs that a spurious write
    fault triggers a local TLB flush, and that is missing from the picture as
    well.

    I decided that adding those "features" to gup() would be too much for this
    already too complex function, and instead added a new simpler
    fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
    which the futex code can call.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
    Signed-off-by: Benjamin Herrenschmidt
    Reported-by: Shan Hai
    Tested-by: Shan Hai
    Cc: David Laight
    Acked-by: Peter Zijlstra
    Cc: Darren Hart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • radix_tree_tagged() is lockless - it reads from a member of the raid-tree
    root node. It does not require any protection.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • With zone_reclaim_mode enabled, it's possible for zones to be considered
    full in the zonelist_cache so they are skipped in the future. If the
    process enters direct reclaim, the ZLC may still consider zones to be full
    even after reclaiming pages. Reconsider all zones for allocation if
    direct reclaim returns successfully.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There have been a small number of complaints about significant stalls
    while copying large amounts of data on NUMA machines reported on a
    distribution bugzilla. In these cases, zone_reclaim was enabled by
    default due to large NUMA distances. In general, the complaints have not
    been about the workload itself unless it was a file server (in which case
    the recommendation was disable zone_reclaim).

    The stalls are mostly due to significant amounts of time spent scanning
    the preferred zone for pages to free. After a failure, it might fallback
    to another node (as zonelists are often node-ordered rather than
    zone-ordered) but stall quickly again when the next allocation attempt
    occurs. In bad cases, each page allocated results in a full scan of the
    preferred zone.

    Patch 1 checks the preferred zone for recent allocation failure
    which is particularly important if zone_reclaim has failed
    recently. This avoids rescanning the zone in the near future and
    instead falling back to another node. This may hurt node locality
    in some cases but a failure to zone_reclaim is more expensive than
    a remote access.

    Patch 2 clears the zlc information after direct reclaim.
    Otherwise, zone_reclaim can mark zones full, direct reclaim can
    reclaim enough pages but the zone is still not considered for
    allocation.

    This was tested on a 24-thread 2-node x86_64 machine. The tests were
    focused on large amounts of IO. All tests were bound to the CPUs on
    node-0 to avoid disturbances due to processes being scheduled on different
    nodes. The kernels tested are

    3.0-rc6-vanilla Vanilla 3.0-rc6
    zlcfirst Patch 1 applied
    zlcreconsider Patches 1+2 applied

    FS-Mark
    ./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
    fsmark-3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirs zlcreconsider
    Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
    Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
    Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
    Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
    Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
    Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
    Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
    Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 501.49 493.91 499.93
    Total Elapsed Time (seconds) 2451.57 2257.48 2215.92

    MMTests Statistics: vmstat
    Page Ins 46268 63840 66008
    Page Outs 90821596 90671128 88043732
    Swap Ins 0 0 0
    Swap Outs 0 0 0
    Direct pages scanned 13091697 8966863 8971790
    Kswapd pages scanned 0 1830011 1831116
    Kswapd pages reclaimed 0 1829068 1829930
    Direct pages reclaimed 13037777 8956828 8648314
    Kswapd efficiency 100% 99% 99%
    Kswapd velocity 0.000 810.643 826.346
    Direct efficiency 99% 99% 96%
    Direct velocity 5340.128 3972.068 4048.788
    Percentage direct scans 100% 83% 83%
    Page writes by reclaim 0 3 0
    Slabs scanned 796672 720640 720256
    Direct inode steals 7422667 7160012 7088638
    Kswapd inode steals 0 1736840 2021238

    Test completes far faster with a large increase in the number of files
    created per second. Standard deviation is high as a small number of
    iterations were much higher than the mean. The number of pages scanned by
    zone_reclaim is reduced and kswapd is used for more work.

    LARGE DD
    3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirst zlcreconsider
    download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
    dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
    delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 125.03 118.98 122.01
    Total Elapsed Time (seconds) 624.56 375.02 398.06

    MMTests Statistics: vmstat
    Page Ins 3594216 439368 407032
    Page Outs 23380832 23380488 23377444
    Swap Ins 0 0 0
    Swap Outs 0 436 287
    Direct pages scanned 17482342 69315973 82864918
    Kswapd pages scanned 0 519123 575425
    Kswapd pages reclaimed 0 466501 522487
    Direct pages reclaimed 5858054 2732949 2712547
    Kswapd efficiency 100% 89% 90%
    Kswapd velocity 0.000 1384.254 1445.574
    Direct efficiency 33% 3% 3%
    Direct velocity 27991.453 184832.737 208171.929
    Percentage direct scans 100% 99% 99%
    Page writes by reclaim 0 5082 13917
    Slabs scanned 17280 29952 35328
    Direct inode steals 115257 1431122 332201
    Kswapd inode steals 0 0 979532

    This test downloads a large tarfile and copies it with dd a number of
    times - similar to the most recent bug report I've dealt with. Time to
    completion is reduced. The number of pages scanned directly is still
    disturbingly high with a low efficiency but this is likely due to the
    number of dirty pages encountered. The figures could probably be improved
    with more work around how kswapd is used and how dirty pages are handled
    but that is separate work and this result is significant on its own.

    Streaming Mapped Writer
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 124.47 111.67 112.64
    Total Elapsed Time (seconds) 2138.14 1816.30 1867.56

    MMTests Statistics: vmstat
    Page Ins 90760 89124 89516
    Page Outs 121028340 120199524 120736696
    Swap Ins 0 86 55
    Swap Outs 0 0 0
    Direct pages scanned 114989363 96461439 96330619
    Kswapd pages scanned 56430948 56965763 57075875
    Kswapd pages reclaimed 27743219 27752044 27766606
    Direct pages reclaimed 49777 46884 36655
    Kswapd efficiency 49% 48% 48%
    Kswapd velocity 26392.541 31363.631 30561.736
    Direct efficiency 0% 0% 0%
    Direct velocity 53780.091 53108.759 51581.004
    Percentage direct scans 67% 62% 62%
    Page writes by reclaim 385 122 1513
    Slabs scanned 43008 39040 42112
    Direct inode steals 0 10 8
    Kswapd inode steals 733 534 477

    This test just creates a large file mapping and writes to it linearly.
    Time to completion is again reduced.

    The gains are mostly down to two things. In many cases, there is less
    scanning as zone_reclaim simply gives up faster due to recent failures.
    The second reason is that memory is used more efficiently. Instead of
    scanning the preferred zone every time, the allocator falls back to
    another zone and uses it instead improving overall memory utilisation.

    This patch: initialise ZLC for first zone eligible for zone_reclaim.

    The zonelist cache (ZLC) is used among other things to record if
    zone_reclaim() failed for a particular zone recently. The intention is to
    avoid a high cost scanning extremely long zonelists or scanning within the
    zone uselessly.

    Currently the zonelist cache is setup only after the first zone has been
    considered and zone_reclaim() has been called. The objective was to avoid
    a costly setup but zone_reclaim is itself quite expensive. If it is
    failing regularly such as the first eligible zone having mostly mapped
    pages, the cost in scanning and allocation stalls is far higher than the
    ZLC initialisation step.

    This patch initialises ZLC before the first eligible zone calls
    zone_reclaim(). Once initialised, it is checked whether the zone failed
    zone_reclaim recently. If it has, the zone is skipped. As the first zone
    is now being checked, additional care has to be taken about zones marked
    full. A zone can be marked "full" because it should not have enough
    unmapped pages for zone_reclaim but this is excessive as direct reclaim or
    kswapd may succeed where zone_reclaim fails. Only mark zones "full" after
    zone_reclaim fails if it failed to reclaim enough pages after scanning.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently we are keeping faulted page locked throughout whole __do_fault
    call (except for page_mkwrite code path) after calling file system's fault
    code. If we do early COW, we allocate a new page which has to be charged
    for a memcg (mem_cgroup_newpage_charge).

    This function, however, might block for unbounded amount of time if memcg
    oom killer is disabled or fork-bomb is running because the only way out of
    the OOM situation is either an external event or OOM-situation fix.

    In the end we are keeping the faulted page locked and blocking other
    processes from faulting it in which is not good at all because we are
    basically punishing potentially an unrelated process for OOM condition in
    a different group (I have seen stuck system because of ld-2.11.1.so being
    locked).

    We can do test easily.

    % cgcreate -g memory:A
    % cgset -r memory.limit_in_bytes=64M A
    % cgset -r memory.memsw.limit_in_bytes=64M A
    % cd kernel_dir; cgexec -g memory:A make -j

    Then, the whole system will live-locked until you kill 'make -j'
    by hands (or push reboot...) This is because some important page in a
    a shared library are locked.

    Considering again, the new page is not necessary to be allocated
    with lock_page() held. And usual page allocation may dive into
    long memory reclaim loop with holding lock_page() and can cause
    very long latency.

    There are 3 ways.
    1. do allocation/charge before lock_page()
    Pros. - simple and can handle page allocation in the same manner.
    This will reduce holding time of lock_page() in general.
    Cons. - we do page allocation even if ->fault() returns error.

    2. do charge after unlock_page(). Even if charge fails, it's just OOM.
    Pros. - no impact to non-memcg path.
    Cons. - implemenation requires special cares of LRU and we need to modify
    page_add_new_anon_rmap()...

    3. do unlock->charge->lock again method.
    Pros. - no impact to non-memcg path.
    Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
    lock and get it again...

    This patch moves "charge" and memory allocation for COW page
    before lock_page(). Then, we can avoid scanning LRU with holding
    a lock on a page and latency under lock_page() will be reduced.

    Then, above livelock disappears.

    [akpm@linux-foundation.org: fix code layout]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Lutz Vieweg
    Original-idea-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Johannes Weiner
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • 2.6.36's 7e496299d4d2 ("tmpfs: make tmpfs scalable with percpu_counter for
    used blocks") to make tmpfs scalable with percpu_counter used
    inode->i_lock in place of sbinfo->stat_lock around i_blocks updates; but
    that was adverse to scalability, and unnecessary, since info->lock is
    already held there in the fast paths.

    Remove those uses of i_lock, and add info->lock in the three error paths
    where it's then needed across shmem_free_blocks(). It's not actually
    needed across shmem_unacct_blocks(), but they're so often paired that it
    looks wrong to split them apart.

    Signed-off-by: Hugh Dickins
    Acked-by: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • truncate_inode_pages_range()'s final loop has a nice pincer property,
    bringing start and end together, squeezing out the last pages. But the
    range handling missed out on that, just sliding up the range, perhaps
    letting pages come in behind it. Add one more test to give it the same
    pincer effect.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make the pagevec_lookup loops in truncate_inode_pages_range(),
    invalidate_mapping_pages() and invalidate_inode_pages2_range() more
    consistent with each other.

    They were relying upon page->index of an unlocked page, but apologizing
    for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
    index handling.

    invalidate_inode_pages2_range() had special handling for a wrapped
    page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
    near there, and a corrupt page->index in the radix_tree could cause more
    trouble than that would catch. Remove that wrapped handling.

    invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
    when near the end of the range: copy that into the other two, although
    it's less useful than you might think (it limits the use of the buffer,
    rather than the indices looked up).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Use consistent variable names in truncate_pagecache(), truncate_setsize(),
    vmtruncate() and vmtruncate_range().

    unmap_mapping_range() and vmtruncate_range() have mismatched interfaces:
    don't change either, but make the vmtruncates more precise about what they
    expect unmap_mapping_range() to do.

    vmtruncate_range() is currently called only with page-aligned start and
    end+1: can handle unaligned start, but unaligned end+1 would hit BUG_ON in
    truncate_inode_pages_range() (lacks partial clearing of the end page).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The often-NULL data arg to read_cache_page() and read_mapping_page()
    functions is misdescribed as "destination for read data": no, it's the
    first arg to the filler function, often struct file * to ->readpage().

    Satisfy checkpatch.pl on those filler prototypes, and tidy up the
    declarations in linux/pagemap.h.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • - shmem pages are not immediately available, but they are not
    potentially available either, even if we swap them out, they will just
    relocate from memory into swap, total amount of immediate and
    potentially available memory is not going to be affected, so we
    shouldn't count them as potentially free in the first place.

    - nr_free_pages() is not an expensive operation anymore, there is no
    need to split the decision making in two halves and repeat code.

    Signed-off-by: Dmitry Fink
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Fink
     
  • RED_INACTIVE is a slab thing, and reusing it for memblock was
    inappropriate, because memblock is dealing with phys_addr_t's which have a
    Kconfigurable sizeof().

    Create a new poison type for this application. Fixes the sparse warning

    warning: cast truncates bits from constant value (9f911029d74e35b becomes 9d74e35b)

    Reported-by: H Hartley Sweeten
    Tested-by: H Hartley Sweeten
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The badness() function in the oom killer was renamed to oom_badness() in
    a63d83f427fb ("oom: badness heuristic rewrite") since it is a globally
    exported function for clarity.

    The prototype for the old function still existed in linux/oom.h, so remove
    it. There are no existing users.

    Also fixes documentation and comment references to badness() and adjusts
    them accordingly.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes