27 Jul, 2011

40 commits

  • NUMA_NO_NODE and numa_node_id() have different meanings. NUMA_NO_NODE is
    obviously the recommended fallback.

    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Adapt new API fashion.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • acct_arg_size() takes ->page_table_lock around add_mm_counter() if
    !SPLIT_RSS_COUNTING. This is not needed after commit 172703b08cd0 ("mm:
    delete non-atomic mm counter implementation").

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Matt Fleming
    Cc: Dave Hansen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If CONFIG_MODULES=n, it makes no sense to retry the list of binary formats
    handler because the list will not be modified by request_module().

    Signed-off-by: Tetsuo Handa
    Cc: Richard Weinberger
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Currently, search_binary_handler() tries to load binary loader module
    using request_module() if a loader for the requested program is not yet
    loaded. But second attempt of request_module() does not affect the result
    of search_binary_handler().

    If request_module() triggered recursion, calling request_module() twice
    causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions. It is
    not an infinite loop but is sufficient for users to consider as a hang up.

    Therefore, this patch changes not to call request_module() twice, making 1
    to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion.

    Signed-off-by: Tetsuo Handa
    Reported-by: Richard Weinberger
    Tested-by: Richard Weinberger
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit a8bef8ff6ea1 ("mm: migration: avoid race between
    shift_arg_pages() and rmap_walk() during migration by not migrating
    temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS
    and VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile
    time one, so BUILD_BUG_ON is more appropriate.

    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Signed-off-by: Daniel Rebelo de Oliveira
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Rebelo de Oliveira
     
  • If an inode's mode permits opening /proc/PID/io and the resulting file
    descriptor is kept across execve() of a setuid or similar binary, the
    ptrace_may_access() check tries to prevent using this fd against the
    task with escalated privileges.

    Unfortunately, there is a race in the check against execve(). If
    execve() is processed after the ptrace check, but before the actual io
    information gathering, io statistics will be gathered from the
    privileged process. At least in theory this might lead to gathering
    sensible information (like ssh/ftp password length) that wouldn't be
    available otherwise.

    Holding task->signal->cred_guard_mutex while gathering the io
    information should protect against the race.

    The order of locking is similar to the one inside of ptrace_attach():
    first goes cred_guard_mutex, then lock_task_sighand().

    Signed-off-by: Vasiliy Kulikov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Change the return value to ENOENT. This return value is then returned
    when opening the proc entry that have been removed. For example,
    open("/proc/bus/pci/XX/YY") when the corresponding device is being
    hot-removed.

    Signed-off-by: Daisuke Ogino
    Cc: Jesse Barnes
    Acked-by: Alexey Dobriyan
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Ogino
     
  • Harmonise these return values with other architectures. In some cases
    this affects all compilers and in other cases non-gcc compilers only.

    Cc: Yoshinori Sato
    Cc: Geert Uytterhoeven
    Cc: Chris Zankel
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • do_coredump() assumes that if format_corename() fails it should return
    -ENOMEM. This is not true, for example cn_print_exe_file() can propagate
    the error from d_path. Even if it was true, this is too fragile. Change
    the code to check "ispipe < 0".

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Jiri Slaby
    Reviewed-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change every occurence of / in comm and hostname to !. If the process
    changes its name to contain /, the core is not dumped (if the directory
    tree doesn't exist like that). The same with hostname being something
    like myhost/3. Fix this behaviour by using the escape loop used in %E.
    (We extract it to a separate function.)

    Now both with comm == myprocess/1 and hostname == myhost/1, the core is
    dumped like (kernel.core_pattern='core.%p.%e.%h):
    core.2349.myprocess!1.myhost!1

    Signed-off-by: Jiri Slaby
    Cc: Alan Cox
    Cc: Al Viro
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • If we don't know the file corresponding to the binary (i.e. exe_file is
    unknown), use "task->comm (path unknown)" instead of simple "(unknown)"
    as suggested by ak.

    The fallback is the same as %e except it will append "(path unknown)".

    Signed-off-by: Jiri Slaby
    Cc: Alan Cox
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • [ poleg@redhat.com: no need to declare show_regs() in ptrace.h, sched.h does this ]
    Signed-off-by: Mike Frysinger
    Cc: Tejun Heo
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • [ This patch has already been accepted as commit 0ac0c0d0f837 but later
    reverted (commit 35926ff5fba8) because it itroduced arch specific
    __node_random which was defined only for x86 code so it broke other
    archs. This is a followup without any arch specific code. Other than
    that there are no functional changes.]

    Some workloads that create a large number of small files tend to assign
    too many pages to node 0 (multi-node systems). Part of the reason is
    that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
    at node 0 for newly created tasks.

    This patch changes the rotor to be initialized to a random node number
    of the cpuset.

    [akpm@linux-foundation.org: fix layout]
    [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
    [mhocko@suse.cz: Make it arch independent]
    [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
    Signed-off-by: Jack Steiner
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Michal Hocko
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Jack Steiner
    Cc: KOSAKI Motohiro
    Cc: Lee Schermerhorn
    Cc: Michal Hocko
    Cc: Paul Menage
    Cc: Pekka Enberg
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • percpu_charge_mutex protects from multiple simultaneous per-cpu charge
    caches draining because we might end up having too many work items. At
    least this was the case until commit 26fe61684449 ("memcg: fix percpu
    cached charge draining frequency") when we introduced a more targeted
    draining for async mode.

    Now that also sync draining is targeted we can safely remove mutex
    because we will not send more work than the current number of CPUs.
    FLUSHING_CACHED_CHARGE protects from sending the same work multiple
    times and stock->nr_pages == 0 protects from pointless sending a work if
    there is obviously nothing to be done. This is of course racy but we
    can live with it as the race window is really small (we would have to
    see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
    non-zero).

    The only remaining place where we can race is synchronous mode when we
    rely on FLUSHING_CACHED_CHARGE test which might have been set by other
    drainer on the same group but we should wait in that case as well.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We are checking whether a given two groups are same or at least in the
    same subtree of a hierarchy at several places. Let's make a helper for
    it to make code easier to read.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently we have two ways how to drain per-CPU caches for charges.
    drain_all_stock_sync will synchronously drain all caches while
    drain_all_stock_async will asynchronously drain only those that refer to
    a given memory cgroup or its subtree in hierarchy. Targeted async
    draining has been introduced by 26fe6168 (memcg: fix percpu cached
    charge draining frequency) to reduce the cpu workers number.

    sync draining is currently triggered only from mem_cgroup_force_empty
    which is triggered only by userspace (mem_cgroup_force_empty_write) or
    when a cgroup is removed (mem_cgroup_pre_destroy). Although these are
    not usually frequent operations it still makes some sense to do targeted
    draining as well, especially if the box has many CPUs.

    This patch unifies both methods to use the single code (drain_all_stock)
    which relies on the original async implementation and just adds
    flush_work to wait on all caches that are still under work for the sync
    mode. We are using FLUSHING_CACHED_CHARGE bit check to prevent from
    waiting on a work that we haven't triggered. Please note that both sync
    and async functions are currently protected by percpu_charge_mutex so we
    cannot race with other drainers.

    Signed-off-by: Michal Hocko
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • drain_all_stock_async tries to optimize a work to be done on the work
    queue by excluding any work for the current CPU because it assumes that
    the context we are called from already tried to charge from that cache
    and it's failed so it must be empty already.

    While the assumption is correct we can optimize it even more by checking
    the current number of pages in the cache. This will also reduce a work
    on other CPUs with an empty stock.

    For the current CPU we can simply call drain_local_stock rather than
    deferring it to the work queue.

    [kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
    Signed-off-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim
    in...") says it adds scanning stats to memory.stat file. But it doesn't
    because we considered we needed to make a concensus for such new APIs.

    This patch is a trial to add memory.scan_stat. This shows
    - the number of scanned pages(total, anon, file)
    - the number of rotated pages(total, anon, file)
    - the number of freed pages(total, anon, file)
    - the number of elaplsed time (including sleep/pause time)

    for both of direct/soft reclaim.

    The biggest difference with oringinal Ying's one is that this file
    can be reset by some write, as

    # echo 0 ...../memory.scan_stat

    Example of output is here. This is a result after make -j 6 kernel
    under 300M limit.

    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
    scanned_pages_by_limit 9471864
    scanned_anon_pages_by_limit 6640629
    scanned_file_pages_by_limit 2831235
    rotated_pages_by_limit 4243974
    rotated_anon_pages_by_limit 3971968
    rotated_file_pages_by_limit 272006
    freed_pages_by_limit 2318492
    freed_anon_pages_by_limit 962052
    freed_file_pages_by_limit 1356440
    elapsed_ns_by_limit 351386416101
    scanned_pages_by_system 0
    scanned_anon_pages_by_system 0
    scanned_file_pages_by_system 0
    rotated_pages_by_system 0
    rotated_anon_pages_by_system 0
    rotated_file_pages_by_system 0
    freed_pages_by_system 0
    freed_anon_pages_by_system 0
    freed_file_pages_by_system 0
    elapsed_ns_by_system 0
    scanned_pages_by_limit_under_hierarchy 9471864
    scanned_anon_pages_by_limit_under_hierarchy 6640629
    scanned_file_pages_by_limit_under_hierarchy 2831235
    rotated_pages_by_limit_under_hierarchy 4243974
    rotated_anon_pages_by_limit_under_hierarchy 3971968
    rotated_file_pages_by_limit_under_hierarchy 272006
    freed_pages_by_limit_under_hierarchy 2318492
    freed_anon_pages_by_limit_under_hierarchy 962052
    freed_file_pages_by_limit_under_hierarchy 1356440
    elapsed_ns_by_limit_under_hierarchy 351386416101
    scanned_pages_by_system_under_hierarchy 0
    scanned_anon_pages_by_system_under_hierarchy 0
    scanned_file_pages_by_system_under_hierarchy 0
    rotated_pages_by_system_under_hierarchy 0
    rotated_anon_pages_by_system_under_hierarchy 0
    rotated_file_pages_by_system_under_hierarchy 0
    freed_pages_by_system_under_hierarchy 0
    freed_anon_pages_by_system_under_hierarchy 0
    freed_file_pages_by_system_under_hierarchy 0
    elapsed_ns_by_system_under_hierarchy 0

    total_xxxx is for hierarchy management.

    This will be useful for further memcg developments and need to be
    developped before we do some complicated rework on LRU/softlimit
    management.

    This patch adds a new struct memcg_scanrecord into scan_control struct.
    sc->nr_scanned at el is not designed for exporting information. For
    example, nr_scanned is reset frequentrly and incremented +2 at scanning
    mapped pages.

    To avoid complexity, I added a new param in scan_control which is for
    exporting scanning score.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Andrew Bresticker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to
    memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
    when mem_limit == memsw_limit. The flag is checked at the beginning of
    reclaim, and "noswap" is set if the flag is true, because using swap is
    meaningless in this case.

    This works well in most cases, but when we try to shrink mem_limit,
    which is the same as memsw_limit now, we might fail to shrink mem_limit
    because swap doesn't used.

    This patch fixes this behavior by:
    - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
    - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ying Han
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets")
    fixes the memcg/kswapd behavior against small targets and prevent vmscan
    priority too high.

    But the implementation is too naive and adds another problem to small
    memcg. It always force scan to 32 pages of file/anon and doesn't handle
    swappiness and other rotate_info. It makes vmscan to scan anon LRU
    regardless of swappiness and make reclaim bad. This patch fixes it by
    adjusting scanning count with regard to swappiness at el.

    At a test "cat 1G file under 300M limit." (swappiness=20)
    before patch
    scanned_pages_by_limit 360919
    scanned_anon_pages_by_limit 180469
    scanned_file_pages_by_limit 180450
    rotated_pages_by_limit 31
    rotated_anon_pages_by_limit 25
    rotated_file_pages_by_limit 6
    freed_pages_by_limit 180458
    freed_anon_pages_by_limit 19
    freed_file_pages_by_limit 180439
    elapsed_ns_by_limit 429758872
    after patch
    scanned_pages_by_limit 180674
    scanned_anon_pages_by_limit 24
    scanned_file_pages_by_limit 180650
    rotated_pages_by_limit 35
    rotated_anon_pages_by_limit 24
    rotated_file_pages_by_limit 11
    freed_pages_by_limit 180634
    freed_anon_pages_by_limit 0
    freed_file_pages_by_limit 180634
    elapsed_ns_by_limit 367119089
    scanned_pages_by_system 0

    the numbers of scanning anon are decreased(as expected), and elapsed time
    reduced. By this patch, small memcgs will work better.
    (*) Because the amount of file-cache is much bigger than anon,
    recalaim_stat's rotate-scan counter make scanning files more.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
    for oom_control. None of the critical sections which it protects sleep
    (eventfd_signal works from atomic context and the rest are simple linked
    list resp. oom_lock atomic operations).

    Mutex is also too heavyweight for those code paths because it triggers a
    lot of scheduling. It also makes makes convoying effects more visible
    when we have a big number of oom killing because we take the lock
    mutliple times during mem_cgroup_handle_oom so we have multiple places
    where many processes can sleep.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
    counter which is incremented by mem_cgroup_oom_lock when we are about to
    handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep
    if oom_lock > 1 to prevent from multiple oom kills at the same time.
    The counter is then decremented by mem_cgroup_oom_unlock called from the
    same function.

    This works correctly but it can lead to serious starvations when we have
    many processes triggering OOM and many CPUs available for them (I have
    tested with 16 CPUs).

    Consider a process (call it A) which gets the oom_lock (the first one
    that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
    processes that are blocked on the mutex. While A releases the mutex and
    calls mem_cgroup_out_of_memory others will wake up (one after another)
    and increase the counter and fall into sleep (memcg_oom_waitq).

    Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
    decreases oom_lock and wakes other tasks (if releasing memory by
    somebody else - e.g. killed process - hasn't done it yet).

    A testcase would look like:
    Assume malloc XXX is a program allocating XXX Megabytes of memory
    which touches all allocated pages in a tight loop
    # swapoff SWAP_DEVICE
    # cgcreate -g memory:A
    # cgset -r memory.oom_control=0 A
    # cgset -r memory.limit_in_bytes= 200M
    # for i in `seq 100`
    # do
    # cgexec -g memory:A malloc 10 &
    # done

    The main problem here is that all processes still race for the mutex and
    there is no guarantee that we will get counter back to 0 for those that
    got back to mem_cgroup_handle_oom. In the end the whole convoy
    in/decreases the counter but we do not get to 1 that would enable
    killing so nothing useful can be done. The time is basically unbounded
    because it highly depends on scheduling and ordering on mutex (I have
    seen this taking hours...).

    This patch replaces the counter by a simple {un}lock semantic. As
    mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
    make sure that nobody else races with us which is guaranteed by the
    memcg_oom_mutex.

    We have to be careful while locking subtrees because we can encounter a
    subtree which is already locked: hierarchy:

    A
    / \
    B \
    /\ \
    C D E

    B - C - D tree might be already locked. While we want to enable locking
    E subtree because OOM situations cannot influence each other we
    definitely do not want to allow locking A.

    Therefore we have to refuse lock if any subtree is already locked and
    clear up the lock for all nodes that have been set up to the failure
    point.

    On the other hand we have to make sure that the rest of the world will
    recognize that a group is under OOM even though it doesn't have a lock.
    Therefore we have to introduce under_oom variable which is incremented
    and decremented for the whole subtree when we enter resp. leave
    mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be
    updated under memcg_oom_mutex because its users only check a single
    group and they use atomic operations for that.

    This can be checked easily by the following test case:

    # cgcreate -g memory:A
    # cgset -r memory.use_hierarchy=1 A
    # cgset -r memory.oom_control=1 A
    # cgset -r memory.limit_in_bytes= 100M
    # cgset -r memory.memsw.limit_in_bytes= 100M
    # cgcreate -g memory:A/B
    # cgset -r memory.oom_control=1 A/B
    # cgset -r memory.limit_in_bytes=20M
    # cgset -r memory.memsw.limit_in_bytes=20M
    # cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B
    # cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A

    While B gets oom_lock A will not get it. Both of them go into sleep and
    wait for an external action. We can make the limit higher for A to
    enforce waking it up

    # cgset -r memory.memsw.limit_in_bytes=300M A
    # cgset -r memory.limit_in_bytes=300M A

    malloc in A has to wake up even though it doesn't have oom_lock.

    Finally, the unlock path is very easy because we always unlock only the
    subtree we have locked previously while we always decrement under_oom.

    Signed-off-by: Michal Hocko
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In mm/memcontrol.c, there are many lru stat functions as..

    mem_cgroup_zone_nr_lru_pages
    mem_cgroup_node_nr_file_lru_pages
    mem_cgroup_nr_file_lru_pages
    mem_cgroup_node_nr_anon_lru_pages
    mem_cgroup_nr_anon_lru_pages
    mem_cgroup_node_nr_unevictable_lru_pages
    mem_cgroup_nr_unevictable_lru_pages
    mem_cgroup_node_nr_lru_pages
    mem_cgroup_nr_lru_pages
    mem_cgroup_get_local_zonestat

    Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
    This seems bad. This patch consolidates all functions into

    mem_cgroup_zone_nr_lru_pages()
    mem_cgroup_node_nr_lru_pages()
    mem_cgroup_nr_lru_pages()

    For these functions, "which LRU?" information is passed by a mask.

    example:
    mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

    And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.

    example:
    mem_cgroup_nr_lru_pages(mem, ALL_LRU)

    BTW, considering layout of NUMA memory placement of counters, this patch seems
    to be better.

    Now, when we gather all LRU information, we scan in following orer
    for_each_lru -> for_each_node -> for_each_zone.

    This means we'll touch cache lines in different node in turn.

    After patch, we'll scan
    for_each_node -> for_each_zone -> for_each_lru(mask)

    Then, we'll gather information in the same cacheline at once.

    [akpm@linux-foundation.org: fix warnigns, build error]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Each memory cgroup has a 'swappiness' value which can be accessed by
    get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages()
    and swappiness is passed by argument. It's propagated by scan_control.

    get_swappiness() is a static function but some planned updates will need
    to get swappiness from files other than memcontrol.c This patch exports
    get_swappiness() as mem_cgroup_swappiness(). With this, we can remove the
    argument of swapiness from try_to_free... and drop swappiness from
    scan_control. only memcg uses it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Shaohua Li
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Ben reported a lockup related to rtc. The lockup happens due to:

    CPU0 CPU1

    rtc_irq_set_state() __run_hrtimer()
    spin_lock_irqsave(&rtc->irq_task_lock) rtc_handle_legacy_irq();
    spin_lock(&rtc->irq_task_lock);
    hrtimer_cancel()
    while (callback_running);

    So the running callback never finishes as it's blocked on
    rtc->irq_task_lock.

    Use hrtimer_try_to_cancel() instead and drop rtc->irq_task_lock while
    waiting for the callback. Fix this for both rtc_irq_set_state() and
    rtc_irq_set_freq().

    Signed-off-by: Thomas Gleixner
    Reported-by: Ben Greear
    Cc: John Stultz
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Due to the hrtimer self rearming mode a user can DoS the machine simply
    because it's starved by hrtimer events.

    The RTC hrtimer is self rearming. We really need to limit the frequency
    to something sensible.

    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Cc: Ingo Molnar
    Cc: Ben Greear
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • The code checks the correctness of the parameters, but unconditionally
    arms/disarms the hrtimer.

    The result is that a random task might arm/disarm rtc timer and surprise
    the real owner by either generating events or by stopping them.

    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Cc: Ingo Molnar
    Cc: Ben Greear
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • The address limit is already set in flush_old_exec() so this
    set_fs(USER_DS) is redundant.

    Signed-off-by: Mathias Krause
    Cc: Koichi Yasutake
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • Dev_opp initial value shoule be ERR_PTR(), IS_ERR() is used to check
    error.

    Signed-off-by: Jonghwan Choi
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonghwan Choi
     
  • The address limit is already set in flush_old_exec() so those calls to
    set_fs(USER_DS) are redundant.

    Also removed the dead code in flush_thread().

    Signed-off-by: Mathias Krause
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • * 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: (31 commits)
    MIPS: Close races in TLB modify handlers.
    MIPS: Add uasm UASM_i_SRL_SAFE macro.
    MIPS: RB532: Use hex_to_bin()
    MIPS: Enable cpu_has_clo_clz for MIPS Technologies' platforms
    MIPS: PowerTV: Provide cpu-feature-overrides.h
    MIPS: Remove pointless return statement from empty void functions.
    MIPS: Limit fixrange_init() to the FIXMAP region
    MIPS: Install handlers for software IRQs
    MIPS: Move FIXADDR_TOP into spaces.h
    MIPS: Add SYNC after cacheflush
    MIPS: pfn_valid() is broken on low memory HIGHMEM systems
    MIPS: HIGHMEM DMA on noncoherent MIPS32 processors
    MIPS: topdown mmap support
    MIPS: Remove redundant addr_limit assignment on exec.
    MIPS: AR7: Replace __attribute__((__packed__)) with __packed
    MIPS: AR7: Remove 'space before tabs' in platform.c
    MIPS: Lantiq: Add missing clk_enable and clk_disable functions.
    MIPS: AR7: Fix trailing semicolon bug in clock.c
    MAINTAINERS: Update MIPS entry.
    MIPS: BCM63xx: Remove duplicate PERF_IRQSTAT_REG definition
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
    ceph: document unlocked d_parent accesses
    ceph: explicitly reference rename old_dentry parent dir in request
    ceph: document locking for ceph_set_dentry_offset
    ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
    ceph: protect d_parent access in ceph_d_revalidate
    ceph: protect access to d_parent
    ceph: handle racing calls to ceph_init_dentry
    ceph: set dir complete frag after adding capability
    rbd: set blk_queue request sizes to object size
    ceph: set up readahead size when rsize is not passed
    rbd: cancel watch request when releasing the device
    ceph: ignore lease mask
    ceph: fix ceph_lookup_open intent usage
    ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
    ceph: fix bad parent_inode calc in ceph_lookup_open
    ceph: avoid carrying Fw cap during write into page cache
    libceph: don't time out osd requests that haven't been received
    ceph: report f_bfree based on kb_avail rather than diffing.
    ceph: only queue capsnap if caps are dirty
    ceph: fix snap writeback when racing with writes
    ...

    Linus Torvalds
     
  • so replace it with mdelay(20).

    Fixes build error:

    ERROR: "__bad_udelay" [drivers/staging/gma500/psb_gfx.ko] undefined!

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Fix build issue caused by undefined struct scatterlist in
    drivers/usb/renesas_usbhs/fifo.c.

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Fix build issue caused by undefined struct scatterlist in
    drivers/mmc/host/tmio_mmc.c.

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
    jbd: change the field "b_cow_tid" of struct journal_head from type unsigned to tid_t
    ext3.txt: update the links in the section "useful links" to the latest ones
    ext3: Fix data corruption in inodes with journalled data
    ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get
    ext3: Fix compilation with -DDX_DEBUG
    quota: Remove unused declaration
    jbd: Use WRITE_SYNC in journal checkpoint.
    jbd: Fix oops in journal_remove_journal_head()
    ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs()
    ext3/ioctl.c: silence sparse warnings about different address spaces
    ext3/ext4 Documentation: remove bh/nobh since it has been deprecated
    ext3: Improve truncate error handling
    ext3: use proper little-endian bitops
    ext2: include fs.h into ext2_fs.h
    ext3: Fix oops in ext3_try_to_allocate_with_rsv()
    jbd: fix a bug of leaking jh->b_jcount
    jbd: remove dependency on __GFP_NOFAIL
    ext3: Convert ext3 to new truncate calling convention
    jbd: Add fixed tracepoints
    ext3: Add fixed tracepoints

    Resolve conflicts in fs/ext3/fsync.c due to fsync locking push-down and
    new fixed tracepoints.

    Linus Torvalds
     
  • For the most part we don't care about racing with rename when directing
    MDS requests; either the old or new parent is fine. Document that, and
    do some minor cleanup.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We carry a pin on the parent directory for the rename source and dest
    dentries. For the source it's r_locked_dir; we need to explicitly
    reference the old_dentry parent as well, since the dentry's d_parent may
    change between when the request was created and pinned and when it is
    freed.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil