29 Jun, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, delay: tsc based udelay should have rdtsc_barrier
    x86, setup: correct include file in <asm/boot.h>
    x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h>
    x86, mce: percpu mcheck_timer should be pinned
    x86: Add sysctl to allow panic on IOCK NMI error
    x86: Fix uv bau sending buffer initialization
    x86, mce: Fix mce resume on 32bit
    x86: Move init_gbpages() to setup_arch()
    x86: ensure percpu lpage doesn't consume too much vmalloc space
    x86: implement percpu_alloc kernel parameter
    x86: fix pageattr handling for lpage percpu allocator and re-enable it
    x86: reorganize cpa_process_alias()
    x86: prepare setup_pcpu_lpage() for pageattr fix
    x86: rename remap percpu first chunk allocator to lpage
    x86: fix duplicate free in setup_pcpu_remap() failure path
    percpu: fix too lazy vunmap cache flushing
    x86: Set cpu_llc_id on AMD CPUs

    Linus Torvalds
     

26 Jun, 2009

1 commit


25 Jun, 2009

4 commits


24 Jun, 2009

7 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • After downing/upping a cpu, an attempt to set
    /proc/sys/vm/percpu_pagelist_fraction results in an oops in
    percpu_pagelist_fraction_sysctl_handler().

    If a processor is downed then we need to set the pageset pointer back to
    the boot pageset.

    Updates of the high water marks should not access pagesets of unpopulated
    zones (those pointer go to the boot pagesets which would be no longer
    functional if their size would be increased beyond zero).

    Signed-off-by: Dimitri Sivanich
    Signed-off-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dimitri Sivanich
     
  • If a kthread happens to use get_user_pages() on an mm (as KSM does),
    there's a chance that it will end up trying to read in a swap page, then
    oops in grab_swap_token() because the kthread has no mm: GUP passes down
    the right mm, so grab_swap_token() ought to be using it.

    We have not identified a stronger case than KSM's daemon (not yet in
    mainline), but the issue must have come up before, since RHEL has included
    a fix for this for years (though a different fix, they just back out of
    grab_swap_token if current->mm is unset: which is what we first proposed,
    but using the right mm here seems more correct).

    Reported-by: Izik Eidus
    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • * 'kmemleak' of git://linux-arm.org/linux-2.6:
    kmemleak: Do not force the slab debugging Kconfig options
    kmemleak: use pr_fmt

    Linus Torvalds
     
  • Indeed FOLL_WRITE matches FAULT_FLAG_WRITE, matches GUP_FLAGS_WRITE,
    and it's tempting to devise a set of Grand Unified Paging flags;
    but not today. So until then, let's rely upon the compiler to spot
    the coincidence, "rather than have that subtle dependency and a
    comment for it" - as you remarked in another context yesterday.

    Signed-off-by: Hugh Dickins
    Acked-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • handle_mm_fault() is now passing fault flags rather than write_access
    down to hugetlb_fault(), so better recognize that in hugetlb_fault(),
    and in hugetlb_no_page().

    Signed-off-by: Hugh Dickins
    Acked-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The isolated page is "cursor_page" not "page".

    This could cause LRU list corruption under memory pressure, caught by
    CONFIG_DEBUG_LIST.

    Reported-by: Ingo Molnar
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Tested-by: Daisuke Nishimura
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Jun, 2009

1 commit


22 Jun, 2009

4 commits

  • According to Andi, it isn't clear whether lpage allocator is worth the
    trouble as there are many processors where PMD TLB is far scarcer than
    PTE TLB. The advantage or disadvantage probably depends on the actual
    size of percpu area and specific processor. As performance
    degradation due to TLB pressure tends to be highly workload specific
    and subtle, it is difficult to decide which way to go without more
    data.

    This patch implements percpu_alloc kernel parameter to allow selecting
    which first chunk allocator to use to ease debugging and testing.

    While at it, make sure all the failure paths report why something
    failed to help determining why certain allocator isn't working. Also,
    kill the "Great future plan" comment which had already been realized
    quite some time ago.

    [ Impact: allow explicit percpu first chunk allocator selection ]

    Signed-off-by: Tejun Heo
    Reported-by: Jan Beulich
    Cc: Andi Kleen
    Cc: Ingo Molnar

    Tejun Heo
     
  • In pcpu_unmap(), flushing virtual cache on vunmap can't be delayed as
    the page is going to be returned to the page allocator. Only TLB
    flushing can be put off such that vmalloc code can handle it lazily.
    Fix it.

    [ Impact: fix subtle virtual cache flush bug ]

    Signed-off-by: Tejun Heo
    Cc: Nick Piggin
    Cc: Ingo Molnar

    Tejun Heo
     
  • This allows the callers to now pass down the full set of FAULT_FLAG_xyz
    flags to handle_mm_fault(). All callers have been (mechanically)
    converted to the new calling convention, there's almost certainly room
    for architectures to clean up their code and then add FAULT_FLAG_RETRY
    when that support is added.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The fault handling routines really want more fine-grained flags than a
    single "was it a write fault" boolean - the callers will want to set
    flags like "you can return a retry error" etc.

    And that's actually how the VM works internally, but right now the
    top-level fault handling functions in mm/memory.c all pass just the
    'write_access' boolean around.

    This switches them over to pass around the FAULT_FLAG_xyzzy 'flags'
    variable instead. The 'write_access' calling convention still exists
    for the exported 'handle_mm_fault()' function, but that is next.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jun, 2009

1 commit

  • da456f1 "page allocator: do not disable interrupts in free_page_mlock()" moved
    the PG_mlocked clearing after the flag sanity checking which makes mlocked
    pages always trigger 'bad page'. Fix this by clearing the bit up front.

    Reported--and-debugged-by: Peter Chubb
    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Tested-by: Maxim Levitsky
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

20 Jun, 2009

1 commit


19 Jun, 2009

7 commits

  • The page allocator also needs the masking of gfp flags during boot,
    so this moves it out of slab/slub and uses it with the page allocator
    as well.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Try to fix memcg's lru rotation sanity: make memcg use the same logic as
    the global LRU does.

    Now, at __isolate_lru_page() retruns -EBUSY, the page is rotated to the
    tail of LRU in global LRU's isolate LRU pages. But in memcg, it's not
    handled. This makes memcg do the same behavior as global LRU and rotate
    LRU in the page is busy.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • A user can set memcg.limit_in_bytes == memcg.memsw.limit_in_bytes when the
    user just want to limit the total size of applications, in other words,
    not very interested in memory usage itself. In this case, swap-out will
    be done only by global-LRU.

    But, under current implementation, memory.limit_in_bytes is checked at
    first and try_to_free_page() may do swap-out. But, that swap-out is
    useless for memsw.limit_in_bytes and the thread may hit limit again.

    This patch tries to fix the current behavior at memory.limit ==
    memsw.limit case. And documentation is updated to explain the behavior of
    this special case.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch fixes mis-accounting of swap usage in memcg.

    In the current implementation, memcg's swap account is uncharged only when
    swap is completely freed. But there are several cases where swap cannot
    be freed cleanly. For handling that, this patch changes that memcg
    uncharges swap account when swap has no references other than cache.

    By this, memcg's swap entry accounting can be fully synchronous with the
    application's behavior.

    This patch also changes memcg's hooks for swap-out.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • We don't need to check do_swap_account in the case that the function which
    checks do_swap_account will never get called if do_swap_account == 0.

    Signed-off-by: Li Zefan
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Add file RSS tracking per memory cgroup

    We currently don't track file RSS, the RSS we report is actually anon RSS.
    All the file mapped pages, come in through the page cache and get
    accounted there. This patch adds support for accounting file RSS pages.
    It should

    1. Help improve the metrics reported by the memory resource controller
    2. Will form the basis for a future shared memory accounting heuristic
    that has been proposed by Kamezawa.

    Unfortunately, we cannot rename the existing "rss" keyword used in
    memory.stat to "anon_rss". We however, add "mapped_file" data and hope to
    educate the end user through documentation.

    [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops]
    Signed-off-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Dhaval Giani
    Cc: Daisuke Nishimura
    Cc: YAMAMOTO Takashi
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Fix some cgroup messages to read better.
    Update MAINTAINERS to include mm/*cgroup* files.

    Signed-off-by: Randy Dunlap
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

18 Jun, 2009

4 commits


17 Jun, 2009

9 commits

  • Conflicts:
    mm/slub.c

    Pekka Enberg
     
  • Pekka Enberg
     
  • * akpm: (182 commits)
    fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
    fbdev: *bfin*: fix __dev{init,exit} markings
    fbdev: *bfin*: drop unnecessary calls to memset
    fbdev: bfin-t350mcqb-fb: drop unused local variables
    fbdev: blackfin has __raw I/O accessors, so use them in fb.h
    fbdev: s1d13xxxfb: add accelerated bitblt functions
    tcx: use standard fields for framebuffer physical address and length
    fbdev: add support for handoff from firmware to hw framebuffers
    intelfb: fix a bug when changing video timing
    fbdev: use framebuffer_release() for freeing fb_info structures
    radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
    s3c-fb: CPUFREQ frequency scaling support
    s3c-fb: fix resource releasing on error during probing
    carminefb: fix possible access beyond end of carmine_modedb[]
    acornfb: remove fb_mmap function
    mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
    mb862xxfb: restrict compliation of platform driver to PPC
    Samsung SoC Framebuffer driver: add Alpha Channel support
    atmel-lcdc: fix pixclock upper bound detection
    offb: use framebuffer_alloc() to allocate fb_info struct
    ...

    Manually fix up conflicts due to kmemcheck in mm/slab.c

    Linus Torvalds
     
  • At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be
    pushed back to "src" list by list_move(). But the page may not be from
    "src" list. This pushes the page back to wrong LRU. And list_move()
    itself is unnecessary because the page is not on top of LRU. Then, leave
    it as it is if __isolate_lru_page() fails.

    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.

    There is a heuristic that determines if the scan is worthwhile but it is
    possible that the heuristic will fail and the CPU gets tied up scanning
    uselessly. Detecting the situation requires some guesswork and
    experimentation so this patch adds a counter "zreclaim_failed" to
    /proc/vmstat. If during high CPU utilisation this counter is increasing
    rapidly, then the resolution to the problem may be to set
    /proc/sys/vm/zone_reclaim_mode to 0.

    [akpm@linux-foundation.org: name things consistently]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met. The problem is that zone_reclaim() failing at all means the
    zone gets marked full.

    This can cause situations where a zone is usable, but is being skipped
    because it has been considered full. Take a situation where a large tmpfs
    mount is occuping a large percentage of memory overall. The pages do not
    get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
    and the zonelist cache considers them not worth trying in the future.

    This patch makes zone_reclaim() return more fine-grained information about
    what occured when zone_reclaim() failued. The zone only gets marked full
    if it really is unreclaimable. If it's a case that the scan did not occur
    or if enough pages were not reclaimed with the limited reclaim_mode, then
    the zone is simply skipped.

    There is a side-effect to this patch. Currently, if zone_reclaim()
    successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
    ahead. With this patch applied, zone watermarks are rechecked after
    zone_reclaim() does some work.

    This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26
    ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
    zonelist_cache was introduced. It was not intended that zone_reclaim()
    aggressively consider the zone to be full when it failed as full direct
    reclaim can still be an option. Due to the age of the bug, it should be
    considered a -stable candidate.

    Signed-off-by: Mel Gorman
    Reviewed-by: Wu Fengguang
    Reviewed-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A bug was brought to my attention against a distro kernel but it affects
    mainline and I believe problems like this have been reported in various
    guises on the mailing lists although I don't have specific examples at the
    moment.

    The reported problem was that malloc() stalled for a long time (minutes in
    some cases) if a large tmpfs mount was occupying a large percentage of
    memory overall. The pages did not get cleaned or reclaimed by
    zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
    are uselessly scanned frequencly making the CPU spin at near 100%.

    This patchset intends to address that bug and bring the behaviour of
    zone_reclaim() more in line with expectations which were noticed during
    investigation. It is based on top of mmotm and takes advantage of
    Kosaki's work with respect to zone_reclaim().

    Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
    scan should go ahead. The broken heuristic is what was causing the
    malloc() stall as it uselessly scanned the LRU constantly. Currently,
    zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
    could not deal with tmpfs pages at all. This fixes up the heuristic so
    that an unnecessary scan is more likely to be correctly avoided.

    Patch 2 notes that zone_reclaim() returning a failure automatically means
    the zone is marked full. This is not always true. It could have
    failed because the GFP mask or zone_reclaim_mode were unsuitable.

    Patch 3 introduces a counter zreclaim_failed that will increment each
    time the zone_reclaim scan-avoidance heuristics fail. If that
    counter is rapidly increasing, then zone_reclaim_mode should be
    set to 0 as a temporarily resolution and a bug reported because
    the scan-avoidance heuristic is still broken.

    This patch:

    On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.

    There is a heuristic that determines if the scan is worthwhile but the
    problem is that the heuristic is not being properly applied and is
    basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of
    proper detection can manfiest as high CPU usage as the LRU list is scanned
    uselessly.

    Historically, once enabled it was depending on NR_FILE_PAGES which may
    include swapcache pages that the reclaim_mode cannot deal with. Patch
    vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
    Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
    pages that were not file-backed such as swapcache and made a calculation
    based on the inactive, active and mapped files. This is far superior when
    zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
    reasonable starting figure.

    This patch alters how zone_reclaim() works out how many pages it might be
    able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in
    the reclaim_mode it will either consider NR_FILE_PAGES as potential
    candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
    swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
    then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
    not set, then NR_FILE_MAPPED are not.

    [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
    [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When a task is chosen for oom kill and is found to be PF_EXITING,
    __oom_kill_task() is called to elevate the task's timeslice and give it
    access to memory reserves so that it may quickly exit.

    This privilege is unnecessary, however, if the task has already detached
    its mm. Although its possible for the mm to become detached later since
    task_lock() is not held, __oom_kill_task() will simply be a no-op in such
    circumstances.

    Subsequently, it is no longer necessary to warn about killing mm-less
    tasks since it is a no-op.

    Signed-off-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Balbir Singh
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit 2e2e425989080cc534fc0fca154cae515f971cf5 ("vmscan,memcg:
    reintroduce sc->may_swap) add may_swap flag and handle it at
    get_scan_ratio().

    But the result of get_scan_ratio() is ignored when priority == 0, so anon
    lru is scanned even if may_swap == 0 or nr_swap_pages == 0. IMHO, this is
    not an expected behavior.

    As for memcg especially, because of this behavior many and many pages are
    swapped-out just in vain when oom is invoked by mem+swap limit.

    This patch is for handling may_swap flag more strictly.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura