18 Jun, 2009

4 commits


17 Jun, 2009

36 commits

  • Conflicts:
    mm/slub.c

    Pekka Enberg
     
  • Pekka Enberg
     
  • * akpm: (182 commits)
    fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
    fbdev: *bfin*: fix __dev{init,exit} markings
    fbdev: *bfin*: drop unnecessary calls to memset
    fbdev: bfin-t350mcqb-fb: drop unused local variables
    fbdev: blackfin has __raw I/O accessors, so use them in fb.h
    fbdev: s1d13xxxfb: add accelerated bitblt functions
    tcx: use standard fields for framebuffer physical address and length
    fbdev: add support for handoff from firmware to hw framebuffers
    intelfb: fix a bug when changing video timing
    fbdev: use framebuffer_release() for freeing fb_info structures
    radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
    s3c-fb: CPUFREQ frequency scaling support
    s3c-fb: fix resource releasing on error during probing
    carminefb: fix possible access beyond end of carmine_modedb[]
    acornfb: remove fb_mmap function
    mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
    mb862xxfb: restrict compliation of platform driver to PPC
    Samsung SoC Framebuffer driver: add Alpha Channel support
    atmel-lcdc: fix pixclock upper bound detection
    offb: use framebuffer_alloc() to allocate fb_info struct
    ...

    Manually fix up conflicts due to kmemcheck in mm/slab.c

    Linus Torvalds
     
  • At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be
    pushed back to "src" list by list_move(). But the page may not be from
    "src" list. This pushes the page back to wrong LRU. And list_move()
    itself is unnecessary because the page is not on top of LRU. Then, leave
    it as it is if __isolate_lru_page() fails.

    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.

    There is a heuristic that determines if the scan is worthwhile but it is
    possible that the heuristic will fail and the CPU gets tied up scanning
    uselessly. Detecting the situation requires some guesswork and
    experimentation so this patch adds a counter "zreclaim_failed" to
    /proc/vmstat. If during high CPU utilisation this counter is increasing
    rapidly, then the resolution to the problem may be to set
    /proc/sys/vm/zone_reclaim_mode to 0.

    [akpm@linux-foundation.org: name things consistently]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met. The problem is that zone_reclaim() failing at all means the
    zone gets marked full.

    This can cause situations where a zone is usable, but is being skipped
    because it has been considered full. Take a situation where a large tmpfs
    mount is occuping a large percentage of memory overall. The pages do not
    get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
    and the zonelist cache considers them not worth trying in the future.

    This patch makes zone_reclaim() return more fine-grained information about
    what occured when zone_reclaim() failued. The zone only gets marked full
    if it really is unreclaimable. If it's a case that the scan did not occur
    or if enough pages were not reclaimed with the limited reclaim_mode, then
    the zone is simply skipped.

    There is a side-effect to this patch. Currently, if zone_reclaim()
    successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
    ahead. With this patch applied, zone watermarks are rechecked after
    zone_reclaim() does some work.

    This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26
    ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
    zonelist_cache was introduced. It was not intended that zone_reclaim()
    aggressively consider the zone to be full when it failed as full direct
    reclaim can still be an option. Due to the age of the bug, it should be
    considered a -stable candidate.

    Signed-off-by: Mel Gorman
    Reviewed-by: Wu Fengguang
    Reviewed-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A bug was brought to my attention against a distro kernel but it affects
    mainline and I believe problems like this have been reported in various
    guises on the mailing lists although I don't have specific examples at the
    moment.

    The reported problem was that malloc() stalled for a long time (minutes in
    some cases) if a large tmpfs mount was occupying a large percentage of
    memory overall. The pages did not get cleaned or reclaimed by
    zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
    are uselessly scanned frequencly making the CPU spin at near 100%.

    This patchset intends to address that bug and bring the behaviour of
    zone_reclaim() more in line with expectations which were noticed during
    investigation. It is based on top of mmotm and takes advantage of
    Kosaki's work with respect to zone_reclaim().

    Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
    scan should go ahead. The broken heuristic is what was causing the
    malloc() stall as it uselessly scanned the LRU constantly. Currently,
    zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
    could not deal with tmpfs pages at all. This fixes up the heuristic so
    that an unnecessary scan is more likely to be correctly avoided.

    Patch 2 notes that zone_reclaim() returning a failure automatically means
    the zone is marked full. This is not always true. It could have
    failed because the GFP mask or zone_reclaim_mode were unsuitable.

    Patch 3 introduces a counter zreclaim_failed that will increment each
    time the zone_reclaim scan-avoidance heuristics fail. If that
    counter is rapidly increasing, then zone_reclaim_mode should be
    set to 0 as a temporarily resolution and a bug reported because
    the scan-avoidance heuristic is still broken.

    This patch:

    On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.

    There is a heuristic that determines if the scan is worthwhile but the
    problem is that the heuristic is not being properly applied and is
    basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of
    proper detection can manfiest as high CPU usage as the LRU list is scanned
    uselessly.

    Historically, once enabled it was depending on NR_FILE_PAGES which may
    include swapcache pages that the reclaim_mode cannot deal with. Patch
    vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
    Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
    pages that were not file-backed such as swapcache and made a calculation
    based on the inactive, active and mapped files. This is far superior when
    zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
    reasonable starting figure.

    This patch alters how zone_reclaim() works out how many pages it might be
    able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in
    the reclaim_mode it will either consider NR_FILE_PAGES as potential
    candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
    swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
    then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
    not set, then NR_FILE_MAPPED are not.

    [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
    [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When a task is chosen for oom kill and is found to be PF_EXITING,
    __oom_kill_task() is called to elevate the task's timeslice and give it
    access to memory reserves so that it may quickly exit.

    This privilege is unnecessary, however, if the task has already detached
    its mm. Although its possible for the mm to become detached later since
    task_lock() is not held, __oom_kill_task() will simply be a no-op in such
    circumstances.

    Subsequently, it is no longer necessary to warn about killing mm-less
    tasks since it is a no-op.

    Signed-off-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Balbir Singh
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit 2e2e425989080cc534fc0fca154cae515f971cf5 ("vmscan,memcg:
    reintroduce sc->may_swap) add may_swap flag and handle it at
    get_scan_ratio().

    But the result of get_scan_ratio() is ignored when priority == 0, so anon
    lru is scanned even if may_swap == 0 or nr_swap_pages == 0. IMHO, this is
    not an expected behavior.

    As for memcg especially, because of this behavior many and many pages are
    swapped-out just in vain when oom is invoked by mem+swap limit.

    This patch is for handling may_swap flag more strictly.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • The "move pages to active list" and "move pages to inactive list" code
    blocks are mostly identical and can be served by a function.

    Thanks to Andrew Morton for pointing this out.

    Note that buffer_heads_over_limit check will also be carried out for
    re-activated pages, which is slightly different from pre-2.6.28 kernels.
    Also, Rik's "vmscan: evict use-once pages first" patch could totally stop
    scans of active file list when memory pressure is low. So the net effect
    could be, the number of buffer heads is now more likely to grow large.

    However that's fine according to Johannes' comments:

    I don't think that this could be harmful. We just preserve the buffer
    mappings of what we consider the working set and with low memory
    pressure, as you say, this set is not big.

    As to stripping of reactivated pages: the only pages we re-activate
    for now are those VM_EXEC mapped ones. Since we don't expect IO from
    or to these pages, removing the buffer mappings in case they grow too
    large should be okay, I guess.

    Cc: Pekka Enberg
    Acked-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Protect referenced PROT_EXEC mapped pages from being deactivated.

    PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
    currently running executables and their linked libraries, they shall really be
    cached aggressively to provide good user experiences.

    Thanks to Johannes Weiner for the advice to reuse the VMA walk in
    page_referenced() to get the PROT_EXEC bit.

    [more details]

    ( The consequences of this patch will have to be discussed together with
    Rik van Riel's recent patch "vmscan: evict use-once pages first". )

    ( Some of the good points and insights are taken into this changelog.
    Thanks to all the involved people for the great LKML discussions. )

    the problem
    ===========

    For a typical desktop, the most precious working set is composed of
    *actively accessed*
    (1) memory mapped executables
    (2) and their anonymous pages
    (3) and other files
    (4) and the dcache/icache/.. slabs
    while the least important data are
    (5) infrequently used or use-once files

    For a typical desktop, one major problem is busty and large amount of (5)
    use-once files flushing out the working set.

    Inside the working set, (4) dcache/icache have already been too sticky ;-)
    So we only have to care (2) anonymous and (1)(3) file pages.

    anonymous pages
    ===============

    Anonymous pages are effectively immune to the streaming IO attack, because we
    now have separate file/anon LRU lists. When the use-once files crowd into the
    file LRU, the list's "quality" is significantly lowered. Therefore the scan
    balance policy in get_scan_ratio() will choose to scan the (low quality) file
    LRU much more frequently than the anon LRU.

    file pages
    ==========

    Rik proposed to *not* scan the active file LRU when the inactive list grows
    larger than active list. This guarantees that when there are use-once streaming
    IO, and the working set is not too large(so that active_size < inactive_size),
    the active file LRU will *not* be scanned at all. So the not-too-large working
    set can be well protected.

    But there are also situations where the file working set is a bit large so that
    (active_size >= inactive_size), or the streaming IOs are not purely use-once.
    In these cases, the active list will be scanned slowly. Because the current
    shrink_active_list() policy is to deactivate active pages regardless of their
    referenced bits. The deactivated pages become susceptible to the streaming IO
    attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
    the deactivated pages don't have enough time to get re-referenced. Because a
    user tend to switch between windows in intervals from seconds to minutes.

    This patch holds mapped executable pages in the active list as long as they
    are referenced during each full scan of the active list. Because the active
    list is normally scanned much slower, they get longer grace time (eg. 100s)
    for further references, which better matches the pace of user operations.

    Therefore this patch greatly prolongs the in-cache time of executable code,
    when there are moderate memory pressures.

    before patch: guaranteed to be cached if reference intervals < I
    after patch: guaranteed to be cached if reference intervals < I+A
    (except when randomly reclaimed by the lumpy reclaim)
    where
    A = time to fully scan the active file LRU
    I = time to fully scan the inactive file LRU

    Note that normally A >> I.

    side effects
    ============

    This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
    but in a much smaller and well targeted scope.

    One may worry about some one to abuse the PROT_EXEC heuristic. But as
    Andrew Morton stated, there are other tricks to getting that sort of boost.

    Another concern is the PROT_EXEC mapped pages growing large in rare cases,
    and therefore hurting reclaim efficiency. But a sane application targeted for
    large audience will never use PROT_EXEC for data mappings. If some home made
    application tries to abuse that bit, it shall be aware of the consequences.
    If it is abused to scale of 2/3 total memory, it gains nothing but overheads.

    benchmarks
    ==========

    1) memory tight desktop

    1.1) brief summary

    - clock time and major faults are reduced by 50%;
    - pswpin numbers are reduced to ~1/3.

    That means X desktop responsiveness is doubled under high memory/swap pressure.

    1.2) test scenario

    - nfsroot gnome desktop with 512M physical memory
    - run some programs, and switch between the existing windows
    after starting each new program.

    1.3) progress timing (seconds)

    before after programs
    0.02 0.02 N xeyes
    0.75 0.76 N firefox
    2.02 1.88 N nautilus
    3.36 3.17 N nautilus --browser
    5.26 4.89 N gthumb
    7.12 6.47 N gedit
    9.22 8.16 N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
    13.58 12.55 N xterm
    15.87 14.57 N mlterm
    18.63 17.06 N gnome-terminal
    21.16 18.90 N urxvt
    26.24 23.48 N gnome-system-monitor
    28.72 26.52 N gnome-help
    32.15 29.65 N gnome-dictionary
    39.66 36.12 N /usr/games/sol
    43.16 39.27 N /usr/games/gnometris
    48.65 42.56 N /usr/games/gnect
    53.31 47.03 N /usr/games/gtali
    58.60 52.05 N /usr/games/iagno
    65.77 55.42 N /usr/games/gnotravex
    70.76 61.47 N /usr/games/mahjongg
    76.15 67.11 N /usr/games/gnome-sudoku
    86.32 75.15 N /usr/games/glines
    92.21 79.70 N /usr/games/glchess
    103.79 88.48 N /usr/games/gnomine
    113.84 96.51 N /usr/games/gnotski
    124.40 102.19 N /usr/games/gnibbles
    137.41 114.93 N /usr/games/gnobots2
    155.53 125.02 N /usr/games/blackjack
    179.85 135.11 N /usr/games/same-gnome
    224.49 154.50 N /usr/bin/gnome-window-properties
    248.44 162.09 N /usr/bin/gnome-default-applications-properties
    282.62 173.29 N /usr/bin/gnome-at-properties
    323.72 188.21 N /usr/bin/gnome-typing-monitor
    363.99 199.93 N /usr/bin/gnome-at-visual
    394.21 206.95 N /usr/bin/gnome-sound-properties
    435.14 224.49 N /usr/bin/gnome-at-mobility
    463.05 234.11 N /usr/bin/gnome-keybinding-properties
    503.75 248.59 N /usr/bin/gnome-about-me
    554.00 276.27 N /usr/bin/gnome-display-properties
    615.48 304.39 N /usr/bin/gnome-network-preferences
    693.03 342.01 N /usr/bin/gnome-mouse-properties
    759.90 388.58 N /usr/bin/gnome-appearance-properties
    937.90 508.47 N /usr/bin/gnome-control-center
    1109.75 587.57 N /usr/bin/gnome-keyboard-properties
    1399.05 758.16 N : oocalc
    1524.64 830.03 N : oodraw
    1684.31 900.03 N : ooimpress
    1874.04 993.91 N : oomath
    2115.12 1081.89 N : ooweb
    2369.02 1161.99 N : oowriter

    Note that the last ": oo*" commands are actually commented out.

    1.4) vmstat numbers (some relevant ones are marked with *)

    before after
    nr_free_pages 1293 3898
    nr_inactive_anon 59956 53460
    nr_active_anon 26815 30026
    nr_inactive_file 2657 3218
    nr_active_file 2019 2806
    nr_unevictable 4 4
    nr_mlock 4 4
    nr_anon_pages 26706 27859
    *nr_mapped 3542 4469
    nr_file_pages 72232 67681
    nr_dirty 1 0
    nr_writeback 123 19
    nr_slab_reclaimable 3375 3534
    nr_slab_unreclaimable 11405 10665
    nr_page_table_pages 8106 7864
    nr_unstable 0 0
    nr_bounce 0 0
    *nr_vmscan_write 394776 230839
    nr_writeback_temp 0 0
    numa_hit 6843353 3318676
    numa_miss 0 0
    numa_foreign 0 0
    numa_interleave 1719 1719
    numa_local 6843353 3318676
    numa_other 0 0
    *pgpgin 5954683 2057175
    *pgpgout 1578276 922744
    *pswpin 1486615 512238
    *pswpout 394568 230685
    pgalloc_dma 277432 56602
    pgalloc_dma32 6769477 3310348
    pgalloc_normal 0 0
    pgalloc_movable 0 0
    pgfree 7048396 3371118
    pgactivate 2036343 1471492
    pgdeactivate 2189691 1612829
    pgfault 3702176 3100702
    *pgmajfault 452116 201343
    pgrefill_dma 12185 7127
    pgrefill_dma32 334384 653703
    pgrefill_normal 0 0
    pgrefill_movable 0 0
    pgsteal_dma 74214 22179
    pgsteal_dma32 3334164 1638029
    pgsteal_normal 0 0
    pgsteal_movable 0 0
    pgscan_kswapd_dma 1081421 1216199
    pgscan_kswapd_dma32 58979118 46002810
    pgscan_kswapd_normal 0 0
    pgscan_kswapd_movable 0 0
    pgscan_direct_dma 2015438 1086109
    pgscan_direct_dma32 55787823 36101597
    pgscan_direct_normal 0 0
    pgscan_direct_movable 0 0
    pginodesteal 3461 7281
    slabs_scanned 564864 527616
    kswapd_steal 2889797 1448082
    kswapd_inodesteal 14827 14835
    pageoutrun 43459 21562
    allocstall 9653 4032
    pgrotated 384216 228631

    1.5) free numbers at the end of the tests

    before patch:
    total used free shared buffers cached
    Mem: 474 467 7 0 0 236
    -/+ buffers/cache: 230 243
    Swap: 1023 418 605

    after patch:
    total used free shared buffers cached
    Mem: 474 457 16 0 0 236
    -/+ buffers/cache: 221 253
    Swap: 1023 404 619

    2) memory flushing in a file server

    2.1) brief summary

    The number of major faults from 50 to 3 during 10% cache hot reads.

    That means this patch successfully stops major faults when the active file
    list is slowly scanned when there are partially cache hot streaming IO.

    2.2) test scenario

    Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
    pages will be activated:

    for i in `seq 0 100 10000000`; do echo $i 110; done > pattern-hot-10
    iotrace.rb --load pattern-hot-10 --play /b/sparse
    vmmon nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree

    and monitor /proc/vmstat during the time. The test box has 2G memory.

    I carried out tests on fresh booted console as well as X desktop, and
    fetched the vmstat numbers on

    (1) begin: shortly after the big read IO starts;
    (2) end: just before the big read IO stops;
    (3) restore: the big read IO stops and the zsh working set restored
    (4) restore X: after IO, switch back and forth between the urxvt and firefox
    windows to restore their working set.

    2.3) console mode results

    nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree

    2.6.29 VM_EXEC protection ON:
    begin: 2481 2237 8694 630 0 574299
    end: 275 231976 233914 633 776271 20933042
    restore: 370 232154 234524 691 777183 20958453

    2.6.29 VM_EXEC protection ON (second run):
    begin: 2434 2237 8493 629 0 574195
    end: 284 231970 233536 632 771918 20896129
    restore: 399 232218 234789 690 774526 20957909

    2.6.30-rc4-mm VM_EXEC protection OFF:
    begin: 2479 2344 9659 210 0 579643
    end: 284 232010 234142 260 772776 20917184
    restore: 379 232159 234371 301 774888 20967849

    The above console numbers show that

    - The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
    I'd attribute that improvement to the mmap readahead improvements :-)

    - The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
    That's a huge improvement - which means with the VM_EXEC protection logic,
    active mmap pages is pretty safe even under partially cache hot streaming IO.

    - when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
    under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
    That roughly means the active mmap pages get 20.8 more chances to get
    re-referenced to stay in memory.

    - The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
    dropped pages are mostly inactive ones. The patch has almost no impact in
    this aspect, that means it won't unnecessarily increase memory pressure.
    (In contrast, your 20% mmap protection ratio will keep them all, and
    therefore eliminate the extra 41 major faults to restore working set
    of zsh etc.)

    The iotrace.rb read throughput is
    151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
    which means the inactive list is rotated at the speed of 250MB/s,
    so a full scan of which takes about 3.5 seconds, while a full scan
    of active file list takes about 77 seconds.

    2.4) X mode results

    We can reach roughly the same conclusions for X desktop:

    nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree

    2.6.30-rc4-mm VM_EXEC protection ON:
    begin: 9740 8920 64075 561 0 678360
    end: 768 218254 220029 565 798953 21057006
    restore: 857 218543 220987 606 799462 21075710
    restore X: 2414 218560 225344 797 799462 21080795

    2.6.30-rc4-mm VM_EXEC protection OFF:
    begin: 9368 5035 26389 554 0 633391
    end: 770 218449 221230 661 646472 17832500
    restore: 1113 218466 220978 710 649881 17905235
    restore X: 2687 218650 225484 947 802700 21083584

    - the absolute nr_mapped drops considerably (to 1/13 of the original size)
    during the streaming IO.
    - the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
    during the whole process.

    Cc: Elladan
    Cc: Nick Piggin
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Collect vma->vm_flags of the VMAs that actually referenced the page.

    This is preparing for more informed reclaim heuristics, eg. to protect
    executable file pages more aggressively. For now only the VM_EXEC bit
    will be used by the caller.

    Thanks to Johannes, Peter and Minchan for all the good tips.

    Acked-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • As function shmem_file_setup does not modify/allocate/free/pass given
    filename - mark it as const.

    Signed-off-by: Sergei Trofimovich
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergei Trofimovich
     
  • The file argument resulted from address_space's readpage long time ago.

    We don't use it any more. Let's remove unnecessary argement.

    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Hugh removed add_to_swap's gfp_mask argument. (mm: remove gfp_mask from
    add_to_swap) So we have to remove annotation of gfp_mask of the function.

    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • SRAT tables may contains nodes of very small size. The arch code may
    decide to not activate such a node. However, currently the early boot
    code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be
    active although these nodes have no present pages.

    For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too

    Signed-off-by: Yinghai Lu
    Tested-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Remove __invalidate_mapping_pages atomic variant now that its sole caller
    can sleep (fixed in eccb95cee4f0d56faa46ef22fb94dd4a3578d3eb ("vfs: fix
    lock inversion in drop_pagecache_sb()")).

    This fixes softlockups that can occur while in the drop_caches path.

    Signed-off-by: Mike Waychison
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Nick Piggin
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Waychison
     
  • The oom killer must be invoked regardless of the order if the allocation
    is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to free
    some memory.

    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This moves the check for OOM_DISABLE to the badness heuristic so it is
    only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the
    score is 0, which is also correctly exported via /proc/pid/oom_score.
    This requires that tasks with badness scores of 0 are prohibited from
    being oom killed, which makes sense since they would not allow for future
    memory freeing anyway.

    Since the oom_adj value is a characteristic of an mm and not a task, it is
    no longer necessary to check the oom_adj value for threads sharing the
    same memory (except when simply issuing SIGKILLs for threads in other
    thread groups).

    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The per-task oom_adj value is a characteristic of its mm more than the
    task itself since it's not possible to oom kill any thread that shares the
    mm. If a task were to be killed while attached to an mm that could not be
    freed because another thread were set to OOM_DISABLE, it would have
    needlessly been terminated since there is no potential for future memory
    freeing.

    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct mm_struct. This requires task_lock() on a
    task to check its oom_adj value to protect against exec, but it's already
    necessary to take the lock when dereferencing the mm to find the total VM
    size for the badness heuristic.

    This fixes a livelock if the oom killer chooses a task and another thread
    sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
    because oom_kill_task() repeatedly returns 1 and refuses to kill the
    chosen task while select_bad_process() will repeatedly choose the same
    task during the next retry.

    Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
    oom_kill_task() to check for threads sharing the same memory will be
    removed in the next patch in this series where it will no longer be
    necessary.

    Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
    these threads are immune from oom killing already. They simply report an
    oom_adj value of OOM_DISABLE.

    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Presently we can know a swap entry is just used as SwapCache via swap_map,
    without looking up swap cache.

    Then, we have a chance to reuse swap-cache-only swap entries in
    get_swap_pages().

    This patch tries to free swap-cache-only swap entries if swap is not
    enough.

    Note: We hit following path when swap_cluster code cannot find a free
    cluster. Then, vm_swap_full() is not only condition to allow the kernel
    to reclaim unused swap.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Tested-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This is a part of the patches for fixing memcg's swap accountinf leak.
    But, IMHO, not a bad patch even if no memcg.

    There are 2 kinds of references to swap.
    - reference from swap entry
    - reference from swap cache

    Then,

    - If there is swap cache && swap's refcnt is 1, there is only swap cache.
    (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL

    This counting logic have worked well for a long time. But considering
    that we cannot know there is a _real_ reference or not by swap_map[],
    current usage of counter is not very good.

    This patch adds a flag SWAP_HAS_CACHE and recored information that a swap
    entry has a cache or not. This will remove -1 magic used in swapfile.c
    and be a help to avoid unnecessary find_get_page().

    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In a following patch, the usage of swap cache is recorded into swap_map.
    This patch is for necessary interface changes to do that.

    2 interfaces:

    - swapcache_prepare()
    - swapcache_free()

    are added for allocating/freeing refcnt from swap-cache to existing swap
    entries. But implementation itself is not changed under this patch. At
    adding swapcache_free(), memcg's hook code is moved under
    swapcache_free(). This is better than using scattered hooks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Solve two problems.

    Whenever memory hotplug sucessfully happens, zone->present_pages
    have to be changed.

    1) Now memory hotplug calls setup_per_zone_wmark_min only when
    online_pages called, not offline_pages.

    It breaks balance.

    2) If zone->present_pages is changed, we also have to change
    zone->inactive_ratio. That's because inactive_ratio depends on
    zone->present_pages.

    Signed-off-by: Minchan Kim
    Acked-by: Yasunori Goto
    Cc: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Factor the per-zone arithemetic inside setup_per_zone_inactive_ratio()'s
    loop into a a separate function, calculate_zone_inactive_ratio(). This
    function will be used in a later patch

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Change the names of two functions. It doesn't affect behavior.

    Presently, setup_per_zone_pages_min() changes low, high of zone as well as
    min. So a better name is setup_per_zone_wmarks(). That's because Mel
    changed zone->pages_[hig/low/min] to zone->watermark array in "page
    allocator: replace the watermark-related union in struct zone with a
    watermark[] array".

    * setup_per_zone_pages_min => setup_per_zone_wmarks

    Of course, we have to change init_per_zone_pages_min, too. There are not
    pages_min any more.

    * init_per_zone_pages_min => init_per_zone_wmark_min

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • shrink_zone() can deactivate active anon pages even if we don't have a
    swap device. Many embedded products don't have a swap device. So the
    deactivation of anon pages is unnecessary.

    This patch prevents unnecessary deactivation of anon lru pages. But, it
    don't prevent aging of anon pages to swap out.

    Signed-off-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    MinChan Kim
     
  • migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz).
    Commit 3140a2273009c01c27d316f35ab76a37e105fdd8 improved move_pages()
    throughput by breaking it into chunks, but it also made migrate_prep() be
    called once per chunk (every 128pages or so) instead of once per
    move_pages().

    This patch reverts to calling migrate_prep() only once per chunk as we did
    before 2.6.29. It is also a followup to commit
    0aedadf91a70a11c4a3e7c7d99b21e5528af8d5d ("mm: move migrate_prep out from
    under mmap_sem").

    This improves migration throughput on the above machine from 600MB/s to
    750MB/s.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Heiko Carstens
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Currently, the following scenario appears to be possible in theory:

    * Tasks are frozen for hibernation or suspend.
    * Free pages are almost exhausted.
    * Certain piece of code in the suspend code path attempts to allocate
    some memory using GFP_KERNEL and allocation order less than or
    equal to PAGE_ALLOC_COSTLY_ORDER.
    * __alloc_pages_internal() cannot find a free page so it invokes the
    OOM killer.
    * The OOM killer attempts to kill a task, but the task is frozen, so
    it doesn't die immediately.
    * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
    to find a free page and invokes the OOM killer.
    * No progress can be made.

    Although it is now hard to trigger during hibernation due to the memory
    shrinking carried out by the hibernation code, it is theoretically
    possible to trigger during suspend after the memory shrinking has been
    removed from that code path. Moreover, since memory allocations are
    going to be used for the hibernation memory shrinking, it will be even
    more likely to happen during hibernation.

    To prevent it from happening, introduce the oom_killer_disabled switch
    that will cause __alloc_pages_internal() to fail in the situations in
    which the OOM killer would have been called and make the freezer set
    this switch after tasks have been successfully frozen.

    [akpm@linux-foundation.org: be nicer to the namespace]
    Signed-off-by: Rafael J. Wysocki
    Cc: Fengguang Wu
    Cc: David Rientjes
    Acked-by: Pavel Machek
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The posix_madvise() function succeeds (and does nothing) when called with
    parameters (NULL, 0, -1); according to LSB tests, it should fail with
    EINVAL because -1 is not a valid flag.

    When called with a valid address and size, it correctly fails.

    So perform an initial check for valid flags first.

    Reported-by: Jiri Dluhos
    Signed-off-by: Nick Piggin
    Reviewed-and-Tested-by: WANG Cong
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should
    detect and suitably handle this (and not by lamely moving the infinite
    loop up to the caller level either).

    Attempting to use __GFP_NOFAIL for a higher-order allocation is even
    worse, so add a once-off runtime check for this to slap people around for
    even thinking about trying it.

    Cc: David Rientjes
    Acked-by: Mel Gorman
    Acked-by: Peter Zijlstra
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Analoguous to follow_phys(), add a helper that looks up the PFN at a
    user virtual address in an IO mapping or a raw PFN mapping.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Acked-by: Magnus Damm
    Cc: Hans Verkuil
    Cc: Paul Mundt
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Acked-by: Magnus Damm
    Cc: Hans Verkuil
    Cc: Paul Mundt
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A generic readonly page table lookup helper to map an address space and an
    address from it to a pte.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Acked-by: Magnus Damm
    Cc: Hans Verkuil
    Cc: Paul Mundt
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The caller of setup_per_zone_inactive_ratio is an __init function. There
    is no need to keep the callee after it completed as well. Also fix a
    comment.

    Acked-by: David Rientjes
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov