04 Jun, 2012

1 commit

  • This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199.

    That commit seems to be the cause of the mm compation list corruption
    issues that Dave Jones reported. The locking (or rather, absense
    there-of) is dubious, as is the use of the 'page' variable once it has
    been found to be outside the pageblock range.

    So revert it for now, we can re-visit this for 3.6. If we even need to:
    as Minchan Kim says, "The patch wasn't a bug fix and even test workload
    was very theoretical".

    Reported-and-tested-by: Dave Jones
    Acked-by: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 May, 2012

8 commits

  • This is the first stage of struct mem_cgroup_zone removal. Further
    patches replace struct mem_cgroup_zone with a pointer to struct lruvec.

    If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • With mem_cgroup_disabled() now explicit, it becomes clear that the
    zone_reclaim_stat structure actually belongs in lruvec, per-zone when
    memcg is disabled but per-memcg per-zone when it's enabled.

    We can delete mem_cgroup_get_reclaim_stat(), and change
    update_page_reclaim_stat() to update just the one set of stats, the one
    which get_scan_count() will actually use.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • - make pageflag_names[] const

    - remove null termination of pageflag_names[]

    Cc: Johannes Weiner
    Cc: Gavin Shan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • String tables with names of enum items are always prone to go out of
    sync with the enums themselves. Ensure during compile time that the
    name table of page flags has the same size as the page flags enum.

    Signed-off-by: Johannes Weiner
    Cc: Gavin Shan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The array pageflag_names[] does conversion from page flags into their
    corresponding names so that a meaningful representation of the
    corresponding page flag can be printed. This mechanism is used while
    dumping page frames. However, the array missed PG_compound_lock. So
    the PG_compound_lock page flag would be printed as a digital number
    instead of a meaningful string.

    The patch fixes that and prints "compound_lock" for the PG_compound_lock
    page flag.

    Signed-off-by: Gavin Shan
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
    pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
    allocation takes ownership of the block may take too long. The type of
    the pageblock remains unchanged so the pageblock cannot be used as a
    migration target during compaction.

    Fix it by:

    * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
    COMPACT_SYNC) and then converting sync field in struct compact_control
    to use it.

    * Adding nr_pageblocks_skipped field to struct compact_control and
    tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
    If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
    try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
    suitable page for allocation. In this case then check how if there were
    enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
    COMPACT_ASYNC_UNMOVABLE mode.

    * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
    COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
    finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all
    pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
    sets change the whole pageblock type to MIGRATE_MOVABLE.

    My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
    131072 standard 4KiB pages in 'Normal' zone) is to:

    - allocate 120000 pages for kernel's usage
    - free every second page (60000 pages) of memory just allocated
    - allocate and use 60000 pages from user space
    - free remaining 60000 pages of kernel memory
    (now we have fragmented memory occupied mostly by user space pages)
    - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage

    The results:
    - with compaction disabled I get 11 successful allocations
    - with compaction enabled - 14 successful allocations
    - with this patch I'm able to get all 100 successful allocations

    NOTE: If we can make kswapd aware of order-0 request during compaction, we
    can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
    (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the
    following thread:

    http://marc.info/?l=linux-mm&m=133552069417068&w=2

    [minchan@kernel.org: minor cleanups]
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Marek Szyprowski
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • This has always been broken: one version takes an unsigned int and the
    other version takes no arguments. This bug was hidden because one
    version of set_pageblock_order() was a macro which doesn't evaluate its
    argument.

    Simplify it all and remove pageblock_default_order() altogether.

    Reported-by: rajman mekaco
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Tejun Heo
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Print physical address info in a style consistent with the %pR style used
    elsewhere in the kernel. For example:

    -Zone PFN ranges:
    +Zone ranges:
    - DMA32 0x00000010 -> 0x00100000
    + DMA32 [mem 0x00010000-0xffffffff]
    - Normal 0x00100000 -> 0x01080000
    + Normal [mem 0x100000000-0x107fffffff]

    Signed-off-by: Bjorn Helgaas
    Cc: Yinghai Lu
    Cc: Konrad Rzeszutek Wilk
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bjorn Helgaas
     

26 May, 2012

1 commit

  • Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
    "These patches contain two major updates for DMA mapping subsystem
    (mainly for ARM architecture). First one is Contiguous Memory
    Allocator (CMA) which makes it possible for device drivers to allocate
    big contiguous chunks of memory after the system has booted.

    The main difference from the similar frameworks is the fact that CMA
    allows to transparently reuse the memory region reserved for the big
    chunk allocation as a system memory, so no memory is wasted when no
    big chunk is allocated. Once the alloc request is issued, the
    framework migrates system pages to create space for the required big
    chunk of physically contiguous memory.

    For more information one can refer to nice LWN articles:

    - 'A reworked contiguous memory allocator':
    http://lwn.net/Articles/447405/

    - 'CMA and ARM':
    http://lwn.net/Articles/450286/

    - 'A deep dive into CMA':
    http://lwn.net/Articles/486301/

    - and the following thread with the patches and links to all previous
    versions:
    https://lkml.org/lkml/2012/4/3/204

    The main client for this new framework is ARM DMA-mapping subsystem.

    The second part provides a complete redesign in ARM DMA-mapping
    subsystem. The core implementation has been changed to use common
    struct dma_map_ops based infrastructure with the recent updates for
    new dma attributes merged in v3.4-rc2. This allows to use more than
    one implementation of dma-mapping calls and change/select them on the
    struct device basis. The first client of this new infractructure is
    dmabounce implementation which has been completely cut out of the
    core, common code.

    The last patch of this redesign update introduces a new, experimental
    implementation of dma-mapping calls on top of generic IOMMU framework.
    This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
    calls if one provides required IOMMU hardware.

    For more information please refer to the following thread:
    http://www.spinics.net/lists/arm-kernel/msg175729.html

    The last patch merges changes from both updates and provides a
    resolution for the conflicts which cannot be avoided when patches have
    been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."

    Acked by Andrew Morton :
    "Yup, this one please. It's had much work, plenty of review and I
    think even Russell is happy with it."

    * 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
    ARM: dma-mapping: use PMD size for section unmap
    cma: fix migration mode
    ARM: integrate CMA with DMA-mapping subsystem
    X86: integrate CMA with DMA-mapping subsystem
    drivers: add Contiguous Memory Allocator
    mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
    mm: extract reclaim code from __alloc_pages_direct_reclaim()
    mm: Serialize access to min_free_kbytes
    mm: page_isolation: MIGRATE_CMA isolation functions added
    mm: mmzone: MIGRATE_CMA migration type added
    mm: page_alloc: change fallbacks array handling
    mm: page_alloc: introduce alloc_contig_range()
    mm: compaction: export some of the functions
    mm: compaction: introduce isolate_freepages_range()
    mm: compaction: introduce map_pages()
    mm: compaction: introduce isolate_migratepages_range()
    mm: page_alloc: remove trailing whitespace
    ARM: dma-mapping: add support for IOMMU mapper
    ARM: dma-mapping: use alloc, mmap, free from dma_ops
    ARM: dma-mapping: remove redundant code and do the cleanup
    ...

    Conflicts:
    arch/x86/include/asm/dma-mapping.h

    Linus Torvalds
     

25 May, 2012

1 commit

  • Pull more networking updates from David Miller:
    "Ok, everything from here on out will be bug fixes."

    1) One final sync of wireless and bluetooth stuff from John Linville.
    These changes have all been in his tree for more than a week, and
    therefore have had the necessary -next exposure. John was just away
    on a trip and didn't have a change to send the pull request until a
    day or two ago.

    2) Put back some defines in user exposed header file areas that were
    removed during the tokenring purge. From Stephen Hemminger and Paul
    Gortmaker.

    3) A bug fix for UDP hash table allocation got lost in the pile due to
    one of those "you got it.. no I've got it.." situations. :-)

    From Tim Bird.

    4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
    try to coalesce overlapping frags and crash. Fix from Eric Dumazet.

    5) RCU routing table lookups can race with free_fib_info(), causing
    crashes when we deref the device pointers in the route. Fix by
    releasing the net device in the RCU callback. From Yanmin Zhang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
    tcp: take care of overlaps in tcp_try_coalesce()
    ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
    mm: add a low limit to alloc_large_system_hash
    ipx: restore token ring define to include/linux/ipx.h
    if: restore token ring ARP type to header
    xen: do not disable netfront in dom0
    phy/micrel: Fix ID of KSZ9021
    mISDN: Add X-Tensions USB ISDN TA XC-525
    gianfar:don't add FCB length to hard_header_len
    Bluetooth: Report proper error number in disconnection
    Bluetooth: Create flags for bt_sk()
    Bluetooth: report the right security level in getsockopt
    Bluetooth: Lock the L2CAP channel when sending
    Bluetooth: Restore locking semantics when looking up L2CAP channels
    Bluetooth: Fix a redundant and problematic incoming MTU check
    Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
    Bluetooth: Fix EIR data generation for mgmt_device_found
    Bluetooth: Fix Inquiry with RSSI event mask
    Bluetooth: improve readability of l2cap_seq_list code
    Bluetooth: Fix skb length calculation
    ...

    Linus Torvalds
     

24 May, 2012

1 commit

  • UDP stack needs a minimum hash size value for proper operation and also
    uses alloc_large_system_hash() for proper NUMA distribution of its hash
    tables and automatic sizing depending on available system memory.

    On some low memory situations, udp_table_init() must ignore the
    alloc_large_system_hash() result and reallocs a bigger memory area.

    As we cannot easily free old hash table, we leak it and kmemleak can
    issue a warning.

    This patch adds a low limit parameter to alloc_large_system_hash() to
    solve this problem.

    We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
    allocation.

    Reported-by: Mark Asselstine
    Reported-by: Tim Bird
    Signed-off-by: Eric Dumazet
    Cc: Paul Gortmaker
    Signed-off-by: David S. Miller

    Tim Bird
     

23 May, 2012

1 commit

  • Pull driver core updates from Greg Kroah-Hartman:
    "Here's the driver core, and other driver subsystems, pull request for
    the 3.5-rc1 merge window.

    Outside of a few minor driver core changes, we ended up with the
    following different subsystem and core changes as well, due to
    interdependancies on the driver core:
    - hyperv driver updates
    - drivers/memory being created and some drivers moved into it
    - extcon driver subsystem created out of the old Android staging
    switch driver code
    - dynamic debug updates
    - printk rework, and /dev/kmsg changes

    All of this has been tested in the linux-next releases for a few weeks
    with no reported problems.

    Signed-off-by: Greg Kroah-Hartman "

    Fix up conflicts in drivers/extcon/extcon-max8997.c where git noticed
    that a patch to the deleted drivers/misc/max8997-muic.c driver needs to
    be applied to this one.

    * tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (90 commits)
    uio_pdrv_genirq: get irq through platform resource if not set otherwise
    memory: tegra{20,30}-mc: Remove empty *_remove()
    printk() - isolate KERN_CONT users from ordinary complete lines
    sysfs: get rid of some lockdep false positives
    Drivers: hv: util: Properly handle version negotiations.
    Drivers: hv: Get rid of an unnecessary check in vmbus_prep_negotiate_resp()
    memory: tegra{20,30}-mc: Use dev_err_ratelimited()
    driver core: Add dev_*_ratelimited() family
    Driver Core: don't oops with unregistered driver in driver_find_device()
    printk() - restore prefix/timestamp printing for multi-newline strings
    printk: add stub for prepend_timestamp()
    ARM: tegra30: Make MC optional in Kconfig
    ARM: tegra20: Make MC optional in Kconfig
    ARM: tegra30: MC: Remove unnecessary BUG*()
    ARM: tegra20: MC: Remove unnecessary BUG*()
    printk: correctly align __log_buf
    ARM: tegra30: Add Tegra Memory Controller(MC) driver
    ARM: tegra20: Add Tegra Memory Controller(MC) driver
    printk() - restore timestamp printing at console output
    printk() - do not merge continuation lines of different threads
    ...

    Linus Torvalds
     

21 May, 2012

9 commits

  • __alloc_contig_migrate_range calls migrate_pages with wrong argument
    for migrate_mode. Fix it.

    Cc: Marek Szyprowski
    Signed-off-by: Minchan Kim
    Acked-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski

    Minchan Kim
     
  • alloc_contig_range() performs memory allocation so it also should keep
    track on keeping the correct level of memory watermarks. This commit adds
    a call to *_slowpath style reclaim to grab enough pages to make sure that
    the final collection of contiguous pages from freelists will not starve
    the system.

    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    CC: Michal Nazarewicz
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Marek Szyprowski
     
  • This patch extracts common reclaim code from __alloc_pages_direct_reclaim()
    function to separate function: __perform_reclaim() which can be later used
    by alloc_contig_range().

    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Cc: Michal Nazarewicz
    Acked-by: Mel Gorman
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Marek Szyprowski
     
  • There is a race between the min_free_kbytes sysctl, memory hotplug
    and transparent hugepage support enablement. Memory hotplug uses a
    zonelists_mutex to avoid a race when building zonelists. Reuse it to
    serialise watermark updates.

    [a.p.zijlstra@chello.nl: Older patch fixed the race with spinlock]
    Signed-off-by: Mel Gorman
    Signed-off-by: Marek Szyprowski
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Barry Song

    Mel Gorman
     
  • This commit changes various functions that change pages and
    pageblocks migrate type between MIGRATE_ISOLATE and
    MIGRATE_MOVABLE in such a way as to allow to work with
    MIGRATE_CMA migrate type.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • The MIGRATE_CMA migration type has two main characteristics:
    (i) only movable pages can be allocated from MIGRATE_CMA
    pageblocks and (ii) page allocator will never change migration
    type of MIGRATE_CMA pageblocks.

    This guarantees (to some degree) that page in a MIGRATE_CMA page
    block can always be migrated somewhere else (unless there's no
    memory left in the system).

    It is designed to be used for allocating big chunks (eg. 10MiB)
    of physically contiguous memory. Once driver requests
    contiguous memory, pages from MIGRATE_CMA pageblocks may be
    migrated away to create a contiguous block.

    To minimise number of migrations, MIGRATE_CMA migration type
    is the last type tried when page allocator falls back to other
    migration types when requested.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • This commit adds a row for MIGRATE_ISOLATE type to the fallbacks array
    which was missing from it. It also, changes the array traversal logic
    a little making MIGRATE_RESERVE an end marker. The letter change,
    removes the implicit MIGRATE_UNMOVABLE from the end of each row which
    was read by __rmqueue_fallback() function.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • This commit adds the alloc_contig_range() function which tries
    to allocate given range of pages. It tries to migrate all
    already allocated pages that fall in the range thus freeing them.
    Once all pages in the range are freed they are removed from the
    buddy system thus allocated for the caller to use.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman

    Michal Nazarewicz
     

12 May, 2012

1 commit

  • Why is there less MemFree than there used to be? It perturbed a test,
    so I've just been bisecting linux-next, and now find the offender went
    upstream yesterday.

    Commit 93278814d359 "mm: fix division by 0 in percpu_pagelist_fraction()"
    mistakenly initialized percpu_pagelist_fraction to the sysctl's minimum 8,
    which leaves 1/8th of memory on percpu lists (on each cpu??); but most of
    us expect it to be left unset at 0 (and it's not then used as a divisor).

    MemTotal: 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB
    Repetitive test with percpu_pagelist_fraction 8:
    MemFree: 6948420kB 6237172kB 6949696kB 6840692kB 6949048kB 6862984kB
    Same test with percpu_pagelist_fraction back to 0:
    MemFree: 7945000kB 7944908kB 7948568kB 7949060kB 7948796kB 7948812kB

    Signed-off-by: Hugh Dickins
    [ We really should fix the crazy sysctl interface too, but that's a
    separate thing - Linus ]
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 May, 2012

1 commit

  • percpu_pagelist_fraction_sysctl_handler() has only considered -EINVAL as
    a possible error from proc_dointvec_minmax().

    If any other error is returned, it would proceed to divide by zero since
    percpu_pagelist_fraction wasn't getting initialized at any point. For
    example, writing 0 bytes into the proc file would trigger the issue.

    Signed-off-by: Sasha Levin
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

08 May, 2012

1 commit


29 Mar, 2012

2 commits

  • Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
    an IPI requesting CPUs to drain these pages to the buddy allocator if they
    actually have pages when asked to flush.

    This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
    severe memory pressure that leads to OOM since in these cases multiple,
    possibly concurrent, allocation requests end up in the direct reclaim code
    path so when the per-cpu pages end up reclaimed on first allocation
    failure for most of the proceeding allocation attempts until the memory
    pressure is off (possibly via the OOM killer) there are no per-cpu pages
    on most CPUs (and there can easily be hundreds of them).

    This also has the side effect of shortening the average latency of direct
    reclaim by 1 or more order of magnitude since waiting for all the CPUs to
    ACK the IPI takes a long time.

    Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
    difference between the number of direct reclaim attempts that end up in
    drain_all_pages() and those were more then 1/2 of the online CPU had any
    per-cpu page in them, using the vmstat counters introduced in the next
    patch in the series and using proc/interrupts.

    In the test sceanrio, this was seen to save around 3600 global
    IPIs after trigerring an OOM on a concurrent workload:

    $ cat /proc/vmstat | tail -n 2
    pcp_global_drain 0
    pcp_global_ipi_saved 0

    $ cat /proc/interrupts | grep CAL
    CAL: 1 2 1 2
    2 2 2 2 Function call interrupts

    $ hackbench 400
    [OOM messages snipped]

    $ cat /proc/vmstat | tail -n 2
    pcp_global_drain 3647
    pcp_global_ipi_saved 3642

    $ cat /proc/interrupts | grep CAL
    CAL: 6 13 6 3
    3 3 1 2 7 Function call interrupts

    Please note that if the global drain is removed from the direct reclaim
    path as a patch from Mel Gorman currently suggests this should be replaced
    with an on_each_cpu_cond invocation.

    Signed-off-by: Gilad Ben-Yossef
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Acked-by: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Andi Kleen
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • The size of coredump files is limited by RLIMIT_CORE, however, allocating
    large amounts of memory results in three negative consequences:

    - the coredumping process may be chosen for oom kill and quickly deplete
    all memory reserves in oom conditions preventing further progress from
    being made or tasks from exiting,

    - the coredumping process may cause other processes to be oom killed
    without fault of their own as the result of a SIGSEGV, for example, in
    the coredumping process, or

    - the coredumping process may result in a livelock while writing to the
    dump file if it needs memory to allocate while other threads are in
    the exit path waiting on the coredumper to complete.

    This is fixed by implying __GFP_NORETRY in the page allocator for
    coredumping processes when reclaim has failed so the allocations fail and
    the process continues to exit.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

22 Mar, 2012

6 commits

  • find_zone_movable_pfns_for_nodes() does not use its argument.

    Signed-off-by: Kautuk Consul
    Cc: David Rientjes
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • add_from_early_node_map() is unused.

    Signed-off-by: Kautuk Consul
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This cpu hotplug hook was accidentally removed in commit 00a62ce91e55
    ("mm: fix Committed_AS underflow on large NR_CPUS environment")

    The visible effect of this accident: some pages are borrowed in per-cpu
    page-vectors. Truncate can deal with it, but these pages cannot be
    reused while this cpu is offline. So this is like a temporary memory
    leak.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Hansen
    Cc: KOSAKI Motohiro
    Cc: Eric B Munson
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • The oom killer chooses not to kill a thread if:

    - an eligible thread has already been oom killed and has yet to exit,
    and

    - an eligible thread is exiting but has yet to free all its memory and
    is not the thread attempting to currently allocate memory.

    SysRq+F manually invokes the global oom killer to kill a memory-hogging
    task. This is normally done as a last resort to free memory when no
    progress is being made or to test the oom killer itself.

    For both uses, we always want to kill a thread and never defer. This
    patch causes SysRq+F to always kill an eligible thread and can be used to
    force a kill even if another oom killed thread has failed to exit.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Acked-by: Pekka Enberg
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Currently a failed order-9 (transparent hugepage) compaction can lead to
    memory compaction being temporarily disabled for a memory zone. Even if
    we only need compaction for an order 2 allocation, eg. for jumbo frames
    networking.

    The fix is relatively straightforward: keep track of the highest order at
    which compaction is succeeding, and only defer compaction for orders at
    which compaction is failing.

    Signed-off-by: Rik van Riel
    Cc: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

14 Feb, 2012

1 commit

  • When the number of dentry cache hash table entries gets too high
    (2147483648 entries), as happens by default on a 16TB system, use of a
    signed integer in the dcache_init() initialization loop prevents the
    dentry_hashtable from getting initialized, causing a panic in
    __d_lookup(). Fix this in dcache_init() and similar areas.

    Signed-off-by: Dimitri Sivanich
    Acked-by: David S. Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dimitri Sivanich
     

24 Jan, 2012

2 commits

  • page_zone() requires an online node otherwise we are accessing NULL
    NODE_DATA. This is not an issue at the moment because node_zones are
    located at the structure beginning but this might change in the future
    so better be careful about that.

    Signed-off-by: Michal Hocko
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Fix the following NULL ptr dereference caused by

    cat /sys/devices/system/memory/memory0/removable

    Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
    RIP: __count_immobile_pages+0x4/0x100
    Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
    Call Trace:
    is_pageblock_removable_nolock+0x34/0x40
    is_mem_section_removable+0x74/0xf0
    show_mem_removable+0x41/0x70
    sysfs_read_file+0xfe/0x1c0
    vfs_read+0xc7/0x130
    sys_read+0x53/0xa0
    system_call_fastpath+0x16/0x1b

    We are crashing because we are trying to dereference NULL zone which
    came from pfn=0 (struct page ffffea0000000000). According to the boot
    log this page is marked reserved:
    e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)

    and early_node_map confirms that:
    early_node_map[3] active PFN ranges
    1: 0x00000010 -> 0x0000009c
    1: 0x00000100 -> 0x000bffa3
    1: 0x00100000 -> 0x00240000

    The problem is that memory_present works in PAGE_SECTION_MASK aligned
    blocks so the reserved range sneaks into the the section as well. This
    also means that free_area_init_node will not take care of those reserved
    pages and they stay uninitialized.

    When we try to read the removable status we walk through all available
    sections and hope that the zone is valid for all pages in the section.
    But this is not true in this case as the zone and nid are not initialized.

    We have only one node in this particular case and it is marked as node=1
    (rather than 0) and that made the problem visible because page_to_nid will
    return 0 and there are no zones on the node.

    Let's check that the zone is valid and that the given pfn falls into its
    boundaries and mark the section not removable. This might cause some
    false positives, probably, but we do not have any sane way to find out
    whether the page is reserved by the platform or it is just not used for
    whatever other reasons.

    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

13 Jan, 2012

4 commits

  • Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If compaction is deferred, direct reclaim is used to try to free enough
    pages for the allocation to succeed. For small high-orders, this has a
    reasonable chance of success. However, if the caller has specified
    __GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
    to fail the allocation rather than stall the caller in direct reclaim.
    This patch skips direct reclaim if compaction is deferred and the caller
    specifies __GFP_NO_KSWAPD.

    Async compaction only considers a subset of pages so it is possible for
    compaction to be deferred prematurely and not enter direct reclaim even in
    cases where it should. To compensate for this, this patch also defers
    compaction only if sync compaction failed.

    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If there is a zone below ZONE_NORMAL has present_pages, we can set node
    state to N_NORMAL_MEMORY, no need to loop to end.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Having a unified structure with a LRU list set for both global zones and
    per-memcg zones allows to keep that code simple which deals with LRU
    lists and does not care about the container itself.

    Once the per-memcg LRU lists directly link struct pages, the isolation
    function and all other list manipulations are shared between the memcg
    case and the global LRU case.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner