Doug / smarc-fsl-linux-kernel | Embedian Git Server

04 Jun, 2012

1 commit

68e3e9262 Revert "mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks" ... Browse Code »

This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199.

That commit seems to be the cause of the mm compation list corruption
issues that Dave Jones reported. The locking (or rather, absense
there-of) is dubious, as is the use of the 'page' variable once it has
been found to be outside the pageblock range.

So revert it for now, we can re-visit this for 3.6. If we even need to:
as Minchan Kim says, "The patch wasn't a bug fix and even test workload
was very theoretical".

Reported-and-tested-by: Dave Jones
Acked-by: Hugh Dickins
Acked-by: KOSAKI Motohiro
Acked-by: Minchan Kim
Cc: Bartlomiej Zolnierkiewicz
Cc: Kyungmin Park
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-06-04 11:05:57 +0800

30 May, 2012

8 commits

7f5e86c2c mm: add link from struct lruvec to struct zone ... Browse Code »

This is the first stage of struct mem_cgroup_zone removal. Further
patches replace struct mem_cgroup_zone with a pointer to struct lruvec.

If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
89abfab13 mm/memcg: move reclaim_stat into lruvec ... Browse Code »

With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone when
memcg is disabled but per-memcg per-zone when it's enabled.

We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats, the one
which get_scan_count() will actually use.

Signed-off-by: Hugh Dickins
Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Reviewed-by: Minchan Kim
Reviewed-by: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-30 07:22:25 +0800
51300cef4 mm/page_alloc.c: cleanups ... Browse Code »

- make pageflag_names[] const

- remove null termination of pageflag_names[]

Cc: Johannes Weiner
Cc: Gavin Shan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-05-30 07:22:23 +0800
acc50c110 mm: page_alloc: catch out-of-date list of page flag names ... Browse Code »

String tables with names of enum items are always prone to go out of
sync with the enums themselves. Ensure during compile time that the
name table of page flags has the same size as the page flags enum.

Signed-off-by: Johannes Weiner
Cc: Gavin Shan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-05-30 07:22:23 +0800
be9cd873e mm/buddy: dump PG_compound_lock page flag ... Browse Code »

The array pageflag_names[] does conversion from page flags into their
corresponding names so that a meaningful representation of the
corresponding page flag can be printed. This mechanism is used while
dumping page frames. However, the array missed PG_compound_lock. So
the PG_compound_lock page flag would be printed as a digital number
instead of a meaningful string.

The patch fixes that and prints "compound_lock" for the PG_compound_lock
page flag.

Signed-off-by: Gavin Shan
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2012-05-30 07:22:23 +0800
5ceb9ce6f mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks ... Browse Code »

When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
allocation takes ownership of the block may take too long. The type of
the pageblock remains unchanged so the pageblock cannot be used as a
migration target during compaction.

Fix it by:

* Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
COMPACT_SYNC) and then converting sync field in struct compact_control
to use it.

* Adding nr_pageblocks_skipped field to struct compact_control and
tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
suitable page for allocation. In this case then check how if there were
enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
COMPACT_ASYNC_UNMOVABLE mode.

* Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all
pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
sets change the whole pageblock type to MIGRATE_MOVABLE.

My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
131072 standard 4KiB pages in 'Normal' zone) is to:

- allocate 120000 pages for kernel's usage
- free every second page (60000 pages) of memory just allocated
- allocate and use 60000 pages from user space
- free remaining 60000 pages of kernel memory
(now we have fragmented memory occupied mostly by user space pages)
- try to allocate 100 order-9 (2048 KiB) pages for kernel's usage

The results:
- with compaction disabled I get 11 successful allocations
- with compaction enabled - 14 successful allocations
- with this patch I'm able to get all 100 successful allocations

NOTE: If we can make kswapd aware of order-0 request during compaction, we
can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
(COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the
following thread:

http://marc.info/?l=linux-mm&m=133552069417068&w=2

[minchan@kernel.org: minor cleanups]
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Marek Szyprowski
Signed-off-by: Bartlomiej Zolnierkiewicz
Signed-off-by: Kyungmin Park
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bartlomiej Zolnierkiewicz
2012-05-30 07:22:22 +0800
955c1cd74 mm/page_alloc.c: remove pageblock_default_order() ... Browse Code »

This has always been broken: one version takes an unsigned int and the
other version takes no arguments. This bug was hidden because one
version of set_pageblock_order() was a macro which doesn't evaluate its
argument.

Simplify it all and remove pageblock_default_order() altogether.

Reported-by: rajman mekaco
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Tejun Heo
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-05-30 07:22:21 +0800
a62e2f4f5 mm: print physical addresses consistently with other parts of kernel ... Browse Code »

Print physical address info in a style consistent with the %pR style used
elsewhere in the kernel. For example:

-Zone PFN ranges:
+Zone ranges:
- DMA32 0x00000010 -> 0x00100000
+ DMA32 [mem 0x00010000-0xffffffff]
- Normal 0x00100000 -> 0x01080000
+ Normal [mem 0x100000000-0x107fffffff]

Signed-off-by: Bjorn Helgaas
Cc: Yinghai Lu
Cc: Konrad Rzeszutek Wilk
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bjorn Helgaas
2012-05-30 07:22:21 +0800

26 May, 2012

1 commit

d484864dd Merge branch 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping ... Browse Code »

Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
"These patches contain two major updates for DMA mapping subsystem
(mainly for ARM architecture). First one is Contiguous Memory
Allocator (CMA) which makes it possible for device drivers to allocate
big contiguous chunks of memory after the system has booted.

The main difference from the similar frameworks is the fact that CMA
allows to transparently reuse the memory region reserved for the big
chunk allocation as a system memory, so no memory is wasted when no
big chunk is allocated. Once the alloc request is issued, the
framework migrates system pages to create space for the required big
chunk of physically contiguous memory.

For more information one can refer to nice LWN articles:

- 'A reworked contiguous memory allocator':
http://lwn.net/Articles/447405/

- 'CMA and ARM':
http://lwn.net/Articles/450286/

- 'A deep dive into CMA':
http://lwn.net/Articles/486301/

- and the following thread with the patches and links to all previous
versions:
https://lkml.org/lkml/2012/4/3/204

The main client for this new framework is ARM DMA-mapping subsystem.

The second part provides a complete redesign in ARM DMA-mapping
subsystem. The core implementation has been changed to use common
struct dma_map_ops based infrastructure with the recent updates for
new dma attributes merged in v3.4-rc2. This allows to use more than
one implementation of dma-mapping calls and change/select them on the
struct device basis. The first client of this new infractructure is
dmabounce implementation which has been completely cut out of the
core, common code.

The last patch of this redesign update introduces a new, experimental
implementation of dma-mapping calls on top of generic IOMMU framework.
This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
calls if one provides required IOMMU hardware.

For more information please refer to the following thread:
http://www.spinics.net/lists/arm-kernel/msg175729.html

The last patch merges changes from both updates and provides a
resolution for the conflicts which cannot be avoided when patches have
been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."

Acked by Andrew Morton :
"Yup, this one please. It's had much work, plenty of review and I
think even Russell is happy with it."

* 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
ARM: dma-mapping: use PMD size for section unmap
cma: fix migration mode
ARM: integrate CMA with DMA-mapping subsystem
X86: integrate CMA with DMA-mapping subsystem
drivers: add Contiguous Memory Allocator
mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
mm: extract reclaim code from __alloc_pages_direct_reclaim()
mm: Serialize access to min_free_kbytes
mm: page_isolation: MIGRATE_CMA isolation functions added
mm: mmzone: MIGRATE_CMA migration type added
mm: page_alloc: change fallbacks array handling
mm: page_alloc: introduce alloc_contig_range()
mm: compaction: export some of the functions
mm: compaction: introduce isolate_freepages_range()
mm: compaction: introduce map_pages()
mm: compaction: introduce isolate_migratepages_range()
mm: page_alloc: remove trailing whitespace
ARM: dma-mapping: add support for IOMMU mapper
ARM: dma-mapping: use alloc, mmap, free from dma_ops
ARM: dma-mapping: remove redundant code and do the cleanup
...

Conflicts:
arch/x86/include/asm/dma-mapping.h

Linus Torvalds
2012-05-26 00:18:59 +0800

25 May, 2012

1 commit

28f3d7176 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull more networking updates from David Miller:
"Ok, everything from here on out will be bug fixes."

1) One final sync of wireless and bluetooth stuff from John Linville.
These changes have all been in his tree for more than a week, and
therefore have had the necessary -next exposure. John was just away
on a trip and didn't have a change to send the pull request until a
day or two ago.

2) Put back some defines in user exposed header file areas that were
removed during the tokenring purge. From Stephen Hemminger and Paul
Gortmaker.

3) A bug fix for UDP hash table allocation got lost in the pile due to
one of those "you got it.. no I've got it.." situations. :-)

From Tim Bird.

4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
try to coalesce overlapping frags and crash. Fix from Eric Dumazet.

5) RCU routing table lookups can race with free_fib_info(), causing
crashes when we deref the device pointers in the route. Fix by
releasing the net device in the RCU callback. From Yanmin Zhang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
tcp: take care of overlaps in tcp_try_coalesce()
ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
mm: add a low limit to alloc_large_system_hash
ipx: restore token ring define to include/linux/ipx.h
if: restore token ring ARP type to header
xen: do not disable netfront in dom0
phy/micrel: Fix ID of KSZ9021
mISDN: Add X-Tensions USB ISDN TA XC-525
gianfar:don't add FCB length to hard_header_len
Bluetooth: Report proper error number in disconnection
Bluetooth: Create flags for bt_sk()
Bluetooth: report the right security level in getsockopt
Bluetooth: Lock the L2CAP channel when sending
Bluetooth: Restore locking semantics when looking up L2CAP channels
Bluetooth: Fix a redundant and problematic incoming MTU check
Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
Bluetooth: Fix EIR data generation for mgmt_device_found
Bluetooth: Fix Inquiry with RSSI event mask
Bluetooth: improve readability of l2cap_seq_list code
Bluetooth: Fix skb length calculation
...

Linus Torvalds
2012-05-25 02:54:29 +0800

24 May, 2012

1 commit

31fe62b95 mm: add a low limit to alloc_large_system_hash ... Browse Code »

UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine
Reported-by: Tim Bird
Signed-off-by: Eric Dumazet
Cc: Paul Gortmaker
Signed-off-by: David S. Miller

Tim Bird
2012-05-24 12:28:21 +0800

23 May, 2012

1 commit

5d4e2d08e Merge tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core updates from Greg Kroah-Hartman:
"Here's the driver core, and other driver subsystems, pull request for
the 3.5-rc1 merge window.

Outside of a few minor driver core changes, we ended up with the
following different subsystem and core changes as well, due to
interdependancies on the driver core:
- hyperv driver updates
- drivers/memory being created and some drivers moved into it
- extcon driver subsystem created out of the old Android staging
switch driver code
- dynamic debug updates
- printk rework, and /dev/kmsg changes

All of this has been tested in the linux-next releases for a few weeks
with no reported problems.

Signed-off-by: Greg Kroah-Hartman "

Fix up conflicts in drivers/extcon/extcon-max8997.c where git noticed
that a patch to the deleted drivers/misc/max8997-muic.c driver needs to
be applied to this one.

* tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (90 commits)
uio_pdrv_genirq: get irq through platform resource if not set otherwise
memory: tegra{20,30}-mc: Remove empty *_remove()
printk() - isolate KERN_CONT users from ordinary complete lines
sysfs: get rid of some lockdep false positives
Drivers: hv: util: Properly handle version negotiations.
Drivers: hv: Get rid of an unnecessary check in vmbus_prep_negotiate_resp()
memory: tegra{20,30}-mc: Use dev_err_ratelimited()
driver core: Add dev_*_ratelimited() family
Driver Core: don't oops with unregistered driver in driver_find_device()
printk() - restore prefix/timestamp printing for multi-newline strings
printk: add stub for prepend_timestamp()
ARM: tegra30: Make MC optional in Kconfig
ARM: tegra20: Make MC optional in Kconfig
ARM: tegra30: MC: Remove unnecessary BUG*()
ARM: tegra20: MC: Remove unnecessary BUG*()
printk: correctly align __log_buf
ARM: tegra30: Add Tegra Memory Controller(MC) driver
ARM: tegra20: Add Tegra Memory Controller(MC) driver
printk() - restore timestamp printing at console output
printk() - do not merge continuation lines of different threads
...

Linus Torvalds
2012-05-23 07:02:13 +0800

21 May, 2012

9 commits

58f42fd54 cma: fix migration mode ... Browse Code »

__alloc_contig_migrate_range calls migrate_pages with wrong argument
for migrate_mode. Fix it.

Cc: Marek Szyprowski
Signed-off-by: Minchan Kim
Acked-by: Michal Nazarewicz
Signed-off-by: Marek Szyprowski

Minchan Kim
2012-05-21 21:09:39 +0800
49f223a9c mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks ... Browse Code »

alloc_contig_range() performs memory allocation so it also should keep
track on keeping the correct level of memory watermarks. This commit adds
a call to *_slowpath style reclaim to grab enough pages to make sure that
the final collection of contiguous pages from freelists will not starve
the system.

Signed-off-by: Marek Szyprowski
Signed-off-by: Kyungmin Park
CC: Michal Nazarewicz
Tested-by: Rob Clark
Tested-by: Ohad Ben-Cohen
Tested-by: Benjamin Gaignard
Tested-by: Robert Nelson
Tested-by: Barry Song

Marek Szyprowski
2012-05-21 21:09:36 +0800
bba907108 mm: extract reclaim code from __alloc_pages_direct_reclaim() ... Browse Code »

This patch extracts common reclaim code from __alloc_pages_direct_reclaim()
function to separate function: __perform_reclaim() which can be later used
by alloc_contig_range().

Signed-off-by: Marek Szyprowski
Signed-off-by: Kyungmin Park
Cc: Michal Nazarewicz
Acked-by: Mel Gorman
Tested-by: Rob Clark
Tested-by: Ohad Ben-Cohen
Tested-by: Benjamin Gaignard
Tested-by: Robert Nelson
Tested-by: Barry Song

Marek Szyprowski
2012-05-21 21:09:35 +0800
cfd3da1e4 mm: Serialize access to min_free_kbytes ... Browse Code »

There is a race between the min_free_kbytes sysctl, memory hotplug
and transparent hugepage support enablement. Memory hotplug uses a
zonelists_mutex to avoid a race when building zonelists. Reuse it to
serialise watermark updates.

[a.p.zijlstra@chello.nl: Older patch fixed the race with spinlock]
Signed-off-by: Mel Gorman
Signed-off-by: Marek Szyprowski
Reviewed-by: KAMEZAWA Hiroyuki
Tested-by: Barry Song

Mel Gorman
2012-05-21 21:09:34 +0800
0815f3d81 mm: page_isolation: MIGRATE_CMA isolation functions added ... Browse Code »

This commit changes various functions that change pages and
pageblocks migrate type between MIGRATE_ISOLATE and
MIGRATE_MOVABLE in such a way as to allow to work with
MIGRATE_CMA migrate type.

Signed-off-by: Michal Nazarewicz
Signed-off-by: Marek Szyprowski
Reviewed-by: KAMEZAWA Hiroyuki
Tested-by: Rob Clark
Tested-by: Ohad Ben-Cohen
Tested-by: Benjamin Gaignard
Tested-by: Robert Nelson
Tested-by: Barry Song

Michal Nazarewicz
2012-05-21 21:09:33 +0800
47118af07 mm: mmzone: MIGRATE_CMA migration type added ... Browse Code »

The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.

This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).

It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory. Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.

To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types when requested.

Signed-off-by: Michal Nazarewicz
Signed-off-by: Marek Szyprowski
Signed-off-by: Kyungmin Park
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Tested-by: Rob Clark
Tested-by: Ohad Ben-Cohen
Tested-by: Benjamin Gaignard
Tested-by: Robert Nelson
Tested-by: Barry Song

Michal Nazarewicz
2012-05-21 21:09:32 +0800
6d4a49160 mm: page_alloc: change fallbacks array handling ... Browse Code »

This commit adds a row for MIGRATE_ISOLATE type to the fallbacks array
which was missing from it. It also, changes the array traversal logic
a little making MIGRATE_RESERVE an end marker. The letter change,
removes the implicit MIGRATE_UNMOVABLE from the end of each row which
was read by __rmqueue_fallback() function.

Signed-off-by: Michal Nazarewicz
Signed-off-by: Marek Szyprowski
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Tested-by: Rob Clark
Tested-by: Ohad Ben-Cohen
Tested-by: Benjamin Gaignard
Tested-by: Robert Nelson
Tested-by: Barry Song

Michal Nazarewicz
2012-05-21 21:09:31 +0800
041d3a8cd mm: page_alloc: introduce alloc_contig_range() ... Browse Code »

This commit adds the alloc_contig_range() function which tries
to allocate given range of pages. It tries to migrate all
already allocated pages that fall in the range thus freeing them.
Once all pages in the range are freed they are removed from the
buddy system thus allocated for the caller to use.

Signed-off-by: Michal Nazarewicz
Signed-off-by: Marek Szyprowski
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Tested-by: Rob Clark
Tested-by: Ohad Ben-Cohen
Tested-by: Benjamin Gaignard
Tested-by: Robert Nelson
Tested-by: Barry Song

Michal Nazarewicz
2012-05-21 21:09:31 +0800
5f63b720b mm: page_alloc: remove trailing whitespace ... Browse Code »

Signed-off-by: Michal Nazarewicz
Signed-off-by: Marek Szyprowski
Acked-by: Mel Gorman

Michal Nazarewicz
2012-05-21 21:09:26 +0800

12 May, 2012

1 commit

1b76b02f1 mm: raise MemFree by reverting percpu_pagelist_fraction to 0 ... Browse Code »

Why is there less MemFree than there used to be? It perturbed a test,
so I've just been bisecting linux-next, and now find the offender went
upstream yesterday.

Commit 93278814d359 "mm: fix division by 0 in percpu_pagelist_fraction()"
mistakenly initialized percpu_pagelist_fraction to the sysctl's minimum 8,
which leaves 1/8th of memory on percpu lists (on each cpu??); but most of
us expect it to be left unset at 0 (and it's not then used as a divisor).

MemTotal: 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB
Repetitive test with percpu_pagelist_fraction 8:
MemFree: 6948420kB 6237172kB 6949696kB 6840692kB 6949048kB 6862984kB
Same test with percpu_pagelist_fraction back to 0:
MemFree: 7945000kB 7944908kB 7948568kB 7949060kB 7948796kB 7948812kB

Signed-off-by: Hugh Dickins
[ We really should fix the crazy sysctl interface too, but that's a
separate thing - Linus ]
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-12 00:23:39 +0800

11 May, 2012

1 commit

93278814d mm: fix division by 0 in percpu_pagelist_fraction() ... Browse Code »

percpu_pagelist_fraction_sysctl_handler() has only considered -EINVAL as
a possible error from proc_dointvec_minmax().

If any other error is returned, it would proceed to divide by zero since
percpu_pagelist_fraction wasn't getting initialized at any point. For
example, writing 0 bytes into the proc file would trigger the issue.

Signed-off-by: Sasha Levin
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2012-05-11 06:06:44 +0800

08 May, 2012

1 commit

155cbfc80 mm: use KERN_CONT in printk() continuation lines ... Browse Code »

Cc: Andrew Morton
Signed-off-by: Kay Sievers
Signed-off-by: Greg Kroah-Hartman

Kay Sievers
2012-05-08 23:55:26 +0800

29 Mar, 2012

2 commits

74046494e mm: only IPI CPUs to drain local pages if they exist ... Browse Code »

Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
an IPI requesting CPUs to drain these pages to the buddy allocator if they
actually have pages when asked to flush.

This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
severe memory pressure that leads to OOM since in these cases multiple,
possibly concurrent, allocation requests end up in the direct reclaim code
path so when the per-cpu pages end up reclaimed on first allocation
failure for most of the proceeding allocation attempts until the memory
pressure is off (possibly via the OOM killer) there are no per-cpu pages
on most CPUs (and there can easily be hundreds of them).

This also has the side effect of shortening the average latency of direct
reclaim by 1 or more order of magnitude since waiting for all the CPUs to
ACK the IPI takes a long time.

Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
difference between the number of direct reclaim attempts that end up in
drain_all_pages() and those were more then 1/2 of the online CPU had any
per-cpu page in them, using the vmstat counters introduced in the next
patch in the series and using proc/interrupts.

In the test sceanrio, this was seen to save around 3600 global
IPIs after trigerring an OOM on a concurrent workload:

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 0
pcp_global_ipi_saved 0

$ cat /proc/interrupts | grep CAL
CAL: 1 2 1 2
2 2 2 2 Function call interrupts

$ hackbench 400
[OOM messages snipped]

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 3647
pcp_global_ipi_saved 3642

$ cat /proc/interrupts | grep CAL
CAL: 6 13 6 3
3 3 1 2 7 Function call interrupts

Please note that if the global drain is removed from the direct reclaim
path as a patch from Mel Gorman currently suggests this should be replaced
with an on_each_cpu_cond invocation.

Signed-off-by: Gilad Ben-Yossef
Acked-by: Mel Gorman
Cc: KOSAKI Motohiro
Acked-by: Christoph Lameter
Acked-by: Peter Zijlstra
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Andi Kleen
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gilad Ben-Yossef
2012-03-29 08:14:35 +0800
29fd66d28 mm, coredump: fail allocations when coredumping instead of oom killing ... Browse Code »

The size of coredump files is limited by RLIMIT_CORE, however, allocating
large amounts of memory results in three negative consequences:

- the coredumping process may be chosen for oom kill and quickly deplete
all memory reserves in oom conditions preventing further progress from
being made or tasks from exiting,

- the coredumping process may cause other processes to be oom killed
without fault of their own as the result of a SIGSEGV, for example, in
the coredumping process, or

- the coredumping process may result in a livelock while writing to the
dump file if it needs memory to allocate while other threads are in
the exit path waiting on the coredumper to complete.

This is fixed by implying __GFP_NORETRY in the page allocator for
coredumping processes when reclaim has failed so the allocations fail and
the process continues to exit.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-29 08:14:35 +0800

22 Mar, 2012

6 commits

b224ef856 page_alloc: remove unused find_zone_movable_pfns_for_nodes() argument ... Browse Code »

find_zone_movable_pfns_for_nodes() does not use its argument.

Signed-off-by: Kautuk Consul
Cc: David Rientjes
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kautuk Consul
2012-03-22 08:55:00 +0800
8d13bddd1 page_alloc.c: remove add_from_early_node_map() ... Browse Code »

add_from_early_node_map() is unused.

Signed-off-by: Kautuk Consul
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kautuk Consul
2012-03-22 08:55:00 +0800
cc9a6c877 cpuset: mm: reduce large amounts of memory barrier related damage v3 ... Browse Code »

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were

3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.

Signed-off-by: Mel Gorman
Cc: Miao Xie
Cc: David Rientjes
Cc: Peter Zijlstra
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-03-22 08:54:59 +0800
f0cb3c76a mm: drain percpu lru add/rotate page-vectors on cpu hot-unplug ... Browse Code »

This cpu hotplug hook was accidentally removed in commit 00a62ce91e55
("mm: fix Committed_AS underflow on large NR_CPUS environment")

The visible effect of this accident: some pages are borrowed in per-cpu
page-vectors. Truncate can deal with it, but these pages cannot be
reused while this cpu is offline. So this is like a temporary memory
leak.

Signed-off-by: Konstantin Khlebnikov
Cc: Dave Hansen
Cc: KOSAKI Motohiro
Cc: Eric B Munson
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-03-22 08:54:58 +0800
08ab9b10d mm, oom: force oom kill on sysrq+f ... Browse Code »

The oom killer chooses not to kill a thread if:

- an eligible thread has already been oom killed and has yet to exit,
and

- an eligible thread is exiting but has yet to free all its memory and
is not the thread attempting to currently allocate memory.

SysRq+F manually invokes the global oom killer to kill a memory-hogging
task. This is normally done as a last resort to free memory when no
progress is being made or to test the oom killer itself.

For both uses, we always want to kill a thread and never defer. This
patch causes SysRq+F to always kill an eligible thread and can be used to
force a kill even if another oom killed thread has failed to exit.

Signed-off-by: David Rientjes
Acked-by: KOSAKI Motohiro
Acked-by: Pekka Enberg
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:58 +0800
aff622495 vmscan: only defer compaction for failed order and higher ... Browse Code »

Currently a failed order-9 (transparent hugepage) compaction can lead to
memory compaction being temporarily disabled for a memory zone. Even if
we only need compaction for an order 2 allocation, eg. for jumbo frames
networking.

The fix is relatively straightforward: keep track of the highest order at
which compaction is succeeding, and only defer compaction for orders at
which compaction is failing.

Signed-off-by: Rik van Riel
Cc: Andrea Arcangeli
Acked-by: Mel Gorman
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2012-03-22 08:54:56 +0800

14 Feb, 2012

1 commit

074b85175 vfs: fix panic in __d_lookup() with high dentry hashtable counts ... Browse Code »

When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.

Signed-off-by: Dimitri Sivanich
Acked-by: David S. Miller
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Dimitri Sivanich
2012-02-14 09:45:38 +0800

24 Jan, 2012

2 commits

656a07062 mm: __count_immobile_pages(): make sure the node is online ... Browse Code »

page_zone() requires an online node otherwise we are accessing NULL
NODE_DATA. This is not an issue at the moment because node_zones are
located at the structure beginning but this might change in the future
so better be careful about that.

Signed-off-by: Michal Hocko
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-01-24 00:38:47 +0800
687875fb7 mm: fix NULL ptr dereference in __count_immobile_pages ... Browse Code »

Fix the following NULL ptr dereference caused by

cat /sys/devices/system/memory/memory0/removable

Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
RIP: __count_immobile_pages+0x4/0x100
Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
Call Trace:
is_pageblock_removable_nolock+0x34/0x40
is_mem_section_removable+0x74/0xf0
show_mem_removable+0x41/0x70
sysfs_read_file+0xfe/0x1c0
vfs_read+0xc7/0x130
sys_read+0x53/0xa0
system_call_fastpath+0x16/0x1b

We are crashing because we are trying to dereference NULL zone which
came from pfn=0 (struct page ffffea0000000000). According to the boot
log this page is marked reserved:
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)

and early_node_map confirms that:
early_node_map[3] active PFN ranges
1: 0x00000010 -> 0x0000009c
1: 0x00000100 -> 0x000bffa3
1: 0x00100000 -> 0x00240000

The problem is that memory_present works in PAGE_SECTION_MASK aligned
blocks so the reserved range sneaks into the the section as well. This
also means that free_area_init_node will not take care of those reserved
pages and they stay uninitialized.

When we try to read the removable status we walk through all available
sections and hope that the zone is valid for all pages in the section.
But this is not true in this case as the zone and nid are not initialized.

We have only one node in this particular case and it is marked as node=1
(rather than 0) and that made the problem visible because page_to_nid will
return 0 and there are no zones on the node.

Let's check that the zone is valid and that the given pfn falls into its
boundaries and mark the section not removable. This might cause some
false positives, probably, but we do not have any sane way to find out
whether the page is reserved by the platform or it is just not used for
whatever other reasons.

Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-01-24 00:38:47 +0800

13 Jan, 2012

4 commits

4111304da mm: enum lru_list lru ... Browse Code »

Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
66199712e mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred ... Browse Code »

If compaction is deferred, direct reclaim is used to try to free enough
pages for the allocation to succeed. For small high-orders, this has a
reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
to fail the allocation rather than stall the caller in direct reclaim.
This patch skips direct reclaim if compaction is deferred and the caller
specifies __GFP_NO_KSWAPD.

Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even in
cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.

Signed-off-by: Mel Gorman
Acked-by: Minchan Kim
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
d0048b0e5 page_alloc: break early in check_for_regular_memory() ... Browse Code »

If there is a zone below ZONE_NORMAL has present_pages, we can set node
state to N_NORMAL_MEMORY, no need to loop to end.

Signed-off-by: Bob Liu
Acked-by: Michal Hocko
Cc: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Liu
2012-01-13 12:13:07 +0800
6290df545 mm: collect LRU list heads into struct lruvec ... Browse Code »

Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.

Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.

Signed-off-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Reviewed-by: Kirill A. Shutemov
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Ying Han
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Christoph Hellwig
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:05 +0800