Doug / smarc-fsl-linux-kernel | Embedian Git Server

13 Dec, 2012

9 commits

315a9850d Btrfs: fix wrong file extent length ... Browse Code »

There are two types of the file extent - inline extent and regular extent,
When we log file extents, we didn't take inline extent into account, fix it.

Signed-off-by: Miao Xie
Reviewed-by: Liu Bo
Signed-off-by: Chris Mason

Miao Xie
2012-12-13 06:15:21 +0800
ca4696371 Btrfs: fix missing flush when committing a transaction ... Browse Code »

Consider the following case:
Task1 Task2
start_transaction
commit_transaction
check pending snapshots list and the
list is empty.
add pending snapshot into list
skip the delalloc flush
end_transaction
...

And then the problem that the snapshot is different with the source subvolume
happen.

This patch fixes the above problem by flush all pending stuffs when all the
other tasks end the transaction.

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2012-12-13 06:15:21 +0800
b7d5b0a81 Btrfs: fix joining the same transaction handler more than 2 times ... Browse Code »

If we flush inodes with pending delalloc in a transaction, we may join
the same transaction handler more than 2 times.

The reason is:
Task use_count of trans handle
commit_transaction 1
|-> btrfs_start_delalloc_inodes 1
|-> run_delalloc_nocow 1
|-> join_transaction 2
|-> cow_file_range 2
|-> join_transaction 3

In fact, cow_file_range needn't join the transaction again because the caller
have joined the transaction, so we fix this problem by this way.

Reported-by: Liu Bo
Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2012-12-13 06:15:20 +0800
4fde183d8 Btrfs: cleanup for btrfs_wait_order_range ... Browse Code »

Variable 'found' is no more used.

Signed-off-by: Liu Bo
Signed-off-by: Chris Mason

Liu Bo
2012-12-13 06:15:19 +0800
9f3959c53 Btrfs: get right arguments for btrfs_wait_ordered_range ... Browse Code »

btrfs_wait_ordered_range expects for 'len' instead of 'end'.

Signed-off-by: Liu Bo
Signed-off-by: Chris Mason

Liu Bo
2012-12-13 06:15:19 +0800
183f37fa3 Btrfs: do not log extents when we only log new names ... Browse Code »

When we log new names, we need to log just enough to recreate the inode
during log replay, and there is no need to log extents along with it.

This actually fixes a bug revealed by xfstests 241, where it shows
that we're logging some extents that have not updated metadata,
so we don't get proper EXTENT_DATA items to be copied to log tree.

Signed-off-by: Liu Bo
Signed-off-by: Chris Mason

Liu Bo
2012-12-13 06:15:18 +0800
292fd7fc3 Btrfs: don't allow degraded mount if too many devices are missing ... Browse Code »

The current behavior is to allow mounting or remounting a filesystem
writeable in degraded mode if at least one writeable device is
present.
The next failed write access to a missing device which is above
the tolerance of the configured level of redundancy results in an
read-only enforcement. Even without this, the next time
barrier_all_devices() is called and more devices are missing than
tolerable, the switch to read-only mode takes place.

In order to behave predictably and to provide proper feedback to
the user at mount time, this patch compares the number of missing
devices with the number of devices that are tolerated to be missing
according to the configured RAID level. If more devices are missing
than tolerated, e.g. if two devices are missing in case of RAID1,
only a read-only mount and remount is allowed.

Signed-off-by: Stefan Behrens
Signed-off-by: Chris Mason

Stefan Behrens
2012-12-13 06:15:18 +0800
d14232487 Btrfs: Fix typo in fs/btrfs ... Browse Code »

Correct spelling typo in btrfs.

Signed-off-by: Masanari Iida
Signed-off-by: Chris Mason

Masanari Iida
2012-12-13 06:15:17 +0800
0253f40ef Btrfs: Remove the invalid shrink size check up from btrfs_shrink_dev() ... Browse Code »

Remove an invalid size check up from btrfs_shrink_dev().

The new size should not larger than the device->total_bytes as it was
already verified before coming to here(i.e. new_size < old_size).

Remove invalid check up for btrfs_shrink_dev().

Signed-off-by: Jie Liu
Signed-off-by: Chris Mason

jeff.liu
2012-12-13 06:15:16 +0800

12 Dec, 2012

13 commits

9afab8820 Btrfs: make ordered extent be flushed by multi-task ... Browse Code »

Though the process of the ordered extents is a bit different with the delalloc inode
flush, but we can see it as a subset of the delalloc inode flush, so we also handle
them by flush workers.

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2012-12-12 02:31:38 +0800
25287e0a1 Btrfs: make ordered operations be handled by multi-task ... Browse Code »

The process of the ordered operations is similar to the delalloc inode flush, so
we handle them by flush workers.

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2012-12-12 02:31:37 +0800
8ccf6f19b Btrfs: make delalloc inodes be flushed by multi-task ... Browse Code »

This patch introduce a new worker pool named "flush_workers", and if we
want to force all the inode with pending delalloc to the disks, we can
queue those inodes into the work queue of the worker pool, in this way,
those inodes will be flushed by multi-task.

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2012-12-12 02:31:37 +0800
7b398f8e5 Btrfs: fill the global reserve when unpinning space ... Browse Code »

Dave gave me an image of a very full file system that would abort the
transaction because it ran out of space while committing the transaction.
This is because we would think there was plenty of room to create a snapshot
even though the global reserve was not full. This happens because we
calculate the global reserve size before we unpin any space, so after we
unpin the space we allow reservations to occur even though we haven't
reserved all of the space for our global reserve. Fix this by adding to the
global reserve while unpinning in order to make sure we always have enough
space to do our work. With this patch we no longer end up with an aborted
transaction, we return ENOSPC properly to the person trying to create the
snapshot. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2012-12-12 02:31:36 +0800
32adf0901 Btrfs: cleanup unused arguments ... Browse Code »

'disk_key' is not used at all.

Signed-off-by: Liu Bo
Signed-off-by: Chris Mason

Liu Bo
2012-12-12 02:31:35 +0800
0e411ecec Btrfs: kill unnecessary arguments in del_ptr ... Browse Code »

The argument 'tree_mod_log' is not necessary since all of callers enable it.

Signed-off-by: Liu Bo
Signed-off-by: Chris Mason

Liu Bo
2012-12-12 02:31:35 +0800
6a7a665d7 Btrfs: reorder tree mod log operations in deleting a pointer ... Browse Code »

Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems
during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first.

Signed-off-by: Liu Bo
Signed-off-by: Chris Mason

Liu Bo
2012-12-12 02:31:34 +0800
95c80bb1f Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems ... Browse Code »

Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside
an extent buffer node, and the node's number of items remains unchanged
(unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that).

So we don't need to increase node's number of items during rewinding,
otherwise we may get an node larger than leafsize and cause general protection
errors later.

Here is the details,
- If we do memory move for inserting a single pointer, we need to
add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding.

- If we do memory move for deleting a single pointer, we need to
decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for
deleting.

- If we do memory move for balance left/right, we need to decrease
node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning.

Signed-off-by: Liu Bo
Signed-off-by: Chris Mason

Liu Bo
2012-12-12 02:31:33 +0800
de6c4115a Btrfs: fix unnecessary while loop when search the free space, cache ... Browse Code »

When we find a bitmap free space entry, we may check the previous extent
entry covers the offset or not. But if we find this entry is also a bitmap
entry, we will continue to check the previous entry of the current one by
a while loop. It is unnecessary because it is impossible that the extent
entry which is in front of a bitmap entry can cover the offset of the entry
after that bitmap entry.

Signed-off-by: Miao Xie
Reviewed-by: Liu Bo
Signed-off-by: Chris Mason

Miao Xie
2012-12-12 02:31:33 +0800
de1ee92ac Btrfs: recheck bio against block device when we map the bio ... Browse Code »

Alex reported a problem where we were writing between chunks on a rbd
device. The thing is we do bio_add_page using logical offsets, but the
physical offset may be different. So when we map the bio now check to see
if the bio is still ok with the physical offset, and if it is not split the
bio up and redo the bio_add_page with the physical sector. This fixes the
problem for Alex and doesn't affect performance in the normal case. Thanks,

Reported-and-tested-by: Alex Elder
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2012-12-12 02:31:32 +0800
08e007d2e Btrfs: improve the noflush reservation ... Browse Code »

In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.

We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2012-12-12 02:31:31 +0800
561c294d4 Btrfs: fix wrong comment in can_overcommit() ... Browse Code »

The comment is not coincident with the code. Fix it.

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2012-12-12 02:31:30 +0800
3fed40cc9 Btrfs: cleanup duplicated division functions ... Browse Code »

div_factor{_fine} has been implemented for two times, cleanup it.
And I move them into a independent file named math.h because they are
common math functions.

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2012-12-12 02:31:30 +0800

11 Dec, 2012

6 commits

29594404d Linux 3.7 Browse Code »

Linus Torvalds
2012-12-11 11:30:57 +0800
55220bb3e Input: matrix-keymap - provide proper module license ... Browse Code »

The matrix-keymap module is currently lacking a proper module license,
add one so we don't have this module tainting the entire kernel. This
issue has been present since commit 1932811f426f ("Input: matrix-keymap
- uninline and prepare for device tree support")

Signed-off-by: Florian Fainelli
CC: stable@vger.kernel.org # v3.5+
Signed-off-by: Dmitry Torokhov
Signed-off-by: Linus Torvalds

Florian Fainelli
2012-12-11 08:10:05 +0800
2c68bc72d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Netlink socket dumping had several missing verifications and checks.

In particular, address comparisons in the request byte code
interpreter could access past the end of the address in the
inet_request_sock.

Also, address family and address prefix lengths were not validated
properly at all.

This means arbitrary applications can read past the end of certain
kernel data structures.

Fixes from Neal Cardwell.

2) ip_check_defrag() operates in contexts where we're in the process
of, or about to, input the packet into the real protocols
(specifically macvlan and AF_PACKET snooping).

Unfortunately, it does a pskb_may_pull() which can modify the
backing packet data which is not legal if the SKB is shared. It
very much can be shared in this context.

Deal with the possibility that the SKB is segmented by using
skb_copy_bits().

Fix from Johannes Berg based upon a report by Eric Leblond.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
ipv4: ip_check_defrag must not modify skb before unsharing
inet_diag: validate port comparison byte code to prevent unsafe reads
inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state

Linus Torvalds
2012-12-11 08:07:11 +0800
caf491916 Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage ... Browse Code »

This reverts commits a50915394f1fc02c2861d3b7ce7014788aa5066e and
d7c3b937bdf45f0b844400b7bf6fd3ed50bac604.

This is a revert of a revert of a revert. In addition, it reverts the
even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
original commits in linux-next.

It turns out that the original patch really was bogus, and that the
original revert was the correct thing to do after all. We thought we
had fixed the problem, and then reverted the revert, but the problem
really is fundamental: waking up kswapd simply isn't the right thing to
do, and direct reclaim sometimes simply _is_ the right thing to do.

When certain allocations fail, we simply should try some direct reclaim,
and if that fails, fail the allocation. That's the right thing to do
for THP allocations, which can easily fail, and the GPU allocations want
to do that too.

So starting kswapd is sometimes simply wrong, and removing the flag that
said "don't start kswapd" was a mistake. Let's hope we never revisit
this mistake again - and certainly not this many times ;)

Acked-by: Mel Gorman
Acked-by: Johannes Weiner
Cc: Rik van Riel
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-12-11 03:03:05 +0800
1bf3751ec ipv4: ip_check_defrag must not modify skb before unsharing ... Browse Code »

ip_check_defrag() might be called from af_packet within the
RX path where shared SKBs are used, so it must not modify
the input SKB before it has unshared it for defragmentation.
Use skb_copy_bits() to get the IP header and only pull in
everything later.

The same is true for the other caller in macvlan as it is
called from dev->rx_handler which can also get a shared SKB.

Reported-by: Eric Leblond
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2012-12-11 02:51:44 +0800
31f8d42d4 Revert "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended" ... Browse Code »

This reverts commit 782fd30406ecb9d9b082816abe0c6008fc72a7b0.

We are going to reinstate the __GFP_NO_KSWAPD flag that has been
removed, the removal reverted, and then removed again. Making this
commit a pointless fixup for a problem that was caused by the removal of
__GFP_NO_KSWAPD flag.

The thing is, we really don't want to wake up kswapd for THP allocations
(because they fail quite commonly under any kind of memory pressure,
including when there is tons of memory free), and these patches were
just trying to fix up the underlying bug: the original removal of
__GFP_NO_KSWAPD in commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD")
was simply bogus.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-12-11 02:47:45 +0800

10 Dec, 2012

4 commits

5e1f54201 inet_diag: validate port comparison byte code to prevent unsafe reads ... Browse Code »

Add logic to verify that a port comparison byte code operation
actually has the second inet_diag_bc_op from which we read the port
for such operations.

Previously the code blindly referenced op[1] without first checking
whether a second inet_diag_bc_op struct could fit there. So a
malicious user could make the kernel read 4 bytes beyond the end of
the bytecode array by claiming to have a whole port comparison byte
code (2 inet_diag_bc_op structs) when in fact the bytecode was not
long enough to hold both.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-12-10 08:00:48 +0800
f67caec90 inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() ... Browse Code »

Add logic to check the address family of the user-supplied conditional
and the address family of the connection entry. We now do not do
prefix matching of addresses from different address families (AF_INET
vs AF_INET6), except for the previously existing support for having an
IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
maintains as-is).

This change is needed for two reasons:

(1) The addresses are different lengths, so comparing a 128-bit IPv6
prefix match condition to a 32-bit IPv4 connection address can cause
us to unwittingly walk off the end of the IPv4 address and read
garbage or oops.

(2) The IPv4 and IPv6 address spaces are semantically distinct, so a
simple bit-wise comparison of the prefixes is not meaningful, and
would lead to bogus results (except for the IPv4-mapped IPv6 case,
which this commit maintains).

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-12-10 07:59:37 +0800
405c00594 inet_diag: validate byte code to prevent oops in inet_diag_bc_run() ... Browse Code »

Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
operations.

Previously we did not validate the inet_diag_hostcond, address family,
address length, and prefix length. So a malicious user could make the
kernel read beyond the end of the bytecode array by claiming to have a
whole inet_diag_hostcond when the bytecode was not long enough to
contain a whole inet_diag_hostcond of the given address family. Or
they could make the kernel read up to about 27 bytes beyond the end of
a connection address by passing a prefix length that exceeded the
length of addresses of the given family.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-12-10 07:59:37 +0800
1c95df85c inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state ... Browse Code »

Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
instantiated for IPv4 traffic and in the SYN-RECV state were actually
created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
means that for such connections inet6_rsk(req) returns a pointer to a
random spot in memory up to roughly 64KB beyond the end of the
request_sock.

With this bug, for a server using AF_INET6 TCP sockets and serving
IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
inet_diag_fill_req() causing an oops or the export to user space of 16
bytes of kernel memory as a garbage IPv6 address, depending on where
the garbage inet6_rsk(req) pointed.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-12-10 07:59:37 +0800

09 Dec, 2012

2 commits

ed23ec4f0 mm: vmscan: fix inappropriate zone congestion clearing ... Browse Code »

commit c702418f8a2f ("mm: vmscan: do not keep kswapd looping forever due
to individual uncompactable zones") removed zone watermark checks from
the compaction code in kswapd but left in the zone congestion clearing,
which now happens unconditionally on higher order reclaim.

This messes up the reclaim throttling logic for zones with
dirty/writeback pages, where zones should only lose their congestion
status when their watermarks have been restored.

Remove the clearing from the zone compaction section entirely. The
preliminary zone check and the reclaim loop in kswapd will clear it if
the zone is considered balanced.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-12-09 00:41:18 +0800
684c9aaeb vfs: fix O_DIRECT read past end of block device ... Browse Code »

The direct-IO write path already had the i_size checks in mm/filemap.c,
but it turns out the read path did not, and removing the block size
checks in fs/block_dev.c (commit bbec0270bdd8: "blkdev_max_block: make
private to fs/buffer.c") removed the magic "shrink IO to past the end of
the device" code there.

Fix it by truncating the IO to the size of the block device, like the
write path already does.

NOTE! I suspect the write path would be *much* better off doing it this
way in fs/block_dev.c, rather than hidden deep in mm/filemap.c. The
mm/filemap.c code is extremely hard to follow, and has various
conditionals on the target being a block device (ie the flag passed in
to 'generic_write_checks()', along with a conditional update of the
inode timestamp etc).

It is also quite possible that we should treat this whole block device
size as a "s_maxbytes" issue, and try to make the logic even more
generic. However, in the meantime this is the fairly minimal targeted
fix.

Noted by Milan Broz thanks to a regression test for the cryptsetup
reencrypt tool.

Reported-and-tested-by: Milan Broz
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-12-09 00:28:26 +0800

08 Dec, 2012

4 commits

1b3c393cd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:
"Two stragglers:

1) The new code that adds new flushing semantics to GRO can cause SKB
pointer list corruption, manage the lists differently to avoid the
OOPS. Fix from Eric Dumazet.

2) When TCP fast open does a retransmit of data in a SYN-ACK or
similar, we update retransmit state that we shouldn't triggering a
WARN_ON later. Fix from Yuchung Cheng."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: gro: fix possible panic in skb_gro_receive()
tcp: bug fix Fast Open client retransmission

Linus Torvalds
2012-12-08 09:00:57 +0800
c3c7c254b net: gro: fix possible panic in skb_gro_receive() ... Browse Code »

commit 2e71a6f8084e (net: gro: selective flush of packets) added
a bug for skbs using frag_list. This part of the GRO stack is rarely
used, as it needs skb not using a page fragment for their skb->head.

Most drivers do use a page fragment, but some of them use GFP_KERNEL
allocations for the initial fill of their RX ring buffer.

napi_gro_flush() overwrite skb->prev that was used for these skb to
point to the last skb in frag_list.

Fix this using a separate field in struct napi_gro_cb to point to the
last fragment.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-12-08 03:39:29 +0800
93b174ad7 tcp: bug fix Fast Open client retransmission ... Browse Code »

If SYN-ACK partially acks SYN-data, the client retransmits the
remaining data by tcp_retransmit_skb(). This increments lost recovery
state variables like tp->retrans_out in Open state. If loss recovery
happens before the retransmission is acked, it triggers the WARN_ON
check in tcp_fastretrans_alert(). For example: the client sends
SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
another 4 data packets and get 3 dupacks.

Since the retransmission is not caused by network drop it should not
update the recovery state variables. Further the server may return a
smaller MSS than the cached MSS used for SYN-data, so the retranmission
needs a loop. Otherwise some data will not be retransmitted until timeout
or other loss recovery events.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2012-12-08 03:39:28 +0800
1afa47170 Merge tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc ... Browse Code »

Pull MMC fixes from Chris Ball:
"Two small regression fixes:

- sdhci-s3c: Fix runtime PM regression against 3.7-rc1
- sh-mmcif: Fix oops against 3.6"

* tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
mmc: sh-mmcif: avoid oops on spurious interrupts (second try)
Revert misapplied "mmc: sh-mmcif: avoid oops on spurious interrupts"
mmc: sdhci-s3c: fix missing clock for gpio card-detect

Linus Torvalds
2012-12-08 01:15:20 +0800

07 Dec, 2012

2 commits

18a2f371f tmpfs: fix shared mempolicy leak ... Browse Code »

This fixes a regression in 3.7-rc, which has since gone into stable.

Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount
imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
on expecting alloc_page_vma() to drop the refcount it had acquired.
This deserves a rework: but for now fix the leak in shmem_alloc_page().

Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
the same refcounting there as in shmem_alloc_page(), delete its onstack
mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
those were invented to let swapin_readahead() make an unknown number of
calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e,
alloc_pages_vma() has kept refcount in balance, so now no problem.

Reported-and-tested-by: Tommi Rantala
Signed-off-by: Mel Gorman
Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Mel Gorman
2012-12-07 03:56:43 +0800
c702418f8 mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones ... Browse Code »

When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the node's
memory that is considered balanced.

This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all bigger
zones in the node had plenty of free memory. Arguably, the same should
apply to compaction: if a significant part of the node is balanced
enough to run compaction, do not get hung up on that tiny zone that
might never get in shape.

When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced). Remove the individual zone checks
that restart the kswapd cycle.

Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.

See for example

https://bugzilla.redhat.com/show_bug.cgi?id=866988

Signed-off-by: Johannes Weiner
Reported-and-tested-by: Thorsten Leemhuis
Reported-by: Jiri Slaby
Tested-by: John Ellson
Tested-by: Zdenek Kabelac
Tested-by: Bruno Wolff III
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-12-07 03:29:57 +0800