Eric Lee / smarc-fsl-linux-kernel

5ec80e50e cris, exec: remove redundant set_fs(USER_DS) ... Browse Code »

The address limit is already set in flush_old_exec() so those calls to
set_fs(USER_DS) are redundant.

Signed-off-by: Mathias Krause
Cc: Jesper Nilsson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:12 +0800

ce1bb7afc cris: fix some build warnings in pinmux.c ... Browse Code »

Fix some harmless warnings such as

arch/cris/arch-v32/mach-a3/pinmux.c:273: warning: ISO C90 forbids mixed declarations and code:

Signed-off-by: WANG Cong
Cc: Mikael Starvik
Cc: Jesper Nilsson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:12 +0800

b7de11004 m68k, exec: remove redundant set_fs(USER_DS) ... Browse Code »

The address limit is already set in flush_old_exec() so those calls to
set_fs(USER_DS) are redundant.

Signed-off-by: Mathias Krause
Cc: Greg Ungerer
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:12 +0800

f79606259 m32r, exec: remove redundant set_fs(USER_DS) ... Browse Code »

The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.

Signed-off-by: Mathias Krause
Cc: Hirokazu Takata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:12 +0800

6fa80900d alpha, exec: remove redundant set_fs(USER_DS) ... Browse Code »

The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.

Signed-off-by: Mathias Krause
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Matt Turner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:12 +0800

023f21b9c h8300, exec: remove redundant set_fs(USER_DS) ... Browse Code »

The address limit is already set in flush_old_exec() so those calls to
set_fs(USER_DS) are redundant.

Signed-off-by: Mathias Krause
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

99b12e3d8 writeback: account NR_WRITTEN at IO completion time ... Browse Code »

NR_WRITTEN is now accounted at block IO enqueue time, which is not very
accurate as to common understanding. This moves NR_WRITTEN accounting to
the IO completion time and makes it more consistent with BDI_WRITTEN,
which is used for bandwidth estimation.

Signed-off-by: Wu Fengguang
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

48f170fb7 tmpfs: simplify unuse and writepage ... Browse Code »

shmem_unuse_inode() and shmem_writepage() contain a little code to cope
with pages inserted independently into the filecache, probably by a
filesystem stacked on top of tmpfs, then fed to its ->readpage() or
->writepage().

Unionfs was indeed experimenting with working in that way three years ago,
but I find no current examples: nowadays the stacking filesystems use vfs
interfaces to the lower filesystem.

It's now illegal: remove most of that code, adding some WARN_ON_ONCEs.

Signed-off-by: Hugh Dickins
Cc: Erez Zadok
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

27ab70062 tmpfs: simplify filepage/swappage ... Browse Code »

We can now simplify shmem_getpage_gfp(): there is no longer a dilemma of
filepage passed in via shmem_readpage(), then swappage found, which must
then be copied over to it.

Although at first it's tempting to replace the **pagep arg by returning
struct page *, that makes a mess of IS_ERR_OR_NULL(page)s in all the
callers, so leave as is.

Insert BUG_ON(!PageUptodate) when we find and lock page: some of the
complication came from uninitialized pages inserted into filecache prior
to readpage; but now we're in control, and only release pagelock on
filecache once it's uptodate (if an error occurs in reading back from
swap, the page remains in swapcache, never moved to filecache).

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

e83c32e8f tmpfs: simplify prealloc_page ... Browse Code »

The prealloc_page handling in shmem_getpage_gfp() is unnecessarily
complicated: first simplify that before going on to filepage/swappage.

That's right, don't report ENOMEM when the preallocation fails: we may or
may not need the page. But simply report ENOMEM once we find we do need
it, instead of dropping lock, repeating allocation, unwinding on failure
etc. And leave the out label on the fast path, don't goto.

Fix something that looks like a bug but turns out not to be: set
PageSwapBacked on prealloc_page before its mem_cgroup_cache_charge(), as
the removed case was doing. That's important before adding to LRU
(determines which LRU the page goes on), and does affect which path it
takes through memcontrol.c, but in the end MEM_CGROUP_CHANGE_TYPE_ SHMEM
is handled no differently from CACHE.

Signed-off-by: Hugh Dickins
Acked-by: Shaohua Li
Cc: "Zhang, Yanmin"
Cc: Tim Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

9276aad6c tmpfs: remove_shmem_readpage ... Browse Code »

Remove that pernicious shmem_readpage() at last: the things we needed it
for (splice, loop, sendfile, i915 GEM) are now fully taken care of by
shmem_file_splice_read() and shmem_read_mapping_page_gfp().

This removal clears the way for a simpler shmem_getpage_gfp(), since page
is never passed in; but leave most of that cleanup until after.

sys_readahead() and sys_fadvise(POSIX_FADV_WILLNEED) will now EINVAL,
instead of unexpectedly trying to read ahead on tmpfs: if that proves to
be an issue for someone, then we can either arrange for them to return
success instead, or try to implement async readahead on tmpfs.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

68da9f055 tmpfs: pass gfp to shmem_getpage_gfp ... Browse Code »

Make shmem_getpage() a wrapper, passing mapping_gfp_mask() down to
shmem_getpage_gfp(), which in turn passes gfp down to shmem_swp_alloc().

Change shmem_read_mapping_page_gfp() to use shmem_getpage_gfp() in the
CONFIG_SHMEM case; but leave tiny !SHMEM using read_cache_page_gfp().

Add a BUG_ON() in case anyone happens to call this on a non-shmem mapping;
though we might later want to let that case route to read_cache_page_gfp().

It annoys me to have these two almost-redundant args, gfp and fault_type:
I can't find a better way; but initialize fault_type only in shmem_fault().

Note that before, read_cache_page_gfp() was allocating i915_gem's pages
with __GFP_NORETRY as intended; but the corresponding swap vector pages
got allocated without it, leaving a small possibility of OOM.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

71f0e07a6 tmpfs: refine shmem_file_splice_read ... Browse Code »

Tidy up shmem_file_splice_read():

Remove readahead: okay, we could implement shmem readahead on swap,
but have never done so before, swap being the slow exceptional path.

Use shmem_getpage() instead of find_or_create_page() plus ->readpage().

Remove several comments: sorry, I found them more distracting than
helpful, and this will not be the reference version of splice_read().

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

708e3508c tmpfs: clone shmem_file_splice_read() ... Browse Code »

Copy __generic_file_splice_read() and generic_file_splice_read() from
fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make
page_cache_pipe_buf_ops and spd_release_page() accessible to it.

Signed-off-by: Hugh Dickins
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

2efaca927 mm/futex: fix futex writes on archs with SW tracking of dirty & young ... Browse Code »

2

I haven't reproduced it myself but the fail scenario is that on such
machines (notably ARM and some embedded powerpc), if you manage to hit
that futex path on a writable page whose dirty bit has gone from the PTE,
you'll livelock inside the kernel from what I can tell.

It will go in a loop of trying the atomic access, failing, trying gup to
"fix it up", getting succcess from gup, go back to the atomic access,
failing again because dirty wasn't fixed etc...

So I think you essentially hang in the kernel.

The scenario is probably rare'ish because affected architecture are
embedded and tend to not swap much (if at all) so we probably rarely hit
the case where dirty is missing or young is missing, but I think Shan has
a piece of SW that can reliably reproduce it using a shared writable
mapping & fork or something like that.

On archs who use SW tracking of dirty & young, a page without dirty is
effectively mapped read-only and a page without young unaccessible in the
PTE.

Additionally, some architectures might lazily flush the TLB when relaxing
write protection (by doing only a local flush), and expect a fault to
invalidate the stale entry if it's still present on another processor.

The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
"fix it up" by causing get_user_pages() which would then be equivalent to
taking the fault.

However that isn't the case. get_user_pages() will not call
handle_mm_fault() in the case where the PTE seems to have the right
permissions, regardless of the dirty and young state. It will eventually
update those bits ... in the struct page, but not in the PTE.

Additionally, it will not handle the lazy TLB flushing that can be
required by some architectures in the fault case.

Basically, gup is the wrong interface for the job. The patch provides a
more appropriate one which boils down to just calling handle_mm_fault()
since what we are trying to do is simulate a real page fault.

The futex code currently attempts to write to user memory within a
pagefault disabled section, and if that fails, tries to fix it up using
get_user_pages().

This doesn't work on archs where the dirty and young bits are maintained
by software, since they will gate access permission in the TLB, and will
not be updated by gup().

In addition, there's an expectation on some archs that a spurious write
fault triggers a local TLB flush, and that is missing from the picture as
well.

I decided that adding those "features" to gup() would be too much for this
already too complex function, and instead added a new simpler
fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
which the futex code can call.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
Signed-off-by: Benjamin Herrenschmidt
Reported-by: Shan Hai
Tested-by: Shan Hai
Cc: David Laight
Acked-by: Peter Zijlstra
Cc: Darren Hart
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

72c478321 mm: remove useless rcu lock-unlock from mapping_tagged() ... Browse Code »

radix_tree_tagged() is lockless - it reads from a member of the raid-tree
root node. It does not require any protection.

Signed-off-by: Konstantin Khlebnikov
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:11 +0800

76d3fbf8f mm: page allocator: reconsider zones for allocation after direct reclaim ... Browse Code »

1

With zone_reclaim_mode enabled, it's possible for zones to be considered
full in the zonelist_cache so they are skipped in the future. If the
process enters direct reclaim, the ZLC may still consider zones to be full
even after reclaiming pages. Reconsider all zones for allocation if
direct reclaim returns successfully.

Signed-off-by: Mel Gorman
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

cd38b115d mm: page allocator: initialise ZLC for first zone eligible for zone_reclaim ... Browse Code »

1

There have been a small number of complaints about significant stalls
while copying large amounts of data on NUMA machines reported on a
distribution bugzilla. In these cases, zone_reclaim was enabled by
default due to large NUMA distances. In general, the complaints have not
been about the workload itself unless it was a file server (in which case
the recommendation was disable zone_reclaim).

The stalls are mostly due to significant amounts of time spent scanning
the preferred zone for pages to free. After a failure, it might fallback
to another node (as zonelists are often node-ordered rather than
zone-ordered) but stall quickly again when the next allocation attempt
occurs. In bad cases, each page allocated results in a full scan of the
preferred zone.

Patch 1 checks the preferred zone for recent allocation failure
which is particularly important if zone_reclaim has failed
recently. This avoids rescanning the zone in the near future and
instead falling back to another node. This may hurt node locality
in some cases but a failure to zone_reclaim is more expensive than
a remote access.

Patch 2 clears the zlc information after direct reclaim.
Otherwise, zone_reclaim can mark zones full, direct reclaim can
reclaim enough pages but the zone is still not considered for
allocation.

This was tested on a 24-thread 2-node x86_64 machine. The tests were
focused on large amounts of IO. All tests were bound to the CPUs on
node-0 to avoid disturbances due to processes being scheduled on different
nodes. The kernels tested are

3.0-rc6-vanilla Vanilla 3.0-rc6
zlcfirst Patch 1 applied
zlcreconsider Patches 1+2 applied

FS-Mark
./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
fsmark-3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirs zlcreconsider
Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 501.49 493.91 499.93
Total Elapsed Time (seconds) 2451.57 2257.48 2215.92

MMTests Statistics: vmstat
Page Ins 46268 63840 66008
Page Outs 90821596 90671128 88043732
Swap Ins 0 0 0
Swap Outs 0 0 0
Direct pages scanned 13091697 8966863 8971790
Kswapd pages scanned 0 1830011 1831116
Kswapd pages reclaimed 0 1829068 1829930
Direct pages reclaimed 13037777 8956828 8648314
Kswapd efficiency 100% 99% 99%
Kswapd velocity 0.000 810.643 826.346
Direct efficiency 99% 99% 96%
Direct velocity 5340.128 3972.068 4048.788
Percentage direct scans 100% 83% 83%
Page writes by reclaim 0 3 0
Slabs scanned 796672 720640 720256
Direct inode steals 7422667 7160012 7088638
Kswapd inode steals 0 1736840 2021238

Test completes far faster with a large increase in the number of files
created per second. Standard deviation is high as a small number of
iterations were much higher than the mean. The number of pages scanned by
zone_reclaim is reduced and kswapd is used for more work.

LARGE DD
3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirst zlcreconsider
download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 125.03 118.98 122.01
Total Elapsed Time (seconds) 624.56 375.02 398.06

MMTests Statistics: vmstat
Page Ins 3594216 439368 407032
Page Outs 23380832 23380488 23377444
Swap Ins 0 0 0
Swap Outs 0 436 287
Direct pages scanned 17482342 69315973 82864918
Kswapd pages scanned 0 519123 575425
Kswapd pages reclaimed 0 466501 522487
Direct pages reclaimed 5858054 2732949 2712547
Kswapd efficiency 100% 89% 90%
Kswapd velocity 0.000 1384.254 1445.574
Direct efficiency 33% 3% 3%
Direct velocity 27991.453 184832.737 208171.929
Percentage direct scans 100% 99% 99%
Page writes by reclaim 0 5082 13917
Slabs scanned 17280 29952 35328
Direct inode steals 115257 1431122 332201
Kswapd inode steals 0 0 979532

This test downloads a large tarfile and copies it with dd a number of
times - similar to the most recent bug report I've dealt with. Time to
completion is reduced. The number of pages scanned directly is still
disturbingly high with a low efficiency but this is likely due to the
number of dirty pages encountered. The figures could probably be improved
with more work around how kswapd is used and how dirty pages are handled
but that is separate work and this result is significant on its own.

Streaming Mapped Writer
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 124.47 111.67 112.64
Total Elapsed Time (seconds) 2138.14 1816.30 1867.56

MMTests Statistics: vmstat
Page Ins 90760 89124 89516
Page Outs 121028340 120199524 120736696
Swap Ins 0 86 55
Swap Outs 0 0 0
Direct pages scanned 114989363 96461439 96330619
Kswapd pages scanned 56430948 56965763 57075875
Kswapd pages reclaimed 27743219 27752044 27766606
Direct pages reclaimed 49777 46884 36655
Kswapd efficiency 49% 48% 48%
Kswapd velocity 26392.541 31363.631 30561.736
Direct efficiency 0% 0% 0%
Direct velocity 53780.091 53108.759 51581.004
Percentage direct scans 67% 62% 62%
Page writes by reclaim 385 122 1513
Slabs scanned 43008 39040 42112
Direct inode steals 0 10 8
Kswapd inode steals 733 534 477

This test just creates a large file mapping and writes to it linearly.
Time to completion is again reduced.

The gains are mostly down to two things. In many cases, there is less
scanning as zone_reclaim simply gives up faster due to recent failures.
The second reason is that memory is used more efficiently. Instead of
scanning the preferred zone every time, the allocator falls back to
another zone and uses it instead improving overall memory utilisation.

This patch: initialise ZLC for first zone eligible for zone_reclaim.

The zonelist cache (ZLC) is used among other things to record if
zone_reclaim() failed for a particular zone recently. The intention is to
avoid a high cost scanning extremely long zonelists or scanning within the
zone uselessly.

Currently the zonelist cache is setup only after the first zone has been
considered and zone_reclaim() has been called. The objective was to avoid
a costly setup but zone_reclaim is itself quite expensive. If it is
failing regularly such as the first eligible zone having mostly mapped
pages, the cost in scanning and allocation stalls is far higher than the
ZLC initialisation step.

This patch initialises ZLC before the first eligible zone calls
zone_reclaim(). Once initialised, it is checked whether the zone failed
zone_reclaim recently. If it has, the zone is skipped. As the first zone
is now being checked, additional care has to be taken about zones marked
full. A zone can be marked "full" because it should not have enough
unmapped pages for zone_reclaim but this is excessive as direct reclaim or
kswapd may succeed where zone_reclaim fails. Only mark zones "full" after
zone_reclaim fails if it failed to reclaim enough pages after scanning.

Signed-off-by: Mel Gorman
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

1d65f86db mm: preallocate page before lock_page() at filemap COW ... Browse Code »

Currently we are keeping faulted page locked throughout whole __do_fault
call (except for page_mkwrite code path) after calling file system's fault
code. If we do early COW, we allocate a new page which has to be charged
for a memcg (mem_cgroup_newpage_charge).

This function, however, might block for unbounded amount of time if memcg
oom killer is disabled or fork-bomb is running because the only way out of
the OOM situation is either an external event or OOM-situation fix.

In the end we are keeping the faulted page locked and blocking other
processes from faulting it in which is not good at all because we are
basically punishing potentially an unrelated process for OOM condition in
a different group (I have seen stuck system because of ld-2.11.1.so being
locked).

We can do test easily.

% cgcreate -g memory:A
% cgset -r memory.limit_in_bytes=64M A
% cgset -r memory.memsw.limit_in_bytes=64M A
% cd kernel_dir; cgexec -g memory:A make -j

Then, the whole system will live-locked until you kill 'make -j'
by hands (or push reboot...) This is because some important page in a
a shared library are locked.

Considering again, the new page is not necessary to be allocated
with lock_page() held. And usual page allocation may dive into
long memory reclaim loop with holding lock_page() and can cause
very long latency.

There are 3 ways.
1. do allocation/charge before lock_page()
Pros. - simple and can handle page allocation in the same manner.
This will reduce holding time of lock_page() in general.
Cons. - we do page allocation even if ->fault() returns error.

2. do charge after unlock_page(). Even if charge fails, it's just OOM.
Pros. - no impact to non-memcg path.
Cons. - implemenation requires special cares of LRU and we need to modify
page_add_new_anon_rmap()...

3. do unlock->charge->lock again method.
Pros. - no impact to non-memcg path.
Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
lock and get it again...

This patch moves "charge" and memory allocation for COW page
before lock_page(). Then, we can avoid scanning LRU with holding
a lock on a page and latency under lock_page() will be reduced.

Then, above livelock disappears.

[akpm@linux-foundation.org: fix code layout]
Signed-off-by: KAMEZAWA Hiroyuki
Reported-by: Lutz Vieweg
Original-idea-by: Michal Hocko
Cc: Michal Hocko
Cc: Ying Han
Cc: Johannes Weiner
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

d515afe88 tmpfs: no need to use i_lock ... Browse Code »

2.6.36's 7e496299d4d2 ("tmpfs: make tmpfs scalable with percpu_counter for
used blocks") to make tmpfs scalable with percpu_counter used
inode->i_lock in place of sbinfo->stat_lock around i_blocks updates; but
that was adverse to scalability, and unnecessary, since info->lock is
already held there in the fast paths.

Remove those uses of i_lock, and add info->lock in the three error paths
where it's then needed across shmem_free_blocks(). It's not actually
needed across shmem_unacct_blocks(), but they're so often paired that it
looks wrong to split them apart.

Signed-off-by: Hugh Dickins
Acked-by: Tim Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

d0823576b mm: pincer in truncate_inode_pages_range ... Browse Code »

truncate_inode_pages_range()'s final loop has a nice pincer property,
bringing start and end together, squeezing out the last pages. But the
range handling missed out on that, just sliding up the range, perhaps
letting pages come in behind it. Add one more test to give it the same
pincer effect.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

b85e0effd mm: consistent truncate and invalidate loops ... Browse Code »

Make the pagevec_lookup loops in truncate_inode_pages_range(),
invalidate_mapping_pages() and invalidate_inode_pages2_range() more
consistent with each other.

They were relying upon page->index of an unlocked page, but apologizing
for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
index handling.

invalidate_inode_pages2_range() had special handling for a wrapped
page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
near there, and a corrupt page->index in the radix_tree could cause more
trouble than that would catch. Remove that wrapped handling.

invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
when near the end of the range: copy that into the other two, although
it's less useful than you might think (it limits the use of the buffer,
rather than the indices looked up).

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

8a549bea5 mm: tidy vmtruncate_range and related functions ... Browse Code »

Use consistent variable names in truncate_pagecache(), truncate_setsize(),
vmtruncate() and vmtruncate_range().

unmap_mapping_range() and vmtruncate_range() have mismatched interfaces:
don't change either, but make the vmtruncates more precise about what they
expect unmap_mapping_range() to do.

vmtruncate_range() is currently called only with page-aligned start and
end+1: can handle unaligned start, but unaligned end+1 would hit BUG_ON in
truncate_inode_pages_range() (lacks partial clearing of the end page).

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

85821aab3 mm: truncate functions are in truncate.c ... Browse Code »

Correct comment on truncate_inode_pages*() in linux/mm.h; and remove
declaration of page_unuse(), it didn't exist even in 2.2.26 or 2.4.0!

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

5e5358e7c mm: cleanup descriptions of filler arg ... Browse Code »

The often-NULL data arg to read_cache_page() and read_mapping_page()
functions is misdescribed as "destination for read data": no, it's the
first arg to the filler function, often struct file * to ->readpage().

Satisfy checkpatch.pl on those filler prototypes, and tidy up the
declarations in linux/pagemap.h.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

df077ac46 sparc64: implement get_user_pages_fast() ... Browse Code »

Signed-off-by: David S. Miller
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

683d2fa67 sparc64: add support for _PAGE_SPECIAL ... Browse Code »

Luckily there are still a few software PTE bits remaining and they even
match up in both the sun4u and sun4v pte layouts.

Signed-off-by: David S. Miller
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

4a0100f75 sparc64: use RCU page table freeing ... Browse Code »

Make use of the generic RCU page table freeing on Sparc64, doing so allows
for race-free software page-table walkers like gup_fast().

Signed-off-by: David S. Miller
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:10 +0800

4dedbf8dc sparc64: kill page table quicklists ... Browse Code »

With the recent mmu_gather changes that included generic RCU freeing of
page-tables, it is now quite straightforward to implement gup_fast() on
sparc64.

This patch:

Remove the page table quicklists. They are pointless and make it harder
to use RCU page table freeing and share code with other architectures.

BTW, this is the second time this has happened, see commit 3c936465249f
("[SPARC64]: Kill pgtable quicklists and use SLAB.")

Signed-off-by: David S. Miller
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

c15bef309 mmap: fix and tidy up overcommit page arithmetic ... Browse Code »

- shmem pages are not immediately available, but they are not
potentially available either, even if we swap them out, they will just
relocate from memory into swap, total amount of immediate and
potentially available memory is not going to be affected, so we
shouldn't count them as potentially free in the first place.

- nr_free_pages() is not an expensive operation anymore, there is no
need to split the decision making in two halves and repeat code.

Signed-off-by: Dmitry Fink
Reviewed-by: Minchan Kim
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

c9d8c3d08 mm/memblock.c: avoid abuse of RED_INACTIVE ... Browse Code »

RED_INACTIVE is a slab thing, and reusing it for memblock was
inappropriate, because memblock is dealing with phys_addr_t's which have a
Kconfigurable sizeof().

Create a new poison type for this application. Fixes the sparse warning

warning: cast truncates bits from constant value (9f911029d74e35b becomes 9d74e35b)

Reported-by: H Hartley Sweeten
Tested-by: H Hartley Sweeten
Acked-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

be8f684d7 oom: make deprecated use of oom_adj more verbose ... Browse Code »

/proc/pid/oom_adj is deprecated and scheduled for removal in August 2012
according to Documentation/feature-removal-schedule.txt.

This patch makes the warning more verbose by making it appear as a more
serious problem (the presence of a stack trace and being multiline should
attract more attention) so that applications still using the old interface
can get fixed.

Very popular users of the old interface have been converted since the oom
killer rewrite has been introduced. udevd switched to the
/proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and
opensshd switched in 5.7p1.

At the start of 2012, this should be changed into a WARN() to emit all
such incidents and then finally remove the tunable in August 2012 as
scheduled.

Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

11239836c oom: remove references to old badness() function ... Browse Code »

The badness() function in the oom killer was renamed to oom_badness() in
a63d83f427fb ("oom: badness heuristic rewrite") since it is a globally
exported function for clarity.

The prototype for the old function still existed in linux/oom.h, so remove
it. There are no existing users.

Also fixes documentation and comment references to badness() and adjusts
them accordingly.

Signed-off-by: David Rientjes
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

6ac475200 mm/memory.c: remove ZAP_BLOCK_SIZE ... Browse Code »

ZAP_BLOCK_SIZE became unused in the preemptible-mmu_gather work ("mm:
Remove i_mmap_lock lockbreak"). So zap it.

Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

32f84528f mm: hugetlb: fix coding style issues ... Browse Code »

Fix coding style issues flagged by checkpatch.pl

Signed-off-by: Chris Forbes
Acked-by: Eric B Munson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

d788e80a8 mm/huge_memory.c: minor lock simplification in __khugepaged_exit ... Browse Code »

The lock is released first thing in all three branches. Simplify this by
unconditionally releasing lock and remove else clause which was only there
to be sure lock was released.

Signed-off-by: Chris Wright
Reviewed-by: Michal Hocko
Cc: Andrea Arcangeli
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

1bb36fbd4 mm/page_cgroup.c: simplify code by using SECTION_ALIGN_UP() and SECTION_ALIGN_DOWN() macros ... Browse Code »

Commit a539f3533b78e3 ("mm: add SECTION_ALIGN_UP() and
SECTION_ALIGN_DOWN() macro") introduced the SECTION_ALIGN_UP() and
SECTION_ALIGN_DOWN() macros. Use those macros to increase code
readability.

Signed-off-by: Daniel Kiper
Acked-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

00a66d297 mm: remove the leftovers of noswapaccount ... Browse Code »

In commit a2c8990aed5ab ("memsw: remove noswapaccount kernel parameter"),
Michal forgot to remove some left pieces of noswapaccount in the tree,
this patch removes them all.

Signed-off-by: WANG Cong
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

dd78553b5 pagewalk: fix code comment for THP ... Browse Code »

Commit bae9c19bf1 ("thp: split_huge_page_mm/vma") changed locking behavior
of walk_page_range(). Thus this patch changes the comment too.

Signed-off-by: KOSAKI Motohiro
Cc: Naoya Horiguchi
Cc: Hiroyuki Kamezawa
Cc: Andrea Arcangeli
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:09 +0800

c27fe4c89 pagewalk: add locking-rule comments ... Browse Code »

Originally, walk_hugetlb_range() didn't require a caller take any lock.
But commit d33b9f45bd ("mm: hugetlb: fix hugepage memory leak in
walk_page_range") changed its rule. Because it added find_vma() call in
walk_hugetlb_range().

Any locking-rule change commit should write a doc too.

[akpm@linux-foundation.org: clarify comment]
Signed-off-by: KOSAKI Motohiro
Cc: Naoya Horiguchi
Cc: Hiroyuki Kamezawa
Cc: Andrea Arcangeli
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-07-26 11:57:08 +0800

GITLAB

Eric Lee / smarc-fsl-linux-kernel

26 Jul, 2011