Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

04 Apr, 2014

40 commits

c558784fa include/linux/mm.h: remove ifdef condition ... Browse Code »

The ifdef conditions in include/linux/mm.h presents three cases:

- !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)

There is no actual definition of function but include/linux/mm.h has a
static inline stub defined.

- defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)

linux/mm.h does not define a prototype, but mm/page_alloc.c defines
the function.

Hence, compiler reports the following warning:

mm/page_alloc.c:4300:15: warning: no previous prototype for `__early_pfn_to_nid' [-Wmissing-prototypes]

- defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)

The architecture defines the function, and linux/mm.h has a
prototype.

Thus, join the conditions of Case 2 and 3 ie eliminate the ifdef
condition of CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID to eliminate the missing
prototype warning from file mm/page_alloc.c.

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Reviewed-by: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:03 +0800
de4985072 mm/nobootmem.c: mark function as static ... Browse Code »

Mark function as static in nobootmem.c because it is not used outside
this file.

This eliminates the following warning in mm/nobootmem.c:

mm/nobootmem.c:324:15: warning: no previous prototype for `___alloc_bootmem_node' [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:02 +0800
d20199e1f mm/page_cgroup.c: mark functions as static ... Browse Code »

Mark functions as static in page_cgroup.c because they are not used
outside this file.

This eliminates the following warning in mm/page_cgroup.c:

mm/page_cgroup.c:177:6: warning: no previous prototype for `__free_page_cgroup' [-Wmissing-prototypes]
mm/page_cgroup.c:190:15: warning: no previous prototype for `online_page_cgroup' [-Wmissing-prototypes]
mm/page_cgroup.c:225:15: warning: no previous prototype for `offline_page_cgroup' [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:02 +0800
2eb2e141c mm/process_vm_access.c: mark function as static ... Browse Code »

Mark function as static in process_vm_access.c because it is not used
outside this file.

This eliminates the following warning in mm/process_vm_access.c:

mm/process_vm_access.c:416:1: warning: no previous prototype for `compat_process_vm_rw' [-Wmissing-prototypes]

[akpm@linux-foundation.org: remove unneeded asmlinkage - compat_process_vm_rw isn't referenced from asm]
Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:02 +0800
eafd4dc4d mm/mmap.c: mark function as static ... Browse Code »

Mark function as static in mmap.c because they are not used outside this
file.

This eliminates the following warning in mm/mmap.c:

mm/mmap.c:407:6: warning: no previous prototype for `validate_mm' [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:02 +0800
b19a99392 mm/memory.c: mark functions as static ... Browse Code »

mark functions as static in memory.c because they are not used outside
this file.

This eliminates the following warnings in mm/memory.c:

mm/memory.c:3530:5: warning: no previous prototype for `numa_migrate_prep' [-Wmissing-prototypes]
mm/memory.c:3545:5: warning: no previous prototype for `do_numa_page' [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:02 +0800
74e77fb9a mm/compaction.c: mark function as static ... Browse Code »

Mark function as static in compaction.c because it is not used outside
this file.

This eliminates the following warning from mm/compaction.c:

mm/compaction.c:1190:9: warning: no previous prototype for `sysfs_compact_node' [-Wmissing-prototypes

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:02 +0800
119d6d59d mm, compaction: avoid isolating pinned pages ... Browse Code »
7

Page migration will fail for memory that is pinned in memory with, for
example, get_user_pages(). In this case, it is unnecessary to take
zone->lru_lock or isolating the page and passing it to page migration
which will ultimately fail.

This is a racy check, the page can still change from under us, but in
that case we'll just fail later when attempting to move the page.

This avoids very expensive memory compaction when faulting transparent
hugepages after pinning a lot of memory with a Mellanox driver.

On a 128GB machine and pinning ~120GB of memory, before this patch we
see the enormous disparity in the number of page migration failures
because of the pinning (from /proc/vmstat):

compact_pages_moved 8450
compact_pagemigrate_failed 15614415

0.05% of pages isolated are successfully migrated and explicitly
triggering memory compaction takes 102 seconds. After the patch:

compact_pages_moved 9197
compact_pagemigrate_failed 7

99.9% of pages isolated are now successfully migrated in this
configuration and memory compaction takes less than one second.

Signed-off-by: David Rientjes
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Rik van Riel
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-04 07:21:01 +0800
f412c97ab mm, hugetlb: mark some bootstrap functions as __init ... Browse Code »

Both prep_compound_huge_page() and prep_compound_gigantic_page() are
only called at bootstrap and can be marked as __init.

The __SetPageTail(page) in prep_compound_gigantic_page() happening
before page->first_page is initialized is not concerning since this is
bootstrap.

Signed-off-by: David Rientjes
Reviewed-by: Michal Hocko
Cc: Joonsoo Kim
Reviewed-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-04 07:21:01 +0800
449dd6984 mm: keep page cache radix tree nodes in check ... Browse Code »
17

Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.

[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
139e56166 lib: radix_tree: tree node interface ... Browse Code »
2

Make struct radix_tree_node part of the public interface and provide API
functions to create, look up, and delete whole nodes. Refactor the
existing insert, look up, delete functions on top of these new node
primitives.

This will allow the VM to track and garbage collect page cache radix
tree nodes.

[sasha.levin@oracle.com: return correct error code on insertion failure]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
a528910e1 mm: thrash detection-based file cache sizing ... Browse Code »
2

The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past. We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size. By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot. On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list. And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: Bob Liu
Cc: Andrea Arcangeli
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
91b0abe36 mm + fs: store shadow entries in page cache ... Browse Code »

Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
0cd6144aa mm + fs: prepare for non-page entries in page cache radix trees ... Browse Code »
77

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache, prepare
every site dealing with the radix trees directly to handle entries other
than pages.

The common lookup functions will filter out non-page entries and return
NULL for page cache holes, just as before. But provide a raw version of
the API which returns non-page entries as well, and switch shmem over to
use it.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:00 +0800
e7b563bb2 mm: filemap: move radix tree hole searching here ... Browse Code »
7

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot". But this is about to change, though, as shadow page
descriptors will be stored in the page cache after the actual pages get
evicted from memory.

Move the functions over to mm/filemap.c and make them native page cache
operations, where they can later be adapted to handle the new definition
of "page cache hole".

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:00 +0800
6dbaf22ce mm: shmem: save one radix tree lookup when truncating swapped pages ... Browse Code »
7

Page cache radix tree slots are usually stabilized by the page lock, but
shmem's swap cookies have no such thing. Because the overall truncation
loop is lockless, the swap entry is currently confirmed by a tree lookup
and then deleted by another tree lookup under the same tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup. This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner
Reviewed-by: Minchan Kim
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:00 +0800
53c59f262 lib: radix-tree: add radix_tree_delete_item() ... Browse Code »
7

Provide a function that does not just delete an entry at a given index,
but also allows passing in an expected item. Delete only if that item
is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner
Reviewed-by: Minchan Kim
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:00 +0800
55881bc76 fs: cachefiles: use add_to_page_cache_lru() ... Browse Code »

This code used to have its own lru cache pagevec up until a0b8cab3 ("mm:
remove lru parameter from __pagevec_lru_add and remove parts of pagevec
API"). Now it's just add_to_page_cache() followed by lru_cache_add(),
might as well use add_to_page_cache_lru() directly.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:00 +0800
6a3ed2123 mm: vmstat: fix UP zone state accounting ... Browse Code »

Summary:

The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past. We call the recently used list
"inactive list" and the frequently used list "active list".

Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages. This means
that working set changes bigger than half of cache memory go undetected
and thrash indefinitely, whereas working sets bigger than half of cache
memory are unprotected against used-once streams that don't even need
caching.

This happens on file servers and media streaming servers, where the
popular files and file sections change over time. Even though the
individual files might be smaller than half of memory, concurrent access
to many of them may still result in their inter-reference distance being
greater than half of memory. It's also been reported as a problem on
database workloads that switch back and forth between tables that are
bigger than half of memory. In these cases the VM never recognizes the
new working set and will for the remainder of the workload thrash disk
data which could easily live in memory.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved them
to the head of the inactive list. This model gave established working
sets more gracetime in the face of temporary use-once streams, but
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This series solves the problem by maintaining a history of pages evicted
from the inactive list, enabling the VM to detect frequently used pages
regardless of inactive list size and facilitate working set transitions.

Tests:

The reported database workload is easily demonstrated on a 8G machine
with two filesets a 6G. This fio workload operates on one set first,
then switches to the other. The VM should obviously always cache the
set that the workload is currently using.

This test is based on a problem encountered by Citus Data customers:
http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data

unpatched:
db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec
db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec
sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%

real 27m15.541s
user 0m19.059s
sys 0m51.459s

patched:
db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec
db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec
sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%

real 6m8.630s
user 0m14.714s
sys 0m31.233s

As can be seen, the unpatched kernel simply never adapts to the
workingset change and db2 is stuck indefinitely with secondary storage
speed. The patched kernel needs 2-3 iterations over db2 before it
replaces db1 and reaches full memory speed. Given the unbounded
negative affect of the existing VM behavior, these patches should be
considered correctness fixes rather than performance optimizations.

Another test resembles a fileserver or streaming server workload, where
data in excess of memory size is accessed at different frequencies.
There is very hot data accessed at a high frequency. Machines should be
fitted so that the hot set of such a workload can be fully cached or all
bets are off. Then there is a very big (compared to available memory)
set of data that is used-once or at a very low frequency; this is what
drives the inactive list and does not really benefit from caching.
Lastly, there is a big set of warm data in between that is accessed at
medium frequencies and benefits from caching the pages between the first
and last streamer of each burst.

unpatched:
hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec
warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec
cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec
sdb: ios=797960/4, merge=11763/1, ticks=4307910/796, in_queue=4308380, util=100.00%

patched:
hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec
warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec
cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec
sdb: ios=596514/4, merge=9341/1, ticks=2395362/997, in_queue=2396484, util=79.18%

In both kernels, the hot set is propagated to the active list and then
served from cache.

In both kernels, the beginning of the warm set is propagated to the
active list as well, but in the unpatched case the active list
eventually takes up half of memory and no new pages from the warm set
get activated, despite repeated access, and despite most of the active
list soon being stale. The patched kernel on the other hand detects the
thrashing and manages to keep this cache window rolling through the data
set. This frees up enough IO bandwidth that the cold set is served at
full speed as well and disk utilization even drops by 20%.

For reference, this same test was performed with the traditional
demotion mechanism, where deactivation is coupled to inactive list
reclaim. However, this had the same outcome as the unpatched kernel:
while the warm set does indeed get activated continuously, it is forced
out of the active list by inactive list pressure, which is dictated
primarily by the unrelated cold set. The warm set is evicted before
subsequent streamers can benefit from it, even though there would be
enough space available to cache the pages of interest.

Costs:

Page reclaim used to shrink the radix trees but now the tree nodes are
reused for shadow entries, where the cost depends heavily on the page
cache access patterns. However, with workloads that maintain spatial or
temporal locality, the shadow entries are either refaulted quickly or
reclaimed along with the inode object itself. Workloads that will
experience a memory cost increase are those that don't really benefit
from caching in the first place.

A more predictable alternative would be a fixed-cost separate pool of
shadow entries, but this would incur relatively higher memory cost for
well-behaved workloads at the benefit of cornercases. It would also
make the shadow entry lookup more costly compared to storing them
directly in the cache structure.

Future:

To simplify the merging process, this patch set is implementing thrash
detection on a global per-zone level only for now, but the design is
such that it can be extended to memory cgroups as well. All we need to
do is store the unique cgroup ID along the node and zone identifier
inside the eviction cookie to identify the lruvec.

Right now we have a fixed ratio (50:50) between inactive and active list
but we already have complaints about working sets exceeding half of
memory being pushed out of the cache by simple streaming in the
background. Ultimately, we want to adjust this ratio and allow for a
much smaller inactive list. These patches are an essential step in this
direction because they decouple the VMs ability to detect working set
changes from the inactive list size. This would allow us to base the
inactive list size on the combined readahead window size for example and
potentially protect a much bigger working set.

It's also a big step towards activating pages with a reuse distance
larger than memory, as long as they are the most frequently used pages
in the workload. This will require knowing more about the access
frequency of active pages than what we measure right now, so it's also
deferred in this series.

Another possibility of having thrashing information would be to revisit
the idea of local reclaim in the form of zero-config memory control
groups. Instead of having allocating tasks go straight to global
reclaim, they could try to reclaim the pages in the memcg they are part
of first as long as the group is not thrashing. This would allow a user
to drop e.g. a back-up job in an otherwise unconfigured memcg and it
would only inflate (and possibly do global reclaim) until it has enough
memory to do proper readahead. But once it reaches that point and stops
thrashing it would just recycle its own used-once pages without kicking
out the cache of any other tasks in the system more than necessary.

This patch (of 10):

Fengguang Wu's build testing spotted problems with inc_zone_state() and
dec_zone_state() on UP configurations in out-of-tree patches.

inc_zone_state() is declared but not defined, dec_zone_state() is
missing entirely.

Just like with *_zone_page_state(), they can be defined like their
preemption-unsafe counterparts on UP.

[akpm@linux-foundation.org: make it build]
Signed-off-by: Johannes Weiner
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Minchan Kim
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:00 +0800
d5bc5fd3f mm: vmscan: shrink_slab: rename max_pass -> freeable ... Browse Code »
7

The name `max_pass' is misleading, because this variable actually keeps
the estimate number of freeable objects, not the maximal number of
objects we can scan in this pass, which can be twice that. Rename it to
reflect its actual meaning.

Signed-off-by: Vladimir Davydov
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-04 07:21:00 +0800
8382d914e mm, hugetlb: improve page-fault scalability ... Browse Code »

The kernel can currently only handle a single hugetlb page fault at a
time. This is due to a single mutex that serializes the entire path.
This lock protects from spurious OOM errors under conditions of low
availability of free hugepages. This problem is specific to hugepages,
because it is normal to want to use every single hugepage in the system
- with normal pages we simply assume there will always be a few spare
pages which can be used temporarily until the race is resolved.

Address this problem by using a table of mutexes, allowing a better
chance of parallelization, where each hugepage is individually
serialized. The hash key is selected depending on the mapping type.
For shared ones it consists of the address space and file offset being
faulted; while for private ones the mm and virtual address are used.
The size of the table is selected based on a compromise of collisions
and memory footprint of a series of database workloads.

Large database workloads that make heavy use of hugepages can be
particularly exposed to this issue, causing start-up times to be
painfully slow. This patch reduces the startup time of a 10 Gb Oracle
DB (with ~5000 faults) from 37.5 secs to 25.7 secs. Larger workloads
will naturally benefit even more.

NOTE:
The only downside to this patch, detected by Joonsoo Kim, is that a
small race is possible in private mappings: A child process (with its
own mm, after cow) can instantiate a page that is already being handled
by the parent in a cow fault. When low on pages, can trigger spurious
OOMs. I have not been able to think of a efficient way of handling
this... but do we really care about such a tiny window? We already
maintain another theoretical race with normal pages. If not, one
possible way to is to maintain the single hash for private mappings --
any workloads that *really* suffer from this scaling problem should
already use shared mappings.

[akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
Signed-off-by: Davidlohr Bueso
Cc: Aneesh Kumar K.V
Cc: David Gibson
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-04 07:21:00 +0800
4e35f4838 mm, hugetlb: use vma_resv_map() map types ... Browse Code »

Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable. Unify it.

[davidlohr@hp.com: code cleanups]
Signed-off-by: Joonsoo Kim
Signed-off-by: Davidlohr Bueso
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Naoya Horiguchi
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-04 07:20:59 +0800
f031dd274 mm, hugetlb: remove resv_map_put ... Browse Code »

This is a preparation patch to unify the use of vma_resv_map()
regardless of the map type. This patch prepares it by removing
resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not
for all resv_maps.

[davidlohr@hp.com: update changelog]
Signed-off-by: Joonsoo Kim
Signed-off-by: Davidlohr Bueso
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Naoya Horiguchi
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-04 07:20:59 +0800
7b24d8616 mm, hugetlb: fix race in region tracking ... Browse Code »

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
mmap_sem (exclusively). This doesn't prevent other tasks from modifying
the region structure, so it can be modified by two processes
concurrently.

To solve this, introduce a spinlock to resv_map and make region
manipulation function grab it before they do actual work.

[davidlohr@hp.com: updated changelog]
Signed-off-by: Davidlohr Bueso
Signed-off-by: Joonsoo Kim
Suggested-by: Joonsoo Kim
Acked-by: David Gibson
Cc: David Gibson
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-04 07:20:59 +0800
1406ec9ba mm, hugetlb: improve, cleanup resv_map parameters ... Browse Code »

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions.

This doesn't introduce any functional change, and it is just for
preparing a next step.

[davidlohr@hp.com: update changelog]
Signed-off-by: Joonsoo Kim
Signed-off-by: Davidlohr Bueso
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Naoya Horiguchi
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-04 07:20:59 +0800
9119a41e9 mm, hugetlb: unify region structure handling ... Browse Code »

Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.

Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.

Signed-off-by: Joonsoo Kim
Signed-off-by: Davidlohr Bueso
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Naoya Horiguchi
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-04 07:20:59 +0800
d26914d11 mm: optimize put_mems_allowed() usage ... Browse Code »
7

Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.

Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-04-04 07:20:58 +0800
91ca91864 mm, compaction: ignore pageblock skip when manually invoking compaction ... Browse Code »
7

The cached pageblock hint should be ignored when triggering compaction
through /proc/sys/vm/compact_memory so all eligible memory is isolated.
Manually invoking compaction is known to be expensive, there's no need
to skip pageblocks based on heuristics (mainly for debugging).

Signed-off-by: David Rientjes
Acked-by: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-04 07:20:58 +0800
3115cd914 mm: vmscan: remove shrink_control arg from do_try_to_free_pages() ... Browse Code »

There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
field initialized on the top level is gfp_mask, which is always equal to
scan_control.gfp_mask. So let's move shrink_control initialization to
shrink_zones().

Signed-off-by: Vladimir Davydov
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Dave Chinner
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-04 07:20:58 +0800
65ec02cb9 mm: vmscan: move call to shrink_slab() to shrink_zones() ... Browse Code »

This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.

Signed-off-by: Vladimir Davydov
Reviewed-by: Glauber Costa
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-04 07:20:58 +0800
99120b772 mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim ... Browse Code »
7

When direct reclaim is executed by a process bound to a set of NUMA
nodes, we should scan only those nodes when possible, but currently we
will scan kmem from all online nodes even if the kmem shrinker is NUMA
aware. That said, binding a process to a particular NUMA node won't
prevent it from shrinking inode/dentry caches from other nodes, which is
not good. Fix this.

Signed-off-by: Vladimir Davydov
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Dave Chinner
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-04 07:20:58 +0800
62572e29b kernel/watchdog.c: touch_nmi_watchdog should only touch local cpu not every one ... Browse Code »

I ran into a scenario where while one cpu was stuck and should have
panic'd because of the NMI watchdog, it didn't. The reason was another
cpu was spewing stack dumps on to the console. Upon investigation, I
noticed that when writing to the console and also when dumping the
stack, the watchdog is touched.

This causes all the cpus to reset their NMI watchdog flags and the
'stuck' cpu just spins forever.

This change causes the semantics of touch_nmi_watchdog to be changed
slightly. Previously, I accidentally changed the semantics and we
noticed there was a codepath in which touch_nmi_watchdog could be
touched from a preemtible area. That caused a BUG() to happen when
CONFIG_DEBUG_PREEMPT was enabled. I believe it was the acpi code.

My attempt here re-introduces the change to have the
touch_nmi_watchdog() code only touch the local cpu instead of all of the
cpus. But instead of using __get_cpu_var(), I use the
__raw_get_cpu_var() version.

This avoids the preemption problem. However my reasoning wasn't because
I was trying to be lazy. Instead I rationalized it as, well if
preemption is enabled then interrupts should be enabled to and the NMI
watchdog will have no reason to trigger. So it won't matter if the
wrong cpu is touched because the percpu interrupt counters the NMI
watchdog uses should still be incrementing.

Don said:

: I'm ok with this patch, though it does alter the behaviour of how
: touch_nmi_watchdog works. For the most part I don't think most callers
: need to touch all of the watchdogs (on each cpu). Perhaps a corner case
: will pop up (the scheduler?? to mimic touch_all_softlockup_watchdogs() ).
:
: But this does address an issue where if a system is locked up and one cpu
: is spewing out useful debug messages (or error messages), the hard lockup
: will fail to go off. We have seen this on RHEL also.

Signed-off-by: Don Zickus
Signed-off-by: Ben Zhang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Zhang
2014-04-04 07:20:58 +0800
45d4f8550 fs/direct-io.c: remove some left over checks ... Browse Code »

We know that "ret > 0" is true here. These tests were left over from
commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't
needed any more.

Signed-off-by: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2014-04-04 07:20:57 +0800
2b665e276 fs/direct-io.c: remove redundant comparison ... Browse Code »

The return value of bio_get_nr_vecs() cannot be bigger than
BIO_MAX_PAGES, so we can remove redundant the comparison between
nr_pages and BIO_MAX_PAGES.

Signed-off-by: Gu Zheng
Cc: Al Viro
Reviewed-by: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gu Zheng
2014-04-04 07:20:57 +0800
9c339255c ocfs2: pass "new" parameter to ocfs2_init_xattr_bucket ... Browse Code »

This patch fixes the following crash:

kernel BUG at fs/ocfs2/uptodate.c:530!
Modules linked in: ocfs2(F) ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs xen_privcmd sunrpc 8021q garp stp llc bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iTCO_wdt iTCO_vendor_support dcdbas coretemp freq_table mperf microcode pcspkr serio_raw bnx2 lpc_ich mfd_core i5k_amb i5000_edac edac_core e1000e sg shpchp ext4(F) jbd2(F) mbcache(F) dm_round_robin(F) sr_mod(F) cdrom(F) usb_storage(F) sd_mod(F) crc_t10dif(F) pata_acpi(F) ata_generic(F) ata_piix(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) radeon(F)
ttm(F) drm_kms_helper(F) drm(F) hwmon(F) i2c_algo_bit(F) i2c_core(F) dm_multipath(F) dm_mirror(F) dm_region_hash(F) dm_log(F) dm_mod(F)
CPU 5
Pid: 21303, comm: xattr-test Tainted: GF W 3.8.13-30.el6uek.x86_64 #2 Dell Inc. PowerEdge 1950/0M788G
RIP: ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2]
Process xattr-test (pid: 21303, threadinfo ffff880017aca000, task ffff880016a2c480)
Call Trace:
ocfs2_init_xattr_bucket+0x8a/0x120 [ocfs2]
ocfs2_cp_xattr_bucket+0xbb/0x1b0 [ocfs2]
ocfs2_extend_xattr_bucket+0x20a/0x2f0 [ocfs2]
ocfs2_add_new_xattr_bucket+0x23e/0x4b0 [ocfs2]
ocfs2_xattr_set_entry_index_block+0x13c/0x3d0 [ocfs2]
ocfs2_xattr_block_set+0xf9/0x220 [ocfs2]
__ocfs2_xattr_set_handle+0x118/0x710 [ocfs2]
ocfs2_xattr_set+0x691/0x880 [ocfs2]
ocfs2_xattr_user_set+0x46/0x50 [ocfs2]
generic_setxattr+0x96/0xa0
__vfs_setxattr_noperm+0x7b/0x170
vfs_setxattr+0xbc/0xc0
setxattr+0xde/0x230
sys_fsetxattr+0xc6/0xf0
system_call_fastpath+0x16/0x1b
Code: 41 80 0c 24 01 48 89 df e8 7d f0 ff ff 4c 89 e6 48 89 df e8 a2 fe ff ff 48 89 df e8 3a f0 ff ff 48 8b 1c 24 4c 8b 64 24 08 c9 c3 0b eb fe 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 66 66
RIP ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2]

It hit the BUG_ON() in ocfs2_set_new_buffer_uptodate():

void ocfs2_set_new_buffer_uptodate(struct ocfs2_caching_info *ci,
struct buffer_head *bh)
{
/* This should definitely *not* exist in our cache */
if (ocfs2_buffer_cached(ci, bh))
printk(KERN_ERR "bh->b_blocknr: %lu @ %p\n", bh->b_blocknr, bh);
BUG_ON(ocfs2_buffer_cached(ci, bh));

set_buffer_uptodate(bh);

ocfs2_metadata_cache_io_lock(ci);
ocfs2_set_buffer_uptodate(ci, bh);
ocfs2_metadata_cache_io_unlock(ci);
}

The problem here is:

We cached a block, but the buffer_head got reused. When we are to pick
up this block again, a new buffer_head created with UPTODATE flag
cleared. ocfs2_buffer_uptodate() returned false since no UPTODATE is
set on the buffer_head. so we set this block to cache as a NEW block,
then it failed at asserting block is not in cache.

The fix is to add a new parameter indicating the bucket is a new
allocated or not to ocfs2_init_xattr_bucket().
ocfs2_init_xattr_bucket() assert block not cached accordingly.

Signed-off-by: Wengang Wang
Cc: Joel Becker
Reviewed-by: Mark Fasheh
Cc: Joe Jin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wengang Wang
2014-04-04 07:20:57 +0800
43b10a203 ocfs2: avoid system inode ref confusion by adding mutex lock ... Browse Code »

The following case may lead to the same system inode ref in confusion.

A thread B thread
ocfs2_get_system_file_inode
->get_local_system_inode
->_ocfs2_get_system_file_inode
because of *arr == NULL,
ocfs2_get_system_file_inode
->get_local_system_inode
->_ocfs2_get_system_file_inode
gets first ref thru
_ocfs2_get_system_file_inode,
gets second ref thru igrab and
set *arr = inode
at the moment, B thread also gets
two refs, so lead to one more
inode ref.

So add mutex lock to avoid multi thread set two inode ref once at the
same time.

Signed-off-by: jiangyiwen
Reviewed-by: Joseph Qi
Cc: Joel Becker
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

jiangyiwen
2014-04-04 07:20:57 +0800
7dc3e8390 ocfs2: iput inode alloc when failed locally ... Browse Code »

In ocfs2_info_handle_freeinode() and ocfs2_test_inode_bit() func, after
calls ocfs2_get_system_file_inode() to get inode ref, if calls
ocfs2_info_scan_inode_alloc() or ocfs2_inode_lock() failed, we should
iput inode alloc to avoid leaking the inode.

Signed-off-by: jiangyiwen
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

jiangyiwen
2014-04-04 07:20:57 +0800
da8ded405 ocfs2/o2net: o2net_listen_data_ready should do nothing if socket state is not TCP_LISTEN ... Browse Code »

Orabug: 17330860

When accepting an incomming connection o2net_accept_one clones a child
data socket from the parent listening socket. It then proceeds to setup
the child with callback o2net_data_ready() and sk_user_data to NULL. If
data arrives in this window, o2net_listen_data_ready will be called with
some non-deterministic value in sk_user_data (not inherited). We panic
when we page fault on sk_user_data -- in parent it is
sock_def_readable().

The fix is to recognize that this is a data socket being set up by
looking at the socket state and do nothing.

Signed-off-by: Tariq Saseed
Signed-off-by: Srinivas Eeda
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tariq Saeed
2014-04-04 07:20:56 +0800
db66c7157 ocfs2: rollback alloc_dinode counts when ocfs2_block_group_set_bits() failed ... Browse Code »

After updating alloc_dinode counts in ocfs2_alloc_dinode_update_counts(),
if ocfs2_alloc_dinode_update_bitmap() failed, there is a rare case that
some space may be lost.

So, roll back alloc_dinode counts when ocfs2_block_group_set_bits()
failed.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Younger Liu
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Younger Liu
2014-04-04 07:20:56 +0800
e228f6439 ocfs2: flock: drop cross-node lock when failed locally ... Browse Code »

ocfs2_do_flock() calls ocfs2_file_lock() to get the cross-node clock and
then call flock_lock_file_wait() to compete with local processes. In
case flock_lock_file_wait() failed, say -ENOMEM, clean up work is not
done. This patch adds the cleanup --drop the cross-node lock which was
just granted.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Wengang Wang
Cc: Joel Becker
Reviewed-by: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wengang Wang
2014-04-04 07:20:56 +0800