05 Jun, 2014
40 commits
-
We have only a few places where we actually want to charge kmem so
instead of intruding into the general page allocation path with
__GFP_KMEMCG it's better to explictly charge kmem there. All kmem
charges will be easier to follow that way.This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
from memcg caches' allocflags. Instead it makes slab allocation path
call memcg_charge_kmem directly getting memcg to charge from the cache's
memcg params.This also eliminates any possibility of misaccounting an allocation
going from one memcg's cache to another memcg, because now we always
charge slabs against the memcg the cache belongs to. That's why this
patch removes the big comment to memcg_kmem_get_cache.Signed-off-by: Vladimir Davydov
Acked-by: Greg Thelen
Cc: Johannes Weiner
Acked-by: Michal Hocko
Cc: Glauber Costa
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
got bumped in that exit path. Now there are two, and a bunch of gotos.
ALLOC_SLOWPATH can now get set more than once during a single call to
__slab_alloc() which is pretty bogus. Here's the sequence:1. Enter __slab_alloc(), fall through all the way to the
stat(s, ALLOC_SLOWPATH);
2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
new_slab (goto #1)
3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
(goto #2)
4. Fall through in the same path we did before all the way to
stat(s, ALLOC_SLOWPATH)
5. bump ALLOC_REFILL stat, then returnDoing this is obviously bogus. It keeps us from being able to
accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
that the total number of allocs always exceeds the total number of
frees.This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
place that __slab_alloc() is. This makes it much less likely that
ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
__slab_alloc().Signed-off-by: Dave Hansen
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.Signed-off-by: David Rientjes
Cc: Pekka Enberg
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Inspired by Joe Perches suggestion in ntfs logging clean-up.
Signed-off-by: Fabian Frederick
Acked-by: Christoph Lameter
Cc: Joe Perches
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
All printk(KERN_foo converted to pr_foo()
Default printk converted to pr_warn()
Coalesce format fragments
Signed-off-by: Fabian Frederick
Acked-by: Christoph Lameter
Cc: Joe Perches
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On system with 2TiB ram, current x86_64 have 128M as section size, and
one memory_block only include one section. So will have 16400 entries
under /sys/devices/system/memory/.Current code try to use block id to find block pointer in /sys for any
section, and reuse that block pointer. that finding will take some time
even after commit 7c243c7168dc ("mm: speedup in __early_pfn_to_nid")
that will skip the search in that case during booting up.So solution could be increase block size just like SGI UV system did.
(harded code to 2g).This patch is trying to probe the block size to make it match mmio remap
size. for example, Intel Nehalem later system will have memory range [0,
TOML), [4g, TOMH]. If the memory hole is 2g and total is 128g, TOM will
be 2g, and TOM2 will be 130g.We could use 2g as block size instead of default 128M. That will reduce
number of entries in /sys/devices/system/memory/On system 6TiB system will reduce boot time by 35 seconds.
Signed-off-by: Yinghai Lu
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults on x86. Care is taken such that _PAGE_NUMA is used only in
situations where the VMA flags distinguish between NUMA hinting faults
and prot_none faults. This decision was x86-specific and conceptually
it is difficult requiring special casing to distinguish between PROTNONE
and NUMA ptes based on context.Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected
for NUMA hinting faults as if the PTE is not present then a fault will
be trapped.Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
This patch shrinks the maximum possible swap size and uses the bit to
uniquely distinguish between NUMA hinting ptes and swap ptes.Signed-off-by: Mel Gorman
Cc: David Vrabel
Cc: Ingo Molnar
Cc: Peter Anvin
Cc: Fengguang Wu
Cc: Linus Torvalds
Cc: Steven Noonan
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Srikar Dronamraju
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
32-bit support for NUMA is an oddity on its own but with automatic NUMA
balancing on top there is a reasonable risk that the CPUPID information
cannot be stored in the page flags. This patch removes support for
automatic NUMA support on 32-bit x86.Signed-off-by: Mel Gorman
Cc: David Vrabel
Cc: Ingo Molnar
Cc: Peter Anvin
Cc: Fengguang Wu
Cc: Linus Torvalds
Cc: Steven Noonan
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Srikar Dronamraju
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Description by Jan Kara:
"A lot of older filesystems don't properly flush volatile disk caches
on fsync(2) which can lead to loss of fsynced data after power failure.This patch makes generic_file_fsync() issue proper cache flush to fix the
problem. Sysadmin can use /sys/devices/.../cache_type to tell the system
it should not send the cache flush."[akpm@linux-foundation.org: nuke ifdef]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Fabian Frederick
Suggested-by: Jan Kara
Suggested-by: Christoph Hellwig
Cc: Jan Kara
Cc: Christoph Hellwig
Cc: Alexander Viro
Cc: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Function parameters comment fixing.
Signed-off-by: Fabian Frederick
Cc: Eric Van Hensbergen
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
v9fs_sysfs_init is only called by __init init_v9fs
Signed-off-by: Fabian Frederick
Cc: Eric Van Hensbergen
Cc: Ron Minnich
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
dlm_recovery_ctxt.received is unused.
ocfs2_should_refresh_lock_res() can only return 0 or 1, so the error
handling code in ocfs2_super_lock() is unneeded.Signed-off-by: joyce.xue
Cc: Joel Becker
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Ocfs2 cluster size may be 1MB, which has 20 bits. When resize, the
input new clusters is mostly the number of clusters in a group
descriptor(32256).Since the input clusters is defined as type int, so it will overflow
when shift left 20 bits and then lead to incorrect global bitmap i_size.Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Parameters new_clusters and first_new_cluster are not used in
ocfs2_update_last_group_and_inode, so remove them.Signed-off-by: Joseph Qi
Reviewed-by: joyce.xue
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We found a race situation when dlm recovery and node joining occurs
simultaneously if the network state is bad.N1 N4
start joining dlm and send
query join to all live nodes
set joining node to N1, return OK
send query join to other
live nodes and it may take
a whilecall dlm_send_join_assert()
to send assert join message
when N2 is down, so keep
trying to send message to N2
until find N2 is downsend assert join message to
N3, but connection is down
with N3, so it may take a
while
become the recovery master for N2
and send begin reco message to other
nodes in domain map but no N1
connection with N3 is rebuild,
then send assert join to N4
call dlm_assert_joined_handler(),
add N1 to domain_mapdlm recovery done, send finalize message
to nodes in domain map, including N1
receiving finalize message,
trigger the BUG() because
recovery master mismatch.Signed-off-by: joyce.xue
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Revert commit 75f82eaa502c ("ocfs2: fix NULL pointer dereference when
dismount and ocfs2rec simultaneously") because it may cause a umount
hang while shutting down the truncate log.fix NULL pointer dereference when dismount and ocfs2rec simultaneously
The situation is as followes:
ocfs2_dismout_volume
-> ocfs2_recovery_exit
-> free osb->recovery_map
-> ocfs2_truncate_shutdown
-> lock global bitmap inode
-> ocfs2_wait_for_recovery
-> check whether osb->recovery_map->rm_used is zeroBecause osb->recovery_map is already freed, rm_used can be any other
values, so it may yield umount hang.To prevent NULL pointer dereference while getting sys_root_inode, we use
a osb_tl_disable flag to disable schedule osb_truncate_log_wq after
truncate log shutdown.Signed-off-by: joyce.xue
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
ocfs_info_foo() and ocfs2_get_request_ptr functions are only used in ioctl.c
Signed-off-by: Fabian Frederick
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We found there is a conversion deadlock when the owner of lockres
happened to crash before send DLM_PROXY_AST_MSG for a downconverting
lock. The situation is as follows:Node1 Node2 Node3
the owner of lockresA
lock_1 granted at EX mode
and call ocfs2_cluster_unlock
to decrease ex_holders.
converting lock_3 from
NL to EX
send DLM_PROXY_AST_MSG
to Node1, asking Node 1
to downconvert.
receiving DLM_PROXY_AST_MSG,
thread ocfs2dc send
DLM_CONVERT_LOCK_MSG
to Node2 to downconvert
lock_1(EX->NL).
lock_1 can be granted and
put it into pending_asts
list, return DLM_NORMAL.
then something happened
and Node2 crashed.
received DLM_NORMAL, waiting
for DLM_PROXY_AST_MSG.
selected as the recovery
master, receving migrate
lock from Node1, queue
lock_1 to the tail of
converting list.After dlm recovery, converting list in the master of lockresA(Node3)
will be: converting list head lock_3(NL->EX) lock_1(EXNL).
Requested mode of lock_3 is not compatible with the granted mode of
lock_1, so it can not be granted. and lock_1 can not downconvert
because covnerting queue is strictly FIFO. So a deadlock is created.
We think function dlm_process_recovery_data() should queue_ast for
lock_1 or alter the order of lock_1 and lock_3, so dlm_thread can
process lock_1 first. And if there are multiple downconverting locks,
they must convert form PR to NL, so no need to sort them.Signed-off-by: joyce.xue
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
ocfs2_commit_thread. Then it will get into a loop with mass logs. This
will meaninglessly consume a larger number of resource and may lead to
the system hanging. So limit printk in this case.[akpm@linux-foundation.org: document the msleep]
Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are two standard techniques for dereferencing structures pointed
to by void *: cast to the right type each time they're used, or assign
to local variables of the right type.But there's no need to do *both*.
Signed-off-by: George Spelvin
Cc: Mark Fasheh
Acked-by: Joel Becker
Reviewed-by: Jie Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Replace strncpy(size 63) by defined value.
Signed-off-by: Fabian Frederick
Cc: Joel Becker
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Static values are automatically initialized to NULL.
Signed-off-by: Fabian Frederick
Cc: Joel Becker
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Direct conversion of one KERN_DEBUG message without DEBUG definition
(suggested by Josh Triplett)That message will now be disabled by default. (see
Documentation/CodingStyle Chapter 13)Signed-off-by: Fabian Frederick
Reviewed-by: Josh Triplett
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add ODEBUG: prefix to pr_fmt
Signed-off-by: Fabian Frederick
Reviewed-by: Josh Triplett
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Convert all printk to pr_foo() except KERN_DEBUG (see
Documentation/CodingStyle Chapter 13)Signed-off-by: Fabian Frederick
Reviewed-by: Josh Triplett
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add pr_fmt based on module name.
Signed-off-by: Fabian Frederick
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Fabian Frederick
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix function parameter documentation
EXPORT_SYMBOLS moved after corresponding functions
Small coding style and checkpatch warning fixes
Signed-off-by: Fabian Frederick
Acked-by: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__uc32_ioremap_pfn_caller() should return NULL when the pfn is found to be
invalid.From a recommendation by Guan Xuetao.
Cc: Guan Xuetao
Cc: Fabian Frederick
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Coalesce formats.
[akpm@linux-foundation.org: undo crazy long line]
Signed-off-by: Fabian Frederick
Cc: Guan Xuetao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Update the last pr_warning callsite in fs branch
Signed-off-by: Fabian Frederick
Cc: Phillip Lougher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__get_cpu_var() is used for multiple purposes in the kernel source. One
of them is address calculation via the form &__get_cpu_var(x). This
calculates the address for the instance of the percpu variable of the
current processor based on an offset.Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.__get_cpu_var() is defined as :
#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less
registers are used when code is generated.At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.The patch set includes passes over all arches as well. Once these
operations are used throughout then specialized macros can be defined in
non -x86 arches as well in order to optimize per cpu access by f.e. using
a global register that may be set to the per cpu base.Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;Converts to
__this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++Converts to
__this_cpu_inc(y)
Signed-off-by: Christoph Lameter
Tested-by: Geert Uytterhoeven [compilation only]
Cc: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Static values are automatically initialized to NULL.
Signed-off-by: Fabian Frederick
Acked-by: Anton Altaparmakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Without this patch fanotify_init does not validate the value passed in
event_f_flags.When a fanotify event is read from the fanotify file descriptor a new
file descriptor is created where file.f_flags = event_f_flags.Internal and external open flags are stored together in field f_flags of
struct file. Hence, an application might create file descriptors with
internal flags like FMODE_EXEC, FMODE_NOCMTIME set.Jan Kara and Eric Paris both aggreed that this is a bug and the value of
event_f_flags should be checked:
https://lkml.org/lkml/2014/4/29/522
https://lkml.org/lkml/2014/4/29/539This updated patch version considers the comments by Michael Kerrisk in
https://lkml.org/lkml/2014/5/4/10With the patch the value of event_f_flags is checked.
When specifying an invalid value error EINVAL is returned.Internal flags are disallowed.
File creation flags are disallowed:
O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TRUNC, and O_TTY_INIT.Flags which do not make sense with fanotify are disallowed:
__O_TMPFILE, O_PATH, FASYNC, and O_DIRECT.This leaves us with the following allowed values:
O_RDONLY, O_WRONLY, O_RDWR are basic functionality. The are stored in the
bits given by O_ACCMODE.O_APPEND is working as expected. The value might be useful in a logging
application which appends the current status each time the log is opened.O_LARGEFILE is needed for files exceeding 4GB on 32bit systems.
O_NONBLOCK may be useful when monitoring slow devices like tapes.
O_NDELAY is equal to O_NONBLOCK except for platform parisc.
To avoid code breaking on parisc either both flags should be
allowed or none. The patch allows both.__O_SYNC and O_DSYNC may be used to avoid data loss on power disruption.
O_NOATIME may be useful to reduce disk activity.
O_CLOEXEC may be useful, if separate processes shall be used to scan files.
Once this patch is accepted, the fanotify_init.2 manpage has to be updated.
Signed-off-by: Heinrich Schuchardt
Reviewed-by: Jan Kara
Cc: Michael Kerrisk
Cc: Valdis Kletnieks
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If fanotify_mark is called with illegal value of arguments flags and
marks it usually returns EINVAL.When fanotify_mark is called with FAN_MARK_FLUSH the argument flags is
not checked for irrelevant flags like FAN_MARK_IGNORED_MASK.The patch removes this inconsistency.
If an irrelevant flag is set error EINVAL is returned.
Signed-off-by: Heinrich Schuchardt
Acked-by: Michael Kerrisk
Acked-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Do not initialize private_destroy_list twice. list_replace_init()
already takes care of initializing private_destroy_list. We don't need
to initialize it with LIST_HEAD() beforehand.Signed-off-by: David Cohen
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Before the patch, read creates FAN_ACCESS_PERM and FAN_ACCESS events,
readdir creates only FAN_ACCESS_PERM events.This is inconsistent.
After the patch, readdir creates FAN_ACCESS_PERM and FAN_ACCESS events.
Signed-off-by: Heinrich Schuchardt
Reviewed-by: Jan Kara
Cc: Eric Paris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Originally from Tvrtko Ursulin (https://lkml.org/lkml/2011/1/12/112)
Avoid having to provide a fake/invalid fd and path when flushing marks
Currently for a group to flush marks it has set it needs to provide a
fake or invalid (but resolvable) file descriptor and path when calling
fanotify_mark. This patch pulls the flush handling a bit up so file
descriptor and path are completely ignored when flushing.I reworked the patch to be applicable again (the signature of
fanotify_mark has changed since Tvrtko's work).Signed-off-by: Heinrich Schuchardt
Cc: Tvrtko Ursulin
Reviewed-by: Jan Kara
Acked-by: Eric Paris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Replace seq_printf where possible + coalesce formats from 2 existing
seq_putsSigned-off-by: Fabian Frederick
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
All printk converted to pr_foo() except internal.h: printk(KERN_DEBUG
Coalesce formats.
Add pr_fmt
Signed-off-by: Fabian Frederick
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds