Eric Lee / smarc-fsl-linux-kernel

17 Jan, 2011

2 commits

f03c65993 sanitize vfsmount refcounting changes ... Browse Code »

Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
* 1 for each fs_struct.root.mnt
* 1 for each fs_struct.pwd.mnt
* 1 for having non-NULL ->mnt_ns
* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc. And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro

Al Viro
2011-01-17 02:47:07 +0800
7b8a53fd8 fix old umount_tree() breakage ... Browse Code »

Expiry-related code calls umount_tree() several times with
the same list to collect vfsmounts to. Which is fine, except
that umount_tree() implicitly assumed that the list would
be empty on each call - it moves the victims over there and
then iterates through the list kicking them out. It's *almost*
idempotent, so everything nearly worked. However, mnt->ghosts
handling (and thus expirability checks) had been broken - that
part was not idempotent...

The fix is trivial - use local temporary list, splice it to
the the collector list when we are through.

Signed-off-by: Al Viro

Al Viro
2011-01-17 02:47:01 +0800

16 Jan, 2011

20 commits

b650c858c autofs4: Merge the remaining dentry ops tables ... Browse Code »

Merge the remaining autofs4 dentry ops tables. It doesn't matter if
d_automount and d_manage are present on something that's not mountable or
holdable as these ops are only used if the appropriate flags are set in
dentry->d_flags.

[AV] switch to ->s_d_op, since now _everything_ on autofs4 is using the
same dentry_operations.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:49 +0800
ea5b778a8 Unexport do_add_mount() and add in follow_automount(), not ->d_automount() ... Browse Code »

Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself. follow_automount() will then
do the addition.

This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer. The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.

To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2. One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.

d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used. The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.

This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:48 +0800
ab90911ff Allow d_manage() to be used in RCU-walk mode ... Browse Code »

Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
as when it is in Ref-walk mode. This permits __follow_mount_rcu() to call
d_manage() directly. d_manage() needs a parameter to indicate that it is in
RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
-ECHILD instead).

autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
accesses it and otherwise request dropping back to ref-walk mode.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:47 +0800
87556ef19 Remove a further kludge from __do_follow_link() ... Browse Code »

Remove a further kludge from __do_follow_link() as it's no longer required with
the automount code.

This reverts the non-helper-function parts of
051d381259eb57d6074d02a6ba6e90e744f1a29f, which breaks union mounts.

Reported-by: vaurora@redhat.com
Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:46 +0800
dd89f90d2 autofs4: Add v4 pseudo direct mount support ... Browse Code »

Version 4 of autofs provides a pseudo direct mount implementation
that relies on directories at the leaves of a directory tree under
an indirect mount to trigger mounts.

This patch adds support for that functionality.

Signed-off-by: Ian Kent
Signed-off-by: David Howells
Signed-off-by: Al Viro

Ian Kent
2011-01-16 09:07:44 +0800
9e3fea16b autofs4: Fix wait validation ... Browse Code »

It is possible for the check in wait.c:validate_request() to return
an incorrect result if the dentry that was mounted upon has changed
during the callback.

Signed-off-by: Ian Kent
Signed-off-by: David Howells
Signed-off-by: Al Viro

Ian Kent
2011-01-16 09:07:43 +0800
665114937 autofs4: Clean up autofs4_free_ino() ... Browse Code »

When this function is called the local reference count does't need to
be updated since the dentry is going away and dput definitely must
not be called here.

Also the autofs info struct field inode isn't used so remove it.

Signed-off-by: Ian Kent
Signed-off-by: David Howells
Signed-off-by: Al Viro

Ian Kent
2011-01-16 09:07:42 +0800
71e469db2 autofs4: Clean up dentry operations ... Browse Code »

There are now two distinct dentry operations uses. One for dentrys
that trigger mounts and one for dentrys that do not.

Rationalize the use of these dentry operations and rename them to
reflect their function.

Signed-off-by: Ian Kent
Signed-off-by: David Howells
Signed-off-by: Al Viro

Ian Kent
2011-01-16 09:07:41 +0800
e61da20a5 autofs4: Clean up inode operations ... Browse Code »

Since the use of ->follow_link() has been eliminated there is no
need to separate the indirect and direct inode operations.

Signed-off-by: Ian Kent
Signed-off-by: David Howells
Signed-off-by: Al Viro

Ian Kent
2011-01-16 09:07:40 +0800
8c13a676d autofs4: Remove unused code ... Browse Code »

Remove code that is not used due to the use of ->d_automount()
and ->d_manage().

Signed-off-by: Ian Kent
Signed-off-by: David Howells
Signed-off-by: Al Viro

Ian Kent
2011-01-16 09:07:39 +0800
b5b801779 autofs4: Add d_manage() dentry operation ... Browse Code »

This patch required a previous patch to add the ->d_automount()
dentry operation.

Add a function to use the newly defined ->d_manage() dentry operation
for blocking during mount and expire.

Whether the VFS calls the dentry operations d_automount() and d_manage()
is controled by the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags. autofs
uses the d_automount() operation to callback to user space to request
mount operations and the d_manage() operation to block walks into mounts
that are under construction or destruction.

In order to prevent these functions from being called unnecessarily the
DMANAGED_* flags are cleared for cases which would cause this. In the
common case the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags are both
set for dentrys waiting to be mounted. The DMANAGED_TRANSIT flag is
cleared upon successful mount request completion and set during expire
runs, both during the dentry expire check, and if selected for expire,
is left set until a subsequent successful mount request completes.

The exception to this is the so-called rootless multi-mount which has
no actual mount at its base. In this case the DMANAGED_AUTOMOUNT flag
is cleared upon successful mount request completion as well and set
again after a successful expire.

Signed-off-by: Ian Kent
Signed-off-by: David Howells
Signed-off-by: Al Viro

Ian Kent
2011-01-16 09:07:38 +0800
10584211e autofs4: Add d_automount() dentry operation ... Browse Code »

Add a function to use the newly defined ->d_automount() dentry operation
for triggering mounts instead of doing the user space callback in ->lookup()
and ->d_revalidate().

Note, to be useful the subsequent patch to add the ->d_manage() dentry
operation is also needed so the discussion of functionality is deferred to
that patch.

Signed-off-by: Ian Kent
Signed-off-by: David Howells
Signed-off-by: Al Viro

Ian Kent
2011-01-16 09:07:37 +0800
db3729153 Remove the automount through follow_link() kludge code from pathwalk ... Browse Code »

Remove the automount through follow_link() kludge code from pathwalk in favour
of using d_automount().

Signed-off-by: David Howells
Acked-by: Ian Kent
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:36 +0800
01c64feac CIFS: Use d_automount() rather than abusing follow_link() ... Browse Code »

Make CIFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

[NOTE: THIS IS UNTESTED!]

Signed-off-by: David Howells
Cc: Steve French
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:35 +0800
36d43a437 NFS: Use d_automount() rather than abusing follow_link() ... Browse Code »
43

Make NFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells
Acked-by: Trond Myklebust
Acked-by: Ian Kent
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:34 +0800
d18610b0c AFS: Use d_automount() rather than abusing follow_link() ... Browse Code »

Make AFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:33 +0800
6f45b6567 Add an AT_NO_AUTOMOUNT flag to suppress terminal automount ... Browse Code »

Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount
point directories. This can be used by fstatat() users to permit the
gathering of attributes on an automount point and also prevent
mass-automounting of a directory of automount points by ls.

Signed-off-by: David Howells
Acked-by: Ian Kent
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:33 +0800
cc53ce53c Add a dentry op to allow processes to be held during pathwalk transit ... Browse Code »

Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).

The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.

The ->d_manage() dentry operation:

int (*d_manage)(struct path *path, bool mounting_here);

takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.

It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.

->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.

Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.

follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).

A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.

__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.

Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.

==========================
WHAT THIS MEANS FOR AUTOFS
==========================

Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.

autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.

The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:

mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[] :autofs4:autofs4_wait+0x674/0x897
[] avc_has_perm+0x46/0x58
[] autoremove_wake_function+0x0/0x2e
[] :autofs4:autofs4_expire_wait+0x41/0x6b
[] :autofs4:autofs4_revalidate+0x91/0x149
[] __lookup_hash+0xa0/0x12f
[] lookup_create+0x46/0x80
[] sys_mkdirat+0x56/0xe4

versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:

automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[] __mutex_lock_slowpath+0x60/0x9b
[] do_path_lookup+0x2ca/0x2f1
[] .text.lock.mutex+0xf/0x14
[] do_rmdir+0x77/0xde
[] tracesys+0x71/0xe0
[] tracesys+0xd5/0xe0

which means that the system is deadlocked.

This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().

Signed-off-by: David Howells
Was-Acked-by: Ian Kent
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:31 +0800
9875cf806 Add a dentry op to handle automounting rather than abusing follow_link() ... Browse Code »

Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).

This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.

The ->d_automount() dentry operation:

struct vfsmount *(*d_automount)(struct path *mountpoint);

takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.

The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.

Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.

__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).

__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.

follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".

I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.

I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.

[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]

Signed-off-by: David Howells
Was-Acked-by: Ian Kent
Signed-off-by: Al Viro

David Howells
2011-01-16 09:05:03 +0800
1a8edf40e do_lookup() fix ... Browse Code »

do_lookup() has a path leading from LOOKUP_RCU case to non-RCU
crossing of mountpoints, which breaks things badly. If we
hit need_revalidate: and do nothing in there, we need to come
back into LOOKUP_RCU half of things, not to done: in non-RCU
one.

Signed-off-by: Al Viro

Al Viro
2011-01-16 09:03:39 +0800

15 Jan, 2011

3 commits

0ad53eeef afs: add afs_wq and use it instead of the system workqueue ... Browse Code »

flush_scheduled_work() is going away. afs needs to make sure all the
works it has queued have finished before being unloaded and there can
be arbitrary number of pending works. Add afs_wq and use it as the
flush domain instead of the system workqueue.

Also, convert cancel_delayed_work() + flush_scheduled_work() to
cancel_delayed_work_sync() in afs_mntpt_kill_timer().

Signed-off-by: Tejun Heo
Signed-off-by: David Howells
Cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds

Tejun Heo
2011-01-15 01:25:11 +0800
ba28b93a5 FS-Cache: Fix operation handling ... Browse Code »

fscache_submit_exclusive_op() adds an operation to the pending list if
other operations are pending. Fix the check for pending ops as n_ops
must be greater than 0 at the point it is checked as it is incremented
immediately before under lock.

Signed-off-by: Akshat Aranya
Signed-off-by: David Howells
Signed-off-by: Linus Torvalds

Akshat Aranya
2011-01-15 01:23:36 +0800
acda4721a Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/gi… ... Browse Code »

…t/npiggin/linux-npiggin

* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
kernel: fix hlist_bl again
cgroups: Fix a lockdep warning at cgroup removal
fs: namei fix ->put_link on wrong inode in do_filp_open

Linus Torvalds
2011-01-15 01:08:29 +0800

14 Jan, 2011

15 commits

7b9337aaf fs: namei fix ->put_link on wrong inode in do_filp_open ... Browse Code »

J. R. Okajima noticed that ->put_link is being attempted on the
wrong inode, and suggested the way to fix it. I changed it a bit
according to Al's suggestion to keep an explicit link path around.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 16:42:43 +0800
db9effe99 Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/gi… ... Browse Code »

…t/npiggin/linux-npiggin

* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
fs: fix do_last error case when need_reval_dot
nfs: add missing rcu-walk check
fs: hlist UP debug fixup
fs: fix dropping of rcu-walk from force_reval_path
fs: force_reval_path drop rcu-walk before d_invalidate
fs: small rcu-walk documentation fixes

Fixed up trivial conflicts in Documentation/filesystems/porting

Linus Torvalds
2011-01-14 12:14:13 +0800
f20877d94 fs: fix do_last error case when need_reval_dot ... Browse Code »

When open(2) without O_DIRECTORY opens an existing dir, it should return
EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
is changed by d_revalidate() which returns any positive to represent
'the target dir is valid.'

Should we keep and return the initialized 'error' in this case.

Signed-off-by: Nick Piggin

J. R. Okajima
2011-01-14 11:56:04 +0800
657e94b67 nfs: add missing rcu-walk check ... Browse Code »

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 10:48:39 +0800
90dbb77ba fs: fix dropping of rcu-walk from force_reval_path ... Browse Code »

As J. R. Okajima noted, force_reval_path passes in the same dentry to
d_revalidate as the one in the nameidata structure (other callers pass in a
child), so the locking breaks. This can oops with a chrooted nfs mount, for
example. Similarly there can be other problems with revalidating a dentry
which is already in nameidata of the path walk.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 10:36:19 +0800
bb20c18db fs: force_reval_path drop rcu-walk before d_invalidate ... Browse Code »

d_revalidate can return in rcu-walk mode even when it returns 0. We can't just
call any old dcache function on rcu-walk dentry (the dentry is unstable, so
even through d_lock can safely be taken, the result may no longer be what we
expect -- careful re-checks would be required). So just drop rcu in this case.

(I missed this conversion when switching to the rcu-walk convention that Linus
suggested)

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 10:35:53 +0800
cb9ef8d5e fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc ... Browse Code »

The sync_inodes_sb() function does not have a return value. Remove the
outdated documentation comment.

Signed-off-by: Stefan Hajnoczi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefan Hajnoczi
2011-01-14 09:32:48 +0800
5f24ce5fd thp: remove PG_buddy ... Browse Code »

PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes. We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:43 +0800
79134171d thp: transparent hugepage vmstat ... Browse Code »

Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:43 +0800
dabb16f63 oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down ... Browse Code »

We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork. Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex. The alternatives also
have much higher overhead.

This patch updated from original patch based on feedback from David
Rientjes.

Signed-off-by: Mandeep Singh Baines
Acked-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mandeep Singh Baines
2011-01-14 09:32:35 +0800
2d90508f6 mm: smaps: export mlock information ... Browse Code »

Currently there is no way to find whether a process has locked its pages
in memory or not. And which of the memory regions are locked in memory.

Add a new field "Locked" to export this information via the smaps file.

Signed-off-by: Nikanth Karthikesan
Acked-by: Balbir Singh
Acked-by: Wu Fengguang
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikanth Karthikesan
2011-01-14 09:32:33 +0800
c32b0d4b3 fs/mpage.c: consolidate code ... Browse Code »

Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to
eliminate code duplication.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hai Shan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hai Shan
2011-01-14 09:32:32 +0800
c691b9d98 sync_inode_metadata: fix comment ... Browse Code »

Use correct function name, remove incorrect apostrophe

Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-01-14 09:32:32 +0800
b9543dac5 writeback: avoid livelocking WB_SYNC_ALL writeback ... Browse Code »
1

When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX. The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
easily end up with non-positive nr_to_write after the function returns, if
the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.

When nr_to_write is
Signed-off-by: Wu Fengguang
Cc: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Jan Engelhardt
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2011-01-14 09:32:32 +0800
aa373cf55 writeback: stop background/kupdate works from livelocking other works ... Browse Code »

Background writeback is easily livelockable in a loop in wb_writeback() by
a process continuously re-dirtying pages (or continuously appending to a
file). This is in fact intended as the target of background writeback is
to write dirty pages it can find as long as we are over
dirty_background_threshold.

But the above behavior gets inconvenient at times because no other work
queued in the flusher thread's queue gets processed. In particular, since
e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1)
can hang forever waiting for flusher thread to do the work.

Generally, when a flusher thread has some work queued, someone submitted
the work to achieve a goal more specific than what background writeback
does. Moreover by working on the specific work, we also reduce amount of
dirty pages which is exactly the target of background writeout. So it
makes sense to give specific work a priority over a generic page cleaning.

Thus we interrupt background writeback if there is some other work to do.
We return to the background writeback after completing all the queued
work.

This may delay the writeback of expired inodes for a while, however the
expired inodes will eventually be flushed to disk as long as the other
works won't livelock.

[fengguang.wu@intel.com: update comment]
Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang
Cc: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Jan Engelhardt
Cc: Jens Axboe

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2011-01-14 09:32:32 +0800