Eric Lee / smarc-fsl-linux-kernel

07 Jan, 2012

2 commits

4ed5e82fe vfs: protect remounting superblock read-only ... Browse Code »

Currently remouting superblock read-only is racy in a major way.

With the per mount read-only infrastructure it is now possible to
prevent most races, which this patch attempts.

Before starting the remount read-only, iterate through all mounts
belonging to the superblock and if none of them have any pending
writes, set sb->s_readonly_remount. This indicates that remount is in
progress and no further write requests are allowed. If the remount
succeeds set MS_RDONLY and reset s_readonly_remount.

If the remounting is unsuccessful just reset s_readonly_remount.
This can result in transient EROFS errors, despite the fact the
remount failed. Unfortunately hodling off writes is difficult as
remount itself may touch the filesystem (e.g. through load_nls())
which would deadlock.

A later patch deals with delayed writes due to nlink going to zero.

Signed-off-by: Miklos Szeredi
Tested-by: Toshiyuki Okajima
Signed-off-by: Al Viro

Miklos Szeredi
2012-01-07 12:20:12 +0800
ece2ccb66 Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into Z Browse Code »

Al Viro
2012-01-07 12:15:54 +0800

04 Jan, 2012

4 commits

c71053659 vfs: spread struct mount - __lookup_mnt() result ... Browse Code »

switch __lookup_mnt() to returning struct mount *; callers adjusted.

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:56:58 +0800
a218d0fdc switch open and mkdir syscalls to umode_t ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:55:19 +0800
cf31e70d6 vfs: new helper - vfs_ustat() ... Browse Code »

... and bury user_get_super()/statfs_by_dentry() - they are
purely internal now.

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:53:07 +0800
f47ec3f28 trim fs/internal.h ... Browse Code »

some stuff in there can actually become static; some belongs to pnode.h
as it's a private interface between namespace.c and pnode.c...

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:52:35 +0800

20 Jul, 2011

2 commits

12ad3ab66 superblock: move pin_sb_for_writeback() to fs/super.c ... Browse Code »

The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.

pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:38 +0800
a4464dbc0 Make ->d_sb assign-once and always non-NULL ... Browse Code »

New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
Allocates dentry, sets its ->d_sb to given superblock and sets
->d_op accordingly. Old d_alloc(NULL, name) callers are converted
to that (all of them know what superblock they want). d_alloc()
itself is left only for parent != NULl case; uses __d_alloc(),
inserts result into the list of parent's children.

Note that now ->d_sb is assign-once and never NULL *and*
->d_parent is never NULL either.

Signed-off-by: Al Viro

Al Viro
2011-07-20 13:44:17 +0800

25 Mar, 2011

2 commits

a66979aba fs: move i_wb_list out from under inode_lock ... Browse Code »

Protect the inode writeback list with a new global lock
inode_wb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-03-25 09:17:51 +0800
55fa6091d fs: move i_sb_list out from under inode_lock ... Browse Code »

Protect the per-sb inode list with a new global lock
inode_sb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-03-25 09:16:32 +0800

22 Mar, 2011

1 commit

0f60f240d FS: lookup_mnt() is only used in the core fs routines now ... Browse Code »

lookup_mnt() is only used in the core fs routines now, so it doesn't need to
be globally declared anymore. It isn't exported to modules at the moment, so
nothing that can be modularised seems to be using it.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2011-03-22 00:13:10 +0800

18 Mar, 2011

1 commit

9d412a43c vfs: split off vfsmount-related parts of vfs_kern_mount() ... Browse Code »

new function: mount_fs(). Does all work done by vfs_kern_mount()
except the allocation and filling of vfsmount; returns root dentry
or ERR_PTR().

vfs_kern_mount() switched to using it and taken to fs/namespace.c,
along with its wrappers.

alloc_vfsmnt()/free_vfsmnt() made static.

functions in namespace.c slightly reordered.

Signed-off-by: Al Viro

Al Viro
2011-03-18 10:10:41 +0800

15 Mar, 2011

1 commit

becfd1f37 vfs: Add open by file handle support ... Browse Code »

[AV: duplicate of open() guts removed; file_open_root() used instead]

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Al Viro

Aneesh Kumar K.V
2011-03-15 14:21:44 +0800

14 Mar, 2011

2 commits

73d049a40 open-style analog of vfs_path_lookup() ... Browse Code »

new function: file_open_root(dentry, mnt, name, flags) opens the file
vfs_path_lookup would arrive to.

Note that name can be empty; in that case the usual requirement that
dentry should be a directory is lifted.

open-coded equivalents switched to it, may_open() got down exactly
one caller and became static.

Signed-off-by: Al Viro

Al Viro
2011-03-14 21:15:28 +0800
47c805dc2 switch do_filp_open() to struct open_flags ... Browse Code »

take calculation of open_flags by open(2) arguments into new helper
in fs/open.c, move filp_open() over there, have it and do_sys_open()
use that helper, switch exec.c callers of do_filp_open() to explicit
(and constant) struct open_flags.

Signed-off-by: Al Viro

Al Viro
2011-03-14 21:15:25 +0800

24 Feb, 2011

1 commit

93b270f76 Fix over-zealous flush_disk when changing device size. ... Browse Code »

There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.

In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.

In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data. In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.

flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.

invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.

invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.

So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.

flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.

dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.

md does use check_disk_size_change and so is affected.

This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.

Cc: stable@kernel.org
Acked-by: Jeff Moyer
Cc: Andrew Patterson
Cc: Jens Axboe
Signed-off-by: NeilBrown

NeilBrown
2011-02-24 14:25:47 +0800

17 Jan, 2011

3 commits

b1e75df45 tidy up around finish_automount() ... Browse Code »

do_add_mount() and mnt_clear_expiry() are not needed outside of
namespace.c anymore, now that namei has finish_automount() to
use.

Signed-off-by: Al Viro

Al Viro
2011-01-17 14:47:59 +0800
19a167af7 Take the completion of automount into new helper ... Browse Code »

... and shift it from namei.c to namespace.c

Signed-off-by: Al Viro

Al Viro
2011-01-17 14:35:23 +0800
f03c65993 sanitize vfsmount refcounting changes ... Browse Code »

Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
* 1 for each fs_struct.root.mnt
* 1 for each fs_struct.pwd.mnt
* 1 for having non-NULL ->mnt_ns
* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc. And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro

Al Viro
2011-01-17 02:47:07 +0800

16 Jan, 2011

1 commit

ea5b778a8 Unexport do_add_mount() and add in follow_automount(), not ->d_automount() ... Browse Code »

Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself. follow_automount() will then
do the addition.

This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer. The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.

To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2. One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.

d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used. The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.

This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:48 +0800

07 Jan, 2011

1 commit

b3e19d924 fs: scale mntget/mntput ... Browse Code »

The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:33 +0800

29 Oct, 2010

1 commit

a4cdbd8bf braino in internal.h ... Browse Code »

wrong return type...

Signed-off-by: Al Viro

Al Viro
2010-10-29 17:49:13 +0800

26 Oct, 2010

3 commits

63997e98a split invalidate_inodes() ... Browse Code »

Pull removal of fsnotify marks into generic_shutdown_super().
Split umount-time work into a new function - evict_inodes().
Make sure that invalidate_inodes() will be able to cope with
I_FREEING once we change locking in iput().

Signed-off-by: Al Viro

Al Viro
2010-10-26 09:27:18 +0800
cffbc8aa3 fs: Convert nr_inodes and nr_unused to per-cpu counters ... Browse Code »

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

[AV: cleaned up a bit, fixed build breakage on weird configs

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Dave Chinner
2010-10-26 09:26:09 +0800
a8dade34e unexport invalidate_inodes ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-10-26 09:23:32 +0800

18 Aug, 2010

2 commits

99b7db7b8 fs: brlock vfsmount_lock ... Browse Code »

fs: brlock vfsmount_lock

Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.

A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.

The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).

The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.

Cc: Al Viro
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

Nick Piggin
2010-08-18 20:35:48 +0800
d996b62a8 tty: fix fu_list abuse ... Browse Code »

tty: fix fu_list abuse

tty code abuses fu_list, which causes a bug in remount,ro handling.

If a tty device node is opened on a filesystem, then the last link to the inode
removed, the filesystem will be allowed to be remounted readonly. This is
because fs_may_remount_ro does not find the 0 link tty inode on the file sb
list (because the tty code incorrectly removed it to use for its own purpose).
This can result in a filesystem with errors after it is marked "clean".

Taking idea from Christoph's initial patch, allocate a tty private struct
at file->private_data and put our required list fields in there, linking
file and tty. This makes tty nodes behave the same way as other device nodes
and avoid meddling with the vfs, and avoids this bug.

The error handling is not trivial in the tty code, so for this bugfix, I take
the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
This is not a problem because our allocator doesn't fail small allocs as a rule
anyway. So proper error handling is left as an exercise for tty hackers.

[ Arguably filesystem's device inode would ideally be divorced from the
driver's pseudo inode when it is opened, but in practice it's not clear whether
that will ever be worth implementing. ]

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig
Cc: Alan Cox
Cc: Greg Kroah-Hartman
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

Nick Piggin
2010-08-18 20:35:47 +0800

22 May, 2010

1 commit

35cf7ba0b Bury __put_super_and_need_restart() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:16 +0800

04 Mar, 2010

1 commit

47cd813f2 Take vfsmount_lock to fs/internal.h ... Browse Code »

no more users left outside of fs/*.c (and very few outside of
fs/namespace.c, actually)

Signed-off-by: Al Viro

Al Viro
2010-03-04 03:07:59 +0800

23 Dec, 2009

1 commit

482928d59 Fix f_flags/f_mode in case of lookup_instantiate_filp() from open(pathname, 3) ... Browse Code »

Just set f_flags when shoving struct file into nameidata; don't
postpone that until __dentry_open(). do_filp_open() has correct
value; lookup_instantiate_filp() doesn't - we lose the difference
between O_RDWR and 3 by that point.

We still set .intent.open.flags, so no fs code needs to be changed.

Signed-off-by: Al Viro

Al Viro
2009-12-23 01:27:34 +0800

17 Dec, 2009

1 commit

e81e3f4dc fs: move get_empty_filp() deffinition to internal.h ... Browse Code »

All users outside of fs/ of get_empty_filp() have been removed. This patch
moves the definition from the include/ directory to internal.h so no new
users crop up and removes the EXPORT_SYMBOL. I'd love to see open intents
stop using it too, but that's a problem for another day and a smarter
developer!

Signed-off-by: Eric Paris
Acked-by: Miklos Szeredi
Signed-off-by: Al Viro

Eric Paris
2009-12-17 01:16:45 +0800

24 Sep, 2009

1 commit

eca6f534e fs: fix overflow in sys_mount() for in-kernel calls ... Browse Code »

sys_mount() reads/copies a whole page for its "type" parameter. When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).

(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code. Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc. checks.)

But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte. Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?

[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]

Reported-by: Ingo Molnar
Signed-off-by: Vegard Nossum
Cc: Al Viro
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: al

Vegard Nossum
2009-09-24 20:40:15 +0800

12 Jun, 2009

4 commits

62c6943b4 Trim a bit of crap from fs.h ... Browse Code »

do_remount_sb() is fs/internal.h fodder, fsync_no_super() is long gone.

Signed-off-by: Al Viro

Al Viro
2009-06-12 09:36:07 +0800
5cee5815d vfs: Make sys_sync() use fsync_super() (version 4) ... Browse Code »

It is unnecessarily fragile to have two places (fsync_super() and do_sync())
doing data integrity sync of the filesystem. Alter __fsync_super() to
accommodate needs of both callers and use it. So after this patch
__fsync_super() is the only place where we gather all the calls needed to
properly send all data on a filesystem to disk.

Nice bonus is that we get a complete livelock avoidance and write_supers()
is now only used for periodic writeback of superblocks.

sync_blockdevs() introduced a couple of patches ago is gone now.

[build fixes folded]

Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2009-06-12 09:36:03 +0800
5a3e5cb8e vfs: Fix sys_sync() and fsync_super() reliability (version 4) ... Browse Code »

So far, do_sync() called:
sync_inodes(0);
sync_supers();
sync_filesystems(0);
sync_filesystems(1);
sync_inodes(1);

This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
and others are hit by this) when racing e.g. with background writeback. A
similar problem hits also other filesystems (e.g. ext2) because of
write_supers() being called before the sync_inodes(1).

Change the ordering of calls in do_sync() - this requires a new function
sync_blockdevs() to preserve the property that block devices are always synced
after write_super() / sync_fs() call.

The same issue is fixed in __fsync_super() function used on umount /
remount read-only.

[AV: build fixes]

Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2009-06-12 09:36:03 +0800
864d7c4c0 fs: move mark_files_ro into file_table.c ... Browse Code »

This function walks the s_files lock, and operates primarily on the
files in a superblock, so it better belongs here (eg. see also
fs_may_remount_ro).

[AV: ... and it shouldn't be static after that move]

Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

npiggin@suse.de
2009-06-12 09:36:02 +0800

01 Apr, 2009

2 commits

498052bba New locking/refcounting for fs_struct ... Browse Code »

* all changes of current->fs are done under task_lock and write_lock of
old fs->lock
* refcount is not atomic anymore (same protection)
* its decrements are done when removing reference from current; at the
same time we decide whether to free it.
* put_fs_struct() is gone
* new field - ->in_exec. Set by check_unsafe_exec() if we are trying to do
execve() and only subthreads share fs_struct. Cleared when finishing exec
(success and failure alike). Makes CLONE_FS fail with -EAGAIN if set.
* check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
is in progress.

Signed-off-by: Al Viro

Al Viro
2009-04-01 11:00:26 +0800
3e93cd671 Take fs_struct handling to new file (fs/fs_struct.c) ... Browse Code »

Pure code move; two new helper functions for nfsd and daemonize
(unshare_fs_struct() and daemonize_fs_struct() resp.; for now -
the same code as used to be in callers). unshare_fs_struct()
exported (for nfsd, as copy_fs_struct()/exit_fs() used to be),
copy_fs_struct() and exit_fs() don't need exports anymore.

Signed-off-by: Al Viro

Al Viro
2009-04-01 11:00:26 +0800

29 Mar, 2009

1 commit

e426b64c4 fix setuid sometimes doesn't ... Browse Code »

Joe Malicki reports that setuid sometimes doesn't: very rarely,
a setuid root program does not get root euid; and, by the way,
they have a health check running lsof every few minutes.

Right, check_unsafe_exec() notes whether the files_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc//fd and /proc//fdinfo lookups make transient
use of get_files_struct(), which also raises that sharing count.

There's a rather simple fix for this: exec's check on files->count
has been redundant ever since 2.6.1 made it unshare_files() (except
while compat_do_execve() omitted to do so) - just remove that check.

[Note to -stable: this patch will not apply before 2.6.29: earlier
releases should just remove the files->count line from unsafe_exec().]

Reported-by: Joe Malicki
Narrowed-down-by: Michael Itz
Tested-by: Joe Malicki
Signed-off-by: Hugh Dickins
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-03-29 08:30:00 +0800

07 Feb, 2009

1 commit

0bf2f3aec CRED: Fix SUID exec regression ... Browse Code »

The patch:

commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials

moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.

However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().

So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().

We do have to be careful because CLONE_THREAD does not imply FS or FILES.

We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.

This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:

[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID

and if unsuccessful, something like:

[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!

The non-root user ID you see will depend on the user you run as.

[test1.c]
#include
#include
#include
#include

static void *thread_func(void *arg)
{
while (1) {}
}

int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;

printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}

printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}

[test2.c]
#include
#include
#include

int main(int argc, char **argv)
{
uid_t uid, euid, suid;

getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}

[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2

test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread

test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2

Reported-by: David Smith
Signed-off-by: David Howells
Acked-by: David Smith
Signed-off-by: James Morris

David Howells
2009-02-07 05:46:18 +0800