Doug / smarc-fsl-linux-kernel | Embedian Git Server

01 May, 2013

19 commits

d69f3bad4 ipc: sysv shared memory limited to 8TiB ... Browse Code »

Trying to run an application which was trying to put data into half of
memory using shmget(), we found that having a shmall value below 8EiB-8TiB
would prevent us from using anything more than 8TiB. By setting
kernel.shmall greater than 8EiB-8TiB would make the job work.

In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX.

ipc/shm.c:
458 static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
459 {
...
465 int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
...
474 if (ns->shm_tot + numpages > ns->shm_ctlall)
475 return -ENOSPC;

[akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int]
Signed-off-by: Robin Holt
Reported-by: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2013-05-01 23:12:58 +0800
41239fe82 ipc/msg.c: use list_for_each_entry_[safe] for list traversing ... Browse Code »

The ipc/msg.c code does its list operations by hand and it open-codes the
accesses, instead of using for_each_entry_[safe].

Signed-off-by: Nikola Pajkovsky
Cc: Stanislav Kinsbursky
Cc: "Eric W. Biederman"
Cc: Peter Hurley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikola Pajkovsky
2013-05-01 23:12:58 +0800
6062a8dc0 ipc,sem: fine grained locking for semtimedop ... Browse Code »

Introduce finer grained locking for semtimedop, to handle the common case
of a program wanting to manipulate one semaphore from an array with
multiple semaphores.

If the call is a semop manipulating just one semaphore in an array with
multiple semaphores, only take the lock for that semaphore itself.

If the call needs to manipulate multiple semaphores, or another caller is
in a transaction that manipulates multiple semaphores, the sem_array lock
is taken, as well as all the locks for the individual semaphores.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

vanilla Davidlohr's Davidlohr's + Davidlohr's +
threads patches rwlock patches v3 patches
10 610652 726325 1783589 2142206
20 341570 365699 1520453 1977878
30 288102 307037 1498167 2037995
40 290714 305955 1612665 2256484
50 288620 312890 1733453 2650292
60 289987 306043 1649360 2388008
70 291298 306347 1723167 2717486
80 290948 305662 1729545 2763582
90 290996 306680 1736021 2757524
100 292243 306700 1773700 3059159

[davidlohr.bueso@hp.com: do not call sem_lock when bogus sma]
[davidlohr.bueso@hp.com: make refcounter atomic]
Signed-off-by: Rik van Riel
Suggested-by: Linus Torvalds
Acked-by: Davidlohr Bueso
Cc: Chegu Vinod
Cc: Jason Low
Reviewed-by: Michel Lespinasse
Cc: Peter Hurley
Cc: Stanislav Kinsbursky
Tested-by: Emmanuel Benisty
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2013-05-01 23:12:58 +0800
9f1bc2c90 ipc,sem: have only one list in struct sem_queue ... Browse Code »

Having only one list in struct sem_queue, and only queueing simple
semaphore operations on the list for the semaphore involved, allows us to
introduce finer grained locking for semtimedop.

Signed-off-by: Rik van Riel
Acked-by: Davidlohr Bueso
Cc: Chegu Vinod
Cc: Emmanuel Benisty
Cc: Jason Low
Cc: Michel Lespinasse
Cc: Peter Hurley
Cc: Stanislav Kinsbursky
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2013-05-01 23:12:58 +0800
c460b662d ipc,sem: open code and rename sem_lock ... Browse Code »

Rename sem_lock() to sem_obtain_lock(), so we can introduce a sem_lock()
later that only locks the sem_array and does nothing else.

Open code the locking from ipc_lock() in sem_obtain_lock() so we can
introduce finer grained locking for the sem_array in the next patch.

[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno out of sem_obtain_lock()]
Signed-off-by: Rik van Riel
Acked-by: Davidlohr Bueso
Cc: Chegu Vinod
Cc: Emmanuel Benisty
Cc: Jason Low
Cc: Michel Lespinasse
Cc: Peter Hurley
Cc: Stanislav Kinsbursky
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2013-05-01 23:12:58 +0800
16df3674e ipc,sem: do not hold ipc lock more than necessary ... Browse Code »

Instead of holding the ipc lock for permissions and security checks, among
others, only acquire it when necessary.

Some numbers....

1) With Rik's semop-multi.c microbenchmark we can see the following
results:

Baseline (3.9-rc1):
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151452270, ops/sec 5048409

+ 59.40% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 6.14% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 3.84% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 3.64% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 2.06% a.out [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 1.86% a.out [kernel.kallsyms] [k] ipc_lock

With this patchset:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 273156400, ops/sec 9105213

+ 18.54% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 11.72% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 7.70% a.out [kernel.kallsyms] [k] ipc_has_perm.isra.21
+ 6.58% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 6.54% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 4.71% a.out [kernel.kallsyms] [k] ipc_obtain_object_check

2) While on an Oracle swingbench DSS (data mining) workload the
improvements are not as exciting as with Rik's benchmark, we can see
some positive numbers. For an 8 socket machine the following are the
percentages of %sys time incurred in the ipc lock:

Baseline (3.9-rc1):
100 swingbench users: 8,74%
400 swingbench users: 21,86%
800 swingbench users: 84,35%

With this patchset:
100 swingbench users: 8,11%
400 swingbench users: 19,93%
800 swingbench users: 77,69%

[riel@redhat.com: fix two locking bugs]
[sasha.levin@oracle.com: prevent releasing RCU read lock twice in semctl_main]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Signed-off-by: Rik van Riel
Reviewed-by: Chegu Vinod
Acked-by: Michel Lespinasse
Cc: Rik van Riel
Cc: Jason Low
Cc: Emmanuel Benisty
Cc: Peter Hurley
Cc: Stanislav Kinsbursky
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-05-01 23:12:58 +0800
444d0f621 ipc: introduce lockless pre_down ipcctl ... Browse Code »

Various forms of ipc use ipcctl_pre_down() to retrieve an ipc object and
check permissions, mostly for IPC_RMID and IPC_SET commands.

Introduce ipcctl_pre_down_nolock(), a lockless version of this function.
The locking version is retained, yet modified to call the nolock version
without affecting its semantics, thus transparent to all ipc callers.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Rik van Riel
Suggested-by: Linus Torvalds
Cc: Chegu Vinod
Cc: Emmanuel Benisty
Cc: Jason Low
Cc: Michel Lespinasse
Cc: Peter Hurley
Cc: Stanislav Kinsbursky
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-05-01 23:12:58 +0800
4d2bff5eb ipc: introduce obtaining a lockless ipc object ... Browse Code »

Through ipc_lock() and therefore ipc_lock_check() we currently return the
locked ipc object. This is not necessary for all situations and can,
therefore, cause unnecessary ipc lock contention.

Introduce analogous ipc_obtain_object() and ipc_obtain_object_check()
functions that only lookup and return the ipc object.

Both these functions must be called within the RCU read critical section.

[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()]
Signed-off-by: Davidlohr Bueso
Signed-off-by: Rik van Riel
Reviewed-by: Chegu Vinod
Acked-by: Michel Lespinasse
Cc: Emmanuel Benisty
Cc: Jason Low
Cc: Peter Hurley
Cc: Stanislav Kinsbursky
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-05-01 23:12:57 +0800
7bb4deff6 ipc: remove bogus lock comment for ipc_checkid ... Browse Code »

This series makes the sysv semaphore code more scalable, by reducing the
time the semaphore lock is held, and making the locking more scalable for
semaphore arrays with multiple semaphores.

The first four patches were written by Davidlohr Buesso, and reduce the
hold time of the semaphore lock.

The last three patches change the sysv semaphore code locking to be more
fine grained, providing a performance boost when multiple semaphores in a
semaphore array are being manipulated simultaneously.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

vanilla Davidlohr's Davidlohr's + Davidlohr's +
threads patches rwlock patches v3 patches
10 610652 726325 1783589 2142206
20 341570 365699 1520453 1977878
30 288102 307037 1498167 2037995
40 290714 305955 1612665 2256484
50 288620 312890 1733453 2650292
60 289987 306043 1649360 2388008
70 291298 306347 1723167 2717486
80 290948 305662 1729545 2763582
90 290996 306680 1736021 2757524
100 292243 306700 1773700 3059159

This patch:

There is no reason to be holding the ipc lock while reading ipcp->seq,
hence remove misleading comment.

Also simplify the return value for the function.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Rik van Riel
Cc: Chegu Vinod
Cc: Emmanuel Benisty
Cc: Jason Low
Cc: Michel Lespinasse
Cc: Peter Hurley
Cc: Stanislav Kinsbursky
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-05-01 23:12:57 +0800
1e3c941c5 ipc/msgutil.c: use linux/uaccess.h ... Browse Code »

Signed-off-by: HoSung Jung
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

HoSung Jung
2013-05-01 23:12:57 +0800
daaf74cf0 ipc: refactor msg list search into separate function ... Browse Code »

[fengguang.wu@intel.com: find_msg can be static]
Signed-off-by: Peter Hurley
Cc: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-05-01 23:12:57 +0800
d076ac911 ipc: simplify msg list search ... Browse Code »

Signed-off-by: Peter Hurley
Acked-by: Stanislav Kinsbursky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-05-01 23:12:57 +0800
8ac6ed585 ipc: implement MSG_COPY as a new receive mode ... Browse Code »

Teach the helper routines about MSG_COPY so that msgtyp is preserved as
the message number to copy.

The security functions affected by this change were audited and no
additional changes are necessary.

Signed-off-by: Peter Hurley
Acked-by: Stanislav Kinsbursky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-05-01 23:12:57 +0800
852028af8 ipc: remove msg handling from queue scan ... Browse Code »

In preparation for refactoring the queue scan into a separate
function, relocate msg copying.

Signed-off-by: Peter Hurley
Acked-by: Stanislav Kinsbursky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-05-01 23:12:57 +0800
2b3097a29 ipc: set EFAULT as default error in load_msg() ... Browse Code »

Signed-off-by: Peter Hurley
Acked-by: Stanislav Kinsbursky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-05-01 23:12:57 +0800
da085d459 ipc: tighten msg copy loops ... Browse Code »

Signed-off-by: Peter Hurley
Acked-by: Stanislav Kinsbursky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-05-01 23:12:57 +0800
be5f4b335 ipc: separate msg allocation from userspace copy ... Browse Code »

Separating msg allocation enables single-block vmalloc
allocation instead.

Signed-off-by: Peter Hurley
Acked-by: Stanislav Kinsbursky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-05-01 23:12:57 +0800
3d8fa456d ipc: clamp with min() ... Browse Code »

Signed-off-by: Peter Hurley
Acked-by: Stanislav Kinsbursky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-05-01 23:12:57 +0800
08d767608 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions

Linus Torvalds
2013-05-01 22:21:43 +0800

30 Apr, 2013

1 commit

8f68fa2d1 ipc/util.c: use register_hotmemory_notifier() ... Browse Code »

Squishes a statement-with-no-effect warning, removes some ifdefs and
shrinks .text by one byte!

Note that this code fails to check for blocking_notifier_chain_register()
failures.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-04-30 06:54:36 +0800

03 Apr, 2013

1 commit

2dc958fa2 ipc: set msg back to -EAGAIN if copy wasn't performed ... Browse Code »

Make sure that msg pointer is set back to error value in case of
MSG_COPY flag is set and desired message to copy wasn't found. This
garantees that msg is either a error pointer or a copy address.

Otherwise the last message in queue will be freed without unlinking from
the queue (which leads to memory corruption) and the dummy allocated
copy won't be released.

Signed-off-by: Stanislav Kinsbursky
Signed-off-by: Linus Torvalds

Stanislav Kinsbursky
2013-04-03 01:09:01 +0800

29 Mar, 2013

1 commit

2c3de1c2d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull userns fixes from Eric W Biederman:
"The bulk of the changes are fixing the worst consequences of the user
namespace design oversight in not considering what happens when one
namespace starts off as a clone of another namespace, as happens with
the mount namespace.

The rest of the changes are just plain bug fixes.

Many thanks to Andy Lutomirski for pointing out many of these issues."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Restrict when proc and sysfs can be mounted
ipc: Restrict mounting the mqueue filesystem
vfs: Carefully propogate mounts across user namespaces
vfs: Add a mount flag to lock read only bind mounts
userns: Don't allow creation if the user is chrooted
yama: Better permission check for ptraceme
pid: Handle the exit of a multi-threaded init.
scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.

Linus Torvalds
2013-03-29 04:43:46 +0800

27 Mar, 2013

1 commit

a636b702e ipc: Restrict mounting the mqueue filesystem ... Browse Code »

Only allow mounting the mqueue filesystem if the caller has CAP_SYS_ADMIN
rights over the ipc namespace. The principle here is if you create
or have capabilities over it you can mount it, otherwise you get to live
with what other people have mounted.

This information is not particularly sensitive and mqueue essentially
only reports which posix messages queues exist. Still when creating a
restricted environment for an application to live any extra
information may be of use to someone with sufficient creativity. The
historical if imperfect way this information has been restricted has
been not to allow mounts and restricting this to ipc namespace
creators maintains the spirit of the historical restriction.

Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-03-27 22:50:06 +0800

23 Mar, 2013

1 commit

38d78e587 mqueue: sys_mq_open: do not call mnt_drop_write() if read-only ... Browse Code »

mnt_drop_write() must be called only if mnt_want_write() succeeded,
otherwise the mnt_writers counter will diverge.

mnt_writers counters are used to check if remounting FS as read-only is
OK, so after an extra mnt_drop_write() call, it would be impossible to
remount mqueue FS as read-only. Besides, on umount a warning would be
printed like this one:

=====================================
[ BUG: bad unlock balance detected! ]
3.9.0-rc3 #5 Not tainted
-------------------------------------
a.out/12486 is trying to release lock (sb_writers) at:
mnt_drop_write+0x1f/0x30
but there are no more locks to release!

Signed-off-by: Vladimir Davydov
Cc: Doug Ledford
Cc: KOSAKI Motohiro
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2013-03-23 07:41:21 +0800

09 Mar, 2013

2 commits

88b9e456b ipc: don't allocate a copy larger than max ... Browse Code »

When MSG_COPY is set, a duplicate message must be allocated for the copy
before locking the queue. However, the copy could not be larger than was
sent which is limited to msg_ctlmax.

Signed-off-by: Peter Hurley
Acked-by: Stanislav Kinsbursky
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-03-09 07:05:33 +0800
e1082f45f ipc: fix potential oops when src msg > 4k w/ MSG_COPY ... Browse Code »

If the src msg is > 4k, then dest->next points to the
next allocated segment; resetting it just prior to dereferencing
is bad.

Signed-off-by: Peter Hurley
Acked-by: Stanislav Kinsbursky
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Hurley
2013-03-09 07:05:33 +0800

06 Mar, 2013

1 commit

e1fd1f490 get rid of union semop in sys_semctl(2) arguments ... Browse Code »

just have the bugger take unsigned long and deal with SETVAL
case (when we use an int member in the union) explicitly.

Signed-off-by: Al Viro

Al Viro
2013-03-06 04:14:16 +0800

04 Mar, 2013

3 commits

0e65a81b1 get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-03-04 12:00:27 +0800
56e41d3c5 merge compat sys_ipc instances ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-03-04 12:00:27 +0800
22d1a35da make HAVE_SYSCALL_WRAPPERS unconditional ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-03-04 11:58:30 +0800

28 Feb, 2013

1 commit

54924ea33 ipc: convert to idr_alloc() ... Browse Code »

Convert to the much saner new idr interface.

The new interface doesn't directly translate to the way idr_pre_get()
was used around ipc_addid() as preloading disables preemption. From
my cursory reading, it seems like we should be able to do all
allocation from ipc_addid(), so I moved it there. Can you please
check whether this would be okay? If this is wrong and ipc_addid()
should be allowed to be called from non-sleepable context, I'd suggest
allocating id itself in the outer functions and later install the
pointer using idr_replace().

Signed-off-by: Tejun Heo
Reported-by: Sedat Dilek
Tested-by: Sedat Dilek
Cc: Stanislav Kinsbursky
Cc: "Eric W. Biederman"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2013-02-28 11:10:19 +0800

27 Feb, 2013

1 commit

d895cb1af Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.

The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.

Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.

PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...

Linus Torvalds
2013-02-27 12:16:07 +0800

26 Feb, 2013

1 commit

94f2f1423 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.

I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.

There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.

The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.

XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...

Linus Torvalds
2013-02-26 08:00:49 +0800

24 Feb, 2013

2 commits

41badc15c mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool ... Browse Code »

do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
multiple, however there was no equivalent code in mm_populate(), which
caused issues.

This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:11 +0800
bebeb3d68 mm: introduce mm_populate() for populating new vmas ... Browse Code »

When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.

This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().

Signed-off-by: Michel Lespinasse
Reported-by: Andy Lutomirski
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:10 +0800

23 Feb, 2013

2 commits

39b652527 fs: Preserve error code in get_empty_filp(), part 2 ... Browse Code »

Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
- not enough memory for file structures
- operation is not allowed
- user is over its limit

Currently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.

Return error through pointer. Change all callers to preserve this error code.

[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

Signed-off-by: Anatol Pomozov
Reviewed-by: "Theodore Ts'o"
Signed-off-by: Al Viro

Anatol Pomozov
2013-02-23 12:31:32 +0800
496ad9aa8 new helper: file_inode(file) ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-02-23 12:31:31 +0800

28 Jan, 2013

1 commit

bc1b69ed2 userns: Allow the unprivileged users to mount mqueue fs ... Browse Code »

This patch allow the unprivileged user to mount mqueuefs in
user ns.

If two userns share the same ipcns,the files in mqueue fs
should be seen in both these two userns.

If the userns has its own ipcns,it has its own mqueue fs too.
ipcns has already done this job well.

Signed-off-by: Gao feng
Signed-off-by: Eric W. Biederman

Gao feng
2013-01-28 11:25:50 +0800

05 Jan, 2013

2 commits

3fcfe7865 ipc: add more comments to message copying related code ... Browse Code »

Signed-off-by: Stanislav Kinsbursky
Cc: "Eric W. Biederman"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stanislav Kinsbursky
2013-01-05 08:11:46 +0800
51eeacaa0 ipc: simplify message copying ... Browse Code »

Remove the redundant and confusing fill_copy(). Also add copy_msg()
check for error. In this case exit from the function have to be done
instead of break, because further code interprets any error as EAGAIN.

Also define copy_msg() for the case when CONFIG_CHECKPOINT_RESTORE is
disabled.

Signed-off-by: Stanislav Kinsbursky
Cc: "Eric W. Biederman"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stanislav Kinsbursky
2013-01-05 08:11:46 +0800