Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

18 Sep, 2014

40 commits

af92ba8fd Linux 3.14.19 Browse Code »

Greg Kroah-Hartman
2014-09-18 00:21:23 +0800
1143261f6 KEYS: Fix termination condition in assoc array garbage collection ... Browse Code »

commit 95389b08d93d5c06ec63ab49bd732b0069b7c35e upstream.

This fixes CVE-2014-3631.

It is possible for an associative array to end up with a shortcut node at the
root of the tree if there are more than fan-out leaves in the tree, but they
all crowd into the same slot in the lowest level (ie. they all have the same
first nibble of their index keys).

When assoc_array_gc() returns back up the tree after scanning some leaves, it
can fall off of the root and crash because it assumes that the back pointer
from a shortcut (after label ascend_old_tree) must point to a normal node -
which isn't true of a shortcut node at the root.

Should we find we're ascending rootwards over a shortcut, we should check to
see if the backpointer is zero - and if it is, we have completed the scan.

This particular bug cannot occur if the root node is not a shortcut - ie. if
you have fewer than 17 keys in a keyring or if you have at least two keys that
sit into separate slots (eg. a keyring and a non keyring).

This can be reproduced by:

ring=`keyctl newring bar @s`
for ((i=1; i/proc/sys/kernel/keys/gc_delay

first will speed things up.

If we do fall off of the top of the tree, we get the following oops:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [] assoc_array_gc+0x2f7/0x540
PGD dae15067 PUD cfc24067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: xt_nat xt_mark nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_ni
CPU: 0 PID: 26011 Comm: kworker/0:1 Not tainted 3.14.9-200.fc20.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: events key_garbage_collector
task: ffff8800918bd580 ti: ffff8800aac14000 task.ti: ffff8800aac14000
RIP: 0010:[] [] assoc_array_gc+0x2f7/0x540
RSP: 0018:ffff8800aac15d40 EFLAGS: 00010206
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800aaecacc0
RDX: ffff8800daecf440 RSI: 0000000000000001 RDI: ffff8800aadc2bc0
RBP: ffff8800aac15da8 R08: 0000000000000001 R09: 0000000000000003
R10: ffffffff8136ccc7 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000070 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000018 CR3: 00000000db10d000 CR4: 00000000000006f0
Stack:
ffff8800aac15d50 0000000000000011 ffff8800aac15db8 ffffffff812e2a70
ffff880091a00600 0000000000000000 ffff8800aadc2bc3 00000000cd42c987
ffff88003702df20 ffff88003702dfa0 0000000053b65c09 ffff8800aac15fd8
Call Trace:
[] ? keyring_detect_cycle_iterator+0x30/0x30
[] keyring_gc+0x75/0x80
[] key_garbage_collector+0x154/0x3c0
[] process_one_work+0x176/0x430
[] worker_thread+0x11b/0x3a0
[] ? rescuer_thread+0x3b0/0x3b0
[] kthread+0xd8/0xf0
[] ? insert_kthread_work+0x40/0x40
[] ret_from_fork+0x7c/0xb0
[] ? insert_kthread_work+0x40/0x40
Code: 08 4c 8b 22 0f 84 bf 00 00 00 41 83 c7 01 49 83 e4 fc 41 83 ff 0f 4c 89 65 c0 0f 8f 5a fe ff ff 48 8b 45 c0 4d 63 cf 49 83 c1 02 8b 34 c8 4d 85 f6 0f 84 be 00 00 00 41 f6 c6 01 0f 84 92
RIP [] assoc_array_gc+0x2f7/0x540
RSP
CR2: 0000000000000018
---[ end trace 1129028a088c0cbd ]---

Signed-off-by: David Howells
Acked-by: Don Zickus
Signed-off-by: James Morris
Signed-off-by: Greg Kroah-Hartman

David Howells
2014-09-18 00:19:29 +0800
ed35863a7 KEYS: Fix use-after-free in assoc_array_gc() ... Browse Code »

commit 27419604f51a97d497853f14142c1059d46eb597 upstream.

An edit script should be considered inaccessible by a function once it has
called assoc_array_apply_edit() or assoc_array_cancel_edit().

However, assoc_array_gc() is accessing the edit script just after the
gc_complete: label.

Reported-by: Andreea-Cristina Bernat
Signed-off-by: David Howells
Reviewed-by: Andreea-Cristina Bernat
cc: shemming@brocade.com
cc: paulmck@linux.vnet.ibm.com
Signed-off-by: James Morris
Signed-off-by: Greg Kroah-Hartman

David Howells
2014-09-18 00:19:29 +0800
d6e22ca59 libceph: gracefully handle large reply messages from the mon ... Browse Code »

commit 73c3d4812b4c755efeca0140f606f83772a39ce4 upstream.

We preallocate a few of the message types we get back from the mon. If we
get a larger message than we are expecting, fall back to trying to allocate
a new one instead of blindly using the one we have.

Signed-off-by: Sage Weil
Reviewed-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman

Sage Weil
2014-09-18 00:19:29 +0800
a8be8af18 vfs: fix bad hashing of dentries ... Browse Code »

commit 99d263d4c5b2f541dfacb5391e22e8c91ea982a6 upstream.

Josef Bacik found a performance regression between 3.2 and 3.10 and
narrowed it down to commit bfcfaa77bdf0 ("vfs: use 'unsigned long'
accesses for dcache name comparison and hashing"). He reports:

"The test case is essentially

for (i = 0; i < 1000000; i++)
mkdir("a$i");

On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
dir/sec with 3.10. This is because we spend waaaaay more time in
__d_lookup on 3.10 than in 3.2.

The new hashing function for strings is suboptimal for <
sizeof(unsigned long) string names (and hell even > sizeof(unsigned
long) string names that I've tested). I broke out the old hashing
function and the new one into a userspace helper to get real numbers
and this is what I'm getting:

Old hash table had 1000000 entries, 0 dupes, 0 max dupes
New hash table had 12628 entries, 987372 dupes, 900 max dupes
We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash

My test does the hash, and then does the d_hash into a integer pointer
array the same size as the dentry hash table on my system, and then
just increments the value at the address we got to see how many
entries we overlap with.

As you can see the old hash function ended up with all 1 million
entries in their own bucket, whereas the new one they are only
distributed among ~12.5k buckets, which is why we're using so much
more CPU in __d_lookup".

The reason for this hash regression is two-fold:

- On 64-bit architectures the down-mixing of the original 64-bit
word-at-a-time hash into the final 32-bit hash value is very
simplistic and suboptimal, and just adds the two 32-bit parts
together.

In particular, because there is no bit shuffling and the mixing
boundary is also a byte boundary, similar character patterns in the
low and high word easily end up just canceling each other out.

- the old byte-at-a-time hash mixed each byte into the final hash as it
hashed the path component name, resulting in the low bits of the hash
generally being a good source of hash data. That is not true for the
word-at-a-time case, and the hash data is distributed among all the
bits.

The fix is the same in both cases: do a better job of mixing the bits up
and using as much of the hash data as possible. We already have the
"hash_32|64()" functions to do that.

Reported-by: Josef Bacik
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Chris Mason
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2014-09-18 00:19:29 +0800
d7fbe53db drm/nouveau: Bump version from 1.1.1 to 1.1.2 ... Browse Code »

commit 7820e5eef0faa4a5e10834296680827f7ce78a89 upstream.

Linux 3.16 fixed multiple bugs in kms pageflip completion events
and timestamping, which were originally introduced in Linux 3.13.

These fixes have been backported to all stable kernels since 3.13.

However, the userspace nouveau-ddx needs to be aware if it is
running on a kernel on which these bugs are fixed, or not.

Bump the patchlevel of the drm driver version to signal this,
so backporting this patch to stable 3.13+ kernels will give the
ddx the required info.

Signed-off-by: Mario Kleiner
Signed-off-by: Ben Skeggs
Signed-off-by: Greg Kroah-Hartman

Mario Kleiner
2014-09-18 00:19:28 +0800
45a317f08 IB/srp: Fix deadlock between host removal and multipathd ... Browse Code »

commit bcc05910359183b431da92713e98eed478edf83a upstream.

If scsi_remove_host() is invoked after a SCSI device has been blocked,
if the fast_io_fail_tmo or dev_loss_tmo work gets scheduled on the
workqueue executing srp_remove_work() and if an I/O request is
scheduled after the SCSI device had been blocked by e.g. multipathd
then the following deadlock can occur:

kworker/6:1 D ffff880831f3c460 0 195 2 0x00000000
Call Trace:
[] schedule+0x29/0x70
[] schedule_timeout+0x10f/0x2a0
[] msleep+0x2f/0x40
[] __blk_drain_queue+0x4e/0x180
[] blk_cleanup_queue+0x225/0x230
[] __scsi_remove_device+0x62/0xe0 [scsi_mod]
[] scsi_forget_host+0x6f/0x80 [scsi_mod]
[] scsi_remove_host+0x7a/0x130 [scsi_mod]
[] srp_remove_work+0x95/0x180 [ib_srp]
[] process_one_work+0x1ea/0x6c0
[] worker_thread+0x11b/0x3a0
[] kthread+0xed/0x110
[] ret_from_fork+0x7c/0xb0
multipathd D ffff880096acc460 0 5340 1 0x00000000
Call Trace:
[] schedule+0x29/0x70
[] schedule_timeout+0x10f/0x2a0
[] io_schedule_timeout+0x9b/0xf0
[] wait_for_completion_io_timeout+0xdc/0x110
[] blk_execute_rq+0x9b/0x100
[] sg_io+0x1a5/0x450
[] scsi_cmd_ioctl+0x2a1/0x430
[] scsi_cmd_blk_ioctl+0x42/0x50
[] sd_ioctl+0xbe/0x140 [sd_mod]
[] blkdev_ioctl+0x234/0x840
[] block_ioctl+0x41/0x50
[] do_vfs_ioctl+0x300/0x520
[] SyS_ioctl+0x41/0x80
[] tracesys+0xd0/0xd5

Fix this by scheduling removal work on another workqueue than the
transport layer timers.

Signed-off-by: Bart Van Assche
Reviewed-by: Sagi Grimberg
Reviewed-by: David Dillow
Cc: Sebastian Parschauer
Signed-off-by: Roland Dreier
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2014-09-18 00:19:28 +0800
c994b9528 mtd: nand: omap: Fix 1-bit Hamming code scheme, omap_calculate_ecc() ... Browse Code »

commit 40ddbf5069bd4e11447c0088fc75318e0aac53f0 upstream.

commit 65b97cf6b8de introduced in v3.7 caused a regression
by using a reversed CS_MASK thus causing omap_calculate_ecc to
always fail. As the NAND base driver never checks for .calculate()'s
return value, the zeroed ECC values are used as is without showing
any error to the user. However, this won't work and the NAND device
won't be guarded by any error code.

Fix the issue by using the correct mask.

Code was tested on omap3beagle using the following procedure
- flash the primary bootloader (MLO) from the kernel to the first
NAND partition using nandwrite.
- boot the board from NAND. This utilizes OMAP ROM loader that
relies on 1-bit Hamming code ECC.

Fixes: 65b97cf6b8de (mtd: nand: omap2: handle nand on gpmc)

Signed-off-by: Roger Quadros
Signed-off-by: Tony Lindgren
Signed-off-by: Greg Kroah-Hartman

Roger Quadros
2014-09-18 00:19:28 +0800
cf39d69ea mtd/ftl: fix the double free of the buffers allocated in build_maps() ... Browse Code »

commit a152056c912db82860a8b4c23d0bd3a5aa89e363 upstream.

I got the following panic on my fsl p5020ds board.

Unable to handle kernel paging request for data at address 0x7375627379737465
Faulting instruction address: 0xc000000000100778
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=24 CoreNet Generic
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-next-20140613 #145
task: c0000000fe080000 ti: c0000000fe088000 task.ti: c0000000fe088000
NIP: c000000000100778 LR: c00000000010073c CTR: 0000000000000000
REGS: c0000000fe08aa00 TRAP: 0300 Not tainted (3.15.0-next-20140613)
MSR: 0000000080029000 CR: 24ad2e24 XER: 00000000
DEAR: 7375627379737465 ESR: 0000000000000000 SOFTE: 1
GPR00: c0000000000c99b0 c0000000fe08ac80 c0000000009598e0 c0000000fe001d80
GPR04: 00000000000000d0 0000000000000913 c000000007902b20 0000000000000000
GPR08: c0000000feaae888 0000000000000000 0000000007091000 0000000000200200
GPR12: 0000000028ad2e28 c00000000fff4000 c0000000007abe08 0000000000000000
GPR16: c0000000007ab160 c0000000007aaf98 c00000000060ba68 c0000000007abda8
GPR20: c0000000007abde8 c0000000feaea6f8 c0000000feaea708 c0000000007abd10
GPR24: c000000000989370 c0000000008c6228 00000000000041ed c0000000fe00a400
GPR28: c00000000017c1cc 00000000000000d0 7375627379737465 c0000000fe001d80
NIP [c000000000100778] .__kmalloc_track_caller+0x70/0x168
LR [c00000000010073c] .__kmalloc_track_caller+0x34/0x168
Call Trace:
[c0000000fe08ac80] [c00000000087e6b8] uevent_sock_list+0x0/0x10 (unreliable)
[c0000000fe08ad20] [c0000000000c99b0] .kstrdup+0x44/0x90
[c0000000fe08adc0] [c00000000017c1cc] .__kernfs_new_node+0x4c/0x130
[c0000000fe08ae70] [c00000000017d7e4] .kernfs_new_node+0x2c/0x64
[c0000000fe08aef0] [c00000000017db00] .kernfs_create_dir_ns+0x34/0xc8
[c0000000fe08af80] [c00000000018067c] .sysfs_create_dir_ns+0x58/0xcc
[c0000000fe08b010] [c0000000002c711c] .kobject_add_internal+0xc8/0x384
[c0000000fe08b0b0] [c0000000002c7644] .kobject_add+0x64/0xc8
[c0000000fe08b140] [c000000000355ebc] .device_add+0x11c/0x654
[c0000000fe08b200] [c0000000002b5988] .add_disk+0x20c/0x4b4
[c0000000fe08b2c0] [c0000000003a21d4] .add_mtd_blktrans_dev+0x340/0x514
[c0000000fe08b350] [c0000000003a3410] .mtdblock_add_mtd+0x74/0xb4
[c0000000fe08b3e0] [c0000000003a32cc] .blktrans_notify_add+0x64/0x94
[c0000000fe08b470] [c00000000039b5b4] .add_mtd_device+0x1d4/0x368
[c0000000fe08b520] [c00000000039b830] .mtd_device_parse_register+0xe8/0x104
[c0000000fe08b5c0] [c0000000003b8408] .of_flash_probe+0x72c/0x734
[c0000000fe08b750] [c00000000035ba40] .platform_drv_probe+0x38/0x84
[c0000000fe08b7d0] [c0000000003599a4] .really_probe+0xa4/0x29c
[c0000000fe08b870] [c000000000359d3c] .__driver_attach+0x100/0x104
[c0000000fe08b900] [c00000000035746c] .bus_for_each_dev+0x84/0xe4
[c0000000fe08b9a0] [c0000000003593c0] .driver_attach+0x24/0x38
[c0000000fe08ba10] [c000000000358f24] .bus_add_driver+0x1c8/0x2ac
[c0000000fe08bab0] [c00000000035a3a4] .driver_register+0x8c/0x158
[c0000000fe08bb30] [c00000000035b9f4] .__platform_driver_register+0x6c/0x80
[c0000000fe08bba0] [c00000000084e080] .of_flash_driver_init+0x1c/0x30
[c0000000fe08bc10] [c000000000001864] .do_one_initcall+0xbc/0x238
[c0000000fe08bd00] [c00000000082cdc0] .kernel_init_freeable+0x188/0x268
[c0000000fe08bdb0] [c0000000000020a0] .kernel_init+0x1c/0xf7c
[c0000000fe08be30] [c000000000000884] .ret_from_kernel_thread+0x58/0xd4
Instruction dump:
41bd0010 480000c8 4bf04eb5 60000000 e94d0028 e93f0000 7cc95214 e8a60008
7fc9502a 2fbe0000 419e00c8 e93f0022 39200000 88ed06b2 992d06b2
---[ end trace b4c9a94804a42d40 ]---

It seems that the corrupted partition header on my mtd device triggers
a bug in the ftl. In function build_maps() it will allocate the buffers
needed by the mtd partition, but if something goes wrong such as kmalloc
failure, mtd read error or invalid partition header parameter, it will
free all allocated buffers and then return non-zero. In my case, it
seems that partition header parameter 'NumTransferUnits' is invalid.

And the ftl_freepart() is a function which free all the partition
buffers allocated by build_maps(). Given the build_maps() is a self
cleaning function, so there is no need to invoke this function even
if build_maps() return with error. Otherwise it will causes the
buffers to be freed twice and then weird things would happen.

Signed-off-by: Kevin Hao
Signed-off-by: Brian Norris
Signed-off-by: Greg Kroah-Hartman

Kevin Hao
2014-09-18 00:19:28 +0800
dd6f423d5 CIFS: Fix wrong restart readdir for SMB1 ... Browse Code »

commit f736906a7669a77cf8cabdcbcf1dc8cb694e12ef upstream.

The existing code calls server->ops->close() that is not
right. This causes XFS test generic/310 to fail. Fix this
by using server->ops->closedir() function.

Signed-off-by: Dan Carpenter
Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2014-09-18 00:19:28 +0800
24dd220a9 CIFS: Fix wrong filename length for SMB2 ... Browse Code »

commit 1bbe4997b13de903c421c1cc78440e544b5f9064 upstream.

The existing code uses the old MAX_NAME constant. This causes
XFS test generic/013 to fail. Fix it by replacing MAX_NAME with
PATH_MAX that SMB1 uses. Also remove an unused MAX_NAME constant
definition.

Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2014-09-18 00:19:27 +0800
da432a625 CIFS: Fix directory rename error ... Browse Code »

commit a07d322059db66b84c9eb4f98959df468e88b34b upstream.

CIFS servers process nlink counts differently for files and directories.
In cifs_rename() if we the request fails on the existing target, we
try to remove it through cifs_unlink() but this is not what we want
to do for directories. As the result the following sequence of commands

mkdir {1,2}; mv -T 1 2; rmdir {1,2}; mkdir {1,2}; echo foo > 2/bar

and XFS test generic/023 fail with -ENOENT error. That's why the second
mkdir reuses the existing inode (target inode of the mv -T command) with
S_DEAD flag.

Fix this by checking whether the target is directory or not and
calling cifs_rmdir() rather than cifs_unlink() for directories.

Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2014-09-18 00:19:27 +0800
db8bf54b1 vfs: add d_is_dir() ... Browse Code »

commit 44b1d53043c482225196e8a9cd9f35163a1b3336 upstream.

Add d_is_dir(dentry) helper which is analogous to S_ISDIR().

To avoid confusion, rename d_is_directory() to d_can_lookup().

Signed-off-by: Miklos Szeredi
Reviewed-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Miklos Szeredi
2014-09-18 00:19:27 +0800
9a7e59763 CIFS: Fix wrong directory attributes after rename ... Browse Code »

commit b46799a8f28c43c5264ac8d8ffa28b311b557e03 upstream.

When we requests rename we also need to update attributes
of both source and target parent directories. Not doing it
causes generic/309 xfstest to fail on SMB2 mounts. Fix this
by marking these directories for force revalidating.

Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2014-09-18 00:19:27 +0800
e6fc2f0c1 CIFS: Possible null ptr deref in SMB2_tcon ... Browse Code »

commit 18f39e7be0121317550d03e267e3ebd4dbfbb3ce upstream.

As Raphael Geissert pointed out, tcon_error_exit can dereference tcon
and there is one path in which tcon can be null.

Signed-off-by: Steve French
Reported-by: Raphael Geissert
Signed-off-by: Greg Kroah-Hartman

Steve French
2014-09-18 00:19:26 +0800
d97cad531 CIFS: Fix async reading on reconnects ... Browse Code »

commit 038bc961c31b070269ecd07349a7ee2e839d4fec upstream.

If we get into read_into_pages() from cifs_readv_receive() and then
loose a network, we issue cifs_reconnect that moves all mids to
a private list and issue their callbacks. The callback of the async
read request sets a mid to retry, frees it and wakes up a process
that waits on the rdata completion.

After the connection is established we return from read_into_pages()
with a short read, use the mid that was freed before and try to read
the remaining data from the a newly created socket. Both actions are
not what we want to do. In reconnect cases (-EAGAIN) we should not
mask off the error with a short read but should return the error
code instead.

Acked-by: Jeff Layton
Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2014-09-18 00:19:26 +0800
2b0111626 CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2 ... Browse Code »

commit 21496687a79424572f46a84c690d331055f4866f upstream.

The existing mapping causes unlink() call to return error after delete
operation. Changing the mapping to -EACCES makes the client process
the call like CIFS protocol does - reset dos attributes with ATTR_READONLY
flag masked off and retry the operation.

Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2014-09-18 00:19:26 +0800
9956752af libceph: do not hard code max auth ticket len ... Browse Code »

commit c27a3e4d667fdcad3db7b104f75659478e0c68d8 upstream.

We hard code cephx auth ticket buffer size to 256 bytes. This isn't
enough for any moderate setups and, in case tickets themselves are not
encrypted, leads to buffer overflows (ceph_x_decrypt() errors out, but
ceph_decode_copy() doesn't - it's just a memcpy() wrapper). Since the
buffer is allocated dynamically anyway, allocated it a bit later, at
the point where we know how much is going to be needed.

Fixes: http://tracker.ceph.com/issues/8979

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2014-09-18 00:19:26 +0800
5083cbcde libceph: add process_one_ticket() helper ... Browse Code »

commit 597cda357716a3cf8d994cb11927af917c8d71fa upstream.

Add a helper for processing individual cephx auth tickets. Needed for
the next commit, which deals with allocating ticket buffers. (Most of
the diff here is whitespace - view with git diff -b).

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2014-09-18 00:19:26 +0800
b15481d2e libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly ... Browse Code »

commit 5f740d7e1531099b888410e6bab13f68da9b1a4d upstream.

Determining ->last_piece based on the value of ->page_offset + length
is incorrect because length here is the length of the entire message.
->last_piece set to false even if page array data item length is /dev/null
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
# rbd_resize calls librbd rbd_resize(), size is in bytes
./rbd_resize bar $(((4 << 20) + 512))
rbd resize --size 10 bar
BAR_DEV=$(rbd map bar)
# trigger a 512-byte copyup -- 512-byte page array data item
dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5

The problem exists only in ceph_msg_data_pages_cursor_init(),
ceph_msg_data_pages_advance() does the right thing. The size_t cast is
unnecessary.

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil
Reviewed-by: Alex Elder
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2014-09-18 00:19:25 +0800
b7700e43d xfs: don't zero partial page cache pages during O_DIRECT writes ... Browse Code »

commit 85e584da3212140ee80fd047f9058bbee0bc00d5 upstream.

xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads. This is different from the other filesystems who
only invalidate pages during DIO writes.

truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page. This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.

buffered reads will find an up to date page with zeros instead of
the data actually on disk.

This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.

[dchinner: catch error and warn if it fails. Comment.]

Signed-off-by: Chris Mason
Reviewed-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Chris Mason
2014-09-18 00:19:25 +0800
4d8d95c66 xfs: don't zero partial page cache pages during O_DIRECT writes ... Browse Code »

commit 834ffca6f7e345a79f6f2e2d131b0dfba8a4b67a upstream.

Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.

This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Dave Chinner
2014-09-18 00:19:25 +0800
d02c3a1bc xfs: don't dirty buffers beyond EOF ... Browse Code »

commit 22e757a49cf010703fcb9c9b4ef793248c39b0c2 upstream.

generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.

The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Dave Chinner
2014-09-18 00:19:25 +0800
84cc5247a xfs: quotacheck leaves dquot buffers without verifiers ... Browse Code »

commit 5fd364fee81a7888af806e42ed8a91c845894f2d upstream.

When running xfs/305, I noticed that quotacheck was flushing dquot
buffers that did not have the xfs_dquot_buf_ops verifiers attached:

XFS (vdb): _xfs_buf_ioapply: no ops on block 0x1dc8/0x1dc8
ffff880052489000: 44 51 01 04 00 00 65 b8 00 00 00 00 00 00 00 00 DQ....e.........
ffff880052489010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
ffff880052489020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
ffff880052489030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
CPU: 1 PID: 2376 Comm: mount Not tainted 3.16.0-rc2-dgc+ #306
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff88006fe38000 ffff88004a0ffae8 ffffffff81cf1cca 0000000000000001
ffff88004a0ffb88 ffffffff814d50ca 000010004a0ffc70 0000000000000000
ffff88006be56dc4 0000000000000021 0000000000001dc8 ffff88007c773d80
Call Trace:
[] dump_stack+0x45/0x56
[] _xfs_buf_ioapply+0x3ca/0x3d0
[] ? wake_up_state+0x20/0x20
[] ? xfs_bdstrat_cb+0x55/0xb0
[] xfs_buf_iorequest+0x6b/0xd0
[] xfs_bdstrat_cb+0x55/0xb0
[] __xfs_buf_delwri_submit+0x15b/0x220
[] ? xfs_buf_delwri_submit+0x30/0x90
[] xfs_buf_delwri_submit+0x30/0x90
[] xfs_qm_quotacheck+0x17d/0x3c0
[] xfs_qm_mount_quotas+0x151/0x1e0
[] xfs_mountfs+0x56c/0x7d0
[] xfs_fs_fill_super+0x2c2/0x340
[] mount_bdev+0x194/0x1d0
[] ? xfs_finish_flags+0x170/0x170
[] xfs_fs_mount+0x15/0x20
[] mount_fs+0x39/0x1b0
[] vfs_kern_mount+0x67/0x120
[] do_mount+0x23e/0xad0
[] ? __get_free_pages+0xe/0x50
[] ? copy_mount_options+0x36/0x150
[] SyS_mount+0x83/0xc0
[] tracesys+0xdd/0xe2

This was caused by dquot buffer readahead not attaching a verifier
structure to the buffer when readahead was issued, resulting in the
followup read of the buffer finding a valid buffer and so not
attaching new verifiers to the buffer as part of the read.

Also, when a verifier failure occurs, we then read the buffer
without verifiers. Attach the verifiers manually after this read so
that if the buffer is then written it will be verified that the
corruption has been repaired.

Further, when flushing a dquot we don't ask for a verifier when
reading in the dquot buffer the dquot belongs to. Most of the time
this isn't an issue because the buffer is still cached, but when it
is not cached it will result in writing the dquot buffer without
having the verfier attached.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Dave Chinner
2014-09-18 00:19:25 +0800
169493da2 xfs: ensure verifiers are attached to recovered buffers ... Browse Code »

commit 67dc288c21064b31a98a53dc64f6b9714b819fd6 upstream.

Crash testing of CRC enabled filesystems has resulted in a number of
reports of bad CRCs being detected after the filesystem was mounted.
Errors such as the following were being seen:

XFS (sdb3): Mounting V5 Filesystem
XFS (sdb3): Starting recovery (logdev: internal)
XFS (sdb3): Metadata CRC error detected at xfs_agf_read_verify+0x5a/0x100 [xfs], block 0x1
XFS (sdb3): Unmount and run xfs_repair
XFS (sdb3): First 64 bytes of corrupted metadata buffer:
ffff880136ffd600: 58 41 47 46 00 00 00 01 00 00 00 00 00 0f aa 40 XAGF...........@
ffff880136ffd610: 00 02 6d 53 00 02 77 f8 00 00 00 00 00 00 00 01 ..mS..w.........
ffff880136ffd620: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 03 ................
ffff880136ffd630: 00 00 00 04 00 08 81 d0 00 08 81 a7 00 00 00 00 ................
XFS (sdb3): metadata I/O error: block 0x1 ("xfs_trans_read_buf_map") error 74 numblks 1

The errors were typically being seen in AGF, AGI and their related
btree block buffers some time after log recovery had run. Often it
wasn't until later subsequent mounts that the problem was
discovered. The common symptom was a buffer with the correct
contents, but a CRC and an LSN that matched an older version of the
contents.

Some debug added to _xfs_buf_ioapply() indicated that buffers were
being written without verifiers attached to them from log recovery,
and Jan Kara isolated the cause to log recovery readahead an dit's
interactions with buffers that had a more recent LSN on disk than
the transaction being recovered. In this case, the buffer did not
get a verifier attached, and os when the second phase of log
recovery ran and recovered EFIs and unlinked inodes, the buffers
were modified and written without the verifier running. Hence they
had up to date contents, but stale LSNs and CRCs.

Fix it by attaching verifiers to buffers we skip due to future LSN
values so they don't escape into the buffer cache without the
correct verifier attached.

This patch is based on analysis and a patch from Jan Kara.

Reported-by: Jan Kara
Reported-by: Fanael Linithien
Reported-by: Grozdan
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Dave Chinner
2014-09-18 00:19:24 +0800
274a25326 RDMA/uapi: Include socket.h in rdma_user_cm.h ... Browse Code »

commit db1044d458a287c18c4d413adc4ad12e92e253b5 upstream.

added struct sockaddr_storage to rdma_user_cm.h without also adding an
include for linux/socket.h to make sure it is defined. Systemtap
needs the header files to build standalone and cannot rely on other
files to pre-include other headers, so add linux/socket.h to the list
of includes in this file.

Fixes: ee7aed4528f ("RDMA/ucma: Support querying for AF_IB addresses")
Signed-off-by: Doug Ledford
Signed-off-by: Roland Dreier
Signed-off-by: Greg Kroah-Hartman

Doug Ledford
2014-09-18 00:19:24 +0800
f4764072d RDMA/iwcm: Use a default listen backlog if needed ... Browse Code »

commit 2f0304d21867476394cd51a54e97f7273d112261 upstream.

If the user creates a listening cm_id with backlog of 0 the IWCM ends
up not allowing any connection requests at all. The correct behavior
is for the IWCM to pick a default value if the user backlog parameter
is zero.

Lustre from version 1.8.8 onward uses a backlog of 0, which breaks
iwarp support without this fix.

Signed-off-by: Steve Wise
Signed-off-by: Roland Dreier
Signed-off-by: Greg Kroah-Hartman

Steve Wise
2014-09-18 00:19:24 +0800
fdb396bd3 md/raid10: Fix memory leak when raid10 reshape completes. ... Browse Code »

commit b39685526f46976bcd13aa08c82480092befa46c upstream.

When a raid10 commences a resync/recovery/reshape it allocates
some buffer space.
When a resync/recovery completes the buffer space is freed. But not
when the reshape completes.
This can result in a small memory leak.

There is a subtle side-effect of this bug. When a RAID10 is reshaped
to a larger array (more devices), the reshape is immediately followed
by a "resync" of the new space. This "resync" will use the buffer
space which was allocated for "reshape". This can cause problems
including a "BUG" in the SCSI layer. So this is suitable for -stable.

Fixes: 3ea7daa5d7fde47cd41f4d56c2deb949114da9d6
Signed-off-by: NeilBrown
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2014-09-18 00:19:24 +0800
bbbd1b12a md/raid10: fix memory leak when reshaping a RAID10. ... Browse Code »

commit ce0b0a46955d1bb389684a2605dbcaa990ba0154 upstream.

raid10 reshape clears unwanted bits from a bio->bi_flags using
a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
was added.
Since then it clears that bit but shouldn't. This results in a
memory leak.

So change to used the approved method of clearing unwanted bits.

As this causes a memory leak which can consume all of memory
the fix is suitable for -stable.

Fixes: a38352e0ac02dbbd4fa464dc22d1352b5fbd06fd
Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
Signed-off-by: NeilBrown
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2014-09-18 00:19:23 +0800
d115e4529 md/raid6: avoid data corruption during recovery of double-degraded RAID6 ... Browse Code »

commit 9c4bdf697c39805078392d5ddbbba5ae5680e0dd upstream.

During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.

If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.

This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.

Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then. In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().

Fixes: 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8
Cc: Yuri Tikhonov
Cc: Dan Williams
Reported-by: "Manibalan P"
Tested-by: "Manibalan P"
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown
Acked-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2014-09-18 00:19:23 +0800
cd1d0eb54 md/raid1,raid10: always abort recover on write error. ... Browse Code »

commit 2446dba03f9dabe0b477a126cbeb377854785b47 upstream.

Currently we don't abort recovery on a write error if the write error
to the recovering device was triggerd by normal IO (as opposed to
recovery IO).

This means that for one bitmap region, the recovery might write to the
recovering device for a few sectors, then not bother for subsequent
sectors (as it never writes to failed devices). In this case
the bitmap bit will be cleared, but it really shouldn't.

The result is that if the recovering device fails and is then re-added
(after fixing whatever hardware problem triggerred the failure),
the second recovery won't redo the region it was in the middle of,
so some of the device will not be recovered properly.

If we abort the recovery, the region being processes will be cancelled
(bit not cleared) and the whole region will be retried.

As the bug can result in data corruption the patch is suitable for
-stable. For kernels prior to 3.11 there is a conflict in raid10.c
which will require care.

Original-from: jiao hui
Reported-and-tested-by: jiao hui
Signed-off-by: NeilBrown
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2014-09-18 00:19:23 +0800
73fc917ea fix copy_tree() regression ... Browse Code »

commit 12a5b5294cb1896e9a3c9fca8ff5a7e3def4e8c6 upstream.

Since 3.14 we had copy_tree() get the shadowing wrong - if we had one
vfsmount shadowing another (i.e. if A is a slave of B, C is mounted
on A/foo, then D got mounted on B/foo creating D' on A/foo shadowed
by C), copy_tree() of A would make a copy of D' shadow the the copy of
C, not the other way around.

It's easy to fix, fortunately - just make sure that mount follows
the one that shadows it in mnt_child as well as in mnt_hash, and when
copy_tree() decides to attach a new mount, check if the last child
it has added to the same parent should be shadowing the new one.
And if it should, just use the same logics commit_tree() has - put the
new mount into the hash and children lists right after the one that
should shadow it.

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2014-09-18 00:19:23 +0800
37fada67e Bluetooth: Avoid use of session socket after the session gets freed ... Browse Code »

commit 32333edb82fb2009980eefc5518100068147ab82 upstream.

The commits 08c30aca9e698faddebd34f81e1196295f9dc063 "Bluetooth: Remove
RFCOMM session refcnt" and 8ff52f7d04d9cc31f1e81dcf9a2ba6335ed34905
"Bluetooth: Return RFCOMM session ptrs to avoid freed session"
allow rfcomm_recv_ua and rfcomm_session_close to delete the session
(and free the corresponding socket) and propagate NULL session pointer
to the upper callers.

Additional fix is required to terminate the loop in rfcomm_process_rx
function to avoid use of freed 'sk' memory.

The issue is only reproducible with kernel option CONFIG_PAGE_POISONING
enabled making freed memory being changed and filled up with fixed char
value used to unmask use-after-free issues.

Signed-off-by: Vignesh Raman
Signed-off-by: Vitaly Kuzmichev
Acked-by: Dean Jenkins
Signed-off-by: Marcel Holtmann
Signed-off-by: Greg Kroah-Hartman

Vignesh Raman
2014-09-18 00:19:23 +0800
333193695 Bluetooth: never linger on process exit ... Browse Code »

commit 093facf3634da1b0c2cc7ed106f1983da901bbab upstream.

If the current process is exiting, lingering on socket close will make
it unkillable, so we should avoid it.

Reproducer:

#include
#include

#define BTPROTO_L2CAP 0
#define BTPROTO_SCO 2
#define BTPROTO_RFCOMM 3

int main()
{
int fd;
struct linger ling;

fd = socket(PF_BLUETOOTH, SOCK_STREAM, BTPROTO_RFCOMM);
//or: fd = socket(PF_BLUETOOTH, SOCK_DGRAM, BTPROTO_L2CAP);
//or: fd = socket(PF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_SCO);

ling.l_onoff = 1;
ling.l_linger = 1000000000;
setsockopt(fd, SOL_SOCKET, SO_LINGER, &ling, sizeof(ling));

return 0;
}

Signed-off-by: Vladimir Davydov
Signed-off-by: Marcel Holtmann
Signed-off-by: Greg Kroah-Hartman

Vladimir Davydov
2014-09-18 00:19:22 +0800
247640a32 Bluetooth: btmrvl: wait for HOST_SLEEP_ENABLE event in suspend ... Browse Code »

commit 396e04f4bb9afefb0744715dc76d9abe18ee5fb0 upstream.

After BT_CMD_HOST_SLEEP_ENABLE command finishes, driver should
wait until getting BT_EVENT_HOST_SLEEP_ENABLE event to complete
suspend procedure.
Without this patch the suspend handler would return success
earlier. By the time when the BT_EVENT_HOST_SLEEP_ENABLE event
comes in the controller driver could have already turned off the
bus clock. This causes kernel crash or system reboot eventually.

Signed-off-by: Chin-Ran Lo
Signed-off-by: Jeff CF Chen
Signed-off-by: Amitkumar Karwar
Signed-off-by: Bing Zhao
Signed-off-by: Marcel Holtmann
Signed-off-by: Greg Kroah-Hartman

Chin-Ran Lo
2014-09-18 00:19:22 +0800
30f6a1fb4 fix EBUSY on umount() from MNT_SHRINKABLE ... Browse Code »

commit 81b6b06197606b4bef4e427a197aeb808e8d89e1 upstream.

We need the parents of victims alive until namespace_unlock() gets to
dput() of the (ex-)mountpoints. However, that screws up the "is it
busy" checks in case when we have shrinkable mounts that need to be
killed. Solution: go ahead and decrement refcounts of parents right
in umount_tree(), increment them again just before dropping rwsem in
namespace_unlock() (and let the loop in the end of namespace_unlock()
finally drop those references for good, as we do now). Parents can't
get freed until we drop rwsem - at least one reference is kept until
then, both in case when parent is among the victims and when it is
not. So they'll still be around when we get to namespace_unlock().

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2014-09-18 00:19:22 +0800
c532b9ce3 get rid of propagate_umount() mistakenly treating slaves as busy. ... Browse Code »

commit 88b368f27a094277143d8ecd5a056116f6a41520 upstream.

The check in __propagate_umount() ("has somebody explicitly mounted
something on that slave?") is done *before* taking the already doomed
victims out of the child lists.

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2014-09-18 00:19:22 +0800
93927a247 mnt: Add tests for unprivileged remount cases that have found to be faulty ... Browse Code »

commit db181ce011e3c033328608299cd6fac06ea50130 upstream.

Kenton Varda discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.

Upon review of the code in remount it was discovered that the code allowed
nosuid, noexec, and nodev to be cleared. It was also discovered that
the code was allowing the per mount atime flags to be changed.

The first naive patch to fix these issues contained the flaw that using
default atime settings when remounting a filesystem could be disallowed.

To avoid this problems in the future add tests to ensure unprivileged
remounts are succeeding and failing at the appropriate times.

Acked-by: Serge E. Hallyn
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2014-09-18 00:19:22 +0800
74006d6e9 mnt: Change the default remount atime from relatime to the existing value ... Browse Code »

commit ffbc6f0ead47fa5a1dc9642b0331cb75c20a640e upstream.

Since March 2009 the kernel has treated the state that if no
MS_..ATIME flags are passed then the kernel defaults to relatime.

Defaulting to relatime instead of the existing atime state during a
remount is silly, and causes problems in practice for people who don't
specify any MS_...ATIME flags and to get the default filesystem atime
setting. Those users may encounter a permission error because the
default atime setting does not work.

A default that does not work and causes permission problems is
ridiculous, so preserve the existing value to have a default
atime setting that is always guaranteed to work.

Using the default atime setting in this way is particularly
interesting for applications built to run in restricted userspace
environments without /proc mounted, as the existing atime mount
options of a filesystem can not be read from /proc/mounts.

In practice this fixes user space that uses the default atime
setting on remount that are broken by the permission checks
keeping less privileged users from changing more privileged users
atime settings.

Acked-by: Serge E. Hallyn
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2014-09-18 00:19:21 +0800
92ecaf878 mnt: Correct permission checks in do_remount ... Browse Code »

commit 9566d6742852c527bf5af38af5cbb878dad75705 upstream.

While invesgiating the issue where in "mount --bind -oremount,ro ..."
would result in later "mount --bind -oremount,rw" succeeding even if
the mount started off locked I realized that there are several
additional mount flags that should be locked and are not.

In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
flags in addition to MNT_READONLY should all be locked. These
flags are all per superblock, can all be changed with MS_BIND,
and should not be changable if set by a more privileged user.

The following additions to the current logic are added in this patch.
- nosuid may not be clearable by a less privileged user.
- nodev may not be clearable by a less privielged user.
- noexec may not be clearable by a less privileged user.
- atime flags may not be changeable by a less privileged user.

The logic with atime is that always setting atime on access is a
global policy and backup software and auditing software could break if
atime bits are not updated (when they are configured to be updated),
and serious performance degradation could result (DOS attack) if atime
updates happen when they have been explicitly disabled. Therefore an
unprivileged user should not be able to mess with the atime bits set
by a more privileged user.

The additional restrictions are implemented with the addition of
MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
mnt flags.

Taken together these changes and the fixes for MNT_LOCK_READONLY
should make it safe for an unprivileged user to create a user
namespace and to call "mount --bind -o remount,... ..." without
the danger of mount flags being changed maliciously.

Acked-by: Serge E. Hallyn
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2014-09-18 00:19:21 +0800