Eric Lee / smarc-fsl-linux-kernel

20 Jul, 2012

4 commits

4234dc62e fix build error ... Browse Code »

Signed-off-by: Jason Liu

Jason Liu
2012-07-20 13:18:44 +0800
d19bf1528 ENGR00162198 [MX6q_ARM2]filesystem: Kernel dump if unplug SD card during bonnie ... Browse Code »

add pointer check before accesssing to fix following problem

staErXtT 3'-efms. .(.mmcblk1p2): error: remounting filesystem read-only
Unable to handle kernel NULL pointer dereference at virtual address 00000010
pgd = df334000
[00000010] *pgd=71e85831, *pte=00000000, *ppte=00000000
Internal error: Oops: 17 [#1] PREEMPT SMP
last sysfs file: /sys/devices/platform/sdhci-esdhc-imx.2/mmc_host/mmc1/
mmc1:b368/serial
Modules linked in: ahci_platform ov3640_camera libahci libata
CPU: 1 Not tainted (2.6.38-daily-00808-g43b3e87 #1)
PC is at __mark_inode_dirty+0xc8/0x1b4
LR is at __mark_inode_dirty+0xb8/0x1b4
pc : [] lr : [] psr: 20000013
sp : df14dde0 ip : 00000062 fp : 00000000
r10: 003d2000 r9 : df14df38 r8 : 00000000
r7 : 4ec22acb r6 : 00000003 r5 : 00000000 r4 : e028c720
r3 : 00000001 r2 : 00000065 r1 : 804fe50c r0 : 00000001

Signed-off-by Tony Lin

Tony Lin
2012-07-20 13:17:58 +0800
a4f52314a ENGR00069937 Community patch for Fix mount error in case of MLC flash ... Browse Code »

Even though we don't use the OOB for MLC nand flash,
we should use the bad block information to skip the bad block.
Patch url:
http://patchwork.ozlabs.org/linux-mtd/patch?q=mlc&filter=none&id=15477
Author:Kyungmin Park

Signed-off-by: Jason Liu

Jason Liu
2012-07-20 13:09:06 +0800
29dc869eb ENGR00068619 JFFS2 community fix with not use OOB ... Browse Code »

JFFS2 community fix with not use OOB at MLC NAND, this patch
is coming from the MTD community

Signed-off-by: Jason Liu

Jason Liu
2012-07-20 13:09:05 +0800

18 Jun, 2012

1 commit

6140710c5 fuse: fix stat call on 32 bit platforms ... Browse Code »

commit 45c72cd73c788dd18c8113d4a404d6b4a01decf1 upstream.

Now we store attr->ino at inode->i_ino, return attr->ino at the
first time and then return inode->i_ino if the attribute timeout
isn't expired. That's wrong on 32 bit platforms because attr->ino
is 64 bit and inode->i_ino is 32 bit in this case.

Fix this by saving 64 bit ino in fuse_inode structure and returning
it every time we call getattr. Also squash attr->ino into inode->i_ino
explicitly.

Signed-off-by: Pavel Shilovsky
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2012-06-18 02:23:12 +0800

09 Jun, 2012

9 commits

749c8151f ext4: don't set i_flags in EXT4_IOC_SETFLAGS ... Browse Code »

commit b22b1f178f6799278d3178d894f37facb2085765 upstream.

Commit 7990696 uses the ext4_{set,clear}_inode_flags() functions to
change the i_flags automatically but fails to remove the error setting
of i_flags. So we still have the problem of trashing state flags.
Fix this by removing the assignment.

Signed-off-by: Tao Ma
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Tao Ma
2012-06-09 23:33:05 +0800
32e090b1f ext4: remove mb_groups before tearing down the buddy_cache ... Browse Code »

commit 95599968d19db175829fb580baa6b68939b320fb upstream.

We can't have references held on pages in the s_buddy_cache while we are
trying to truncate its pages and put the inode. All the pages must be
gone before we reach clear_inode. This can only be gauranteed if we
can prevent new users from grabbing references to s_buddy_cache's pages.

The original bug can be reproduced and the bug fix can be verified by:

while true; do mount -t ext4 /dev/ram0 /export/hda3/ram0; \
umount /export/hda3/ram0; done &

while true; do cat /proc/fs/ext4/ram0/mb_groups; done

Signed-off-by: Salman Qazi
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Salman Qazi
2012-06-09 23:33:04 +0800
97434cf53 ext4: add ext4_mb_unload_buddy in the error path ... Browse Code »

commit 02b7831019ea4e7994968c84b5826fa8b248ffc8 upstream.

ext4_free_blocks fails to pair an ext4_mb_load_buddy with a matching
ext4_mb_unload_buddy when it fails a memory allocation.

Signed-off-by: Salman Qazi
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Salman Qazi
2012-06-09 23:33:04 +0800
eeb7cb57c ext4: don't trash state flags in EXT4_IOC_SETFLAGS ... Browse Code »

commit 79906964a187c405db72a3abc60eb9b50d804fbc upstream.

In commit 353eb83c we removed i_state_flags with 64-bit longs, But
when handling the EXT4_IOC_SETFLAGS ioctl, we replace i_flags
directly, which trashes the state flags which are stored in the high
32-bits of i_flags on 64-bit platforms. So use the the
ext4_{set,clear}_inode_flags() functions which use atomic bit
manipulation functions instead.

Reported-by: Tao Ma
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Theodore Ts'o
2012-06-09 23:33:04 +0800
801bdd926 ext4: add missing save_error_info() to ext4_error() ... Browse Code »

commit f3fc0210c0fc91900766c995f089c39170e68305 upstream.

The ext4_error() function is missing a call to save_error_info().
Since this is the function which marks the file system as containing
an error, this oversight (which was introduced in 2.6.36) is quite
significant, and should be backported to older stable kernels with
high urgency.

Reported-by: Ken Sumrall
Signed-off-by: "Theodore Ts'o"
Cc: ksumrall@google.com
Signed-off-by: Greg Kroah-Hartman

Theodore Ts'o
2012-06-09 23:33:04 +0800
e36db7f81 ext4: force ro mount if ext4_setup_super() fails ... Browse Code »

commit 7e84b6216467b84cd332c8e567bf5aa113fd2f38 upstream.

If ext4_setup_super() fails i.e. due to a too-high revision,
the error is logged in dmesg but the fs is not mounted RO as
indicated.

Tested by:

# mkfs.ext4 -r 4 /dev/sdb6
# mount /dev/sdb6 /mnt/test
# dmesg | grep "too high"
[164919.759248] EXT4-fs (sdb6): revision level too high, forcing read-only mode
# grep sdb6 /proc/mounts
/dev/sdb6 /mnt/test2 ext4 rw,seclabel,relatime,data=ordered 0 0

Reviewed-by: Andreas Dilger
Signed-off-by: Eric Sandeen
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2012-06-09 23:33:04 +0800
6baeff72b vfs: umount_tree() might be called on subtree that had never made it ... Browse Code »

commit 63d37a84ab6004c235314ffd7a76c5eb28c2fae0 upstream.

__mnt_make_shortterm() in there undoes the effect of __mnt_make_longterm()
we'd done back when we set ->mnt_ns non-NULL; it should not be done to
vfsmounts that had never gone through commit_tree() and friends. Kudos to
lczerner for catching that one...

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-06-09 23:33:01 +0800
6ab590251 NFSv4: Map NFS4ERR_SHARE_DENIED into an EACCES error instead of EIO ... Browse Code »

commit fb13bfa7e1bcfdcfdece47c24b62f1a1cad957e9 upstream.

If a file OPEN is denied due to a share lock, the resulting
NFS4ERR_SHARE_DENIED is currently mapped to the default EIO.
This patch adds a more appropriate mapping, and brings Linux
into line with what Solaris 10 does.

See https://bugzilla.kernel.org/show_bug.cgi?id=43286

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2012-06-09 23:32:59 +0800
93b715235 cifs: fix oops while traversing open file list (try #4) ... Browse Code »

commit 2c0c2a08bed7a3b791f88d09d16ace56acb3dd98 upstream.

While traversing the linked list of open file handles, if the identfied
file handle is invalid, a reopen is attempted and if it fails, we
resume traversing where we stopped and cifs can oops while accessing
invalid next element, for list might have changed.

So mark the invalid file handle and attempt reopen if no
valid file handle is found in rest of the list.
If reopen fails, move the invalid file handle to the end of the list
and start traversing the list again from the begining.
Repeat this four times before giving up and returning an error if
file reopen keeps failing.

Signed-off-by: Shirish Pargaonkar
Reviewed-by: Jeff Layton
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Shirish Pargaonkar
2012-06-09 23:32:58 +0800

01 Jun, 2012

2 commits

2ec196c97 vfs: make AIO use the proper rw_verify_area() area helpers ... Browse Code »

commit a70b52ec1aaeaf60f4739edb1b422827cb6f3893 upstream.

We had for some reason overlooked the AIO interface, and it didn't use
the proper rw_verify_area() helper function that checks (for example)
mandatory locking on the file, and that the size of the access doesn't
cause us to overflow the provided offset limits etc.

Instead, AIO did just the security_file_permission() thing (that
rw_verify_area() also does) directly.

This fixes it to do all the proper helper functions, which not only
means that now mandatory file locking works with AIO too, we can
actually remove lines of code.

Reported-by: Manish Honap
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2012-06-01 15:12:53 +0800
37a845777 block: don't mark buffers beyond end of disk as mapped ... Browse Code »

commit 080399aaaf3531f5b8761ec0ac30ff98891e8686 upstream.

Hi,

We have a bug report open where a squashfs image mounted on ppc64 would
exhibit errors due to trying to read beyond the end of the disk. It can
easily be reproduced by doing the following:

[root@ibm-p750e-02-lp3 ~]# ls -l install.img
-rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
[root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
[root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
dd: reading `/dev/loop0': Input/output error
277376+0 records in
277376+0 records out
142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s

In dmesg, you'll find the following:

squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 43.106012] attempt to access beyond end of device
[ 43.106029] loop0: rw=0, want=277410, limit=277408
[ 43.106039] Buffer I/O error on device loop0, logical block 138704
[ 43.106053] attempt to access beyond end of device
[ 43.106057] loop0: rw=0, want=277412, limit=277408
[ 43.106061] Buffer I/O error on device loop0, logical block 138705
[ 43.106066] attempt to access beyond end of device
[ 43.106070] loop0: rw=0, want=277414, limit=277408
[ 43.106073] Buffer I/O error on device loop0, logical block 138706
[ 43.106078] attempt to access beyond end of device
[ 43.106081] loop0: rw=0, want=277416, limit=277408
[ 43.106085] Buffer I/O error on device loop0, logical block 138707
[ 43.106089] attempt to access beyond end of device
[ 43.106093] loop0: rw=0, want=277418, limit=277408
[ 43.106096] Buffer I/O error on device loop0, logical block 138708
[ 43.106101] attempt to access beyond end of device
[ 43.106104] loop0: rw=0, want=277420, limit=277408
[ 43.106108] Buffer I/O error on device loop0, logical block 138709
[ 43.106112] attempt to access beyond end of device
[ 43.106116] loop0: rw=0, want=277422, limit=277408
[ 43.106120] Buffer I/O error on device loop0, logical block 138710
[ 43.106124] attempt to access beyond end of device
[ 43.106128] loop0: rw=0, want=277424, limit=277408
[ 43.106131] Buffer I/O error on device loop0, logical block 138711
[ 43.106135] attempt to access beyond end of device
[ 43.106139] loop0: rw=0, want=277426, limit=277408
[ 43.106143] Buffer I/O error on device loop0, logical block 138712
[ 43.106147] attempt to access beyond end of device
[ 43.106151] loop0: rw=0, want=277428, limit=277408
[ 43.106154] Buffer I/O error on device loop0, logical block 138713
[ 43.106158] attempt to access beyond end of device
[ 43.106162] loop0: rw=0, want=277430, limit=277408
[ 43.106166] attempt to access beyond end of device
[ 43.106169] loop0: rw=0, want=277432, limit=277408
...
[ 43.106307] attempt to access beyond end of device
[ 43.106311] loop0: rw=0, want=277470, limit=2774

Squashfs manages to read in the end block(s) of the disk during the
mount operation. Then, when dd reads the block device, it leads to
block_read_full_page being called with buffers that are beyond end of
disk, but are marked as mapped. Thus, it would end up submitting read
I/O against them, resulting in the errors mentioned above. I fixed the
problem by modifying init_page_buffers to only set the buffer mapped if
it fell inside of i_size.

Cheers,
Jeff

Signed-off-by: Jeff Moyer
Acked-by: Nick Piggin

--

Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Jeff Moyer
2012-06-01 15:12:52 +0800

22 May, 2012

6 commits

fcb2c2e95 wake up s_wait_unfrozen when ->freeze_fs fails ... Browse Code »

commit e1616300a20c80396109c1cf013ba9a36055a3da upstream.

dd slept infinitely when fsfeeze failed because of EIO.
To fix this problem, if ->freeze_fs fails, freeze_super() wakes up
the tasks waiting for the filesystem to become unfrozen.

When s_frozen isn't SB_UNFROZEN in __generic_file_aio_write(),
the function sleeps until FITHAW ioctl wakes up s_wait_unfrozen.

However, if ->freeze_fs fails, s_frozen is set to SB_UNFROZEN and then
freeze_super() returns an error number. In this case, FITHAW ioctl returns
EINVAL because s_frozen is already SB_UNFROZEN. There is no way to wake up
s_wait_unfrozen, so __generic_file_aio_write() sleeps infinitely.

Signed-off-by: Kazuya Mio
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Kazuya Mio
2012-05-22 00:40:05 +0800
1a28fbbeb ext4: fix error handling on inode bitmap corruption ... Browse Code »

commit acd6ad83517639e8f09a8c5525b1dccd81cd2a10 upstream.

When insert_inode_locked() fails in ext4_new_inode() it most likely means inode
bitmap got corrupted and we allocated again inode which is already in use. Also
doing unlock_new_inode() during error recovery is wrong since the inode does
not have I_NEW set. Fix the problem by jumping to fail: (instead of fail_drop:)
which declares filesystem error and does not call unlock_new_inode().

Signed-off-by: Jan Kara
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2012-05-22 00:40:04 +0800
8e8a21270 ext3: Fix error handling on inode bitmap corruption ... Browse Code »

commit 1415dd8705394399d59a3df1ab48d149e1e41e77 upstream.

When insert_inode_locked() fails in ext3_new_inode() it most likely
means inode bitmap got corrupted and we allocated again inode which
is already in use. Also doing unlock_new_inode() during error recovery
is wrong since inode does not have I_NEW set. Fix the problem by jumping
to fail: (instead of fail_drop:) which declares filesystem error and
does not call unlock_new_inode().

Reviewed-by: Eric Sandeen
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2012-05-22 00:40:04 +0800
19165bdbb NFSv4: Revalidate uid/gid after open ... Browse Code »

This is a shorter (and more appropriate for stable kernels) analog to
the following upstream commit:

commit 6926afd1925a54a13684ebe05987868890665e2b
Author: Trond Myklebust
Date: Sat Jan 7 13:22:46 2012 -0500

NFSv4: Save the owner/group name string when doing open

...so that we can do the uid/gid mapping outside the asynchronous RPC
context.
This fixes a bug in the current NFSv4 atomic open code where the client
isn't able to determine what the true uid/gid fields of the file are,
(because the asynchronous nature of the OPEN call denies it the ability
to do an upcall) and so fills them with default values, marking the
inode as needing revalidation.
Unfortunately, in some cases, the VFS will do some additional sanity
checks on the file, and may override the server's decision to allow
the open because it sees the wrong owner/group fields.

Signed-off-by: Trond Myklebust

Without this patch, logging into two different machines with home
directories mounted over NFS4 and then running "vim" and typing ":q"
in each reliably produces the following error on the second machine:

E137: Viminfo file is not writable: /users/system/rtheys/.viminfo

This regression was introduced by 80e52aced138 ("NFSv4: Don't do
idmapper upcalls for asynchronous RPC calls", merged during the 2.6.32
cycle) --- after the OPEN call, .viminfo has the default values for
st_uid and st_gid (0xfffffffe) cached because we do not want to let
rpciod wait for an idmapper upcall to fill them in.

The fix used in mainline is to save the owner and group as strings and
perform the upcall in _nfs4_proc_open outside the rpciod context,
which takes about 600 lines. For stable, we can do something similar
with a one-liner: make open check for the stale fields and make a
(synchronous) GETATTR call to fill them when needed.

Trond dictated the patch, I typed it in, and Rik tested it.

Addresses http://bugs.debian.org/659111 and
https://bugzilla.redhat.com/789298

Reported-by: Rik Theys
Explained-by: David Flyn
Signed-off-by: Jonathan Nieder
Tested-by: Rik Theys
Signed-off-by: Greg Kroah-Hartman

Jonathan Nieder
2012-05-22 00:40:04 +0800
797c09ed3 ext4: avoid deadlock on sync-mounted FS w/o journal ... Browse Code »

commit c1bb05a657fb3d8c6179a4ef7980261fae4521d7 upstream.

Processes hang forever on a sync-mounted ext2 file system that
is mounted with the ext4 module (default in Fedora 16).

I can reproduce this reliably by mounting an ext2 partition with
"-o sync" and opening a new file an that partition with vim. vim
will hang in "D" state forever. The same happens on ext4 without
a journal.

I am attaching a small patch here that solves this issue for me.
In the sync mounted case without a journal,
ext4_handle_dirty_metadata() may call sync_dirty_buffer(), which
can't be called with buffer lock held.

Also move mb_cache_entry_release inside lock to avoid race
fixed previously by 8a2bfdcb ext[34]: EA block reference count racing fix
Note too that ext2 fixed this same problem in 2006 with
b2f49033 [PATCH] fix deadlock in ext2

Signed-off-by: Martin.Wilck@ts.fujitsu.com
[sandeen@redhat.com: move mb_cache_entry_release before unlock, edit commit msg]
Signed-off-by: Eric Sandeen
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2012-05-22 00:40:04 +0800
1541d27bd jffs2: Fix lock acquisition order bug in gc path ... Browse Code »

commit 226bb7df3d22bcf4a1c0fe8206c80cc427498eae upstream.

The locking policy is such that the erase_complete_block spinlock is
nested within the alloc_sem mutex. This fixes a case in which the
acquisition order was erroneously reversed. This issue was caught by
the following lockdep splat:

=======================================================
[ INFO: possible circular locking dependency detected ]
3.0.5 #1
-------------------------------------------------------
jffs2_gcd_mtd6/299 is trying to acquire lock:
(&c->alloc_sem){+.+.+.}, at: [] jffs2_garbage_collect_pass+0x314/0x890

but task is already holding lock:
(&(&c->erase_completion_lock)->rlock){+.+...}, at: [] jffs2_garbage_collect_pass+0x308/0x890

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&c->erase_completion_lock)->rlock){+.+...}:
[] validate_chain+0xe6c/0x10bc
[] __lock_acquire+0x54c/0xba4
[] lock_acquire+0xa4/0x114
[] _raw_spin_lock+0x3c/0x4c
[] jffs2_garbage_collect_pass+0x4c/0x890
[] jffs2_garbage_collect_thread+0x1b4/0x1cc
[] kthread+0x98/0xa0
[] kernel_thread_exit+0x0/0x8

-> #0 (&c->alloc_sem){+.+.+.}:
[] print_circular_bug+0x70/0x2c4
[] validate_chain+0x1034/0x10bc
[] __lock_acquire+0x54c/0xba4
[] lock_acquire+0xa4/0x114
[] mutex_lock_nested+0x74/0x33c
[] jffs2_garbage_collect_pass+0x314/0x890
[] jffs2_garbage_collect_thread+0x1b4/0x1cc
[] kthread+0x98/0xa0
[] kernel_thread_exit+0x0/0x8

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&(&c->erase_completion_lock)->rlock);
lock(&c->alloc_sem);
lock(&(&c->erase_completion_lock)->rlock);
lock(&c->alloc_sem);

*** DEADLOCK ***

1 lock held by jffs2_gcd_mtd6/299:
#0: (&(&c->erase_completion_lock)->rlock){+.+...}, at: [] jffs2_garbage_collect_pass+0x308/0x890

stack backtrace:
[] (unwind_backtrace+0x0/0x100) from [] (dump_stack+0x20/0x24)
[] (dump_stack+0x20/0x24) from [] (print_circular_bug+0x1c8/0x2c4)
[] (print_circular_bug+0x1c8/0x2c4) from [] (validate_chain+0x1034/0x10bc)
[] (validate_chain+0x1034/0x10bc) from [] (__lock_acquire+0x54c/0xba4)
[] (__lock_acquire+0x54c/0xba4) from [] (lock_acquire+0xa4/0x114)
[] (lock_acquire+0xa4/0x114) from [] (mutex_lock_nested+0x74/0x33c)
[] (mutex_lock_nested+0x74/0x33c) from [] (jffs2_garbage_collect_pass+0x314/0x890)
[] (jffs2_garbage_collect_pass+0x314/0x890) from [] (jffs2_garbage_collect_thread+0x1b4/0x1cc)
[] (jffs2_garbage_collect_thread+0x1b4/0x1cc) from [] (kthread+0x98/0xa0)
[] (kthread+0x98/0xa0) from [] (kernel_thread_exit+0x0/0x8)

This was introduce in '81cfc9f jffs2: Fix serious write stall due to erase'.

Signed-off-by: Josh Cartwright
Signed-off-by: Artem Bityutskiy
Signed-off-by: David Woodhouse
Signed-off-by: Greg Kroah-Hartman

Josh Cartwright
2012-05-22 00:40:03 +0800

07 May, 2012

9 commits

879295392 hfsplus: Fix potential buffer overflows ... Browse Code »

commit 6f24f892871acc47b40dd594c63606a17c714f77 upstream.

Commit ec81aecb2966 ("hfs: fix a potential buffer overflow") fixed a few
potential buffer overflows in the hfs filesystem. But as Timo Warns
pointed out, these changes also need to be made on the hfsplus
filesystem as well.

Reported-by: Timo Warns
Acked-by: WANG Cong
Cc: Alexey Khoroshilov
Cc: Miklos Szeredi
Cc: Sage Weil
Cc: Eugene Teo
Cc: Roman Zippel
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Alexey Dobriyan
Cc: Dave Anderson
Cc: Andrew Morton
Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Linus Torvalds

Greg Kroah-Hartman
2012-05-07 23:56:50 +0800
70403b35a autofs: make the autofsv5 packet file descriptor use a packetized pipe ... Browse Code »

commit 64f371bc3107e69efce563a3d0f0e6880de0d537 upstream.

The autofs packet size has had a very unfortunate size problem on x86:
because the alignment of 'u64' differs in 32-bit and 64-bit modes, and
because the packet data was not 8-byte aligned, the size of the autofsv5
packet structure differed between 32-bit and 64-bit modes despite
looking otherwise identical (300 vs 304 bytes respectively).

We first fixed that up by making the 64-bit compat mode know about this
problem in commit a32744d4abae ("autofs: work around unhappy compat
problem on x86-64"), and that made a 32-bit 'systemd' work happily on a
64-bit kernel because everything then worked the same way as on a 32-bit
kernel.

But it turned out that 'automount' had actually known and worked around
this problem in user space, so fixing the kernel to do the proper 32-bit
compatibility handling actually *broke* 32-bit automount on a 64-bit
kernel, because it knew that the packet sizes were wrong and expected
those incorrect sizes.

As a result, we ended up reverting that compatibility mode fix, and
thus breaking systemd again, in commit fcbf94b9dedd.

With both automount and systemd doing a single read() system call, and
verifying that they get *exactly* the size they expect but using
different sizes, it seemed that fixing one of them inevitably seemed to
break the other. At one point, a patch I seriously considered applying
from Michael Tokarev did a "strcmp()" to see if it was automount that
was doing the operation. Ugly, ugly.

However, a prettier solution exists now thanks to the packetized pipe
mode. By marking the communication pipe as being packetized (by simply
setting the O_DIRECT flag), we can always just write the bigger packet
size, and if user-space does a smaller read, it will just get that
partial end result and the extra alignment padding will simply be thrown
away.

This makes both automount and systemd happy, since they now get the size
they asked for, and the kernel side of autofs simply no longer needs to
care - it could pad out the packet arbitrarily.

Of course, if there is some *other* user of autofs (please, please,
please tell me it ain't so - and we haven't heard of any) that tries to
read the packets with multiple writes, that other user will now be
broken - the whole point of the packetized mode is that one system call
gets exactly one packet, and you cannot read a packet in pieces.

Tested-by: Michael Tokarev
Cc: Alan Cox
Cc: David Miller
Cc: Ian Kent
Cc: Thomas Meyer
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2012-05-07 23:56:37 +0800
beed6c2e0 pipes: add a "packetized pipe" mode for writing ... Browse Code »

commit 9883035ae7edef3ec62ad215611cb8e17d6a1a5d upstream.

The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.

When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own. The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).

End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time. You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.

NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops. Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.

The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes). But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.

Tested-by: Michael Tokarev
Cc: Alan Cox
Cc: David Miller
Cc: Ian Kent
Cc: Thomas Meyer
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2012-05-07 23:56:36 +0800
034199be7 nfsd: fix error values returned by nfsd4_lockt() when nfsd_open() fails ... Browse Code »

commit 04da6e9d63427b2d0fd04766712200c250b3278f upstream.

nfsd_open() already returns an NFS error value; only vfs_test_lock()
result needs to be fed through nfserrno(). Broken by commit 55ef12
(nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT)
three years ago...

Signed-off-by: Al Viro
Signed-off-by: Jonathan Nieder
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-05-07 23:56:35 +0800
d2fd339e9 nfsd: fix b0rken error value for setattr on read-only mount ... Browse Code »

commit 96f6f98501196d46ce52c2697dd758d9300c63f5 upstream.

..._want_write() returns -EROFS on failure, _not_ an NFS error value.

Signed-off-by: Al Viro
Signed-off-by: Jonathan Nieder
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-05-07 23:56:35 +0800
d25895e8f Revert "autofs: work around unhappy compat problem on x86-64" ... Browse Code »

commit fcbf94b9dedd2ce08e798a99aafc94fec8668161 upstream.

This reverts commit a32744d4abae24572eff7269bc17895c41bd0085.

While that commit was technically the right thing to do, and made the
x86-64 compat mode work identically to native 32-bit mode (and thus
fixing the problem with a 32-bit systemd install on a 64-bit kernel), it
turns out that the automount binaries had workarounds for this compat
problem.

Now, the workarounds are disgusting: doing an "uname()" to find out the
architecture of the kernel, and then comparing it for the 64-bit cases
and fixing up the size of the read() in automount for those. And they
were confused: it's not actually a generic 64-bit issue at all, it's
very much tied to just x86-64, which has different alignment for an
'u64' in 64-bit mode than in 32-bit mode.

But the end result is that fixing the compat layer actually breaks the
case of a 32-bit automount on a x86-64 kernel.

There are various approaches to fix this (including just doing a
"strcmp()" on current->comm and comparing it to "automount"), but I
think that I will do the one that teaches pipes about a special "packet
mode", which will allow user space to not have to care too deeply about
the padding at the end of the autofs packet.

That change will make the compat workaround unnecessary, so let's revert
it first, and get automount working again in compat mode. The
packetized pipes will then fix autofs for systemd.

Reported-and-requested-by: Michael Tokarev
Cc: Ian Kent
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2012-05-07 23:56:32 +0800
95cb2c603 NFSv4: Ensure that we check lock exclusive/shared type against open modes ... Browse Code »

commit 55725513b5ef9d462aa3e18527658a0362aaae83 upstream.

Since we may be simulating flock() locks using NFS byte range locks,
we can't rely on the VFS having checked the file open mode for us.

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2012-05-07 23:56:31 +0800
03a9f1949 NFSv4: Ensure that the LOCK code sets exception->inode ... Browse Code »

commit 05ffe24f5290dc095f98fbaf84afe51ef404ccc5 upstream.

All callers of nfs4_handle_exception() that need to handle
NFS4ERR_OPENMODE correctly should set exception->inode

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2012-05-07 23:56:31 +0800
cb2fee322 nfs: Enclose hostname in brackets when needed in nfs_do_root_mount ... Browse Code »

commit 98a2139f4f4d7b5fcc3a54c7fddbe88612abed20 upstream.

When hostname contains colon (e.g. when it is an IPv6 address) it needs
to be enclosed in brackets to make parsing of NFS device string possible.
Fix nfs_do_root_mount() to enclose hostname properly when needed. NFS code
actually does not need this as it does not parse the string passed by
nfs_do_root_mount() but the device string is exposed to userspace in
/proc/mounts.

CC: Josh Boyer
CC: Trond Myklebust
Signed-off-by: Jan Kara
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2012-05-07 23:56:31 +0800

28 Apr, 2012

9 commits

8d2228dd9 tcp: allow splice() to build full TSO packets ... Browse Code »

[ This combines upstream commit
2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix
commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ]

vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)

The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.

4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)

This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)

In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.

Call tcp_push() only if MSG_MORE is not set in the flags parameter.

This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.

In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)

Reported-by: Tom Herbert
Cc: Nandita Dukkipati
Cc: Neal Cardwell
Cc: Tom Herbert
Cc: Yuchung Cheng
Cc: H.K. Jerry Chu
Cc: Maciej Żenczykowski
Cc: Mahesh Bandewar
Cc: Ilpo Järvinen
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2012-04-28 00:51:18 +0800
9740f1d82 lockd: fix the endianness bug ... Browse Code »

commit e847469bf77a1d339274074ed068d461f0c872bc upstream.

comparing be32 values for < is not doing the right thing...

Signed-off-by: Al Viro
Cc: "J. Bruce Fields"
Cc: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-04-28 00:51:18 +0800
d434e3ec4 ocfs2: ->e_leaf_clusters endianness breakage ... Browse Code »

commit 72094e43e3af5020510f920321d71f1798fa896d upstream.

le16, not le32...

Signed-off-by: Al Viro
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-04-28 00:51:18 +0800
ea6c7f23a ocfs2: ->rl_count endianness breakage ... Browse Code »

commit 28748b325dc2d730ccc312830a91c4ae0c0d9379 upstream.

le16, not le32...

Signed-off-by: Al Viro
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-04-28 00:51:18 +0800
bdd5904ce ocfs: ->rl_used breakage on big-endian ... Browse Code »

commit e1bf4cc620fd143766ddfcee3b004a1d1bb34fd0 upstream.

it's le16, not le32 or le64...

Signed-off-by: Al Viro
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-04-28 00:51:17 +0800
ee88fc68d ocfs2: ->l_next_free_req breakage on big-endian ... Browse Code »

commit 3a251f04fe97c3d335b745c98e4b377e3c3899f2 upstream.

It's le16, not le32...

Signed-off-by: Al Viro
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-04-28 00:51:17 +0800
025a55c8a btrfs: btrfs_root_readonly() broken on big-endian ... Browse Code »

commit 6ed3cf2cdfce4c9f1d73171bd3f27d9cb77b734e upstream.

->root_flags is __le64 and all accesses to it go through the helpers
that do proper conversions. Except for btrfs_root_readonly(), which
checks bit 0 as in host-endian...

Signed-off-by: Al Viro
Cc: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-04-28 00:51:17 +0800
5479e7878 nfsd: fix compose_entry_fh() failure exits ... Browse Code »

commit efe39651f08813180f37dc508d950fc7d92b29a8 upstream.

Restore the original logics ("fail on mountpoints, negatives and in
case of fh_compose() failures"). Since commit 8177e (nfsd: clean up
readdirplus encoding) that got broken -
rv = fh_compose(fhp, exp, dchild, &cd->fh);
if (rv)
goto out;
if (!dchild->d_inode)
goto out;
rv = 0;
out:
is equivalent to
rv = fh_compose(fhp, exp, dchild, &cd->fh);
out:
and the second check has no effect whatsoever...

Signed-off-by: Al Viro
Cc: "J. Bruce Fields"
Signed-off-by: Greg Kroah-Hartman

Al Viro
2012-04-28 00:51:17 +0800
cf11afd6e Don't limit non-nested epoll paths ... Browse Code »

commit 93dc6107a76daed81c07f50215fa6ae77691634f upstream.

Commit 28d82dc1c4ed ("epoll: limit paths") that I did to limit the
number of possible wakeup paths in epoll is causing a few applications
to longer work (dovecot for one).

The original patch is really about limiting the amount of epoll nesting
(since epoll fds can be attached to other fds). Thus, we probably can
allow an unlimited number of paths of depth 1. My current patch limits
it at 1000. And enforce the limits on paths that have a greater depth.

This is captured in: https://bugzilla.redhat.com/show_bug.cgi?id=681578

Signed-off-by: Jason Baron
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jason Baron
2012-04-28 00:51:09 +0800