Eric Lee / smarc-fsl-linux-kernel

13 Feb, 2013

4 commits

9f309c86c nfs: Convert idmap to use kuids and kgids ... Browse Code »

Convert nfs_map_name_to_uid to return a kuid_t value.
Convert nfs_map_name_to_gid to return a kgid_t value.
Convert nfs_map_uid_to_name to take a kuid_t paramater.
Convert nfs_map_gid_to_name to take a kgid_t paramater.

Tweak nfs_fattr_map_owner_to_name to use a kuid_t intermediate value.
Tweak nfs_fattr_map_group_to_name to use a kgid_t intermediate value.

Which makes these functions properly handle kuids and kgids, including
erroring of the generated kuid or kgid is invalid.

Cc: "J. Bruce Fields"
Cc: Trond Myklebust
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-02-13 22:15:29 +0800
54f834cd5 nfs: Convert struct nfs_fattr to Use kuid_t and kgid_t ... Browse Code »

Cc: "J. Bruce Fields"
Cc: Trond Myklebust
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-02-13 22:15:28 +0800
7eaf040b7 sunrpc: Use kuid_t and kgid_t where appropriate ... Browse Code »

Convert variables that store uids and gids to be of type
kuid_t and kgid_t instead of type uid_t and gid_t.

Cc: "J. Bruce Fields"
Cc: Trond Myklebust
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-02-13 22:15:16 +0800
d83f5901b coda: Restrict coda messages to the initial user namespace ... Browse Code »

Remove the slight chance that uids and gids in coda messages will be
interpreted in the wrong user namespace.

- Only allow processes in the initial user namespace to open the coda
character device to communicate with coda filesystems.
- Explicitly convert the uids in the coda header into the initial user
namespace.
- In coda_vattr_to_attr make kuids and kgids from the initial user
namespace uids and gids in struct coda_vattr that just came from
userspace.
- In coda_iattr_to_vattr convert kuids and kgids into uids and gids
in the intial user namespace and store them in struct coda_vattr for
sending to coda userspace programs.

Nothing needs to be changed with mounts as coda does not support
being mounted in anything other than the initial user namespace.

Cc: Jan Harkes
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-02-13 22:00:53 +0800

12 Feb, 2013

3 commits

b46425569 9p: Modify struct 9p_fid to use a kuid_t not a uid_t ... Browse Code »

Change struct 9p_fid and it's associated functions to
use kuid_t's instead of uid_t.

Cc: Eric Van Hensbergen
Cc: Ron Minnich
Cc: Latchesar Ionkov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-02-12 19:19:32 +0800
447c50943 9p: Modify the stat structures to use kuid_t and kgid_t ... Browse Code »

9p has thre strucrtures that can encode inode stat information. Modify
all of those structures to contain kuid_t and kgid_t values. Modify
he wire encoders and decoders of those structures to use 'u' and 'g' instead of
'd' in the format string where uids and gids are present.

This results in all kuid and kgid conversion to and from on the wire values
being performed by the same code in protocol.c where the client is known
at the time of the conversion.

Cc: Eric Van Hensbergen
Cc: Ron Minnich
Cc: Latchesar Ionkov
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2013-02-12 19:19:31 +0800
f791f7c5e 9p: Transmit kuid and kgid values ... Browse Code »

Modify the p9_client_rpc format specifiers of every function that
directly transmits a uid or a gid from 'd' to 'u' or 'g' as
appropriate.

Modify those same functions to take kuid_t and kgid_t parameters
instead of uid_t and gid_t parameters.

Cc: Eric Van Hensbergen
Cc: Ron Minnich
Cc: Latchesar Ionkov
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2013-02-12 19:19:30 +0800

27 Jan, 2013

1 commit

c61a2810a userns: Avoid recursion in put_user_ns ... Browse Code »

When freeing a deeply nested user namespace free_user_ns calls
put_user_ns on it's parent which may in turn call free_user_ns again.
When -fno-optimize-sibling-calls is passed to gcc one stack frame per
user namespace is left on the stack, potentially overflowing the
kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
so we can't count on gcc to optimize this code.

Remove struct kref and use a plain atomic_t. Making the code more
flexible and easier to comprehend. Make the loop in free_user_ns
explict to guarantee that the stack does not overflow with
CONFIG_FRAME_POINTER enabled.

I have tested this fix with a simple program that uses unshare to
create a deeply nested user namespace structure and then calls exit.
With 1000 nesteuser namespaces before this change running my test
program causes the kernel to die a horrible death. With 10,000,000
nested user namespaces after this change my test program runs to
completion and causes no harm.

Acked-by: Serge Hallyn
Pointed-out-by: Vasily Kulikov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-01-27 14:11:41 +0800

26 Dec, 2012

1 commit

c876ad768 pidns: Stop pid allocation when init dies ... Browse Code »

Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.

Can lead to processes attempting access their child reaper and
instead following a stale pointer.

That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.

Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.

Pointed-out-by: Oleg Nesterov
Reviewed-by: Oleg Nesterov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-12-26 08:10:05 +0800

22 Dec, 2012

9 commits

4fe19a136 Merge git://www.linux-watchdog.org/linux-watchdog ... Browse Code »

Pull watchdog updates from Wim Van Sebroeck:
"This includes some fixes and code improvements (like
clk_prepare_enable and clk_disable_unprepare), conversion from the
omap_wdt and twl4030_wdt drivers to the watchdog framework, addition
of the SB8x0 chipset support and the DA9055 Watchdog driver and some
OF support for the davinci_wdt driver."

* git://www.linux-watchdog.org/linux-watchdog: (22 commits)
watchdog: mei: avoid oops in watchdog unregister code path
watchdog: Orion: Fix possible null-deference in orion_wdt_probe
watchdog: sp5100_tco: Add SB8x0 chipset support
watchdog: davinci_wdt: add OF support
watchdog: da9052: Fix invalid free of devm_ allocated data
watchdog: twl4030_wdt: Change TWL4030_MODULE_PM_RECEIVER to TWL_MODULE_PM_RECEIVER
watchdog: remove depends on CONFIG_EXPERIMENTAL
watchdog: Convert dev_printk(KERN_ to dev_(
watchdog: DA9055 Watchdog driver
watchdog: omap_wdt: eliminate goto
watchdog: omap_wdt: delete redundant platform_set_drvdata() calls
watchdog: omap_wdt: convert to devm_ functions
watchdog: omap_wdt: convert to new watchdog core
watchdog: WatchDog Timer Driver Core: fix comment
watchdog: s3c2410_wdt: use clk_prepare_enable and clk_disable_unprepare
watchdog: imx2_wdt: Select the driver via ARCH_MXC
watchdog: cpu5wdt.c: add missing del_timer call
watchdog: hpwdt.c: Increase version string
watchdog: Convert twl4030_wdt to watchdog core
davinci_wdt: preparation for switch to common clock framework
...

Linus Torvalds
2012-12-22 09:10:29 +0800
b49249d10 Merge tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm ... Browse Code »

Pull dm update from Alasdair G Kergon:
"Miscellaneous device-mapper fixes, cleanups and performance
improvements.

Of particular note:
- Disable broken WRITE SAME support in all targets except linear and
striped. Use it when kcopyd is zeroing blocks.
- Remove several mempools from targets by moving the data into the
bio's new front_pad area(which dm calls 'per_bio_data').
- Fix a race in thin provisioning if discards are misused.
- Prevent userspace from interfering with the ioctl parameters and
use kmalloc for the data buffer if it's small instead of vmalloc.
- Throttle some annoying error messages when I/O fails."

* tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
dm stripe: add WRITE SAME support
dm: remove map_info
dm snapshot: do not use map_context
dm thin: dont use map_context
dm raid1: dont use map_context
dm flakey: dont use map_context
dm raid1: rename read_record to bio_record
dm: move target request nr to dm_target_io
dm snapshot: use per_bio_data
dm verity: use per_bio_data
dm raid1: use per_bio_data
dm: introduce per_bio_data
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
dm linear: add WRITE SAME support
dm: add WRITE SAME support
dm: prepare to support WRITE SAME
dm ioctl: use kmalloc if possible
dm ioctl: remove PF_MEMALLOC
dm persistent data: improve improve space map block alloc failure message
dm thin: use DMERR_LIMIT for errors
...

Linus Torvalds
2012-12-22 09:08:06 +0800
184e25166 Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband ... Browse Code »

Pull more infiniband changes from Roland Dreier:
"Second batch of InfiniBand/RDMA changes for 3.8:
- cxgb4 changes to fix lookup engine hash collisions
- mlx4 changes to make flow steering usable
- fix to IPoIB to avoid pinning dst reference for too long"

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA/cxgb4: Fix bug for active and passive LE hash collision path
RDMA/cxgb4: Fix LE hash collision bug for passive open connection
RDMA/cxgb4: Fix LE hash collision bug for active open connection
mlx4_core: Allow choosing flow steering mode
mlx4_core: Adjustments to Flow Steering activation logic for SR-IOV
mlx4_core: Fix error flow in the flow steering wrapper
mlx4_core: Add QPN enforcement for flow steering rules set by VFs
cxgb4: Add LE hash collision bug fix path in LLD driver
cxgb4: Add T4 filter support
IPoIB: Call skb_dst_drop() once skb is enqueued for sending

Linus Torvalds
2012-12-22 08:40:26 +0800
0264405b8 Merge tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic ... Browse Code »

Pull asm-generic cleanup from Arnd Bergmann:
"These are a few cleanups for asm-generic:

- a set of patches from Lars-Peter Clausen to generalize asm/mmu.h
and use it in the architectures that don't need any special
handling.
- A patch from Will Deacon to remove the {read,write}s{b,w,l} as
discussed during the arm64 review
- A patch from James Hogan that helps with the meta architecture
series."

* tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
xtensa: Use generic asm/mmu.h for nommu
h8300: Use generic asm/mmu.h
c6x: Use generic asm/mmu.h
asm-generic/mmu.h: Add support for FDPIC
asm-generic/mmu.h: Remove unused vmlist field from mm_context_t
asm-generic: io: remove {read,write} string functions
asm-generic/io.h: remove asm/cacheflush.h include

Linus Torvalds
2012-12-22 08:39:08 +0800
7de3ee57d dm: remove map_info ... Browse Code »

This patch removes map_info from bio-based device mapper targets.
map_info is still used for request-based targets.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:41 +0800
ddbd658f6 dm: move target request nr to dm_target_io ... Browse Code »

This patch moves target_request_nr from map_info to dm_target_io and
makes it accessible with dm_bio_get_target_request_nr.

This patch is a preparation for the next patch that removes map_info.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:39 +0800
c0820cf5a dm: introduce per_bio_data ... Browse Code »

Introduce a field per_bio_data_size in struct dm_target.

Targets can set this field in the constructor. If a target sets this
field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
are allocated for each bio submitted to the target. These data can be
used for any purpose by the target and help us improve performance by
removing some per-target mempools.

Per-bio data is accessed with dm_per_bio_data. The
argument data_size must be the same as the value per_bio_data_size in
dm_target.

If the target has a pointer to per_bio_data, it can get a pointer to
the bio with dm_bio_from_per_bio_data() function (data_size must be the
same as the value passed to dm_per_bio_data).

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:38 +0800
d54eaa5a0 dm: prepare to support WRITE SAME ... Browse Code »

Allow targets to opt in to WRITE SAME support by setting
'num_write_same_requests' in the dm_target structure.

A dm device will only advertise WRITE SAME support if all its
targets and all its underlying devices support it.

Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-12-22 04:23:36 +0800
5023e5cf5 dm ioctl: remove PF_MEMALLOC ... Browse Code »

When allocating memory for the userspace ioctl data, set some
appropriate GPF flags directly instead of using PF_MEMALLOC.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-12-22 04:23:36 +0800

21 Dec, 2012

22 commits

96680d2b9 Merge branch 'for-next' of git://git.infradead.org/users/eparis/notify ... Browse Code »

Pull filesystem notification updates from Eric Paris:
"This pull mostly is about locking changes in the fsnotify system. By
switching the group lock from a spin_lock() to a mutex() we can now
hold the lock across things like iput(). This fixes a problem
involving unmounting a fs and having inodes be busy, first pointed out
by FAT, but reproducible with tmpfs.

This also restores signal driven I/O for inotify, which has been
broken since about 2.6.32."

Ugh. I *hate* the timing of this. It was rebased after the merge
window opened, and then left to sit with the pull request coming the day
before the merge window closes. That's just crap. But apparently the
patches themselves have been around for over a year, just gathering
dust, so now it's suddenly critical.

Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen
Rothwell's fixes from -next.

* 'for-next' of git://git.infradead.org/users/eparis/notify:
inotify: automatically restart syscalls
inotify: dont skip removal of watch descriptor if creation of ignored event failed
fanotify: dont merge permission events
fsnotify: make fasync generic for both inotify and fanotify
fsnotify: change locking order
fsnotify: dont put marks on temporary list when clearing marks by group
fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
fsnotify: pass group to fsnotify_destroy_mark()
fsnotify: use a mutex instead of a spinlock to protect a groups mark list
fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed
fsnotify: take groups mark_lock before mark lock
fsnotify: use reference counting for groups
fsnotify: introduce fsnotify_get_group()
inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()

Linus Torvalds
2012-12-21 12:11:52 +0800
4c9a44aeb Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge the rest of Andrew's patches for -rc1:
"A bunch of fixes and misc missed-out-on things.

That'll do for -rc1. I still have a batch of IPC patches which still
have a possible bug report which I'm chasing down."

* emailed patches from Andrew Morton : (25 commits)
keys: use keyring_alloc() to create module signing keyring
keys: fix unreachable code
sendfile: allows bypassing of notifier events
SGI-XP: handle non-fatal traps
fat: fix incorrect function comment
Documentation: ABI: remove testing/sysfs-devices-node
proc: fix inconsistent lock state
linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
memcg: don't register hotcpu notifier from ->css_alloc()
checkpatch: warn on uapi #includes that #include
mm: cma: WARN if freed memory is still in use
exec: do not leave bprm->interp on stack
...

Linus Torvalds
2012-12-21 12:00:43 +0800
1f0377ff0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull VFS update from Al Viro:
"fscache fixes, ESTALE patchset, vmtruncate removal series, assorted
misc stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits)
vfs: make lremovexattr retry once on ESTALE error
vfs: make removexattr retry once on ESTALE
vfs: make llistxattr retry once on ESTALE error
vfs: make listxattr retry once on ESTALE error
vfs: make lgetxattr retry once on ESTALE
vfs: make getxattr retry once on an ESTALE error
vfs: allow lsetxattr() to retry once on ESTALE errors
vfs: allow setxattr to retry once on ESTALE errors
vfs: allow utimensat() calls to retry once on an ESTALE error
vfs: fix user_statfs to retry once on ESTALE errors
vfs: make fchownat retry once on ESTALE errors
vfs: make fchmodat retry once on ESTALE errors
vfs: have chroot retry once on ESTALE error
vfs: have chdir retry lookup and call once on ESTALE error
vfs: have faccessat retry once on an ESTALE error
vfs: have do_sys_truncate retry once on an ESTALE error
vfs: fix renameat to retry on ESTALE errors
vfs: make do_unlinkat retry once on ESTALE errors
vfs: make do_rmdir retry once on ESTALE errors
vfs: add a flags argument to user_path_parent
...

Linus Torvalds
2012-12-21 10:14:31 +0800
54d46ea99 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull signal handling cleanups from Al Viro:
"sigaltstack infrastructure + conversion for x86, alpha and um,
COMPAT_SYSCALL_DEFINE infrastructure.

Note that there are several conflicts between "unify
SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
resolution is trivial - just remove definitions of SS_ONSTACK and
SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
include/uapi/linux/signal.h contains the unified variant."

Fixed up conflicts as per Al.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
alpha: switch to generic sigaltstack
new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
generic compat_sys_sigaltstack()
introduce generic sys_sigaltstack(), switch x86 and um to it
new helper: compat_user_stack_pointer()
new helper: restore_altstack()
unify SS_ONSTACK/SS_DISABLE definitions
new helper: current_user_stack_pointer()
missing user_stack_pointer() instances
Bury the conditionals from kernel_thread/kernel_execve series
COMPAT_SYSCALL_DEFINE: infrastructure

Linus Torvalds
2012-12-21 10:05:28 +0800
c4e18497d linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors ... Browse Code »

Commit 263a523d18bc ("linux/kernel.h: Fix warning seen with W=1 due to
change in DIV_ROUND_CLOSEST") fixes a warning seen with W=1 due to
change in DIV_ROUND_CLOSEST.

Unfortunately, the C compiler converts divide operations with unsigned
divisors to unsigned, even if the dividend is signed and negative (for
example, -10 / 5U = 858993457). The C standard says "If one operand has
unsigned int type, the other operand is converted to unsigned int", so
the compiler is not to blame. As a result, DIV_ROUND_CLOSEST(0, 2U) and
similar operations now return bad values, since the automatic conversion
of expressions such as "0 - 2U/2" to unsigned was not taken into
account.

Fix by checking for the divisor variable type when deciding which
operation to perform. This fixes DIV_ROUND_CLOSEST(0, 2U), but still
returns bad values for negative dividends divided by unsigned divisors.
Mark the latter case as unsupported.

One observed effect of this problem is that the s2c_hwmon driver reports
a value of 4198403 instead of 0 if the ADC reads 0.

Other impact is unpredictable. Problem is seen if the divisor is an
unsigned variable or constant and the dividend is less than (divisor/2).

Signed-off-by: Guenter Roeck
Reported-by: Juergen Beisert
Tested-by: Juergen Beisert
Cc: Jean Delvare
Cc: [3.7.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Guenter Roeck
2012-12-21 09:40:20 +0800
b66c59840 exec: do not leave bprm->interp on stack ... Browse Code »

If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.

Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules. Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted. They leave bprm->interp
pointing to their local stack. This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.

After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules. As
such, we need to protect the changes to interp.

This changes the logic to require allocation for any changes to the
bprm->interp. To avoid adding a new kmalloc to every exec, the default
value is left as-is. Only when passing through binfmt_script or
binfmt_misc does an allocation take place.

For a proof of concept, see DoTest.sh from:

http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/

Signed-off-by: Kees Cook
Cc: halfdog
Cc: P J P
Cc: Alexander Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2012-12-21 09:40:19 +0800
1ac12b4b6 vfs: turn is_dir argument to kern_path_create into a lookup_flags arg ... Browse Code »

Where we can pass in LOOKUP_DIRECTORY or LOOKUP_REVAL. Any other flags
passed in here are currently ignored.

Signed-off-by: Jeff Layton
Signed-off-by: Al Viro

Jeff Layton
2012-12-21 07:50:02 +0800
b9d6ba94b vfs: add a retry_estale helper function to handle retries on ESTALE ... Browse Code »

This function is expected to be called from path-based syscalls to help
them decide whether to try the lookup and call again in the event that
they got an -ESTALE return back on an earier try.

Currently, we only retry the call once on an ESTALE error, but in the
event that we decide that that's not enough in the future, we should be
able to change the logic in this helper without too much effort.

Signed-off-by: Jeff Layton
Signed-off-by: Al Viro

Jeff Layton
2012-12-21 07:50:00 +0800
21e89c0c4 Merge branch 'fscache' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells… ... Browse Code »

…/linux-fs into for-linus

Al Viro
2012-12-21 07:49:14 +0800
471667391 vfs: Remove useless function prototypes ... Browse Code »

Commit 8e22cc88d68ca1a46d7d582938f979eb640ed30f removes the (un)lock_super
function definitions but forgets to remove their prototypes.

Signed-off-by: Alessio Igor Bogani
Signed-off-by: Al Viro

Alessio Igor Bogani
2012-12-21 07:47:08 +0800
7898575fc mm: drop vmtruncate ... Browse Code »

Removed vmtruncate

Signed-off-by: Marco Stornelli
Signed-off-by: Al Viro

Marco Stornelli
2012-12-21 07:46:29 +0800
d30357f2f vfs: drop vmtruncate ... Browse Code »

Removed vmtruncate

Signed-off-by: Marco Stornelli
Signed-off-by: Al Viro

Marco Stornelli
2012-12-21 07:46:29 +0800
1f372dff1 FS-Cache: Mark cancellation of in-progress operation ... Browse Code »

Mark as cancelled an operation that is in progress rather than pending at the
time it is cancelled, and call fscache_complete_op() to cancel an operation so
that blocked ops can be started.

Signed-off-by: David Howells

David Howells
2012-12-21 06:34:00 +0800
36a02de5d FS-Cache: Convert the object event ID #defines into an enum ... Browse Code »

Convert the fscache_object event IDs from #defines into an enum. Also add an
extra label to the enum to carry the event count and redefine the event mask
in terms of that.

Signed-off-by: David Howells

David Howells
2012-12-21 06:08:05 +0800
a02de9608 VFS: Make more complete truncate operation available to CacheFiles ... Browse Code »

Make a more complete truncate operation available to CacheFiles (including
security checks and suchlike) so that it can use this to clear invalidated
cache files.

Signed-off-by: David Howells
Acked-by: Al Viro

David Howells
2012-12-21 06:05:41 +0800
982197277 Merge branch 'for-3.8' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd update from Bruce Fields:
"Included this time:

- more nfsd containerization work from Stanislav Kinsbursky: we're
not quite there yet, but should be by 3.9.

- NFSv4.1 progress: implementation of basic backchannel security
negotiation and the mandatory BACKCHANNEL_CTL operation. See

http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues

for remaining TODO's

- Fixes for some bugs that could be triggered by unusual compounds.
Our xdr code wasn't designed with v4 compounds in mind, and it
shows. A more thorough rewrite is still a todo.

- If you've ever seen "RPC: multiple fragments per record not
supported" logged while using some sort of odd userland NFS client,
that should now be fixed.

- Further work from Jeff Layton on our mechanism for storing
information about NFSv4 clients across reboots.

- Further work from Bryan Schumaker on his fault-injection mechanism
(which allows us to discard selective NFSv4 state, to excercise
rarely-taken recovery code paths in the client.)

- The usual mix of miscellaneous bugs and cleanup.

Thanks to everyone who tested or contributed this cycle."

* 'for-3.8' of git://linux-nfs.org/~bfields/linux: (111 commits)
nfsd4: don't leave freed stateid hashed
nfsd4: free_stateid can use the current stateid
nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
nfsd: warn on odd reply state in nfsd_vfs_read
nfsd4: fix oops on unusual readlike compound
nfsd4: disable zero-copy on non-final read ops
svcrpc: fix some printks
NFSD: Correct the size calculation in fault_inject_write
NFSD: Pass correct buffer size to rpc_ntop
nfsd: pass proper net to nfsd_destroy() from NFSd kthreads
nfsd: simplify service shutdown
nfsd: replace boolean nfsd_up flag by users counter
nfsd: simplify NFSv4 state init and shutdown
nfsd: introduce helpers for generic resources init and shutdown
nfsd: make NFSd service structure allocated per net
nfsd: make NFSd service boot time per-net
nfsd: per-net NFSd up flag introduced
nfsd: move per-net startup code to separated function
nfsd: pass net to __write_ports() and down
nfsd: pass net to nfsd_set_nrthreads()
...

Linus Torvalds
2012-12-21 06:04:11 +0800
ef778e7ae FS-Cache: Provide proper invalidation ... Browse Code »

Provide a proper invalidation method rather than relying on the netfs retiring
the cookie it has and getting a new one. The problem with this is that isn't
easy for the netfs to make sure that it has completed/cancelled all its
outstanding storage and retrieval operations on the cookie it is retiring.

Instead, have the cache provide an invalidation method that will cancel or wait
for all currently outstanding operations before invalidating the cache, and
will cause new operations to queue up behind that. Whilst invalidation is in
progress, some requests will be rejected until the cache can stack a barrier on
the operation queue to cause new operations to be deferred behind it.

Signed-off-by: David Howells

David Howells
2012-12-21 06:04:07 +0800
40889e8d9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph update from Sage Weil:
"There are a few different groups of commits here. The largest is
Alex's ongoing work to enable the coming RBD features (cloning,
striping). There is some cleanup in libceph that goes along with it.

Cyril and David have fixed some problems with NFS reexport (leaking
dentries and page locks), and there is a batch of patches from Yan
fixing problems with the fs client when running against a clustered
MDS. There are a few bug fixes mixed in for good measure, many of
which will be going to the stable trees once they're upstream.

My apologies for the late pull. There is still a gremlin in the rbd
map/unmap code and I was hoping to include the fix for that as well,
but we haven't been able to confirm the fix is correct yet; I'll send
that in a separate pull once it's nailed down."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
rbd: get rid of rbd_{get,put}_dev()
libceph: register request before unregister linger
libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
libceph: init event->node in ceph_osdc_create_event()
libceph: init osd->o_node in create_osd()
libceph: report connection fault with warning
libceph: socket can close in any connection state
rbd: don't use ENOTSUPP
rbd: remove linger unconditionally
rbd: get rid of RBD_MAX_SEG_NAME_LEN
libceph: avoid using freed osd in __kick_osd_requests()
ceph: don't reference req after put
rbd: do not allow remove of mounted-on image
libceph: Unlock unprocessed pages in start_read() error path
ceph: call handle_cap_grant() for cap import message
ceph: Fix __ceph_do_pending_vmtruncate
ceph: Don't add dirty inode to dirty list if caps is in migration
ceph: Fix infinite loop in __wake_requests
ceph: Don't update i_max_size when handling non-auth cap
bdi_register: add __printf verification, fix arg mismatch
...

Linus Torvalds
2012-12-21 06:00:13 +0800
9f10523f8 FS-Cache: Fix operation state management and accounting ... Browse Code »

Fix the state management of internal fscache operations and the accounting of
what operations are in what states.

This is done by:

(1) Give struct fscache_operation a enum variable that directly represents the
state it's currently in, rather than spreading this knowledge over a bunch
of flags, who's processing the operation at the moment and whether it is
queued or not.

This makes it easier to write assertions to check the state at various
points and to prevent invalid state transitions.

(2) Add an 'operation complete' state and supply a function to indicate the
completion of an operation (fscache_op_complete()) and make things call
it. The final call to fscache_put_operation() can then check that an op
in the appropriate state (complete or cancelled).

(3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
govern the state of an object:

(a) The ->n_ops is now the number of extant operations on the object
and is now decremented by fscache_put_operation() only.

(b) The ->n_in_progress is simply the number of objects that have been
taken off of the object's pending queue for the purposes of being
run. This is decremented by fscache_op_complete() only.

(c) The ->n_exclusive is the number of exclusive ops that have been
submitted and queued or are in progress. It is decremented by
fscache_op_complete() and by fscache_cancel_op().

fscache_put_operation() and fscache_operation_gc() now no longer try to
clean up ->n_exclusive and ->n_in_progress. That was leading to double
decrements against fscache_cancel_op().

fscache_cancel_op() now no longer decrements ->n_ops. That was leading to
double decrements against fscache_put_operation().

fscache_submit_exclusive_op() now decides whether it has to queue an op
based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
will persist in being true even after all preceding operations have been
cancelled or completed. Furthermore, if an object is active and there are
runnable ops against it, there must be at least one op running.

(4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
provide a function to record completion of the pages as they complete.

When n_pages reaches 0, the operation is deemed to be complete and
fscache_op_complete() is called.

Add calls to fscache_retrieval_complete() anywhere we've finished with a
page we've been given to read or allocate for. This includes places where
we just return pages to the netfs for reading from the server and where
accessing the cache fails and we discard the proposed netfs page.

The bugs in the unfixed state management manifest themselves as oopses like the
following where the operation completion gets out of sync with return of the
cookie by the netfs. This is possible because the cache unlocks and returns
all the netfs pages before recording its completion - which means that there's
nothing to stop the netfs discarding them and returning the cookie.

FS-Cache: Cookie 'NFS.fh' still has outstanding reads
------------[ cut here ]------------
kernel BUG at fs/fscache/cookie.c:519!
invalid opcode: 0000 [#1] SMP
CPU 1
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090 /DG965RY
RIP: 0010:[] [] __fscache_relinquish_cookie+0x170/0x343 [fscache]
RSP: 0018:ffff8800368cfb00 EFLAGS: 00010282
RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000
RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c
RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000
R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98
R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370
FS: 0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)
Stack:
ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0
ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0
ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91
Call Trace:
[] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs]
[] nfs_clear_inode+0x3c/0x41 [nfs]
[] nfs4_evict_inode+0x2f/0x33 [nfs]
[] evict+0xa1/0x15c
[] dispose_list+0x2c/0x38
[] prune_icache_sb+0x28c/0x29b
[] prune_super+0xd5/0x140
[] shrink_slab+0x102/0x1ab
[] balance_pgdat+0x2f2/0x595
[] ? process_timeout+0xb/0xb
[] kswapd+0x270/0x289
[] ? __init_waitqueue_head+0x46/0x46
[] ? balance_pgdat+0x595/0x595
[] kthread+0x7f/0x87
[] kernel_thread_helper+0x4/0x10
[] ? finish_task_switch+0x45/0xc0
[] ? retint_restore_args+0xe/0xe
[] ? __init_kthread_worker+0x53/0x53
[] ? gs_change+0xb/0xb

Signed-off-by: David Howells

David Howells
2012-12-21 05:58:26 +0800
ef46ed888 FS-Cache: Make cookie relinquishment wait for outstanding reads ... Browse Code »

Make fscache_relinquish_cookie() log a warning and wait if there are any
outstanding reads left on the cookie it was given.

Signed-off-by: David Howells

David Howells
2012-12-21 05:58:25 +0800
a13eea6bd Merge tag 'for-3.8-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs ... Browse Code »

Pull new F2FS filesystem from Jaegeuk Kim:
"Introduce a new file system, Flash-Friendly File System (F2FS), to
Linux 3.8.

Highlights:
- Add initial f2fs source codes
- Fix an endian conversion bug
- Fix build failures on random configs
- Fix the power-off-recovery routine
- Minor cleanup, coding style, and typos patches"

From the Kconfig help text:

F2FS is based on Log-structured File System (LFS), which supports
versatile "flash-friendly" features. The design has been focused on
addressing the fundamental issues in LFS, which are snowball effect
of wandering tree and high cleaning overhead.

Since flash-based storages show different characteristics according to
the internal geometry or flash memory management schemes aka FTL, F2FS
and tools support various parameters not only for configuring on-disk
layout, but also for selecting allocation and cleaning algorithms.

and there's an article by Neil Brown about it on lwn.net:

http://lwn.net/Articles/518988/

* tag 'for-3.8-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits)
f2fs: fix tracking parent inode number
f2fs: cleanup the f2fs_bio_alloc routine
f2fs: introduce accessor to retrieve number of dentry slots
f2fs: remove redundant call to f2fs_put_page in delete entry
f2fs: make use of GFP_F2FS_ZERO for setting gfp_mask
f2fs: rewrite f2fs_bio_alloc to make it simpler
f2fs: fix a typo in f2fs documentation
f2fs: remove unused variable
f2fs: move error condition for mkdir at proper place
f2fs: remove unneeded initialization
f2fs: check read only condition before beginning write out
f2fs: remove unneeded memset from init_once
f2fs: show error in case of invalid mount arguments
f2fs: fix the compiler warning for uninitialized use of variable
f2fs: resolve build failures
f2fs: adjust kernel coding style
f2fs: fix endian conversion bugs reported by sparse
f2fs: remove unneeded version.h header file from f2fs.h
f2fs: update the f2fs document
f2fs: update Kconfig and Makefile
...

Linus Torvalds
2012-12-21 05:54:52 +0800
c4d6d8dbf CacheFiles: Fix the marking of cached pages ... Browse Code »

Under some circumstances CacheFiles defers the marking of pages with PG_fscache
so that it can take advantage of pagevecs to reduce the number of calls to
fscache_mark_pages_cached() and the netfs's hook to keep track of this.

There are, however, two problems with this:

(1) It can lead to the PG_fscache mark being applied _after_ the page is set
PG_uptodate and unlocked (by the call to fscache_end_io()).

(2) CacheFiles's ref on the page is dropped immediately following
fscache_end_io() - and so may not still be held when the mark is applied.
This can lead to the page being passed back to the allocator before the
mark is applied.

Fix this by, where appropriate, marking the page before calling
fscache_end_io() and releasing the page. This means that we can't take
advantage of pagevecs and have to make a separate call for each page to the
marking routines.

The symptoms of this are Bad Page state errors cropping up under memory
pressure, for example:

BUG: Bad page state in process tar pfn:002da
page:ffffea0000009fb0 count:0 mapcount:0 mapping: (null) index:0x1447
page flags: 0x1000(private_2)
Pid: 4574, comm: tar Tainted: G W 3.1.0-rc4-fsdevel+ #1064
Call Trace:
[] ? dump_page+0xb9/0xbe
[] bad_page+0xd5/0xea
[] get_page_from_freelist+0x35b/0x46a
[] __alloc_pages_nodemask+0x362/0x662
[] __do_page_cache_readahead+0x13a/0x267
[] ? __do_page_cache_readahead+0xa2/0x267
[] ra_submit+0x1c/0x20
[] ondemand_readahead+0x28b/0x29a
[] ? ondemand_readahead+0x163/0x29a
[] page_cache_sync_readahead+0x38/0x3a
[] generic_file_aio_read+0x2ab/0x67e
[] nfs_file_read+0xa4/0xc9 [nfs]
[] do_sync_read+0xba/0xfa
[] ? security_file_permission+0x7b/0x84
[] ? rw_verify_area+0xab/0xc8
[] vfs_read+0xaa/0x13a
[] sys_read+0x45/0x6c
[] system_call_fastpath+0x16/0x1b

As can be seen, PG_private_2 (== PG_fscache) is set in the page flags.

Instrumenting fscache_mark_pages_cached() to verify whether page->mapping was
set appropriately showed that sometimes it wasn't. This led to the discovery
that sometimes the page has apparently been reclaimed by the time the marker
got to see it.

Reported-by: M. Stevens
Signed-off-by: David Howells
Reviewed-by: Jeff Layton

David Howells
2012-12-21 05:54:30 +0800