13 Feb, 2013

4 commits

  • Convert nfs_map_name_to_uid to return a kuid_t value.
    Convert nfs_map_name_to_gid to return a kgid_t value.
    Convert nfs_map_uid_to_name to take a kuid_t paramater.
    Convert nfs_map_gid_to_name to take a kgid_t paramater.

    Tweak nfs_fattr_map_owner_to_name to use a kuid_t intermediate value.
    Tweak nfs_fattr_map_group_to_name to use a kgid_t intermediate value.

    Which makes these functions properly handle kuids and kgids, including
    erroring of the generated kuid or kgid is invalid.

    Cc: "J. Bruce Fields"
    Cc: Trond Myklebust
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Cc: "J. Bruce Fields"
    Cc: Trond Myklebust
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Convert variables that store uids and gids to be of type
    kuid_t and kgid_t instead of type uid_t and gid_t.

    Cc: "J. Bruce Fields"
    Cc: Trond Myklebust
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Remove the slight chance that uids and gids in coda messages will be
    interpreted in the wrong user namespace.

    - Only allow processes in the initial user namespace to open the coda
    character device to communicate with coda filesystems.
    - Explicitly convert the uids in the coda header into the initial user
    namespace.
    - In coda_vattr_to_attr make kuids and kgids from the initial user
    namespace uids and gids in struct coda_vattr that just came from
    userspace.
    - In coda_iattr_to_vattr convert kuids and kgids into uids and gids
    in the intial user namespace and store them in struct coda_vattr for
    sending to coda userspace programs.

    Nothing needs to be changed with mounts as coda does not support
    being mounted in anything other than the initial user namespace.

    Cc: Jan Harkes
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 Feb, 2013

3 commits

  • Change struct 9p_fid and it's associated functions to
    use kuid_t's instead of uid_t.

    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • 9p has thre strucrtures that can encode inode stat information. Modify
    all of those structures to contain kuid_t and kgid_t values. Modify
    he wire encoders and decoders of those structures to use 'u' and 'g' instead of
    'd' in the format string where uids and gids are present.

    This results in all kuid and kgid conversion to and from on the wire values
    being performed by the same code in protocol.c where the client is known
    at the time of the conversion.

    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Modify the p9_client_rpc format specifiers of every function that
    directly transmits a uid or a gid from 'd' to 'u' or 'g' as
    appropriate.

    Modify those same functions to take kuid_t and kgid_t parameters
    instead of uid_t and gid_t parameters.

    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

27 Jan, 2013

1 commit

  • When freeing a deeply nested user namespace free_user_ns calls
    put_user_ns on it's parent which may in turn call free_user_ns again.
    When -fno-optimize-sibling-calls is passed to gcc one stack frame per
    user namespace is left on the stack, potentially overflowing the
    kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
    so we can't count on gcc to optimize this code.

    Remove struct kref and use a plain atomic_t. Making the code more
    flexible and easier to comprehend. Make the loop in free_user_ns
    explict to guarantee that the stack does not overflow with
    CONFIG_FRAME_POINTER enabled.

    I have tested this fix with a simple program that uses unshare to
    create a deeply nested user namespace structure and then calls exit.
    With 1000 nesteuser namespaces before this change running my test
    program causes the kernel to die a horrible death. With 10,000,000
    nested user namespaces after this change my test program runs to
    completion and causes no harm.

    Acked-by: Serge Hallyn
    Pointed-out-by: Vasily Kulikov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Dec, 2012

1 commit

  • Oleg pointed out that in a pid namespace the sequence.
    - pid 1 becomes a zombie
    - setns(thepidns), fork,...
    - reaping pid 1.
    - The injected processes exiting.

    Can lead to processes attempting access their child reaper and
    instead following a stale pointer.

    That waitpid for init can return before all of the processes in
    the pid namespace have exited is also unfortunate.

    Avoid these problems by disabling the allocation of new pids in a pid
    namespace when init dies, instead of when the last process in a pid
    namespace is reaped.

    Pointed-out-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

22 Dec, 2012

9 commits

  • Pull watchdog updates from Wim Van Sebroeck:
    "This includes some fixes and code improvements (like
    clk_prepare_enable and clk_disable_unprepare), conversion from the
    omap_wdt and twl4030_wdt drivers to the watchdog framework, addition
    of the SB8x0 chipset support and the DA9055 Watchdog driver and some
    OF support for the davinci_wdt driver."

    * git://www.linux-watchdog.org/linux-watchdog: (22 commits)
    watchdog: mei: avoid oops in watchdog unregister code path
    watchdog: Orion: Fix possible null-deference in orion_wdt_probe
    watchdog: sp5100_tco: Add SB8x0 chipset support
    watchdog: davinci_wdt: add OF support
    watchdog: da9052: Fix invalid free of devm_ allocated data
    watchdog: twl4030_wdt: Change TWL4030_MODULE_PM_RECEIVER to TWL_MODULE_PM_RECEIVER
    watchdog: remove depends on CONFIG_EXPERIMENTAL
    watchdog: Convert dev_printk(KERN_ to dev_(
    watchdog: DA9055 Watchdog driver
    watchdog: omap_wdt: eliminate goto
    watchdog: omap_wdt: delete redundant platform_set_drvdata() calls
    watchdog: omap_wdt: convert to devm_ functions
    watchdog: omap_wdt: convert to new watchdog core
    watchdog: WatchDog Timer Driver Core: fix comment
    watchdog: s3c2410_wdt: use clk_prepare_enable and clk_disable_unprepare
    watchdog: imx2_wdt: Select the driver via ARCH_MXC
    watchdog: cpu5wdt.c: add missing del_timer call
    watchdog: hpwdt.c: Increase version string
    watchdog: Convert twl4030_wdt to watchdog core
    davinci_wdt: preparation for switch to common clock framework
    ...

    Linus Torvalds
     
  • Pull dm update from Alasdair G Kergon:
    "Miscellaneous device-mapper fixes, cleanups and performance
    improvements.

    Of particular note:
    - Disable broken WRITE SAME support in all targets except linear and
    striped. Use it when kcopyd is zeroing blocks.
    - Remove several mempools from targets by moving the data into the
    bio's new front_pad area(which dm calls 'per_bio_data').
    - Fix a race in thin provisioning if discards are misused.
    - Prevent userspace from interfering with the ioctl parameters and
    use kmalloc for the data buffer if it's small instead of vmalloc.
    - Throttle some annoying error messages when I/O fails."

    * tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
    dm stripe: add WRITE SAME support
    dm: remove map_info
    dm snapshot: do not use map_context
    dm thin: dont use map_context
    dm raid1: dont use map_context
    dm flakey: dont use map_context
    dm raid1: rename read_record to bio_record
    dm: move target request nr to dm_target_io
    dm snapshot: use per_bio_data
    dm verity: use per_bio_data
    dm raid1: use per_bio_data
    dm: introduce per_bio_data
    dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
    dm linear: add WRITE SAME support
    dm: add WRITE SAME support
    dm: prepare to support WRITE SAME
    dm ioctl: use kmalloc if possible
    dm ioctl: remove PF_MEMALLOC
    dm persistent data: improve improve space map block alloc failure message
    dm thin: use DMERR_LIMIT for errors
    ...

    Linus Torvalds
     
  • Pull more infiniband changes from Roland Dreier:
    "Second batch of InfiniBand/RDMA changes for 3.8:
    - cxgb4 changes to fix lookup engine hash collisions
    - mlx4 changes to make flow steering usable
    - fix to IPoIB to avoid pinning dst reference for too long"

    * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
    RDMA/cxgb4: Fix bug for active and passive LE hash collision path
    RDMA/cxgb4: Fix LE hash collision bug for passive open connection
    RDMA/cxgb4: Fix LE hash collision bug for active open connection
    mlx4_core: Allow choosing flow steering mode
    mlx4_core: Adjustments to Flow Steering activation logic for SR-IOV
    mlx4_core: Fix error flow in the flow steering wrapper
    mlx4_core: Add QPN enforcement for flow steering rules set by VFs
    cxgb4: Add LE hash collision bug fix path in LLD driver
    cxgb4: Add T4 filter support
    IPoIB: Call skb_dst_drop() once skb is enqueued for sending

    Linus Torvalds
     
  • Pull asm-generic cleanup from Arnd Bergmann:
    "These are a few cleanups for asm-generic:

    - a set of patches from Lars-Peter Clausen to generalize asm/mmu.h
    and use it in the architectures that don't need any special
    handling.
    - A patch from Will Deacon to remove the {read,write}s{b,w,l} as
    discussed during the arm64 review
    - A patch from James Hogan that helps with the meta architecture
    series."

    * tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
    xtensa: Use generic asm/mmu.h for nommu
    h8300: Use generic asm/mmu.h
    c6x: Use generic asm/mmu.h
    asm-generic/mmu.h: Add support for FDPIC
    asm-generic/mmu.h: Remove unused vmlist field from mm_context_t
    asm-generic: io: remove {read,write} string functions
    asm-generic/io.h: remove asm/cacheflush.h include

    Linus Torvalds
     
  • This patch removes map_info from bio-based device mapper targets.
    map_info is still used for request-based targets.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch moves target_request_nr from map_info to dm_target_io and
    makes it accessible with dm_bio_get_target_request_nr.

    This patch is a preparation for the next patch that removes map_info.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Introduce a field per_bio_data_size in struct dm_target.

    Targets can set this field in the constructor. If a target sets this
    field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
    are allocated for each bio submitted to the target. These data can be
    used for any purpose by the target and help us improve performance by
    removing some per-target mempools.

    Per-bio data is accessed with dm_per_bio_data. The
    argument data_size must be the same as the value per_bio_data_size in
    dm_target.

    If the target has a pointer to per_bio_data, it can get a pointer to
    the bio with dm_bio_from_per_bio_data() function (data_size must be the
    same as the value passed to dm_per_bio_data).

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Allow targets to opt in to WRITE SAME support by setting
    'num_write_same_requests' in the dm_target structure.

    A dm device will only advertise WRITE SAME support if all its
    targets and all its underlying devices support it.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • When allocating memory for the userspace ioctl data, set some
    appropriate GPF flags directly instead of using PF_MEMALLOC.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

21 Dec, 2012

22 commits

  • Pull filesystem notification updates from Eric Paris:
    "This pull mostly is about locking changes in the fsnotify system. By
    switching the group lock from a spin_lock() to a mutex() we can now
    hold the lock across things like iput(). This fixes a problem
    involving unmounting a fs and having inodes be busy, first pointed out
    by FAT, but reproducible with tmpfs.

    This also restores signal driven I/O for inotify, which has been
    broken since about 2.6.32."

    Ugh. I *hate* the timing of this. It was rebased after the merge
    window opened, and then left to sit with the pull request coming the day
    before the merge window closes. That's just crap. But apparently the
    patches themselves have been around for over a year, just gathering
    dust, so now it's suddenly critical.

    Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen
    Rothwell's fixes from -next.

    * 'for-next' of git://git.infradead.org/users/eparis/notify:
    inotify: automatically restart syscalls
    inotify: dont skip removal of watch descriptor if creation of ignored event failed
    fanotify: dont merge permission events
    fsnotify: make fasync generic for both inotify and fanotify
    fsnotify: change locking order
    fsnotify: dont put marks on temporary list when clearing marks by group
    fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
    fsnotify: pass group to fsnotify_destroy_mark()
    fsnotify: use a mutex instead of a spinlock to protect a groups mark list
    fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed
    fsnotify: take groups mark_lock before mark lock
    fsnotify: use reference counting for groups
    fsnotify: introduce fsnotify_get_group()
    inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()

    Linus Torvalds
     
  • Merge the rest of Andrew's patches for -rc1:
    "A bunch of fixes and misc missed-out-on things.

    That'll do for -rc1. I still have a batch of IPC patches which still
    have a possible bug report which I'm chasing down."

    * emailed patches from Andrew Morton : (25 commits)
    keys: use keyring_alloc() to create module signing keyring
    keys: fix unreachable code
    sendfile: allows bypassing of notifier events
    SGI-XP: handle non-fatal traps
    fat: fix incorrect function comment
    Documentation: ABI: remove testing/sysfs-devices-node
    proc: fix inconsistent lock state
    linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
    memcg: don't register hotcpu notifier from ->css_alloc()
    checkpatch: warn on uapi #includes that #include
    mm: cma: WARN if freed memory is still in use
    exec: do not leave bprm->interp on stack
    ...

    Linus Torvalds
     
  • Pull VFS update from Al Viro:
    "fscache fixes, ESTALE patchset, vmtruncate removal series, assorted
    misc stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits)
    vfs: make lremovexattr retry once on ESTALE error
    vfs: make removexattr retry once on ESTALE
    vfs: make llistxattr retry once on ESTALE error
    vfs: make listxattr retry once on ESTALE error
    vfs: make lgetxattr retry once on ESTALE
    vfs: make getxattr retry once on an ESTALE error
    vfs: allow lsetxattr() to retry once on ESTALE errors
    vfs: allow setxattr to retry once on ESTALE errors
    vfs: allow utimensat() calls to retry once on an ESTALE error
    vfs: fix user_statfs to retry once on ESTALE errors
    vfs: make fchownat retry once on ESTALE errors
    vfs: make fchmodat retry once on ESTALE errors
    vfs: have chroot retry once on ESTALE error
    vfs: have chdir retry lookup and call once on ESTALE error
    vfs: have faccessat retry once on an ESTALE error
    vfs: have do_sys_truncate retry once on an ESTALE error
    vfs: fix renameat to retry on ESTALE errors
    vfs: make do_unlinkat retry once on ESTALE errors
    vfs: make do_rmdir retry once on ESTALE errors
    vfs: add a flags argument to user_path_parent
    ...

    Linus Torvalds
     
  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds
     
  • Commit 263a523d18bc ("linux/kernel.h: Fix warning seen with W=1 due to
    change in DIV_ROUND_CLOSEST") fixes a warning seen with W=1 due to
    change in DIV_ROUND_CLOSEST.

    Unfortunately, the C compiler converts divide operations with unsigned
    divisors to unsigned, even if the dividend is signed and negative (for
    example, -10 / 5U = 858993457). The C standard says "If one operand has
    unsigned int type, the other operand is converted to unsigned int", so
    the compiler is not to blame. As a result, DIV_ROUND_CLOSEST(0, 2U) and
    similar operations now return bad values, since the automatic conversion
    of expressions such as "0 - 2U/2" to unsigned was not taken into
    account.

    Fix by checking for the divisor variable type when deciding which
    operation to perform. This fixes DIV_ROUND_CLOSEST(0, 2U), but still
    returns bad values for negative dividends divided by unsigned divisors.
    Mark the latter case as unsupported.

    One observed effect of this problem is that the s2c_hwmon driver reports
    a value of 4198403 instead of 0 if the ADC reads 0.

    Other impact is unpredictable. Problem is seen if the divisor is an
    unsigned variable or constant and the dividend is less than (divisor/2).

    Signed-off-by: Guenter Roeck
    Reported-by: Juergen Beisert
    Tested-by: Juergen Beisert
    Cc: Jean Delvare
    Cc: [3.7.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guenter Roeck
     
  • If a series of scripts are executed, each triggering module loading via
    unprintable bytes in the script header, kernel stack contents can leak
    into the command line.

    Normally execution of binfmt_script and binfmt_misc happens recursively.
    However, when modules are enabled, and unprintable bytes exist in the
    bprm->buf, execution will restart after attempting to load matching
    binfmt modules. Unfortunately, the logic in binfmt_script and
    binfmt_misc does not expect to get restarted. They leave bprm->interp
    pointing to their local stack. This means on restart bprm->interp is
    left pointing into unused stack memory which can then be copied into the
    userspace argv areas.

    After additional study, it seems that both recursion and restart remains
    the desirable way to handle exec with scripts, misc, and modules. As
    such, we need to protect the changes to interp.

    This changes the logic to require allocation for any changes to the
    bprm->interp. To avoid adding a new kmalloc to every exec, the default
    value is left as-is. Only when passing through binfmt_script or
    binfmt_misc does an allocation take place.

    For a proof of concept, see DoTest.sh from:

    http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/

    Signed-off-by: Kees Cook
    Cc: halfdog
    Cc: P J P
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Where we can pass in LOOKUP_DIRECTORY or LOOKUP_REVAL. Any other flags
    passed in here are currently ignored.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • This function is expected to be called from path-based syscalls to help
    them decide whether to try the lookup and call again in the event that
    they got an -ESTALE return back on an earier try.

    Currently, we only retry the call once on an ESTALE error, but in the
    event that we decide that that's not enough in the future, we should be
    able to change the logic in this helper without too much effort.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • …/linux-fs into for-linus

    Al Viro
     
  • Commit 8e22cc88d68ca1a46d7d582938f979eb640ed30f removes the (un)lock_super
    function definitions but forgets to remove their prototypes.

    Signed-off-by: Alessio Igor Bogani
    Signed-off-by: Al Viro

    Alessio Igor Bogani
     
  • Removed vmtruncate

    Signed-off-by: Marco Stornelli
    Signed-off-by: Al Viro

    Marco Stornelli
     
  • Removed vmtruncate

    Signed-off-by: Marco Stornelli
    Signed-off-by: Al Viro

    Marco Stornelli
     
  • Mark as cancelled an operation that is in progress rather than pending at the
    time it is cancelled, and call fscache_complete_op() to cancel an operation so
    that blocked ops can be started.

    Signed-off-by: David Howells

    David Howells
     
  • Convert the fscache_object event IDs from #defines into an enum. Also add an
    extra label to the enum to carry the event count and redefine the event mask
    in terms of that.

    Signed-off-by: David Howells

    David Howells
     
  • Make a more complete truncate operation available to CacheFiles (including
    security checks and suchlike) so that it can use this to clear invalidated
    cache files.

    Signed-off-by: David Howells
    Acked-by: Al Viro

    David Howells
     
  • Pull nfsd update from Bruce Fields:
    "Included this time:

    - more nfsd containerization work from Stanislav Kinsbursky: we're
    not quite there yet, but should be by 3.9.

    - NFSv4.1 progress: implementation of basic backchannel security
    negotiation and the mandatory BACKCHANNEL_CTL operation. See

    http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues

    for remaining TODO's

    - Fixes for some bugs that could be triggered by unusual compounds.
    Our xdr code wasn't designed with v4 compounds in mind, and it
    shows. A more thorough rewrite is still a todo.

    - If you've ever seen "RPC: multiple fragments per record not
    supported" logged while using some sort of odd userland NFS client,
    that should now be fixed.

    - Further work from Jeff Layton on our mechanism for storing
    information about NFSv4 clients across reboots.

    - Further work from Bryan Schumaker on his fault-injection mechanism
    (which allows us to discard selective NFSv4 state, to excercise
    rarely-taken recovery code paths in the client.)

    - The usual mix of miscellaneous bugs and cleanup.

    Thanks to everyone who tested or contributed this cycle."

    * 'for-3.8' of git://linux-nfs.org/~bfields/linux: (111 commits)
    nfsd4: don't leave freed stateid hashed
    nfsd4: free_stateid can use the current stateid
    nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
    nfsd: warn on odd reply state in nfsd_vfs_read
    nfsd4: fix oops on unusual readlike compound
    nfsd4: disable zero-copy on non-final read ops
    svcrpc: fix some printks
    NFSD: Correct the size calculation in fault_inject_write
    NFSD: Pass correct buffer size to rpc_ntop
    nfsd: pass proper net to nfsd_destroy() from NFSd kthreads
    nfsd: simplify service shutdown
    nfsd: replace boolean nfsd_up flag by users counter
    nfsd: simplify NFSv4 state init and shutdown
    nfsd: introduce helpers for generic resources init and shutdown
    nfsd: make NFSd service structure allocated per net
    nfsd: make NFSd service boot time per-net
    nfsd: per-net NFSd up flag introduced
    nfsd: move per-net startup code to separated function
    nfsd: pass net to __write_ports() and down
    nfsd: pass net to nfsd_set_nrthreads()
    ...

    Linus Torvalds
     
  • Provide a proper invalidation method rather than relying on the netfs retiring
    the cookie it has and getting a new one. The problem with this is that isn't
    easy for the netfs to make sure that it has completed/cancelled all its
    outstanding storage and retrieval operations on the cookie it is retiring.

    Instead, have the cache provide an invalidation method that will cancel or wait
    for all currently outstanding operations before invalidating the cache, and
    will cause new operations to queue up behind that. Whilst invalidation is in
    progress, some requests will be rejected until the cache can stack a barrier on
    the operation queue to cause new operations to be deferred behind it.

    Signed-off-by: David Howells

    David Howells
     
  • Pull Ceph update from Sage Weil:
    "There are a few different groups of commits here. The largest is
    Alex's ongoing work to enable the coming RBD features (cloning,
    striping). There is some cleanup in libceph that goes along with it.

    Cyril and David have fixed some problems with NFS reexport (leaking
    dentries and page locks), and there is a batch of patches from Yan
    fixing problems with the fs client when running against a clustered
    MDS. There are a few bug fixes mixed in for good measure, many of
    which will be going to the stable trees once they're upstream.

    My apologies for the late pull. There is still a gremlin in the rbd
    map/unmap code and I was hoping to include the fix for that as well,
    but we haven't been able to confirm the fix is correct yet; I'll send
    that in a separate pull once it's nailed down."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
    rbd: get rid of rbd_{get,put}_dev()
    libceph: register request before unregister linger
    libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
    libceph: init event->node in ceph_osdc_create_event()
    libceph: init osd->o_node in create_osd()
    libceph: report connection fault with warning
    libceph: socket can close in any connection state
    rbd: don't use ENOTSUPP
    rbd: remove linger unconditionally
    rbd: get rid of RBD_MAX_SEG_NAME_LEN
    libceph: avoid using freed osd in __kick_osd_requests()
    ceph: don't reference req after put
    rbd: do not allow remove of mounted-on image
    libceph: Unlock unprocessed pages in start_read() error path
    ceph: call handle_cap_grant() for cap import message
    ceph: Fix __ceph_do_pending_vmtruncate
    ceph: Don't add dirty inode to dirty list if caps is in migration
    ceph: Fix infinite loop in __wake_requests
    ceph: Don't update i_max_size when handling non-auth cap
    bdi_register: add __printf verification, fix arg mismatch
    ...

    Linus Torvalds
     
  • Fix the state management of internal fscache operations and the accounting of
    what operations are in what states.

    This is done by:

    (1) Give struct fscache_operation a enum variable that directly represents the
    state it's currently in, rather than spreading this knowledge over a bunch
    of flags, who's processing the operation at the moment and whether it is
    queued or not.

    This makes it easier to write assertions to check the state at various
    points and to prevent invalid state transitions.

    (2) Add an 'operation complete' state and supply a function to indicate the
    completion of an operation (fscache_op_complete()) and make things call
    it. The final call to fscache_put_operation() can then check that an op
    in the appropriate state (complete or cancelled).

    (3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
    govern the state of an object:

    (a) The ->n_ops is now the number of extant operations on the object
    and is now decremented by fscache_put_operation() only.

    (b) The ->n_in_progress is simply the number of objects that have been
    taken off of the object's pending queue for the purposes of being
    run. This is decremented by fscache_op_complete() only.

    (c) The ->n_exclusive is the number of exclusive ops that have been
    submitted and queued or are in progress. It is decremented by
    fscache_op_complete() and by fscache_cancel_op().

    fscache_put_operation() and fscache_operation_gc() now no longer try to
    clean up ->n_exclusive and ->n_in_progress. That was leading to double
    decrements against fscache_cancel_op().

    fscache_cancel_op() now no longer decrements ->n_ops. That was leading to
    double decrements against fscache_put_operation().

    fscache_submit_exclusive_op() now decides whether it has to queue an op
    based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
    will persist in being true even after all preceding operations have been
    cancelled or completed. Furthermore, if an object is active and there are
    runnable ops against it, there must be at least one op running.

    (4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
    provide a function to record completion of the pages as they complete.

    When n_pages reaches 0, the operation is deemed to be complete and
    fscache_op_complete() is called.

    Add calls to fscache_retrieval_complete() anywhere we've finished with a
    page we've been given to read or allocate for. This includes places where
    we just return pages to the netfs for reading from the server and where
    accessing the cache fails and we discard the proposed netfs page.

    The bugs in the unfixed state management manifest themselves as oopses like the
    following where the operation completion gets out of sync with return of the
    cookie by the netfs. This is possible because the cache unlocks and returns
    all the netfs pages before recording its completion - which means that there's
    nothing to stop the netfs discarding them and returning the cookie.

    FS-Cache: Cookie 'NFS.fh' still has outstanding reads
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/cookie.c:519!
    invalid opcode: 0000 [#1] SMP
    CPU 1
    Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

    Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090 /DG965RY
    RIP: 0010:[] [] __fscache_relinquish_cookie+0x170/0x343 [fscache]
    RSP: 0018:ffff8800368cfb00 EFLAGS: 00010282
    RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000
    RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c
    RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000
    R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98
    R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370
    FS: 0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)
    Stack:
    ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0
    ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0
    ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91
    Call Trace:
    [] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs]
    [] nfs_clear_inode+0x3c/0x41 [nfs]
    [] nfs4_evict_inode+0x2f/0x33 [nfs]
    [] evict+0xa1/0x15c
    [] dispose_list+0x2c/0x38
    [] prune_icache_sb+0x28c/0x29b
    [] prune_super+0xd5/0x140
    [] shrink_slab+0x102/0x1ab
    [] balance_pgdat+0x2f2/0x595
    [] ? process_timeout+0xb/0xb
    [] kswapd+0x270/0x289
    [] ? __init_waitqueue_head+0x46/0x46
    [] ? balance_pgdat+0x595/0x595
    [] kthread+0x7f/0x87
    [] kernel_thread_helper+0x4/0x10
    [] ? finish_task_switch+0x45/0xc0
    [] ? retint_restore_args+0xe/0xe
    [] ? __init_kthread_worker+0x53/0x53
    [] ? gs_change+0xb/0xb

    Signed-off-by: David Howells

    David Howells
     
  • Make fscache_relinquish_cookie() log a warning and wait if there are any
    outstanding reads left on the cookie it was given.

    Signed-off-by: David Howells

    David Howells
     
  • Pull new F2FS filesystem from Jaegeuk Kim:
    "Introduce a new file system, Flash-Friendly File System (F2FS), to
    Linux 3.8.

    Highlights:
    - Add initial f2fs source codes
    - Fix an endian conversion bug
    - Fix build failures on random configs
    - Fix the power-off-recovery routine
    - Minor cleanup, coding style, and typos patches"

    From the Kconfig help text:

    F2FS is based on Log-structured File System (LFS), which supports
    versatile "flash-friendly" features. The design has been focused on
    addressing the fundamental issues in LFS, which are snowball effect
    of wandering tree and high cleaning overhead.

    Since flash-based storages show different characteristics according to
    the internal geometry or flash memory management schemes aka FTL, F2FS
    and tools support various parameters not only for configuring on-disk
    layout, but also for selecting allocation and cleaning algorithms.

    and there's an article by Neil Brown about it on lwn.net:

    http://lwn.net/Articles/518988/

    * tag 'for-3.8-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits)
    f2fs: fix tracking parent inode number
    f2fs: cleanup the f2fs_bio_alloc routine
    f2fs: introduce accessor to retrieve number of dentry slots
    f2fs: remove redundant call to f2fs_put_page in delete entry
    f2fs: make use of GFP_F2FS_ZERO for setting gfp_mask
    f2fs: rewrite f2fs_bio_alloc to make it simpler
    f2fs: fix a typo in f2fs documentation
    f2fs: remove unused variable
    f2fs: move error condition for mkdir at proper place
    f2fs: remove unneeded initialization
    f2fs: check read only condition before beginning write out
    f2fs: remove unneeded memset from init_once
    f2fs: show error in case of invalid mount arguments
    f2fs: fix the compiler warning for uninitialized use of variable
    f2fs: resolve build failures
    f2fs: adjust kernel coding style
    f2fs: fix endian conversion bugs reported by sparse
    f2fs: remove unneeded version.h header file from f2fs.h
    f2fs: update the f2fs document
    f2fs: update Kconfig and Makefile
    ...

    Linus Torvalds
     
  • Under some circumstances CacheFiles defers the marking of pages with PG_fscache
    so that it can take advantage of pagevecs to reduce the number of calls to
    fscache_mark_pages_cached() and the netfs's hook to keep track of this.

    There are, however, two problems with this:

    (1) It can lead to the PG_fscache mark being applied _after_ the page is set
    PG_uptodate and unlocked (by the call to fscache_end_io()).

    (2) CacheFiles's ref on the page is dropped immediately following
    fscache_end_io() - and so may not still be held when the mark is applied.
    This can lead to the page being passed back to the allocator before the
    mark is applied.

    Fix this by, where appropriate, marking the page before calling
    fscache_end_io() and releasing the page. This means that we can't take
    advantage of pagevecs and have to make a separate call for each page to the
    marking routines.

    The symptoms of this are Bad Page state errors cropping up under memory
    pressure, for example:

    BUG: Bad page state in process tar pfn:002da
    page:ffffea0000009fb0 count:0 mapcount:0 mapping: (null) index:0x1447
    page flags: 0x1000(private_2)
    Pid: 4574, comm: tar Tainted: G W 3.1.0-rc4-fsdevel+ #1064
    Call Trace:
    [] ? dump_page+0xb9/0xbe
    [] bad_page+0xd5/0xea
    [] get_page_from_freelist+0x35b/0x46a
    [] __alloc_pages_nodemask+0x362/0x662
    [] __do_page_cache_readahead+0x13a/0x267
    [] ? __do_page_cache_readahead+0xa2/0x267
    [] ra_submit+0x1c/0x20
    [] ondemand_readahead+0x28b/0x29a
    [] ? ondemand_readahead+0x163/0x29a
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_aio_read+0x2ab/0x67e
    [] nfs_file_read+0xa4/0xc9 [nfs]
    [] do_sync_read+0xba/0xfa
    [] ? security_file_permission+0x7b/0x84
    [] ? rw_verify_area+0xab/0xc8
    [] vfs_read+0xaa/0x13a
    [] sys_read+0x45/0x6c
    [] system_call_fastpath+0x16/0x1b

    As can be seen, PG_private_2 (== PG_fscache) is set in the page flags.

    Instrumenting fscache_mark_pages_cached() to verify whether page->mapping was
    set appropriately showed that sometimes it wasn't. This led to the discovery
    that sometimes the page has apparently been reclaimed by the time the marker
    got to see it.

    Reported-by: M. Stevens
    Signed-off-by: David Howells
    Reviewed-by: Jeff Layton

    David Howells