10 Oct, 2012

1 commit

  • Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
    u64 inum = fid->raw[2];
    which is unhelpfully reported as at the end of shmem_alloc_inode():

    BUG: unable to handle kernel paging request at ffff880061cd3000
    IP: [] shmem_alloc_inode+0x40/0x40
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Call Trace:
    [] ? exportfs_decode_fh+0x79/0x2d0
    [] do_handle_open+0x163/0x2c0
    [] sys_open_by_handle_at+0xc/0x10
    [] tracesys+0xe1/0xe6

    Right, tmpfs is being stupid to access fid->raw[2] before validating that
    fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
    fall at the end of a page, and the next page not be present.

    But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
    careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
    could oops in the same way: add the missing fh_len checks to those.

    Reported-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Cc: Al Viro
    Cc: Sage Weil
    Cc: Steven Whitehouse
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Hugh Dickins
     

09 Oct, 2012

1 commit

  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

03 Oct, 2012

5 commits

  • Pull xfs update from Ben Myers:
    "Several enhancements and cleanups:

    - make inode32 and inode64 remountable options
    - SEEK_HOLE/SEEK_DATA enhancements
    - cleanup struct declarations in xfs_mount.h"

    * tag 'for-linus-v3.7-rc1' of git://oss.sgi.com/xfs/xfs:
    xfs: Make inode32 a remountable option
    xfs: add inode64->inode32 transition into xfs_set_inode32()
    xfs: Fix mp->m_maxagi update during inode64 remount
    xfs: reduce code duplication handling inode32/64 options
    xfs: make inode64 as the default allocation mode
    xfs: Fix m_agirotor reset during AG selection
    Make inode64 a remountable option
    xfs: stop the sync worker before xfs_unmountfs
    xfs: xfs_seek_hole() refinement with hole searching from page cache for unwritten extents
    xfs: xfs_seek_data() refinement with unwritten extents check up from page cache
    xfs: Introduce a helper routine to probe data or hole offset from page cache
    xfs: Remove type argument from xfs_seek_data()/xfs_seek_hole()
    xfs: fix race while discarding buffers [V4]
    xfs: check for possible overflow in xfs_ioc_trim
    xfs: unlock the AGI buffer when looping in xfs_dialloc
    xfs: kill struct declarations in xfs_mount.h
    xfs: fix uninitialised variable in xfs_rtbuf_get()

    Linus Torvalds
     
  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov
     
  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     
  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

27 Sep, 2012

10 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • As inode64 is the default option now, and was also made remountable
    previously, inode32 can also be remounted on-the-fly when it is needed.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Carlos Maiolino
     
  • To make inode32 a remountable option, xfs_set_inode32() should be able
    to make a transition from inode64 option, disabling inode allocation on
    higher AGs.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Carlos Maiolino
     
  • With the changes made on xfs_set_inode64(), to make it behave as
    xfs_set_inode32() (now leaving to the caller the responsibility to update
    mp->m_maxagi), we use the return value of xfs_set_inode64() to update
    mp->m_maxagi during remount.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Carlos Maiolino
     
  • Add xfs_set_inode32() to be used to enable inode32 allocation mode. this
    will reduce the amount of duplicated code needed to mount/remount a
    filesystem with inode32 option. This patch also changes
    xfs_set_inode64() to return the maximum AG number that inodes can be
    allocated instead of set mp->m_maxagi by itself, so that the behaviour
    is the same as xfs_set_inode32(). This simplifies code that calls these
    functions and needs to know the maximum AG that inodes can be allocated
    in.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Carlos Maiolino
     
  • since 64-bit inodes can be accessed while using inode32, and these can
    also be used on 32-bit kernels, there is no reason to still keep inode32
    as the default mount option. If the filesystem cannot handle 64bit
    inode numbers (i.e CONFIG_LBDAF is not enabled and BITS_PER_LONG == 32),
    XFS_MOUNT_SMALL_INUMS will still be set by default, so inode64 is not an
    unconditional default value.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Carlos Maiolino
     
  • xfs_ialloc_next_ag() currently resets m_agirotor when it is equal to
    m_maxagi:

    if (++mp->m_agirotor == mp->m_maxagi)
    mp->m_agirotor = 0;

    But, if for some reason mp->m_maxagi changes to a lower value than
    current m_agirotor, this condition will never be true, causing
    m_agirotor to exceed the maximum allowed value (m_maxagi).

    This implies mainly during lookups for xfs_perag structs in its radix
    tree, since the agno value used for the lookup is based on m_agirotor.
    An out-of-range m_agirotor may cause a lookup failure which in case will
    return NULL.

    As an example, the value of m_maxagi is decreased during
    inode64->inode32 remount process, case where I've found this problem.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Carlos Maiolino
     
  • Actually, there is no reason about why a user must umount and mount a
    XFS filesystem to enable 'inode64' option. So, this patch makes this a
    remountable option.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Carlos Maiolino
     

19 Sep, 2012

1 commit

  • Cancel work of the xfs_sync_worker before teardown of the log in
    xfs_unmountfs. This prevents occasional crashes on unmount like so:

    PID: 21602 TASK: ee9df060 CPU: 0 COMMAND: "kworker/0:3"
    #0 [c5377d28] crash_kexec at c0292c94
    #1 [c5377d80] oops_end at c07090c2
    #2 [c5377d98] no_context at c06f614e
    #3 [c5377dbc] __bad_area_nosemaphore at c06f6281
    #4 [c5377df4] bad_area_nosemaphore at c06f629b
    #5 [c5377e00] do_page_fault at c070b0cb
    #6 [c5377e7c] error_code (via page_fault) at c070892c
    EAX: f300c6a8 EBX: f300c6a8 ECX: 000000c0 EDX: 000000c0 EBP: c5377ed0
    DS: 007b ESI: 00000000 ES: 007b EDI: 00000001 GS: ffffad20
    CS: 0060 EIP: c0481ad0 ERR: ffffffff EFLAGS: 00010246
    #7 [c5377eb0] atomic64_read_cx8 at c0481ad0
    #8 [c5377ebc] xlog_assign_tail_lsn_locked at f7cc7c6e [xfs]
    #9 [c5377ed4] xfs_trans_ail_delete_bulk at f7ccd520 [xfs]
    #10 [c5377f0c] xfs_buf_iodone at f7ccb602 [xfs]
    #11 [c5377f24] xfs_buf_do_callbacks at f7cca524 [xfs]
    #12 [c5377f30] xfs_buf_iodone_callbacks at f7cca5da [xfs]
    #13 [c5377f4c] xfs_buf_iodone_work at f7c718d0 [xfs]
    #14 [c5377f58] process_one_work at c024ee4c
    #15 [c5377f98] worker_thread at c024f43d
    #16 [c5377fbc] kthread at c025326b
    #17 [c5377fe8] kernel_thread_helper at c070e834

    PID: 26653 TASK: e79143b0 CPU: 3 COMMAND: "umount"
    #0 [cde0fda0] __schedule at c0706595
    #1 [cde0fe28] schedule at c0706b89
    #2 [cde0fe30] schedule_timeout at c0705600
    #3 [cde0fe94] __down_common at c0706098
    #4 [cde0fec8] __down at c0706122
    #5 [cde0fed0] down at c025936f
    #6 [cde0fee0] xfs_buf_lock at f7c7131d [xfs]
    #7 [cde0ff00] xfs_freesb at f7cc2236 [xfs]
    #8 [cde0ff10] xfs_fs_put_super at f7c80f21 [xfs]
    #9 [cde0ff1c] generic_shutdown_super at c0333d7a
    #10 [cde0ff38] kill_block_super at c0333e0f
    #11 [cde0ff48] deactivate_locked_super at c0334218
    #12 [cde0ff58] deactivate_super at c033495d
    #13 [cde0ff68] mntput_no_expire at c034bc13
    #14 [cde0ff7c] sys_umount at c034cc69
    #15 [cde0ffa0] sys_oldumount at c034ccd4
    #16 [cde0ffb0] system_call at c0707e66

    commit 11159a05 added this to xfs_log_unmount and needs to be cleaned up
    at a later date.

    Signed-off-by: Ben Myers
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely

    Ben Myers
     

18 Sep, 2012

4 commits

  • Modify quota_send_warning to take struct kqid instead a type and
    identifier pair.

    When sending netlink broadcasts always convert uids and quota
    identifiers into the intial user namespace. There is as yet no way to
    send a netlink broadcast message with different contents to receivers
    in different namespaces, so for the time being just map all of the
    identifiers into the initial user namespace which preserves the
    current behavior.

    Change the callers of quota_send_warning in gfs2, xfs and dquot
    to generate a struct kqid to pass to quota send warning. When
    all of the user namespaces convesions are complete a struct kqid
    values will be availbe without need for conversion, but a conversion
    is needed now to avoid needing to convert everything at once.

    Cc: Ben Myers
    Cc: Alex Elder
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Steven Whitehouse
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Update the quotactl user space interface to successfull compile with
    user namespaces support enabled and to hand off quota identifiers to
    lower layers of the kernel in struct kqid instead of type and qid
    pairs.

    The quota on function is not converted because while it takes a quota
    type and an id. The id is the on disk quota format to use, which
    is something completely different.

    The signature of two struct quotactl_ops methods were changed to take
    struct kqid argumetns get_dqblk and set_dqblk.

    The dquot, xfs, and ocfs2 implementations of get_dqblk and set_dqblk
    are minimally changed so that the code continues to work with
    the change in parameter type.

    This is the first in a series of changes to always store quota
    identifiers in the kernel in struct kqid and only use raw type and qid
    values when interacting with on disk structures or userspace. Always
    using struct kqid internally makes it hard to miss places that need
    conversion to or from the kernel internal values.

    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: Alex Elder
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • - Pass the user namespace the uid and gid values in the xattr are stored
    in into posix_acl_from_xattr.

    - Pass the user namespace kuid and kgid values should be converted into
    when storing uid and gid values in an xattr in posix_acl_to_xattr.

    - Modify all callers of posix_acl_from_xattr and posix_acl_to_xattr to
    pass in &init_user_ns.

    In the short term this change is not strictly needed but it makes the
    code clearer. In the longer term this change is necessary to be able to
    mount filesystems outside of the initial user namespace that natively
    store posix acls in the linux xattr format.

    Cc: Theodore Tso
    Cc: Andrew Morton
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Al Viro
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Cancel work of the xfs_sync_worker before teardown of the log in
    xfs_unmountfs. This prevents occasional crashes on unmount like so:

    PID: 21602 TASK: ee9df060 CPU: 0 COMMAND: "kworker/0:3"
    #0 [c5377d28] crash_kexec at c0292c94
    #1 [c5377d80] oops_end at c07090c2
    #2 [c5377d98] no_context at c06f614e
    #3 [c5377dbc] __bad_area_nosemaphore at c06f6281
    #4 [c5377df4] bad_area_nosemaphore at c06f629b
    #5 [c5377e00] do_page_fault at c070b0cb
    #6 [c5377e7c] error_code (via page_fault) at c070892c
    EAX: f300c6a8 EBX: f300c6a8 ECX: 000000c0 EDX: 000000c0 EBP: c5377ed0
    DS: 007b ESI: 00000000 ES: 007b EDI: 00000001 GS: ffffad20
    CS: 0060 EIP: c0481ad0 ERR: ffffffff EFLAGS: 00010246
    #7 [c5377eb0] atomic64_read_cx8 at c0481ad0
    #8 [c5377ebc] xlog_assign_tail_lsn_locked at f7cc7c6e [xfs]
    #9 [c5377ed4] xfs_trans_ail_delete_bulk at f7ccd520 [xfs]
    #10 [c5377f0c] xfs_buf_iodone at f7ccb602 [xfs]
    #11 [c5377f24] xfs_buf_do_callbacks at f7cca524 [xfs]
    #12 [c5377f30] xfs_buf_iodone_callbacks at f7cca5da [xfs]
    #13 [c5377f4c] xfs_buf_iodone_work at f7c718d0 [xfs]
    #14 [c5377f58] process_one_work at c024ee4c
    #15 [c5377f98] worker_thread at c024f43d
    #16 [c5377fbc] kthread at c025326b
    #17 [c5377fe8] kernel_thread_helper at c070e834

    PID: 26653 TASK: e79143b0 CPU: 3 COMMAND: "umount"
    #0 [cde0fda0] __schedule at c0706595
    #1 [cde0fe28] schedule at c0706b89
    #2 [cde0fe30] schedule_timeout at c0705600
    #3 [cde0fe94] __down_common at c0706098
    #4 [cde0fec8] __down at c0706122
    #5 [cde0fed0] down at c025936f
    #6 [cde0fee0] xfs_buf_lock at f7c7131d [xfs]
    #7 [cde0ff00] xfs_freesb at f7cc2236 [xfs]
    #8 [cde0ff10] xfs_fs_put_super at f7c80f21 [xfs]
    #9 [cde0ff1c] generic_shutdown_super at c0333d7a
    #10 [cde0ff38] kill_block_super at c0333e0f
    #11 [cde0ff48] deactivate_locked_super at c0334218
    #12 [cde0ff58] deactivate_super at c033495d
    #13 [cde0ff68] mntput_no_expire at c034bc13
    #14 [cde0ff7c] sys_umount at c034cc69
    #15 [cde0ffa0] sys_oldumount at c034ccd4
    #16 [cde0ffb0] system_call at c0707e66

    commit 11159a05 added this to xfs_log_unmount and needs to be cleaned up
    at a later date.

    Signed-off-by: Ben Myers
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely

    Ben Myers
     

30 Aug, 2012

1 commit

  • While xfs_buftarg_shrink() is freeing buffers from the dispose list (filled with
    buffers from lru list), there is a possibility to have xfs_buf_stale() racing
    with it, and removing buffers from dispose list before xfs_buftarg_shrink() does
    it.

    This happens because xfs_buftarg_shrink() handle the dispose list without
    locking and the test condition in xfs_buf_stale() checks for the buffer being in
    *any* list:

    if (!list_empty(&bp->b_lru))

    If the buffer happens to be on dispose list, this causes the buffer counter of
    lru list (btp->bt_lru_nr) to be decremented twice (once in xfs_buftarg_shrink()
    and another in xfs_buf_stale()) causing a wrong account usage of the lru list.

    This may cause xfs_buftarg_shrink() to return a wrong value to the memory
    shrinker shrink_slab(), and such account error may also cause an underflowed
    value to be returned; since the counter is lower than the current number of
    items in the lru list, a decrement may happen when the counter is 0, causing
    an underflow on the counter.

    The fix uses a new flag field (and a new buffer flag) to serialize buffer
    handling during the shrink process. The new flag field has been designed to use
    btp->bt_lru_lock/unlock instead of xfs_buf_lock/unlock mechanism.

    dchinner, sandeen, aquini and aris also deserve credits for this.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Ben Myers
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Carlos Maiolino
     

25 Aug, 2012

5 commits

  • xfs_seek_hole() refinement with hole searching from page cache for unwritten extent.

    Signed-off-by: Jie Liu
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Jeff Liu
     
  • xfs_seek_data() refinement with unwritten extents check up from page cache.

    Signed-off-by: Jie Liu
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Jeff Liu
     
  • Introduce helpers to probe data or hole offset from page cache.

    Signed-off-by: Jie Liu
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Jeff Liu
     
  • The type is already indicated by the function naming explicitly, so this argument
    can be omitted from those calls.

    Signed-off-by: Jie Liu
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Jeff Liu
     
  • While xfs_buftarg_shrink() is freeing buffers from the dispose list (filled with
    buffers from lru list), there is a possibility to have xfs_buf_stale() racing
    with it, and removing buffers from dispose list before xfs_buftarg_shrink() does
    it.

    This happens because xfs_buftarg_shrink() handle the dispose list without
    locking and the test condition in xfs_buf_stale() checks for the buffer being in
    *any* list:

    if (!list_empty(&bp->b_lru))

    If the buffer happens to be on dispose list, this causes the buffer counter of
    lru list (btp->bt_lru_nr) to be decremented twice (once in xfs_buftarg_shrink()
    and another in xfs_buf_stale()) causing a wrong account usage of the lru list.

    This may cause xfs_buftarg_shrink() to return a wrong value to the memory
    shrinker shrink_slab(), and such account error may also cause an underflowed
    value to be returned; since the counter is lower than the current number of
    items in the lru list, a decrement may happen when the counter is 0, causing
    an underflow on the counter.

    The fix uses a new flag field (and a new buffer flag) to serialize buffer
    handling during the shrink process. The new flag field has been designed to use
    btp->bt_lru_lock/unlock instead of xfs_buf_lock/unlock mechanism.

    dchinner, sandeen, aquini and aris also deserve credits for this.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Ben Myers
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Carlos Maiolino
     

24 Aug, 2012

3 commits

  • If range.start or range.minlen is bigger than filesystem size, return
    invalid value error. This fixes possible overflow in BTOBB macro when
    passed value was nearly ULLONG_MAX.

    Signed-off-by: Tomas Racek
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Tomas Racek
     
  • Also update some commens in the area to make the code easier to read.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Results in this assert failure in generic/090:

    XFS: Assertion failed: *nmap >= 1, file: fs/xfs/xfs_bmap.c, line: 4363
    .....
    Call Trace:
    [] xfs_bmapi_read+0x6b/0x370
    [] xfs_rtbuf_get+0x42/0x130
    [] xfs_rtget_summary+0x89/0x120
    [] xfs_rtallocate_extent_size+0xce/0x340
    [] xfs_rtallocate_extent+0x240/0x290
    [] xfs_bmap_rtalloc+0x1ba/0x340
    [] xfs_bmap_alloc+0x35/0x40
    [] xfs_bmapi_allocate+0xf1/0x350
    [] xfs_bmapi_write+0x66e/0xa60
    [] xfs_iomap_write_direct+0x22a/0x3f0
    [] __xfs_get_blocks+0x38b/0x5d0
    [] xfs_get_blocks_direct+0x14/0x20
    [] do_blockdev_direct_IO+0xf71/0x1eb0
    [] __blockdev_direct_IO+0x55/0x60
    [] xfs_vm_direct_IO+0x11a/0x1e0
    [] generic_file_direct_write+0xd7/0x1b0
    [] xfs_file_dio_aio_write+0x13c/0x320
    [] xfs_file_aio_write+0x1c2/0x1d0
    [] do_sync_write+0xa7/0xe0
    [] vfs_write+0xa8/0x160
    [] sys_pwrite64+0x92/0xb0
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Dave Chinner
    Signed-off-by: Ben Myers

    Dave Chinner
     

21 Aug, 2012

1 commit

  • flush[_delayed]_work_sync() are now spurious. Mark them deprecated
    and convert all users to flush[_delayed]_work().

    If you're cc'd and wondering what's going on: Now all workqueues are
    non-reentrant and the regular flushes guarantee that the work item is
    not pending or running on any CPU on return, so there's no reason to
    use the sync flushes at all and they're going away.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Ian Campbell
    Cc: Jens Axboe
    Cc: Mattia Dongili
    Cc: Kent Yoder
    Cc: David Airlie
    Cc: Jiri Kosina
    Cc: Karsten Keil
    Cc: Bryan Wu
    Cc: Benjamin Herrenschmidt
    Cc: Alasdair Kergon
    Cc: Mauro Carvalho Chehab
    Cc: Florian Tobias Schandinat
    Cc: David Woodhouse
    Cc: "David S. Miller"
    Cc: linux-wireless@vger.kernel.org
    Cc: Anton Vorontsov
    Cc: Sangbeom Kim
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: Eric Van Hensbergen
    Cc: Takashi Iwai
    Cc: Steven Whitehouse
    Cc: Petr Vandrovec
    Cc: Mark Fasheh
    Cc: Christoph Hellwig
    Cc: Avi Kivity

    Tejun Heo
     

17 Aug, 2012

4 commits

  • If range.start or range.minlen is bigger than filesystem size, return
    invalid value error. This fixes possible overflow in BTOBB macro when
    passed value was nearly ULLONG_MAX.

    Signed-off-by: Tomas Racek
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Tomas Racek
     
  • Also update some commens in the area to make the code easier to read.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • I noticed that "struct xfs_mount_args" was still declared in
    "fs/xfs/xfs_mount.h". That struct doesn't even exist any more (and
    is obviously not referenced elsewhere in that header file). While
    in there, delete four other unneeded struct declarations in that
    file.

    Doing so highlights that "fs/xfs/xfs_trace.h" was relying indirectly
    on "xfs_mount.h" to be #included in order to declare "struct
    xfs_bmbt_irec", so add that declaration to resolve that issue.

    Signed-off-by: Alex Elder
    Signed-off-by: Ben Myers

    Alex Elder
     
  • Results in this assert failure in generic/090:

    XFS: Assertion failed: *nmap >= 1, file: fs/xfs/xfs_bmap.c, line: 4363
    .....
    Call Trace:
    [] xfs_bmapi_read+0x6b/0x370
    [] xfs_rtbuf_get+0x42/0x130
    [] xfs_rtget_summary+0x89/0x120
    [] xfs_rtallocate_extent_size+0xce/0x340
    [] xfs_rtallocate_extent+0x240/0x290
    [] xfs_bmap_rtalloc+0x1ba/0x340
    [] xfs_bmap_alloc+0x35/0x40
    [] xfs_bmapi_allocate+0xf1/0x350
    [] xfs_bmapi_write+0x66e/0xa60
    [] xfs_iomap_write_direct+0x22a/0x3f0
    [] __xfs_get_blocks+0x38b/0x5d0
    [] xfs_get_blocks_direct+0x14/0x20
    [] do_blockdev_direct_IO+0xf71/0x1eb0
    [] __blockdev_direct_IO+0x55/0x60
    [] xfs_vm_direct_IO+0x11a/0x1e0
    [] generic_file_direct_write+0xd7/0x1b0
    [] xfs_file_dio_aio_write+0x13c/0x320
    [] xfs_file_aio_write+0x1c2/0x1d0
    [] do_sync_write+0xa7/0xe0
    [] vfs_write+0xa8/0x160
    [] sys_pwrite64+0x92/0xb0
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Dave Chinner
    Signed-off-by: Ben Myers

    Dave Chinner
     

02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

31 Jul, 2012

2 commits

  • Generic code now blocks all writers from standard write paths. So we add
    blocking of all writers coming from ioctl (we get a protection of ioctl against
    racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
    non-racy freeze protection. We also keep freeze protection on transaction
    start to block internal filesystem writes such as removal of preallocated
    blocks.

    CC: Ben Myers
    CC: Alex Elder
    CC: xfs@oss.sgi.com
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Pull xfs update from Ben Myers:
    "Numerous cleanups and several bug fixes. Here are some highlights:

    - Discontiguous directory buffer support
    - Inode allocator refactoring
    - Removal of the IO lock in inode reclaim
    - Implementation of .update_time
    - Fix for handling of EOF in xfs_vm_writepage
    - Fix for races in xfsaild, and idle mode is re-enabled
    - Fix for a crash in xfs_buf completion handlers on unmount."

    Fix up trivial conflicts in fs/xfs/{xfs_buf.c,xfs_log.c,xfs_log_priv.h}
    due to duplicate patches that had already been merged for 3.5.

    * tag 'for-linus-v3.6-rc1' of git://oss.sgi.com/xfs/xfs: (44 commits)
    xfs: wait for the write the superblock on unmount
    xfs: re-enable xfsaild idle mode and fix associated races
    xfs: remove iolock lock classes
    xfs: avoid the iolock in xfs_free_eofblocks for evicted inodes
    xfs: do not take the iolock in xfs_inactive
    xfs: remove xfs_inactive_attrs
    xfs: clean up xfs_inactive
    xfs: do not read the AGI buffer in xfs_dialloc until nessecary
    xfs: refactor xfs_ialloc_ag_select
    xfs: add a short cut to xfs_dialloc for the non-NULL agbp case
    xfs: remove the alloc_done argument to xfs_dialloc
    xfs: split xfs_dialloc
    xfs: remove xfs_ialloc_find_free
    Prefix IO_XX flags with XFS_IO_XX to avoid namespace colision.
    xfs: remove xfs_inotobp
    xfs: merge xfs_itobp into xfs_imap_to_bp
    xfs: handle EOF correctly in xfs_vm_writepage
    xfs: implement ->update_time
    xfs: fix comment typo of struct xfs_da_blkinfo.
    xfs: do not call xfs_bdstrat_cb in xfs_buf_iodone_callbacks
    ...

    Linus Torvalds
     

30 Jul, 2012

1 commit

  • v2: Add the xfs_buf_lock to xfs_quiesce_attr().
    Add explaination why xfs_buf_lock() is used to wait for write.

    xfs_wait_buftarg() does not wait for the completion of the write of the
    uncached superblock. This write can race with the shutdown of the log
    and causes a panic if the write does not win the race.

    During the log write, xfsaild_push() will lock the buffer and set the
    XBF_ASYNC flag. Because the XBF_FLAG is set, complete() is not performed
    on the buffer's iowait entry, we cannot call xfs_buf_iowait() to wait
    for the write to complete. The buffer's lock is held until the write is
    complete, so we can block on a xfs_buf_lock() request to be notified
    that the write is complete.

    Signed-off-by: Mark Tinguely
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Mark Tinguely