07 Mar, 2012

1 commit


06 Mar, 2012

5 commits

  • Merge the emailed seties of 19 patches from Andrew Morton

    * akpm:
    rapidio/tsi721: fix queue wrapping bug in inbound doorbell handler
    memcg: fix mapcount check in move charge code for anonymous page
    mm: thp: fix BUG on mm->nr_ptes
    alpha: fix 32/64-bit bug in futex support
    memcg: fix GPF when cgroup removal races with last exit
    debugobjects: Fix selftest for static warnings
    floppy/scsi: fix setting of BIO flags
    memcg: fix deadlock by inverting lrucare nesting
    drivers/rtc/rtc-r9701.c: fix crash in r9701_remove()
    c2port: class_create() returns an ERR_PTR
    pps: class_create() returns an ERR_PTR, not NULL
    hung_task: fix the broken rcu_lock_break() logic
    vfork: kill PF_STARTING
    coredump_wait: don't call complete_vfork_done()
    vfork: make it killable
    vfork: introduce complete_vfork_done()
    aio: wake up waiters when freeing unused kiocbs
    kprobes: return proper error code from register_kprobe()
    kmsg_dump: don't run on non-error paths by default

    Linus Torvalds
     
  • Now that CLONE_VFORK is killable, coredump_wait() no longer needs
    complete_vfork_done(). zap_threads() should find and kill all tasks with
    the same ->mm, this includes our parent if ->vfork_done is set.

    mm_release() becomes the only caller, unexport complete_vfork_done().

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes.

    Move the clear-and-complete-vfork_done code into the new trivial helper,
    complete_vfork_done().

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Bart Van Assche reported a hung fio process when either hot-removing
    storage or when interrupting the fio process itself. The (pruned) call
    trace for the latter looks like so:

    fio D 0000000000000001 0 6849 6848 0x00000004
    ffff880092541b88 0000000000000046 ffff880000000000 ffff88012fa11dc0
    ffff88012404be70 ffff880092541fd8 ffff880092541fd8 ffff880092541fd8
    ffff880128b894d0 ffff88012404be70 ffff880092541b88 000000018106f24d
    Call Trace:
    schedule+0x3f/0x60
    io_schedule+0x8f/0xd0
    wait_for_all_aios+0xc0/0x100
    exit_aio+0x55/0xc0
    mmput+0x2d/0x110
    exit_mm+0x10d/0x130
    do_exit+0x671/0x860
    do_group_exit+0x44/0xb0
    get_signal_to_deliver+0x218/0x5a0
    do_signal+0x65/0x700
    do_notify_resume+0x65/0x80
    int_signal+0x12/0x17

    The problem lies with the allocation batching code. It will
    opportunistically allocate kiocbs, and then trim back the list of iocbs
    when there is not enough room in the completion ring to hold all of the
    events.

    In the case above, what happens is that the pruning back of events ends
    up freeing up the last active request and the context is marked as dead,
    so it is thus responsible for waking up waiters. Unfortunately, the
    code does not check for this condition, so we end up with a hung task.

    Signed-off-by: Jeff Moyer
    Reported-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Cc: [3.2.x only]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

05 Mar, 2012

1 commit

  • It's only used inside fs/dcache.c, and we're going to play games with it
    for the word-at-a-time patches. This time we really don't even want to
    export it, because it really is an internal function to fs/dcache.c, and
    has been since it was introduced.

    Having it in that extremely hot header file (it's included in pretty
    much everything, thanks to ) is a disaster for testing
    different versions, and is utterly pointless.

    We really should have some kind of header file diet thing, where we
    figure out which parts of header files are really better off private and
    only result in more expensive compiles.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Mar, 2012

5 commits

  • Commit 5707c87f "vfs: uninline full_name_hash()" broke the modular
    build, because it needs exporting now that it isn't inlined any more.

    Reported-by: Tetsuo Handa
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The code in link_path_walk() that finds out the length and the hash of
    the next path component is some of the hottest code in the kernel. And
    I have a version of it that does things at the full width of the CPU
    wordsize at a time, but that means that we *really* want to split it up
    into a separate helper function.

    So this re-organizes the code a bit and splits the hashing part into a
    helper function called "hash_name()". It returns the length of the
    pathname component, while at the same time computing and writing the
    hash to the appropriate location.

    The code generation is slightly changed by this patch, but generally for
    the better - and the added abstraction actually makes the code easier to
    read too. And the new interface is well suited for replacing just the
    "hash_name()" function with alternative implementations.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • .. and also use it in lookup_one_len() rather than open-coding it.

    There aren't any performance-critical users, so inlining it is silly.
    But it wouldn't matter if it wasn't for the fact that the word-at-a-time
    dentry name patches want to conditionally replace the function, and
    uninlining it sets the stage for that.

    So again, this is a preparatory patch that doesn't change any semantics,
    and only prepares for a much cleaner and testable word-at-a-time dentry
    name accessor patch.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • These don't change any semantics, but they clean up the code a bit and
    mark some arguments appropriately 'const'.

    They came up as I was doing the word-at-a-time dcache name accessor
    code, and cleaning this up now allows me to send out a smaller relevant
    interesting patch for the experimental stuff.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The regset common infrastructure assumed that regsets would always
    have .get and .set methods, but not necessarily .active methods.
    Unfortunately people have since written regsets without .set methods.

    Rather than putting in stub functions everywhere, handle regsets with
    null .get or .set methods explicitly.

    Signed-off-by: H. Peter Anvin
    Reviewed-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc:
    Signed-off-by: Linus Torvalds

    H. Peter Anvin
     

29 Feb, 2012

1 commit

  • Fix printk format warning (from Linus's suggestion):

    on i386:
    fs/ecryptfs/miscdev.c:433:38: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'unsigned int'

    and on x86_64:
    fs/ecryptfs/miscdev.c:433:38: warning: format '%u' expects type 'unsigned int', but argument 4 has type 'long unsigned int'

    Signed-off-by: Randy Dunlap
    Cc: Geert Uytterhoeven
    Cc: Tyler Hicks
    Cc: Dustin Kirkland
    Cc: ecryptfs@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

28 Feb, 2012

4 commits

  • This makes mount take slightly longer, but at the same time, the first
    write to the filesystem will be faster too. It also means that if there
    is a problem in the resource index, then we can refuse to mount rather
    than having to try and report that when the first write occurs.

    In addition, to avoid recursive locking, we hvae to take account of
    instances when the rindex glock may already be held when we are
    trying to update the rbtree of resource groups.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch fixes a problem whereby gfs2_grow was failing and causing GFS2
    to assert. The problem was that when GFS2's fallocate operation tried to
    acquire an "allocation" it made sure the rindex was up to date, and if not,
    it called gfs2_rindex_update. However, if the file being fallocated was
    the rindex itself, it was already locked at that point. By calling
    gfs2_rindex_update at an earlier point in time, we bring rindex up to date
    and thereby avoid trying to lock it when the "allocation" is acquired.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch fixes a problem whereby you were unable to delete
    files until other file system operations were done (such as
    statfs, touch, writes, etc.) that caused the rindex to be
    read in.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch fixes a narrow race window between the glock ref count
    hitting zero and glocks being removed from the lru_list.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

27 Feb, 2012

3 commits

  • Anton Altaparmakov
     
  • The cifs code will attempt to open files on lookup under certain
    circumstances. What happens though if we find that the file we opened
    was actually a FIFO or other special file?

    Currently, the open filehandle just ends up being leaked leading to
    a dentry refcount mismatch and oops on umount. Fix this by having the
    code close the filehandle on the server if it turns out not to be a
    regular file. While we're at it, change this spaghetti if statement
    into a switch too.

    Cc: stable@vger.kernel.org
    Reported-by: CAI Qian
    Tested-by: CAI Qian
    Reviewed-by: Shirish Pargaonkar
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     
  • Currently we do inc/drop_nlink for a parent directory for every
    mkdir/rmdir calls. That's wrong when Unix extensions are disabled
    because in this case a server doesn't follow the same semantic and
    returns the old value on the next QueryInfo request. As the result,
    we update our value with the server one and then decrement it on
    every rmdir call - go to negative nlink values.

    Fix this by removing inc/drop_nlink for the parent directory from
    mkdir/rmdir, setting it for a revalidation and ignoring NumberOfLinks
    for directories when Unix extensions are disabled.

    Signed-off-by: Pavel Shilovsky
    Reviewed-by: Jeff Layton
    Signed-off-by: Steve French

    Pavel Shilovsky
     

26 Feb, 2012

1 commit

  • When the autofs protocol version 5 packet type was added in commit
    5c0a32fc2cd0 ("autofs4: add new packet type for v5 communications"), it
    obvously tried quite hard to be word-size agnostic, and uses explicitly
    sized fields that are all correctly aligned.

    However, with the final "char name[NAME_MAX+1]" array at the end, the
    actual size of the structure ends up being not very well defined:
    because the struct isn't marked 'packed', doing a "sizeof()" on it will
    align the size of the struct up to the biggest alignment of the members
    it has.

    And despite all the members being the same, the alignment of them is
    different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte
    alignment on x86-64. And while 'NAME_MAX+1' ends up being a nice round
    number (256), the name[] array starts out a 4-byte aligned.

    End result: the "packed" size of the structure is 300 bytes: 4-byte, but
    not 8-byte aligned.

    As a result, despite all the fields being in the same place on all
    architectures, sizeof() will round up that size to 304 bytes on
    architectures that have 8-byte alignment for u64.

    Note that this is *not* a problem for 32-bit compat mode on POWER, since
    there __u64 is 8-byte aligned even in 32-bit mode. But on x86, 32-bit
    and 64-bit alignment is different for 64-bit entities, and as a result
    the structure that has exactly the same layout has different sizes.

    So on x86-64, but no other architecture, we will just subtract 4 from
    the size of the structure when running in a compat task. That way we
    will write the properly sized packet that user mode expects.

    Not pretty. Sadly, this very subtle, and unnecessary, size difference
    has been encoded in user space that wants to read packets of *exactly*
    the right size, and will refuse to touch anything else.

    Reported-and-tested-by: Thomas Meyer
    Signed-off-by: Ian Kent
    Signed-off-by: Linus Torvalds

    Ian Kent
     

25 Feb, 2012

3 commits

  • signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
    this is not enough. eppoll_entry->whead still points to the memory
    we are going to free, ep_unregister_pollwait()->remove_wait_queue()
    is obviously unsafe.

    Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
    change ep_unregister_pollwait() to check pwq->whead != NULL under
    rcu_read_lock() before remove_wait_queue(). We add the new helper,
    ep_remove_wait_queue(), for this.

    This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
    ->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
    ep_unregister_pollwait()->remove_wait_queue() can play with already
    freed and potentially reused ->sighand, but this is fine. This memory
    must have the valid ->signalfd_wqh until rcu_read_unlock().

    Reported-by: Maxime Bizon
    Cc:
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch is intentionally incomplete to simplify the review.
    It ignores ep_unregister_pollwait() which plays with the same wqh.
    See the next change.

    epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
    f_op->poll() needs. In particular it assumes that the wait queue
    can't go away until eventpoll_release(). This is not true in case
    of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
    which is not connected to the file.

    This patch adds the special event, POLLFREE, currently only for
    epoll. It expects that init_poll_funcptr()'ed hook should do the
    necessary cleanup. Perhaps it should be defined as EPOLLFREE in
    eventpoll.

    __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
    ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
    helper.

    ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
    This make this poll entry inconsistent, but we don't care. If you
    share epoll fd which contains our sigfd with another process you
    should blame yourself. signalfd is "really special". I simply do
    not know how we can define the "right" semantics if it used with
    epoll.

    The main problem is, epoll calls signalfd_poll() once to establish
    the connection with the wait queue, after that signalfd_poll(NULL)
    returns the different/inconsistent results depending on who does
    EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
    has nothing to do with the file, it works with the current thread.

    In short: this patch is the hack which tries to fix the symptoms.
    It also assumes that nobody can take tasklist_lock under epoll
    locks, this seems to be true.

    Note:

    - we do not have wake_up_all_poll() but wake_up_poll()
    is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

    - signalfd_cleanup() uses POLLHUP along with POLLFREE,
    we need a couple of simple changes in eventpoll.c to
    make sure it can't be "lost".

    Reported-by: Maxime Bizon
    Cc:
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Quoth Chris:
    "This is later than I wanted because I got backed up running through
    btrfs bugs from the Oracle QA teams. But they are all bug fixes that
    we've queued and tested since rc1.

    Nothing in particular stands out, this just reflects bug fixing and QA
    done in parallel by all the btrfs developers. The most user visible
    of these is:

    Btrfs: clear the extent uptodate bits during parent transid failures

    Because that helps deal with out of date drives (say an iscsi disk
    that has gone away and come back). The old code wasn't always
    properly retrying the other mirror for this type of failure."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
    Btrfs: fix compiler warnings on 32 bit systems
    Btrfs: increase the global block reserve estimates
    Btrfs: clear the extent uptodate bits during parent transid failures
    Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
    Btrfs: make sure we update latest_bdev
    Btrfs: improve error handling for btrfs_insert_dir_item callers
    Btrfs: be less strict on finding next node in clear_extent_bit
    Btrfs: fix a bug on overcommit stuff
    Btrfs: kick out redundant stuff in convert_extent_bit
    Btrfs: skip states when they does not contain bits to clear
    Btrfs: check return value of lookup_extent_mapping() correctly
    Btrfs: fix deadlock on page lock when doing auto-defragment
    Btrfs: fix return value check of extent_io_ops
    btrfs: honor umask when creating subvol root
    btrfs: silence warning in raid array setup
    btrfs: fix structs where bitfields and spinlock/atomic share 8B word
    btrfs: delalloc for page dirtied out-of-band in fixup worker
    Btrfs: fix memory leak in load_free_space_cache()
    btrfs: don't check DUP chunks twice
    Btrfs: fix trim 0 bytes after a device delete
    ...

    Linus Torvalds
     

24 Feb, 2012

5 commits

  • The enospc tracing code added some interesting uses of
    u64 pointer casts.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Anton Altaparmakov
     
  • From: Masanari Iida

    Signed-off-by: Anton Altaparmakov

    Anton Altaparmakov
     
  • With kernel 3.1, Christoph removed i_alloc_sem and replaced it with
    calls (namely inode_dio_wait() and inode_dio_done()) which are
    EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and
    further inode_dio_wait() was pushed from notify_change() into the file
    system ->setattr() method but no non-GPL file system can make this call.

    That means non-GPL file systems cannot exist any more unless they do not
    use any VFS functionality related to reading/writing as far as I can
    tell or at least as long as they want to implement direct i/o.

    Both Linus and Al (and others) have said on LKML that this breakage of
    the VFS API should not have happened and that the change was simply
    missed as it was not documented in the change logs of the patches that
    did those changes.

    This patch changes the two function exports in question to be
    EXPORT_SYMBOL() thus restoring the VFS API as it used to be - accessible
    for all modules.

    Christoph, who introduced the two functions and exported them GPL-only
    is CC-ed on this patch to give him the opportunity to object to the
    symbols being changed in this manner if he did indeed intend them to be
    GPL-only and does not want them to become available to all modules.

    Signed-off-by: Anton Altaparmakov
    CC: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Anton Altaparmakov
     
  • A fix from Jesper Juhl removes an assignment in an ASSERT when a compare
    is intended. Two fixes from Mitsuo Hayasaka address off-by-ones in XFS
    quota enforcement.

    * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: make inode quota check more general
    xfs: change available ranges of softlimit and hardlimit in quota check
    XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended

    Linus Torvalds
     

23 Feb, 2012

6 commits

  • When doing IO with large amounts of data fragmentation, the global block
    reserve calulations are too low. This increases them to avoid
    ENOSPC crashes.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • If btrfs reads a block and finds a parent transid mismatch, it clears
    the uptodate flags on the extent buffer, and the pages inside it. But
    we only clear the uptodate bits in the state tree if the block straddles
    more than one page.

    This is from an old optimization from to reduce contention on the extent
    state tree. But it is buggy because the code that retries a read from
    a different copy of the block is going to find the uptodate state bits
    set and skip the IO.

    The end result of the bug is that we'll never actually read the good
    copy (if there is one).

    The fix here is to always clear the uptodate state bits, which is safe
    because this code is only called when the parent transid fails.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • When we are setting up the mount, we close all the
    devices that were not actually part of the metadata we found.

    But, we don't make sure that one of those devices wasn't
    fs_devices->latest_bdev, which means we can do a use after free
    on the one we closed.

    This updates latest_bdev as it goes.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This allows us to gracefully continue if we aren't able to insert
    directory items, both for normal files/dirs and snapshots.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Bugfixes for the NFS client.

    Fix a nasty Oops in the NFSv4 getacl code, another source of infinite
    loops in the NFSv4 state recovery code, and a regression in NFSv4.1
    session initialisation.

    Also deal with an NFSv4.1 memory leak.

    * tag 'nfs-for-3.3-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4: fix server_scope memory leak
    NFSv4.1: Fix a NFSv4.1 session initialisation regression
    NFSv4: Ensure we throw out bad delegation stateids on NFS4ERR_BAD_STATEID
    NFSv4: Fix an Oops in the NFSv4 getacl code

    Linus Torvalds
     

22 Feb, 2012

5 commits

  • Found by Coverity software (http://scan.coverity.com).

    Signed-off-by: Anton Altaparmakov

    Anton Altaparmakov
     
  • Found by Coverity software (http://scan.coverity.com).

    Signed-off-by: Anton Altaparmakov

    Anton Altaparmakov
     
  • The 'poll()' system call timeout parameter is supposed to be 'int', not
    'long'.

    Now, the reason this matters is that right now 32-bit compat mode is
    broken on at least x86-64, because the 32-bit code just calls
    'sys_poll()' directly on x86-64, and the 32-bit argument will have been
    zero-extended, turning a signed 'int' into a large unsigned 'long'
    value.

    We could just introduce a 'compat_sys_poll()' function for this, and
    that may eventually be what we have to do, but since the actual standard
    poll() semantics is *supposed* to be 'int', and since at least on x86-64
    glibc sign-extends the argument before invocing the system call (so
    nobody can actually use a 64-bit timeout value in user space _anyway_,
    even in 64-bit binaries), the simpler solution would seem to be to just
    fix the definition of the system call to match what it should have been
    from the very start.

    If it turns out that somebody somehow circumvents the user-level libc
    64-bit sign extension and actually uses a large unsigned 64-bit timeout
    despite that not being how poll() is supposed to work, we will need to
    do the compat_sys_poll() approach.

    Reported-by: Thomas Meyer
    Acked-by: Eric Dumazet
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The xfs checks quota when reserving disk blocks and inodes. In the block
    reservation, it checks if the total number of blocks including current
    usage and new reservation exceed quota. In the inode reservation,
    it checks using the total number of inodes including only current usage
    without new reservation. However, this inode quota check works well
    since the caller of xfs_trans_dquot() always sets the argument of the
    number of new inode reservation to 1 or 0 and inode is reserved one by
    one in current xfs.

    To make it more general, this patch changes it to the same way as the
    block quota check.

    Signed-off-by: Mitsuo Hayasaka
    Cc: Ben Myers
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Mitsuo Hayasaka
     
  • In general, quota allows us to use disk blocks and inodes up to each
    limit, that is, they are available if they don't exceed their limitations.
    Current xfs sets their available ranges to lower than them except disk
    inode quota check. So, this patch changes the ranges to not beyond them.

    Signed-off-by: Mitsuo Hayasaka
    Cc: Ben Myers
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Mitsuo Hayasaka