21 Jun, 2011

2 commits

  • * 'for-2.6.40' of git://linux-nfs.org/~bfields/linux:
    nfsd4: fix break_lease flags on nfsd open
    nfsd: link returns nfserr_delay when breaking lease
    nfsd: v4 support requires CRYPTO
    nfsd: fix dependency of nfsd on auth_rpcgss

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    devcgroup_inode_permission: take "is it a device node" checks to inlined wrapper
    fix comment in generic_permission()
    kill obsolete comment for follow_down()
    proc_sys_permission() is OK in RCU mode
    reiserfs_permission() doesn't need to bail out in RCU mode
    proc_fd_permission() is doesn't need to bail out in RCU mode
    nilfs2_permission() doesn't need to bail out in RCU mode
    logfs doesn't need ->permission() at all
    coda_ioctl_permission() is safe in RCU mode
    cifs_permission() doesn't need to bail out in RCU mode
    bad_inode_permission() is safe from RCU mode
    ubifs: dereferencing an ERR_PTR in ubifs_mount()

    Linus Torvalds
     

20 Jun, 2011

14 commits


18 Jun, 2011

10 commits

  • In isofs_fill_super(), when an iso_primary_descriptor is found, it is
    kept in pri_bh. The error cases don't properly release it. Fix it.

    Reported-and-tested-by: 김원석
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Snapshot creation has two phases. One is the initial snapshot setup,
    and the second is done during commit, while nobody is allowed to modify
    the root we are snapshotting.

    The delayed metadata insertion code can break that rule, it does a
    delayed inode update on the inode of the parent of the snapshot,
    and delayed directory item insertion.

    This makes sure to run the pending delayed operations before we
    record the snapshot root, which avoids corruptions.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • When allocation fails in btrfs_read_fs_root_no_name, ret is not set
    although it is returned, holding a garbage value.

    Signed-off-by: David Sterba
    Reviewed-by: Li Zefan
    Signed-off-by: Chris Mason

    David Sterba
     
  • We have migrated the space for the delayed inode items from
    trans_block_rsv to global_block_rsv, but we forgot to set trans->block_rsv to
    global_block_rsv when we doing delayed inode operations, and the following Oops
    happened:

    [ 9792.654889] ------------[ cut here ]------------
    [ 9792.654898] WARNING: at fs/btrfs/extent-tree.c:5681
    btrfs_alloc_free_block+0xca/0x27c [btrfs]()
    [ 9792.654899] Hardware name: To Be Filled By O.E.M.
    [ 9792.654900] Modules linked in: btrfs zlib_deflate libcrc32c
    ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
    arc4 rt61pci rt2x00pci rt2x00lib snd_hda_codec_hdmi mac80211
    snd_hda_codec_realtek cfg80211 snd_hda_intel edac_core snd_seq rfkill
    pcspkr serio_raw snd_hda_codec eeprom_93cx6 edac_mce_amd sp5100_tco
    i2c_piix4 k10temp snd_hwdep snd_seq_device snd_pcm floppy r8169 xhci_hcd
    mii snd_timer snd soundcore snd_page_alloc ipv6 firewire_ohci pata_acpi
    ata_generic firewire_core pata_via crc_itu_t radeon ttm drm_kms_helper
    drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
    [ 9792.654919] Pid: 2762, comm: rm Tainted: G W 2.6.39+ #1
    [ 9792.654920] Call Trace:
    [ 9792.654922] [] warn_slowpath_common+0x83/0x9b
    [ 9792.654925] [] warn_slowpath_null+0x1a/0x1c
    [ 9792.654933] [] btrfs_alloc_free_block+0xca/0x27c [btrfs]
    [ 9792.654945] [] ? map_extent_buffer+0x6e/0xa8 [btrfs]
    [ 9792.654953] [] __btrfs_cow_block+0xfc/0x30c [btrfs]
    [ 9792.654963] [] ? btrfs_buffer_uptodate+0x47/0x58 [btrfs]
    [ 9792.654970] [] ? read_block_for_search+0x94/0x368 [btrfs]
    [ 9792.654978] [] btrfs_cow_block+0xfe/0x146 [btrfs]
    [ 9792.654986] [] btrfs_search_slot+0x14d/0x4b6 [btrfs]
    [ 9792.654997] [] ? map_extent_buffer+0x6e/0xa8 [btrfs]
    [ 9792.655022] [] btrfs_lookup_inode+0x2f/0x8f [btrfs]
    [ 9792.655025] [] ? _cond_resched+0xe/0x22
    [ 9792.655027] [] ? mutex_lock+0x29/0x50
    [ 9792.655039] [] btrfs_update_delayed_inode+0x72/0x137 [btrfs]
    [ 9792.655051] [] btrfs_run_delayed_items+0x90/0xdb [btrfs]
    [ 9792.655062] [] btrfs_commit_transaction+0x228/0x654 [btrfs]
    [ 9792.655064] [] ? remove_wait_queue+0x3a/0x3a
    [ 9792.655075] [] btrfs_evict_inode+0x14d/0x202 [btrfs]
    [ 9792.655077] [] evict+0x71/0x111
    [ 9792.655079] [] iput+0x12a/0x132
    [ 9792.655081] [] do_unlinkat+0x106/0x155
    [ 9792.655083] [] ? path_put+0x1f/0x23
    [ 9792.655085] [] ? audit_syscall_entry+0x145/0x171
    [ 9792.655087] [] ? putname+0x34/0x36
    [ 9792.655090] [] sys_unlinkat+0x29/0x2b
    [ 9792.655092] [] system_call_fastpath+0x16/0x1b
    [ 9792.655093] ---[ end trace 02b696eb02b3f768 ]---

    This patch fix it by setting the reservation of the transaction handle to the
    correct one.

    Reported-by: Josef Bacik
    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Removes code no longer used. The sysfs file itself is kept, because the
    btrfs developers expressed interest in putting new entries to sysfs.

    Signed-off-by: Maarten Lankhorst
    Signed-off-by: Chris Mason

    Maarten Lankhorst
     
  • smatch reports:

    btrfs_recover_log_trees error: 'wc.replay_dest' dereferencing
    possible ERR_PTR()

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • …btrfs-work into for-linus

    Conflicts:
    fs/btrfs/transaction.c

    Signed-off-by: Chris Mason <chris.mason@oracle.com>

    Chris Mason
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: make log devices with write back caches work
    xfs: fix ->mknod() return value on xfs_get_acl() failure

    Linus Torvalds
     
  • The recent commit to get rid of our trans_mutex introduced
    some races with block group relocation. The problem is that relocation
    needs to do some record keeping about each root, and it was relying
    on the transaction mutex to coordinate things in subtle ways.

    This fix adds a mutex just for the relocation code and makes sure
    it doesn't have a big impact on normal operations. The race is
    really fixed in btrfs_record_root_in_trans, which is where we
    step back and wait for the relocation code to finish accounting
    setup.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • ____call_usermodehelper() now erases any credentials set by the
    subprocess_inf::init() function. The problem is that commit
    17f60a7da150 ("capabilites: allow the application of capability limits
    to usermode helpers") creates and commits new credentials with
    prepare_kernel_cred() after the call to the init() function. This wipes
    all keyrings after umh_keys_init() is called.

    The best way to deal with this is to put the init() call just prior to
    the commit_creds() call, and pass the cred pointer to init(). That
    means that umh_keys_init() and suchlike can modify the credentials
    _before_ they are published and potentially in use by the rest of the
    system.

    This prevents request_key() from working as it is prevented from passing
    the session keyring it set up with the authorisation token to
    /sbin/request-key, and so the latter can't assume the authority to
    instantiate the key. This causes the in-kernel DNS resolver to fail
    with ENOKEY unconditionally.

    Signed-off-by: David Howells
    Acked-by: Eric Paris
    Tested-by: Jeff Layton
    Signed-off-by: Linus Torvalds

    David Howells
     

17 Jun, 2011

2 commits


16 Jun, 2011

12 commits

  • There's no reason not to support cache flushing on external log devices.
    The only thing this really requires is flushing the data device first
    both in fsync and log commits. A side effect is that we also have to
    remove the barrier write test during mount, which has been superflous
    since the new FLUSH+FUA code anyway. Also use the chance to flush the
    RT subvolume write cache before the fsync commit, which is required
    for correct semantics.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Store the AFS vnode uniquifier in the i_generation field, not the i_version
    field of the inode struct. i_version can then be given the AFS data version
    number.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Set s_id in the superblock to the name of the AFS volume that this superblock
    corresponds to.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • I've got a report of a file corruption from fsxlinux on ext3. The important
    operations to the page were:
    mapwrite to a hole
    partial write to the page
    read - found the page zeroed from the end of the normal write

    The culprit seems to be that if get_block() fails in __block_write_begin()
    (e.g. transient ENOSPC in ext3), the function does ClearPageUptodate(page).
    Thus when we retry the write, the logic in __block_write_begin() thinks zeroing
    of the page is needed and overwrites old data. In fact, I don't see why we
    should ever need to zero the uptodate bit here - either the page was uptodate
    when we entered __block_write_begin() and it should stay so when we leave it,
    or it was not uptodate and noone had right to set it uptodate during
    __block_write_begin() so it remains !uptodate when we leave as well. So just
    remove clearing of the bit.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • afs_fill_page should read the page that is about to be written but
    the current implementation has a number of issues. If we aren't
    extending the file we always read PAGE_CACHE_SIZE at offset 0. If we
    are extending the file we try to read the entire file.

    Change afs_fill_page to read PAGE_CACHE_SIZE at the right offset,
    clamped to i_size.

    While here, avoid calling afs_fill_page when we are doing a
    PAGE_CACHE_SIZE write.

    Signed-off-by: Anton Blanchard
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    Anton Blanchard
     
  • [Kudos to dhowells for tracking that crap down]

    If two processes attempt to cause automounting on the same mountpoint at the
    same time, the vfsmount holding the mountpoint will be left with one too few
    references on it, causing a BUG when the kernel tries to clean up.

    The problem is that lock_mount() drops the caller's reference to the
    mountpoint's vfsmount in the case where it finds something already mounted on
    the mountpoint as it transits to the mounted filesystem and replaces path->mnt
    with the new mountpoint vfsmount.

    During a pathwalk, however, we don't take a reference on the vfsmount if it is
    the same as the one in the nameidata struct, but do_add_mount() doesn't know
    this.

    The fix is to make sure we have a ref on the vfsmount of the mountpoint before
    calling do_add_mount(). However, if lock_mount() doesn't transit, we're then
    left with an extra ref on the mountpoint vfsmount which needs releasing.
    We can handle that in follow_managed() by not making assumptions about what
    we can and what we cannot get from lookup_mnt() as the current code does.

    The callers of follow_managed() expect that reference to path->mnt will be
    grabbed iff path->mnt has been changed. follow_managed() and follow_automount()
    keep track of whether such reference has been grabbed and assume that it'll
    happen in those and only those cases that'll have us return with changed
    path->mnt. That assumption is almost correct - it breaks in case of
    racing automounts and in even harder to hit race between following a mountpoint
    and a couple of mount --move. The thing is, we don't need to make that
    assumption at all - after the end of loop in follow_manage() we can check
    if path->mnt has ended up unchanged and do mntput() if needed.

    The BUG can be reproduced with the following test program:

    #include
    #include
    #include
    #include
    #include
    int main(int argc, char **argv)
    {
    int pid, ws;
    struct stat buf;
    pid = fork();
    stat(argv[1], &buf);
    if (pid > 0) wait(&ws);
    return 0;
    }

    and the following procedure:

    (1) Mount an NFS volume that on the server has something else mounted on a
    subdirectory. For instance, I can mount / from my server:

    mount warthog:/ /mnt -t nfs4 -r

    On the server /data has another filesystem mounted on it, so NFS will see
    a change in FSID as it walks down the path, and will mark /mnt/data as
    being a mountpoint. This will cause the automount code to be triggered.

    !!! Do not look inside the mounted fs at this point !!!

    (2) Run the above program on a file within the submount to generate two
    simultaneous automount requests:

    /tmp/forkstat /mnt/data/testfile

    (3) Unmount the automounted submount:

    umount /mnt/data

    (4) Unmount the original mount:

    umount /mnt

    At this point the kernel should throw a BUG with something like the
    following:

    BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]

    Note that the bug appears on the root dentry of the original mount, not the
    mountpoint and not the submount because sys_umount() hasn't got to its final
    mntput_no_expire() yet, but this isn't so obvious from the call trace:

    [] shrink_dcache_for_umount+0x69/0x82
    [] generic_shutdown_super+0x37/0x15b
    [] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
    [] kill_anon_super+0x1d/0x7e
    [] nfs4_kill_super+0x60/0xb6 [nfs]
    [] deactivate_locked_super+0x34/0x83
    [] deactivate_super+0x6f/0x7b
    [] mntput_no_expire+0x18d/0x199
    [] mntput+0x3b/0x44
    [] release_mounts+0xa2/0xbf
    [] sys_umount+0x47a/0x4ba
    [] ? trace_hardirqs_on_caller+0x1fd/0x22f
    [] system_call_fastpath+0x16/0x1b

    as do_umount() is inlined. However, you can see release_mounts() in there.

    Note also that it may be necessary to have multiple CPU cores to be able to
    trigger this bug.

    Tested-by: Jeff Layton
    Tested-by: Ian Kent
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    Al Viro
     
  • Git bisection shows that commit e6bc45d65df8599fdbae73be9cec4ceed274db53 causes
    BUG_ONs under high I/O load:

    kernel BUG at fs/inode.c:1368!
    [ 2862.501007] Call Trace:
    [ 2862.501007] [] d_kill+0xf8/0x140
    [ 2862.501007] [] dput+0xc9/0x190
    [ 2862.501007] [] fput+0x15f/0x210
    [ 2862.501007] [] filp_close+0x61/0x90
    [ 2862.501007] [] sys_close+0xb1/0x110
    [ 2862.501007] [] system_call_fastpath+0x16/0x1b

    A reliable way to reproduce this bug is:
    Login to KDE, run 'rsnapshot sync', and apt-get install openjdk-6-jdk,
    and apt-get remove openjdk-6-jdk.

    The buggy part of the patch is this:
    struct inode *inode = NULL;
    .....
    - if (nd.last.name[nd.last.len])
    - goto slashes;
    inode = dentry->d_inode;
    - if (inode)
    - ihold(inode);
    + if (nd.last.name[nd.last.len] || !inode)
    + goto slashes;
    + ihold(inode)
    ...
    if (inode)
    iput(inode); /* truncate the inode here */

    If nd.last.name[nd.last.len] is nonzero (and thus goto slashes branch is taken),
    and dentry->d_inode is non-NULL, then this code now does an additional iput on
    the inode, which is wrong.

    Fix this by only setting the inode variable if nd.last.name[nd.last.len] is 0.

    Reference: https://lkml.org/lkml/2011/6/15/50
    Reported-by: Norbert Preining
    Reported-by: Török Edwin
    Cc: "Theodore Ts'o"
    Cc: Al Viro
    Signed-off-by: Török Edwin
    Signed-off-by: Al Viro

    Török Edwin
     
  • This reverts commit 7f81c8890c15a10f5220bebae3b6dfae4961962a.

    It turns out that it's not actually a build-time check on x86-64 UML,
    which does some seriously crazy stuff with VM_STACK_FLAGS.

    The VM_STACK_FLAGS define depends on the arch-supplied
    VM_STACK_DEFAULT_FLAGS value, and on x86-64 UML we have

    arch/um/sys-x86_64/shared/sysdep/vm-flags.h:

    #define VM_STACK_DEFAULT_FLAGS \
    (test_thread_flag(TIF_IA32) ? vm_stack_flags32 : vm_stack_flags)

    #define VM_STACK_DEFAULT_FLAGS vm_stack_flags

    (yes, seriously: two different #define's for that thing, with the first
    one being inside an "#ifdef TIF_IA32")

    It's possible that it is UML that should just be fixed in this area, but
    for now let's just undo the (very small) optimization.

    Reported-by: Randy Dunlap
    Acked-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Commit a8bef8ff6ea1 ("mm: migration: avoid race between shift_arg_pages()
    and rmap_walk() during migration by not migrating temporary stacks")
    introduced a BUG_ON() to ensure that VM_STACK_FLAGS and
    VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time
    one, so BUILD_BUG_ON is more appropriate.

    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Don't call iput with the inode half setup to be a namespace filedescriptor.
    Instead rearrange the code so that we don't initialize ei->ns_ops until
    after I ns_ops->get succeeds, preventing us from invoking ns_ops->put
    when ns_ops->get failed.

    Reported-by: Ingo Saitz
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • We can lockup if we try to allow new writers join the transaction and we have
    flushoncommit set or have a pending snapshot. This is because we set
    no_trans_join and then loop around and try to wait for ordered extents again.
    The problem is the ordered endio stuff needs to join the transaction, which it
    can't do because no_trans_join is set. So instead wait until after this loop to
    set no_trans_join and then make sure to wait for num_writers == 1 in case
    anybody got started in between us exiting the loop and setting no_trans_join.
    This could easily be reproduced by mounting -o flushoncommit and running xfstest
    13. It cannot be reproduced with this patch. Thanks,

    Reported-by: Jim Schutt
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Currently there is nothing protecting the pending_snapshots list on the
    transaction. We only hold the directory mutex that we are snapshotting and a
    read lock on the subvol_sem, so we could race with somebody else creating a
    snapshot in a different directory and end up with list corruption. So protect
    this list with the trans_lock. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik