06 Mar, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: no .snap inside of snapped namespace
    libceph: fix msgr standby handling
    libceph: fix msgr keepalive flag
    libceph: fix msgr backoff
    libceph: retry after authorization failure
    libceph: fix handling of short returns from get_user_pages
    ceph: do not clear I_COMPLETE from d_release
    ceph: do not set I_COMPLETE
    Revert "ceph: keep reference to parent inode on ceph_dentry"

    Linus Torvalds
     

05 Mar, 2011

2 commits

  • The "bad_page()" page allocator sanity check was reported recently (call
    chain as follows):

    bad_page+0x69/0x91
    free_hot_cold_page+0x81/0x144
    skb_release_data+0x5f/0x98
    __kfree_skb+0x11/0x1a
    tcp_ack+0x6a3/0x1868
    tcp_rcv_established+0x7a6/0x8b9
    tcp_v4_do_rcv+0x2a/0x2fa
    tcp_v4_rcv+0x9a2/0x9f6
    do_timer+0x2df/0x52c
    ip_local_deliver+0x19d/0x263
    ip_rcv+0x539/0x57c
    netif_receive_skb+0x470/0x49f
    :virtio_net:virtnet_poll+0x46b/0x5c5
    net_rx_action+0xac/0x1b3
    __do_softirq+0x89/0x133
    call_softirq+0x1c/0x28
    do_softirq+0x2c/0x7d
    do_IRQ+0xec/0xf5
    default_idle+0x0/0x50
    ret_from_intr+0x0/0xa
    default_idle+0x29/0x50
    cpu_idle+0x95/0xb8
    start_kernel+0x220/0x225
    _sinittext+0x22f/0x236

    It occurs because an skb with a fraglist was freed from the tcp
    retransmit queue when it was acked, but a page on that fraglist had
    PG_Slab set (indicating it was allocated from the Slab allocator (which
    means the free path above can't safely free it via put_page.

    We tracked this back to an nfsv4 setacl operation, in which the nfs code
    attempted to fill convert the passed in buffer to an array of pages in
    __nfs4_proc_set_acl, which gets used by the skb->frags list in
    xs_sendpages. __nfs4_proc_set_acl just converts each page in the buffer
    to a page struct via virt_to_page, but the vfs allocates the buffer via
    kmalloc, meaning the PG_slab bit is set. We can't create a buffer with
    kmalloc and free it later in the tcp ack path with put_page, so we need
    to either:

    1) ensure that when we create the list of pages, no page struct has
    PG_Slab set

    or

    2) not use a page list to send this data

    Given that these buffers can be multiple pages and arbitrarily sized, I
    think (1) is the right way to go. I've written the below patch to
    allocate a page from the buddy allocator directly and copy the data over
    to it. This ensures that we have a put_page free-able page for every
    entry that winds up on an skb frag list, so it can be safely freed when
    the frame is acked. We do a put page on each entry after the
    rpc_call_sync call so as to drop our own reference count to the page,
    leaving only the ref count taken by tcp_sendpages. This way the data
    will be properly freed when the ack comes in

    Successfully tested by myself to solve the above oops.

    Note, as this is the result of a setacl operation that exceeded a page
    of data, I think this amounts to a local DOS triggerable by an
    uprivlidged user, so I'm CCing security on this as well.

    Signed-off-by: Neil Horman
    CC: Trond Myklebust
    CC: security@kernel.org
    CC: Jeff Layton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Otherwise you can do things like

    # mkdir .snap/foo
    # cd .snap/foo/.snap
    # ls

    Signed-off-by: Sage Weil

    Sage Weil
     

04 Mar, 2011

6 commits


03 Mar, 2011

10 commits


02 Mar, 2011

3 commits

  • vfs_rename_other() does not lock renamed inode with i_mutex. Thus changing
    i_nlink in a non-atomic manner (which happens in ext2_rename()) can corrupt
    it as reported and analyzed by Josh.

    In fact, there is no good reason to mess with i_nlink of the moved file.
    We did it presumably to simulate linking into the new directory and unlinking
    from an old one. But the practical effect of this is disputable because fsck
    can possibly treat file as being properly linked into both directories without
    writing any error which is confusing. So we just stop increment-decrement
    games with i_nlink which also fixes the corruption.

    CC: stable@kernel.org
    CC: Al Viro
    Signed-off-by: Josh Hunt
    Signed-off-by: Jan Kara

    Josh Hunt
     
  • Commit 493f3358cb289ccf716c5a14fa5bb52ab75943e5 added this call to
    xfs_fs_geometry() in order to avoid passing kernel stack data back
    to user space:

    + memset(geo, 0, sizeof(*geo));

    Unfortunately, one of the callers of that function passes the
    address of a smaller data type, cast to fit the type that
    xfs_fs_geometry() requires. As a result, this can happen:

    Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
    in: f87aca93

    Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1
    Call Trace:

    [] ? panic+0x50/0x150
    [] ? __stack_chk_fail+0x10/0x18
    [] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs]

    Fix this by fixing that one caller to pass the right type and then
    copy out the subset it is interested in.

    Note: This patch is an alternative to one originally proposed by
    Eric Sandeen.

    Reported-by: Jeffrey Hundstad
    Signed-off-by: Alex Elder
    Reviewed-by: Eric Sandeen
    Tested-by: Jeffrey Hundstad

    Alex Elder
     
  • According to the report from Jiro SEKIBA titled "regression in
    2.6.37?" (Message-Id: ), on 2.6.37 and
    later kernels, lscp command no longer displays "i" flag on checkpoints
    that snapshot operations or garbage collection created.

    This is a regression of nilfs2 checkpointing function, and it's
    critical since it broke behavior of a part of nilfs2 applications.
    For instance, snapshot manager of TimeBrowse gets to create
    meaningless snapshots continuously; snapshot creation triggers another
    checkpoint, but applications cannot distinguish whether the new
    checkpoint contains meaningful changes or not without the i-flag.

    This patch fixes the regression and brings that application behavior
    back to normal.

    Reported-by: Jiro SEKIBA
    Signed-off-by: Ryusuke Konishi
    Tested-by: Ryusuke Konishi
    Tested-by: Jiro SEKIBA
    Cc: stable [2.6.37]

    Ryusuke Konishi
     

01 Mar, 2011

3 commits


26 Feb, 2011

7 commits

  • A race can occur when io_submit() races with io_destroy():

    CPU1 CPU2
    io_submit()
    do_io_submit()
    ...
    ctx = lookup_ioctx(ctx_id);
    io_destroy()
    Now do_io_submit() holds the last reference to ctx.
    ...
    queue new AIO
    put_ioctx(ctx) - frees ctx with active AIOs

    We solve this issue by checking whether ctx is being destroyed in AIO
    submission path after adding new AIO to ctx. Then we are guaranteed that
    either io_destroy() waits for new AIO or we see that ctx is being
    destroyed and bail out.

    Cc: Nick Piggin
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.

    lookup_ioctx doesn't implement the rcu lookup pattern properly.
    rcu_read_lock does not prevent refcount going to zero, so we might take
    a refcount on a zero count ioctx.

    Fix the bug by atomically testing for zero refcount before incrementing.

    [jack@suse.cz: added comment into the code]
    Reviewed-by: Jeff Moyer
    Signed-off-by: Nick Piggin
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The kernel automatically evaluates partition tables of storage devices.
    The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains
    a bug that causes a kernel oops on certain corrupted LDM partitions. A
    kernel subsystem seems to crash, because, after the oops, the kernel no
    longer recognizes newly connected storage devices.

    The patch changes ldm_parse_vmdb() to Validate the value of vblk_size.

    Signed-off-by: Timo Warns
    Cc: Eugene Teo
    Acked-by: Richard Russon
    Cc: Harvey Harrison
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Timo Warns
     
  • In several places, an epoll fd can call another file's ->f_op->poll()
    method with ep->mtx held. This is in general unsafe, because that other
    file could itself be an epoll fd that contains the original epoll fd.

    The code defends against this possibility in its own ->poll() method using
    ep_call_nested, but there are several other unsafe calls to ->poll
    elsewhere that can be made to deadlock. For example, the following simple
    program causes the call in ep_insert recursively call the original fd's
    ->poll, leading to deadlock:

    #include
    #include

    int main(void) {
    int e1, e2, p[2];
    struct epoll_event evt = {
    .events = EPOLLIN
    };

    e1 = epoll_create(1);
    e2 = epoll_create(2);
    pipe(p);

    epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
    epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
    write(p[1], p, sizeof p);
    epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);

    return 0;
    }

    On insertion, check whether the inserted file is itself a struct epoll,
    and if so, do a recursive walk to detect whether inserting this file would
    create a loop of epoll structures, which could lead to deadlock.

    [nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Nelson Elhage
    Reported-by: Nelson Elhage
    Tested-by: Nelson Elhage
    Cc: [2.6.34+, possibly earlier]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: fix fiemap bugs with delalloc
    Btrfs: set FMODE_EXCL in btrfs_device->mode
    Btrfs: make btrfs_rm_device() fail gracefully
    Btrfs: Avoid accessing unmapped kernel address
    Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
    Btrfs: allow balance to explicitly allocate chunks as it relocates
    Btrfs: put ENOSPC debugging under a mount option

    Linus Torvalds
     
  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix - again - partition detection when array becomes active
    Fix over-zealous flush_disk when changing device size.
    md: avoid spinlock problem in blk_throtl_exit
    md: correctly handle probe of an 'mdp' device.
    md: don't set_capacity before array is active.
    md: Fix raid1->raid0 takeover

    Linus Torvalds
     
  • I'm seeing the following oops when testing afs:

    Unable to handle kernel paging request for data at address 0x00000008
    ...
    NIP [c0000000003393b0] .afs_unlink_writeback+0x38/0xc0
    LR [c00000000033987c] .afs_put_writeback+0x98/0xec
    Call Trace:
    [c00000000345f600] [c00000000033987c] .afs_put_writeback+0x98/0xec
    [c00000000345f690] [c00000000033ae80] .afs_write_begin+0x6a4/0x75c
    [c00000000345f790] [c00000000012b77c] .generic_file_buffered_write+0x148/0x320
    [c00000000345f8d0] [c00000000012e1b8] .__generic_file_aio_write+0x37c/0x3e4
    [c00000000345f9d0] [c00000000012e2a8] .generic_file_aio_write+0x88/0xfc
    [c00000000345fa90] [c0000000003390a8] .afs_file_write+0x10c/0x178
    [c00000000345fb40] [c000000000188788] .do_sync_write+0xc4/0x128
    [c00000000345fcc0] [c000000000189658] .vfs_write+0xe8/0x1d8
    [c00000000345fd70] [c000000000189884] .SyS_write+0x68/0xb0
    [c00000000345fe30] [c000000000008564] syscall_exit+0x0/0x40

    afs_write_begin hits an error and calls afs_unlink_writeback. In there
    we do list_del_init on an uninitialised list.

    The patch below initialises ->link when creating the afs_writeback struct.

    Signed-off-by: Anton Blanchard
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

25 Feb, 2011

3 commits

  • Commit e1181ee6 "vfs: pass struct file to do_truncate on O_TRUNC
    opens" broke the behavior of open(O_TRUNC|O_RDONLY) in fuse. Fuse
    assumed that when called from open, a truncate() will be done, not an
    ftruncate().

    Fix by restoring the old behavior, based on the ATTR_OPEN flag.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Single threaded NTFS-3G could get stuck if a delayed RELEASE reply
    triggered a DESTROY request via path_put().

    Fix this by

    a) making RELEASE requests synchronous, whenever possible, on fuseblk
    filesystems

    b) if not possible (triggered by an asynchronous read/write) then do
    the path_put() in a separate thread with schedule_work().

    Reported-by: Oliver Neukum
    Cc: stable@kernel.org
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The new implementation of bd_link_disk_holder() added by 49731baa41d
    (block: restore multiple bd_link_disk_holder() support) didn't get an
    extra reference for the holder_dir kobject of the slave bdev; however,
    bdev kills holder_dir on removal, not release, so if the slave bdev is
    removed while there are holder links, the holder_dir will be destroyed
    while there still are holder links, which leads to oops later when
    bd_unlink_disk_order() tries to remove those links.

    Make bd_link_disk_holder() grab an extra reference for the slave's
    holder_dir and put it in bd_unlink_disk_holder().

    Signed-off-by: Tejun Heo
    Reported-by: "Hawrylewicz Czarnowski, Przemyslaw"
    Tested-by: "Hawrylewicz Czarnowski, Przemyslaw"
    Cc: Neil Brown
    Cc: Jens Axboe
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

24 Feb, 2011

4 commits

  • By the commit
    b3e19d9 2011-01-07 fs: scale mntget/mntput
    vfsmount_lock was introduced around testing mnt_count.
    Fix the mis-typed 'unlock'

    Signed-off-by: J. R. Okajima
    Acked-by: Al Viro
    Signed-off-by: Al Viro

    J. R. Okajima
     
  • There are two cases when we call flush_disk.
    In one, the device has disappeared (check_disk_change) so any
    data will hold becomes irrelevant.
    In the oter, the device has changed size (check_disk_size_change)
    so data we hold may be irrelevant.

    In both cases it makes sense to discard any 'clean' buffers,
    so they will be read back from the device if needed.

    In the former case it makes sense to discard 'dirty' buffers
    as there will never be anywhere safe to write the data. In the
    second case it *does*not* make sense to discard dirty buffers
    as that will lead to file system corruption when you simply enlarge
    the containing devices.

    flush_disk calls __invalidate_devices.
    __invalidate_device calls both invalidate_inodes and invalidate_bdev.

    invalidate_inodes *does* discard I_DIRTY inodes and this does lead
    to fs corruption.

    invalidate_bev *does*not* discard dirty pages, but I don't really care
    about that at present.

    So this patch adds a flag to __invalidate_device (calling it
    __invalidate_device2) to indicate whether dirty buffers should be
    killed, and this is passed to invalidate_inodes which can choose to
    skip dirty inodes.

    flusk_disk then passes true from check_disk_change and false from
    check_disk_size_change.

    dm avoids tripping over this problem by calling i_size_write directly
    rathher than using check_disk_size_change.

    md does use check_disk_size_change and so is affected.

    This regression was introduced by commit 608aeef17a which causes
    check_disk_size_change to call flush_disk, so it is suitable for any
    kernel since 2.6.27.

    Cc: stable@kernel.org
    Acked-by: Jeff Moyer
    Cc: Andrew Patterson
    Cc: Jens Axboe
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Michael Leun reported that running parallel opens on a fuse filesystem
    can trigger a "kernel BUG at mm/truncate.c:475"

    Gurudas Pai reported the same bug on NFS.

    The reason is, unmap_mapping_range() is not prepared for more than
    one concurrent invocation per inode. For example:

    thread1: going through a big range, stops in the middle of a vma and
    stores the restart address in vm_truncate_count.

    thread2: comes in with a small (e.g. single page) unmap request on
    the same vma, somewhere before restart_address, finds that the
    vma was already unmapped up to the restart address and happily
    returns without doing anything.

    Another scenario would be two big unmap requests, both having to
    restart the unmapping and each one setting vm_truncate_count to its
    own value. This could go on forever without any of them being able to
    finish.

    Truncate and hole punching already serialize with i_mutex. Other
    callers of unmap_mapping_range() do not, and it's difficult to get
    i_mutex protection for all callers. In particular ->d_revalidate(),
    which calls invalidate_inode_pages2_range() in fuse, may be called
    with or without i_mutex.

    This patch adds a new mutex to 'struct address_space' to prevent
    running multiple concurrent unmap_mapping_range() on the same mapping.

    [ We'll hopefully get rid of all this with the upcoming mm
    preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
    lockbreak" patch in particular. But that is for 2.6.39 ]

    Signed-off-by: Miklos Szeredi
    Reported-by: Michael Leun
    Reported-by: Gurudas Pai
    Tested-by: Gurudas Pai
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • The Btrfs fiemap code wasn't properly returning delalloc extents,
    so applications that trust fiemap to decide if there are holes in the
    file see holes instead of delalloc.

    This reworks the btrfs fiemap code, adding a get_extent helper that
    searches for delalloc ranges and also adding a helper for extent_fiemap
    that skips past holes in the file.

    Signed-off-by: Chris Mason

    Chris Mason
     

23 Feb, 2011

1 commit

  • Right now we, are relying on the fact that when we attempt to
    actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP
    and the user is informed that the device does not support discard.

    However, in the case where the we do not hit any suitable free
    extent to trim in FITRIM code, it will finish without any error.
    This is very confusing, because it seems that FITRIM was successful
    even though the device does not actually supports discard.

    Solution: Check for the discard support before attempt to search for
    free extents.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Alex Elder

    Lukas Czerner