21 Dec, 2010

2 commits


18 Dec, 2010

2 commits


17 Dec, 2010

1 commit

  • * 'for-linus' of git://git.infradead.org/users/eparis/notify:
    fanotify: fill in the metadata_len field on struct fanotify_event_metadata
    fanotify: split version into version and metadata_len
    fanotify: Dont try to open a file descriptor for the overflow event
    fanotify: Introduce FAN_NOFD
    fanotify: do not leak user reference on allocation failure
    inotify: stop kernel memory leak on file creation failure
    fanotify: on group destroy allow all waiters to bypass permission check
    fanotify: Dont allow a mask of 0 if setting or removing a mark
    fanotify: correct broken ref counting in case adding a mark failed
    fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called
    fanotify: remove packed from access response message
    fanotify: deny permissions when no event was sent

    Linus Torvalds
     

16 Dec, 2010

5 commits

  • On 2.6.37-rc1, garbage collection ioctl of nilfs was broken due to the
    commit 263d90cefc7d82a0 ("nilfs2: remove own inode hash used for GC"),
    and leading to filesystem corruption.

    The patch doesn't queue gc-inodes for log writer if they are reused
    through the vfs inode cache. Here, gc-inode is the inode which
    buffers blocks to be relocated on GC. That patch queues gc-inodes in
    nilfs_init_gcinode() function, but this function is not called when
    they don't have I_NEW flag. Thus, some of live blocks are wrongly
    overrode without being moved to new logs.

    This resolves the problem by moving the gc-inode queueing to an outer
    function to ensure it's done right.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • The user buffer may be 512-byte aligned, not page-aligned. We were
    assuming the buffer was page-aligned and only accounting for
    non-page-aligned io offsets.

    Signed-off-by: Henry C Chang
    Signed-off-by: Sage Weil

    Henry C Chang
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix typo which broke '..' detection in ext4_find_entry()
    ext4: Turn off multiple page-io submission by default

    Linus Torvalds
     
  • The install_special_mapping routine (used, for example, to setup the
    vdso) skips the security check before insert_vm_struct, allowing a local
    attacker to bypass the mmap_min_addr security restriction by limiting
    the available pages for special mappings.

    bprm_mm_init() also skips the check, and although I don't think this can
    be used to bypass any restrictions, I don't see any reason not to have
    the security check.

    $ uname -m
    x86_64
    $ cat /proc/sys/vm/mmap_min_addr
    65536
    $ cat install_special_mapping.s
    section .bss
    resb BSS_SIZE
    section .text
    global _start
    _start:
    mov eax, __NR_pause
    int 0x80
    $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
    $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
    $ ./install_special_mapping &
    [1] 14303
    $ cat /proc/14303/maps
    0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
    00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
    00011000-ffffe000 rwxp 00000000 00:00 0 [stack]

    It's worth noting that Red Hat are shipping with mmap_min_addr set to
    4096.

    Signed-off-by: Tavis Ormandy
    Acked-by: Kees Cook
    Acked-by: Robert Swiecki
    [ Changed to not drop the error code - akpm ]
    Reviewed-by: James Morris
    Signed-off-by: Linus Torvalds

    Tavis Ormandy
     
  • The fanotify_event_metadata now has a field which is supposed to
    indicate the length of the metadata portion of the event. Fill in that
    field as well.

    Based-in-part-on-patch-by: Alexey Zaytsev
    Signed-off-by: Eric Paris

    Eric Paris
     

15 Dec, 2010

9 commits

  • There should be a check for the NUL character instead of '0'.

    Fortunately the only thing that cares about this is NFS serving, which
    is why we didn't notice this in the merge window testing.

    Reported-by: Phil Carmody
    Signed-off-by: Aaro Koskinen
    Signed-off-by: "Theodore Ts'o"

    Aaro Koskinen
     
  • Jon Nelson has found a test case which causes postgresql to fail with
    the error:

    psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581

    Under memory pressure, it looks like part of a file can end up getting
    replaced by zero's. Until we can figure out the cause, we'll roll
    back the change and use block_write_full_page() instead of
    ext4_bio_write_page(). The new, more efficient writing function can
    be used via the mount option mblk_io_submit, so we can test and fix
    the new page I/O code.

    To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
    memory such that the system just at the end of triggering writeback
    before running the following sql script:

    begin;
    create temporary table foo as select x as a, ARRAY[x] as b FROM
    generate_series(1, 10000000 ) AS x;
    create index foo_a_idx on foo (a);
    create index foo_b_idx on foo USING GIN (b);
    rollback;

    If the temporary table is created on a hard drive partition which is
    encrypted using dm_crypt, then under memory pressure, approximately
    30-40% of the time, pgsql will issue the above failure.

    This patch should fix this problem, and the problem will come back if
    the file system is mounted with the mblk_io_submit mount option.

    Reported-by: Jon Nelson
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • * 'for-2.6.37' of git://linux-nfs.org/~bfields/linux:
    nfsd: Fix possible BUG_ON firing in set_change_info
    sunrpc: prevent use-after-free on clearing XPT_BUSY

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: prevent RAID level downgrades when space is low
    Btrfs: account for missing devices in RAID allocation profiles
    Btrfs: EIO when we fail to read tree roots
    Btrfs: fix compiler warnings
    Btrfs: Make async snapshot ioctl more generic
    Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
    Btrfs: Fix a crash when mounting a subvolume
    Btrfs: fix sync subvol/snapshot creation
    Btrfs: Fix page leak in compressed writeback path
    Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
    Btrfs: fixup return code for btrfs_del_orphan_item
    Btrfs: do not do fast caching if we are allocating blocks for tree_root
    Btrfs: deal with space cache errors better
    Btrfs: fix use after free in O_DIRECT

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: verify ioctl retries
    fuse: fix ioctl when server is 32bit

    Linus Torvalds
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: log timestamp changes to the source inode in rename

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: fix ioctl magic
    ceph: Behave better when handling file lock replies.
    ceph: pass lock information by struct file_lock instead of as individual params.
    ceph: Handle file locks in replies from the MDS.
    ceph: avoid possible null deref in readdir after dir llseek

    Linus Torvalds
     
  • * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
    NFS: Fix panic after nfs_umount()
    nfs: remove extraneous and problematic calls to nfs_clear_request
    nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
    NFS: Fix fcntl F_GETLK not reporting some conflicts
    nfs: Discard ACL cache on mode update
    NFS: Readdir cleanups
    NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
    NFS: Fix a memory leak in nfs_readdir
    Call the filesystem back whenever a page is removed from the page cache
    NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
    cifs: remove bogus remapping of error in cifs_filldir()
    cifs: allow calling cifs_build_path_to_root on incomplete cifs_sb
    cifs: fix check of error return from is_path_accessable
    cifs: remove Local_System_Name
    cifs: fix use of CONFIG_CIFS_ACL
    cifs: add attribute cache timeout (actimeo) tunable

    Linus Torvalds
     

14 Dec, 2010

3 commits

  • The extent allocator has code that allows us to fill
    allocations from any available block group, even if it doesn't
    match the raid level we've requested.

    This was put in because adding a new drive to a filesystem
    made with the default mkfs options actually upgrades the metadata from
    single spindle dup to full RAID1.

    But, the code also allows us to allocate from a raid0 chunk when we
    really want a raid1 or raid10 chunk. This can cause big trouble because
    mkfs creates a small (4MB) raid0 chunk for data and metadata which then
    goes unused for raid1/raid10 installs.

    The allocator will happily wander in and allocate from that chunk when
    things get tight, which is not correct.

    The fix here is to make sure that we provide duplication when the
    caller has asked for it. It does all the dups to be any raid level,
    which preserves the dup->raid1 upgrade abilities.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • When we mount in RAID degraded mode without adding a new device to
    replace the failed one, we can end up using the wrong RAID flags for
    allocations.

    This results in strange combinations of block groups (raid1 in a raid10
    filesystem) and corruptions when we try to allocate blocks from single
    spindle chunks on drives that are actually missing.

    The first device has two small 4MB chunks in it that mkfs creates and
    these are usually unused in a raid1 or raid10 setup. But, in -o degraded,
    the allocator will fall back to these because the mask of desired raid groups
    isn't correct.

    The fix here is to count the missing devices as we build up the list
    of devices in the system. This count is used when picking the
    raid level to make sure we continue using the same levels that were
    in place before we lost a drive.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • If we just get a plain IO error when we read tree roots, the code
    wasn't properly sending that error up the chain. This allowed mounts to
    continue when they should failed, and allowed operations
    on partially setup root structs. The end result was usually oopsen
    on spinlocks that hadn't been spun up correctly.

    Signed-off-by: Chris Mason

    Chris Mason
     

11 Dec, 2010

8 commits

  • ... regarding an unused function when !MIGRATION, and regarding a
    printk() format string vs argument mismatch.

    Signed-off-by: Jan Beulich
    Signed-off-by: Chris Mason

    Jan Beulich
     
  • If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
    wouldn't have to create a new structure for async snapshot creation.

    Here we convert async snapshot ioctl to use a more generic ABI, as
    we'll add more ioctls for snapshots/subvolumes in the future, readonly
    snapshots for example.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • This problem is found in meego testing:
    http://bugs.meego.com/show_bug.cgi?id=6672
    A file in btrfs is mmaped and the mmaped buffer is passed to pwrite to write to the same page
    of the same file. In btrfs_file_aio_write(), the pages is locked by prepare_pages(). So when
    btrfs_copy_from_user() is called, page fault happens and the same page needs to be locked again
    in filemap_fault(). The fix is to move iov_iter_fault_in_readable() before prepage_pages() to make page
    fault happen before pages are locked. And also disable page fault in critical region in
    btrfs_copy_from_user().

    Reviewed-by: Yan, Zheng
    Signed-off-by: Zhong, Xin
    Signed-off-by: Chris Mason

    Xin Zhong
     
  • We should drop dentry before deactivating the superblock, otherwise
    we can hit this bug:

    BUG: Dentry f349a690{i=100,n=/} still in use (1) [unmount of btrfs loop1]
    ...

    Steps to reproduce the bug:

    # mount /dev/loop1 /mnt
    # mkdir save
    # btrfs subvolume snapshot /mnt save/snap1
    # umount /mnt
    # mount -o subvol=save/snap1 /dev/loop1 /mnt
    (crash)

    Reported-by: Michael Niederle
    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • We were incorrectly taking the async path even for the sync ioctls by
    passing in &transid unconditionally.

    There's ample room for further cleanup here, but this keeps the fix simple.

    Signed-off-by: Sage Weil
    Reviewed-by: Li Zefan
    Signed-off-by: Chris Mason

    Sage Weil
     
  • "start + num_bytes >= actual_end" can happen when compressed page writeback races
    with file truncation. In that case we need unlock and release pages past the end
    of file.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • Not being able to delete an orphan item isn't a horrible thing. The worst that
    happens is the next time around we try and do the orphan cleanup and we can't
    find the referenced object and just delete the item and move on.

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • After a few unsuccessful NFS mount attempts in which the client and
    server cannot agree on an authentication flavor both support, the
    client panics. nfs_umount() is invoked in the kernel in this case.

    Turns out nfs_umount()'s UMNT RPC invocation causes the RPC client to
    write off the end of the rpc_clnt's iostat array. This is because the
    mount client's nrprocs field is initialized with the count of defined
    procedures (two: MNT and UMNT), rather than the size of the client's
    proc array (four).

    The fix is to use the same initialization technique used by most other
    upper layer clients in the kernel.

    Introduced by commit 0b524123, which failed to update nrprocs when
    support was added for UMNT in the kernel.

    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=24302
    BugLink: http://bugs.launchpad.net/bugs/683938

    Reported-by: Stefan Bader
    Tested-by: Stefan Bader
    Cc: stable@kernel.org # >= 2.6.32
    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

10 Dec, 2010

5 commits

  • Now that we don't mark VFS inodes dirty anymore for internal
    timestamp changes, but rely on the transaction subsystem to push
    them out, we need to explicitly log the source inode in rename after
    updating it's timestamps to make sure the changes actually get
    forced out by sync/fsync or an AIL push.

    We already account for the fourth inode in the log reservation, as a
    rename of directories needs to update the nlink field, so just
    adding the xfs_trans_log_inode call is enough.

    This fixes the xfsqa 065 regression introduced by:

    "xfs: don't use vfs writeback for pure metadata modifications"

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • If the orphan item doesn't exist, we return 1, which doesn't make any sense to
    the callers. Instead return -ENOENT if we didn't find the item. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Since the fast caching uses normal tree locking, we can possibly deadlock if we
    get to the caching via a btrfs_search_slot() on the tree_root. So just check to
    see if the root we are on is the tree root, and just don't do the fast caching.

    Reported-by: Sage Weil
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Currently if the space cache inode generation number doesn't match the
    generation number in the space cache header we will just fail to load the space
    cache, but we won't mark the space cache as an error, so we'll keep getting that
    error each time somebody tries to cache that block group until we actually clear
    the thing. Fix this by marking the space cache as having an error so we only
    get the message once. This patch also makes it so that we don't try and setup
    space cache for a block group that isn't cached, since we won't be able to write
    it out anyway. None of these problems are actual problems, they are just
    annoying and sub-optimal. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This fixes a bug where we use dip after we have freed it. Instead just use the
    file_offset that was passed to the function. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

09 Dec, 2010

2 commits

  • As the FIXME points out correctly, now filldir() itself returns -EOVERFLOW if
    it not possible to represent the inode number supplied by the filesystem in
    the field provided by userspace.

    Signed-off-by: Suresh Jayaraman
    Reviewed-by: Jeff Layton
    Signed-off-by: Steve French

    Suresh Jayaraman
     
  • If vfs_getattr in fill_post_wcc returns an error, we don't
    set fh_post_change.
    For NFSv4, this can result in set_change_info triggering a BUG_ON.
    i.e. fh_post_saved being zero isn't really a bug.

    So:
    - instead of BUGging when fh_post_saved is zero, just clear ->atomic.
    - if vfs_getattr fails in fill_post_wcc, take a copy of i_ctime anyway.
    This will be used i seg_change_info, but not overly trusted.
    - While we are there, remove the pointless 'if' statements in set_change_info.
    There is no harm setting all the values.

    Signed-off-by: NeilBrown
    Cc: stable@kernel.org
    Signed-off-by: J. Bruce Fields

    Neil Brown
     

08 Dec, 2010

3 commits

  • When a nfs_page is freed, nfs_free_request is called which also calls
    nfs_clear_request to clean out the lock and open contexts and free the
    pagecache page.

    However, a couple of places in the nfs code call nfs_clear_request
    themselves. What happens here if the refcount on the request is still high?
    We'll be releasing contexts and freeing pointers while the request is
    possibly still in use.

    Remove those bare calls to nfs_clear_context. That should only be done when
    the request is being freed.

    Note that when doing this, we need to watch out for tests of req->wb_page.
    Previously, nfs_set_page_tag_locked() and nfs_clear_page_tag_locked()
    would check the value of req->wb_page to figure out if the page is mapped
    into the nfsi->nfs_page_tree. We now indicate the page is mapped using
    the new bit PG_MAPPED in req->wb_flags .

    Reported-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • When nfs client(kernel) don't support NFSv4, maybe user build
    kernel without NFSv4, there is a problem.

    Using command "mount SERVER-IP:/nfsv3 /mnt/" to mount NFSv3
    filesystem, mount should should success, but fail and get error:

    "mount.nfs: an incorrect mount option was specified"

    System call mount "nfs"(not "nfs4") with "vers=4",
    if CONFIG_NFS_V4 is not defined, the "vers=4" will be parsed
    as invalid argument and kernel return EINVAL to nfs-utils.

    About that, we really want get EPROTONOSUPPORT rather than
    EINVAL. This path make sure kernel parses argument success,
    and return EPROTONOSUPPORT at nfs_validate_mount_data().

    Signed-off-by: Mi Jinlong
    Signed-off-by: Trond Myklebust

    Mi Jinlong
     
  • The commit 129a84de2347002f09721cda3155ccfd19fade40 (locks: fix F_GETLK
    regression (failure to find conflicts)) fixed the posix_test_lock()
    function by itself, however, its usage in NFS changed by the commit
    9d6a8c5c213e34c475e72b245a8eb709258e968c (locks: give posix_test_lock
    same interface as ->lock) remained broken - subsequent NFS-specific
    locking code received F_UNLCK instead of the user-specified lock type.
    To fix the problem, fl->fl_type needs to be saved before the
    posix_test_lock() call and restored if no local conflicts were reported.

    Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23892
    Tested-by: Alexander Morozov
    Signed-off-by: Sergey Vlasov
    Cc:
    Signed-off-by: Trond Myklebust

    Sergey Vlasov