14 Oct, 2011

1 commit


13 Oct, 2011

1 commit


12 Oct, 2011

3 commits

  • Currently we have a few issues with the way the workqueue code is used to
    implement AIL pushing:

    - it accidentally uses the same workqueue as the syncer action, and thus
    can be prevented from running if there are enough sync actions active
    in the system.
    - it doesn't use the HIGHPRI flag to queue at the head of the queue of
    work items

    At this point I'm not confident enough in getting all the workqueue flags and
    tweaks right to provide a perfectly reliable execution context for AIL
    pushing, which is the most important piece in XFS to make forward progress
    when the log fills.

    Revert back to use a kthread per filesystem which fixes all the above issues
    at the cost of having a task struct and stack around for each mounted
    filesystem. In addition this also gives us much better ways to diagnose
    any issues involving hung AIL pushing and removes a small amount of code.

    Signed-off-by: Christoph Hellwig
    Reported-by: Stefan Priebe
    Tested-by: Stefan Priebe
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We need to check for pinned buffers even in .iop_pushbuf given that inode
    items flush into the same buffers that may be pinned directly due operations
    on the unlinked inode list operating directly on buffers. To do this add a
    return value to .iop_pushbuf that tells the AIL push about this and use
    the existing log force mechanisms to unpin it.

    Signed-off-by: Christoph Hellwig
    Reported-by: Stefan Priebe
    Tested-by: Stefan Priebe
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • If an item was locked we should not update xa_last_pushed_lsn and thus skip
    it when restarting the AIL scan as we need to be able to lock and write it
    out as soon as possible. Otherwise heavy lock contention might starve AIL
    pushing too easily, especially given the larger backoff once we moved
    xa_last_pushed_lsn all the way to the target lsn.

    Signed-off-by: Christoph Hellwig
    Reported-by: Stefan Priebe
    Tested-by: Stefan Priebe
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

11 Oct, 2011

2 commits

  • The btrfs file defrag code will loop through the extents and
    force COW on them. But there is a concurrent truncate in the middle of
    the defrag, it might end up defragging the same range over and over
    again.

    The problem is that writepage won't go through and do anything on pages
    past i_size, so the cow won't happen, so the file will appear to still
    be fragmented. defrag will end up hitting the same extents again and
    again.

    In the worst case, the truncate can actually live lock with the defrag
    because the defrag keeps creating new ordered extents which the truncate
    code keeps waiting on.

    The fix here is to make defrag check for i_size inside the main loop,
    instead of just once before the looping starts.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Follow those steps:

    # mount -o autodefrag /dev/sda7 /mnt
    # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
    # sync
    # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc

    and then it'll go into a loop: writeback -> defrag -> writeback ...

    It's because writeback writes [8K, 200K] and then writes [0, 8K].

    I tried to make writeback know if the pages are dirtied by defrag,
    but the patch was a bit intrusive. Here I simply set writeback_index
    when we defrag a file.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     

10 Oct, 2011

1 commit


08 Oct, 2011

1 commit


04 Oct, 2011

1 commit


01 Oct, 2011

1 commit

  • A user reported a problem where ceph was getting into 100% cpu usage while doing
    some writing. It turns out it's because we were doing a short write on a not
    uptodate page, which means we'd fall back at one page at a time and fault the
    page in. The problem is our position is on the page boundary, so our fault in
    logic wasn't actually reading the page, so we'd just spin forever or until the
    page got read in by somebody else. This will force a readpage if we end up
    doing a short copy. Alexandre could reproduce this easily with ceph and reports
    it fixes his problem. I also wrote a reproducer that no longer hangs my box
    with this patch. Thanks,

    Reported-and-tested-by: Alexandre Oliva
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

27 Sep, 2011

3 commits

  • That flag no longer makes sense, since we don't look up automount points
    as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT
    handling was buggy to begin with: it would avoid automounting even for
    cases where we really *needed* to do the automount handling, and could
    return ENOENT for autofs entries that hadn't been instantiated yet.

    With our new non-eager automount semantics, one discussion has been
    about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the
    newfstatat() and fstatat64() system calls), but it's probably not worth
    it: you can always force at least directory automounting by simply
    adding the final '/' to the filename, which works for *all* of the stat
    family system calls, old and new.

    So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a
    result of our bad default behavior.

    Acked-by: Ian Kent
    Acked-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The concensus seems to be that system calls such as stat() etc should
    not trigger an automount. Neither should the l* versions.

    This patch therefore adds a LOOKUP_AUTOMOUNT flag to tag those lookups
    that _should_ trigger an automount on the last path element.

    Signed-off-by: Trond Myklebust
    [ Edited to leave out the cases that are already covered by LOOKUP_OPEN,
    LOOKUP_DIRECTORY and LOOKUP_CREATE - all of which also fundamentally
    force automounting for their own reasons - Linus ]
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • Since we've now turned around and made LOOKUP_FOLLOW *not* force an
    automount, we want to add the ability to force an automount event on
    lookup even if we don't happen to have one of the other flags that force
    it implicitly (LOOKUP_OPEN, LOOKUP_DIRECTORY, LOOKUP_PARENT..)

    Most cases will never want to use this, since you'd normally want to
    delay automounting as long as possible, which usually implies
    LOOKUP_OPEN (when we open a file or directory, we really cannot avoid
    the automount any more).

    But Trond argued sufficiently forcefully that at a minimum bind mounting
    a file and quotactl will want to force the automount lookup. Some other
    cases (like nfs_follow_remote_path()) could use it too, although
    LOOKUP_DIRECTORY would work there as well.

    This commit just adds the flag and logic, no users yet, though. It also
    doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and
    was made irrelevant by the same change that made us not follow on
    LOOKUP_FOLLOW.

    Cc: Trond Myklebust
    Cc: Ian Kent
    Cc: Jeff Layton
    Cc: Miklos Szeredi
    Cc: David Howells
    Cc: Al Viro
    Cc: Greg KH
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Sep, 2011

4 commits

  • * 'for-linus' of git://git.kernel.dk/linux-block:
    floppy: use del_timer_sync() in init cleanup
    blk-cgroup: be able to remove the record of unplugged device
    block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
    mm: Add comment explaining task state setting in bdi_forker_thread()
    mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
    block: simplify force plug flush code a little bit
    block: change force plug flush call order
    block: Fix queue_flag update when rq_affinity goes from 2 to 1
    block: separate priority boosting from REQ_META
    block: remove READ_META and WRITE_META
    xen-blkback: fixed indentation and comments
    xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.

    Linus Torvalds
     
  • This is modeled after the smaps code.

    It detects transparent hugepages and then does a single gather_stats()
    for the page as a whole. This has two benifits:
    1. It is more efficient since it does many pages in a single shot.
    2. It does not have to break down the huge page.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • gather_pte_stats() does a number of checks on a target page
    to see whether it should even be considered for statistics.
    This breaks that code out in to a separate function so that
    we can use it in the transparent hugepage case in the next
    patch.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • We need to teach the numa_maps code about transparent huge pages. The
    first step is to teach gather_stats() that the pte it is dealing with
    might represent more than one page.

    Note that will we use this in a moment for transparent huge pages since
    they have use a single pmd_t which _acts_ as a "surrogate" for a bunch
    of smaller pte_t's.

    I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes
    for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to
    figure out how many _bytes_ "dirty=1" means, you must first know the
    hugetlbfs page size. That's easier said than done especially if you
    don't have visibility in to the mount.

    But, that's probably a discussion for another day especially since it
    would change behavior to fix it. But, just in case anyone wonders why
    this patch only passes a '1' in the hugetlb case...

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

21 Sep, 2011

3 commits


20 Sep, 2011

5 commits

  • Fix sec=ntlmv2/i authentication option during mount of Samba shares.

    cifs client was coding ntlmv2 response incorrectly.
    All that is needed in temp as specified in MS-NLMP seciton 3.3.2

    "Define ComputeResponse(NegFlg, ResponseKeyNT, ResponseKeyLM,
    CHALLENGE_MESSAGE.ServerChallenge, ClientChallenge, Time, ServerName)

    as
    Set temp to ConcatenationOf(Responserversion, HiResponserversion,
    Z(6), Time, ClientChallenge, Z(4), ServerName, Z(4)"

    is MsvAvNbDomainName.

    For sec=ntlmsspi, build_av_pair is not used, a blob is plucked from
    type 2 response sent by the server to use in authentication.

    I tested sec=ntlmv2/i and sec=ntlmssp/i mount options against
    Samba (3.6) and Windows - XP, 2003 Server and 7.
    They all worked.

    Signed-off-by: Shirish Pargaonkar
    Signed-off-by: Steve French

    Shirish Pargaonkar
     
  • Both these options are started with "rw" - that's why the first one
    isn't switched on even if it is specified. Fix this by adding a length
    check for "rw" option check.

    Cc:
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French

    Steve French
     
  • move it to the beginning of the loop.

    Signed-off-by: Pavel Shilovsky
    Reviewed-by: Jeff Layton
    Signed-off-by: Steve French

    Pavel Shilovsky
     
  • The name_len variable in CIFSFindNext is a signed int that gets set to
    the resume_name_len in the cifs_search_info. The resume_name_len however
    is unsigned and for some infolevels is populated directly from a 32 bit
    value sent by the server.

    If the server sends a very large value for this, then that value could
    look negative when converted to a signed int. That would make that
    value pass the PATH_MAX check later in CIFSFindNext. The name_len would
    then be used as a length value for a memcpy. It would then be treated
    as unsigned again, and the memcpy scribbles over a ton of memory.

    Fix this by making the name_len an unsigned value in CIFSFindNext.

    Cc:
    Reported-by: Darren Lavender
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     
  • * 'for-linus' of git://github.com/chrismason/linux:
    Btrfs: only clear the need lookup flag after the dentry is setup
    BTRFS: Fix lseek return value for error
    Btrfs: don't change inode flag of the dest clone file
    Btrfs: don't make a file partly checksummed through file clone
    Btrfs: fix pages truncation in btrfs_ioctl_clone()
    btrfs: fix d_off in the first dirent

    Linus Torvalds
     

18 Sep, 2011

7 commits

  • We can race with readdir and the RCU path walking stuff. This is because we
    clear the need lookup flag before actually instantiating the inode. This will
    lead the RCU path walk stuff to find a dentry it thinks is valid without a
    d_inode attached. So instead unhash the dentry when we first start the lookup,
    and then clear the flag after we've instantiated the dentry so we're garunteed
    to either try the slow lookup, or have the d_inode set properly.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • The recent reworking of btrfs' lseek lead to incorrect
    values being returned. This adds checks for seeking
    beyond EOF in SEEK_HOLE and makes sure the error
    values come back correct.

    Andi Kleen also sent in similar patches.

    Signed-off-by: Jie Liu
    Reported-by: Andi Kleen
    Signed-off-by: Chris Mason

    Jeff Liu
     
  • Chris Mason
     
  • The dst file will have the same inode flags with dst file after
    file clone, and I think it's unexpected.

    For example, the dst file will suddenly become immutable after
    getting some share of data with src file, if the src is immutable.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • To reproduce the bug:

    # mount /dev/sda7 /mnt
    # dd if=/dev/zero of=/mnt/src bs=4K count=1
    # umount /mnt

    # mount -o nodatasum /dev/sda7 /mnt
    # dd if=/dev/zero of=/mnt/dst bs=4K count=1
    # clone_range -s 4K -l 4K /mnt/src /mnt/dst

    # echo 3 > /proc/sys/vm/drop_caches
    # cat /mnt/dst
    # dmesg
    ...
    btrfs no csum found for inode 258 start 0
    btrfs csum failed ino 258 off 0 csum 2566472073 private 0

    It's because part of the file is checksummed and the other part is not,
    and then btrfs will complain checksum is not found when we read the file.

    Disallow file clone if src and dst file have different checksum flag,
    so we ensure a file is completely checksummed or unchecksummed.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • It's a bug in commit f81c9cdc567cd3160ff9e64868d9a1a7ee226480
    (Btrfs: truncate pages from clone ioctl target range)

    We should pass the dest range to the truncate function, but not the
    src range.

    Also move the function before locking extent state.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • Since the d_off in the first dirent for "." (that originates from
    the 4th argument "offset" of filldir() for the 2nd dirent for "..")
    is wrongly assigned in btrfs_real_readdir(), telldir returns same
    offset for different locations.

    | # mkfs.btrfs /dev/sdb1
    | # mount /dev/sdb1 fs0
    | # cd fs0
    | # touch file0 file1
    | # ../test
    | telldir: 0
    | readdir: d_off = 2, d_name = "."
    | telldir: 2
    | readdir: d_off = 2, d_name = ".."
    | telldir: 2
    | readdir: d_off = 3, d_name = "file0"
    | telldir: 3
    | readdir: d_off = 2147483647, d_name = "file1"
    | telldir: 2147483647

    To fix this problem, pass filp->f_pos (which is loff_t) instead.

    | # ../test
    | telldir: 0
    | readdir: d_off = 1, d_name = "."
    | telldir: 1
    | readdir: d_off = 2, d_name = ".."
    | telldir: 2
    | readdir: d_off = 3, d_name = "file0"
    :

    At the moment the "offset" for "." is unused because there is no
    preceding dirent, however it is better to pass filp->f_pos to follow
    grammatical usage.

    Signed-off-by: Hidetoshi Seto
    Signed-off-by: Chris Mason

    Hidetoshi Seto
     

16 Sep, 2011

3 commits

  • * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    nfs: Do not allow multiple mounts on same mountpoint when using -o noac
    NFS: Fix a typo in nfs_flush_multi
    NFSv4: renewd needs to be able to handle the NFS4ERR_CB_PATH_DOWN error
    NFSv4: The NFSv4.0 client must send RENEW calls if it holds a delegation
    NFSv4: nfs4_proc_renew should be declared static
    NFSv4: nfs4_proc_async_renew should use a GFP_NOFS allocation

    Linus Torvalds
     
  • generic_check_addressable can't deal with hfsplus's larger than page
    size allocation blocks, so simply opencode the checks that we actually
    need in hfsplus_fill_super.

    Signed-off-by: Christoph Hellwig
    Reported-by: Pavel Ivanov
    Tested-by: Pavel Ivanov
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Commit 6596528e391a ("hfsplus: ensure bio requests are not smaller than
    the hardware sectors") changed the pointers used for volume header
    allocations but failed to free the correct pointers in the error path
    path of hfsplus_fill_super() and hfsplus_read_wrapper.

    The second hunk came from a separate patch by Pavel Ivanov.

    Reported-by: Pavel Ivanov
    Signed-off-by: Seth Forshee
    Signed-off-by: Christoph Hellwig
    Cc:
    Signed-off-by: Linus Torvalds

    Seth Forshee
     

15 Sep, 2011

2 commits

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: fix a use after free in xfs_end_io_direct_write

    Linus Torvalds
     
  • We used to get the victim pinned by dentry_unhash() prior to commit
    64252c75a219 ("vfs: remove dget() from dentry_unhash()") and ->rmdir()
    and ->rename() instances relied on that; most of them don't care, but
    ones that used d_delete() themselves do. As the result, we are getting
    rmdir() oopses on NFS now.

    Just grab the reference before locking the victim and drop it explicitly
    after unlocking, same as vfs_rename_other() does.

    Signed-off-by: Al Viro
    Tested-by: Simon Kirby
    Cc: stable@kernel.org (3.0.x)
    Signed-off-by: Linus Torvalds

    Al Viro
     

14 Sep, 2011

2 commits

  • There is a window in which the ioend that we call inode_dio_wake on
    in xfs_end_io_direct_write is already free. Fix this by storing
    the inode pointer in a local variable.

    This is a fix for the regression introduced in 3.1-rc by
    "fs: move inode_dio_done to the end_io handler".

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Do not allow multiple mounts on same mountpoint when using -o noac

    When you normally attempt to mount a share twice on the same mountpoint,
    a check in do_add_mount causes it to return an error

    # mount localhost:/nfsv3 /mnt
    # mount localhost:/nfsv3 /mnt
    mount.nfs: /mnt is already mounted or busy

    However when using the option 'noac', the user is able to mount the same
    share on the same mountpoint multiple times. This happens because a
    share mounted with the noac option is automatically assigned the 'sync'
    flag MS_SYNCHRONOUS in nfs_initialise_sb(). This flag is set after the
    check for already existing superblocks is done in sget(). The check for
    the mount flags in nfs_compare_mount_options() does not take into
    account the 'sync' flag applied later on in the code path. This means
    that when using 'noac', a new superblock structure is assigned for every
    new mount of the same share and multiple shares on the same mountpoint
    are allowed.

    ie.
    # mount -onoac localhost:/nfsv3 /mnt
    can be run multiple times.

    The patch checks for noac and assigns the sync flag before sget() is
    called to obtain an already existing superblock structure.

    Signed-off-by: Sachin Prabhu
    Reviewed-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Sachin Prabhu