30 Apr, 2012

2 commits

  • The autofs packet size has had a very unfortunate size problem on x86:
    because the alignment of 'u64' differs in 32-bit and 64-bit modes, and
    because the packet data was not 8-byte aligned, the size of the autofsv5
    packet structure differed between 32-bit and 64-bit modes despite
    looking otherwise identical (300 vs 304 bytes respectively).

    We first fixed that up by making the 64-bit compat mode know about this
    problem in commit a32744d4abae ("autofs: work around unhappy compat
    problem on x86-64"), and that made a 32-bit 'systemd' work happily on a
    64-bit kernel because everything then worked the same way as on a 32-bit
    kernel.

    But it turned out that 'automount' had actually known and worked around
    this problem in user space, so fixing the kernel to do the proper 32-bit
    compatibility handling actually *broke* 32-bit automount on a 64-bit
    kernel, because it knew that the packet sizes were wrong and expected
    those incorrect sizes.

    As a result, we ended up reverting that compatibility mode fix, and
    thus breaking systemd again, in commit fcbf94b9dedd.

    With both automount and systemd doing a single read() system call, and
    verifying that they get *exactly* the size they expect but using
    different sizes, it seemed that fixing one of them inevitably seemed to
    break the other. At one point, a patch I seriously considered applying
    from Michael Tokarev did a "strcmp()" to see if it was automount that
    was doing the operation. Ugly, ugly.

    However, a prettier solution exists now thanks to the packetized pipe
    mode. By marking the communication pipe as being packetized (by simply
    setting the O_DIRECT flag), we can always just write the bigger packet
    size, and if user-space does a smaller read, it will just get that
    partial end result and the extra alignment padding will simply be thrown
    away.

    This makes both automount and systemd happy, since they now get the size
    they asked for, and the kernel side of autofs simply no longer needs to
    care - it could pad out the packet arbitrarily.

    Of course, if there is some *other* user of autofs (please, please,
    please tell me it ain't so - and we haven't heard of any) that tries to
    read the packets with multiple writes, that other user will now be
    broken - the whole point of the packetized mode is that one system call
    gets exactly one packet, and you cannot read a packet in pieces.

    Tested-by: Michael Tokarev
    Cc: Alan Cox
    Cc: David Miller
    Cc: Ian Kent
    Cc: Thomas Meyer
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The actual internal pipe implementation is already really about
    individual packets (called "pipe buffers"), and this simply exposes that
    as a special packetized mode.

    When we are in the packetized mode (marked by O_DIRECT as suggested by
    Alan Cox), a write() on a pipe will not merge the new data with previous
    writes, so each write will get a pipe buffer of its own. The pipe
    buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
    will tell the reader side to break the read at that boundary (and throw
    away any partial packet contents that do not fit in the read buffer).

    End result: as long as you do writes less than PIPE_BUF in size (so that
    the pipe doesn't have to split them up), you can now treat the pipe as a
    packet interface, where each read() system call will read one packet at
    a time. You can just use a sufficiently big read buffer (PIPE_BUF is
    sufficient, since bigger than that doesn't guarantee atomicity anyway),
    and the return value of the read() will naturally give you the size of
    the packet.

    NOTE! We do not support zero-sized packets, and zero-sized reads and
    writes to a pipe continue to be no-ops. Also note that big packets will
    currently be split at write time, but that the size at which that
    happens is not really specified (except that it's bigger than PIPE_BUF).
    Currently that limit is the system page size, but we might want to
    explicitly support bigger packets some day.

    The main user for this is going to be the autofs packet interface,
    allowing us to stop having to care so deeply about exact packet sizes
    (which have had bugs with 32/64-bit compatibility modes). But user
    space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
    fail with an EINVAL on kernels that do not support this interface.

    Tested-by: Michael Tokarev
    Cc: Alan Cox
    Cc: David Miller
    Cc: Ian Kent
    Cc: Thomas Meyer
    Cc: stable@kernel.org # needed for systemd/autofs interaction fix
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Apr, 2012

1 commit

  • Pull btrfs fixes from Chris Mason:
    "This has our collection of bug fixes. I missed the last rc because I
    thought our patches were making NFS crash during my xfs test runs.
    Turns out it was an NFS client bug fixed by someone else while I tried
    to bisect it.

    All of these fixes are small, but some are fairly high impact. The
    biggest are fixes for our mount -o remount handling, a deadlock due to
    GFP_KERNEL allocations in readdir, and a RAID10 error handling bug.

    This was tested against both 3.3 and Linus' master as of this morning."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (26 commits)
    Btrfs: reduce lock contention during extent insertion
    Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir
    Btrfs: Fix space checking during fs resize
    Btrfs: fix block_rsv and space_info lock ordering
    Btrfs: Prevent root_list corruption
    Btrfs: fix repair code for RAID10
    Btrfs: do not start delalloc inodes during sync
    Btrfs: fix that check_int_data mount option was ignored
    Btrfs: don't count CRC or header errors twice while scrubbing
    Btrfs: fix btrfs_ioctl_dev_info() crash on missing device
    btrfs: don't return EINTR
    Btrfs: double unlock bug in error handling
    Btrfs: always store the mirror we read the eb from
    fs/btrfs/volumes.c: add missing free_fs_devices
    btrfs: fix early abort in 'remount'
    Btrfs: fix max chunk size check in chunk allocator
    Btrfs: add missing read locks in backref.c
    Btrfs: don't call free_extent_buffer twice in iterate_irefs
    Btrfs: Make free_ipath() deal gracefully with NULL pointers
    Btrfs: avoid possible use-after-free in clear_extent_bit()
    ...

    Linus Torvalds
     

28 Apr, 2012

9 commits

  • This reverts commit a32744d4abae24572eff7269bc17895c41bd0085.

    While that commit was technically the right thing to do, and made the
    x86-64 compat mode work identically to native 32-bit mode (and thus
    fixing the problem with a 32-bit systemd install on a 64-bit kernel), it
    turns out that the automount binaries had workarounds for this compat
    problem.

    Now, the workarounds are disgusting: doing an "uname()" to find out the
    architecture of the kernel, and then comparing it for the 64-bit cases
    and fixing up the size of the read() in automount for those. And they
    were confused: it's not actually a generic 64-bit issue at all, it's
    very much tied to just x86-64, which has different alignment for an
    'u64' in 64-bit mode than in 32-bit mode.

    But the end result is that fixing the compat layer actually breaks the
    case of a 32-bit automount on a x86-64 kernel.

    There are various approaches to fix this (including just doing a
    "strcmp()" on current->comm and comparing it to "automount"), but I
    think that I will do the one that teaches pipes about a special "packet
    mode", which will allow user space to not have to care too deeply about
    the padding at the end of the autofs packet.

    That change will make the compat workaround unnecessary, so let's revert
    it first, and get automount working again in compat mode. The
    packetized pipes will then fix autofs for systemd.

    Reported-and-requested-by: Michael Tokarev
    Cc: Ian Kent
    Cc: stable@kernel.org # for 3.3
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull CIFS fixes from Steve French.

    * git://git.samba.org/sfrench/cifs-2.6:
    Use correct conversion specifiers in cifs_show_options
    CIFS: Show backupuid/gid in /proc/mounts
    cifs: fix offset handling in cifs_iovec_write

    Linus Torvalds
     
  • We're spending huge amounts of time on lock contention during
    end_io processing because we unconditionally assume we are overwriting
    an existing extent in the file for each IO.

    This checks to see if we are outside i_size, and if so, it uses a
    less expensive readonly search of the btree to look for existing
    extents.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Btrfs has an optimization where it will preallocate dentries during
    readdir to fill in enough information to open the inode without an extra
    lookup.

    But, we're calling d_alloc, which is doing GFP_KERNEL allocations, and
    that leads to deadlocks because our readdir code has tree locks held.

    For now, disable this optimization. We'll fix the gfp mask in the next
    merge window.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Fix out-of-space checking, addressing a warning and potential resource
    leak when resizing the filesystem down while allocating blocks.

    Signed-off-by: Daniel J Blueman
    Reviewed-by: Josef Bacik
    Signed-off-by: Chris Mason

    Daniel J Blueman
     
  • may_commit_transaction() calls
    spin_lock(&space_info->lock);
    spin_lock(&delayed_rsv->lock);
    and update_global_block_rsv() calls
    spin_lock(&block_rsv->lock);
    spin_lock(&sinfo->lock);

    Lockdep complains about this at run time.
    Everywhere except in update_global_block_rsv(), the space_info lock is
    the outer lock, therefore the locking order in update_global_block_rsv()
    is changed.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • I was seeing root_list corruption on unmount during fs resize in 3.4-rc4; add
    correct locking to address this.

    Signed-off-by: Daniel J Blueman
    Signed-off-by: Chris Mason

    Daniel J Blueman
     
  • btrfs_map_block sets mirror_num, so that the repair code knows eventually
    which device gave us the read error. For RAID10, mirror_num must be 1 or 2.
    Before this fix mirror_num was incorrectly related to our stripe index.

    Signed-off-by: Jan Schmidt
    Signed-off-by: Chris Mason

    Jan Schmidt
     
  • btrfs_start_delalloc_inodes will just walk the list of delalloc inodes and
    start writing them out, but it doesn't splice the list or anything so as
    long as somebody is doing work on the box you could end up in this section
    _forever_. So just remove it, it's not needed anyway since sync will start
    writeback on all inodes anyway, all we need to do is wait for ordered
    extents and then we can commit the transaction. In my horrible torture test
    sync goes from taking 4 minutes to about 1.5 minutes. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

27 Apr, 2012

1 commit

  • Merge fixes from Andrew Morton:
    "13 fixes. The acerhdf patches aren't (really) fixes. But they've
    been stuck in my tree for up to two years, sent to Matthew multiple
    times and the developers are unhappy."

    * emailed from Andrew Morton : (13 patches)
    mm: fix NULL ptr dereference in move_pages
    mm: fix NULL ptr dereference in migrate_pages
    revert "proc: clear_refs: do not clear reserved pages"
    drivers/rtc/rtc-ds1307.c: fix BUG shown with lock debugging enabled
    arch/arm/mach-ux500/mbox-db5500.c: world-writable sysfs fifo file
    hugetlbfs: lockdep annotate root inode properly
    acerhdf: lowered default temp fanon/fanoff values
    acerhdf: add support for new hardware
    acerhdf: add support for Aspire 1410 BIOS v1.3314
    fs/buffer.c: remove BUG() in possible but rare condition
    mm: fix up the vmscan stat in vmstat
    epoll: clear the tfile_check_list on -ELOOP
    mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma

    Linus Torvalds
     

26 Apr, 2012

5 commits

  • Pull NFS client bugfixes from Trond Myklebust:
    - Fix NFSv4 infinite loops on open(O_TRUNC)
    - Fix an Oops and an infinite loop in the NFSv4 flock code
    - Don't register the PipeFS filesystem until it has been set up
    - Fix an Oops in nfs_try_to_update_request
    - Don't reuse NFSv4 open owners: fixes a bad sequence id storm.

    * tag 'nfs-for-3.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4: Keep dropped state owners on the LRU list for a while
    NFSv4: Ensure that we don't drop a state owner more than once
    NFSv4: Ensure we do not reuse open owner names
    nfs: Enclose hostname in brackets when needed in nfs_do_root_mount
    NFS: put open context on error in nfs_flush_multi
    NFS: put open context on error in nfs_pagein_multi
    NFSv4: Fix open(O_TRUNC) and ftruncate() error handling
    NFSv4: Ensure that we check lock exclusive/shared type against open modes
    NFSv4: Ensure that the LOCK code sets exception->inode
    NFS: check for req==NULL in nfs_try_to_update_request cleanup
    SUNRPC: register PipeFS file system after pernet sybsystem

    Linus Torvalds
     
  • Revert commit 85e72aa5384 ("proc: clear_refs: do not clear reserved
    pages"), which was a quick fix suitable for -stable until ARM had been
    moved over to the gate_vma mechanism:

    https://lkml.org/lkml/2012/1/14/55

    With commit f9d4861f ("ARM: 7294/1: vectors: use gate_vma for vectors user
    mapping"), ARM does now use the gate_vma, so the PageReserved check can be
    removed from the proc code.

    Signed-off-by: Will Deacon
    Cc: Nicolas Pitre
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • This fixes the below reported false lockdep warning. e096d0c7e2e4
    ("lockdep: Add helper function for dir vs file i_mutex annotation") added
    a similar annotation for every other inode in hugetlbfs but missed the
    root inode because it was allocated by a separate function.

    For HugeTLB fs we allow taking i_mutex in mmap. HugeTLB fs doesn't
    support file write and its file read callback is modified in a05b0855fd
    ("hugetlbfs: avoid taking i_mutex from hugetlbfs_read()") to not take
    i_mutex. Hence for HugeTLB fs with regular files we really don't take
    i_mutex with mmap_sem held.

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.4.0-rc1+ #322 Not tainted
    -------------------------------------------------------
    bash/1572 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x40/0x90

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [] vfs_readdir+0x56/0xa8

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&sb->s_type->i_mutex_key#12){+.+.+.}:
    [] lock_acquire+0xd5/0xfa
    [] __mutex_lock_common+0x48/0x350
    [] mutex_lock_nested+0x2a/0x31
    [] hugetlbfs_file_mmap+0x7d/0x104
    [] mmap_region+0x272/0x47d
    [] do_mmap_pgoff+0x294/0x2ee
    [] sys_mmap_pgoff+0xd2/0x10e
    [] sys_mmap+0x1d/0x1f
    [] system_call_fastpath+0x16/0x1b

    -> #0 (&mm->mmap_sem){++++++}:
    [] __lock_acquire+0xa81/0xd75
    [] lock_acquire+0xd5/0xfa
    [] might_fault+0x6d/0x90
    [] filldir+0x6a/0xc2
    [] dcache_readdir+0x5c/0x222
    [] vfs_readdir+0x76/0xa8
    [] sys_getdents+0x79/0xc9
    [] system_call_fastpath+0x16/0x1b

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&sb->s_type->i_mutex_key#12);
    lock(&mm->mmap_sem);
    lock(&sb->s_type->i_mutex_key#12);
    lock(&mm->mmap_sem);

    *** DEADLOCK ***

    1 lock held by bash/1572:
    #0: (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [] vfs_readdir+0x56/0xa8

    stack backtrace:
    Pid: 1572, comm: bash Not tainted 3.4.0-rc1+ #322
    Call Trace:
    [] print_circular_bug+0x1f8/0x209
    [] __lock_acquire+0xa81/0xd75
    [] ? handle_pte_fault+0x5ff/0x614
    [] ? mark_lock+0x2d/0x258
    [] ? might_fault+0x40/0x90
    [] lock_acquire+0xd5/0xfa
    [] ? might_fault+0x40/0x90
    [] ? __mutex_lock_common+0x333/0x350
    [] might_fault+0x6d/0x90
    [] ? might_fault+0x40/0x90
    [] filldir+0x6a/0xc2
    [] dcache_readdir+0x5c/0x222
    [] ? sys_ioctl+0x74/0x74
    [] ? sys_ioctl+0x74/0x74
    [] ? sys_ioctl+0x74/0x74
    [] vfs_readdir+0x76/0xa8
    [] sys_getdents+0x79/0xc9
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Aneesh Kumar K.V
    Cc: Dave Jones
    Cc: Al Viro
    Cc: Josh Boyer
    Cc: Peter Zijlstra
    Cc: Mimi Zohar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • While stressing the kernel with with failing allocations today, I hit the
    following chain of events:

    alloc_page_buffers():

    bh = alloc_buffer_head(GFP_NOFS);
    if (!bh)
    goto no_grow;
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • An epoll_ctl(,EPOLL_CTL_ADD,,) operation can return '-ELOOP' to prevent
    circular epoll dependencies from being created. However, in that case we
    do not properly clear the 'tfile_check_list'. Thus, add a call to
    clear_tfile_check_list() for the -ELOOP case.

    Signed-off-by: Jason Baron
    Reported-by: Yurij M. Plotnikov
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Tested-by: Alexandra N. Kossovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

25 Apr, 2012

2 commits

  • cifs_show_options uses the wrong conversion specifier for uid, gid,
    rsize & wsize. Correct this to %u to match it to the variable type
    'unsigned integer'.

    Signed-off-by: Sachin Prabhu
    Reviewed-by: Jeff Layton
    Signed-off-by: Steve French

    Sachin Prabhu
     
  • Show backupuid/backupgid in /proc/mounts for cifs shares mounted with
    the backupuid/backupgid feature.

    Also consolidate the two separate checks for
    pvolume_info->backupuid_specified into a single if condition in
    cifs_setup_cifs_sb().

    Signed-off-by: Sachin Prabhu
    Reviewed-by: Jeff Layton
    Signed-off-by: Steve French

    Sachin Prabhu
     

24 Apr, 2012

6 commits

  • This patch instructs DLM to prevent an "in place" conversion, where the
    lock just stays on the granted queue, and instead forces the conversion to
    the back of the convert queue. This is done on upward conversions only.

    This is useful in cases where, for example, a lock is frequently needed in
    PR on one node, but another node needs it temporarily in EX to update it.
    This may happen, for example, when the rindex is being updated by gfs2_grow.
    The gfs2_grow needs to have the lock in EX, but the other nodes need to
    re-read it to retrieve the updates. The glock is already granted in PR on
    the non-growing nodes, so this prevents them from continually re-granting
    the lock in PR, and forces the EX from gfs2_grow to go through.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • Pull ext4 bug fixes from Ted Ts'o:
    "These are two low-risk bug fixes for ext4, fixing a compile warning
    and a potential deadlock."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    super.c: unused variable warning without CONFIG_QUOTA
    jbd2: use GFP_NOFS for blkdev_issue_flush

    Linus Torvalds
     
  • sb info is only checked with quota support.

    fs/ext4/super.c: In function ‘parse_options’:
    fs/ext4/super.c:1600:23: warning: unused variable ‘sbi’ [-Wunused-variable]

    Signed-off-by: Eldad Zack
    Signed-off-by: Theodore Ts'o

    Eldad Zack
     
  • flush request is issued in transaction commit code path, so looks using
    GFP_KERNEL to allocate memory for flush request bio falls into the classic
    deadlock issue. I saw btrfs and dm get it right, but ext4, xfs and md are
    using GFP.

    Signed-off-by: Shaohua Li
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    Shaohua Li
     
  • Pull dlm fixes from David Teigland:
    "This includes one short patch fixing the behavior of the QUECVT flag,
    which the gfs2 folks are waiting on."

    * tag 'dlm-fixes-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
    dlm: fix QUECVT when convert queue is empty

    Linus Torvalds
     
  • The QUECVT flag should not prevent conversions from
    being granted immediately when the convert queue is
    empty.

    Signed-off-by: David Teigland

    David Teigland
     

22 Apr, 2012

2 commits


21 Apr, 2012

9 commits


20 Apr, 2012

3 commits

  • In the recent update of the cifs_iovec_write code to use async writes,
    the handling of the file position was broken. That patch added a local
    "offset" variable to handle the offset, and then only updated the
    original "*poffset" before exiting.

    Unfortunately, it copied off the original offset from the beginning,
    instead of doing so after generic_write_checks had been called. Fix
    this by moving the initialization of "offset" after that in the
    function.

    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     
  • Pull nfsd bugfixes from J. Bruce Fields:
    "One bugfix, and one minor header fix from Jeff Layton while we're
    here"

    * 'for-3.4' of git://linux-nfs.org/~bfields/linux:
    nfsd: include cld.h in the headers_install target
    nfsd: don't fail unchecked creates of non-special files

    Linus Torvalds
     
  • If the file wasn't opened for writing, then truncate and ftruncate
    need to report the appropriate errors.

    Reported-by: Miklos Szeredi
    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Trond Myklebust