03 Dec, 2013

1 commit

  • The pipe code was trying (and failing) to be very careful about freeing
    the pipe info only after the last access, with a pattern like:

    spin_lock(&inode->i_lock);
    if (!--pipe->files) {
    inode->i_pipe = NULL;
    kill = 1;
    }
    spin_unlock(&inode->i_lock);
    __pipe_unlock(pipe);
    if (kill)
    free_pipe_info(pipe);

    where the final freeing is done last.

    HOWEVER. The above is actually broken, because while the freeing is
    done at the end, if we have two racing processes releasing the pipe
    inode info, the one that *doesn't* free it will decrement the ->files
    count, and unlock the inode i_lock, but then still use the
    "pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".

    This is *very* hard to trigger in practice, since the race window is
    very small, and adding debug options seems to just hide it by slowing
    things down.

    Simon originally reported this way back in July as an Oops in
    kmem_cache_allocate due to a single bit corruption (due to the final
    "spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
    allocation that had re-used the free'd pipe-info), it's taken this long
    to figure out.

    Since the 'pipe->files' accesses aren't even protected by the pipe lock
    (we very much use the inode lock for that), the simple solution is to
    just drop the pipe lock early. And since there were two users of this
    pattern, create a helper function for it.

    Introduced commit ba5bb147330a ("pipe: take allocation and freeing of
    pipe_inode_info out of ->i_mutex").

    Reported-by: Simon Kirby
    Reported-by: Ian Applegate
    Acked-by: Al Viro
    Cc: stable@kernel.org # v3.10+
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Nov, 2013

1 commit

  • Pull vfs dentry reference count fix from Al Viro.

    This fixes a possible inode_permission NULL pointer dereference (and
    other problems) that were due to the root dentry count being decremented
    too much. In commit 48a066e72d97 ("RCU'd vfsmounts") the placement of
    clearing the LOOKUP_RCU bit changed, and we then returned failure of
    incrementing the lockref on the parent dentry with LOOKUP_RCU cleared.

    But that meant we needed to go through the same cleanup routines that
    the later failures did wrt LOOKUP_ROOT and nd->root.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fix bogus path_put() of nd->root after some unlazy_walk() failures

    Linus Torvalds
     

29 Nov, 2013

2 commits

  • Failure to grab reference to parent dentry should go through the
    same cleanup as nd->seq mismatch. As it is, we might end up with
    caller thinking it needs to path_put() nd->root, with obvious
    nasty results once we'd hit that bug enough times to drive the
    refcount of root dentry all the way to zero...

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull cifs fixes from Steve French:
    "SMB3 "validate negotiate" is needed to prevent certain types of
    downgrade attacks.

    Also changes SMB2/SMB3 copy offload from using the BTRFS copy ioctl
    (BTRFS_IOC_CLONE) to a cifs specific ioctl (CIFS_IOC_COPYCHUNK_FILE)
    to address Christoph's comment that there are semantic differences
    between requesting copy offload in which copy-on-write is mandatory
    (as in the BTRFS ioctl) and optional in the SMB2/SMB3 case. Also
    fixes SMB2/SMB3 copychunk for large files"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    [CIFS] Do not use btrfs refcopy ioctl for SMB2 copy offload
    Check SMB3 dialects against downgrade attacks
    Removed duplicated (and unneeded) goto
    CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files

    Linus Torvalds
     

28 Nov, 2013

3 commits

  • Pull driver core fixes from Greg KH:
    "Here are 3 patches for sysfs issues that have been reported. Well, 1
    patch really, the first one is reverted as it's not really needed (the
    correct fix is coming in through the different driver subsystems
    instead)

    But that 1 sysfs fix is needed, so this is still a good thing to pull
    in now"

    Signed-off-by: Greg Kroah-Hartman

    * tag 'driver-core-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    Revert "sysfs: handle duplicate removal attempts in sysfs_remove_group()"
    sysfs: use a separate locking class for open files depending on mmap
    sysfs: handle duplicate removal attempts in sysfs_remove_group()

    Linus Torvalds
     
  • This tool hasn't been maintained in over a decade, and is pretty much
    useless these days. Let's pretend it never happened.

    Also remove a long-dead email address.

    Signed-off-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • This reverts commit 54d71145a4548330313ca664a4a009772fe8b7dd.

    The root cause of these "inverted" sysfs removals have now been found,
    so there is no need for this patch. Keep this functionality around so
    that this type of error doesn't show up in driver code again.

    Cc: Mika Westerberg
    Cc: Rafael J. Wysocki
    Cc: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

27 Nov, 2013

1 commit

  • Pull ceph bug-fixes from Sage Weil:
    "These include a couple fixes to the new fscache code that went in
    during the last cycle (which will need to go stable@ shortly as well),
    a couple client-side directory fragmentation fixes, a fix for a race
    in the cap release queuing path, and a couple race fixes in the
    request abort and resend code.

    Obviously some of this could have gone into 3.12 final, but I
    preferred to overtest rather than send things in for a late -rc, and
    then my travel schedule intervened"

    * 'for-linus-bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: allocate non-zero page to fscache in readpage()
    ceph: wake up 'safe' waiters when unregistering request
    ceph: cleanup aborted requests when re-sending requests.
    ceph: handle race between cap reconnect and cap release
    ceph: set caps count after composing cap reconnect message
    ceph: queue cap release in __ceph_remove_cap()
    ceph: handle frag mismatch between readdir request and reply
    ceph: remove outdated frag information
    ceph: hung on ceph fscache invalidate in some cases

    Linus Torvalds
     

25 Nov, 2013

1 commit

  • Change cifs.ko to using CIFS_IOCTL_COPYCHUNK instead
    of BTRFS_IOC_CLONE to avoid confusion about whether
    copy-on-write is required or optional for this operation.

    SMB2/SMB3 copyoffload had used the BTRFS_IOC_CLONE ioctl since
    they both speed up copy by offloading the copy rather than
    passing many read and write requests back and forth and both have
    identical syntax (passing file handles), but for SMB2/SMB3
    CopyChunk the server is not required to use copy-on-write
    to make a copy of the file (although some do), and Christoph
    has commented that since CopyChunk does not require
    copy-on-write we should not reuse BTRFS_IOC_CLONE.

    This patch renames the ioctl to use a cifs specific IOCTL
    CIFS_IOCTL_COPYCHUNK. This ioctl is particularly important
    for SMB2/SMB3 since large file copy over the network otherwise
    can be very slow, and with this is often more than 100 times
    faster putting less load on server and client.

    Note that if a copy syscall is ever introduced, depending on
    its requirements/format it could end up using one of the other
    three methods that CIFS/SMB2/SMB3 can do for copy offload,
    but this method is particularly useful for file copy
    and broadly supported (not just by Samba server).

    Signed-off-by: Steve French
    Reviewed-by: Jeff Layton
    Reviewed-by: David Disseldorp

    Steve French
     

24 Nov, 2013

8 commits

  • ceph_osdc_readpages() returns number of bytes read, currently,
    the code only allocate full-zero page into fscache, this patch
    fixes this.

    Signed-off-by: Li Wang
    Reviewed-by: Milosz Tanski
    Reviewed-by: Sage Weil

    Li Wang
     
  • We also need to wake up 'safe' waiters if error occurs or request
    aborted. Otherwise sync(2)/fsync(2) may hang forever.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Sage Weil

    Yan, Zheng
     
  • Aborted requests usually get cleared when the reply is received.
    If MDS crashes, no reply will be received. So we need to cleanup
    aborted requests when re-sending requests.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Greg Farnum
    Signed-off-by: Sage Weil

    Yan, Zheng
     
  • When a cap get released while composing the cap reconnect message.
    We should skip queuing the release message if the cap hasn't been
    added to the cap reconnect message.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Sage Weil

    Yan, Zheng
     
  • It's possible that some caps get released while composing the cap
    reconnect message.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Sage Weil

    Yan, Zheng
     
  • call __queue_cap_release() in __ceph_remove_cap(), this avoids
    acquiring s_cap_lock twice.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Sage Weil

    Yan, Zheng
     
  • The following two commits implemented mmap support in the regular file
    path and merged bin file support into the regular path.

    73d9714627ad ("sysfs: copy bin mmap support from fs/sysfs/bin.c to fs/sysfs/file.c")
    3124eb1679b2 ("sysfs: merge regular and bin file handling")

    After the merge, the following commands trigger a spurious lockdep
    warning. "test-mmap-read" simply mmaps the file and dumps the
    content.

    $ cat /sys/block/sda/trace/act_mask
    $ test-mmap-read /sys/devices/pci0000\:00/0000\:00\:03.0/resource0 4096

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.12.0-work+ #378 Not tainted
    -------------------------------------------------------
    test-mmap-read/567 is trying to acquire lock:
    (&of->mutex){+.+.+.}, at: [] sysfs_bin_mmap+0x4f/0x120

    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x49/0xa0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&mm->mmap_sem){++++++}:
    ...
    -> #2 (sr_mutex){+.+.+.}:
    ...
    -> #1 (&bdev->bd_mutex){+.+.+.}:
    ...
    -> #0 (&of->mutex){+.+.+.}:
    ...

    other info that might help us debug this:

    Chain exists of:
    &of->mutex --> sr_mutex --> &mm->mmap_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&mm->mmap_sem);
    lock(sr_mutex);
    lock(&mm->mmap_sem);
    lock(&of->mutex);

    *** DEADLOCK ***

    1 lock held by test-mmap-read/567:
    #0: (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x49/0xa0

    stack backtrace:
    CPU: 3 PID: 567 Comm: test-mmap-read Not tainted 3.12.0-work+ #378
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    ffffffff81ed41a0 ffff880009441bc8 ffffffff81611ad2 ffffffff81eccb80
    ffff880009441c08 ffffffff8160f215 ffff880009441c60 ffff880009c75208
    0000000000000000 ffff880009c751e0 ffff880009c75208 ffff880009c74ac0
    Call Trace:
    [] dump_stack+0x4e/0x7a
    [] print_circular_bug+0x2b0/0x2bf
    [] __lock_acquire+0x1a3a/0x1e60
    [] lock_acquire+0x9a/0x1d0
    [] mutex_lock_nested+0x67/0x3f0
    [] sysfs_bin_mmap+0x4f/0x120
    [] mmap_region+0x3b3/0x5b0
    [] do_mmap_pgoff+0x34e/0x3d0
    [] vm_mmap_pgoff+0x6a/0xa0
    [] SyS_mmap_pgoff+0xbe/0x250
    [] SyS_mmap+0x22/0x30
    [] system_call_fastpath+0x16/0x1b

    This happens because one file nests sr_mutex, which nests mm->mmap_sem
    under it, under of->mutex while mmap implementation naturally nests
    of->mutex under mm->mmap_sem. The warning is false positive as
    of->mutex is per open-file and the two paths belong to two different
    files. This warning didn't trigger before regular and bin file
    supports were merged because only bin file supported mmap and the
    other side of locking happened only on regular files which used
    equivalent but separate locking.

    It'd be best if we give separate locking classes per file but we can't
    easily do that. Let's differentiate on ->mmap() for now. Later we'll
    add explicit file operations struct and can add per-ops lockdep key
    there.

    Signed-off-by: Tejun Heo
    Reported-by: Dave Jones
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • Commit bcdde7e221a8 (sysfs: make __sysfs_remove_dir() recursive) changed
    the behavior so that directory removals will be done recursively. This
    means that the sysfs group might already be removed if its parent directory
    has been removed.

    The current code outputs warnings similar to following log snippet when it
    detects that there is no group for the given kobject:

    WARNING: CPU: 0 PID: 4 at fs/sysfs/group.c:214 sysfs_remove_group+0xc6/0xd0()
    sysfs group ffffffff81c6f1e0 not found for kobject 'host7'
    Modules linked in:
    CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 3.12.0+ #13
    Hardware name: /D33217CK, BIOS GKPPT10H.86A.0042.2013.0422.1439 04/22/2013
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    0000000000000009 ffff8801002459b0 ffffffff817daab1 ffff8801002459f8
    ffff8801002459e8 ffffffff810436b8 0000000000000000 ffffffff81c6f1e0
    ffff88006d440358 ffff88006d440188 ffff88006e8b4c28 ffff880100245a48
    Call Trace:
    [] dump_stack+0x45/0x56
    [] warn_slowpath_common+0x78/0xa0
    [] warn_slowpath_fmt+0x47/0x50
    [] ? sysfs_get_dirent_ns+0x49/0x70
    [] sysfs_remove_group+0xc6/0xd0
    [] dpm_sysfs_remove+0x3e/0x50
    [] device_del+0x40/0x1b0
    [] device_unregister+0xd/0x20
    [] scsi_remove_host+0xba/0x110
    [] ata_host_detach+0xc6/0x100
    [] ata_pci_remove_one+0x18/0x20
    [] pci_device_remove+0x28/0x60
    [] __device_release_driver+0x64/0xd0
    [] device_release_driver+0x1e/0x30
    [] bus_remove_device+0xf7/0x140
    [] device_del+0x121/0x1b0
    [] pci_stop_bus_device+0x94/0xa0
    [] pci_stop_bus_device+0x3b/0xa0
    [] pci_stop_bus_device+0x3b/0xa0
    [] pci_stop_and_remove_bus_device+0xd/0x20
    [] trim_stale_devices+0x73/0xe0
    [] trim_stale_devices+0xbb/0xe0
    [] trim_stale_devices+0xbb/0xe0
    [] acpiphp_check_bridge+0x7e/0xd0
    [] hotplug_event+0xcd/0x160
    [] hotplug_event_work+0x25/0x60
    [] acpi_hotplug_work_fn+0x17/0x22
    [] process_one_work+0x17a/0x430
    [] worker_thread+0x119/0x390
    [] ? manage_workers.isra.25+0x2a0/0x2a0
    [] kthread+0xcd/0xf0
    [] ? kthread_create_on_node+0x180/0x180
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_create_on_node+0x180/0x180

    On this particular machine I see ~16 of these message during Thunderbolt
    hot-unplug.

    Fix this in similar way that was done for sysfs_remove_one() by checking
    if the parent directory has already been removed and bailing out early.

    Signed-off-by: Mika Westerberg
    Acked-by: Rafael J. Wysocki
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Mika Westerberg
     

23 Nov, 2013

6 commits

  • …ux/kernel/git/tyhicks/ecryptfs

    Pull minor eCryptfs fix from Tyler Hicks:
    "Quiet static checkers by removing unneeded conditionals"

    * tag 'ecryptfs-3.13-rc1-quiet-checkers' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
    eCryptfs: file->private_data is always valid

    Linus Torvalds
     
  • Pull aio fixes from Benjamin LaHaise.

    * git://git.kvack.org/~bcrl/aio-next:
    aio: nullify aio->ring_pages after freeing it
    aio: prevent double free in ioctx_alloc
    aio: Fix a trinity splat

    Linus Torvalds
     
  • Pull nfsd bugfixes from Bruce Fields:
    "A couple nfsd bugfixes"

    * 'for-3.13' of git://linux-nfs.org/~bfields/linux:
    nfsd4: fix xdr decoding of large non-write compounds
    nfsd: make sure to balance get/put_write_access
    nfsd: split up nfsd_setattr

    Linus Torvalds
     
  • Pull GFS2 fixes from Steven Whitehouse:
    "A couple of small, but important bug fixes for GFS2. The first one
    fixes a possible NULL pointer dereference, and the second one resolves
    a reference counting issue in one of the lesser used paths through
    atomic_open"

    * tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
    GFS2: Fix ref count bug relating to atomic_open
    GFS2: fix potential NULL pointer dereference

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "Almost all of these are bug fixes. Dave Sterba's documentation update
    is the big exception because he removed our promises to set any
    machine running Btrfs on fire"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Documentation: filesystems: update btrfs tools section
    Documentation: filesystems: add new btrfs mount options
    btrfs: update kconfig help text
    btrfs: fix bio_size_ok() for max_sectors > 0xffff
    btrfs: Use trace condition for get_extent tracepoint
    btrfs: fix typo in the log message
    Btrfs: fix list delete warning when removing ordered root from the list
    Btrfs: print bytenr instead of page pointer in check-int
    Btrfs: remove dead codes from ctree.h
    Btrfs: don't wait for ordered data outside desired range
    Btrfs: fix lockdep error in async commit
    Btrfs: avoid heavy operations in btrfs_commit_super
    Btrfs: fix __btrfs_start_workers retval
    Btrfs: disable online raid-repair on ro mounts
    Btrfs: do not inc uncorrectable_errors counter on ro scrubs
    Btrfs: only drop modified extents if we logged the whole inode
    Btrfs: make sure to copy everything if we rename
    Btrfs: don't BUG_ON() if we get an error walking backrefs

    Linus Torvalds
     
  • Pull second xfs update from Ben Myers:
    "There are a couple of patches that I wasn't quite sure about in time
    for our initial 3.13 pull request, a bugfix, and an update to add Dave
    to MAINTAINERS:

    Here we have a performance fix for inode iversion, increased inode
    cluster size for v5 superblock filesystems, a fix for error handling
    in xfs_bmap_add_attrfork, and a MAINTAINERS update to add Dave"

    * tag 'xfs-for-linus-v3.13-rc1-2' of git://oss.sgi.com/xfs/xfs:
    xfs: open code inc_inode_iversion when logging an inode
    xfs: increase inode cluster size for v5 filesystems
    xfs: fix unlock in xfs_bmap_add_attrfork
    xfs: update maintainers

    Linus Torvalds
     

22 Nov, 2013

4 commits

  • Merge patches from Andrew Morton:
    "13 fixes"

    * emailed patches from Andrew Morton :
    mm: place page->pmd_huge_pte to right union
    MAINTAINERS: add keyboard driver to Hyper-V file list
    x86, mm: do not leak page->ptl for pmd page tables
    ipc,shm: correct error return value in shmctl (SHM_UNLOCK)
    mm, mempolicy: silence gcc warning
    block/partitions/efi.c: fix bound check
    ARM: drivers/rtc/rtc-at91rm9200.c: disable interrupts at shutdown
    mm: hugetlbfs: fix hugetlbfs optimization
    kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS cleanly
    ipc,shm: fix shm_file deletion races
    mm: thp: give transparent hugepage code a separate copy_page
    checkpatch: fix "Use of uninitialized value" warnings
    configfs: fix race between dentry put and lookup

    Linus Torvalds
     
  • Pull audit updates from Eric Paris:
    "Nothing amazing. Formatting, small bug fixes, couple of fixes where
    we didn't get records due to some old VFS changes, and a change to how
    we collect execve info..."

    Fixed conflict in fs/exec.c as per Eric and linux-next.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    audit: fix type of sessionid in audit_set_loginuid()
    audit: call audit_bprm() only once to add AUDIT_EXECVE information
    audit: move audit_aux_data_execve contents into audit_context union
    audit: remove unused envc member of audit_aux_data_execve
    audit: Kill the unused struct audit_aux_data_capset
    audit: do not reject all AUDIT_INODE filter types
    audit: suppress stock memalloc failure warnings since already managed
    audit: log the audit_names record type
    audit: add child record before the create to handle case where create fails
    audit: use given values in tty_audit enable api
    audit: use nlmsg_len() to get message payload length
    audit: use memset instead of trying to initialize field by field
    audit: fix info leak in AUDIT_GET requests
    audit: update AUDIT_INODE filter rule to comparator function
    audit: audit feature to set loginuid immutable
    audit: audit feature to only allow unsetting the loginuid
    audit: allow unsetting the loginuid (with priv)
    audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
    audit: loginuid functions coding style
    selinux: apply selinux checks on new audit message types
    ...

    Linus Torvalds
     
  • A race window in configfs, it starts from one dentry is UNHASHED and end
    before configfs_d_iput is called. In this window, if a lookup happen,
    since the original dentry was UNHASHED, so a new dentry will be
    allocated, and then in configfs_attach_attr(), sd->s_dentry will be
    updated to the new dentry. Then in configfs_d_iput(),
    BUG_ON(sd->s_dentry != dentry) will be triggered and system panic.

    sys_open: sys_close:
    ... fput
    dput
    dentry_kill
    __d_drop dentry still point
    to this dentry.

    lookup_real
    configfs_lookup
    configfs_attach_attr---> update sd->s_dentry
    to new allocated dentry here.

    d_kill
    configfs_d_iput s_dentry != dentry)
    triggered here.

    To fix it, change configfs_d_iput to not update sd->s_dentry if
    sd->s_count > 2, that means there are another dentry is using the sd
    beside the one that is going to be put. Use configfs_dirent_lock in
    configfs_attach_attr to sync with configfs_d_iput.

    With the following steps, you can reproduce the bug.

    1. enable ocfs2, this will mount configfs at /sys/kernel/config and
    fill configure in it.

    2. run the following script.
    while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &
    while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &

    Signed-off-by: Junxiao Bi
    Cc: Joel Becker
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • In the case that atomic_open calls finish_no_open() with
    the dentry that was supplied to gfs2_atomic_open() an
    extra reference count is required. This patch fixes that
    issue preventing a bug trap triggering at umount time.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

21 Nov, 2013

13 commits

  • Commit [e66cf1610: GFS2: Use lockref for glocks] replaced call:
    atomic_read(&gi->gl->gl_ref) == 0
    with:
    __lockref_is_dead(&gl->gl_lockref)
    therefore changing how gl is accessed, from gi->gl to plan gl.
    However, gl can be a NULL pointer, and so gi->gl needs to be
    used instead (which is guaranteed not to be NULL because fo
    the while loop checking that condition).

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Steven Whitehouse

    Michal Nazarewicz
     
  • Reflect the current status. Portions of the text taken from the
    wiki pages.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • The data type of max_sectors in queue settings is unsigned int. But
    this value is stored to the local variable whose type is unsigned short
    in bio_size_ok(). This can cause unexpected result when max_sectors >
    0xffff.

    Cc: Chris Mason
    Cc: linux-btrfs@vger.kernel.org
    Signed-off-by: Akinobu Mita
    Signed-off-by: Chris Mason

    Akinobu Mita
     
  • Doing an if statement to test some condition to know if we should
    trigger a tracepoint is pointless when tracing is disabled. This just
    adds overhead and wastes a branch prediction. This is why the
    TRACE_EVENT_CONDITION() was created. It places the check inside the jump
    label so that the branch does not happen unless tracing is enabled.

    That is, instead of doing:

    if (em)
    trace_btrfs_get_extent(root, em);

    Which is basically this:

    if (em)
    if (static_key(trace_btrfs_get_extent)) {

    Using a TRACE_EVENT_CONDITION() we can just do:

    trace_btrfs_get_extent(root, em);

    And the condition trace event will do:

    if (static_key(trace_btrfs_get_extent)) {
    if (em) {
    ...

    The static key is a non conditional jump (or nop) that is faster than
    having to check if em is NULL or not.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Steven Rostedt
     
  • Signed-off-by: Anand Jain
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Anand Jain
     
  • Commit b02441999efcc6152b87cd58e7970bb7843f76cf "Btrfs: don't wait for
    the completion of all the ordered extents" introduced a bug that broke
    the ordered root list:
    WARNING: CPU: 1 PID: 7119 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()

    It is because we forgot to return the roots in the splice list to the
    ordered list of the fs. Fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Miao Xie
     
  • The page pointer information was useless. The bytenr is what you
    want when you search for submitted write bios.

    Additionally, a new bit in the print mask is added that allows
    to selectively enable the check-int submit_bio verbose mode. Before,
    the global verbose mode had to be enabled leading to many million
    useless lines in the kernel log.

    And a comment is added that explains that LOG_BUF_SHIFT needs to
    be set to a really high value.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • These two functions are only stated but undefined.

    Signed-off-by: Wang Shilong
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • In btrfs_wait_ordered_range(), if we found an extent to the left
    of the start of our desired wait range and the last byte of that
    extent is 1 less than the desired range's start, we would would
    wait for the IO completion of that extent unnecessarily.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • Lockdep complains about btrfs's async commit:

    [ 2372.462171] [ BUG: bad unlock balance detected! ]
    [ 2372.462191] 3.12.0+ #32 Tainted: G W
    [ 2372.462209] -------------------------------------
    [ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at:
    [ 2372.462275] [] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
    [ 2372.462305] but there are no more locks to release!
    [ 2372.462324]
    [ 2372.462324] other info that might help us debug this:
    [ 2372.462349] no locks held by ceph-osd/14048.
    [ 2372.462367]
    [ 2372.462367] stack backtrace:
    [ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G W 3.12.0+ #32
    [ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011
    [ 2372.462455] ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320
    [ 2372.462491] ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650
    [ 2372.462526] ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000
    [ 2372.462562] Call Trace:
    [ 2372.462584] [] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
    [ 2372.462619] [] dump_stack+0x45/0x56
    [ 2372.462642] [] print_unlock_imbalance_bug+0xec/0x100
    [ 2372.462677] [] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
    [ 2372.462710] [] lock_release+0x18e/0x210
    [ 2372.462742] [] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs]
    [ 2372.462783] [] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs]
    [ 2372.462822] [] btrfs_ioctl+0x4c3/0x1f70 [btrfs]
    [ 2372.462849] [] ? avc_has_perm+0x121/0x1b0
    [ 2372.462873] [] ? avc_has_perm+0x24/0x1b0
    [ 2372.462897] [] ? sched_clock_cpu+0xa8/0x100
    [ 2372.462922] [] do_vfs_ioctl+0x2e5/0x4e0
    [ 2372.462946] [] ? file_has_perm+0x86/0xa0
    [ 2372.462969] [] SyS_ioctl+0x81/0xa0
    [ 2372.462991] [] tracesys+0xdd/0xe2

    ====================================================

    It's because that we don't do the right thing when checking if it's ok to
    tell lockdep that we're trying to release the rwsem.

    If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but
    as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze
    rwsem, which makes lockdep complains a lot.

    Reported-by: Ma Jianpeng
    Signed-off-by: Liu Bo
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     
  • The 'git blame' history shows that, the old transaction commit code has to do
    twice to ensure roots are updated and we have to flush metadata and super block
    manually, however, right now all of these can be handled well inside
    the transaction commit code without extra efforts.

    And the error handling part remains same with the current code, -- 'return to
    caller once we get error'.

    This saves us a transaction commit and a flush of super block, which are both
    heavy operations according to ftrace output analysis.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     
  • __btrfs_start_workers returns 0 in case it raced with
    btrfs_stop_workers and lost the race. This is wrong because worker in
    this case is not allowed to start and is in fact destroyed. Return
    -EINVAL instead.

    Signed-off-by: Ilya Dryomov
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Ilya Dryomov
     
  • This disables the "if needed, write the good copy back before the read
    is completed" part of the read sequence for read-only mounts.

    Cc: Jan Schmidt
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Ilya Dryomov