03 Aug, 2016

1 commit


01 Aug, 2016

11 commits

  • With commit 56f23fdbb600 ("Btrfs: fix file/data loss caused by fsync after
    rename and new inode") we got simple fix for a functional issue when the
    following sequence of actions is done:

    at transaction N
    create file A at directory D
    at transaction N + M (where M >= 1)
    move/rename existing file A from directory D to directory E
    create a new file named A at directory D
    fsync the new file
    power fail

    The solution was to simply detect such scenario and fallback to a full
    transaction commit when we detect it. However this turned out to had a
    significant impact on throughput (and a bit on latency too) for benchmarks
    using the dbench tool, which simulates real workloads from smbd (Samba)
    servers. For example on a test vm (with a debug kernel):

    Unpatched:
    Throughput 19.1572 MB/sec 32 clients 32 procs max_latency=1005.229 ms

    Patched:
    Throughput 23.7015 MB/sec 32 clients 32 procs max_latency=809.206 ms

    The patched results (this patch is applied) are similar to the results of
    a kernel with the commit 56f23fdbb600 ("Btrfs: fix file/data loss caused
    by fsync after rename and new inode") reverted.

    This change avoids the fallback to a transaction commit and instead makes
    sure all the names of the conflicting inode (the one that had a name in a
    past transaction that matches the name of the new file in the same parent
    directory) are logged so that at log replay time we don't lose neither the
    new file nor the old file, and the old file gets the name it was renamed
    to.

    This also ends up avoiding a full transaction commit for a similar case
    that involves an unlink instead of a rename of the old file:

    at transaction N
    create file A at directory D
    at transaction N + M (where M >= 1)
    remove file A
    create a new file named A at directory D
    fsync the new file
    power fail

    Signed-off-by: Filipe Manana

    Filipe Manana
     
  • When we attempt to read an inode from disk, we end up always returning an
    -ESTALE error to the caller regardless of the actual failure reason, which
    can be an out of memory problem (when allocating a path), some error found
    when reading from the fs/subvolume btree (like a genuine IO error) or the
    inode does not exists. So lets start returning the real error code to the
    callers so that they don't treat all -ESTALE errors as meaning that the
    inode does not exists (such as during orphan cleanup). This will also be
    needed for a subsequent patch in the same series dealing with a special
    fsync case.

    Signed-off-by: Filipe Manana

    Filipe Manana
     
  • When doing an incremental send, if we find a new/modified/deleted extent,
    reference or xattr without having previously processed the corresponding
    inode item we end up exexuting a BUG_ON(). This is because whenever an
    extent, xattr or reference is added, modified or deleted, we always expect
    to have the corresponding inode item updated. However there are situations
    where this will not happen due to transient -ENOMEM or -ENOSPC errors when
    doing delayed inode updates.

    For example, when punching holes we can succeed in deleting and modifying
    (shrinking) extents but later fail to do the delayed inode update. So after
    such failure we close our transaction handle and right after a snapshot of
    the fs/subvol tree can be made and used later for a send operation. The
    same thing can happen during truncate, link, unlink, and xattr related
    operations.

    So instead of executing a BUG_ON, make send return an -EIO error and print
    an informative error message do dmesg/syslog.

    Signed-off-by: Filipe Manana

    Filipe Manana
     
  • The caller of send_utimes() is supposed to be sure that the inode number
    it passes to this function does actually exists in the send snapshot.
    However due to logic/algorithm bugs (such as the one fixed by the patch
    titled "Btrfs: send, fix invalid leaf accesses due to incorrect utimes
    operations"), this might not be the case and when that happens it makes
    send_utimes() access use an unrelated leaf item as the target inode item
    or access beyond a leaf's boundaries (when the leaf is full and
    path->slots[0] matches the number of items in the leaf).

    So if the call to btrfs_search_slot() done by send_utimes() does not find
    the inode item, just make sure send_utimes() returns -ENOENT and does not
    silently accesses unrelated leaf items or does invalid leaf accesses, also
    allowing us to easialy and deterministically catch such algorithmic/logic
    bugs.

    Signed-off-by: Filipe Manana

    Filipe Manana
     
  • During an incremental send, if we have delayed rename operations for inodes
    that were children of directories which were removed in the send snapshot,
    we can end up accessing incorrect items in a leaf or accessing beyond the
    last item of the leaf due to issuing utimes operations for the removed
    inodes. Consider the following example:

    Parent snapshot:
    . (ino 256)
    |--- a/ (ino 257)
    | |--- c/ (ino 262)
    |
    |--- b/ (ino 258)
    | |--- d/ (ino 263)
    |
    |--- del/ (ino 261)
    |--- x/ (ino 259)
    |--- y/ (ino 260)

    Send snapshot:

    . (ino 256)
    |--- a/ (ino 257)
    |
    |--- b/ (ino 258)
    |
    |--- c/ (ino 262)
    | |--- y/ (ino 260)
    |
    |--- d/ (ino 263)
    |--- x/ (ino 259)

    1) When processing inodes 259 and 260, we end up delaying their rename
    operations because their parents, inodes 263 and 262 respectively, were
    not yet processed and therefore not yet renamed;

    2) When processing inode 262, its rename operation is issued and right
    after the rename operation for inode 260 is issued. However right after
    issuing the rename operation for inode 260, at send.c:apply_dir_move(),
    we issue utimes operations for all current and past parents of inode
    260. This means we try to send a utimes operation for its old parent,
    inode 261 (deleted in the send snapshot), which does not cause any
    immediate and deterministic failure, because when the target inode is
    not found in the send snapshot, the send.c:send_utimes() function
    ignores it and uses the leaf region pointed to by path->slots[0],
    which can be any unrelated item (belonging to other inode) or it can
    be a region outside the leaf boundaries, if the leaf is full and
    path->slots[0] matches the number of items in the leaf. So we end
    up either successfully sending a utimes operation, which is fine
    and irrelevant because the old parent (inode 261) will end up being
    deleted later, or we end up doing an invalid memory access tha
    crashes the kernel.

    So fix this by making apply_dir_move() issue utimes operations only for
    parents that still exist in the send snapshot. In a separate patch we
    will make send_utimes() return an error (-ENOENT) if the given inode
    does not exists in the send snapshot.

    Signed-off-by: Robbie Ko
    Signed-off-by: Filipe Manana
    [Rewrote change log to be more detailed and better organized]

    Signed-off-by: Filipe Manana

    Robbie Ko
     
  • Under certain situations, when doing an incremental send, we can end up
    not freeing orphan_dir_info structures as soon as they are no longer
    needed. Instead we end up freeing them only after finishing the send
    stream, which causes a warning to be emitted:

    [282735.229200] ------------[ cut here ]------------
    [282735.229968] WARNING: CPU: 9 PID: 10588 at fs/btrfs/send.c:6298 btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
    [282735.231282] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
    [282735.237130] CPU: 9 PID: 10588 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
    [282735.239309] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [282735.240160] 0000000000000000 ffff880224273ca8 ffffffff8126b42c 0000000000000000
    [282735.240160] 0000000000000000 ffff880224273ce8 ffffffff81052b14 0000189a24273ac8
    [282735.240160] ffff8802210c9800 0000000000000000 0000000000000001 0000000000000000
    [282735.240160] Call Trace:
    [282735.240160] [] dump_stack+0x67/0x90
    [282735.240160] [] __warn+0xc2/0xdd
    [282735.240160] [] warn_slowpath_null+0x1d/0x1f
    [282735.240160] [] btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
    [282735.240160] [] btrfs_ioctl+0x14f/0x1f81 [btrfs]
    [282735.240160] [] ? arch_local_irq_save+0x9/0xc
    [282735.240160] [] vfs_ioctl+0x18/0x34
    [282735.240160] [] do_vfs_ioctl+0x550/0x5be
    [282735.240160] [] ? __fget+0x6b/0x77
    [282735.240160] [] ? __fget_light+0x62/0x71
    [282735.240160] [] SyS_ioctl+0x57/0x79
    [282735.240160] [] entry_SYSCALL_64_fastpath+0x18/0xa8
    [282735.240160] [] ? time_hardirqs_off+0x9/0x14
    [282735.240160] [] ? trace_hardirqs_off_caller+0x1f/0xaa
    [282735.256343] ---[ end trace a4539270c8056f93 ]---

    Consider the following example:

    Parent snapshot:

    . (ino 256)
    |--- a/ (ino 257)
    | |--- c/ (ino 260)
    |
    |--- del/ (ino 259)
    |--- tmp/ (ino 258)
    |--- x/ (ino 261)
    |--- y/ (ino 262)

    Send snapshot:

    . (ino 256)
    |--- a/ (ino 257)
    | |--- x/ (ino 261)
    | |--- y/ (ino 262)
    |
    |--- c/ (ino 260)
    |--- tmp/ (ino 258)

    1) When processing inode 258, we end up delaying its rename operation
    because it has an ancestor (in the send snapshot) that has a higher
    inode number (inode 260) which was also renamed in the send snapshot,
    therefore we delay the rename of inode 258 so that it happens after
    inode 260 is renamed;

    2) When processing inode 259, we end up delaying its deletion (rmdir
    operation) because it has a child inode (258) that has its rename
    operation delayed. At this point we allocate an orphan_dir_info
    structure and tag inode 258 so that we later attempt to see if we
    can delete (rmdir) inode 259 once inode 258 is renamed;

    3) When we process inode 260, after renaming it we finally do the rename
    operation for inode 258. Once we issue the rename operation for inode
    258 we notice that this inode was tagged so that we attempt to see
    if at this point we can delete (rmdir) inode 259. But at this point
    we can not still delete inode 259 because it has 2 children, inodes
    261 and 262, that were not yet processed and therefore not yet
    moved (renamed) away from inode 259. We end up not freeing the
    orphan_dir_info structure allocated in step 2;

    4) We process inodes 261 and 262, and once we move/rename inode 262
    we issue the rmdir operation for inode 260;

    5) We finish the send stream and notice that red black tree that
    contains orphan_dir_info structures is not empty, so we emit
    a warning and then free any orphan_dir_structures left.

    So fix this by freeing an orphan_dir_info structure once we try to
    apply a pending rename operation if we can not delete yet the tagged
    directory.

    A test case for fstests follows soon.

    Signed-off-by: Robbie Ko
    Signed-off-by: Filipe Manana
    [Modified changelog to be more detailed and easier to understand]

    Robbie Ko
     
  • Under certain situations, an incremental send operation can contain
    a rmdir operation that will make the receiving end fail when attempting
    to execute it, because the target directory is not yet empty.

    Consider the following example:

    Parent snapshot:

    . (ino 256)
    |--- a/ (ino 257)
    | |--- c/ (ino 260)
    |
    |--- del/ (ino 259)
    |--- tmp/ (ino 258)
    |--- x/ (ino 261)

    Send snapshot:

    . (ino 256)
    |--- a/ (ino 257)
    | |--- x/ (ino 261)
    |
    |--- c/ (ino 260)
    |--- tmp/ (ino 258)

    1) When processing inode 258, we delay its rename operation because inode
    260 is its new parent in the send snapshot and it was not yet renamed
    (since 260 > 258, that is, beyond the current progress);

    2) When processing inode 259, we realize we can not yet send an rmdir
    operation (against inode 259) because inode 258 was still not yet
    renamed/moved away from inode 259. Therefore we update data structures
    so that after inode 258 is renamed, we try again to see if we can
    finally send an rmdir operation for inode 259;

    3) When we process inode 260, we send a rename operation for it followed
    by a rename operation for inode 258. Once we send the rename operation
    for inode 258 we then check if we can finally issue an rmdir for its
    previous parent, inode 259, by calling the can_rmdir() function with
    a value of sctx->cur_ino + 1 (260 + 1 = 261) for its "progress"
    argument. This makes can_rmdir() return true (value 1) because even
    though there's still a child inode of inode 259 that was not yet
    renamed/moved, which is inode 261, the given value of progress (261)
    is not lower then 261 (that is, not lower than the inode number of
    some child of inode 259). So we end up sending a rmdir operation for
    inode 259 before its child inode 261 is processed and renamed.

    So fix this by passing the correct progress value to the call to
    can_rmdir() from within apply_dir_move() (where we issue delayed rename
    operations), which should match stcx->cur_ino (the number of the inode
    currently being processed) and not sctx->cur_ino + 1.

    A test case for fstests follows soon.

    Signed-off-by: Robbie Ko
    Signed-off-by: Filipe Manana
    [Rewrote change log to be more detailed, clear and well formatted]

    Signed-off-by: Filipe Manana

    Robbie Ko
     
  • Example scenario:

    Parent snapshot:

    . (ino 277)
    |---- tmp/ (ino 278)
    |---- pre/ (ino 280)
    | |---- wait_dir/ (ino 281)
    |
    |---- desc/ (ino 282)
    |---- ance/ (ino 283)
    | |---- below_ance/ (ino 279)
    |
    |---- other_dir/ (ino 284)

    Send snapshot:

    . (ino 277)
    |---- tmp/ (ino 278)
    |---- other_dir/ (ino 284)
    |---- below_ance/ (ino 279)
    | |---- pre/ (ino 280)
    |
    |---- wait_dir/ (ino 281)
    |---- desc/ (ino 282)
    |---- ance/ (ino 283)

    While computing the send stream the following steps happen:

    1) While processing inode 279 we end up delaying its rename operation
    because its new parent in the send snapshot, inode 284, was not
    yet processed and therefore not yet renamed;

    2) Later when processing inode 280 we end up renaming it immediately to
    "ance/below_once/pre" and not delay its rename operation because its
    new parent (inode 279 in the send snapshot) has its rename operation
    delayed and inode 280 is not an encestor of inode 279 (its parent in
    the send snapshot) in the parent snapshot;

    3) When processing inode 281 we end up delaying its rename operation
    because its new parent in the send snapshot, inode 284, was not yet
    processed and therefore not yet renamed;

    4) When processing inode 282 we do not delay its rename operation because
    its parent in the send snapshot, inode 281, already has its own rename
    operation delayed and our current inode (282) is not an ancestor of
    inode 281 in the parent snapshot. Therefore inode 282 is renamed to
    "ance/below_ance/pre/wait_dir";

    5) When processing inode 283 we realize that we can rename it because one
    of its ancestors in the send snapshot, inode 281, has its rename
    operation delayed and inode 283 is not an ancestor of inode 281 in the
    parent snapshot. So a rename operation to rename inode 283 to
    "ance/below_ance/pre/wait_dir/desc/ance" is issued. This path is
    invalid due to a missing path building loop that was undetected by
    the incremental send implementation, as inode 283 ends up getting
    included twice in the path (once with its path in the parent snapshot).
    Therefore its rename operation must wait before the ancestor inode 284
    is renamed.

    Fix this by not terminating the rename dependency checks when we find an
    ancestor, in the send snapshot, that has its rename operation delayed. So
    that we continue doing the same checks if the current inode is not an
    ancestor, in the parent snapshot, of an ancestor in the send snapshot we
    are processing in the loop.

    The problem and reproducer were reported by Robbie Ko, as part of a patch
    titled "Btrfs: incremental send, avoid ancestor rename to descendant".
    However the fix was unnecessarily complicated and can be addressed with
    much less code and effort.

    Reported-by: Robbie Ko
    Signed-off-by: Filipe Manana

    Filipe Manana
     
  • The function path_loop() can return a negative integer, signaling an
    error, 0 if there's no path loop and 1 if there's a path loop. We were
    treating any non zero values as meaning that a path loop exists. Fix
    this by explicitly checking for errors and gracefully return them to
    user space.

    Signed-off-by: Filipe Manana

    Filipe Manana
     
  • When doing an incremental send we can end up not moving directories that
    have the same name. This happens when the same parent directory has
    different child directories with the same name in the parent and send
    snapshots.

    For example, consider the following scenario:

    Parent snapshot:

    . (ino 256)
    |---- d/ (ino 257)
    | |--- p1/ (ino 258)
    |
    |---- p1/ (ino 259)

    Send snapshot:

    . (ino 256)
    |--- d/ (ino 257)
    |--- p1/ (ino 259)
    |--- p1/ (ino 258)

    The directory named "d" (inode 257) has in both snapshots an entry with
    the name "p1" but it refers to different inodes in both snapshots (inode
    258 in the parent snapshot and inode 259 in the send snapshot). When
    attempting to move inode 258, the operation is delayed because its new
    parent, inode 259, was not yet moved/renamed (as the stream is currently
    processing inode 258). Then when processing inode 259, we also end up
    delaying its move/rename operation so that it happens after inode 258 is
    moved/renamed. This decision to delay the move/rename rename operation
    of inode 259 is due to the fact that the new parent inode (257) still
    has inode 258 as its child, which has the same name has inode 259. So
    we end up with inode 258 move/rename operation waiting for inode's 259
    move/rename operation, which in turn it waiting for inode's 258
    move/rename. This results in ending the send stream without issuing
    move/rename operations for inodes 258 and 259 and generating the
    following warnings in syslog/dmesg:

    [148402.979747] ------------[ cut here ]------------
    [148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
    [148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
    [148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
    [148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
    [148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
    [148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
    [148402.988136] Call Trace:
    [148402.988136] [] dump_stack+0x67/0x90
    [148402.988136] [] __warn+0xc2/0xdd
    [148402.988136] [] warn_slowpath_null+0x1d/0x1f
    [148402.988136] [] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
    [148402.988136] [] btrfs_ioctl+0x14f/0x1f81 [btrfs]
    [148402.988136] [] ? arch_local_irq_save+0x9/0xc
    [148402.988136] [] ? __lock_is_held+0x3c/0x57
    [148402.988136] [] vfs_ioctl+0x18/0x34
    [148402.988136] [] do_vfs_ioctl+0x550/0x5be
    [148402.988136] [] ? __fget+0x6b/0x77
    [148402.988136] [] ? __fget_light+0x62/0x71
    [148402.988136] [] SyS_ioctl+0x57/0x79
    [148402.988136] [] entry_SYSCALL_64_fastpath+0x18/0xa8
    [148402.988136] [] ? trace_hardirqs_off_caller+0x3f/0xaa
    [148403.011373] ---[ end trace a4539270c8056f8b ]---
    [148403.012296] ------------[ cut here ]------------
    [148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
    [148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
    [148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
    [148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
    [148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
    [148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
    [148403.020104] Call Trace:
    [148403.020104] [] dump_stack+0x67/0x90
    [148403.020104] [] __warn+0xc2/0xdd
    [148403.020104] [] warn_slowpath_null+0x1d/0x1f
    [148403.020104] [] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
    [148403.020104] [] btrfs_ioctl+0x14f/0x1f81 [btrfs]
    [148403.020104] [] ? arch_local_irq_save+0x9/0xc
    [148403.020104] [] ? __lock_is_held+0x3c/0x57
    [148403.020104] [] vfs_ioctl+0x18/0x34
    [148403.020104] [] do_vfs_ioctl+0x550/0x5be
    [148403.020104] [] ? __fget+0x6b/0x77
    [148403.020104] [] ? __fget_light+0x62/0x71
    [148403.020104] [] SyS_ioctl+0x57/0x79
    [148403.020104] [] entry_SYSCALL_64_fastpath+0x18/0xa8
    [148403.020104] [] ? trace_hardirqs_off_caller+0x3f/0xaa
    [148403.038981] ---[ end trace a4539270c8056f8c ]---

    There's another issue caused by similar (but more complex) changes in the
    directory hierarchy that makes move/rename operations fail, described with
    the following example:

    Parent snapshot:

    .
    |---- a/ (ino 262)
    | |---- c/ (ino 268)
    |
    |---- d/ (ino 263)
    |---- ance/ (ino 267)
    |---- e/ (ino 264)
    |---- f/ (ino 265)
    |---- ance/ (ino 266)

    Send snapshot:

    .
    |---- a/ (ino 262)
    |---- c/ (ino 268)
    | |---- ance/ (ino 267)
    |
    |---- d/ (ino 263)
    | |---- ance/ (ino 266)
    |
    |---- f/ (ino 265)
    |---- e/ (ino 264)

    When the inode 265 is processed, the path for inode 267 is computed, which
    at that time corresponds to "d/ance", and it's stored in the names cache.
    Later on when processing inode 266, we end up orphanizing (renaming to a
    name matching the pattern o--) inode 267 because it has
    the same name as inode 266 and it's currently a child of the new parent
    directory (inode 263) for inode 266. After the orphanization and while we
    are still processing inode 266, a rename operation for inode 266 is
    generated. However the source path for that rename operation is incorrect
    because it ends up using the old, pre-orphanization, name of inode 267.
    The no longer valid name for inode 267 was previously cached when
    processing inode 265 and it remains usable and considered valid until
    the inode currently being processed has a number greater than 267.
    This resulted in the receiving side failing with the following error:

    ERROR: rename d/ance/ance -> d/ance failed: No such file or directory

    So fix these issues by detecting such circular dependencies for rename
    operations and by clearing the cached name of an inode once the inode
    is orphanized.

    A test case for fstests will follow soon.

    Signed-off-by: Robbie Ko
    Signed-off-by: Filipe Manana
    [Rewrote change log to be more detailed and organized, and improved
    comments]

    Signed-off-by: Filipe Manana

    Robbie Ko
     
  • When we start an fsync we start ordered extents for all delalloc ranges.
    However before attempting to log the inode, we only wait for those ordered
    extents if we are not doing a full sync (bit BTRFS_INODE_NEEDS_FULL_SYNC
    is set in the inode's flags). This means that if an ordered extent
    completes with an IO error before we check if we can skip logging the
    inode, we will not catch and report the IO error to user space. This is
    because on an IO error, when the ordered extent completes we do not
    update the inode, so if the inode was not previously updated by the
    current transaction we end up not logging it through calls to fsync and
    therefore not check its mapping flags for the presence of IO errors.

    Fix this by checking for errors in the flags of the inode's mapping when
    we notice we can skip logging the inode.

    This caused sporadic failures in the test generic/331 (which explicitly
    tests for IO errors during an fsync call).

    Signed-off-by: Filipe Manana
    Reviewed-by: Liu Bo

    Filipe Manana
     

21 Jul, 2016

3 commits


08 Jul, 2016

18 commits

  • We used to allow you to set FLUSH_ALL and then just wouldn't do things like
    commit transactions or wait on ordered extents if we noticed you were in a
    transaction. However now that all the flushing for FLUSH_ALL is asynchronous
    we've lost the ability to tell, and we could end up deadlocking. So instead use
    FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
    we error out to preserve the previous behavior. I've also added an ASSERT() to
    catch anybody else who tries to do this. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Since we set the reloc control before we've reserved our space for relocation we
    could race with a root being dirtied and not actually have space to do our init
    reloc root. So once we've allocated it and set it up go ahead and make our
    reservation before setting the relocate control, that way anybody who tries to
    do the reloc root init has space to use. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • This is the case all the time anyway except for relocation which could be doing
    a reloc root for a non ref counted root, in which case we'd end up with some
    random block rsv rather than the one we have our reservation in. If there isn't
    enough space in the block rsv we are trying to steal from we'll BUG() because we
    expect there to be space for the orphan to make its reservation. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Traditionally we've calculated the global block rsv by guessing how much of the
    metadata used amount was the extent tree, and then taking the data size and
    figuring out how large the csum tree would have to be to hold that much data.

    This is imprecise and falls down on MIXED file systems as we can't trust the
    data used amount. This resulted in failures for xfstests generic/333 because it
    creates lots of clones, which explodes out the extent tree. Our global reserve
    calculations were woefully inaccurate in this case which meant we got into a
    situation where we did not have enough reserved to do our work.

    We know we only use the global block rsv for the extent, csum, and root trees,
    so just get the bytes used for these trees and use that as the basis of our
    global reserve. Since these are not reference counted trees the bytes_used
    value will be accurate. This fixed the transaction aborts seen with
    generic/333. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Instead of doing fs_info->fs_root in need_async_flush, which may not be set
    during recovery when mounting, just pass the root itself in, which makes more
    sense as thats what btrfs_calc_reclaim_metadata_size takes.

    Signed-off-by: Josef Bacik
    Reported-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We do this check when we start the async reclaimer thread, might as well check
    before we kick it off to save us some cycles. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
    isn't quite right because we will go through and free that extent later when we
    unpin, so it messes up apps that are accounting for the reservation space. We
    were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
    only actually free the reservation instead of pinning the extent. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • When tracing enospc problems on a box with multiple file systems mounted I need
    to be able to differentiate between the two file systems. Most of the important
    trace points I'm looking at already have an fsid, but the reserved extent trace
    points do not, so add that to make it possible to figure out which trace point
    belongs to which file system. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We want to track when we're triggering flushing from our reservation code and
    what flushing is being done when we start flushing. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We can sometimes drop the reservation we had for our inode, so we need to remove
    that amount from to_reserve so that our tracepoint reports a valid amount of
    space.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Pinned extents are an important metric to keep track of for enospc.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Our enospc flushing sucks. It is born from a time where we were early
    enospc'ing constantly because multiple threads would race in for the same
    reservation and randomly starve other ones out. So I came up with this solution
    to block any other reservations from happening while one guy tried to flush
    stuff to satisfy his reservation. This gives us pretty good correctness, but
    completely crap latency.

    The solution I've come up with is ticketed reservations. Basically we try to
    make our reservation, and if we can't we put a ticket on a list in order and
    kick off an async flusher thread. This async flusher thread does the same old
    flushing we always did, just asynchronously. As space is freed and added back
    to the space_info it checks and sees if we have any tickets that need
    satisfying, and adds space to the tickets and wakes up anything we've satisfied.

    Once the flusher thread stops making progress it wakes up all the current
    tickets and tells them to take a hike.

    There is a priority list for things that can't flush, since the async flusher
    could do anything we need to avoid deadlocks. These guys get priority for
    having their reservation made, and will still do manual flushing themselves in
    case the async flusher isn't running.

    This patch gives us significantly better latencies. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • I'm writing a tool to visualize the enospc system inside btrfs, I need this
    tracepoint in order to keep track of the block groups in the system. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • These were hidden behind enospc_debug, which isn't helpful as they indicate
    actual bugs, unlike the rest of the enospc_debug stuff which is really debug
    information. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We reserve space for the inode update when we first reserve space for writing to
    a file. However there are lots of ways that we can use this reservation and not
    have it for subsequent ordered extents. Previously we'd fall through and try to
    reserve metadata bytes for this, then we'd just steal the full reservation from
    the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
    reservation from the global reserve. The problem with this is we can easily
    just return ENOSPC and fallback to updating the inode item directly. In the
    worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
    fall all the way through, however if we just fallback and update the inode
    directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.

    We would have also just added the extent item for the inode so we likely will
    have already cow'ed down most of the way to the leaf containing the inode item,
    so we are more often than not only need one or two nodesize's worth of
    reservations. Given the reservation for the extent itself is also a worst case
    we will likely already have space to cover the inode update.

    This change will make us behave better in the theoretical worst case, and much
    better in the case that we don't have our reservation and cannot reserve more
    metadata. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • There are a few races in the metadata reservation stuff. First we add the bytes
    to the block_rsv well after we've set the bit on the inode saying that we have
    space for it and after we've reserved the bytes. So use the normal
    btrfs_block_rsv_add helper for this case. Secondly we can flush delalloc
    extents when we try to reserve space for our write, which means that we could
    have used up the space for the inode and we wouldn't know because we only check
    before the reservation. So instead make sure we are always reserving space for
    the inode update, and then if we don't need it release those bytes afterward.
    Thanks,

    Signed-off-by: Josef Bacik
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba

    Josef Bacik
     
  • So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
    Not only this but it unconditionally changes the size of the block_rsv. This
    isn't a bug strictly speaking, but it makes truncate block rsv's look funny
    because every time we migrate bytes over its size grows, even though we only
    want it to be a specific size. So collapse this into one function that takes an
    update_size argument and make truncate and evict not update the size for
    consistency sake. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • For some reason we're adding bytes_readonly to the space info after we update
    the space info with the block group info. This creates a tiny race where we
    could over-reserve space because we haven't yet taken out the bytes_readonly
    bit. Since we already know this information at the time we call
    update_space_info, just pass it along so it can be updated all at once. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

04 Jul, 2016

3 commits


03 Jul, 2016

4 commits

  • overlay needs underlying fs to support d_type. Recently I put in a
    patch in to detect this condition and started failing mount if
    underlying fs did not support d_type.

    But this breaks existing configurations over kernel upgrade. Those who
    are running docker (partially broken configuration) with xfs not
    supporting d_type, are surprised that after kernel upgrade docker does
    not run anymore.

    https://github.com/docker/docker/issues/22937#issuecomment-229881315

    So instead of erroring out, detect broken configuration and warn
    about it. This should allow existing docker setups to continue
    working after kernel upgrade.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi
    Fixes: 45aebeaf4f67 ("ovl: Ensure upper filesystem supports d_type")
    Cc: 4.6

    Vivek Goyal
     
  • Pull MIPS fix from Ralf Baechle:
    "Only a single fix for 4.7 pending at this point. It fixes an issue
    that may lead to corruption of the cache mode bits in the page table"

    * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
    MIPS: Fix possible corruption of cache mode by mprotect.

    Linus Torvalds
     
  • Pull powerpc fixes from Michael Ellerman:

    - tm: Always reclaim in start_thread() for exec() class syscalls from
    Cyril Bur

    - tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 from Michael
    Neuling

    - eeh: Fix wrong argument passed to eeh_rmv_device() from Gavin Shan

    - Initialise pci_io_base as early as possible from Darren Stevens

    * tag 'powerpc-4.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc: Initialise pci_io_base as early as possible
    powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
    powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()
    powerpc/tm: Always reclaim in start_thread() for exec() class syscalls

    Linus Torvalds
     
  • Pull drm fixes frlm Dave Airlie:
    "Just some AMD and Intel fixes, the AMD ones are further production
    Polaris fixes, and the Intel ones fix some early timeouts, some PCI ID
    changes and a couple of other fixes.

    Still a bit Internet challenged here, hopefully end of next week will
    solve it"

    * tag 'drm-fixes-for-v4.7-rc6' of git://people.freedesktop.org/~airlied/linux:
    drm/i915: Fix missing unlock on error in i915_ppgtt_info()
    drm/amd/powerplay: workaround for UVD clock issue
    drm/amdgpu: add ACLK_CNTL setting for polaris10
    drm/amd/powerplay: fix issue uvd dpm can't enabled on Polaris11.
    drm/amd/powerplay: Workaround for Memory EDC Error on Polaris10.
    drm/i915: Removing PCI IDs that are no longer listed as Kabylake.
    drm/i915: Add more Kabylake PCI IDs.
    drm/i915: Avoid early timeout during AUX transfers
    drm/i915/hsw: Avoid early timeout during LCPLL disable/restore
    drm/i915/lpt: Avoid early timeout during FDI PHY reset
    drm/i915/bxt: Avoid early timeout during PLL enable
    drm/i915: Refresh cached DP port register value on resume
    drm/amd/powerplay: Update CKS on/ CKS off voltage offset calculation
    drm/amd/powerplay: disable FFC.
    drm/amd/powerplay: add some definition for FFC feature on polaris.

    Linus Torvalds