30 Sep, 2015

25 commits

  • commit 841df7df196237ea63233f0f9eaa41db53afd70f upstream.

    Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
    superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
    when the journal is aborted. That makes logic in
    jbd2_log_do_checkpoint() bail out which is fine, except that
    jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
    a progress in cleaning the journal. Without it jbd2_journal_destroy()
    just loops in an infinite loop.

    Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
    jbd2_log_do_checkpoint() fails with error.

    Reported-by: Eryu Guan
    Tested-by: Eryu Guan
    Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 7cb74be6fd827e314f81df3c5889b87e4c87c569 upstream.

    Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
    hfs_bnode_find() for finding or creating pages corresponding to an inode)
    are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
    and should not be page_cache_release()'ed until hfs_bnode_free().

    This patch fixes a problem I first saw in July 2012: merely running "du"
    on a large hfsplus-mounted directory a few times on a reasonably loaded
    system would get the hfsplus driver all confused and complaining about
    B-tree inconsistencies, and generates a "BUG: Bad page state". Most
    recently, I can generate this problem on up-to-date Fedora 22 with shipped
    kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
    mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
    lightly-used QEMU VM image of the full Mac OS X 10.9:

    $ df -i / /home /mnt
    Filesystem Inodes IUsed IFree IUse% Mounted on
    /dev/mapper/fedora-root 3276800 551665 2725135 17% /
    /dev/mapper/fedora-home 52879360 716221 52163139 2% /home
    /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt

    After applying the patch, I was able to run "du /" (60+ times) and "du
    /mnt" (150+ times) continuously and simultaneously for 6+ hours.

    There are many reports of the hfsplus driver getting confused under load
    and generating "BUG: Bad page state" or other similar issues over the
    years. [1]

    The unpatched code [2] has always been wrong since it entered the kernel
    tree. The only reason why it gets away with it is that the
    kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
    the kernel has not had a chance to reuse the memory for something else,
    most of the time.

    The current RW driver appears to have followed the design and development
    of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
    2001) had a B-tree node-centric approach to
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
    migrating towards version 0.2 (June 2002) of caching and releasing pages
    per inode extents. When the current RW code first entered the kernel [2]
    in 2005, there was an REF_PAGES conditional (and "//" commented out code)
    to switch between B-node centric paging to inode-centric paging. There
    was a mistake with the direction of one of the REF_PAGES conditionals in
    __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
    removed, but a page_cache_release() was mistakenly left in (propagating
    the "REF_PAGES !REF_PAGE" mistake), and the commented-out
    page_cache_release() in bnode_release() (which should be spanned by
    !REF_PAGES) was never enabled.

    References:
    [1]:
    Michael Fox, Apr 2013
    http://www.spinics.net/lists/linux-fsdevel/msg63807.html
    ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

    Sasha Levin, Feb 2015
    http://lkml.org/lkml/2015/2/20/85 ("use after free")

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
    https://bugzilla.kernel.org/show_bug.cgi?id=42342
    https://bugzilla.kernel.org/show_bug.cgi?id=63841
    https://bugzilla.kernel.org/show_bug.cgi?id=78761

    [2]:
    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
    commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:36 2004 -0800

    [PATCH] HFS rewrite

    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

    commit 91556682e0bf004d98a529bf829d339abb98bbbd
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:48 2004 -0800

    [PATCH] HFS+ support

    [3]:
    http://sourceforge.net/projects/linux-hfsplus/

    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

    http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
    fs/hfsplus/bnode.c?r1=1.4&r2=1.5

    Date: Thu Jun 6 09:45:14 2002 +0000
    Use buffer cache instead of page cache in bnode.c. Cache inode extents.

    [4]:
    http://git.kernel.org/cgit/linux/kernel/git/\
    stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

    commit a5e3985fa014029eb6795664c704953720cc7f7d
    Author: Roman Zippel
    Date: Tue Sep 6 15:18:47 2005 -0700

    [PATCH] hfs: remove debug code

    Signed-off-by: Hin-Tak Leung
    Signed-off-by: Sergei Antonov
    Reviewed-by: Anton Altaparmakov
    Reported-by: Sasha Levin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Vyacheslav Dubeyko
    Cc: Sougata Santra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hin-Tak Leung
     
  • commit b4cc0efea4f0bfa2477c56af406cfcf3d3e58680 upstream.

    Fix B-tree corruption when a new record is inserted at position 0 in the
    node in hfs_brec_insert().

    This is an identical change to the corresponding hfs b-tree code to Sergei
    Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
    to keep similar code paths in the hfs and hfsplus drivers in sync, where
    appropriate.

    Signed-off-by: Hin-Tak Leung
    Cc: Sergei Antonov
    Cc: Joe Perches
    Reviewed-by: Vyacheslav Dubeyko
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hin-Tak Leung
     
  • commit 5556e7e6d30e8e9b5ee51b0e5edd526ee80e5e36 upstream.

    Consider eCryptfs dcache entries to be stale when the corresponding
    lower inode's i_nlink count is zero. This solves a problem caused by the
    lower inode being directly modified, without going through the eCryptfs
    mount, leaving stale eCryptfs dentries cached and the eCryptfs inode's
    i_nlink count not being cleared.

    Signed-off-by: Tyler Hicks
    Reported-by: Richard Weinberger
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     
  • commit 40f705a736eac10e7dca7ab5dd5ed675a6df031d upstream.

    On a filesystem like vfat, all files are created with the same owner
    and mode independent of who created the file. When a vfat filesystem
    is mounted with root as owner of all files and read access for everyone,
    root's processes left world-readable coredumps on it (but other
    users' processes only left empty corefiles when given write access
    because of the uid mismatch).

    Given that the old behavior was inconsistent and insecure, I don't see
    a problem with changing it. Now, all processes refuse to dump core unless
    the resulting corefile will only be readable by their owner.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit fbb1816942c04429e85dbf4c1a080accc534299e upstream.

    It was possible for an attacking user to trick root (or another user) into
    writing his coredumps into an attacker-readable, pre-existing file using
    rename() or link(), causing the disclosure of secret data from the victim
    process' virtual memory. Depending on the configuration, it was also
    possible to trick root into overwriting system files with coredumps. Fix
    that issue by never writing coredumps into existing files.

    Requirements for the attack:
    - The attack only applies if the victim's process has a nonzero
    RLIMIT_CORE and is dumpable.
    - The attacker can trick the victim into coredumping into an
    attacker-writable directory D, either because the core_pattern is
    relative and the victim's cwd is attacker-writable or because an
    absolute core_pattern pointing to a world-writable directory is used.
    - The attacker has one of these:
    A: on a system with protected_hardlinks=0:
    execute access to a folder containing a victim-owned,
    attacker-readable file on the same partition as D, and the
    victim-owned file will be deleted before the main part of the attack
    takes place. (In practice, there are lots of files that fulfill
    this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
    This does not apply to most Linux systems because most distros set
    protected_hardlinks=1.
    B: on a system with protected_hardlinks=1:
    execute access to a folder containing a victim-owned,
    attacker-readable and attacker-writable file on the same partition
    as D, and the victim-owned file will be deleted before the main part
    of the attack takes place.
    (This seems to be uncommon.)
    C: on any system, independent of protected_hardlinks:
    write access to a non-sticky folder containing a victim-owned,
    attacker-readable file on the same partition as D
    (This seems to be uncommon.)

    The basic idea is that the attacker moves the victim-owned file to where
    he expects the victim process to dump its core. The victim process dumps
    its core into the existing file, and the attacker reads the coredump from
    it.

    If the attacker can't move the file because he does not have write access
    to the containing directory, he can instead link the file to a directory
    he controls, then wait for the original link to the file to be deleted
    (because the kernel checks that the link count of the corefile is 1).

    A less reliable variant that requires D to be non-sticky works with link()
    and does not require deletion of the original link: link() the file into
    D, but then unlink() it directly before the kernel performs the link count
    check.

    On systems with protected_hardlinks=0, this variant allows an attacker to
    not only gain information from coredumps, but also clobber existing,
    victim-writable files with coredumps. (This could theoretically lead to a
    privilege escalation.)

    Signed-off-by: Jann Horn
    Cc: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit 36319608e28701c07cad80ae3be8b0fdfb1ab40f upstream.

    This reverts commit 4e379d36c050b0117b5d10048be63a44f5036115.

    This commit opens up a race between the recovery code and the open code.

    Reported-by: Olga Kornievskaia
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 4a1e2feb9d246775dee0f78ed5b18826bae2b1c5 upstream.

    According to RFC5661 Section 18.2.4, CLOSE is supposed to return
    the zero stateid. This means that nfs_clear_open_stateid_locked()
    cannot assume that the result stateid will always match the 'other'
    field of the existing open stateid when trying to determine a race
    with a parallel OPEN.

    Instead, we look at the argument, and check for matches.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit d13549074cf066d6d5bb29903d044beffea342d3 upstream.

    According to the flexfiles protocol, the layoutreturn should specify an
    array of errors in the following format:

    struct ff_ioerr4 {
    offset4 ffie_offset;
    length4 ffie_length;
    stateid4 ffie_stateid;
    device_error4 ffie_errors<>;
    };

    This patch fixes up the code to ensure that our ffie_errors is indeed
    encoded as an array (albeit with only a single entry).

    Reported-by: Tom Haynes
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 5420401079e152ff68a8024f6a375804b1c21505 upstream.

    We do not want to update inode attributes with DS values.

    Signed-off-by: Peng Tao
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     
  • commit aaae3f00d3f67f681a1f3cb7af999e976e8a24ce upstream.

    If the ctime or mtime or change attribute have changed because
    of an operation we initiated, we should make sure that we force
    an attribute update. However we do not want to mark the page cache
    for revalidation.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 69f230d907e8c1ca3f9bd528993eeb98f712b0dd upstream.

    Otherwise we break fstest case tests/read_write/mctime.t

    Does files layout need the same fix as well?

    Cc: Anna Schumaker
    Signed-off-by: Peng Tao
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     
  • commit e9ae58aeee8842a50f7e199d602a5ccb2e41a95f upstream.

    We should ensure that we always set the pgio_header's error field
    if a READ or WRITE RPC call returns an error. The current code depends
    on 'hdr->good_bytes' always being initialised to a large value, which
    is not always done correctly by callers.
    When this happens, applications may end up missing important errors.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 18e3b739fdc826481c6a1335ce0c5b19b3d415da upstream.

    ---Steps to Reproduce--

    # cat /etc/exports
    /nfs/referal *(rw,insecure,no_subtree_check,no_root_squash,crossmnt)
    /nfs/old *(ro,insecure,subtree_check,root_squash,crossmnt)

    # mount -t nfs nfs-server:/nfs/ /mnt/
    # ll /mnt/*/

    # cat /etc/exports
    /nfs/referal *(rw,insecure,no_subtree_check,no_root_squash,crossmnt,refer=/nfs/old/@nfs-server)
    /nfs/old *(ro,insecure,subtree_check,root_squash,crossmnt)
    # service nfs restart

    # ll /mnt/*/ --->>>>> oops here

    [ 5123.102925] BUG: unable to handle kernel NULL pointer dereference at (null)
    [ 5123.103363] IP: [] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
    [ 5123.103752] PGD 587b9067 PUD 3cbf5067 PMD 0
    [ 5123.104131] Oops: 0000 [#1]
    [ 5123.104529] Modules linked in: nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev vmw_balloon parport_pc parport i2c_piix4 shpchp auth_rpcgss nfs_acl vmw_vmci lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi serio_raw scsi_transport_spi e1000 mptscsih mptbase ata_generic pata_acpi [last unloaded: nfsd]
    [ 5123.105887] CPU: 0 PID: 15853 Comm: ::1-manager Tainted: G OE 4.2.0-rc6+ #214
    [ 5123.106358] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
    [ 5123.106860] task: ffff88007620f300 ti: ffff88005877c000 task.ti: ffff88005877c000
    [ 5123.107363] RIP: 0010:[] [] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
    [ 5123.107909] RSP: 0018:ffff88005877fdb8 EFLAGS: 00010246
    [ 5123.108435] RAX: ffff880053f3bc00 RBX: ffff88006ce6c908 RCX: ffff880053a0d240
    [ 5123.108968] RDX: ffffea0000e6d940 RSI: ffff8800399a0000 RDI: ffff88006ce6c908
    [ 5123.109503] RBP: ffff88005877fe28 R08: ffffffff81c708a0 R09: 0000000000000000
    [ 5123.110045] R10: 00000000000001a2 R11: ffff88003ba7f5c8 R12: ffff880054c55800
    [ 5123.110618] R13: 0000000000000000 R14: ffff880053a0d240 R15: ffff880053a0d240
    [ 5123.111169] FS: 0000000000000000(0000) GS:ffffffff81c27000(0000) knlGS:0000000000000000
    [ 5123.111726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 5123.112286] CR2: 0000000000000000 CR3: 0000000054cac000 CR4: 00000000001406f0
    [ 5123.112888] Stack:
    [ 5123.113458] ffffea0000e6d940 ffff8800399a0000 00000000000167d0 0000000000000000
    [ 5123.114049] 0000000000000000 0000000000000000 0000000000000000 00000000a7ec82c6
    [ 5123.114662] ffff88005877fe18 ffffea0000e6d940 ffff8800399a0000 ffff880054c55800
    [ 5123.115264] Call Trace:
    [ 5123.115868] [] nfs4_try_migration+0xbb/0x220 [nfsv4]
    [ 5123.116487] [] nfs4_run_state_manager+0x4ab/0x7b0 [nfsv4]
    [ 5123.117104] [] ? nfs4_do_reclaim+0x510/0x510 [nfsv4]
    [ 5123.117813] [] kthread+0xd7/0xf0
    [ 5123.118456] [] ? kthread_worker_fn+0x160/0x160
    [ 5123.119108] [] ret_from_fork+0x3f/0x70
    [ 5123.119723] [] ? kthread_worker_fn+0x160/0x160
    [ 5123.120329] Code: 4c 8b 6a 58 74 17 eb 52 48 8d 55 a8 89 c6 4c 89 e7 e8 4a b5 ff ff 8b 45 b0 85 c0 74 1c 4c 89 f9 48 8b 55 90 48 8b 75 98 48 89 df ff 55 00 3d e8 d8 ff ff 41 89 c6 74 cf 48 8b 4d c8 65 48 33
    [ 5123.121643] RIP [] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
    [ 5123.122308] RSP
    [ 5123.122942] CR2: 0000000000000000

    Fixes: ec011fe847 ("NFS: Introduce a vector of migration recovery ops")
    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Kinglong Mee
     
  • commit 6f536936b79bd4b5cea8fb0e5b8b0bce8cd1ea4a upstream.

    - Switch back to using list_for_each_entry(). Fixes an incorrect test
    for list NULL termination.
    - Do not assume that lists are sorted.
    - Finally, consider an existing entry to match if it consists of a subset
    of the addresses in the new entry.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 7c2dad99d60c86ec686b3bfdcb787c450a7ea89f upstream.

    Chuck reports seeing cases where a GETATTR that happens to race
    with an asynchronous WRITE is overriding the file size, despite
    the attribute barrier being set by the writeback code.

    The culprit turns out to be the check in nfs_ctime_need_update(),
    which sees that the ctime is newer than the cached ctime, and
    assumes that it is safe to override the attribute barrier.
    This patch removes that override, and ensures that attribute
    barriers are always respected.

    Reported-by: Chuck Lever
    Fixes: a08a8cd375db9 ("NFS: Add attribute update barriers to NFS writebacks")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit efcbc04e16dfa95fef76309f89710dd1d99a5453 upstream.

    It is unusual to combine the open flags O_RDONLY and O_EXCL, but
    it appears that libre-office does just that.

    [pid 3250] stat("/home/USER/.config", {st_mode=S_IFDIR|0700, st_size=8192, ...}) = 0
    [pid 3250] open("/home/USER/.config/libreoffice/4-suse/user/extensions/buildid", O_RDONLY|O_EXCL

    NFSv4 takes O_EXCL as a sign that a setattr command should be sent,
    probably to reset the timestamps.

    When it was an O_RDONLY open, the SETATTR command does not
    identify any actual attributes to change.
    If no delegation was provided to the open, the SETATTR uses the
    all-zeros stateid and the request is accepted (at least by the
    Linux NFS server - no harm, no foul).

    If a read-delegation was provided, this is used in the SETATTR
    request, and a Netapp filer will justifiably claim
    NFS4ERR_BAD_STATEID, which the Linux client takes as a sign
    to retry - indefinitely.

    So only treat O_EXCL specially if O_CREAT was also given.

    Signed-off-by: NeilBrown
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 3fcbbd244ed1d20dc0eb7d48d729503992fa9b7d upstream.

    It's possible that a DELEGRETURN could race with (e.g.) client expiry,
    in which case we could end up putting the delegation hash reference more
    than once.

    Have unhash_delegation_locked return a bool that indicates whether it
    was already unhashed. In the case of destroy_delegation we only
    conditionally put the hash reference if that returns true.

    The other callers of unhash_delegation_locked call it while walking
    list_heads that shouldn't yet be detached. If we find that it doesn't
    return true in those cases, then throw a WARN_ON as that indicates that
    we have a partially hashed delegation, and that something is likely very
    wrong.

    Tested-by: Andrew W Elble
    Tested-by: Anna Schumaker
    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit e85687393f3ee0a77ccca016f903d1558bb69258 upstream.

    When an open or lock stateid is hashed, we take an extra reference to
    it. When we unhash it, we drop that reference. The code however does
    not properly account for the case where we have two callers concurrently
    trying to unhash the stateid. This can lead to list corruption and the
    hash reference being put more than once.

    Fix this by having unhash_ol_stateid use list_del_init on the st_perfile
    list_head, and then testing to see if that list_head is empty before
    releasing the hash reference. This means that some of the unhashing
    wrappers now become bool return functions so we can test to see whether
    the stateid was unhashed before we put the reference.

    Reported-by: Andrew W Elble
    Tested-by: Andrew W Elble
    Reported-by: Anna Schumaker
    Tested-by: Anna Schumaker
    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 6896f15aabde505b35888039af93d1d182a0108a upstream.

    Currently we'll respond correctly to a request for either
    FS_LAYOUT_TYPES or LAYOUT_TYPES, but not to a request for both
    attributes simultaneously.

    Signed-off-by: Kinglong Mee
    Reviewed-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Kinglong Mee
     
  • commit 2b83d3de4c18af49800e0b26ae013db4fcf43a4a upstream.

    pNFS writes don't return attributes, however that doesn't mean that we
    should ignore the fact that they may be extending the file. This patch
    ensures that if a write is seen to extend the file, then we always set
    an attribute barrier, and update the cached file size.

    Signed-off-by: Trond Myklebust
    Cc: Peng Tao
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 1f9b8c8fbc9a4d029760b16f477b9d15500e3a34 upstream.

    While we are committing a transaction, it's possible the previous one is
    still finishing its commit and therefore we wait for it to finish first.
    However we were not checking if that previous transaction ended up getting
    aborted after we waited for it to commit, so we ended up committing the
    current transaction which can lead to fs corruption because the new
    superblock can point to trees that have had one or more nodes/leafs that
    were never durably persisted.
    The following sequence diagram exemplifies how this is possible:

    CPU 0 CPU 1

    transaction N starts

    (...)

    btrfs_commit_transaction(N)

    cur_trans->state = TRANS_STATE_COMMIT_START;
    (...)
    cur_trans->state = TRANS_STATE_COMMIT_DOING;
    (...)

    cur_trans->state = TRANS_STATE_UNBLOCKED;
    root->fs_info->running_transaction = NULL;

    btrfs_start_transaction()
    --> starts transaction N + 1

    btrfs_write_and_wait_transaction(trans, root);
    --> starts writing all new or COWed ebs created
    at transaction N

    creates some new ebs, COWs some
    existing ebs but doesn't COW or
    deletes eb X

    btrfs_commit_transaction(N + 1)
    (...)
    cur_trans->state = TRANS_STATE_COMMIT_START;
    (...)
    wait_for_commit(root, prev_trans);
    --> prev_trans == transaction N

    btrfs_write_and_wait_transaction() continues
    writing ebs
    --> fails writing eb X, we abort transaction N
    and set bit BTRFS_FS_STATE_ERROR on
    fs_info->fs_state, so no new transactions
    can start after setting that bit

    cleanup_transaction()
    btrfs_cleanup_one_transaction()
    wakes up task at CPU 1

    continues, doesn't abort because
    cur_trans->aborted (transaction N + 1)
    is zero, and no checks for bit
    BTRFS_FS_STATE_ERROR in fs_info->fs_state
    are made

    btrfs_write_and_wait_transaction(trans, root);
    --> succeeds, no errors during writeback

    write_ctree_super(trans, root, 0);
    --> succeeds
    --> we have now a superblock that points us
    to some root that uses eb X, which was
    never written to disk

    In this scenario future attempts to read eb X from disk results in an
    error message like "parent transid verify failed on X wanted Y found Z".

    So fix this by aborting the current transaction if after waiting for the
    previous transaction we verify that it was aborted.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik
    Reviewed-by: Liu Bo
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 4c17a6d56bb0cad3066a714e94f7185a24b40f49 upstream.

    This might lead to local privilege escalation (code execution as
    kernel) for systems where the following conditions are met:

    - CONFIG_CIFS_SMB2 and CONFIG_CIFS_POSIX are enabled
    - a cifs filesystem is mounted where:
    - the mount option "vers" was used and set to a value >=2.0
    - the attacker has write access to at least one file on the filesystem

    To attack this, an attacker would have to guess the target_tcon
    pointer (but guessing wrong doesn't cause a crash, it just returns an
    error code) and win a narrow race.

    Signed-off-by: Jann Horn
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit bdfe0cbd746aa9b2509c2f6d6be17193cf7facd7 upstream.

    This reverts commit 08439fec266c3cc5702953b4f54bdf5649357de0.

    Unfortunately we still need to test for bdi->dev to avoid a crash when a
    USB stick is yanked out while a file system is mounted:

    usb 2-2: USB disconnect, device number 2
    Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write
    JBD2: Error -5 detected when updating journal superblock for sdb1-8.
    BUG: unable to handle kernel paging request at 34beb000
    IP: [] __percpu_counter_add+0x18/0xc0
    *pdpt = 0000000023db9001 *pde = 0000000000000000
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 4083 Comm: umount Tainted: G U OE 4.1.1-040101-generic #201507011435
    Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
    task: ebf06b50 ti: ebebc000 task.ti: ebebc000
    EIP: 0060:[] EFLAGS: 00010082 CPU: 0
    EIP is at __percpu_counter_add+0x18/0xc0
    EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001
    ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0
    Stack:
    c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160
    ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160
    f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000
    Call Trace:
    [] account_page_dirtied+0x74/0x110
    [] __set_page_dirty+0x3f/0xb0
    [] mark_buffer_dirty+0x53/0xc0
    [] ext4_commit_super+0x17b/0x250
    [] ext4_put_super+0xc1/0x320
    [] ? fsnotify_unmount_inodes+0x1aa/0x1c0
    [] ? evict_inodes+0xca/0xe0
    [] generic_shutdown_super+0x6a/0xe0
    [] ? prepare_to_wait_event+0xd0/0xd0
    [] ? unregister_shrinker+0x40/0x50
    [] kill_block_super+0x26/0x70
    [] deactivate_locked_super+0x45/0x80
    [] deactivate_super+0x47/0x60
    [] cleanup_mnt+0x39/0x80
    [] __cleanup_mnt+0x10/0x20
    [] task_work_run+0x91/0xd0
    [] do_notify_resume+0x7c/0x90
    [] work_notify
    Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89
    EIP: [] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40
    CR2: 0000000034beb000
    ---[ end trace dd564a7bea834ecd ]---

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit c642dc9e1aaed953597e7092d7df329e6234096e upstream.

    At some point along this sequence of changes:

    f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
    bb04457 ext4: support freezing ext2 (nojournal) file systems
    9ca9238 ext4: Use separate super_operations structure for no_journal filesystems

    ext4 started setting needs_recovery on filesystems without journals
    when they are unfrozen. This makes no sense, and in fact confuses
    blkid to the point where it doesn't recognize the filesystem at all.

    (freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
    see needs_recovery set on fs w/ no journal).

    To fix this, don't manipulate the INCOMPAT_RECOVER feature on
    filesystems without journals.

    Reported-by: Stu Mark
    Reviewed-by: Jan Kara
    Signed-off-by: Eric Sandeen
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Eric Sandeen
     

22 Sep, 2015

6 commits

  • commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream.

    Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • commit f49a26e7718dd30b49e3541e3e25aecf5e7294e2 upstream.

    Update ctime and mtime when a directory is modified. (though OS/2 doesn't
    update them anyway)

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 4b75de8615050c1b0dd8d7794838c42f74ed36ba upstream.

    Before the make_empty_dir_inode calls were introduce into proc, sysfs,
    and sysctl those directories when stated reported an i_size of 0.
    make_empty_dir_inode started reporting an i_size of 2. At least one
    userspace application depended on stat returning i_size of 0. So
    modify make_empty_dir_inode to cause an i_size of 0 to be reported for
    these directories.

    Reported-by: Tejun Heo
    Acked-by: Tejun Heo
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 037542345a82aaaa228ec280fe6ddff1568d169f upstream.

    Users have occasionally reported that file type for some directory
    entries is wrong. This mostly happened after updating libraries some
    libraries. After some debugging the problem was traced down to
    xfs_dir2_node_replace(). The function uses args->filetype as a file type
    to store in the replaced directory entry however it also calls
    xfs_da3_node_lookup_int() which will store file type of the current
    directory entry in args->filetype. Thus we fail to change file type of a
    directory entry to a proper type.

    Fix the problem by storing new file type in a local variable before
    calling xfs_da3_node_lookup_int().

    Reported-by: Giacomo Comes
    Signed-off-by: Jan Kara
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit ffeecc5213024ae663377b442eedcfbacf6d0c5d upstream.

    struct xfs_attr_leafblock contains 'entries' array which is declared
    with size 1 altough it can in fact contain much more entries. Since this
    array is followed by further struct members, gcc (at least in version
    4.8.3) thinks that the array has the fixed size of 1 element and thus
    may optimize away all accesses beyond the end of array resulting in
    non-working code. This problem was only observed with userspace code in
    xfsprogs, however it's better to be safe in kernel as well and have
    matching kernel and xfsprogs definitions.

    Signed-off-by: Jan Kara
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 2f123bce18943fff819bc10f8868ffb9149fc622 upstream.

    In the dir3 data block readahead function, use the regular read
    verifier to check the block's CRC and spot-check the block contents
    instead of directly calling only the spot-checking routine. This
    prevents corrupted directory data blocks from being read into the
    kernel, which can lead to garbage ls output and directory loops (if
    say one of the entries contains slashes and other junk).

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     

17 Aug, 2015

7 commits

  • commit 8fcd461db7c09337b6d2e22d25eb411123f379e3 upstream.

    Currently, preprocess_stateid_op calls nfs4_check_olstateid which
    verifies that the open stateid corresponds to the current filehandle in the
    call by calling nfs4_check_fh.

    If the stateid is a NFS4_DELEG_STID however, then no such check is done.
    This could cause incorrect enforcement of permissions, because the
    nfsd_permission() call in nfs4_check_file uses current the current
    filehandle, but any subsequent IO operation will use the file descriptor
    in the stateid.

    Move the call to nfs4_check_fh into nfs4_check_file instead so that it
    can be done for all stateid types.

    Signed-off-by: Jeff Layton
    [bfields: moved fh check to avoid NULL deref in special stateid case]
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit a0649b2d3fffb1cde8745568c767f3a55a3462bc upstream.

    Split out two self contained helpers to make the function more readable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields
    Cc: Jeff Layton
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     
  • commit 3ead7c52bdb0ab44f4bb1feed505a8323cc12ba7 upstream.

    This function may copy the si_addr_lsb field to user mode when it hasn't
    been initialized, which can leak kernel stack data to user mode.

    Just checking the value of si_code is insufficient because the same
    si_code value is shared between multiple signals. This is solved by
    checking the value of si_signo in addition to si_code.

    Signed-off-by: Amanieu d'Antras
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Amanieu d'Antras
     
  • commit c2227a39a078473115910512aa0f8d53bd915e60 upstream.

    On an absent filesystem (one served by another server), we need to be
    able to handle requests for certain attributest (like fs_locations, so
    the client can find out which server does have the filesystem), but
    others we can't.

    We forgot to take that into account when adding another attribute
    bitmask work for the SECURITY_LABEL attribute.

    There an export entry with the "refer" option can result in:

    [ 88.414272] kernel BUG at fs/nfsd/nfs4xdr.c:2249!
    [ 88.414828] invalid opcode: 0000 [#1] SMP
    [ 88.415368] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nfsd xfs libcrc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iosf_mbi ppdev btrfs coretemp crct10dif_pclmul crc32_pclmul crc32c_intel xor ghash_clmulni_intel raid6_pq vmw_balloon parport_pc parport i2c_piix4 shpchp vmw_vmci acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi mptscsih serio_raw mptbase e1000 scsi_transport_spi ata_generic pata_acpi [last unloaded: nfsd]
    [ 88.417827] CPU: 0 PID: 2116 Comm: nfsd Not tainted 4.0.7-300.fc22.x86_64 #1
    [ 88.418448] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
    [ 88.419093] task: ffff880079146d50 ti: ffff8800785d8000 task.ti: ffff8800785d8000
    [ 88.419729] RIP: 0010:[] [] nfsd4_encode_fattr+0x820/0x1f00 [nfsd]
    [ 88.420376] RSP: 0000:ffff8800785db998 EFLAGS: 00010206
    [ 88.421027] RAX: 0000000000000001 RBX: 000000000018091a RCX: ffff88006668b980
    [ 88.421676] RDX: 00000000fffef7fc RSI: 0000000000000000 RDI: ffff880078d05000
    [ 88.422315] RBP: ffff8800785dbb58 R08: ffff880078d043f8 R09: ffff880078d4a000
    [ 88.422968] R10: 0000000000010000 R11: 0000000000000002 R12: 0000000000b0a23a
    [ 88.423612] R13: ffff880078d05000 R14: ffff880078683100 R15: ffff88006668b980
    [ 88.424295] FS: 0000000000000000(0000) GS:ffff88007c600000(0000) knlGS:0000000000000000
    [ 88.424944] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 88.425597] CR2: 00007f40bc370f90 CR3: 0000000035af5000 CR4: 00000000001407f0
    [ 88.426285] Stack:
    [ 88.426921] ffff8800785dbaa8 ffffffffa049e4af ffff8800785dba08 ffffffff813298f0
    [ 88.427585] ffff880078683300 ffff8800769b0de8 0000089d00000001 0000000087f805e0
    [ 88.428228] ffff880000000000 ffff880079434a00 0000000000000000 ffff88006668b980
    [ 88.428877] Call Trace:
    [ 88.429527] [] ? exp_get_by_name+0x7f/0xb0 [nfsd]
    [ 88.430168] [] ? inode_doinit_with_dentry+0x210/0x6a0
    [ 88.430807] [] ? d_lookup+0x2e/0x60
    [ 88.431449] [] ? dput+0x33/0x230
    [ 88.432097] [] ? mntput+0x24/0x40
    [ 88.432719] [] ? path_put+0x22/0x30
    [ 88.433340] [] ? nfsd_cross_mnt+0xb7/0x1c0 [nfsd]
    [ 88.433954] [] nfsd4_encode_dirent+0x1b0/0x3d0 [nfsd]
    [ 88.434601] [] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
    [ 88.435172] [] nfsd_readdir+0x1c1/0x2a0 [nfsd]
    [ 88.435710] [] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
    [ 88.436447] [] nfsd4_encode_readdir+0x120/0x220 [nfsd]
    [ 88.437011] [] nfsd4_encode_operation+0x7d/0x190 [nfsd]
    [ 88.437566] [] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
    [ 88.438157] [] nfsd_dispatch+0xc3/0x220 [nfsd]
    [ 88.438680] [] svc_process_common+0x43b/0x690 [sunrpc]
    [ 88.439192] [] svc_process+0x103/0x1b0 [sunrpc]
    [ 88.439694] [] nfsd+0x117/0x190 [nfsd]
    [ 88.440194] [] ? nfsd_destroy+0x90/0x90 [nfsd]
    [ 88.440697] [] kthread+0xd8/0xf0
    [ 88.441260] [] ? kthread_worker_fn+0x180/0x180
    [ 88.441762] [] ret_from_fork+0x58/0x90
    [ 88.442322] [] ? kthread_worker_fn+0x180/0x180
    [ 88.442879] Code: 0f 84 93 05 00 00 83 f8 ea c7 85 a0 fe ff ff 00 00 27 30 0f 84 ba fe ff ff 85 c0 0f 85 a5 fe ff ff e9 e3 f9 ff ff 0f 1f 44 00 00 0b 66 0f 1f 44 00 00 be 04 00 00 00 4c 89 ef 4c 89 8d 68 fe
    [ 88.444052] RIP [] nfsd4_encode_fattr+0x820/0x1f00 [nfsd]
    [ 88.444658] RSP
    [ 88.445232] ---[ end trace 6cb9d0487d94a29f ]---

    Signed-off-by: Kinglong Mee
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Kinglong Mee
     
  • commit 32e5a2a2be6b085febaac36efff495ad65a55e6c upstream.

    When using a large volume, for example 9T volume with 2T already used,
    frequent creation of small files with O_DIRECT when the IO is not
    cluster aligned may clear sectors in the wrong place. This will cause
    filesystem corruption.

    This is because p_cpos is a u32. When calculating the corresponding
    sector it should be converted to u64 first, otherwise it may overflow.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joseph Qi
     
  • commit 209f7512d007980fd111a74a064d70a3656079cf upstream.

    The "BUG_ON(list_empty(&osb->blocked_lock_list))" in
    ocfs2_downconvert_thread_do_work can be triggered in the following case:

    ocfs2dc has firstly saved osb->blocked_lock_count to local varibale
    processed, and then processes the dentry lockres. During the dentry
    put, it calls iput and then deletes rw, inode and open lockres from
    blocked list in ocfs2_mark_lockres_freeing. And this causes the
    variable `processed' to not reflect the number of blocked lockres to be
    processed, which triggers the BUG.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joseph Qi
     
  • commit 8f2f3eb59dff4ec538de55f2e0592fec85966aab upstream.

    fsnotify_clear_marks_by_group_flags() can race with
    fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked()
    drops mark_mutex, a mark from the list iterated by
    fsnotify_clear_marks_by_group_flags() can be freed and thus the next
    entry pointer we have cached may become stale and we dereference free
    memory.

    Fix the problem by first moving marks to free to a special private list
    and then always free the first entry in the special list. This method
    is safe even when entries from the list can disappear once we drop the
    lock.

    Signed-off-by: Jan Kara
    Reported-by: Ashish Sangwan
    Reviewed-by: Ashish Sangwan
    Cc: Lino Sanfilippo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

11 Aug, 2015

2 commits

  • commit df150ed102baa0e78c06e08e975dfb47147dd677 upstream.

    We don't log remote attribute contents, and instead write them
    synchronously before we commit the block allocation and attribute
    tree update transaction. As a result we are writing to the allocated
    space before the allcoation has been made permanent.

    As a result, we cannot consider this allocation to be a metadata
    allocation. Metadata allocation can take blocks from the free list
    and so reuse them before the transaction that freed the block is
    committed to disk. This behaviour is perfectly fine for journalled
    metadata changes as log recovery will ensure the free operation is
    replayed before the overwrite, but for remote attribute writes this
    is not the case.

    Hence we have to consider the remote attribute blocks to contain
    data and allocate accordingly. We do this by dropping the
    XFS_BMAPI_METADATA flag from the block allocation. This means the
    allocation will not use blocks that are on the busy list without
    first ensuring that the freeing transaction has been committed to
    disk and the blocks removed from the busy list. This ensures we will
    never overwrite a freed block without first ensuring that it is
    really free.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit e3c32ee9e3e747fec01eb38e6610a9157d44c3ea upstream.

    In recent testing, a system that crashed failed log recovery on
    restart with a bad symlink buffer magic number:

    XFS (vda): Starting recovery (logdev: internal)
    XFS (vda): Bad symlink block magic!
    XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

    On examination of the log via xfs_logprint, none of the symlink
    buffers in the log had a bad magic number, nor were any other types
    of buffer log format headers mis-identified as symlink buffers.
    Tracing was used to find the buffer the kernel was tripping over,
    and xfs_db identified it's contents as:

    000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
    020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
    040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
    060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
    .....

    This is a remote attribute buffer, which are notable in that they
    are not logged but are instead written synchronously by the remote
    attribute code so that they exist on disk before the attribute
    transactions are committed to the journal.

    The above remote attribute block has an invalid LSN in it - cycle
    0xd002000, block 0 - which means when log recovery comes along to
    determine if the transaction that writes to the underlying block
    should be replayed, it sees a block that has a future LSN and so
    does not replay the buffer data in the transaction. Instead, it
    validates the buffer magic number and attaches the buffer verifier
    to it. It is this buffer magic number check that is failing in the
    above assert, indicating that we skipped replay due to the LSN of
    the underlying buffer.

    The problem here is that the remote attribute buffers cannot have a
    valid LSN placed into them, because the transaction that contains
    the attribute tree pointer changes and the block allocation that the
    attribute data is being written to hasn't yet been committed. Hence
    the LSN field in the attribute block is completely unwritten,
    thereby leaving the underlying contents of the block in the LSN
    field. It could have any value, and hence a future overwrite of the
    block by log recovery may or may not work correctly.

    Fix this by always writing an invalid LSN to the remote attribute
    block, as any buffer in log recovery that needs to write over the
    remote attribute should occur. We are protected from having old data
    written over the attribute by the fact that freeing the block before
    the remote attribute is written will result in the buffer being
    marked stale in the log and so all changes prior to the buffer stale
    transaction will be cancelled by log recovery.

    Hence it is safe to ignore the LSN in the case or synchronously
    written, unlogged metadata such as remote attribute blocks, and to
    ensure we do that correctly, we need to write an invalid LSN to all
    remote attribute blocks to trigger immediate recovery of metadata
    that is written over the top.

    As a further protection for filesystems that may already have remote
    attribute blocks with bad LSNs on disk, change the log recovery code
    to always trigger immediate recovery of metadata over remote
    attribute blocks.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner