15 Sep, 2018

1 commit

  • [ Upstream commit dc2572791d3a41bab94400af2b6bca9d71ccd303 ]

    hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
    hfs_brec_insert() may instead set it to an error-valued pointer. Fix
    this to prevent a crash.

    Link: http://lkml.kernel.org/r/53d9749a029c41b4016c495fc5838c9dba3afc52.1530294815.git.ernesto.mnd.fernandez@gmail.com
    Signed-off-by: Ernesto A. Fernández
    Cc: Anatoly Trosinenko
    Cc: Viacheslav Dubeyko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ernesto A. Fernández
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

15 Sep, 2017

1 commit

  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     

01 Aug, 2017

1 commit

  • This patch converts most of the in-kernel filesystems that do writeback
    out of the pagecache to report errors using the errseq_t-based
    infrastructure that was recently added. This allows them to report
    errors once for each open file description.

    Most filesystems have a fairly straightforward fsync operation. They
    call filemap_write_and_wait_range to write back all of the data and
    wait on it, and then (sometimes) sync out the metadata.

    For those filesystems this is a straightforward conversion from calling
    filemap_write_and_wait_range in their fsync operation to calling
    file_write_and_wait_range.

    Acked-by: Jan Kara
    Acked-by: Dave Kleikamp
    Signed-off-by: Jeff Layton

    Jeff Layton
     

17 Jul, 2017

1 commit

  • Firstly by applying the following with coccinelle's spatch:

    @@ expression SB; @@
    -SB->s_flags & MS_RDONLY
    +sb_rdonly(SB)

    to effect the conversion to sb_rdonly(sb), then by applying:

    @@ expression A, SB; @@
    (
    -(!sb_rdonly(SB)) && A
    +!sb_rdonly(SB) && A
    |
    -A != (sb_rdonly(SB))
    +A != sb_rdonly(SB)
    |
    -A == (sb_rdonly(SB))
    +A == sb_rdonly(SB)
    |
    -!(sb_rdonly(SB))
    +!sb_rdonly(SB)
    |
    -A && (sb_rdonly(SB))
    +A && sb_rdonly(SB)
    |
    -A || (sb_rdonly(SB))
    +A || sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) != A
    +sb_rdonly(SB) != A
    |
    -(sb_rdonly(SB)) == A
    +sb_rdonly(SB) == A
    |
    -(sb_rdonly(SB)) && A
    +sb_rdonly(SB) && A
    |
    -(sb_rdonly(SB)) || A
    +sb_rdonly(SB) || A
    )

    @@ expression A, B, SB; @@
    (
    -(sb_rdonly(SB)) ? 1 : 0
    +sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) ? A : B
    +sb_rdonly(SB) ? A : B
    )

    to remove left over excess bracketage and finally by applying:

    @@ expression A, SB; @@
    (
    -(A & MS_RDONLY) != sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) != sb_rdonly(SB)
    |
    -(A & MS_RDONLY) == sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) == sb_rdonly(SB)
    )

    to make comparisons against the result of sb_rdonly() (which is a bool)
    work correctly.

    Signed-off-by: David Howells

    David Howells
     

09 May, 2017

1 commit

  • Commit afddba49d18f ("fs: introduce write_begin, write_end, and
    perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
    checked in pagecache_write_begin(), but that check was removed by
    4e02ed4b4a2f ("fs: remove prepare_write/commit_write").

    Between these two commits, commit d9414774dc0c ("cifs: Convert cifs to
    new aops.") added a check in cifs_write_begin(), but that check was soon
    removed by commit a98ee8c1c707 ("[CIFS] fix regression in
    cifs_write_begin/cifs_write_end").

    Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
    remove this flag. This patch has no functionality changes.

    Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Jeff Layton
    Reviewed-by: Christoph Hellwig
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

04 Mar, 2017

1 commit


02 Mar, 2017

1 commit


28 Feb, 2017

1 commit


19 Feb, 2017

1 commit

  • I was looking through static analysis warnings and there is a bug here
    that goes all the way back to the start of git. Basically we're copying
    the pointer and nearby garbage instead of the data the fd.key pointer is
    pointing to.

    Signed-off-by: Dan Carpenter
    Reviewed-by: Vyacheslav Dubeyko
    Signed-off-by: Al Viro

    Dan Carpenter
     

25 Dec, 2016

1 commit


11 Oct, 2016

3 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Al Viro
     
  • Pull vfs xattr updates from Al Viro:
    "xattr stuff from Andreas

    This completes the switch to xattr_handler ->get()/->set() from
    ->getxattr/->setxattr/->removexattr"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Remove {get,set,remove}xattr inode operations
    xattr: Stop calling {get,set,remove}xattr inode operations
    vfs: Check for the IOP_XATTR flag in listxattr
    xattr: Add __vfs_{get,set,remove}xattr helpers
    libfs: Use IOP_XATTR flag for empty directory handling
    vfs: Use IOP_XATTR flag for bad-inode handling
    vfs: Add IOP_XATTR inode operations flag
    vfs: Move xattr_resolve_name to the front of fs/xattr.c
    ecryptfs: Switch to generic xattr handlers
    sockfs: Get rid of getxattr iop
    sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
    kernfs: Switch to generic xattr handlers
    hfs: Switch to generic xattr handlers
    jffs2: Remove jffs2_{get,set,remove}xattr macros
    xattr: Remove unnecessary NULL attribute name check

    Linus Torvalds
     

08 Oct, 2016

1 commit


07 Oct, 2016

1 commit


28 Sep, 2016

1 commit

  • CURRENT_TIME_SEC is not y2038 safe. current_time() will
    be transitioned to use 64 bit time along with vfs in a
    separate patch.
    There is no plan to transistion CURRENT_TIME_SEC to use
    y2038 safe time interfaces.

    current_time() will also be extended to use superblock
    range checking parameters when range checking is introduced.

    This works because alloc_super() fills in the the s_time_gran
    in super block to NSEC_PER_SEC.

    Signed-off-by: Deepa Dinamani
    Acked-by: Jan Kara
    Signed-off-by: Al Viro

    Deepa Dinamani
     

27 Sep, 2016

2 commits

  • Generated patch:

    sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
    sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • This is trivial to do:

    - add flags argument to foo_rename()
    - check if flags doesn't have any other than RENAME_NOREPLACE
    - assign foo_rename() to .rename2 instead of .rename

    Filesystems converted:

    affs, bfs, exofs, ext2, hfs, hfsplus, jffs2, jfs, logfs, minix, msdos,
    nilfs2, omfs, reiserfs, sysvfs, ubifs, udf, ufs, vfat.

    Signed-off-by: Miklos Szeredi
    Acked-by: Boaz Harrosh
    Acked-by: Richard Weinberger
    Acked-by: Bob Copeland
    Acked-by: Jan Kara
    Cc: Theodore Ts'o
    Cc: Jaegeuk Kim
    Cc: OGAWA Hirofumi
    Cc: Mikulas Patocka
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Ryusuke Konishi
    Cc: Christoph Hellwig

    Miklos Szeredi
     

22 Sep, 2016

1 commit

  • inode_change_ok() will be resposible for clearing capabilities and IMA
    extended attributes and as such will need dentry. Give it as an argument
    to inode_change_ok() instead of an inode. Also rename inode_change_ok()
    to setattr_prepare() to better relect that it does also some
    modifications in addition to checks.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

07 Aug, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    In the "trivial API change" department - ->d_compare() losing 'parent'
    argument"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    cachefiles: Fix race between inactivating and culling a cache object
    9p: use clone_fid()
    9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
    vfs: make dentry_needs_remove_privs() internal
    vfs: remove file_needs_remove_privs()
    vfs: fix deadlock in file_remove_privs() on overlayfs
    get rid of 'parent' argument of ->d_compare()
    cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()
    affs ->d_compare(): don't bother with ->d_inode
    fold _d_rehash() and __d_rehash() together
    fold dentry_rcuwalk_invalidate() into its only remaining caller

    Linus Torvalds
     

06 Aug, 2016

1 commit

  • Pull qstr constification updates from Al Viro:
    "Fairly self-contained bunch - surprising lot of places passes struct
    qstr * as an argument when const struct qstr * would suffice; it
    complicates analysis for no good reason.

    I'd prefer to feed that separately from the assorted fixes (those are
    in #for-linus and with somewhat trickier topology)"

    * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    qstr: constify instances in adfs
    qstr: constify instances in lustre
    qstr: constify instances in f2fs
    qstr: constify instances in ext2
    qstr: constify instances in vfat
    qstr: constify instances in procfs
    qstr: constify instances in fuse
    qstr constify instances in fs/dcache.c
    qstr: constify instances in nfs
    qstr: constify instances in ocfs2
    qstr: constify instances in autofs4
    qstr: constify instances in hfs
    qstr: constify instances in hfsplus
    qstr: constify instances in logfs
    qstr: constify dentry_init_security

    Linus Torvalds
     

01 Aug, 2016

1 commit


29 Jul, 2016

1 commit

  • Pull vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    Probably the most interesting part long-term is ->d_init() - that will
    have a bunch of followups in (at least) ceph and lustre, but we'll
    need to sort the barrier-related rules before it can get used for
    really non-trivial stuff.

    Another fun thing is the merge of ->d_iput() callers (dentry_iput()
    and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
    except the one in __d_lookup_lru())"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    fs/dcache.c: avoid soft-lockup in dput()
    vfs: new d_init method
    vfs: Update lookup_dcache() comment
    bdev: get rid of ->bd_inodes
    Remove last traces of ->sync_page
    new helper: d_same_name()
    dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
    vfs: clean up documentation
    vfs: document ->d_real()
    vfs: merge .d_select_inode() into .d_real()
    unify dentry_iput() and dentry_unlink_inode()
    binfmt_misc: ->s_root is not going anywhere
    drop redundant ->owner initializations
    ufs: get rid of redundant checks
    orangefs: constify inode_operations
    missed comment updates from ->direct_IO() prototype change
    file_inode(f)->i_mapping is f->f_mapping
    trim fsnotify hooks a bit
    9p: new helper - v9fs_parent_fid()
    debugfs: ->d_parent is never NULL or negative
    ...

    Linus Torvalds
     

21 Jul, 2016

1 commit


11 Jun, 2016

1 commit

  • We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time. It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

    A few other users of our string hashes also wanted to mix in their own
    pointers into the hash, and those are updated to use the same mechanism.

    Hash users that don't have any particular initial salt can just use the
    NULL pointer as a no-salt.

    Cc: Vegard Nossum
    Cc: George Spelvin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 May, 2016

1 commit


28 May, 2016

1 commit

  • smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
    we'd hashed the new dentry and attached it to inode, we need ->setxattr()
    instances getting the inode as an explicit argument rather than obtaining
    it from dentry.

    Similar change for ->getxattr() had been done in commit ce23e64. Unlike
    ->getxattr() (which is used by both selinux and smack instances of
    ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
    it got missed back then.

    Reported-by: Seung-Woo Kim
    Tested-by: Casey Schaufler
    Signed-off-by: Al Viro

    Al Viro
     

18 May, 2016

1 commit

  • Pull vfs cleanups from Al Viro:
    "More cleanups from Christoph"

    * 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    nfsd: use RWF_SYNC
    fs: add RWF_DSYNC aand RWF_SYNC
    ceph: use generic_write_sync
    fs: simplify the generic_write_sync prototype
    fs: add IOCB_SYNC and IOCB_DSYNC
    direct-io: remove the offset argument to dio_complete
    direct-io: eliminate the offset argument to ->direct_IO
    xfs: eliminate the pos variable in xfs_file_dio_aio_write
    filemap: remove the pos argument to generic_file_direct_write
    filemap: remove pos variables in generic_file_read_iter

    Linus Torvalds
     

13 May, 2016

1 commit


03 May, 2016

1 commit


02 May, 2016

1 commit


11 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

21 Jan, 2016

1 commit


15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

04 Jan, 2016

1 commit


11 Sep, 2015

2 commits

  • Fix B-tree corruption when a new record is inserted at position 0 in the
    node in hfs_brec_insert().

    This is an identical change to the corresponding hfs b-tree code to Sergei
    Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
    to keep similar code paths in the hfs and hfsplus drivers in sync, where
    appropriate.

    Signed-off-by: Hin-Tak Leung
    Cc: Sergei Antonov
    Cc: Joe Perches
    Reviewed-by: Vyacheslav Dubeyko
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hin-Tak Leung
     
  • Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
    hfs_bnode_find() for finding or creating pages corresponding to an inode)
    are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
    and should not be page_cache_release()'ed until hfs_bnode_free().

    This patch fixes a problem I first saw in July 2012: merely running "du"
    on a large hfsplus-mounted directory a few times on a reasonably loaded
    system would get the hfsplus driver all confused and complaining about
    B-tree inconsistencies, and generates a "BUG: Bad page state". Most
    recently, I can generate this problem on up-to-date Fedora 22 with shipped
    kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
    mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
    lightly-used QEMU VM image of the full Mac OS X 10.9:

    $ df -i / /home /mnt
    Filesystem Inodes IUsed IFree IUse% Mounted on
    /dev/mapper/fedora-root 3276800 551665 2725135 17% /
    /dev/mapper/fedora-home 52879360 716221 52163139 2% /home
    /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt

    After applying the patch, I was able to run "du /" (60+ times) and "du
    /mnt" (150+ times) continuously and simultaneously for 6+ hours.

    There are many reports of the hfsplus driver getting confused under load
    and generating "BUG: Bad page state" or other similar issues over the
    years. [1]

    The unpatched code [2] has always been wrong since it entered the kernel
    tree. The only reason why it gets away with it is that the
    kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
    the kernel has not had a chance to reuse the memory for something else,
    most of the time.

    The current RW driver appears to have followed the design and development
    of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
    2001) had a B-tree node-centric approach to
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
    migrating towards version 0.2 (June 2002) of caching and releasing pages
    per inode extents. When the current RW code first entered the kernel [2]
    in 2005, there was an REF_PAGES conditional (and "//" commented out code)
    to switch between B-node centric paging to inode-centric paging. There
    was a mistake with the direction of one of the REF_PAGES conditionals in
    __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
    removed, but a page_cache_release() was mistakenly left in (propagating
    the "REF_PAGES !REF_PAGE" mistake), and the commented-out
    page_cache_release() in bnode_release() (which should be spanned by
    !REF_PAGES) was never enabled.

    References:
    [1]:
    Michael Fox, Apr 2013
    http://www.spinics.net/lists/linux-fsdevel/msg63807.html
    ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

    Sasha Levin, Feb 2015
    http://lkml.org/lkml/2015/2/20/85 ("use after free")

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
    https://bugzilla.kernel.org/show_bug.cgi?id=42342
    https://bugzilla.kernel.org/show_bug.cgi?id=63841
    https://bugzilla.kernel.org/show_bug.cgi?id=78761

    [2]:
    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
    commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:36 2004 -0800

    [PATCH] HFS rewrite

    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

    commit 91556682e0bf004d98a529bf829d339abb98bbbd
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:48 2004 -0800

    [PATCH] HFS+ support

    [3]:
    http://sourceforge.net/projects/linux-hfsplus/

    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

    http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
    fs/hfsplus/bnode.c?r1=1.4&r2=1.5

    Date: Thu Jun 6 09:45:14 2002 +0000
    Use buffer cache instead of page cache in bnode.c. Cache inode extents.

    [4]:
    http://git.kernel.org/cgit/linux/kernel/git/\
    stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

    commit a5e3985fa014029eb6795664c704953720cc7f7d
    Author: Roman Zippel
    Date: Tue Sep 6 15:18:47 2005 -0700

    [PATCH] hfs: remove debug code

    Signed-off-by: Hin-Tak Leung
    Signed-off-by: Sergei Antonov
    Reviewed-by: Anton Altaparmakov
    Reported-by: Sasha Levin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Vyacheslav Dubeyko
    Cc: Sougata Santra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hin-Tak Leung