10 Nov, 2015

5 commits

  • commit 0f89abf56abbd0e1c6e3cef9813e6d9f05383c1e upstream.

    Commit 8eb934591f8b ("btrfs: check unsupported filters in balance
    arguments") adds a jump to exit label out_bargs in case the argument
    check fails. At this point in addition to the bargs memory, the
    memory for struct btrfs_balance_control has already been allocated.
    Ownership of bctl is passed to btrfs_balance() in the good case,
    thus the memory is not freed due to the introduced jump. Make sure
    that the memory gets freed in any case as necessary. Detected by
    Coverity CID 1328378.

    Signed-off-by: Christian Engelmayer
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    Christian Engelmayer
     
  • commit ab79efab0a0ba01a74df782eb7fa44b044dae8b5 upstream.

    In ovl_copy_up_locked(), newdentry is leaked if the function exits through
    out_cleanup as this just to out after calling ovl_cleanup() - which doesn't
    actually release the ref on newdentry.

    The out_cleanup segment should instead exit through out2 as certainly
    newdentry leaks - and possibly upper does also, though this isn't caught
    given the catch of newdentry.

    Without this fix, something like the following is seen:

    BUG: Dentry ffff880023e9eb20{i=f861,n=#ffff880023e82d90} still in use (1) [unmount of tmpfs tmpfs]
    BUG: Dentry ffff880023ece640{i=0,n=bigfile} still in use (1) [unmount of tmpfs tmpfs]

    when unmounting the upper layer after an error occurred in copyup.

    An error can be induced by creating a big file in a lower layer with
    something like:

    dd if=/dev/zero of=/lower/a/bigfile bs=65536 count=1 seek=$((0xf000))

    to create a large file (4.1G). Overlay an upper layer that is too small
    (on tmpfs might do) and then induce a copy up by opening it writably.

    Reported-by: Ulrich Obergfell
    Signed-off-by: David Howells
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • commit 0480334fa60488d12ae101a02d7d9e1a3d03d7dd upstream.

    Open the lower file with O_LARGEFILE in ovl_copy_up().

    Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for
    catching 32-bit userspace dealing with a file large enough that it'll be
    mishandled if the application isn't aware that there might be an integer
    overflow. Inside the kernel, there shouldn't be any problems.

    Reported-by: Ulrich Obergfell
    Signed-off-by: David Howells
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • commit 5ffdbe8bf1e485026e1c7e4714d2841553cf0b40 upstream.

    This fixes memory leak after umount.

    Kmemleak report:

    unreferenced object 0xffff8800ba791010 (size 8):
    comm "mount", pid 2394, jiffies 4294996294 (age 53.920s)
    hex dump (first 8 bytes):
    20 1c 13 02 00 88 ff ff .......
    backtrace:
    [] create_object+0x124/0x2c0
    [] kmemleak_alloc+0x7b/0xc0
    [] __kmalloc+0x106/0x340
    [] ovl_fill_super+0x55c/0x9b0 [overlay]
    [] mount_nodev+0x54/0xa0
    [] ovl_mount+0x18/0x20 [overlay]
    [] mount_fs+0x43/0x170
    [] vfs_kern_mount+0x74/0x170
    [] do_mount+0x22d/0xdf0
    [] SyS_mount+0x7b/0xc0
    [] entry_SYSCALL_64_fastpath+0x12/0x76
    [] 0xffffffffffffffff

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Miklos Szeredi
    Fixes: dd662667e6d3 ("ovl: add mutli-layer infrastructure")
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • commit 0f95502ad84874b3c05fc7cdd9d4d9d5cddf7859 upstream.

    This fixes small memory leak after mount.

    Kmemleak report:

    unreferenced object 0xffff88003683fe00 (size 16):
    comm "mount", pid 2029, jiffies 4294909563 (age 33.380s)
    hex dump (first 16 bytes):
    20 27 1f bb 00 88 ff ff 40 4b 0f 36 02 88 ff ff '......@K.6....
    backtrace:
    [] create_object+0x124/0x2c0
    [] kmemleak_alloc+0x7b/0xc0
    [] __kmalloc+0x106/0x340
    [] ovl_fill_super+0x389/0x9a0 [overlay]
    [] mount_nodev+0x54/0xa0
    [] ovl_mount+0x18/0x20 [overlay]
    [] mount_fs+0x43/0x170
    [] vfs_kern_mount+0x74/0x170
    [] do_mount+0x22d/0xdf0
    [] SyS_mount+0x7b/0xc0
    [] entry_SYSCALL_64_fastpath+0x12/0x76
    [] 0xffffffffffffffff

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Miklos Szeredi
    Fixes: a78d9f0d5d5c ("ovl: support multiple lower layers")
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

27 Oct, 2015

7 commits

  • commit 83bfff23e9ed19f37c4ef0bba84e75bd88e5cf21 upstream.

    Now that we have file locking helpers that can deal with an inode
    instead of a filp, we can change the NFSv4 locking code to use that
    instead.

    This should fix the case where we have a filp that is closed while flock
    or OFD locks are set on it, and the task is signaled so that it doesn't
    wait for the LOCKU reply to come in before the filp is freed. At that
    point we can end up with a use-after-free with the current code, which
    relies on dereferencing the fl_file in the lock request.

    Signed-off-by: Jeff Layton
    Reviewed-by: "J. Bruce Fields"
    Tested-by: "J. Bruce Fields"
    Cc: William Dauchy
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit ee296d7c5709440f8abd36b5b65c6b3e388538d9 upstream.

    They just call file_inode and then the corresponding *_inode_file_wait
    function. Just make them static inlines instead.

    Signed-off-by: Jeff Layton
    Cc: William Dauchy
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 29d01b22eaa18d8b46091d3c98c6001c49f78e4a upstream.

    Allow callers to pass in an inode instead of a filp.

    Signed-off-by: Jeff Layton
    Reviewed-by: "J. Bruce Fields"
    Tested-by: "J. Bruce Fields"
    Cc: William Dauchy
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit bcd7f78d078ff6197715c1ed070c92aca57ec12c upstream.

    ...and rename it to better describe how it works.

    In order to fix a use-after-free in NFS, we need to be able to remove
    locks from an inode after the filp associated with them may have already
    been freed. flock_lock_file already only dereferences the filp to get to
    the inode, so just change it so the callers do that.

    All of the callers already pass in a lock request that has the fl_file
    set properly, so we don't need to pass it in individually. With that
    change it now only dereferences the filp to get to the inode, so just
    push that out to the callers.

    Signed-off-by: Jeff Layton
    Reviewed-by: "J. Bruce Fields"
    Tested-by: "J. Bruce Fields"
    Cc: William Dauchy
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 8c3ad9cb7343dc5f61b8cf3cdbe1016c5e7c2c8b upstream.

    Recent Linux clients have started to send GETLAYOUT requests with
    minlength less than blocksize.

    Servers aren't really allowed to impose this kind of restriction on
    layouts; see RFC 5661 section 18.43.3 for details.

    This has been observed to cause indefinite hangs on fsx runs on some
    clients.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     
  • commit dc6c5fb3b514221f2e9d21ee626a9d95d3418dff upstream.

    The code for btrfs inode-resolve has never worked properly for
    files with enough hard links to trigger extrefs. It was trying to
    get the leaf out of a path after freeing the path:

    btrfs_release_path(path);
    leaf = path->nodes[0];
    item_size = btrfs_item_size_nr(leaf, slot);

    The fix here is to use the extent buffer we cloned just a little higher
    up to avoid deadlocks caused by using the leaf in the path.

    Signed-off-by: Chris Mason
    cc: Mark Fasheh
    Reviewed-by: Filipe Manana
    Reviewed-by: Mark Fasheh
    Signed-off-by: Greg Kroah-Hartman

    Chris Mason
     
  • commit 8eb934591f8bf584969454a658f629cd06e59f3a upstream.

    We don't verify that all the balance filter arguments supplemented by
    the flags are actually known to the kernel. Thus we let it silently pass
    and do nothing.

    At the moment this means only the 'limit' filter, but we're going to add
    a few more soon so it's better to have that fixed. Also in older stable
    kernels so that it works with newer userspace tools.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    David Sterba
     

23 Oct, 2015

20 commits

  • commit daf3761c9fcde0f4ca64321cbed6c1c86d304193 upstream.

    Leandro Awa writes:
    "After switching to version 4.1.6, our parallelized and distributed
    workflows now fail consistently with errors of the form:

    T34: ./regex.c:39:22: error: config.h: No such file or directory

    From our 'git bisect' testing, the following commit appears to be the
    possible cause of the behavior we've been seeing: commit 766c4cbfacd8"

    Al Viro says:
    "What happens is that 766c4cbfacd8 got the things subtly wrong.

    We used to treat d_is_negative() after lookup_fast() as "fall with
    ENOENT". That was wrong - checking ->d_flags outside of ->d_seq
    protection is unreliable and failing with hard error on what should've
    fallen back to non-RCU pathname resolution is a bug.

    Unfortunately, we'd pulled the test too far up and ran afoul of
    another kind of staleness. The dentry might have been absolutely
    stable from the RCU point of view (and we might be on UP, etc), but
    stale from the remote fs point of view. If ->d_revalidate() returns
    "it's actually stale", dentry gets thrown away and the original code
    wouldn't even have looked at its ->d_flags.

    What we need is to check ->d_flags where 766c4cbfacd8 does (prior to
    ->d_seq validation) but only use the result in cases where we do not
    discard this dentry outright"

    Reported-by: Leandro Awa
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911
    Fixes: 766c4cbfacd8 ("namei: d_is_negative() should be checked...")
    Tested-by: Leandro Awa
    Signed-off-by: Trond Myklebust
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 3ec0c97959abff33a42db9081c22132bcff5b4f2 upstream.

    If filelayout_decode_layout fail, _filelayout_free_lseg will causes
    a double freeing of fh_array.

    [ 1179.279800] BUG: unable to handle kernel NULL pointer dereference at (null)
    [ 1179.280198] IP: [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
    [ 1179.281010] PGD 0
    [ 1179.281443] Oops: 0000 [#1]
    [ 1179.281831] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) xfs libcrc32c coretemp nfsd crct10dif_pclmul ppdev crc32_pclmul crc32c_intel auth_rpcgss ghash_clmulni_intel nfs_acl lockd vmw_balloon grace sunrpc parport_pc vmw_vmci parport shpchp i2c_piix4 vmwgfx drm_kms_helper ttm drm serio_raw mptspi scsi_transport_spi mptscsih e1000 mptbase ata_generic pata_acpi [last unloaded: fscache]
    [ 1179.283891] CPU: 0 PID: 13336 Comm: cat Tainted: G OE 4.3.0-rc1-pnfs+ #244
    [ 1179.284323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
    [ 1179.285206] task: ffff8800501d48c0 ti: ffff88003e3c4000 task.ti: ffff88003e3c4000
    [ 1179.285668] RIP: 0010:[] [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
    [ 1179.286612] RSP: 0018:ffff88003e3c77f8 EFLAGS: 00010202
    [ 1179.287092] RAX: 0000000000000000 RBX: ffff88001fe78900 RCX: 0000000000000000
    [ 1179.287731] RDX: ffffea0000f40760 RSI: ffff88001fe789c8 RDI: ffff88001fe789c0
    [ 1179.288383] RBP: ffff88003e3c7810 R08: ffffea0000f40760 R09: 0000000000000000
    [ 1179.289170] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88001fe789c8
    [ 1179.289959] R13: ffff88001fe789c0 R14: ffff88004ec05a80 R15: ffff88004f935b88
    [ 1179.290791] FS: 00007f4e66bb5700(0000) GS:ffffffff81c29000(0000) knlGS:0000000000000000
    [ 1179.291580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1179.292209] CR2: 0000000000000000 CR3: 00000000203f8000 CR4: 00000000001406f0
    [ 1179.292731] Stack:
    [ 1179.293195] ffff88001fe78900 00000000000000d0 ffff88001fe78178 ffff88003e3c7868
    [ 1179.293676] ffffffffa0272737 0000000000000001 0000000000000001 ffff88001fe78800
    [ 1179.294151] 00000000614fffce ffffffff81727671 ffff88001fe78100 ffff88001fe78100
    [ 1179.294623] Call Trace:
    [ 1179.295092] [] filelayout_alloc_lseg+0xa7/0x2d0 [nfs_layout_nfsv41_files]
    [ 1179.295625] [] ? out_of_line_wait_on_bit+0x81/0xb0
    [ 1179.296133] [] pnfs_layout_process+0xae/0x320 [nfsv4]
    [ 1179.296632] [] nfs4_proc_layoutget+0x2b1/0x360 [nfsv4]
    [ 1179.297134] [] pnfs_update_layout+0x853/0xb30 [nfsv4]
    [ 1179.297632] [] ? nfs_get_lock_context+0x74/0x170 [nfs]
    [ 1179.298158] [] filelayout_pg_init_read+0x37/0x50 [nfs_layout_nfsv41_files]
    [ 1179.298834] [] __nfs_pageio_add_request+0x119/0x460 [nfs]
    [ 1179.299385] [] ? nfs_create_request.part.9+0x37/0x2e0 [nfs]
    [ 1179.299872] [] nfs_pageio_add_request+0xa3/0x1b0 [nfs]
    [ 1179.300362] [] readpage_async_filler+0x85/0x260 [nfs]
    [ 1179.300907] [] read_cache_pages+0x91/0xd0
    [ 1179.301391] [] ? nfs_read_completion+0x220/0x220 [nfs]
    [ 1179.301867] [] nfs_readpages+0x128/0x200 [nfs]
    [ 1179.302330] [] __do_page_cache_readahead+0x203/0x280
    [ 1179.302784] [] ? __do_page_cache_readahead+0xd8/0x280
    [ 1179.303413] [] ondemand_readahead+0x1a6/0x2f0
    [ 1179.303855] [] page_cache_sync_readahead+0x31/0x50
    [ 1179.304286] [] generic_file_read_iter+0x4a6/0x5c0
    [ 1179.304711] [] ? __nfs_revalidate_mapping+0x1f6/0x240 [nfs]
    [ 1179.305132] [] nfs_file_read+0x52/0xa0 [nfs]
    [ 1179.305540] [] __vfs_read+0xcc/0x100
    [ 1179.305936] [] vfs_read+0x85/0x130
    [ 1179.306326] [] SyS_read+0x58/0xd0
    [ 1179.306708] [] entry_SYSCALL_64_fastpath+0x12/0x76
    [ 1179.307094] Code: c4 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 8b 07 49 89 f4 85 c0 74 47 48 8b 06 49 89 fd 8b 38 48 85 ff 74 22 31 db eb 0c 48 63 d3 48 8b 3c d0 48 85
    [ 1179.308357] RIP [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
    [ 1179.309177] RSP
    [ 1179.309582] CR2: 0000000000000000

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust
    Cc: William Dauchy
    Signed-off-by: Greg Kroah-Hartman

    Kinglong Mee
     
  • commit 9391dd00d13c853ab4f2a85435288ae2202e0e43 upstream.

    when opening a directory we want the overlayfs inode, not one from
    the topmost layer.

    Reported-By: Andrey Jr. Melnikov
    Tested-By: Andrey Jr. Melnikov
    Signed-off-by: Al Viro
    Cc: "Kamata, Munehisa"
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 4bacc9c9234c7c8eec44f5ed4e960d9f96fa0f01 upstream.

    Make file->f_path always point to the overlay dentry so that the path in
    /proc/pid/fd is correct and to ensure that label-based LSMs have access to the
    overlay as well as the underlay (path-based LSMs probably don't need it).

    Using my union testsuite to set things up, before the patch I see:

    [root@andromeda union-testsuite]# bash 5 /a/foo107
    [root@andromeda union-testsuite]# stat /mnt/a/foo107
    ...
    Device: 23h/35d Inode: 13381 Links: 1
    ...
    [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
    ...
    Device: 23h/35d Inode: 13381 Links: 1
    ...

    After the patch:

    [root@andromeda union-testsuite]# bash 5 /mnt/a/foo107
    [root@andromeda union-testsuite]# stat /mnt/a/foo107
    ...
    Device: 23h/35d Inode: 40346 Links: 1
    ...
    [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
    ...
    Device: 23h/35d Inode: 40346 Links: 1
    ...

    Note the change in where /proc/$$/fd/5 points to in the ls command. It was
    pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
    (which is correct).

    The inode accessed, however, is the lower layer. The union layer is on device
    25h/37d and the upper layer on 24h/36d.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro
    Cc: "Kamata, Munehisa"
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • commit f25801ee4680ef1db21e15c112e6e5fe3ffe8da5 upstream.

    Call ovl_drop_write() earlier in ovl_dentry_open() before we call vfs_open()
    as we've done the copy up for which we needed the freeze-write lock by that
    point.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro
    Cc: "Kamata, Munehisa"
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • commit 397d425dc26da728396e66d392d5dcb8dac30c37 upstream.

    In rare cases a directory can be renamed out from under a bind mount.
    In those cases without special handling it becomes possible to walk up
    the directory tree to the root dentry of the filesystem and down
    from the root dentry to every other file or directory on the filesystem.

    Like division by zero .. from an unconnected path can not be given
    a useful semantic as there is no predicting at which path component
    the code will realize it is unconnected. We certainly can not match
    the current behavior as the current behavior is a security hole.

    Therefore when encounting .. when following an unconnected path
    return -ENOENT.

    - Add a function path_connected to verify path->dentry is reachable
    from path->mnt.mnt_root. AKA to validate that rename did not do
    something nasty to the bind mount.

    To avoid races path_connected must be called after following a path
    component to it's next path component.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit cde93be45a8a90d8c264c776fab63487b5038a65 upstream.

    A rename can result in a dentry that by walking up d_parent
    will never reach it's mnt_root. For lack of a better term
    I call this an escaped path.

    prepend_path is called by four different functions __d_path,
    d_absolute_path, d_path, and getcwd.

    __d_path only wants to see paths are connected to the root it passes
    in. So __d_path needs prepend_path to return an error.

    d_absolute_path similarly wants to see paths that are connected to
    some root. Escaped paths are not connected to any mnt_root so
    d_absolute_path needs prepend_path to return an error greater
    than 1. So escaped paths will be treated like paths on lazily
    unmounted mounts.

    getcwd needs to prepend "(unreachable)" so getcwd also needs
    prepend_path to return an error.

    d_path is the interesting hold out. d_path just wants to print
    something, and does not care about the weird cases. Which raises
    the question what should be printed?

    Given that / should result in -ENOENT I
    believe it is desirable for escaped paths to be printed as empty
    paths. As there are not really any meaninful path components when
    considered from the perspective of a mount tree.

    So tweak prepend_path to return an empty path with an new error
    code of 3 when it encounters an escaped path.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit cf6f54e3f133229f02a90c04fe0ff9dd9d3264b4 upstream.

    Fixes the following lockdep splat:
    [ 1.244527] =============================================
    [ 1.245193] [ INFO: possible recursive locking detected ]
    [ 1.245193] 4.2.0-rc1+ #37 Not tainted
    [ 1.245193] ---------------------------------------------
    [ 1.245193] cp/742 is trying to acquire lock:
    [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] ubifs_init_security+0x29/0xb0
    [ 1.245193]
    [ 1.245193] but task is already holding lock:
    [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280
    [ 1.245193]
    [ 1.245193] other info that might help us debug this:
    [ 1.245193] Possible unsafe locking scenario:
    [ 1.245193]
    [ 1.245193] CPU0
    [ 1.245193] ----
    [ 1.245193] lock(&sb->s_type->i_mutex_key#9);
    [ 1.245193] lock(&sb->s_type->i_mutex_key#9);
    [ 1.245193]
    [ 1.245193] *** DEADLOCK ***
    [ 1.245193]
    [ 1.245193] May be due to missing lock nesting notation
    [ 1.245193]
    [ 1.245193] 2 locks held by cp/742:
    [ 1.245193] #0: (sb_writers#5){.+.+.+}, at: [] mnt_want_write+0x1f/0x50
    [ 1.245193] #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280
    [ 1.245193]
    [ 1.245193] stack backtrace:
    [ 1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37
    [ 1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014
    [ 1.245193] ffffffff8252d530 ffff88007b023a38 ffffffff814f6f49 ffffffff810b56c5
    [ 1.245193] ffff88007c30cc80 ffff88007b023af8 ffffffff810a150d ffff88007b023a68
    [ 1.245193] 000000008101302a ffff880000000000 00000008f447e23f ffffffff8252d500
    [ 1.245193] Call Trace:
    [ 1.245193] [] dump_stack+0x4c/0x65
    [ 1.245193] [] ? console_unlock+0x1c5/0x510
    [ 1.245193] [] __lock_acquire+0x1a6d/0x1ea0
    [ 1.245193] [] ? __lock_is_held+0x58/0x80
    [ 1.245193] [] lock_acquire+0xd3/0x270
    [ 1.245193] [] ? ubifs_init_security+0x29/0xb0
    [ 1.245193] [] mutex_lock_nested+0x6b/0x3a0
    [ 1.245193] [] ? ubifs_init_security+0x29/0xb0
    [ 1.245193] [] ? ubifs_init_security+0x29/0xb0
    [ 1.245193] [] ubifs_init_security+0x29/0xb0
    [ 1.245193] [] ubifs_create+0xa6/0x1f0
    [ 1.245193] [] ? path_openat+0x3af/0x1280
    [ 1.245193] [] vfs_create+0x95/0xc0
    [ 1.245193] [] path_openat+0x7cc/0x1280
    [ 1.245193] [] ? __lock_acquire+0x543/0x1ea0
    [ 1.245193] [] ? sched_clock_cpu+0x90/0xc0
    [ 1.245193] [] ? calc_global_load_tick+0x60/0x90
    [ 1.245193] [] ? sched_clock_cpu+0x90/0xc0
    [ 1.245193] [] ? __alloc_fd+0xaf/0x180
    [ 1.245193] [] do_filp_open+0x75/0xd0
    [ 1.245193] [] ? _raw_spin_unlock+0x26/0x40
    [ 1.245193] [] ? __alloc_fd+0xaf/0x180
    [ 1.245193] [] do_sys_open+0x129/0x200
    [ 1.245193] [] SyS_open+0x19/0x20
    [ 1.245193] [] entry_SYSCALL_64_fastpath+0x12/0x6f

    While the lockdep splat is a false positive, becuase path_openat holds i_mutex
    of the parent directory and ubifs_init_security() tries to acquire i_mutex
    of a new inode, it reveals that taking i_mutex in ubifs_init_security() is
    in vain because it is only being called in the inode allocation path
    and therefore nobody else can see the inode yet.

    Reported-and-tested-by: Boris Brezillon
    Reviewed-and-tested-by: Dongsheng Yang
    Signed-off-by: Richard Weinberger
    Signed-off-by: dedekind1@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Richard Weinberger
     
  • commit 98ce94c8df762d413b3ecb849e2b966b21606d04 upstream.

    Linux cifs mount with ntlmssp against an Mac OS X (Yosemite
    10.10.5) share fails in case the clocks differ more than +/-2h:

    digest-service: digest-request: od failed with 2 proto=ntlmv2
    digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2

    Fix this by (re-)using the given server timestamp for the
    ntlmv2 authentication (as Windows 7 does).

    A related problem was also reported earlier by Namjae Jaen (see below):

    Windows machine has extended security feature which refuse to allow
    authentication when there is time difference between server time and
    client time when ntlmv2 negotiation is used. This problem is prevalent
    in embedded enviornment where system time is set to default 1970.

    Modern servers send the server timestamp in the TargetInfo Av_Pair
    structure in the challenge message [see MS-NLMP 2.2.2.1]
    In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must
    use the server provided timestamp if present OR current time if it is
    not

    Reported-by: Namjae Jeon
    Signed-off-by: Peter Seiderer
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Peter Seiderer
     
  • commit 646200a041203f440fb6fcf9cacd9efeda9de74c upstream.

    The error paths in set_file_size for cifs and smb3 are incorrect.

    In the unlikely event that a server did not support set file info
    of the file size, the code incorrectly falls back to trying SMBWriteX
    (note that only the original core SMB Write, used for example by DOS,
    can set the file size this way - this actually does not work for the more
    recent SMBWriteX). The idea was since the old DOS SMB Write could set
    the file size if you write zero bytes at that offset then use that if
    server rejects the normal set file info call.

    Fortunately the SMBWriteX will never be sent on the wire (except when
    file size is zero) since the length and offset fields were reversed
    in the two places in this function that call SMBWriteX causing
    the fall back path to return an error. It is also important to never call
    an SMB request from an SMB2/sMB3 session (which theoretically would
    be possible, and can cause a brief session drop, although the client
    recovers) so this should be fixed. In practice this path does not happen
    with modern servers but the error fall back to SMBWriteX is clearly wrong.

    Removing the calls to SMBWriteX in the error paths in cifs_set_file_size

    Pointed out by PaX/grsecurity team

    Signed-off-by: Steve French
    Reported-by: PaX Team
    CC: Emese Revfy
    CC: Brad Spengler
    Signed-off-by: Greg Kroah-Hartman

    Steve French
     
  • commit e0ddde9d44e37fbc21ce893553094ecf1a633ab5 upstream.

    leases (oplocks) were always requested for SMB2/SMB3 even when oplocks
    disabled in the cifs.ko module.

    Signed-off-by: Steve French
    Reviewed-by: Chandrika Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Steve French
     
  • commit ceb1b0b9b4d1089e9f2731a314689ae17784c861 upstream.

    Kerberos, which is very important for security, was only enabled for
    CIFS not SMB2/SMB3 mounts (e.g. vers=3.0)

    Patch based on the information detailed in
    http://thread.gmane.org/gmane.linux.kernel.cifs/10081/focus=10307
    to enable Kerberized SMB2/SMB3

    a) SMB2_negotiate: enable/use decode_negTokenInit in SMB2_negotiate
    b) SMB2_sess_setup: handle Kerberos sectype and replicate Kerberos
    SMB1 processing done in sess_auth_kerberos

    Signed-off-by: Noel Power
    Signed-off-by: Jim McDonough
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Steve French
     
  • commit 8fa4592a14ebb3c22a21d846d1e4f65dab7d1a7c upstream.

    If all other conditions in nfs_can_extend_write() are met, and there
    are no locks, then we should be able to assume close-to-open semantics
    and the ability to extend our write to cover the whole page.

    With this patch, the xfstests generic/074 test completes in 242s instead
    of >1400s on my test rig.

    Fixes: bd61e0a9c852 ("locks: convert posix locks to file_lock_context")
    Cc: Jeff Layton
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 048883e0b934d9a5103d40e209cb14b7f33d2933 upstream.

    We really want sizeof(struct page *) instead. Otherwise we limit
    maximum IO size to 64 pages rather than 512 pages on a 64bit system.

    Fixes 2e11f829(nfs: cap request size to fit a kmalloced page array).

    Cc: Christoph Hellwig
    Signed-off-by: Peng Tao
    Fixes: 2e11f8296d22 ("nfs: cap request size to fit a kmalloced page array")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     
  • commit 6f29b9bba7b08c6b1d6f2cc4cf750b342fc1946c upstream.

    There is a reference leak of layout segment after resetting
    pageio read/write to mds.

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Kinglong Mee
     
  • commit 808f80b46790f27e145c72112189d6a3be2bc884 upstream.

    My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of
    compressed and shared extents") was effective only if the compressed
    extents cover a file range with a length that is not a multiple of 16
    pages. That's because the detection of when we reached a different range
    of the file that shares the same compressed extent as the previously
    processed range was done at extent_io.c:__do_contiguous_readpages(),
    which covers subranges with a length up to 16 pages, because
    extent_readpages() groups the pages in clusters no larger than 16 pages.
    So fix this by tracking the start of the previously processed file
    range's extent map at extent_readpages().

    The following test case for fstests reproduces the issue:

    seq=`basename $0`
    seqres=$RESULT_DIR/$seq
    echo "QA output created by $seq"
    tmp=/tmp/$$
    status=1 # failure is the default!
    trap "_cleanup; exit \$status" 0 1 2 3 15

    _cleanup()
    {
    rm -f $tmp.*
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter

    # real QA test starts here
    _need_to_be_root
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_cloner

    rm -f $seqres.full

    test_clone_and_read_compressed_extent()
    {
    local mount_opts=$1

    _scratch_mkfs >>$seqres.full 2>&1
    _scratch_mount $mount_opts

    # Create our test file with a single extent of 64Kb that is going to
    # be compressed no matter which compression algo is used (zlib/lzo).
    $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
    $SCRATCH_MNT/foo | _filter_xfs_io

    # Now clone the compressed extent into an adjacent file offset.
    $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
    $SCRATCH_MNT/foo $SCRATCH_MNT/foo

    echo "File digest before unmount:"
    md5sum $SCRATCH_MNT/foo | _filter_scratch

    # Remount the fs or clear the page cache to trigger the bug in
    # btrfs. Because the extent has an uncompressed length that is a
    # multiple of 16 pages, all the pages belonging to the second range
    # of the file (64K to 128K), which points to the same extent as the
    # first range (0K to 64K), had their contents full of zeroes instead
    # of the byte 0xaa. This was a bug exclusively in the read path of
    # compressed extents, the correct data was stored on disk, btrfs
    # just failed to fill in the pages correctly.
    _scratch_remount

    echo "File digest after remount:"
    # Must match the digest we got before.
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    }

    echo -e "\nTesting with zlib compression..."
    test_clone_and_read_compressed_extent "-o compress=zlib"

    _scratch_unmount

    echo -e "\nTesting with lzo compression..."
    test_clone_and_read_compressed_extent "-o compress=lzo"

    status=0
    exit

    Signed-off-by: Filipe Manana
    Tested-by: Timofey Titovets
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 005efedf2c7d0a270ffbe28d8997b03844f3e3e7 upstream.

    If a file has a range pointing to a compressed extent, followed by
    another range that points to the same compressed extent and a read
    operation attempts to read both ranges (either completely or part of
    them), the pages that correspond to the second range are incorrectly
    filled with zeroes.

    Consider the following example:

    File layout
    [0 - 8K] [8K - 24K]
    | |
    | |
    points to extent X, points to extent X,
    offset 4K, length of 8K offset 0, length 16K

    [extent X, compressed length = 4K uncompressed length = 16K]

    If a readpages() call spans the 2 ranges, a single bio to read the extent
    is submitted - extent_io.c:submit_extent_page() would only create a new
    bio to cover the second range pointing to the extent if the extent it
    points to had a different logical address than the extent associated with
    the first range. This has a consequence of the compressed read end io
    handler (compression.c:end_compressed_bio_read()) finish once the extent
    is decompressed into the pages covering the first range, leaving the
    remaining pages (belonging to the second range) filled with zeroes (done
    by compression.c:btrfs_clear_biovec_end()).

    So fix this by submitting the current bio whenever we find a range
    pointing to a compressed extent that was preceded by a range with a
    different extent map. This is the simplest solution for this corner
    case. Making the end io callback populate both ranges (or more, if we
    have multiple pointing to the same extent) is a much more complex
    solution since each bio is tightly coupled with a single extent map and
    the extent maps associated to the ranges pointing to the shared extent
    can have different offsets and lengths.

    The following test case for fstests triggers the issue:

    seq=`basename $0`
    seqres=$RESULT_DIR/$seq
    echo "QA output created by $seq"
    tmp=/tmp/$$
    status=1 # failure is the default!
    trap "_cleanup; exit \$status" 0 1 2 3 15

    _cleanup()
    {
    rm -f $tmp.*
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter

    # real QA test starts here
    _need_to_be_root
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_cloner

    rm -f $seqres.full

    test_clone_and_read_compressed_extent()
    {
    local mount_opts=$1

    _scratch_mkfs >>$seqres.full 2>&1
    _scratch_mount $mount_opts

    # Create a test file with a single extent that is compressed (the
    # data we write into it is highly compressible no matter which
    # compression algorithm is used, zlib or lzo).
    $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
    -c "pwrite -S 0xbb 4K 8K" \
    -c "pwrite -S 0xcc 12K 4K" \
    $SCRATCH_MNT/foo | _filter_xfs_io

    # Now clone our extent into an adjacent offset.
    $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
    $SCRATCH_MNT/foo $SCRATCH_MNT/foo

    # Same as before but for this file we clone the extent into a lower
    # file offset.
    $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
    -c "pwrite -S 0xbb 12K 8K" \
    -c "pwrite -S 0xcc 20K 4K" \
    $SCRATCH_MNT/bar | _filter_xfs_io

    $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
    $SCRATCH_MNT/bar $SCRATCH_MNT/bar

    echo "File digests before unmounting filesystem:"
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    md5sum $SCRATCH_MNT/bar | _filter_scratch

    # Evicting the inode or clearing the page cache before reading
    # again the file would also trigger the bug - reads were returning
    # all bytes in the range corresponding to the second reference to
    # the extent with a value of 0, but the correct data was persisted
    # (it was a bug exclusively in the read path). The issue happened
    # only if the same readpages() call targeted pages belonging to the
    # first and second ranges that point to the same compressed extent.
    _scratch_remount

    echo "File digests after mounting filesystem again:"
    # Must match the same digests we got before.
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    md5sum $SCRATCH_MNT/bar | _filter_scratch
    }

    echo -e "\nTesting with zlib compression..."
    test_clone_and_read_compressed_extent "-o compress=zlib"

    _scratch_unmount

    echo -e "\nTesting with lzo compression..."
    test_clone_and_read_compressed_extent "-o compress=lzo"

    status=0
    exit

    Signed-off-by: Filipe Manana
    Reviewed-by: Qu Wenruo
    Reviewed-by: Liu Bo
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit a30e577c96f59b1e1678ea5462432b09bf7d5cbc upstream.

    In btrfs_evict_inode, we properly truncate the page cache for evicted
    inodes but then we call btrfs_wait_ordered_range for every inode as well.
    It's the right thing to do for regular files but results in incorrect
    behavior for device inodes for block devices.

    filemap_fdatawrite_range gets called with inode->i_mapping which gets
    resolved to the block device inode before getting passed to
    wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens
    next depends on whether there's an open file handle associated with the
    inode. If there is, we write to the block device, which is unexpected
    behavior. If there isn't, we through normally and inode->i_data is used.
    We can also end up racing against open/close which can result in crashes
    when i_mapping points to a block device inode that has been closed.

    Since there can't be any page cache associated with special file inodes,
    it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
    the problem.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911
    Tested-by: Christoph Biedl
    Signed-off-by: Jeff Mahoney
    Reviewed-by: Filipe Manana
    Signed-off-by: Greg Kroah-Hartman

    Jeff Mahoney
     
  • commit 012572d4fc2e4ddd5c8ec8614d51414ec6cae02a upstream.

    The order of the following three spinlocks should be:
    dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock

    But dlm_dispatch_assert_master() is called while holding
    dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls
    dlm_grab() which will take dlm_domain_lock.

    Once another thread (for example, dlm_query_join_handler) has already
    taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock
    happens.

    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: "Junxiao Bi"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joseph Qi
     
  • commit f0b2e563bc419df7c1b3d2f494574c25125f6aed upstream.

    The dax code doesn't currently support misaligned partitions,
    so disable O_DIRECT via dax until such time as that support
    materializes.

    Suggested-by: Boaz Harrosh
    Signed-off-by: Jeff Moyer
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Jeff Moyer
     

30 Sep, 2015

8 commits

  • commit 841df7df196237ea63233f0f9eaa41db53afd70f upstream.

    Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
    superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
    when the journal is aborted. That makes logic in
    jbd2_log_do_checkpoint() bail out which is fine, except that
    jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
    a progress in cleaning the journal. Without it jbd2_journal_destroy()
    just loops in an infinite loop.

    Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
    jbd2_log_do_checkpoint() fails with error.

    Reported-by: Eryu Guan
    Tested-by: Eryu Guan
    Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 7cb74be6fd827e314f81df3c5889b87e4c87c569 upstream.

    Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
    hfs_bnode_find() for finding or creating pages corresponding to an inode)
    are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
    and should not be page_cache_release()'ed until hfs_bnode_free().

    This patch fixes a problem I first saw in July 2012: merely running "du"
    on a large hfsplus-mounted directory a few times on a reasonably loaded
    system would get the hfsplus driver all confused and complaining about
    B-tree inconsistencies, and generates a "BUG: Bad page state". Most
    recently, I can generate this problem on up-to-date Fedora 22 with shipped
    kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
    mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
    lightly-used QEMU VM image of the full Mac OS X 10.9:

    $ df -i / /home /mnt
    Filesystem Inodes IUsed IFree IUse% Mounted on
    /dev/mapper/fedora-root 3276800 551665 2725135 17% /
    /dev/mapper/fedora-home 52879360 716221 52163139 2% /home
    /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt

    After applying the patch, I was able to run "du /" (60+ times) and "du
    /mnt" (150+ times) continuously and simultaneously for 6+ hours.

    There are many reports of the hfsplus driver getting confused under load
    and generating "BUG: Bad page state" or other similar issues over the
    years. [1]

    The unpatched code [2] has always been wrong since it entered the kernel
    tree. The only reason why it gets away with it is that the
    kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
    the kernel has not had a chance to reuse the memory for something else,
    most of the time.

    The current RW driver appears to have followed the design and development
    of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
    2001) had a B-tree node-centric approach to
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
    migrating towards version 0.2 (June 2002) of caching and releasing pages
    per inode extents. When the current RW code first entered the kernel [2]
    in 2005, there was an REF_PAGES conditional (and "//" commented out code)
    to switch between B-node centric paging to inode-centric paging. There
    was a mistake with the direction of one of the REF_PAGES conditionals in
    __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
    removed, but a page_cache_release() was mistakenly left in (propagating
    the "REF_PAGES !REF_PAGE" mistake), and the commented-out
    page_cache_release() in bnode_release() (which should be spanned by
    !REF_PAGES) was never enabled.

    References:
    [1]:
    Michael Fox, Apr 2013
    http://www.spinics.net/lists/linux-fsdevel/msg63807.html
    ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

    Sasha Levin, Feb 2015
    http://lkml.org/lkml/2015/2/20/85 ("use after free")

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
    https://bugzilla.kernel.org/show_bug.cgi?id=42342
    https://bugzilla.kernel.org/show_bug.cgi?id=63841
    https://bugzilla.kernel.org/show_bug.cgi?id=78761

    [2]:
    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
    commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:36 2004 -0800

    [PATCH] HFS rewrite

    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

    commit 91556682e0bf004d98a529bf829d339abb98bbbd
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:48 2004 -0800

    [PATCH] HFS+ support

    [3]:
    http://sourceforge.net/projects/linux-hfsplus/

    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

    http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
    fs/hfsplus/bnode.c?r1=1.4&r2=1.5

    Date: Thu Jun 6 09:45:14 2002 +0000
    Use buffer cache instead of page cache in bnode.c. Cache inode extents.

    [4]:
    http://git.kernel.org/cgit/linux/kernel/git/\
    stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

    commit a5e3985fa014029eb6795664c704953720cc7f7d
    Author: Roman Zippel
    Date: Tue Sep 6 15:18:47 2005 -0700

    [PATCH] hfs: remove debug code

    Signed-off-by: Hin-Tak Leung
    Signed-off-by: Sergei Antonov
    Reviewed-by: Anton Altaparmakov
    Reported-by: Sasha Levin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Vyacheslav Dubeyko
    Cc: Sougata Santra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hin-Tak Leung
     
  • commit b4cc0efea4f0bfa2477c56af406cfcf3d3e58680 upstream.

    Fix B-tree corruption when a new record is inserted at position 0 in the
    node in hfs_brec_insert().

    This is an identical change to the corresponding hfs b-tree code to Sergei
    Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
    to keep similar code paths in the hfs and hfsplus drivers in sync, where
    appropriate.

    Signed-off-by: Hin-Tak Leung
    Cc: Sergei Antonov
    Cc: Joe Perches
    Reviewed-by: Vyacheslav Dubeyko
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hin-Tak Leung
     
  • commit 5556e7e6d30e8e9b5ee51b0e5edd526ee80e5e36 upstream.

    Consider eCryptfs dcache entries to be stale when the corresponding
    lower inode's i_nlink count is zero. This solves a problem caused by the
    lower inode being directly modified, without going through the eCryptfs
    mount, leaving stale eCryptfs dentries cached and the eCryptfs inode's
    i_nlink count not being cleared.

    Signed-off-by: Tyler Hicks
    Reported-by: Richard Weinberger
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     
  • commit 40f705a736eac10e7dca7ab5dd5ed675a6df031d upstream.

    On a filesystem like vfat, all files are created with the same owner
    and mode independent of who created the file. When a vfat filesystem
    is mounted with root as owner of all files and read access for everyone,
    root's processes left world-readable coredumps on it (but other
    users' processes only left empty corefiles when given write access
    because of the uid mismatch).

    Given that the old behavior was inconsistent and insecure, I don't see
    a problem with changing it. Now, all processes refuse to dump core unless
    the resulting corefile will only be readable by their owner.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit fbb1816942c04429e85dbf4c1a080accc534299e upstream.

    It was possible for an attacking user to trick root (or another user) into
    writing his coredumps into an attacker-readable, pre-existing file using
    rename() or link(), causing the disclosure of secret data from the victim
    process' virtual memory. Depending on the configuration, it was also
    possible to trick root into overwriting system files with coredumps. Fix
    that issue by never writing coredumps into existing files.

    Requirements for the attack:
    - The attack only applies if the victim's process has a nonzero
    RLIMIT_CORE and is dumpable.
    - The attacker can trick the victim into coredumping into an
    attacker-writable directory D, either because the core_pattern is
    relative and the victim's cwd is attacker-writable or because an
    absolute core_pattern pointing to a world-writable directory is used.
    - The attacker has one of these:
    A: on a system with protected_hardlinks=0:
    execute access to a folder containing a victim-owned,
    attacker-readable file on the same partition as D, and the
    victim-owned file will be deleted before the main part of the attack
    takes place. (In practice, there are lots of files that fulfill
    this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
    This does not apply to most Linux systems because most distros set
    protected_hardlinks=1.
    B: on a system with protected_hardlinks=1:
    execute access to a folder containing a victim-owned,
    attacker-readable and attacker-writable file on the same partition
    as D, and the victim-owned file will be deleted before the main part
    of the attack takes place.
    (This seems to be uncommon.)
    C: on any system, independent of protected_hardlinks:
    write access to a non-sticky folder containing a victim-owned,
    attacker-readable file on the same partition as D
    (This seems to be uncommon.)

    The basic idea is that the attacker moves the victim-owned file to where
    he expects the victim process to dump its core. The victim process dumps
    its core into the existing file, and the attacker reads the coredump from
    it.

    If the attacker can't move the file because he does not have write access
    to the containing directory, he can instead link the file to a directory
    he controls, then wait for the original link to the file to be deleted
    (because the kernel checks that the link count of the corefile is 1).

    A less reliable variant that requires D to be non-sticky works with link()
    and does not require deletion of the original link: link() the file into
    D, but then unlink() it directly before the kernel performs the link count
    check.

    On systems with protected_hardlinks=0, this variant allows an attacker to
    not only gain information from coredumps, but also clobber existing,
    victim-writable files with coredumps. (This could theoretically lead to a
    privilege escalation.)

    Signed-off-by: Jann Horn
    Cc: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit 36319608e28701c07cad80ae3be8b0fdfb1ab40f upstream.

    This reverts commit 4e379d36c050b0117b5d10048be63a44f5036115.

    This commit opens up a race between the recovery code and the open code.

    Reported-by: Olga Kornievskaia
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 4a1e2feb9d246775dee0f78ed5b18826bae2b1c5 upstream.

    According to RFC5661 Section 18.2.4, CLOSE is supposed to return
    the zero stateid. This means that nfs_clear_open_stateid_locked()
    cannot assume that the result stateid will always match the 'other'
    field of the existing open stateid when trying to determine a race
    with a parallel OPEN.

    Instead, we look at the argument, and check for matches.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust