22 Sep, 2016

3 commits

  • Avoid spurious preemption.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Al Viro
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: der.herr@hofr.at
    Cc: paulmck@linux.vnet.ibm.com
    Cc: riel@redhat.com
    Cc: tj@kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • As Oleg suggested, replace file_lock_list with a structure containing
    the hlist head and a spinlock.

    This completely removes the lglock from fs/locks.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Al Viro
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: der.herr@hofr.at
    Cc: paulmck@linux.vnet.ibm.com
    Cc: riel@redhat.com
    Cc: tj@kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Replace the global part of the lglock with a percpu-rwsem.

    Since fcl_lock is a spinlock and itself nests under i_lock, which too
    is a spinlock we cannot acquire sleeping locks at
    locks_{insert,remove}_global_locks().

    We can however wrap all fcl_lock acquisitions with percpu_down_read
    such that all invocations of locks_{insert,remove}_global_locks() have
    that read lock held.

    This allows us to replace the lg_global part of the lglock with the
    write side of the rwsem.

    In the absense of writers, percpu_{down,up}_read() are free of atomic
    instructions. This further avoids the very long preempt-disable
    regions caused by lglock on larger machines.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Al Viro
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: der.herr@hofr.at
    Cc: paulmck@linux.vnet.ibm.com
    Cc: riel@redhat.com
    Cc: tj@kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Sep, 2016

2 commits

  • We hit hardened usercopy feature check for kernel text access by reading
    kcore file:

    usercopy: kernel memory exposure attempt detected from ffffffff8179a01f () (4065 bytes)
    kernel BUG at mm/usercopy.c:75!

    Bypassing this check for kcore by adding bounce buffer for ktext data.

    Reported-by: Steve Best
    Fixes: f5509cc18daa ("mm: Hardened usercopy")
    Suggested-by: Kees Cook
    Signed-off-by: Jiri Olsa
    Acked-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Jiri Olsa
     
  • Next patch adds bounce buffer for ktext area, so it's
    convenient to have single bounce buffer for both
    vmalloc/module and ktext cases.

    Suggested-by: Linus Torvalds
    Signed-off-by: Jiri Olsa
    Acked-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Jiri Olsa
     

20 Sep, 2016

10 commits

  • This reverts commit 38b52efd218b ("ocfs2: bump up o2cb network protocol
    version").

    This commit made rolling upgrade fail. When one node is upgraded to new
    version with this commit, the remaining nodes will fail to establish
    connections to it, then the application like VMs on the remaining nodes
    can't be live migrated to the upgraded one. This will cause an outage.
    Since negotiate hb timeout behavior didn't change without this commit,
    so revert it.

    Fixes: 38b52efd218bf ("ocfs2: bump up o2cb network protocol version")
    Link: http://lkml.kernel.org/r/1471396924-10375-1-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • If we punch a hole on a reflink such that following conditions are met:

    1. start offset is on a cluster boundary
    2. end offset is not on a cluster boundary
    3. (end offset is somewhere in another extent) or
    (hole range > MAX_CONTIG_BYTES(1MB)),

    we dont COW the first cluster starting at the start offset. But in this
    case, we were wrongly passing this cluster to
    ocfs2_zero_range_for_truncate() to zero out. This will modify the
    cluster in place and zero it in the source too.

    Fix this by skipping this cluster in such a scenario.

    To reproduce:

    1. Create a random file of say 10 MB
    xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
    2. Reflink it
    reflink -f 10MBfile reflnktest
    3. Punch a hole at starting at cluster boundary with range greater that
    1MB. You can also use a range that will put the end offset in another
    extent.
    fallocate -p -o 0 -l 1048615 reflnktest
    4. sync
    5. Check the first cluster in the source file. (It will be zeroed out).
    dd if=10MBfile iflag=direct bs= count=1 | hexdump -C

    Link: http://lkml.kernel.org/r/1470957147-14185-1-git-send-email-ashish.samant@oracle.com
    Signed-off-by: Ashish Samant
    Reported-by: Saar Maoz
    Reviewed-by: Srinivas Eeda
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Eric Ren
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ashish Samant
     
  • If ocfs2_reserve_cluster_bitmap_bits() fails with ENOSPC, it will try to
    free truncate log and then retry. Since ocfs2_try_to_free_truncate_log
    will lock/unlock global bitmap inode, we have to unlock it before
    calling this function. But when retry reserve and it fails with no
    global bitmap inode lock taken, it will unlock again in error handling
    branch and BUG.

    This issue also exists if no need retry and then ocfs2_inode_lock fails.
    So fix it.

    Fixes: 2070ad1aebff ("ocfs2: retry on ENOSPC if sufficient space in truncate log")
    Link: http://lkml.kernel.org/r/57D91939.6030809@huawei.com
    Signed-off-by: Joseph Qi
    Signed-off-by: Jiufei Xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • fanotify_get_response() calls fsnotify_remove_event() when it finds that
    group is being released from fanotify_release() (bypass_perm is set).

    However the event it removes need not be only in the group's notification
    queue but it can have already moved to access_list (userspace read the
    event before closing the fanotify instance fd) which is protected by a
    different lock. Thus when fsnotify_remove_event() races with
    fanotify_release() operating on access_list, the list can get corrupted.

    Fix the problem by moving all the logic removing permission events from
    the lists to one place - fanotify_release().

    Fixes: 5838d4442bd5 ("fanotify: fix double free of pending permission events")
    Link: http://lkml.kernel.org/r/1473797711-14111-3-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reported-by: Miklos Szeredi
    Tested-by: Miklos Szeredi
    Reviewed-by: Miklos Szeredi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Implement a function that can be called when a group is being shutdown
    to stop queueing new events to the group. Fanotify will use this.

    Fixes: 5838d4442bd5 ("fanotify: fix double free of pending permission events")
    Link: http://lkml.kernel.org/r/1473797711-14111-2-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Miklos Szeredi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • The root cause of this issue is the same with the one fixed by the last
    patch, but this time credits for allocator inode and group descriptor
    may not be consumed before trans extend.

    The following error was caught:

    WARNING: CPU: 0 PID: 2037 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]()
    Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront fb_sys_fops sysimgblt sysfillrect syscopyarea xen_netfront parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 2037 Comm: rm Tainted: G W 4.1.12-37.6.3.el6uek.bug24573128v2.x86_64 #2
    Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
    Call Trace:
    dump_stack+0x48/0x5c
    warn_slowpath_common+0x95/0xe0
    warn_slowpath_null+0x1a/0x20
    start_this_handle+0x4c3/0x510 [jbd2]
    jbd2__journal_restart+0x161/0x1b0 [jbd2]
    jbd2_journal_restart+0x13/0x20 [jbd2]
    ocfs2_extend_trans+0x74/0x220 [ocfs2]
    ocfs2_free_cached_blocks+0x16b/0x4e0 [ocfs2]
    ocfs2_run_deallocs+0x70/0x270 [ocfs2]
    ocfs2_commit_truncate+0x474/0x6f0 [ocfs2]
    ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2]
    ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
    ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
    ocfs2_evict_inode+0x28/0x60 [ocfs2]
    evict+0xab/0x1a0
    iput_final+0xf6/0x190
    iput+0xc8/0xe0
    do_unlinkat+0x1b7/0x310
    SyS_unlinkat+0x22/0x40
    system_call_fastpath+0x12/0x71
    ---[ end trace a62437cb060baa71 ]---
    JBD2: rm wants too many credits (149 > 128)

    Link: http://lkml.kernel.org/r/1473674623-11810-2-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Every time, ocfs2_extend_trans() included a credit for truncate log
    inode, but as that inode had been managed by jbd2 running transaction
    first time, it will not consume that credit until
    jbd2_journal_restart().

    Since total credits to extend always included the un-consumed ones,
    there will be more and more un-consumed credit, at last
    jbd2_journal_restart() will fail due to credit number over the half of
    max transction credit.

    The following error was caught when unlinking a large file with many
    extents:

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 13626 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]()
    Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 13626 Comm: unlink Tainted: G W 4.1.12-37.6.3.el6uek.x86_64 #2
    Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
    Call Trace:
    dump_stack+0x48/0x5c
    warn_slowpath_common+0x95/0xe0
    warn_slowpath_null+0x1a/0x20
    start_this_handle+0x4c3/0x510 [jbd2]
    jbd2__journal_restart+0x161/0x1b0 [jbd2]
    jbd2_journal_restart+0x13/0x20 [jbd2]
    ocfs2_extend_trans+0x74/0x220 [ocfs2]
    ocfs2_replay_truncate_records+0x93/0x360 [ocfs2]
    __ocfs2_flush_truncate_log+0x13e/0x3a0 [ocfs2]
    ocfs2_remove_btree_range+0x458/0x7f0 [ocfs2]
    ocfs2_commit_truncate+0x1b3/0x6f0 [ocfs2]
    ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2]
    ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
    ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
    ocfs2_evict_inode+0x28/0x60 [ocfs2]
    evict+0xab/0x1a0
    iput_final+0xf6/0x190
    iput+0xc8/0xe0
    do_unlinkat+0x1b7/0x310
    SyS_unlink+0x16/0x20
    system_call_fastpath+0x12/0x71
    ---[ end trace 28aa7410e69369cf ]---
    JBD2: unlink wants too many credits (251 > 128)

    Link: http://lkml.kernel.org/r/1473674623-11810-1-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Commit c01d5b300774 ("shmem: get_unmapped_area align huge page") makes
    use of shm_get_unmapped_area() in shm_file_operations() unconditional to
    CONFIG_MMU.

    As Tony Battersby pointed this can lead NULL-pointer dereference on
    machine with CONFIG_MMU=y and CONFIG_SHMEM=n. In this case ipc/shm is
    backed by ramfs which doesn't provide f_op->get_unmapped_area for
    configurations with MMU.

    The solution is to provide dummy f_op->get_unmapped_area for ramfs when
    CONFIG_MMU=y, which just call current->mm->get_unmapped_area().

    Fixes: c01d5b300774 ("shmem: get_unmapped_area align huge page")
    Link: http://lkml.kernel.org/r/20160912102704.140442-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tony Battersby
    Tested-by: Tony Battersby
    Cc: Hugh Dickins
    Cc: [4.7.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Somewhere along the way the autofs expire operation has changed to hold
    a spin lock over expired dentry selection. The autofs indirect mount
    expired dentry selection is complicated and quite lengthy so it isn't
    appropriate to hold a spin lock over the operation.

    Commit 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") added a
    might_sleep() to dput() causing a WARN_ONCE() about this usage to be
    issued.

    But the spin lock doesn't need to be held over this check, the autofs
    dentry info. flags are enough to block walks into dentrys during the
    expire.

    I've left the direct mount expire as it is (for now) because it is much
    simpler and quicker than the indirect mount expire and adding spin lock
    release and re-aquires would do nothing more than add overhead.

    Fixes: 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()")
    Link: http://lkml.kernel.org/r/20160912014017.1773.73060.stgit@pluto.themaw.net
    Signed-off-by: Ian Kent
    Reported-by: Takashi Iwai
    Tested-by: Takashi Iwai
    Cc: Takashi Iwai
    Cc: NeilBrown
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
    checks if lockres master has changed to identify whether new master has
    finished recovery or not. This will introduce a race that right after
    old master does umount ( means master will change), a new convert
    request comes.

    In this case, it will reset lockres state to DLM_RECOVERING and then
    retry convert, and then fail with lockres->l_action being set to
    OCFS2_AST_INVALID, which will cause inconsistent lock level between
    ocfs2 and dlm, and then finally BUG.

    Since dlm recovery will clear lock->convert_pending in
    dlm_move_lockres_to_recovery_list, we can use it to correctly identify
    the race case between convert and recovery. So fix it.

    Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
    Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com
    Signed-off-by: Joseph Qi
    Signed-off-by: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

17 Sep, 2016

1 commit


16 Sep, 2016

3 commits

  • This ensures that do_mmap() won't implicitly make AIO memory mappings
    executable if the READ_IMPLIES_EXEC personality flag is set. Such
    behavior is problematic because the security_mmap_file LSM hook doesn't
    catch this case, potentially permitting an attacker to bypass a W^X
    policy enforced by SELinux.

    I have tested the patch on my machine.

    To test the behavior, compile and run this:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void) {
    personality(READ_IMPLIES_EXEC);
    aio_context_t ctx = 0;
    if (syscall(__NR_io_setup, 1, &ctx))
    err(1, "io_setup");

    char cmd[1000];
    sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
    (int)getpid());
    system(cmd);
    return 0;
    }

    In the output, "rw-s" is good, "rwxs" is bad.

    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Kirill A Shutemov reports that the kernel doesn't try to cap dest_count
    in any way, and uses the number to allocate kernel memory. This causes
    high order allocation warnings in the kernel log if someone passes in a
    big enough value. We should clamp the allocation at PAGE_SIZE to avoid
    stressing the VM.

    The two existing users of the dedupe ioctl never send more than 120
    requests, so we can safely clamp dest_range at PAGE_SIZE, because with
    4k pages we can handle up to 127 dedupe candidates. Given the max
    extent length of 16MB, we can end up doing 2GB of IO which is plenty.

    [ Note: the "offsetof()" can't overflow, because 'count' is just a
    16-bit integer. That's not obvious in the limited context of the
    patch, so I'm noting it here because it made me go look. - Linus ]

    Reported-by: "Kirill A. Shutemov"
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • All the VFS functions in the dedupe ioctl path return int status, so
    the ioctl handler ought to as well.

    Found by Coverity, CID 1350952.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

13 Sep, 2016

1 commit

  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct
    state accounting
    - Fix the CREATE_SESSION slot number

    Bugfixes:
    - sunrpc: fix a UDP memory accounting regression
    - NFS: Fix an error reporting regression in nfs_file_write()
    - pNFS: Fix further layout stateid issues
    - RPC/rdma: Revert 3d4cf35bd4fa ("xprtrdma: Reply buffer
    exhaustion...")
    - RPC/rdma: Fix receive buffer accounting"

    * tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4.1: Fix the CREATE_SESSION slot number accounting
    xprtrdma: Fix receive buffer accounting
    xprtrdma: Revert 3d4cf35bd4fa ("xprtrdma: Reply buffer exhaustion...")
    pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs
    pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present
    pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID
    pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised
    NFS: Fix error reporting in nfs_file_write()
    sunrpc: fix UDP memory accounting

    Linus Torvalds
     

12 Sep, 2016

1 commit

  • Ensure that we conform to the algorithm described in RFC5661, section
    18.36.4 for when to bump the sequence id. In essence we do it for all
    cases except when the RPC call timed out, or in case of the server returning
    NFS4ERR_DELAY or NFS4ERR_STALE_CLIENTID.

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Trond Myklebust
     

11 Sep, 2016

2 commits

  • Pull libnvdimm fixes from Dan Williams:
    "nvdimm fixes for v4.8, two of them are tagged for -stable:

    - Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
    DAX pmd mappings end up with an uncached pgprot, and unusable
    performance for the device-dax interface. The device-dax interface
    appeared in 4.7 so this is tagged for -stable.

    - Fix a couple VM_BUG_ON() checks in the show_smaps() path to
    understand DAX pmd entries. This fix is tagged for -stable.

    - Fix a mis-merge of the nfit machine-check handler to flip the
    polarity of an if() to match the final version of the patch that
    Vishal sent for 4.8-rc1. Without this the nfit machine check
    handler never detects / inserts new 'badblocks' entries which
    applications use to identify lost portions of files.

    - For test purposes, fix the nvdimm_clear_poison() path to operate on
    legacy / simulated nvdimm memory ranges. Without this fix a test
    can set badblocks, but never clear them on these ranges.

    - Fix the range checking done by dax_dev_pmd_fault(). This is not
    tagged for -stable since this problem is mitigated by specifying
    aligned resources at device-dax setup time.

    These patches have appeared in a next release over the past week. The
    recent rebase you can see in the timestamps was to drop an invalid fix
    as identified by the updated device-dax unit tests [1]. The -mm
    touches have an ack from Andrew"

    [1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
    https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: allow legacy (e820) pmem region to clear bad blocks
    nfit, mce: Fix SPA matching logic in MCE handler
    mm: fix cache mode of dax pmd mappings
    mm: fix show_smap() for zone_device-pmd ranges
    dax: fix mapping size check

    Linus Torvalds
     
  • Pull fscrypto fixes fromTed Ts'o:
    "Fix some brown-paper-bag bugs for fscrypto, including one one which
    allows a malicious user to set an encryption policy on an empty
    directory which they do not own"

    * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    fscrypto: require write access to mount to set encryption policy
    fscrypto: only allow setting encryption policy on directories
    fscrypto: add authorization check for setting encryption policy

    Linus Torvalds
     

10 Sep, 2016

10 commits

  • Since setting an encryption policy requires writing metadata to the
    filesystem, it should be guarded by mnt_want_write/mnt_drop_write.
    Otherwise, a user could cause a write to a frozen or readonly
    filesystem. This was handled correctly by f2fs but not by ext4. Make
    fscrypt_process_policy() handle it rather than relying on the filesystem
    to get it right.

    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
    Signed-off-by: Theodore Ts'o
    Acked-by: Jaegeuk Kim

    Eric Biggers
     
  • Signed-off-by: Sachin Prabhu
    Tested-by: Aurelien Aptel
    Signed-off-by: Steve French

    Sachin Prabhu
     
  • The patch
    fs/cifs: make share unaccessible at root level mountable
    makes use of prepaths when any component of the underlying path is
    inaccessible.

    When mounting 2 separate shares having different prepaths but are other
    wise similar in other respects, we end up sharing superblocks when we
    shouldn't be doing so.

    Signed-off-by: Sachin Prabhu
    Tested-by: Aurelien Aptel
    Signed-off-by: Steve French

    Sachin Prabhu
     
  • Fix memory leaks introduced by the patch
    fs/cifs: make share unaccessible at root level mountable

    Also move allocation of cifs_sb->prepath to cifs_setup_cifs_sb().

    Signed-off-by: Sachin Prabhu
    Tested-by: Aurelien Aptel
    Signed-off-by: Steve French

    Sachin Prabhu
     
  • The FS_IOC_SET_ENCRYPTION_POLICY ioctl allowed setting an encryption
    policy on nondirectory files. This was unintentional, and in the case
    of nonempty regular files did not behave as expected because existing
    data was not actually encrypted by the ioctl.

    In the case of ext4, the user could also trigger filesystem errors in
    ->empty_dir(), e.g. due to mismatched "directory" checksums when the
    kernel incorrectly tried to interpret a regular file as a directory.

    This bug affected ext4 with kernels v4.8-rc1 or later and f2fs with
    kernels v4.6 and later. It appears that older kernels only permitted
    directories and that the check was accidentally lost during the
    refactoring to share the file encryption code between ext4 and f2fs.

    This patch restores the !S_ISDIR() check that was present in older
    kernels.

    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org
    Signed-off-by: Theodore Ts'o

    Eric Biggers
     
  • On an ext4 or f2fs filesystem with file encryption supported, a user
    could set an encryption policy on any empty directory(*) to which they
    had readonly access. This is obviously problematic, since such a
    directory might be owned by another user and the new encryption policy
    would prevent that other user from creating files in their own directory
    (for example).

    Fix this by requiring inode_owner_or_capable() permission to set an
    encryption policy. This means that either the caller must own the file,
    or the caller must have the capability CAP_FOWNER.

    (*) Or also on any regular file, for f2fs v4.6 and later and ext4
    v4.8-rc1 and later; a separate bug fix is coming for that.

    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
    Signed-off-by: Theodore Ts'o

    Eric Biggers
     
  • Attempting to dump /proc//smaps for a process with pmd dax mappings
    currently results in the following VM_BUG_ONs:

    kernel BUG at mm/huge_memory.c:1105!
    task: ffff88045f16b140 task.stack: ffff88045be14000
    RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
    [..]
    Call Trace:
    [] smaps_pte_range+0xa0/0x4b0
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    kernel BUG at fs/proc/task_mmu.c:585!
    RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
    Call Trace:
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    These locations are sanity checking page flags that must be set for an
    anonymous transparent huge page, but are not set for the zone_device
    pages associated with dax mappings.

    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Acked-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Pull fuse fix from Miklos Szeredi:
    "This fixes a deadlock when fuse, direct I/O and loop device are
    combined"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: direct-io: don't dirty ITER_BVEC pages

    Linus Torvalds
     
  • Pull overlayfs fix from Miklos Szeredi:
    "This fixes a regression caused by the last pull request"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: fix workdir creation

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "I'm not proud of how long it took me to track down that one liner in
    btrfs_sync_log(), but the good news is the patches I was trying to
    blame for these problems were actually fine (sorry Filipe)"

    * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
    btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
    btrfs: do not decrease bytes_may_use when replaying extents

    Linus Torvalds
     

08 Sep, 2016

1 commit


06 Sep, 2016

2 commits

  • In btrfs_async_reclaim_metadata_space(), we use ticket's address to
    determine whether asynchronous metadata reclaim work is making progress.

    ticket = list_first_entry(&space_info->tickets,
    struct reserve_ticket, list);
    if (last_ticket == ticket) {
    flush_state++;
    } else {
    last_ticket = ticket;
    flush_state = FLUSH_DELAYED_ITEMS_NR;
    if (commit_cycles)
    commit_cycles--;
    }

    But indeed it's wrong, we should not rely on local variable's address to
    do this check, because addresses may be same. In my test environment, I
    dd one 168MB file in a 256MB fs, found that for this file, every time
    wait_reserve_ticket() called, local variable ticket's address is same,

    For above codes, assume a previous ticket's address is addrA, last_ticket
    is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
    wake up it, then another ticket is added, but with the same address addrA,
    now last_ticket will be same to current ticket, then current ticket's flush
    work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
    which may result in some enospc issues(I have seen this in my test machine).

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Wang Xiaoguang
     
  • We use a btrfs_log_ctx structure to pass information into the
    tree log commit, and get error values out. It gets added to a per
    log-transaction list which we walk when things go bad.

    Commit d1433debe added an optimization to skip waiting for the log
    commit, but didn't take root_log_ctx out of the list. This
    patch makes sure we remove things before exiting.

    Signed-off-by: Chris Mason
    Fixes: d1433debe7f4346cf9fc0dafc71c3137d2a97bc4
    cc: stable@vger.kernel.org # 3.15+

    Chris Mason
     

05 Sep, 2016

4 commits

  • When replaying extents, there is no need to update bytes_may_use
    in btrfs_alloc_logged_file_extent(), otherwise it'll trigger a
    WARN_ON about bytes_may_use.

    Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")
    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Wang Xiaoguang
     
  • Commit f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
    modified "if (fpos_frag(new_pos) != fi->frag)" to "if (fi->frag |=
    fpos_frag(new_pos))" in need_reset_readdir(), thus replacing a
    comparison operator with an assignment one.

    This looks like a typo which is reported by clang when building the
    kernel with some warning flags:

    fs/ceph/dir.c:600:22: error: using the result of an assignment as a
    condition without parentheses [-Werror,-Wparentheses]
    } else if (fi->frag |= fpos_frag(new_pos)) {
    ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
    fs/ceph/dir.c:600:22: note: place parentheses around the assignment
    to silence this warning
    } else if (fi->frag |= fpos_frag(new_pos)) {
    ^
    ( )
    fs/ceph/dir.c:600:22: note: use '!=' to turn this compound
    assignment into an inequality comparison
    } else if (fi->frag |= fpos_frag(new_pos)) {
    ^~
    !=

    Fixes: f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
    Signed-off-by: Nicolas Iooss
    Signed-off-by: Ilya Dryomov

    Nicolas Iooss
     
  • Workdir creation fails in latest kernel.

    Fix by allowing EOPNOTSUPP as a valid return value from
    vfs_removexattr(XATTR_NAME_POSIX_ACL_*). Upper filesystem may not support
    ACL and still be perfectly able to support overlayfs.

    Reported-by: Martin Ziegler
    Signed-off-by: Miklos Szeredi
    Fixes: c11b9fdd6a61 ("ovl: remove posix_acl_default from workdir")
    Cc:

    Miklos Szeredi
     
  • If there are outstanding LAYOUTGET rpc calls, then we want to ensure that
    we keep the layout stateid around so we that don't inadvertently pick up
    an old/misordered sequence id.
    The race is as follows:

    Client Server
    ====== ======
    LAYOUTGET(seqid)
    LAYOUTGET(seqid)
    return LAYOUTGET(seqid+1)
    return LAYOUTGET(seqid+2)
    process LAYOUTGET(seqid+2)
    forget layout
    process LAYOUTGET(seqid+1)

    If it forgets the layout stateid before processing seqid+1, then
    the client will not check the layout->plh_barrier, and so will set
    the stateid with seqid+1.

    Signed-off-by: Trond Myklebust

    Trond Myklebust