23 Sep, 2016

1 commit


21 Sep, 2016

2 commits

  • We hit hardened usercopy feature check for kernel text access by reading
    kcore file:

    usercopy: kernel memory exposure attempt detected from ffffffff8179a01f () (4065 bytes)
    kernel BUG at mm/usercopy.c:75!

    Bypassing this check for kcore by adding bounce buffer for ktext data.

    Reported-by: Steve Best
    Fixes: f5509cc18daa ("mm: Hardened usercopy")
    Suggested-by: Kees Cook
    Signed-off-by: Jiri Olsa
    Acked-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Jiri Olsa
     
  • Next patch adds bounce buffer for ktext area, so it's
    convenient to have single bounce buffer for both
    vmalloc/module and ktext cases.

    Suggested-by: Linus Torvalds
    Signed-off-by: Jiri Olsa
    Acked-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Jiri Olsa
     

20 Sep, 2016

10 commits

  • This reverts commit 38b52efd218b ("ocfs2: bump up o2cb network protocol
    version").

    This commit made rolling upgrade fail. When one node is upgraded to new
    version with this commit, the remaining nodes will fail to establish
    connections to it, then the application like VMs on the remaining nodes
    can't be live migrated to the upgraded one. This will cause an outage.
    Since negotiate hb timeout behavior didn't change without this commit,
    so revert it.

    Fixes: 38b52efd218bf ("ocfs2: bump up o2cb network protocol version")
    Link: http://lkml.kernel.org/r/1471396924-10375-1-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • If we punch a hole on a reflink such that following conditions are met:

    1. start offset is on a cluster boundary
    2. end offset is not on a cluster boundary
    3. (end offset is somewhere in another extent) or
    (hole range > MAX_CONTIG_BYTES(1MB)),

    we dont COW the first cluster starting at the start offset. But in this
    case, we were wrongly passing this cluster to
    ocfs2_zero_range_for_truncate() to zero out. This will modify the
    cluster in place and zero it in the source too.

    Fix this by skipping this cluster in such a scenario.

    To reproduce:

    1. Create a random file of say 10 MB
    xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
    2. Reflink it
    reflink -f 10MBfile reflnktest
    3. Punch a hole at starting at cluster boundary with range greater that
    1MB. You can also use a range that will put the end offset in another
    extent.
    fallocate -p -o 0 -l 1048615 reflnktest
    4. sync
    5. Check the first cluster in the source file. (It will be zeroed out).
    dd if=10MBfile iflag=direct bs= count=1 | hexdump -C

    Link: http://lkml.kernel.org/r/1470957147-14185-1-git-send-email-ashish.samant@oracle.com
    Signed-off-by: Ashish Samant
    Reported-by: Saar Maoz
    Reviewed-by: Srinivas Eeda
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Eric Ren
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ashish Samant
     
  • If ocfs2_reserve_cluster_bitmap_bits() fails with ENOSPC, it will try to
    free truncate log and then retry. Since ocfs2_try_to_free_truncate_log
    will lock/unlock global bitmap inode, we have to unlock it before
    calling this function. But when retry reserve and it fails with no
    global bitmap inode lock taken, it will unlock again in error handling
    branch and BUG.

    This issue also exists if no need retry and then ocfs2_inode_lock fails.
    So fix it.

    Fixes: 2070ad1aebff ("ocfs2: retry on ENOSPC if sufficient space in truncate log")
    Link: http://lkml.kernel.org/r/57D91939.6030809@huawei.com
    Signed-off-by: Joseph Qi
    Signed-off-by: Jiufei Xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • fanotify_get_response() calls fsnotify_remove_event() when it finds that
    group is being released from fanotify_release() (bypass_perm is set).

    However the event it removes need not be only in the group's notification
    queue but it can have already moved to access_list (userspace read the
    event before closing the fanotify instance fd) which is protected by a
    different lock. Thus when fsnotify_remove_event() races with
    fanotify_release() operating on access_list, the list can get corrupted.

    Fix the problem by moving all the logic removing permission events from
    the lists to one place - fanotify_release().

    Fixes: 5838d4442bd5 ("fanotify: fix double free of pending permission events")
    Link: http://lkml.kernel.org/r/1473797711-14111-3-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reported-by: Miklos Szeredi
    Tested-by: Miklos Szeredi
    Reviewed-by: Miklos Szeredi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Implement a function that can be called when a group is being shutdown
    to stop queueing new events to the group. Fanotify will use this.

    Fixes: 5838d4442bd5 ("fanotify: fix double free of pending permission events")
    Link: http://lkml.kernel.org/r/1473797711-14111-2-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Miklos Szeredi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • The root cause of this issue is the same with the one fixed by the last
    patch, but this time credits for allocator inode and group descriptor
    may not be consumed before trans extend.

    The following error was caught:

    WARNING: CPU: 0 PID: 2037 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]()
    Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront fb_sys_fops sysimgblt sysfillrect syscopyarea xen_netfront parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 2037 Comm: rm Tainted: G W 4.1.12-37.6.3.el6uek.bug24573128v2.x86_64 #2
    Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
    Call Trace:
    dump_stack+0x48/0x5c
    warn_slowpath_common+0x95/0xe0
    warn_slowpath_null+0x1a/0x20
    start_this_handle+0x4c3/0x510 [jbd2]
    jbd2__journal_restart+0x161/0x1b0 [jbd2]
    jbd2_journal_restart+0x13/0x20 [jbd2]
    ocfs2_extend_trans+0x74/0x220 [ocfs2]
    ocfs2_free_cached_blocks+0x16b/0x4e0 [ocfs2]
    ocfs2_run_deallocs+0x70/0x270 [ocfs2]
    ocfs2_commit_truncate+0x474/0x6f0 [ocfs2]
    ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2]
    ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
    ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
    ocfs2_evict_inode+0x28/0x60 [ocfs2]
    evict+0xab/0x1a0
    iput_final+0xf6/0x190
    iput+0xc8/0xe0
    do_unlinkat+0x1b7/0x310
    SyS_unlinkat+0x22/0x40
    system_call_fastpath+0x12/0x71
    ---[ end trace a62437cb060baa71 ]---
    JBD2: rm wants too many credits (149 > 128)

    Link: http://lkml.kernel.org/r/1473674623-11810-2-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Every time, ocfs2_extend_trans() included a credit for truncate log
    inode, but as that inode had been managed by jbd2 running transaction
    first time, it will not consume that credit until
    jbd2_journal_restart().

    Since total credits to extend always included the un-consumed ones,
    there will be more and more un-consumed credit, at last
    jbd2_journal_restart() will fail due to credit number over the half of
    max transction credit.

    The following error was caught when unlinking a large file with many
    extents:

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 13626 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]()
    Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 13626 Comm: unlink Tainted: G W 4.1.12-37.6.3.el6uek.x86_64 #2
    Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
    Call Trace:
    dump_stack+0x48/0x5c
    warn_slowpath_common+0x95/0xe0
    warn_slowpath_null+0x1a/0x20
    start_this_handle+0x4c3/0x510 [jbd2]
    jbd2__journal_restart+0x161/0x1b0 [jbd2]
    jbd2_journal_restart+0x13/0x20 [jbd2]
    ocfs2_extend_trans+0x74/0x220 [ocfs2]
    ocfs2_replay_truncate_records+0x93/0x360 [ocfs2]
    __ocfs2_flush_truncate_log+0x13e/0x3a0 [ocfs2]
    ocfs2_remove_btree_range+0x458/0x7f0 [ocfs2]
    ocfs2_commit_truncate+0x1b3/0x6f0 [ocfs2]
    ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2]
    ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
    ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
    ocfs2_evict_inode+0x28/0x60 [ocfs2]
    evict+0xab/0x1a0
    iput_final+0xf6/0x190
    iput+0xc8/0xe0
    do_unlinkat+0x1b7/0x310
    SyS_unlink+0x16/0x20
    system_call_fastpath+0x12/0x71
    ---[ end trace 28aa7410e69369cf ]---
    JBD2: unlink wants too many credits (251 > 128)

    Link: http://lkml.kernel.org/r/1473674623-11810-1-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Commit c01d5b300774 ("shmem: get_unmapped_area align huge page") makes
    use of shm_get_unmapped_area() in shm_file_operations() unconditional to
    CONFIG_MMU.

    As Tony Battersby pointed this can lead NULL-pointer dereference on
    machine with CONFIG_MMU=y and CONFIG_SHMEM=n. In this case ipc/shm is
    backed by ramfs which doesn't provide f_op->get_unmapped_area for
    configurations with MMU.

    The solution is to provide dummy f_op->get_unmapped_area for ramfs when
    CONFIG_MMU=y, which just call current->mm->get_unmapped_area().

    Fixes: c01d5b300774 ("shmem: get_unmapped_area align huge page")
    Link: http://lkml.kernel.org/r/20160912102704.140442-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tony Battersby
    Tested-by: Tony Battersby
    Cc: Hugh Dickins
    Cc: [4.7.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Somewhere along the way the autofs expire operation has changed to hold
    a spin lock over expired dentry selection. The autofs indirect mount
    expired dentry selection is complicated and quite lengthy so it isn't
    appropriate to hold a spin lock over the operation.

    Commit 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") added a
    might_sleep() to dput() causing a WARN_ONCE() about this usage to be
    issued.

    But the spin lock doesn't need to be held over this check, the autofs
    dentry info. flags are enough to block walks into dentrys during the
    expire.

    I've left the direct mount expire as it is (for now) because it is much
    simpler and quicker than the indirect mount expire and adding spin lock
    release and re-aquires would do nothing more than add overhead.

    Fixes: 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()")
    Link: http://lkml.kernel.org/r/20160912014017.1773.73060.stgit@pluto.themaw.net
    Signed-off-by: Ian Kent
    Reported-by: Takashi Iwai
    Tested-by: Takashi Iwai
    Cc: Takashi Iwai
    Cc: NeilBrown
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
    checks if lockres master has changed to identify whether new master has
    finished recovery or not. This will introduce a race that right after
    old master does umount ( means master will change), a new convert
    request comes.

    In this case, it will reset lockres state to DLM_RECOVERING and then
    retry convert, and then fail with lockres->l_action being set to
    OCFS2_AST_INVALID, which will cause inconsistent lock level between
    ocfs2 and dlm, and then finally BUG.

    Since dlm recovery will clear lock->convert_pending in
    dlm_move_lockres_to_recovery_list, we can use it to correctly identify
    the race case between convert and recovery. So fix it.

    Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
    Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com
    Signed-off-by: Joseph Qi
    Signed-off-by: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

17 Sep, 2016

1 commit


16 Sep, 2016

3 commits

  • This ensures that do_mmap() won't implicitly make AIO memory mappings
    executable if the READ_IMPLIES_EXEC personality flag is set. Such
    behavior is problematic because the security_mmap_file LSM hook doesn't
    catch this case, potentially permitting an attacker to bypass a W^X
    policy enforced by SELinux.

    I have tested the patch on my machine.

    To test the behavior, compile and run this:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void) {
    personality(READ_IMPLIES_EXEC);
    aio_context_t ctx = 0;
    if (syscall(__NR_io_setup, 1, &ctx))
    err(1, "io_setup");

    char cmd[1000];
    sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
    (int)getpid());
    system(cmd);
    return 0;
    }

    In the output, "rw-s" is good, "rwxs" is bad.

    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Kirill A Shutemov reports that the kernel doesn't try to cap dest_count
    in any way, and uses the number to allocate kernel memory. This causes
    high order allocation warnings in the kernel log if someone passes in a
    big enough value. We should clamp the allocation at PAGE_SIZE to avoid
    stressing the VM.

    The two existing users of the dedupe ioctl never send more than 120
    requests, so we can safely clamp dest_range at PAGE_SIZE, because with
    4k pages we can handle up to 127 dedupe candidates. Given the max
    extent length of 16MB, we can end up doing 2GB of IO which is plenty.

    [ Note: the "offsetof()" can't overflow, because 'count' is just a
    16-bit integer. That's not obvious in the limited context of the
    patch, so I'm noting it here because it made me go look. - Linus ]

    Reported-by: "Kirill A. Shutemov"
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • All the VFS functions in the dedupe ioctl path return int status, so
    the ioctl handler ought to as well.

    Found by Coverity, CID 1350952.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

13 Sep, 2016

2 commits

  • Conflicts:
    drivers/net/ethernet/mediatek/mtk_eth_soc.c
    drivers/net/ethernet/qlogic/qed/qed_dcbx.c
    drivers/net/phy/Kconfig

    All conflicts were cases of overlapping commits.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct
    state accounting
    - Fix the CREATE_SESSION slot number

    Bugfixes:
    - sunrpc: fix a UDP memory accounting regression
    - NFS: Fix an error reporting regression in nfs_file_write()
    - pNFS: Fix further layout stateid issues
    - RPC/rdma: Revert 3d4cf35bd4fa ("xprtrdma: Reply buffer
    exhaustion...")
    - RPC/rdma: Fix receive buffer accounting"

    * tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4.1: Fix the CREATE_SESSION slot number accounting
    xprtrdma: Fix receive buffer accounting
    xprtrdma: Revert 3d4cf35bd4fa ("xprtrdma: Reply buffer exhaustion...")
    pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs
    pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present
    pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID
    pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised
    NFS: Fix error reporting in nfs_file_write()
    sunrpc: fix UDP memory accounting

    Linus Torvalds
     

12 Sep, 2016

1 commit

  • Ensure that we conform to the algorithm described in RFC5661, section
    18.36.4 for when to bump the sequence id. In essence we do it for all
    cases except when the RPC call timed out, or in case of the server returning
    NFS4ERR_DELAY or NFS4ERR_STALE_CLIENTID.

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Trond Myklebust
     

11 Sep, 2016

2 commits

  • Pull libnvdimm fixes from Dan Williams:
    "nvdimm fixes for v4.8, two of them are tagged for -stable:

    - Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
    DAX pmd mappings end up with an uncached pgprot, and unusable
    performance for the device-dax interface. The device-dax interface
    appeared in 4.7 so this is tagged for -stable.

    - Fix a couple VM_BUG_ON() checks in the show_smaps() path to
    understand DAX pmd entries. This fix is tagged for -stable.

    - Fix a mis-merge of the nfit machine-check handler to flip the
    polarity of an if() to match the final version of the patch that
    Vishal sent for 4.8-rc1. Without this the nfit machine check
    handler never detects / inserts new 'badblocks' entries which
    applications use to identify lost portions of files.

    - For test purposes, fix the nvdimm_clear_poison() path to operate on
    legacy / simulated nvdimm memory ranges. Without this fix a test
    can set badblocks, but never clear them on these ranges.

    - Fix the range checking done by dax_dev_pmd_fault(). This is not
    tagged for -stable since this problem is mitigated by specifying
    aligned resources at device-dax setup time.

    These patches have appeared in a next release over the past week. The
    recent rebase you can see in the timestamps was to drop an invalid fix
    as identified by the updated device-dax unit tests [1]. The -mm
    touches have an ack from Andrew"

    [1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
    https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: allow legacy (e820) pmem region to clear bad blocks
    nfit, mce: Fix SPA matching logic in MCE handler
    mm: fix cache mode of dax pmd mappings
    mm: fix show_smap() for zone_device-pmd ranges
    dax: fix mapping size check

    Linus Torvalds
     
  • Pull fscrypto fixes fromTed Ts'o:
    "Fix some brown-paper-bag bugs for fscrypto, including one one which
    allows a malicious user to set an encryption policy on an empty
    directory which they do not own"

    * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    fscrypto: require write access to mount to set encryption policy
    fscrypto: only allow setting encryption policy on directories
    fscrypto: add authorization check for setting encryption policy

    Linus Torvalds
     

10 Sep, 2016

10 commits

  • Since setting an encryption policy requires writing metadata to the
    filesystem, it should be guarded by mnt_want_write/mnt_drop_write.
    Otherwise, a user could cause a write to a frozen or readonly
    filesystem. This was handled correctly by f2fs but not by ext4. Make
    fscrypt_process_policy() handle it rather than relying on the filesystem
    to get it right.

    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
    Signed-off-by: Theodore Ts'o
    Acked-by: Jaegeuk Kim

    Eric Biggers
     
  • Signed-off-by: Sachin Prabhu
    Tested-by: Aurelien Aptel
    Signed-off-by: Steve French

    Sachin Prabhu
     
  • The patch
    fs/cifs: make share unaccessible at root level mountable
    makes use of prepaths when any component of the underlying path is
    inaccessible.

    When mounting 2 separate shares having different prepaths but are other
    wise similar in other respects, we end up sharing superblocks when we
    shouldn't be doing so.

    Signed-off-by: Sachin Prabhu
    Tested-by: Aurelien Aptel
    Signed-off-by: Steve French

    Sachin Prabhu
     
  • Fix memory leaks introduced by the patch
    fs/cifs: make share unaccessible at root level mountable

    Also move allocation of cifs_sb->prepath to cifs_setup_cifs_sb().

    Signed-off-by: Sachin Prabhu
    Tested-by: Aurelien Aptel
    Signed-off-by: Steve French

    Sachin Prabhu
     
  • The FS_IOC_SET_ENCRYPTION_POLICY ioctl allowed setting an encryption
    policy on nondirectory files. This was unintentional, and in the case
    of nonempty regular files did not behave as expected because existing
    data was not actually encrypted by the ioctl.

    In the case of ext4, the user could also trigger filesystem errors in
    ->empty_dir(), e.g. due to mismatched "directory" checksums when the
    kernel incorrectly tried to interpret a regular file as a directory.

    This bug affected ext4 with kernels v4.8-rc1 or later and f2fs with
    kernels v4.6 and later. It appears that older kernels only permitted
    directories and that the check was accidentally lost during the
    refactoring to share the file encryption code between ext4 and f2fs.

    This patch restores the !S_ISDIR() check that was present in older
    kernels.

    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org
    Signed-off-by: Theodore Ts'o

    Eric Biggers
     
  • On an ext4 or f2fs filesystem with file encryption supported, a user
    could set an encryption policy on any empty directory(*) to which they
    had readonly access. This is obviously problematic, since such a
    directory might be owned by another user and the new encryption policy
    would prevent that other user from creating files in their own directory
    (for example).

    Fix this by requiring inode_owner_or_capable() permission to set an
    encryption policy. This means that either the caller must own the file,
    or the caller must have the capability CAP_FOWNER.

    (*) Or also on any regular file, for f2fs v4.6 and later and ext4
    v4.8-rc1 and later; a separate bug fix is coming for that.

    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
    Signed-off-by: Theodore Ts'o

    Eric Biggers
     
  • Attempting to dump /proc//smaps for a process with pmd dax mappings
    currently results in the following VM_BUG_ONs:

    kernel BUG at mm/huge_memory.c:1105!
    task: ffff88045f16b140 task.stack: ffff88045be14000
    RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
    [..]
    Call Trace:
    [] smaps_pte_range+0xa0/0x4b0
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    kernel BUG at fs/proc/task_mmu.c:585!
    RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
    Call Trace:
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    These locations are sanity checking page flags that must be set for an
    anonymous transparent huge page, but are not set for the zone_device
    pages associated with dax mappings.

    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Acked-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Pull fuse fix from Miklos Szeredi:
    "This fixes a deadlock when fuse, direct I/O and loop device are
    combined"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: direct-io: don't dirty ITER_BVEC pages

    Linus Torvalds
     
  • Pull overlayfs fix from Miklos Szeredi:
    "This fixes a regression caused by the last pull request"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: fix workdir creation

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "I'm not proud of how long it took me to track down that one liner in
    btrfs_sync_log(), but the good news is the patches I was trying to
    blame for these problems were actually fine (sorry Filipe)"

    * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
    btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
    btrfs: do not decrease bytes_may_use when replaying extents

    Linus Torvalds
     

08 Sep, 2016

3 commits

  • Rewrite the data and ack handling code such that:

    (1) Parsing of received ACK and ABORT packets and the distribution and the
    filing of DATA packets happens entirely within the data_ready context
    called from the UDP socket. This allows us to process and discard ACK
    and ABORT packets much more quickly (they're no longer stashed on a
    queue for a background thread to process).

    (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead
    keep track of the offset and length of the content of each packet in
    the sk_buff metadata. This means we don't do any allocation in the
    receive path.

    (3) Jumbo DATA packet parsing is now done in data_ready context. Rather
    than cloning the packet once for each subpacket and pulling/trimming
    it, we file the packet multiple times with an annotation for each
    indicating which subpacket is there. From that we can directly
    calculate the offset and length.

    (4) A call's receive queue can be accessed without taking locks (memory
    barriers do have to be used, though).

    (5) Incoming calls are set up from preallocated resources and immediately
    made live. They can than have packets queued upon them and ACKs
    generated. If insufficient resources exist, DATA packet #1 is given a
    BUSY reply and other DATA packets are discarded).

    (6) sk_buffs no longer take a ref on their parent call.

    To make this work, the following changes are made:

    (1) Each call's receive buffer is now a circular buffer of sk_buff
    pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
    between the call and the socket. This permits each sk_buff to be in
    the buffer multiple times. The receive buffer is reused for the
    transmit buffer.

    (2) A circular buffer of annotations (rxtx_annotations) is kept parallel
    to the data buffer. Transmission phase annotations indicate whether a
    buffered packet has been ACK'd or not and whether it needs
    retransmission.

    Receive phase annotations indicate whether a slot holds a whole packet
    or a jumbo subpacket and, if the latter, which subpacket. They also
    note whether the packet has been decrypted in place.

    (3) DATA packet window tracking is much simplified. Each phase has just
    two numbers representing the window (rx_hard_ack/rx_top and
    tx_hard_ack/tx_top).

    The hard_ack number is the sequence number before base of the window,
    representing the last packet the other side says it has consumed.
    hard_ack starts from 0 and the first packet is sequence number 1.

    The top number is the sequence number of the highest-numbered packet
    residing in the buffer. Packets between hard_ack+1 and top are
    soft-ACK'd to indicate they've been received, but not yet consumed.

    Four macros, before(), before_eq(), after() and after_eq() are added
    to compare sequence numbers within the window. This allows for the
    top of the window to wrap when the hard-ack sequence number gets close
    to the limit.

    Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
    to indicate when rx_top and tx_top point at the packets with the
    LAST_PACKET bit set, indicating the end of the phase.

    (4) Calls are queued on the socket 'receive queue' rather than packets.
    This means that we don't need have to invent dummy packets to queue to
    indicate abnormal/terminal states and we don't have to keep metadata
    packets (such as ABORTs) around

    (5) The offset and length of a (sub)packet's content are now passed to
    the verify_packet security op. This is currently expected to decrypt
    the packet in place and validate it.

    However, there's now nowhere to store the revised offset and length of
    the actual data within the decrypted blob (there may be a header and
    padding to skip) because an sk_buff may represent multiple packets, so
    a locate_data security op is added to retrieve these details from the
    sk_buff content when needed.

    (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
    individually secured and needs to be individually decrypted. The code
    to do this is broken out into rxrpc_recvmsg_data() and shared with the
    kernel API. It now iterates over the call's receive buffer rather
    than walking the socket receive queue.

    Additional changes:

    (1) The timers are condensed to a single timer that is set for the soonest
    of three timeouts (delayed ACK generation, DATA retransmission and
    call lifespan).

    (2) Transmission of ACK and ABORT packets is effected immediately from
    process-context socket ops/kernel API calls that cause them instead of
    them being punted off to a background work item. The data_ready
    handler still has to defer to the background, though.

    (3) A shutdown op is added to the AF_RXRPC socket so that the AFS
    filesystem can shut down the socket and flush its own work items
    before closing the socket to deal with any in-progress service calls.

    Future additional changes that will need to be considered:

    (1) Make sure that a call doesn't hog the front of the queue by receiving
    data from the network as fast as userspace is consuming it to the
    exclusion of other calls.

    (2) Transmit delayed ACKs from within recvmsg() when we've consumed
    sufficiently more packets to avoid the background work item needing to
    run.

    Signed-off-by: David Howells

    David Howells
     
  • Make it possible for the data_ready handler called from the UDP transport
    socket to completely instantiate an rxrpc_call structure and make it
    immediately live by preallocating all the memory it might need. The idea
    is to cut out the background thread usage as much as possible.

    [Note that the preallocated structs are not actually used in this patch -
    that will be done in a future patch.]

    If insufficient resources are available in the preallocation buffers, it
    will be possible to discard the DATA packet in the data_ready handler or
    schedule a BUSY packet without the need to schedule an attempt at
    allocation in a background thread.

    To this end:

    (1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a
    maximum number each of the listen backlog size. The backlog size is
    limited to a maxmimum of 32. Only this many of each can be in the
    preallocation buffer.

    (2) For userspace sockets, the preallocation is charged initially by
    listen() and will be recharged by accepting or rejecting pending
    new incoming calls.

    (3) For kernel services {,re,dis}charging of the preallocation buffers is
    handled manually. Two notifier callbacks have to be provided before
    kernel_listen() is invoked:

    (a) An indication that a new call has been instantiated. This can be
    used to trigger background recharging.

    (b) An indication that a call is being discarded. This is used when
    the socket is being released.

    A function, rxrpc_kernel_charge_accept() is called by the kernel
    service to preallocate a single call. It should be passed the user ID
    to be used for that call and a callback to associate the rxrpc call
    with the kernel service's side of the ID.

    (4) Discard the preallocation when the socket is closed.

    (5) Temporarily bump the refcount on the call allocated in
    rxrpc_incoming_call() so that rxrpc_release_call() can ditch the
    preallocation ref on service calls unconditionally. This will no
    longer be necessary once the preallocation is used.

    Note that this does not yet control the number of active service calls on a
    client - that will come in a later patch.

    A future development would be to provide a setsockopt() call that allows a
    userspace server to manually charge the preallocation buffer. This would
    allow user call IDs to be provided in advance and the awkward manual accept
    stage to be bypassed.

    Signed-off-by: David Howells

    David Howells
     
  • …linux into for-linus-4.8

    Chris Mason
     

07 Sep, 2016

1 commit

  • Add a tracepoint for working out where local aborts happen. Each
    tracepoint call is labelled with a 3-letter code so that they can be
    distinguished - and the DATA sequence number is added too where available.

    rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
    indicate the circumstances when it aborts a call.

    Signed-off-by: David Howells

    David Howells
     

06 Sep, 2016

2 commits

  • In btrfs_async_reclaim_metadata_space(), we use ticket's address to
    determine whether asynchronous metadata reclaim work is making progress.

    ticket = list_first_entry(&space_info->tickets,
    struct reserve_ticket, list);
    if (last_ticket == ticket) {
    flush_state++;
    } else {
    last_ticket = ticket;
    flush_state = FLUSH_DELAYED_ITEMS_NR;
    if (commit_cycles)
    commit_cycles--;
    }

    But indeed it's wrong, we should not rely on local variable's address to
    do this check, because addresses may be same. In my test environment, I
    dd one 168MB file in a 256MB fs, found that for this file, every time
    wait_reserve_ticket() called, local variable ticket's address is same,

    For above codes, assume a previous ticket's address is addrA, last_ticket
    is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
    wake up it, then another ticket is added, but with the same address addrA,
    now last_ticket will be same to current ticket, then current ticket's flush
    work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
    which may result in some enospc issues(I have seen this in my test machine).

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Wang Xiaoguang
     
  • We use a btrfs_log_ctx structure to pass information into the
    tree log commit, and get error values out. It gets added to a per
    log-transaction list which we walk when things go bad.

    Commit d1433debe added an optimization to skip waiting for the log
    commit, but didn't take root_log_ctx out of the list. This
    patch makes sure we remove things before exiting.

    Signed-off-by: Chris Mason
    Fixes: d1433debe7f4346cf9fc0dafc71c3137d2a97bc4
    cc: stable@vger.kernel.org # 3.15+

    Chris Mason
     

05 Sep, 2016

2 commits

  • When replaying extents, there is no need to update bytes_may_use
    in btrfs_alloc_logged_file_extent(), otherwise it'll trigger a
    WARN_ON about bytes_may_use.

    Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")
    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Wang Xiaoguang
     
  • Commit f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
    modified "if (fpos_frag(new_pos) != fi->frag)" to "if (fi->frag |=
    fpos_frag(new_pos))" in need_reset_readdir(), thus replacing a
    comparison operator with an assignment one.

    This looks like a typo which is reported by clang when building the
    kernel with some warning flags:

    fs/ceph/dir.c:600:22: error: using the result of an assignment as a
    condition without parentheses [-Werror,-Wparentheses]
    } else if (fi->frag |= fpos_frag(new_pos)) {
    ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
    fs/ceph/dir.c:600:22: note: place parentheses around the assignment
    to silence this warning
    } else if (fi->frag |= fpos_frag(new_pos)) {
    ^
    ( )
    fs/ceph/dir.c:600:22: note: use '!=' to turn this compound
    assignment into an inequality comparison
    } else if (fi->frag |= fpos_frag(new_pos)) {
    ^~
    !=

    Fixes: f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
    Signed-off-by: Nicolas Iooss
    Signed-off-by: Ilya Dryomov

    Nicolas Iooss