11 Jul, 2017

10 commits

  • When building the argv/envp pointers, the envp is needlessly
    pre-incremented instead of just continuing after the argv pointers are
    finished. In some (likely impossible) race where the strings could be
    changed from userspace between copy_strings() and here, it might be
    possible to confuse the envp position. Instead, just use sp like
    everything else.

    Link: http://lkml.kernel.org/r/20170622173838.GA43308@beast
    Signed-off-by: Kees Cook
    Cc: Rik van Riel
    Cc: Daniel Micay
    Cc: Qualys Security Advisory
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Dmitry Safonov
    Cc: Andy Lutomirski
    Cc: Grzegorz Andrejczuk
    Cc: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The ELF_ET_DYN_BASE position was originally intended to keep loaders
    away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2
    /bin/cat" might cause the subsequent load of /bin/cat into where the
    loader had been loaded.)

    With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
    ELF_ET_DYN_BASE continued to be used since the kernel was only looking
    at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the
    top 1/3rd of the TASK_SIZE, a substantial portion of the address space
    is unused.

    For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
    loaded above the mmap region. This means they can be made to collide
    (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
    pathological stack regions.

    Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
    region in all cases, and will now additionally avoid programs falling
    back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
    if it would have collided with the stack, now it will fail to load
    instead of falling back to the mmap region).

    To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
    are loaded into the mmap region, leaving space available for either an
    ET_EXEC binary with a fixed location or PIE being loaded into mmap by
    the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
    which means architectures can now safely lower their values without risk
    of loaders colliding with their subsequently loaded programs.

    For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
    the entire 32-bit address space for 32-bit pointers.

    Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
    suggestions on how to implement this solution.

    Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR")
    Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Cc: Daniel Micay
    Cc: Qualys Security Advisory
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Dmitry Safonov
    Cc: Andy Lutomirski
    Cc: Grzegorz Andrejczuk
    Cc: Masahiro Yamada
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: James Hogan
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Pratyush Anand
    Cc: Russell King
    Cc: Will Deacon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • We've encountered zombies that are waiting for a thread to exit that are
    looping in ep_poll() almost endlessly although there is a pending
    SIGKILL as a result of a group exit.

    This happens because we always find ep_events_available() and fetch more
    events and never are able to check for signal_pending() that would break
    from the loop and return -EINTR.

    Special case fatal signals and break immediately to guarantee that we
    loop to fetch more events and delay making a timely exit.

    It would also be possible to simply move the check for signal_pending()
    higher than checking for ep_events_available(), but there have been no
    reports of delayed signal handling other than SIGKILL preventing zombies
    from exiting that would be fixed by this.

    It fixes an issue for us where we have witnessed zombies sticking around
    for at least O(minutes), but considering the code has been like this
    forever and nobody else has complained that I have found, I would simply
    queue it up for 4.12.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Alexander Viro
    Cc: Jan Kara
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The code can be much simplified by switching to ida_simple_get/remove.

    Link: http://lkml.kernel.org/r/8d1cc9f7-5115-c9dc-028e-c0770b6bfe1f@gmail.com
    Signed-off-by: Heiner Kallweit
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiner Kallweit
     
  • __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
    duration if there are more number of items in the lru list. As per the
    current code, it can hold the spin lock for upto maximum UINT_MAX
    entries at a time. So if there are more number of items in the lru
    list, then "BUG: spinlock lockup suspected" is observed in the below
    path:

    spin_bug+0x90
    do_raw_spin_lock+0xfc
    _raw_spin_lock+0x28
    list_lru_add+0x28
    dput+0x1c8
    path_put+0x20
    terminate_walk+0x3c
    path_lookupat+0x100
    filename_lookup+0x6c
    user_path_at_empty+0x54
    SyS_faccessat+0xd0
    el0_svc_naked+0x24

    This nlru->lock is acquired by another CPU in this path -

    d_lru_shrink_move+0x34
    dentry_lru_isolate_shrink+0x48
    __list_lru_walk_one.isra.10+0x94
    list_lru_walk_node+0x40
    shrink_dcache_sb+0x60
    do_remount_sb+0xbc
    do_emergency_remount+0xb0
    process_one_work+0x228
    worker_thread+0x2e0
    kthread+0xf4
    ret_from_fork+0x10

    Fix this lockup by reducing the number of entries to be shrinked from
    the lru list to 1024 at once. Also, add cond_resched() before
    processing the lru list again.

    Link: http://marc.info/?t=149722864900001&r=1&w=2
    Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org
    Signed-off-by: Sahitya Tummala
    Suggested-by: Jan Kara
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Alexander Polakov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sahitya Tummala
     
  • After commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
    we do not hide stack guard page in /proc//maps

    Link: http://lkml.kernel.org/r/211f3c2a-f7ef-7c13-82bf-46fd426f6e1b@virtuozzo.com
    Signed-off-by: Vasily Averin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     
  • Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
    keep the error hugepage away from the system, which is OK but not good
    enough because the hugepage still has a refcount and unpoison doesn't
    work on the error hugepage (PageHWPoison flags are cleared but pages are
    still leaked.) And there's "wasting health subpages" issue too. This
    patch reworks on me_huge_page() to solve these issues.

    For hugetlb file, recently we have truncating code so let's use it in
    hugetlbfs specific ->error_remove_page().

    For anonymous hugepage, it's helpful to dissolve the error page after
    freeing it into free hugepage list. Migration entry and PageHWPoison in
    the head page prevent the access to it.

    TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
    It's not critical (and at least no worse that now) because in such case
    the error hugepage just stays in free hugepage list without being
    dissolved. By virtue of PageHWPoison in head page, it's never allocated
    to processes.

    [akpm@linux-foundation.org: fix unused var warnings]
    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • To install a buffer_head into the cpu's LRU queue, bh_lru_install()
    would construct a new copy of the queue and then memcpy it over the real
    queue. But it's easily possible to do the update in-place, which is
    faster and simpler. Some work can also be skipped if the buffer_head
    was already in the queue.

    As a microbenchmark I timed how long it takes to run sb_getblk()
    10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
    Effectively, this benchmarks looking up buffer_heads that are in the
    page cache but not in the LRU:

    Before this patch: 1.758s
    After this patch: 1.653s

    This patch also removes about 350 bytes of compiled code (on x86_64),
    partly due to removal of the memcpy() which was being inlined+unrolled.

    Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com
    Signed-off-by: Eric Biggers
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • Pull XFS updates from Darrick Wong:
    "Here are some changes for you for 4.13. For the most part it's fixes
    for bugs and deadlock problems, and preparation for online fsck in
    some future merge window.

    - Avoid quotacheck deadlocks

    - Fix transaction overflows when bunmapping fragmented files

    - Refactor directory readahead

    - Allow admin to configure if ASSERT is fatal

    - Improve transaction usage detail logging during overflows

    - Minor cleanups

    - Don't leak log items when the log shuts down

    - Remove double-underscore typedefs

    - Various preparation for online scrubbing

    - Introduce new error injection configuration sysfs knobs

    - Refactor dq_get_next to use extent map directly

    - Fix problems with iterating the page cache for unwritten data

    - Implement SEEK_{HOLE,DATA} via iomap

    - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

    - Don't use MAXPATHLEN to check on-disk symlink target lengths"

    * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
    xfs: don't crash on unexpected holes in dir/attr btrees
    xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
    xfs: fix contiguous dquot chunk iteration livelock
    xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
    vfs: Add iomap_seek_hole and iomap_seek_data helpers
    vfs: Add page_cache_seek_hole_data helper
    xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
    xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
    xfs: Check for m_errortag initialization in xfs_errortag_test
    xfs: grab dquots without taking the ilock
    xfs: fix semicolon.cocci warnings
    xfs: Don't clear SGID when inheriting ACLs
    xfs: free cowblocks and retry on buffered write ENOSPC
    xfs: replace log_badcrc_factor knob with error injection tag
    xfs: convert drop_writes to use the errortag mechanism
    xfs: remove unneeded parameter from XFS_TEST_ERROR
    xfs: expose errortag knobs via sysfs
    xfs: make errortag a per-mountpoint structure
    xfs: free uncommitted transactions during log recovery
    xfs: don't allow bmap on rt files
    ...

    Linus Torvalds
     
  • Pull btrfs fix from David Sterba:
    "This fixes a user-visible bug introduced by the nowait-aio patches
    merged in this cycle"

    * 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: nowait aio: Correct assignment of pos

    Linus Torvalds
     

10 Jul, 2017

5 commits

  • Assigning pos for usage early messes up in append mode, where the pos is
    re-assigned in generic_write_checks(). Assign pos later to get the
    correct position to write from iocb->ki_pos.

    Since check_can_nocow also uses the value of pos, we shift
    generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT
    are present in generic_write_checks(), so checking for IOCB_NOWAIT is
    enough.

    Also, put locking sequence in the fast path.

    This fixes a user visible bug, as reported:

    "apparently breaks several shell related features on my system.
    In zsh history stopped working, because no new entries are added
    anymore.
    I fist noticed the issue when I tried to build mplayer. It uses a shell
    script to generate a help_mp.h file:
    [...]

    Here is a simple testcase:

    % echo "foo" >> test
    % echo "foo" >> test
    % cat test
    foo
    %
    "

    Fixes: edf064e7c6fe ("btrfs: nowait aio support")
    CC: Jens Axboe
    Reported-by: Markus Trippelsdorf
    Link: https://lkml.kernel.org/r/20170704042306.GA274@x4
    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Goldwyn Rodrigues
     
  • Add xattrs to allow the user to get/set metadata in lieu of having pioctl()
    available. The following xattrs are now available:

    - "afs.cell"

    The name of the cell in which the vnode's volume resides.

    - "afs.fid"

    The volume ID, vnode ID and vnode uniquifier of the file as three hex
    numbers separated by colons.

    - "afs.volume"

    The name of the volume in which the vnode resides.

    For example:

    # getfattr -d -m ".*" /mnt/scratch
    getfattr: Removing leading '/' from absolute path names
    # file: mnt/scratch
    afs.cell="mycell.myorg.org"
    afs.fid="10000b:1:1"
    afs.volume="scratch"

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • The AFS_ACE_READ and AFS_ACE_WRITE permission bits should not
    be used to make access decisions for the directory itself. They
    are meant to control access for the objects contained in that
    directory.

    Reading a directory is allowed if the AFS_ACE_LOOKUP bit is set.
    This would cause an incorrect access denied error for a directory
    with AFS_ACE_LOOKUP but not AFS_ACE_READ.

    The AFS_ACE_WRITE bit does not allow operations that modify the
    directory. For a directory with AFS_ACE_WRITE but neither
    AFS_ACE_INSERT nor AFS_ACE_DELETE, this would result in trying
    operations that would ultimately be denied by the server.

    Signed-off-by: Marc Dionne
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    Marc Dionne
     
  • Pull ext4 updates from Ted Ts'o:
    "The first major feature for ext4 this merge window is the largedir
    feature, which allows ext4 directories to support over 2 billion
    directory entries (assuming ~64 byte file names; in practice, users
    will run into practical performance limits first.) This feature was
    originally written by the Lustre team, and credit goes to Artem
    Blagodarenko from Seagate for getting this feature upstream.

    The second major major feature allows ext4 to support extended
    attribute values up to 64k. This feature was also originally from
    Lustre, and has been enhanced by Tahsin Erdogan from Google with a
    deduplication feature so that if multiple files have the same xattr
    value (for example, Windows ACL's stored by Samba), only one copy will
    be stored on disk for encoding and caching efficiency.

    We also have the usual set of bug fixes, cleanups, and optimizations"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
    ext4: fix spelling mistake: "prellocated" -> "preallocated"
    ext4: fix __ext4_new_inode() journal credits calculation
    ext4: skip ext4_init_security() and encryption on ea_inodes
    fs: generic_block_bmap(): initialize all of the fields in the temp bh
    ext4: change fast symlink test to not rely on i_blocks
    ext4: require key for truncate(2) of encrypted file
    ext4: don't bother checking for encryption key in ->mmap()
    ext4: check return value of kstrtoull correctly in reserved_clusters_store
    ext4: fix off-by-one fsmap error on 1k block filesystems
    ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
    ext4: return EIO on read error in ext4_find_entry
    ext4: forbid encrypting root directory
    ext4: send parallel discards on commit completions
    ext4: avoid unnecessary stalls in ext4_evict_inode()
    ext4: add nombcache mount option
    ext4: strong binding of xattr inode references
    ext4: eliminate xattr entry e_hash recalculation for removes
    ext4: reserve space for xattr entries/names
    quota: add get_inode_usage callback to transfer multi-inode charges
    ext4: xattr inode deduplication
    ...

    Linus Torvalds
     
  • Pull fscrypt updates from Ted Ts'o:
    "Add support for 128-bit AES and some cleanups to fscrypt"

    * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
    fscrypt: make ->dummy_context() return bool
    fscrypt: add support for AES-128-CBC
    fscrypt: inline fscrypt_free_filename()

    Linus Torvalds
     

09 Jul, 2017

2 commits


08 Jul, 2017

11 commits

  • Pull read/write fix from Al Viro:
    "file_start_write()/file_end_write() got mixed into vfs_iter_write() by
    accident; that's a deadlock for all existing callers - they already do
    that, some - quite a bit outside.

    Easily fixed, fortunately"

    * 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    move file_{start,end}_write() out of do_iter_write()

    Linus Torvalds
     
  • To avoid pathological stack usage or the need to special-case setuid
    execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).

    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     
  • In quite a few places we call xfs_da_read_buf with a mappedbno that we
    don't control, then assume that the function passes back either an error
    code or a buffer pointer. Unfortunately, if mappedbno == -2 and bno
    maps to a hole, we get a return code of zero and a NULL buffer, which
    means that we crash if we actually try to use that buffer pointer. This
    happens immediately when we set the buffer type for transaction context.

    Therefore, check that we have no error code and a non-NULL bp before
    trying to use bp. This patch is a follow-up to an incomplete fix in
    96a3aefb8ffde231 ("xfs: don't crash if reading a directory results in an
    unexpected hole").

    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     
  • Pull Writeback error handling fixes from Jeff Layton:
    "The main rationale for all of these changes is to tighten up writeback
    error reporting to userland. There are many ways now that writeback
    errors can be lost, such that fsync/fdatasync/msync return 0 when
    writeback actually failed.

    This pile contains a small set of cleanups and writeback error
    handling fixes that I was able to break off from the main pile (#2).

    Two of the patches in this pile are trivial. The exceptions are the
    patch to fix up error handling in write_one_page, and the patch to
    make JFS pay attention to write_one_page errors"

    * tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    fs: remove call_fsync helper function
    mm: clean up error handling in write_one_page
    JFS: do not ignore return code from write_one_page()
    mm: drop "wait" parameter from write_one_page()

    Linus Torvalds
     
  • Pull cifs fixes from Steve French:
    "First set of CIFS/SMB3 fixes for the merge window. Also improves POSIX
    character mapping for SMB3"

    * tag 'cifs-bug-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6:
    CIFS: fix circular locking dependency
    cifs: set oparms.create_options rather than or'ing in CREATE_OPEN_BACKUP_INTENT
    cifs: Do not modify mid entry after submitting I/O in cifs_call_async
    CIFS: add SFM mapping for 0x01-0x1F
    cifs: hide unused functions
    cifs: Use smb 2 - 3 and cifsacl mount options getacl functions
    cifs: prototype declaration and definition for smb 2 - 3 and cifsacl mount options
    CIFS: add CONFIG_CIFS_DEBUG_KEYS to dump encryption keys
    cifs: set mapping error when page writeback fails in writepage or launder_pages
    SMB3: Enable encryption for SMB3.1.1

    Linus Torvalds
     
  • …l/git/gfs2/linux-gfs2

    Pull GFS2 fix from Bob Peterson:
    "Sorry for the additional merge request, but Andreas discovered this
    problem soon after you processed our last gfs2 merge.

    This fixes a regression introduced by a patch we did in mid-2015
    (commit 88ffbf3e037e: "GFS2: Use resizable hash table for glocks"), so
    best to get it fixed. Some code was reverted that should not have
    been.

    The patch from Andreas Gruenbacher just re-adds code that had been
    there originally"

    * tag 'gfs2-4.13.fixes.addendum' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
    gfs2: Fix glock rhashtable rcu bug

    Linus Torvalds
     
  • take_dentry_name_snapshot() takes a safe snapshot of dentry name;
    if the name is a short one, it gets copied into caller-supplied
    structure, otherwise an extra reference to external name is grabbed
    (those are never modified). In either case the pointer to stable
    string is stored into the same structure.

    dentry must be held by the caller of take_dentry_name_snapshot(),
    but may be freely dropped afterwards - the snapshot will stay
    until destroyed by release_dentry_name_snapshot().

    Intended use:
    struct name_snapshot s;

    take_dentry_name_snapshot(&s, dentry);
    ...
    access s.name
    ...
    release_dentry_name_snapshot(&s);

    Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
    to pass down with event.

    Signed-off-by: Al Viro

    Al Viro
     
  • Michael Ellerman reported that commit 8c6657cb50cb ("Switch flock
    copyin/copyout primitives to copy_{from,to}_user()") broke his
    networking on a bunch of PPC machines (64-bit kernel, 32-bit userspace).

    The reason is a brown-paper bug by that commit, which had the arguments
    to "copy_flock_fields()" in the wrong order, breaking the compat
    handling for file locking. Apparently very few people run 32-bit user
    space on x86 any more, so the PPC people got the honor of noticing this
    "feature".

    Michael also sent a minimal diff that just changed the order of the
    arguments in that macro.

    This is not that minimal diff.

    This not only changes the order of the arguments in the macro, it also
    changes them to be pointers (to be consistent with all the other uses of
    those pointers), and makes the functions that do all of this also have
    the proper "const" attribution on the source pointers in order to make
    issues like that (using the source as a destination) be really obvious.

    Reported-by: Michael Ellerman
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks",
    glocks were freed via call_rcu to allow reading the glock hashtable
    locklessly using rcu. This was then changed to free glocks immediately,
    which made reading the glock hashtable unsafe. Bring back the original
    code for freeing glocks via call_rcu.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson
    Cc: stable@vger.kernel.org # 4.3+

    Andreas Gruenbacher
     
  • Pull libnvdimm updates from Dan Williams:
    "libnvdimm updates for the latest ACPI and UEFI specifications. This
    pull request also includes new 'struct dax_operations' enabling to
    undo the abuse of copy_user_nocache() for copy operations to pmem.

    The dax work originally missed 4.12 to address concerns raised by Al.

    Summary:

    - Introduce the _flushcache() family of memory copy helpers and use
    them for persistent memory write operations on x86. The
    _flushcache() semantic indicates that the cache is either bypassed
    for the copy operation (movnt) or any lines dirtied by the copy
    operation are written back (clwb, clflushopt, or clflush).

    - Extend dax_operations with ->copy_from_iter() and ->flush()
    operations. These operations and other infrastructure updates allow
    all persistent memory specific dax functionality to be pushed into
    libnvdimm and the pmem driver directly. It also allows dax-specific
    sysfs attributes to be linked to a host device, for example:
    /sys/block/pmem0/dax/write_cache

    - Add support for the new NVDIMM platform/firmware mechanisms
    introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2
    namespace label format, extensions to the address-range-scrub
    command set, new error injection commands, and a new BTT
    (block-translation-table) layout. These updates support inter-OS
    and pre-OS compatibility.

    - Fix a longstanding memory corruption bug in nfit_test.

    - Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
    capable.

    - Miscellaneous fixes and small updates across libnvdimm and the nfit
    driver.

    Acknowledgements that came after the branch was pushed: commit
    6aa734a2f38e ("libnvdimm, region, pmem: fix 'badblocks'
    sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani
    "

    * tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits)
    libnvdimm, namespace: record 'lbasize' for pmem namespaces
    acpi/nfit: Issue Start ARS to retrieve existing records
    libnvdimm: New ACPI 6.2 DSM functions
    acpi, nfit: Show bus_dsm_mask in sysfs
    libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
    acpi, nfit: Enable DSM pass thru for root functions.
    libnvdimm: passthru functions clear to send
    libnvdimm, btt: convert some info messages to warn/err
    libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime
    libnvdimm: fix the clear-error check in nsio_rw_bytes
    libnvdimm, btt: fix btt_rw_page not returning errors
    acpi, nfit: quiet invalid block-aperture-region warnings
    libnvdimm, btt: BTT updates for UEFI 2.7 format
    acpi, nfit: constify *_attribute_group
    libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
    libnvdimm, pmem, dax: export a cache control attribute
    dax: convert to bitmask for flags
    dax: remove default copy_from_iter fallback
    libnvdimm, nfit: enable support for volatile ranges
    libnvdimm, pmem: fix persistence warning
    ...

    Linus Torvalds
     

07 Jul, 2017

12 commits

  • XFS has a maximum symlink target length of 1024 bytes; this is a
    holdover from the Irix days. Unfortunately, the constant establishing
    this is 'MAXPATHLEN' and is /not/ the same as the Linux MAXPATHLEN,
    which is 4096.

    The kernel enforces its 1024 byte MAXPATHLEN on symlink targets, but
    xfsprogs picks up the (Linux) system 4096 byte MAXPATHLEN, which means
    that xfs_repair doesn't complain about oversized symlinks.

    Since this is an on-disk format constraint, put the define in the XFS
    namespace and move everything over to use the new name.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • Merge misc updates from Andrew Morton:

    - a few hotfixes

    - various misc updates

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (108 commits)
    mm, memory_hotplug: move movable_node to the hotplug proper
    mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
    mm, memory_hotplug: drop artificial restriction on online/offline
    mm: memcontrol: account slab stats per lruvec
    mm: memcontrol: per-lruvec stats infrastructure
    mm: memcontrol: use generic mod_memcg_page_state for kmem pages
    mm: memcontrol: use the node-native slab memory counters
    mm: vmstat: move slab statistics from zone to node counters
    mm/zswap.c: delete an error message for a failed memory allocation in zswap_dstmem_prepare()
    mm/zswap.c: improve a size determination in zswap_frontswap_init()
    mm/zswap.c: delete an error message for a failed memory allocation in zswap_pool_create()
    mm/swapfile.c: sort swap entries before free
    mm/oom_kill: count global and memory cgroup oom kills
    mm: per-cgroup memory reclaim stats
    mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objects
    mm: kmemleak: factor object reference updating out of scan_block()
    mm: kmemleak: slightly reduce the size of some structures on 64-bit architectures
    mm, mempolicy: don't check cpuset seqlock where it doesn't matter
    mm, cpuset: always use seqlock when changing task's nodemask
    mm, mempolicy: simplify rebinding mempolicies when updating cpusets
    ...

    Linus Torvalds
     
  • Pull misc compat stuff updates from Al Viro:
    "This part is basically untangling various compat stuff. Compat
    syscalls moved to their native counterparts, getting rid of quite a
    bit of double-copying and/or set_fs() uses. A lot of field-by-field
    copyin/copyout killed off.

    - kernel/compat.c is much closer to containing just the
    copyin/copyout of compat structs. Not all compat syscalls are gone
    from it yet, but it's getting there.

    - ipc/compat_mq.c killed off completely.

    - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
    drivers/block/floppy.c where they belong. Yes, there are several
    drivers that implement some of the same ioctls. Some are m68k and
    one is 32bit-only pmac. drivers/block/floppy.c is the only one in
    that bunch that can be built on biarch"

    * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mqueue: move compat syscalls to native ones
    usbdevfs: get rid of field-by-field copyin
    compat_hdio_ioctl: get rid of set_fs()
    take floppy compat ioctls to sodding floppy.c
    ipmi: get rid of field-by-field __get_user()
    ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
    rt_sigtimedwait(): move compat to native
    select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
    put_compat_rusage(): switch to copy_to_user()
    sigpending(): move compat to native
    getrlimit()/setrlimit(): move compat to native
    times(2): move compat to native
    compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
    fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
    do_sigaltstack(): lift copying to/from userland into callers
    take compat_sys_old_getrlimit() to native syscall
    trim __ARCH_WANT_SYS_OLD_GETRLIMIT

    Linus Torvalds
     
  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • A poisoned or migrated hugepage is stored as a swap entry in the page
    tables. On architectures that support hugepages consisting of
    contiguous page table entries (such as on arm64) this leads to ambiguity
    in determining the page table entry to return in huge_pte_offset() when
    a poisoned entry is encountered.

    Let's remove the ambiguity by adding a size parameter to convey
    additional information about the requested address. Also fixup the
    definition/usage of huge_pte_offset() throughout the tree.

    Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
    Cc: Ralf Baechle (supporter:MIPS)
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • Update dcache, inode, pid, mountpoint, and mount hash tables to use
    HASH_ZERO, and remove initialization after allocations. In case of
    places where HASH_EARLY was used such as in __pv_init_lock_hash the
    zeroed hash table was already assumed, because memblock zeroes the
    memory.

    CPU: SPARC M6, Memory: 7T
    Before fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 11.798s

    After fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 3.198s

    CPU: Intel Xeon E5-2630, Memory: 2.2T:
    Before fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.245s

    After fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.244s

    Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Babu Moger
    Cc: David Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Calculation of start end end in __wake_userfault function are not used
    and can be removed.

    Link: http://lkml.kernel.org/r/1494930917-3134-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • There is no real reason to duplicate kvmalloc* helpers so drop
    alloc_fdmem and replace it with the appropriate library function.

    Link: http://lkml.kernel.org/r/20170531155145.17111-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • attribute_groups are not supposed to change at runtime. All functions
    working with attribute_groups provided by work with
    const attribute_group. So mark the non-const structs as const.

    File size before:
    text data bss dec hex filename
    4402 1088 38 5528 1598 fs/ocfs2/stackglue.o

    File size After adding 'const':
    text data bss dec hex filename
    4442 1024 38 5504 1580 fs/ocfs2/stackglue.o

    Link: http://lkml.kernel.org/r/cab4e59b4918db3ed2ec77073a4cb310c4429ef5.1498808026.git.arvind.yadav.cs@gmail.com
    Signed-off-by: Arvind Yadav
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arvind Yadav
     
  • 'sd->dbg_sock' is malloced in sc_common_open(), but not freed at the end
    of sc_fop_release().

    Link: http://lkml.kernel.org/r/594FB0A4.2050105@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    piaojun
     
  • Filesystems generally use SUPER_MAGIC values from magic.h instead of a
    local definition.

    Link: http://lkml.kernel.org/r/20170521154217.27917-1-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Fix a static code checker warning:

    fs/ocfs2/inode.c:179 ocfs2_iget() warn: passing zero to 'ERR_PTR'

    Fixes: d56a8f32e4c6 ("ocfs2: check/fix inode block for online file check")
    Link: http://lkml.kernel.org/r/1495516634-1952-1-git-send-email-ghe@suse.com
    Signed-off-by: Gang He
    Reviewed-by: Joseph Qi
    Reviewed-by: Eric Ren
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He