30 Jul, 2016

7 commits

  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     
  • Pull fuse updates from Miklos Szeredi:
    "This fixes error propagation from writeback to fsync/close for
    writeback cache mode as well as adding a missing capability flag to
    the INIT message. The rest are cleanups.

    (The commits are recent but all the code actually sat in -next for a
    while now. The recommits are due to conflict avoidance and the
    addition of Cc: stable@...)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: use filemap_check_errors()
    mm: export filemap_check_errors() to modules
    fuse: fix wrong assignment of ->flags in fuse_send_init()
    fuse: fuse_flush must check mapping->flags for errors
    fuse: fsync() did not return IO errors
    fuse: don't mess with blocking signals
    new helper: wait_event_killable_exclusive()
    fuse: improve aio directIO write performance for size extending writes

    Linus Torvalds
     
  • This reverts commit 3c9fe8cdff1b889a059a30d22f130372f2b3885f.

    As Miklos points out in commit c1b2cc1a765a, the "lookup_hash()" helper
    is now unused, and in fact, with the hash salting changes, since the
    hash of a dentry name now depends on the directory dentry it is in, the
    helper function isn't even really likely to be useful.

    So rather than keep it around in case somebody else might end up finding
    a use for it, let's just remove the helper and not trick people into
    thinking it might be a useful thing.

    For example, I had obviously completely missed how the helper didn't
    follow the normal dentry hashing patterns, and how the hash salting
    patch broke overlayfs. Things would quietly build and look sane, but
    not work.

    Suggested-by: Miklos Szeredi
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull overlayfs update from Miklos Szeredi:
    "First of all, this fixes a regression in overlayfs introduced by the
    dentry hash salting. I've moved the patch fixing this to the front of
    the queue, so if (god forbid) something needs to be bisected in
    overlayfs this regression won't interfere with that.

    The biggest part is preparation for selinux support, done by Vivek
    Goyal. Essentially this makes all operations on underlying
    filesystems be done with credentials of mounter. This makes
    everything nicely consistent.

    There are also fixes for a number of known and recently discovered
    non-standard behavior (thanks to Eryu Guan for testing and improving
    the test suites)"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (23 commits)
    ovl: simplify empty checking
    qstr: constify instances in overlayfs
    ovl: clear nlink on rmdir
    ovl: disallow overlayfs as upperdir
    ovl: fix warning
    ovl: remove duplicated include from super.c
    ovl: append MAY_READ when diluting write checks
    ovl: dilute permission checks on lower only if not special file
    ovl: fix POSIX ACL setting
    ovl: share inode for hard link
    ovl: store real inode pointer in ->i_private
    ovl: permission: return ECHILD instead of ENOENT
    ovl: update atime on upper
    ovl: fix sgid on directory
    ovl: simplify permission checking
    ovl: do not require mounter to have MAY_WRITE on lower
    ovl: do operations on underlying file system in mounter's context
    ovl: modify ovl_permission() to do checks on two inodes
    ovl: define ->get_acl() for overlay inodes
    ovl: move some common code in a function
    ...

    Linus Torvalds
     
  • Pull freevxfs updates from Christoph Hellwig:
    "Support for foreign endianess and HP-UP superblocks from
    Krzysztof Błaszkowski"

    * tag 'freevxfs-for-4.8' of git://git.infradead.org/users/hch/freevxfs:
    freevxfs: update Kconfig information
    freevxfs: refactor readdir and lookup code
    freevxfs: fix lack of inode initialization
    freevxfs: fix memory leak in vxfs_read_fshead()
    freevxfs: update documentation and cresdits for HP-UX support
    freevxfs: implement ->alloc_inode and ->destroy_inode
    freevxfs: avoid the need for forward declaring the super operations
    freevxfs: move VFS inode allocation into vxfs_blkiget and vxfs_stiget
    freevxfs: remove vxfs_put_fake_inode
    freevxfs: handle big endian HP-UX file systems

    Linus Torvalds
     
  • Pull configfs update from Christoph Hellwig:
    "A simple error handling fix from Tal Shorer"

    * tag 'configfs-for-4.8' of git://git.infradead.org/users/hch/configfs:
    configfs: don't set buffer_needs_fill to zero if show() returns error

    Linus Torvalds
     
  • Pull CIFS/SMB3 fixes from Steve French:
    "Various CIFS/SMB3 fixes, most for stable"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    CIFS: Fix a possible invalid memory access in smb2_query_symlink()
    fs/cifs: make share unaccessible at root level mountable
    cifs: fix crash due to race in hmac(md5) handling
    cifs: unbreak TCP session reuse
    cifs: Check for existing directory when opening file with O_CREAT
    Add MF-Symlinks support for SMB 2.0

    Linus Torvalds
     

29 Jul, 2016

33 commits

  • Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • FUSE_HAS_IOCTL_DIR should be assigned to ->flags, it may be a typo.

    Signed-off-by: Wei Fang
    Signed-off-by: Miklos Szeredi
    Fixes: 69fe05c90ed5 ("fuse: add missing INIT flags")
    Cc:

    Wei Fang
     
  • fuse_flush() calls write_inode_now() that triggers writeback, but actual
    writeback will happen later, on fuse_sync_writes(). If an error happens,
    fuse_writepage_end() will set error bit in mapping->flags. So, we have to
    check mapping->flags after fuse_sync_writes().

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on")
    Cc: # v3.15+

    Maxim Patlasov
     
  • Due to implementation of fuse writeback filemap_write_and_wait_range() does
    not catch errors. We have to do this directly after fuse_sync_writes()

    Signed-off-by: Alexey Kuznetsov
    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on")
    Cc: # v3.15+

    Alexey Kuznetsov
     
  • The empty checking logic is duplicated in ovl_check_empty_and_clear() and
    ovl_remove_and_whiteout(), except the condition for clearing whiteouts is
    different:

    ovl_check_empty_and_clear() checked for being upper

    ovl_remove_and_whiteout() checked for merge OR lower

    Move the intersection of those checks (upper AND merge) into
    ovl_check_empty_and_clear() and simplify ovl_remove_and_whiteout().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Signed-off-by: Al Viro
    Signed-off-by: Miklos Szeredi

    Al Viro
     
  • To make delete notification work on fa/inotify.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • This does not work and does not make sense. So instead of fixing it
    (probably not hard) just disallow.

    Reported-by: Andrei Vagin
    Signed-off-by: Miklos Szeredi
    Cc:

    Miklos Szeredi
     
  • There's a superfluous newline in the warning message in ovl_d_real().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Remove duplicated include.

    Signed-off-by: Wei Yongjun
    Signed-off-by: Miklos Szeredi

    Wei Yongjun
     
  • Right now we remove MAY_WRITE/MAY_APPEND bits from mask if realfile is on
    lower/. This is done as files on lower will never be written and will be
    copied up. But to copy up a file, mounter should have MAY_READ permission
    otherwise copy up will fail. So set MAY_READ in mask when MAY_WRITE is
    reset.

    Dan Walsh noticed this when he did access(lowerfile, W_OK) and it returned
    True (context mounts) but when he tried to actually write to file, it
    failed as mounter did not have permission on lower file.

    [SzM] don't set MAY_READ if only MAY_APPEND is set without MAY_WRITE; this
    won't trigger a copy-up.

    Reported-by: Dan Walsh
    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Right now if file is on lower/, we remove MAY_WRITE/MAY_APPEND bits from
    mask as lower/ will never be written and file will be copied up. But this
    is not true for special files. These files are not copied up and are opened
    in place. So don't dilute the checks for these types of files.

    Reported-by: Dan Walsh
    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Setting POSIX ACL needs special handling:

    1) Some permission checks are done by ->setxattr() which now uses mounter's
    creds ("ovl: do operations on underlying file system in mounter's
    context"). These permission checks need to be done with current cred as
    well.

    2) Setting ACL can fail for various reasons. We do not need to copy up in
    these cases.

    In the mean time switch to using generic_setxattr.

    [Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr()
    doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is
    disabled, and I could not come up with an obvious way to do it.

    This instead avoids the link error by defining two sets of ACL operations
    and letting the compiler drop one of the two at compile time depending
    on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code,
    also leading to smaller code.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Inode attributes are copied up to overlay inode (uid, gid, mode, atime,
    mtime, ctime) so generic code using these fields works correcty. If a hard
    link is created in overlayfs separate inodes are allocated for each link.
    If chmod/chown/etc. is performed on one of the links then the inode
    belonging to the other ones won't be updated.

    This patch attempts to fix this by sharing inodes for hard links.

    Use inode hash (with real inode pointer as a key) to make sure overlay
    inodes are shared for hard links on upper. Hard links on lower are still
    split (which is not user observable until the copy-up happens, see
    Documentation/filesystems/overlayfs.txt under "Non-standard behavior").

    The inode is only inserted in the hash if it is non-directoy and upper.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • To get from overlay inode to real inode we currently use 'struct
    ovl_entry', which has lifetime connected to overlay dentry. This is okay,
    since each overlay dentry had a new overlay inode allocated.

    Following patch will break that assumption, so need to leave out ovl_entry.
    This patch stores the real inode directly in i_private, with the lowest bit
    used to indicate whether the inode is upper or lower.

    Lifetime rules remain, using ovl_inode_real() must only be done while
    caller holds ref on overlay dentry (and hence on real dentry), or within
    RCU protected regions.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The error is due to RCU and is temporary.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Fix atime update logic in overlayfs.

    This patch adds an i_op->update_time() handler to overlayfs inodes. This
    forwards atime updates to the upper layer only. No atime updates are done
    on lower layers.

    Remove implicit atime updates to underlying files and directories with
    O_NOATIME. Remove explicit atime update in ovl_readlink().

    Clear atime related mnt flags from cloned upper mount. This means atime
    updates are controlled purely by overlayfs mount options.

    Reported-by: Konstantin Khlebnikov
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • When creating directory in workdir, the group/sgid inheritance from the
    parent dir was omitted completely. Fix this by calling inode_init_owner()
    on overlay inode and using the resulting uid/gid/mode to create the file.

    Unfortunately the sgid bit can be stripped off due to umask, so need to
    reset the mode in this case in workdir before moving the directory in
    place.

    Reported-by: Eryu Guan
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The fact that we always do permission checking on the overlay inode and
    clear MAY_WRITE for checking access to the lower inode allows cruft to be
    removed from ovl_permission().

    1) "default_permissions" option effectively did generic_permission() on the
    overlay inode with i_mode, i_uid and i_gid updated from underlying
    filesystem. This is what we do by default now. It did the update using
    vfs_getattr() but that's only needed if the underlying filesystem can
    change (which is not allowed). We may later introduce a "paranoia_mode"
    that verifies that mode/uid/gid are not changed.

    2) splitting out the IS_RDONLY() check from inode_permission() also becomes
    unnecessary once we remove the MAY_WRITE from the lower inode check.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Now we have two levels of checks in ovl_permission(). overlay inode
    is checked with the creds of task while underlying inode is checked
    with the creds of mounter.

    Looks like mounter does not have to have WRITE access to files on lower/.
    So remove the MAY_WRITE from access mask for checks on underlying
    lower inode.

    This means task should still have the MAY_WRITE permission on lower
    inode and mounter is not required to have MAY_WRITE.

    It also solves the problem of read only NFS mounts being used as lower.
    If __inode_permission(lower_inode, MAY_WRITE) is called on read only
    NFS, it fails. By resetting MAY_WRITE, check succeeds and case of
    read only NFS shold work with overlay without having to specify any
    special mount options (default permission).

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Given we are now doing checks both on overlay inode as well underlying
    inode, we should be able to do checks and operations on underlying file
    system using mounter's context.

    So modify all operations to do checks/operations on underlying dentry/inode
    in the context of mounter.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Right now ovl_permission() calls __inode_permission(realinode), to do
    permission checks on real inode and no checks are done on overlay inode.

    Modify it to do checks both on overlay inode as well as underlying inode.
    Checks on overlay inode will be done with the creds of calling task while
    checks on underlying inode will be done with the creds of mounter.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Now we are planning to do DAC permission checks on overlay inode
    itself. And to make it work, we will need to make sure we can get acls from
    underlying inode. So define ->get_acl() for overlay inodes and this in turn
    calls into underlying filesystem to get acls, if any.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • ovl_create_upper() and ovl_create_over_whiteout() seem to be sharing some
    common code which can be moved into a separate function. No functionality
    change.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Previously this was only done for directory inodes. Doing so for all
    inodes makes for a nice cleanup in ovl_permission at zero cost.

    Inodes are not shared for hard links on the overlay, so this works fine.

    Signed-off-by: Miklos Szeredi

    Andreas Gruenbacher
     
  • No point in keeping overlay inodes around since they will never be reused.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The hash salting changes meant that we can no longer reuse the hash in the
    overlay dentry to look up the underlying dentry.

    Instead of lookup_hash(), use lookup_one_len_unlocked() and swith to
    mounter's creds (like we do for all other operations later in the series).

    Now the lookup_hash() export introduced in 4.6 by 3c9fe8cdff1b ("vfs: add
    lookup_hash() helper") is unused and can possibly be removed; its
    usefulness negated by the hash salting and the idea that mounter's creds
    should be used on operations on underlying filesystems.

    Signed-off-by: Miklos Szeredi
    Fixes: 8387ff2577eb ("vfs: make the string hashes salt the hash")

    Miklos Szeredi
     
  • Pull tracing updates from Steven Rostedt:
    "This is mostly clean ups and small fixes. Some of the more visible
    changes are:

    - The function pid code uses the event pid filtering logic
    - [ku]probe events have access to current->comm
    - trace_printk now has sample code
    - PCI devices now trace physical addresses
    - stack tracing has less unnessary functions traced"

    * tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    printk, tracing: Avoiding unneeded blank lines
    tracing: Use __get_str() when manipulating strings
    tracing, RAS: Cleanup on __get_str() usage
    tracing: Use outer () on __get_str() definition
    ftrace: Reduce size of function graph entries
    tracing: Have HIST_TRIGGERS select TRACING
    tracing: Using for_each_set_bit() to simplify trace_pid_write()
    ftrace: Move toplevel init out of ftrace_init_tracefs()
    tracing/function_graph: Fix filters for function_graph threshold
    tracing: Skip more functions when doing stack tracing of events
    tracing: Expose CPU physical addresses (resource values) for PCI devices
    tracing: Show the preempt count of when the event was called
    tracing: Add trace_printk sample code
    tracing: Choose static tp_printk buffer by explicit nesting count
    tracing: expose current->comm to [ku]probe events
    ftrace: Have set_ftrace_pid use the bitmap like events do
    tracing: Move pid_list write processing into its own function
    tracing: Move the pid_list seq_file functions to be global
    tracing: Move filtered_pid helper functions into trace.c
    tracing: Make the pid filtering helper functions global

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:

    - Replace pcommit with ADR / directed-flushing.

    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement
    either ADR, or provide one or more flush addresses per nvdimm.

    ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
    to the memory controller on a power-fail event.

    Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
    Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
    A flush hint is an mmio address that when written and fenced assures
    that all previous posted writes targeting a given dimm have been
    flushed to media.

    - On-demand ARS (address range scrub).

    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices. When latent errors are detected we re-scrub the
    media to refresh the bad block list, userspace can also request a
    re-scrub at any time.

    - Support for the Microsoft DSM (device specific method) command
    format.

    - Support for EDK2/OVMF virtual disk device memory ranges.

    - Various fixes and cleanups across the subsystem.

    * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
    libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
    nfit: do an ARS scrub on hitting a latent media error
    nfit: move to nfit/ sub-directory
    nfit, libnvdimm: allow an ARS scrub to be triggered on demand
    libnvdimm: register nvdimm_bus devices with an nd_bus driver
    pmem: clarify a debug print in pmem_clear_poison
    x86/insn: remove pcommit
    Revert "KVM: x86: add pcommit support"
    nfit, tools/testing/nvdimm/: unify shutdown paths
    libnvdimm: move ->module to struct nvdimm_bus_descriptor
    nfit: cleanup acpi_nfit_init calling convention
    nfit: fix _FIT evaluation memory leak + use after free
    tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
    tools/testing/nvdimm: add virtual ramdisk range
    acpi, nfit: treat virtual ramdisk SPA as pmem region
    pmem: kill __pmem address space
    pmem: kill wmb_pmem()
    libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
    fs/dax: remove wmb_pmem()
    libnvdimm, pmem: flush posted-write queues on shutdown
    ...

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:
    "The rest of MM"

    * emailed patches from Andrew Morton : (101 commits)
    mm, compaction: simplify contended compaction handling
    mm, compaction: introduce direct compaction priority
    mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
    mm, page_alloc: make THP-specific decisions more generic
    mm, page_alloc: restructure direct compaction handling in slowpath
    mm, page_alloc: don't retry initial attempt in slowpath
    mm, page_alloc: set alloc_flags only once in slowpath
    lib/stackdepot.c: use __GFP_NOWARN for stack allocations
    mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
    mm, kasan: account for object redzone in SLUB's nearest_obj()
    mm: fix use-after-free if memory allocation failed in vma_adjust()
    zsmalloc: Delete an unnecessary check before the function call "iput"
    mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
    mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
    mm: optimize copy_page_to/from_iter_iovec
    mm: add cond_resched() to generic_swapfile_activate()
    Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
    mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
    mm: hwpoison: remove incorrect comments
    make __section_nr() more efficient
    ...

    Linus Torvalds
     
  • Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
    This only makes sense if each kernel stack exists entirely in one zone,
    and allowing vmapped stacks could break this assumption.

    Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
    allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
    architectures. Keep it simple and use KiB.

    Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • NR_FILE_PAGES is the number of file pages.
    NR_FILE_MAPPED is the number of mapped file pages.
    NR_ANON_PAGES is the number of mapped anon pages.

    This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
    NR_ANON_PAGES for mapped pages. This patch renames NR_ANON_PAGES so we
    have

    NR_FILE_PAGES is the number of file pages.
    NR_FILE_MAPPED is the number of mapped file pages.
    NR_ANON_MAPPED is the number of mapped anon pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman