07 Dec, 2019

1 commit

  • Pull vfs d_inode/d_flags memory ordering fixes from Al Viro:
    "Fallout from tree-wide audit for ->d_inode/->d_flags barriers use.
    Basically, the problem is that negative pinned dentries require
    careful treatment - unless ->d_lock is locked or parent is held at
    least shared, another thread can make them positive right under us.

    Most of the uses turned out to be safe - the main surprises as far as
    filesystems are concerned were

    - race in dget_parent() fastpath, that might end up with the caller
    observing the returned dentry _negative_, due to insufficient
    barriers. It is positive in memory, but we could end up seeing the
    wrong value of ->d_inode in CPU cache. Fixed.

    - manual checks that result of lookup_one_len_unlocked() is positive
    (and rejection of negatives). Again, insufficient barriers (we
    might end up with inconsistent observed values of ->d_inode and
    ->d_flags). Fixed by switching to a new primitive that does the
    checks itself and returns ERR_PTR(-ENOENT) instead of a negative
    dentry. That way we get rid of boilerplate converting negatives
    into ERR_PTR(-ENOENT) in the callers and have a single place to
    deal with the barrier-related mess - inside fs/namei.c rather than
    in every caller out there.

    The guts of pathname resolution *do* need to be careful - the race
    found by Ritesh is real, as well as several similar races.
    Fortunately, it turns out that we can take care of that with fairly
    local changes in there.

    The tree-wide audit had not been fun, and I hate the idea of repeating
    it. I think the right approach would be to annotate the places where
    we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse
    catch regressions. But I'm still not sure what would be the least
    invasive way of doing that and it's clearly the next cycle fodder"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs/namei.c: fix missing barriers when checking positivity
    fix dget_parent() fastpath race
    new helper: lookup_positive_unlocked()
    fs/namei.c: pull positivity check into follow_managed()

    Linus Torvalds
     

27 Nov, 2019

1 commit

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - A comprehensive rewrite of the robust/PI futex code's exit handling
    to fix various exit races. (Thomas Gleixner et al)

    - Rework the generic REFCOUNT_FULL implementation using
    atomic_fetch_* operations so that the performance impact of the
    cmpxchg() loops is mitigated for common refcount operations.

    With these performance improvements the generic implementation of
    refcount_t should be good enough for everybody - and this got
    confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
    REFCOUNT_FULL entirely, leaving the generic implementation enabled
    unconditionally. (Will Deacon)

    - Other misc changes, fixes, cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    lkdtm: Remove references to CONFIG_REFCOUNT_FULL
    locking/refcount: Remove unused 'refcount_error_report()' function
    locking/refcount: Consolidate implementations of refcount_t
    locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
    locking/refcount: Move saturation warnings out of line
    locking/refcount: Improve performance of generic REFCOUNT_FULL code
    locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the header
    locking/refcount: Remove unused refcount_*_checked() variants
    locking/refcount: Ensure integer operands are treated as signed
    locking/refcount: Define constants for saturation and max refcount values
    futex: Prevent exit livelock
    futex: Provide distinct return value when owner is exiting
    futex: Add mutex around futex exit
    futex: Provide state handling for exec() as well
    futex: Sanitize exit state handling
    futex: Mark the begin of futex exit explicitly
    futex: Set task::futex_state to DEAD right after handling futex exit
    futex: Split futex_mm_release() for exit/exec
    exit/exec: Seperate mm_release()
    futex: Replace PF_EXITPIDONE with a state
    ...

    Linus Torvalds
     

16 Nov, 2019

1 commit

  • Most of the callers of lookup_one_len_unlocked() treat negatives are
    ERR_PTR(-ENOENT). Provide a helper that would do just that. Note
    that a pinned positive dentry remains positive - it's ->d_inode is
    stable, etc.; a pinned _negative_ dentry can become positive at any
    point as long as you are not holding its parent at least shared.
    So using lookup_one_len_unlocked() needs to be careful;
    lookup_positive_unlocked() is safer and that's what the callers
    end up open-coding anyway.

    Signed-off-by: Al Viro

    Al Viro
     

13 Nov, 2019

7 commits

  • Each kernfs_node is identified with a 64bit ID. The low 32bit is
    exposed as ino and the high gen. While this already allows using inos
    as keys by looking up with wildcard generation number of 0, it's
    adding unnecessary complications for 64bit ino archs which can
    directly use kernfs_node IDs as inos to uniquely identify each cgroup
    instance.

    This patch exposes IDs directly as inos on 64bit ino archs. The
    conversion is mostly straight-forward.

    * 32bit ino archs behave the same as before. 64bit ino archs now use
    the whole 64bit ID as ino and the generation number is fixed at 1.

    * 64bit inos still use the same idr allocator which gurantees that the
    lower 32bits identify the current live instance uniquely and the
    high 32bits are incremented whenever the low bits wrap. As the
    upper 32bits are no longer used as gen and we don't wanna start ino
    allocation with 33rd bit set, the initial value for highbits
    allocation is changed to 0 on 64bit ino archs.

    * blktrace exposes two 32bit numbers - (INO,GEN) pair - to identify
    the issuing cgroup. Userland builds FILEID_INO32_GEN fids from
    these numbers to look up the cgroups. To remain compatible with the
    behavior, always output (LOW32,HIGH32) which will be constructed
    back to the original 64bit ID by __kernfs_fh_to_dentry().

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim

    Tejun Heo
     
  • The current kernfs exportfs implementation uses the generic_fh_*()
    helpers and FILEID_INO32_GEN[_PARENT] which limits ino to 32bits.
    Let's implement custom exportfs operations and fid type to remove the
    restriction.

    * FILEID_KERNFS is a single u64 value whose content is
    kernfs_node->id. This is the only native fid type.

    * For backward compatibility with blk_log_action() path which exposes
    (ino,gen) pairs which userland assembles into FILEID_INO32_GEN keys,
    combine the generic keys into 64bit IDs in the same order.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim

    Tejun Heo
     
  • kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the
    specified ino. On top of that, kernfs_get_node_by_id() and
    kernfs_fh_get_inode() implement full ID matching by testing the rest
    of ID.

    On surface, confusingly, the two are slightly different in that the
    latter uses 0 gen as wildcard while the former doesn't - does it mean
    that the latter can't uniquely identify inodes w/ 0 gen? In practice,
    this is a distinction without a difference because generation number
    starts at 1. There are no actual IDs with 0 gen, so it can always
    safely used as wildcard.

    Let's simplify the code by renaming kernfs_find_and_get_node_by_ino()
    to kernfs_find_and_get_node_by_id(), moving all lookup logics into it,
    and removing now unnecessary kernfs_get_node_by_id().

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman

    Tejun Heo
     
  • kernfs_node->id is currently a union kernfs_node_id which represents
    either a 32bit (ino, gen) pair or u64 value. I can't see much value
    in the usage of the union - all that's needed is a 64bit ID which the
    current code is already limited to. Using a union makes the code
    unnecessarily complicated and prevents using 64bit ino without adding
    practical benefits.

    This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
    ino is stored in the lower 32bits and gen upper. Accessors -
    kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
    ino and gen. This simplifies ID handling less cumbersome and will
    allow using 64bit inos on supported archs.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim
    Cc: Jens Axboe
    Cc: Alexei Starovoitov

    Tejun Heo
     
  • kernfs node can be created in two separate steps - allocation and
    activation. This is used to make kernfs nodes visible only after the
    internal states attached to the node are fully initialized.
    kernfs_find_and_get_node_by_id() currently allows lookups of nodes
    which aren't activated yet and thus can expose nodes are which are
    still being prepped by kernfs users.

    Fix it by disallowing lookups of nodes which aren't activated yet.

    kernfs_find_and_get_node_by_ino()

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim

    Tejun Heo
     
  • kernfs_find_and_get_node_by_ino() uses RCU protection. It's currently
    a bit buggy because it can look up a node which hasn't been activated
    yet and thus may end up exposing a node that the kernfs user is still
    prepping.

    While it can be fixed by pushing it further in the current direction,
    it's already complicated and isn't clear whether the complexity is
    justified. The main use of kernfs_find_and_get_node_by_ino() is for
    exportfs operations. They aren't super hot and all the follow-up
    operations (e.g. mapping to path) use normal locking anyway.

    Let's switch to a dumber locking scheme and protect the lookup with
    kernfs_idr_lock.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim

    Tejun Heo
     
  • When the 32bit ino wraps around, kernfs increments the generation
    number to distinguish reused ino instances. The wrap-around detection
    tests whether the allocated ino is lower than what the cursor but the
    cursor is pointing to the next ino to allocate so the condition never
    triggers.

    Fix it by remembering the last ino and comparing against that.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Fixes: 4a3ef68acacf ("kernfs: implement i_generation")
    Cc: Namhyung Kim
    Cc: stable@vger.kernel.org # v4.14+

    Tejun Heo
     

09 Oct, 2019

1 commit

  • Since the following commit:

    b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release")

    @nested is no longer used in lock_release(), so remove it from all
    lock_release() calls and friends.

    Signed-off-by: Qian Cai
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Acked-by: Daniel Vetter
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: airlied@linux.ie
    Cc: akpm@linux-foundation.org
    Cc: alexander.levin@microsoft.com
    Cc: daniel@iogearbox.net
    Cc: davem@davemloft.net
    Cc: dri-devel@lists.freedesktop.org
    Cc: duyuyang@gmail.com
    Cc: gregkh@linuxfoundation.org
    Cc: hannes@cmpxchg.org
    Cc: intel-gfx@lists.freedesktop.org
    Cc: jack@suse.com
    Cc: jlbec@evilplan.or
    Cc: joonas.lahtinen@linux.intel.com
    Cc: joseph.qi@linux.alibaba.com
    Cc: jslaby@suse.com
    Cc: juri.lelli@redhat.com
    Cc: maarten.lankhorst@linux.intel.com
    Cc: mark@fasheh.com
    Cc: mhocko@kernel.org
    Cc: mripard@kernel.org
    Cc: ocfs2-devel@oss.oracle.com
    Cc: rodrigo.vivi@intel.com
    Cc: sean@poorly.run
    Cc: st@kernel.org
    Cc: tj@kernel.org
    Cc: tytso@mit.edu
    Cc: vdavydov.dev@gmail.com
    Cc: vincent.guittot@linaro.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
    Signed-off-by: Ingo Molnar

    Qian Cai
     

20 Sep, 2019

1 commit

  • Pull y2038 vfs updates from Arnd Bergmann:
    "Add inode timestamp clamping.

    This series from Deepa Dinamani adds a per-superblock minimum/maximum
    timestamp limit for a file system, and clamps timestamps as they are
    written, to avoid random behavior from integer overflow as well as
    having different time stamps on disk vs in memory.

    At mount time, a warning is now printed for any file system that can
    represent current timestamps but not future timestamps more than 30
    years into the future, similar to the arbitrary 30 year limit that was
    added to settimeofday().

    This was picked as a compromise to warn users to migrate to other file
    systems (e.g. ext4 instead of ext3) when they need the file system to
    survive beyond 2038 (or similar limits in other file systems), but not
    get in the way of normal usage"

    * tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
    ext4: Reduce ext4 timestamp warnings
    isofs: Initialize filesystem timestamp ranges
    pstore: fs superblock limits
    fs: omfs: Initialize filesystem timestamp ranges
    fs: hpfs: Initialize filesystem timestamp ranges
    fs: ceph: Initialize filesystem timestamp ranges
    fs: sysv: Initialize filesystem timestamp ranges
    fs: affs: Initialize filesystem timestamp ranges
    fs: fat: Initialize filesystem timestamp ranges
    fs: cifs: Initialize filesystem timestamp ranges
    fs: nfs: Initialize filesystem timestamp ranges
    ext4: Initialize timestamps limits
    9p: Fill min and max timestamps in sb
    fs: Fill in max and min timestamps in superblock
    utimes: Clamp the timestamps before update
    mount: Add mount warning for impending timestamp expiry
    timestamp_truncate: Replace users of timespec64_trunc
    vfs: Add timestamp_truncate() api
    vfs: Add file timestamp range support

    Linus Torvalds
     

30 Aug, 2019

1 commit

  • Update the inode timestamp updates to use timestamp_truncate()
    instead of timespec64_trunc().

    The change was mostly generated by the following coccinelle
    script.

    virtual context
    virtual patch

    @r1 depends on patch forall@
    struct inode *inode;
    identifier i_xtime =~ "^i_[acm]time$";
    expression e;
    @@

    inode->i_xtime =
    - timespec64_trunc(
    + timestamp_truncate(
    ...,
    - e);
    + inode);

    Signed-off-by: Deepa Dinamani
    Acked-by: Greg Kroah-Hartman
    Acked-by: Jeff Layton
    Cc: adrian.hunter@intel.com
    Cc: dedekind1@gmail.com
    Cc: gregkh@linuxfoundation.org
    Cc: hch@lst.de
    Cc: jaegeuk@kernel.org
    Cc: jlbec@evilplan.org
    Cc: richard@nod.at
    Cc: tj@kernel.org
    Cc: yuchao0@huawei.com
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: linux-ntfs-dev@lists.sourceforge.net
    Cc: linux-mtd@lists.infradead.org

    Deepa Dinamani
     

25 Jul, 2019

2 commits

  • In kernfs_path_from_node_locked(), there is an if statement on line 147
    to check whether buf is NULL:
    if (buf)

    When buf is NULL, it is used on line 151:
    len += strlcpy(buf + len, parent_str, ...)
    and line 158:
    len += strlcpy(buf + len, "/", ...)
    and line 160:
    len += strlcpy(buf + len, kn->name, ...)

    Thus, possible null-pointer dereferences may occur.

    To fix these possible bugs, buf is checked before being used.
    If it is NULL, -EINVAL is returned.

    These bugs are found by a static analysis tool STCheck written by us.

    Signed-off-by: Jia-Ju Bai
    Link: https://lore.kernel.org/r/20190724022242.27505-1-baijiaju1990@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Jia-Ju Bai
     
  • Get root safely after kn is ensureed to be not null.

    Signed-off-by: Peng Wang
    Acked-by: Tejun Heo
    Link: https://lore.kernel.org/r/20190708151611.13242-1-rocking@whu.edu.cn
    Signed-off-by: Greg Kroah-Hartman

    Peng Wang
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this file is released under the gplv2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 68 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Armijn Hemel
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

1 commit


08 May, 2019

2 commits

  • Pull misc dcache updates from Al Viro:
    "Most of this pile is putting name length into struct name_snapshot and
    making use of it.

    The beginning of this series ("ovl_lookup_real_one(): don't bother
    with strlen()") ought to have been split in two (separate switch of
    name_snapshot to struct qstr from overlayfs reaping the trivial
    benefits of that), but I wanted to avoid a rebase - by the time I'd
    spotted that it was (a) in -next and (b) close to 5.1-final ;-/"

    * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    audit_compare_dname_path(): switch to const struct qstr *
    audit_update_watch(): switch to const struct qstr *
    inotify_handle_event(): don't bother with strlen()
    fsnotify: switch send_to_group() and ->handle_event to const struct qstr *
    fsnotify(): switch to passing const struct qstr * for file_name
    switch fsnotify_move() to passing const struct qstr * for old_name
    ovl_lookup_real_one(): don't bother with strlen()
    sysv: bury the broken "quietly truncate the long filenames" logics
    nsfs: unobfuscate
    unexport d_alloc_pseudo()

    Linus Torvalds
     
  • Pull selinux updates from Paul Moore:
    "We've got a few SELinux patches for the v5.2 merge window, the
    highlights are below:

    - Add LSM hooks, and the SELinux implementation, for proper labeling
    of kernfs. While we are only including the SELinux implementation
    here, the rest of the LSM folks have given the hooks a thumbs-up.

    - Update the SELinux mdp (Make Dummy Policy) script to actually work
    on a modern system.

    - Disallow userspace to change the LSM credentials via
    /proc/self/attr when the task's credentials are already overridden.

    The change was made in procfs because all the LSM folks agreed this
    was the Right Thing To Do and duplicating it across each LSM was
    going to be annoying"

    * tag 'selinux-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    proc: prevent changes to overridden credentials
    selinux: Check address length before reading address family
    kernfs: fix xattr name handling in LSM helpers
    MAINTAINERS: update SELinux file patterns
    selinux: avoid uninitialized variable warning
    selinux: remove useless assignments
    LSM: lsm_hooks.h - fix missing colon in docstring
    selinux: Make selinux_kernfs_init_security static
    kernfs: initialize security of newly created nodes
    selinux: implement the kernfs_init_security hook
    LSM: add new hook for kernfs node initialization
    kernfs: use simple_xattrs for security attributes
    selinux: try security xattr after genfs for kernfs filesystems
    kernfs: do not alloc iattrs in kernfs_xattr_get
    kernfs: clean up struct kernfs_iattrs
    scripts/selinux: fix build
    selinux: use kernel linux/socket.h for genheaders and mdp
    scripts/selinux: modernize mdp

    Linus Torvalds
     

27 Apr, 2019

1 commit

  • Note that in fnsotify_move() and fsnotify_link() we are guaranteed
    that dentry->d_name won't change during the fsnotify() evaluation
    (by having the parent directory locked exclusive), so we don't
    need to fetch dentry->d_name.name in the callers. In fsnotify_dirent()
    the same stability of dentry->d_name is also true, but it's a bit
    more convoluted - there is one callchain (devpts_pty_new() ->
    fsnotify_create() -> fsnotify_dirent()) where the parent is _not_
    locked, but on devpts ->d_name of everything is unchanging; it
    has neither explicit nor implicit renames.

    Signed-off-by: Al Viro

    Al Viro
     

26 Apr, 2019

1 commit

  • smp_mb__before_atomic() can not be applied to atomic_set(). Remove the
    barrier and rely on RELEASE synchronization.

    Fixes: ba16b2846a8c6 ("kernfs: add an API to get kernfs node from inode number")
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrea Parri
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Andrea Parri
     

04 Apr, 2019

1 commit

  • The implementation of kernfs_security_xattr_*() helpers reuses the
    kernfs_node_xattr_*() functions, which take the suffix of the xattr name
    and extract full xattr name from it using xattr_full_name(). However,
    this function relies on the fact that the suffix passed to xattr
    handlers from VFS is always constructed from the full name by just
    incerementing the pointer. This doesn't necessarily hold for the callers
    of kernfs_security_xattr_*(), so their usage will easily lead to
    out-of-bounds access.

    Fix this by moving the xattr name reconstruction to the VFS xattr
    handlers and replacing the kernfs_security_xattr_*() helpers with more
    general kernfs_xattr_*() helpers that take full xattr name and allow
    accessing all kernfs node's xattrs.

    Reported-by: kernel test robot
    Fixes: b230d5aba2d1 ("LSM: add new hook for kernfs node initialization")
    Fixes: ec882da5cda9 ("selinux: implement the kernfs_init_security hook")
    Signed-off-by: Ondrej Mosnacek
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     

21 Mar, 2019

5 commits

  • Use the new security_kernfs_init_security() hook to allow LSMs to
    possibly assign a non-default security context to a newly created kernfs
    node based on the attributes of the new node and also its parent node.

    This fixes an issue with cgroupfs under SELinux, where newly created
    cgroup subdirectories/files would not inherit its parent's context if
    it had been set explicitly to a non-default value (other than the genfs
    context specified by the policy). This can be reproduced as follows (on
    Fedora/RHEL):

    # mkdir /sys/fs/cgroup/unified/test
    # # Need permissive to change the label under Fedora policy:
    # setenforce 0
    # chcon -t container_file_t /sys/fs/cgroup/unified/test
    # ls -lZ /sys/fs/cgroup/unified
    total 0
    -r--r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.controllers
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.max.depth
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.max.descendants
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.procs
    -r--r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.stat
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.subtree_control
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.threads
    drwxr-xr-x. 2 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 init.scope
    drwxr-xr-x. 26 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:21 system.slice
    drwxr-xr-x. 3 root root system_u:object_r:container_file_t:s0 0 Jan 29 03:15 test
    drwxr-xr-x. 3 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 user.slice
    # mkdir /sys/fs/cgroup/unified/test/subdir

    Actual result:

    # ls -ldZ /sys/fs/cgroup/unified/test/subdir
    drwxr-xr-x. 2 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:15 /sys/fs/cgroup/unified/test/subdir

    Expected result:

    # ls -ldZ /sys/fs/cgroup/unified/test/subdir
    drwxr-xr-x. 2 root root unconfined_u:object_r:container_file_t:s0 0 Jan 29 03:15 /sys/fs/cgroup/unified/test/subdir

    Link: https://github.com/SELinuxProject/selinux-kernel/issues/39

    Signed-off-by: Ondrej Mosnacek
    Acked-by: Casey Schaufler
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     
  • This patch introduces a new security hook that is intended for
    initializing the security data for newly created kernfs nodes, which
    provide a way of storing a non-default security context, but need to
    operate independently from mounts (and therefore may not have an
    associated inode at the moment of creation).

    The main motivation is to allow kernfs nodes to inherit the context of
    the parent under SELinux, similar to the behavior of
    security_inode_init_security(). Other LSMs may implement their own logic
    for handling the creation of new nodes.

    This patch also adds helper functions to for
    getting/setting security xattrs of a kernfs node so that LSMs hooks are
    able to do their job. Other important attributes should be accessible
    direcly in the kernfs_node fields (in case there is need for more, then
    new helpers should be added to kernfs.h along with the patch that needs
    them).

    Signed-off-by: Ondrej Mosnacek
    Acked-by: Casey Schaufler
    [PM: more manual merge fixes]
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     
  • Replace the special handling of security xattrs with simple_xattrs, as
    is already done for the trusted xattrs. This simplifies the code and
    allows LSMs to use more than just a single xattr to do their business.

    Signed-off-by: Ondrej Mosnacek
    Acked-by: Casey Schaufler
    [PM: manual merge fixes]
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     
  • This is a read-only operation, so we can simply return -ENODATA if
    kn->iattr is NULL.

    Signed-off-by: Ondrej Mosnacek
    Acked-by: Casey Schaufler
    [PM: minor merge fixes]
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     
  • Right now, kernfs_iattrs embeds the whole struct iattr, even though it
    doesn't really use half of its fields... This both leads to wasting
    space and makes the code look awkward. Let's just list the few fields
    we need directly in struct kernfs_iattrs.

    Signed-off-by: Ondrej Mosnacek
    Acked-by: Casey Schaufler
    [PM: merged a number of chunks manually due to fuzz]
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     

13 Mar, 2019

1 commit

  • Pull vfs mount infrastructure updates from Al Viro:
    "The rest of core infrastructure; no new syscalls in that pile, but the
    old parts are switched to new infrastructure. At that point
    conversions of individual filesystems can happen independently; some
    are done here (afs, cgroup, procfs, etc.), there's also a large series
    outside of that pile dealing with NFS (quite a bit of option-parsing
    stuff is getting used there - it's one of the most convoluted
    filesystems in terms of mount-related logics), but NFS bits are the
    next cycle fodder.

    It got seriously simplified since the last cycle; documentation is
    probably the weakest bit at the moment - I considered dropping the
    commit introducing Documentation/filesystems/mount_api.txt (cutting
    the size increase by quarter ;-), but decided that it would be better
    to fix it up after -rc1 instead.

    That pile allows to do followup work in independent branches, which
    should make life much easier for the next cycle. fs/super.c size
    increase is unpleasant; there's a followup series that allows to
    shrink it considerably, but I decided to leave that until the next
    cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
    afs: Use fs_context to pass parameters over automount
    afs: Add fs_context support
    vfs: Add some logging to the core users of the fs_context log
    vfs: Implement logging through fs_context
    vfs: Provide documentation for new mount API
    vfs: Remove kern_mount_data()
    hugetlbfs: Convert to fs_context
    cpuset: Use fs_context
    kernfs, sysfs, cgroup, intel_rdt: Support fs_context
    cgroup: store a reference to cgroup_ns into cgroup_fs_context
    cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
    cgroup_do_mount(): massage calling conventions
    cgroup: stash cgroup_root reference into cgroup_fs_context
    cgroup2: switch to option-by-option parsing
    cgroup1: switch to option-by-option parsing
    cgroup: take options parsing into ->parse_monolithic()
    cgroup: fold cgroup1_mount() into cgroup1_get_tree()
    cgroup: start switching to fs_context
    ipc: Convert mqueue fs to fs_context
    proc: Add fs_context support to procfs
    ...

    Linus Torvalds
     

07 Mar, 2019

1 commit

  • Pull driver core updates from Greg KH:
    "Here is the big driver core patchset for 5.1-rc1

    More patches than "normal" here this merge window, due to some work in
    the driver core by Alexander Duyck to rework the async probe
    functionality to work better for a number of devices, and independant
    work from Rafael for the device link functionality to make it work
    "correctly".

    Also in here is:

    - lots of BUS_ATTR() removals, the macro is about to go away

    - firmware test fixups

    - ihex fixups and simplification

    - component additions (also includes i915 patches)

    - lots of minor coding style fixups and cleanups.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (65 commits)
    driver core: platform: remove misleading err_alloc label
    platform: set of_node in platform_device_register_full()
    firmware: hardcode the debug message for -ENOENT
    driver core: Add missing description of new struct device_link field
    driver core: Fix PM-runtime for links added during consumer probe
    drivers/component: kerneldoc polish
    async: Add cmdline option to specify drivers to be async probed
    driver core: Fix possible supplier PM-usage counter imbalance
    PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
    driver: platform: Support parsing GpioInt 0 in platform_get_irq()
    selftests: firmware: fix verify_reqs() return value
    Revert "selftests: firmware: remove use of non-standard diff -Z option"
    Revert "selftests: firmware: add CONFIG_FW_LOADER_USER_HELPER_FALLBACK to config"
    device: Fix comment for driver_data in struct device
    kernfs: Allocating memory for kernfs_iattrs with kmem_cache.
    sysfs: remove unused include of kernfs-internal.h
    driver core: Postpone DMA tear-down until after devres release
    driver core: Document limitation related to DL_FLAG_RPM_ACTIVE
    PM-runtime: Take suppliers into account in __pm_runtime_set_status()
    device.h: Add __cold to dev_ logging functions
    ...

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Patch series "psi: pressure stall monitors", v3.

    Android is adopting psi to detect and remedy memory pressure that
    results in stuttering and decreased responsiveness on mobile devices.

    Psi gives us the stall information, but because we're dealing with
    latencies in the millisecond range, periodically reading the pressure
    files to detect stalls in a timely fashion is not feasible. Psi also
    doesn't aggregate its averages at a high enough frequency right now.

    This patch series extends the psi interface such that users can
    configure sensitive latency thresholds and use poll() and friends to be
    notified when these are breached.

    As high-frequency aggregation is costly, it implements an aggregation
    method that is optimized for fast, short-interval averaging, and makes
    the aggregation frequency adaptive, such that high-frequency updates
    only happen while monitored stall events are actively occurring.

    With these patches applied, Android can monitor for, and ward off,
    mounting memory shortages before they cause problems for the user. For
    example, using memory stall monitors in userspace low memory killer
    daemon (lmkd) we can detect mounting pressure and kill less important
    processes before device becomes visibly sluggish.

    In our memory stress testing psi memory monitors produce roughly 10x
    less false positives compared to vmpressure signals. Having ability to
    specify multiple triggers for the same psi metric allows other parts of
    Android framework to monitor memory state of the device and act
    accordingly.

    The new interface is straightforward. The user opens one of the
    pressure files for writing and writes a trigger description into the
    file descriptor that defines the stall state - some or full, and the
    maximum stall time over a given window of time. E.g.:

    /* Signal when stall time exceeds 100ms of a 1s window */
    char trigger[] = "full 100000 1000000";
    fd = open("/proc/pressure/memory");
    write(fd, trigger, sizeof(trigger));
    while (poll() >= 0) {
    ...
    }
    close(fd);

    When the monitored stall state is entered, psi adapts its aggregation
    frequency according to what the configured time window requires in order
    to emit event signals in a timely fashion. Once the stalling subsides,
    aggregation reverts back to normal.

    The trigger is associated with the open file descriptor. To stop
    monitoring, the user only needs to close the file descriptor and the
    trigger is discarded.

    Patches 1-4 prepare the psi code for polling support. Patch 5
    implements the adaptive polling logic, the pressure growth detection
    optimized for short intervals, and hooks up write() and poll() on the
    pressure files.

    The patches were developed in collaboration with Johannes Weiner.

    This patch (of 5):

    Kernfs has a standardized poll/notification mechanism for waking all
    pollers on all fds when a filesystem node changes. To allow polling for
    custom events, add a .poll callback that can override the default.

    This is in preparation for pollable cgroup pressure files which have
    per-fd trigger configurations.

    Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.com
    Signed-off-by: Johannes Weiner
    Signed-off-by: Suren Baghdasaryan
    Cc: Dennis Zhou
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Li Zefan
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

28 Feb, 2019

1 commit

  • Make kernfs support superblock creation/mount/remount with fs_context.

    This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
    be made to support fs_context also.

    Notes:

    (1) A kernfs_fs_context struct is created to wrap fs_context and the
    kernfs mount parameters are moved in here (or are in fs_context).

    (2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
    namespace tag parameter is passed in the context if desired

    (3) kernfs_free_fs_context() is provided as a destructor for the
    kernfs_fs_context struct, but for the moment it does nothing except
    get called in the right places.

    (4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
    pass, but possibly this should be done anyway in case someone wants to
    add a parameter in future.

    (5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
    the cgroup v1 and v2 mount parameters are all moved there.

    (6) cgroup1 parameter parsing error messages are now handled by invalf(),
    which allows userspace to collect them directly.

    (7) cgroup1 parameter cleanup is now done in the context destructor rather
    than in the mount/get_tree and remount functions.

    Weirdies:

    (*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
    but then uses the resulting pointer after dropping the locks. I'm
    told this is okay and needs commenting.

    (*) The cgroup refcount web. This really needs documenting.

    (*) cgroup2 only has one root?

    Add a suggestion from Thomas Gleixner in which the RDT enablement code is
    placed into its own function.

    [folded a leak fix from Andrey Vagin]

    Signed-off-by: David Howells
    cc: Greg Kroah-Hartman
    cc: Tejun Heo
    cc: Li Zefan
    cc: Johannes Weiner
    cc: cgroups@vger.kernel.org
    cc: fenghua.yu@intel.com
    Signed-off-by: Al Viro

    David Howells
     

08 Feb, 2019

1 commit

  • Creating a new cache for kernfs_iattrs.
    Currently, memory is allocated with kzalloc() which
    always gives aligned memory. On ARM, this is 64 byte aligned.
    To avoid the wastage of memory in aligning the size requested,
    a new cache for kernfs_iattrs is created.

    Size of struct kernfs_iattrs is 80 Bytes.
    On ARM, it will come in kmalloc-128 slab.
    and it will come in kmalloc-192 slab if debug info is enabled.
    Extra bytes taken 48 bytes.

    Total number of objects created : 4096
    Total saving = 48*4096 = 192 KB

    After creating new slab(When debug info is enabled) :
    sh-3.2# cat /proc/slabinfo
    ...
    kernfs_iattrs_cache 4069 4096 128 32 1 : tunables 0 0 0 : slabdata 128 128 0
    ...

    All testing has been done on ARM target.

    Signed-off-by: Ayush Mittal
    Signed-off-by: Vaneet Narang
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Ayush Mittal
     

18 Jan, 2019

2 commits

  • unused now and impossible to use safely anyway.

    Signed-off-by: Al Viro

    Al Viro
     
  • same story as with last May fixes in sysfs (7b745a4e4051
    "unfuck sysfs_mount()"); new_sb is left uninitialized
    in case of early errors in kernfs_mount_ns() and papering
    over it by treating any error from kernfs_mount_ns() as
    equivalent to !new_ns ends up conflating the cases when
    objects had never been transferred to a superblock with
    ones when that has happened and resulting new superblock
    had been dropped. Easily fixed (same way as in sysfs
    case). Additionally, there's a superblock leak on
    kernfs_node_dentry() failure *and* a dentry leak inside
    kernfs_node_dentry() itself - the latter on probably
    impossible errors, but the former not impossible to trigger
    (as the matter of fact, injecting allocation failures
    at that point *does* trigger it).

    Cc: stable@kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

27 Nov, 2018

1 commit

  • kernfs_notify() does two notifications: poll and fsnotify. Originally,
    both notifications were done from scheduled work context and all that
    kernfs_notify() did was schedule the work.

    This patch simply moves the poll notification from the scheduled work
    handler to kernfs_notify(). The fsnotify notification still needs to be
    done from scheduled work context because it can sleep (it needs to lock
    a mutex).

    If the poll notification is time critical (the notified thread needs to
    wake as quickly as possible), it's better to do it from kernfs_notify()
    directly. One example is calling sysfs_notify_dirent() from a hardware
    interrupt handler to wake up a thread and handle the interrupt in user
    space.

    Signed-off-by: Radu Rendec
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Radu Rendec
     

27 Oct, 2018

1 commit

  • The page cache and most shrinkable slab caches hold data that has been
    read from disk, but there are some caches that only cache CPU work, such
    as the dentry and inode caches of procfs and sysfs, as well as the subset
    of radix tree nodes that track non-resident page cache.

    Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
    the shrinker's seeks setting tells the reclaim algorithm that for every
    two page cache pages scanned it should scan one slab object.

    This is a bogus setting. A virtual inode that required no IO to create is
    not twice as valuable as a page cache page; shadow cache entries with
    eviction distances beyond the size of memory aren't either.

    In most cases, the behavior in practice is still fine. Such virtual
    caches don't tend to grow and assert themselves aggressively, and usually
    get picked up before they cause problems. But there are scenarios where
    that's not true.

    Our database workloads suffer from two of those. For one, their file
    workingset is several times bigger than available memory, which has the
    kernel aggressively create shadow page cache entries for the non-resident
    parts of it. The workingset code does tell the VM that most of these are
    expendable, but the VM ends up balancing them 2:1 to cache pages as per
    the seeks setting. This is a huge waste of memory.

    These workloads also deal with tens of thousands of open files and use
    /proc for introspection, which ends up growing the proc_inode_cache to
    absurdly large sizes - again at the cost of valuable cache space, which
    isn't a reasonable trade-off, given that proc inodes can be re-created
    without involving the disk.

    This patch implements a "zero-seek" setting for shrinkers that results in
    a target ratio of 0:1 between their objects and IO-backed caches. This
    allows such virtual caches to grow when memory is available (they do
    cache/avoid CPU work after all), but effectively disables them as soon as
    IO-backed objects are under pressure.

    It then switches the shrinkers for procfs and sysfs metadata, as well as
    excess page cache shadow nodes, to the new zero-seek setting.

    Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Domas Mituzas
    Reviewed-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

17 Sep, 2018

1 commit

  • The terminating NUL byte is only there because the buffer is
    allocated with kzalloc(PAGE_SIZE, GFP_KERNEL), but since the
    range-check is off-by-one, and PAGE_SIZE==PATH_MAX, the
    returned string may not be zero-terminated if it is exactly
    PATH_MAX characters long. Furthermore also the initial loop
    may theoretically exceed PATH_MAX and cause a fault.

    Signed-off-by: Bernd Edlinger
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Bernd Edlinger
     

19 Aug, 2018

1 commit

  • Pull driver core updates from Greg KH:
    "Here are all of the driver core and related patches for 4.19-rc1.

    Nothing huge here, just a number of small cleanups and the ability to
    now stop the deferred probing after init happens.

    All of these have been in linux-next for a while with only a merge
    issue reported"

    * tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
    base: core: Remove WARN_ON from link dependencies check
    drivers/base: stop new probing during shutdown
    drivers: core: Remove glue dirs from sysfs earlier
    driver core: remove unnecessary function extern declare
    sysfs.h: fix non-kernel-doc comment
    PM / Domains: Stop deferring probe at the end of initcall
    iommu: Remove IOMMU_OF_DECLARE
    iommu: Stop deferring probe at end of initcalls
    pinctrl: Support stopping deferred probe after initcalls
    dt-bindings: pinctrl: add a 'pinctrl-use-default' property
    driver core: allow stopping deferred probe after init
    driver core: add a debugfs entry to show deferred devices
    sysfs: Fix internal_create_group() for named group updates
    base: fix order of OF initialization
    linux/device.h: fix kernel-doc notation warning
    Documentation: update firmware loader fallback reference
    kobject: Replace strncpy with memcpy
    drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number
    kernfs: Replace strncpy with memcpy
    device: Add #define dev_fmt similar to #define pr_fmt
    ...

    Linus Torvalds
     

21 Jul, 2018

1 commit

  • This change allows creating kernfs files and directories with arbitrary
    uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
    kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
    The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
    and always create objects belonging to the global root.

    When creating symlinks ownership (uid/gid) is taken from the target kernfs
    object.

    Co-Developed-by: Tyler Hicks
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Tyler Hicks
    Signed-off-by: David S. Miller

    Dmitry Torokhov
     

07 Jul, 2018

1 commit

  • gcc 8.1.0 complains:

    fs/kernfs/symlink.c:91:3: warning:
    'strncpy' output truncated before terminating nul copying
    as many bytes from a string as its length
    fs/kernfs/symlink.c: In function 'kernfs_iop_get_link':
    fs/kernfs/symlink.c:88:14: note: length computed here

    Using strncpy() is indeed less than perfect since the length of data to
    be copied has already been determined with strlen(). Replace strncpy()
    with memcpy() to address the warning and optimize the code a little.

    Signed-off-by: Guenter Roeck
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Guenter Roeck