17 Jun, 2020

1 commit

  • [ Upstream commit 2f02fd3fa13e51713b630164f8a8e5b42de8283b ]

    The comments in fanotify_group_event_mask() say:

    "If the event is on dir/child and this mark doesn't care about
    events on dir/child, don't send it!"

    Specifically, mount and filesystem marks do not care about events
    on child, but they can still specify an ignore mask for those events.
    For example, a group that has:
    - A mount mark with mask 0 and ignore_mask FAN_OPEN
    - An inode mark on a directory with mask FAN_OPEN | FAN_OPEN_EXEC
    with flag FAN_EVENT_ON_CHILD

    A child file open for exec would be reported to group with the FAN_OPEN
    event despite the fact that FAN_OPEN is in ignore mask of mount mark,
    because the mark iteration loop skips over non-inode marks for events
    on child when calculating the ignore mask.

    Move ignore mask calculation to the top of the iteration loop block
    before excluding marks for events on dir/child.

    Link: https://lore.kernel.org/r/20200524072441.18258-1-amir73il@gmail.com
    Reported-by: Jan Kara
    Link: https://lore.kernel.org/linux-fsdevel/20200521162443.GA26052@quack2.suse.cz/
    Fixes: 55bf882c7f13 "fanotify: fix merging marks masks with FAN_ONDIR"
    Fixes: b469e7e47c8a "fanotify: fix handling of events on child..."
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: Sasha Levin

    Amir Goldstein
     

20 May, 2020

1 commit

  • commit 55bf882c7f13dda8bbe624040c6d5b4fbb812d16 upstream.

    Change the logic of FAN_ONDIR in two ways that are similar to the logic
    of FAN_EVENT_ON_CHILD, that was fixed in commit 54a307ba8d3c ("fanotify:
    fix logic of events on child"):

    1. The flag is meaningless in ignore mask
    2. The flag refers only to events in the mask of the mark where it is set

    This is what the fanotify_mark.2 man page says about FAN_ONDIR:
    "Without this flag, only events for files are created." It doesn't
    say anything about setting this flag in ignore mask to stop getting
    events on directories nor can I think of any setup where this capability
    would be useful.

    Currently, when marks masks are merged, the FAN_ONDIR flag set in one
    mark affects the events that are set in another mark's mask and this
    behavior causes unexpected results. For example, a user adds a mark on a
    directory with mask FAN_ATTRIB | FAN_ONDIR and a mount mark with mask
    FAN_OPEN (without FAN_ONDIR). An opendir() of that directory (which is
    inside that mount) generates a FAN_OPEN event even though neither of the
    marks requested to get open events on directories.

    Link: https://lore.kernel.org/r/20200319151022.31456-10-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Cc: Rachel Sibley
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     

14 May, 2020

2 commits

  • [ Upstream commit f367a62a7cad2447d835a9f14fc63997a9137246 ]

    With inotify, when a watch is set on a directory and on its child, an
    event on the child is reported twice, once with wd of the parent watch
    and once with wd of the child watch without the filename.

    With fanotify, when a watch is set on a directory and on its child, an
    event on the child is reported twice, but it has the exact same
    information - either an open file descriptor of the child or an encoded
    fid of the child.

    The reason that the two identical events are not merged is because the
    object id used for merging events in the queue is the child inode in one
    event and parent inode in the other.

    For events with path or dentry data, use the victim inode instead of the
    watched inode as the object id for event merging, so that the event
    reported on parent will be merged with the event reported on the child.

    Link: https://lore.kernel.org/r/20200319151022.31456-9-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: Sasha Levin

    Amir Goldstein
     
  • [ Upstream commit dfc2d2594e4a79204a3967585245f00644b8f838 ]

    The event inode field is used only for comparison in queue merges and
    cannot be dereferenced after handle_event(), because it does not hold a
    refcount on the inode.

    Replace it with an abstract id to do the same thing.

    Link: https://lore.kernel.org/r/20200319151022.31456-8-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: Sasha Levin

    Amir Goldstein
     

12 Jan, 2020

2 commits

  • [ Upstream commit 1edc8eb2e93130e36ac74ac9c80913815a57d413 ]

    When a filesystem is unmounted, we currently call fsnotify_sb_delete()
    before evict_inodes(), which means that fsnotify_unmount_inodes()
    must iterate over all inodes on the superblock looking for any inodes
    with watches. This is inefficient and can lead to livelocks as it
    iterates over many unwatched inodes.

    At this point, SB_ACTIVE is gone and dropping refcount to zero kicks
    the inode out out immediately, so anything processed by
    fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop.

    After that, the call to evict_inodes will evict everything else with a
    zero refcount.

    This should speed things up overall, and avoid livelocks in
    fsnotify_unmount_inodes().

    Signed-off-by: Eric Sandeen
    Reviewed-by: Jan Kara
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    Eric Sandeen
     
  • [ Upstream commit 04646aebd30b99f2cfa0182435a2ec252fcb16d0 ]

    Anything that walks all inodes on sb->s_inodes list without rescheduling
    risks softlockups.

    Previous efforts were made in 2 functions, see:

    c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
    ac05fbb inode: don't softlockup when evicting inodes

    but there hasn't been an audit of all walkers, so do that now. This
    also consistently moves the cond_resched() calls to the bottom of each
    loop in cases where it already exists.

    One loop remains: remove_dquot_ref(), because I'm not quite sure how
    to deal with that one w/o taking the i_lock.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Jan Kara
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    Eric Sandeen
     

28 Sep, 2019

1 commit

  • Pull nfsd updates from Bruce Fields:
    "Highlights:

    - Add a new knfsd file cache, so that we don't have to open and close
    on each (NFSv2/v3) READ or WRITE. This can speed up read and write
    in some cases. It also replaces our readahead cache.

    - Prevent silent data loss on write errors, by treating write errors
    like server reboots for the purposes of write caching, thus forcing
    clients to resend their writes.

    - Tweak the code that allocates sessions to be more forgiving, so
    that NFSv4.1 mounts are less likely to hang when a server already
    has a lot of clients.

    - Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
    limited only by the backend filesystem and the maximum RPC size.

    - Allow the server to enforce use of the correct kerberos credentials
    when a client reclaims state after a reboot.

    And some miscellaneous smaller bugfixes and cleanup"

    * tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits)
    sunrpc: clean up indentation issue
    nfsd: fix nfs read eof detection
    nfsd: Make nfsd_reset_boot_verifier_locked static
    nfsd: degraded slot-count more gracefully as allocation nears exhaustion.
    nfsd: handle drc over-allocation gracefully.
    nfsd: add support for upcall version 2
    nfsd: add a "GetVersion" upcall for nfsdcld
    nfsd: Reset the boot verifier on all write I/O errors
    nfsd: Don't garbage collect files that might contain write errors
    nfsd: Support the server resetting the boot verifier
    nfsd: nfsd_file cache entries should be per net namespace
    nfsd: eliminate an unnecessary acl size limit
    Deprecate nfsd fault injection
    nfsd: remove duplicated include from filecache.c
    nfsd: Fix the documentation for svcxdr_tmpalloc()
    nfsd: Fix up some unused variable warnings
    nfsd: close cached files prior to a REMOVE or RENAME that would replace target
    nfsd: rip out the raparms cache
    nfsd: have nfsd_test_lock use the nfsd_file cache
    nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
    ...

    Linus Torvalds
     

24 Sep, 2019

1 commit

  • Pull selinux updates from Paul Moore:

    - Add LSM hooks, and SELinux access control hooks, for dnotify,
    fanotify, and inotify watches. This has been discussed with both the
    LSM and fs/notify folks and everybody is good with these new hooks.

    - The LSM stacking changes missed a few calls to current_security() in
    the SELinux code; we fix those and remove current_security() for
    good.

    - Improve our network object labeling cache so that we always return
    the object's label, even when under memory pressure. Previously we
    would return an error if we couldn't allocate a new cache entry, now
    we always return the label even if we can't create a new cache entry
    for it.

    - Convert the sidtab atomic_t counter to a normal u32 with
    READ/WRITE_ONCE() and memory barrier protection.

    - A few patches to policydb.c to clean things up (remove forward
    declarations, long lines, bad variable names, etc)

    * tag 'selinux-pr-20190917' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    lsm: remove current_security()
    selinux: fix residual uses of current_security() for the SELinux blob
    selinux: avoid atomic_t usage in sidtab
    fanotify, inotify, dnotify, security: add security hook for fs notifications
    selinux: always return a secid from the network caches if we find one
    selinux: policydb - rename type_val_to_struct_array
    selinux: policydb - fix some checkpatch.pl warnings
    selinux: shuffle around policydb.c to get rid of forward declarations

    Linus Torvalds
     

19 Aug, 2019

1 commit


13 Aug, 2019

1 commit

  • As of now, setting watches on filesystem objects has, at most, applied a
    check for read access to the inode, and in the case of fanotify, requires
    CAP_SYS_ADMIN. No specific security hook or permission check has been
    provided to control the setting of watches. Using any of inotify, dnotify,
    or fanotify, it is possible to observe, not only write-like operations, but
    even read access to a file. Modeling the watch as being merely a read from
    the file is insufficient for the needs of SELinux. This is due to the fact
    that read access should not necessarily imply access to information about
    when another process reads from a file. Furthermore, fanotify watches grant
    more power to an application in the form of permission events. While
    notification events are solely, unidirectional (i.e. they only pass
    information to the receiving application), permission events are blocking.
    Permission events make a request to the receiving application which will
    then reply with a decision as to whether or not that action may be
    completed. This causes the issue of the watching application having the
    ability to exercise control over the triggering process. Without drawing a
    distinction within the permission check, the ability to read would imply
    the greater ability to control an application. Additionally, mount and
    superblock watches apply to all files within the same mount or superblock.
    Read access to one file should not necessarily imply the ability to watch
    all files accessed within a given mount or superblock.

    In order to solve these issues, a new LSM hook is implemented and has been
    placed within the system calls for marking filesystem objects with inotify,
    fanotify, and dnotify watches. These calls to the hook are placed at the
    point at which the target path has been resolved and are provided with the
    path struct, the mask of requested notification events, and the type of
    object on which the mark is being set (inode, superblock, or mount). The
    mask and obj_type have already been translated into common FS_* values
    shared by the entirety of the fs notification infrastructure. The path
    struct is passed rather than just the inode so that the mount is available,
    particularly for mount watches. This also allows for use of the hook by
    pathname-based security modules. However, since the hook is intended for
    use even by inode based security modules, it is not placed under the
    CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
    modules would need to enable all of the path hooks, even though they do not
    use any of them.

    This only provides a hook at the point of setting a watch, and presumes
    that permission to set a particular watch implies the ability to receive
    all notification about that object which match the mask. This is all that
    is required for SELinux. If other security modules require additional hooks
    or infrastructure to control delivery of notification, these can be added
    by them. It does not make sense for us to propose hooks for which we have
    no implementation. The understanding that all notifications received by the
    requesting application are all strictly of a type for which the application
    has been granted permission shows that this implementation is sufficient in
    its coverage.

    Security modules wishing to provide complete control over fanotify must
    also implement a security_file_open hook that validates that the access
    requested by the watching application is authorized. Fanotify has the issue
    that it returns a file descriptor with the file mode specified during
    fanotify_init() to the watching process on event. This is already covered
    by the LSM security_file_open hook if the security module implements
    checking of the requested file mode there. Otherwise, a watching process
    can obtain escalated access to a file for which it has not been authorized.

    The selinux_path_notify hook implementation works by adding five new file
    permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
    (descriptions about which will follow), and one new filesystem permission:
    watch (which is applied to superblock checks). The hook then decides which
    subset of these permissions must be held by the requesting application
    based on the contents of the provided mask and the obj_type. The
    selinux_file_open hook already checks the requested file mode and therefore
    ensures that a watching process cannot escalate its access through
    fanotify.

    The watch, watch_mount, and watch_sb permissions are the baseline
    permissions for setting a watch on an object and each are a requirement for
    any watch to be set on a file, mount, or superblock respectively. It should
    be noted that having either of the other two permissions (watch_reads and
    watch_with_perm) does not imply the watch, watch_mount, or watch_sb
    permission. Superblock watches further require the filesystem watch
    permission to the superblock. As there is no labeled object in view for
    mounts, there is no specific check for mount watches beyond watch_mount to
    the inode. Such a check could be added in the future, if a suitable labeled
    object existed representing the mount.

    The watch_reads permission is required to receive notifications from
    read-exclusive events on filesystem objects. These events include accessing
    a file for the purpose of reading and closing a file which has been opened
    read-only. This distinction has been drawn in order to provide a direct
    indication in the policy for this otherwise not obvious capability. Read
    access to a file should not necessarily imply the ability to observe read
    events on a file.

    Finally, watch_with_perm only applies to fanotify masks since it is the
    only way to set a mask which allows for the blocking, permission event.
    This permission is needed for any watch which is of this type. Though
    fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
    trust to root, which we do not do, and does not support least privilege.

    Signed-off-by: Aaron Goidel
    Acked-by: Casey Schaufler
    Acked-by: Jan Kara
    Signed-off-by: Paul Moore

    Aaron Goidel
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

13 Jul, 2019

1 commit

  • Commit d46eb14b735b ("fs: fsnotify: account fsnotify metadata to
    kmemcg") added remote memcg charging for fanotify and inotify event
    objects. The aim was to charge the memory to the listener who is
    interested in the events but without triggering the OOM killer.
    Otherwise there would be security concerns for the listener.

    At the time, oom-kill trigger was not in the charging path. A parallel
    work added the oom-kill back to charging path i.e. commit 29ef680ae7c2
    ("memcg, oom: move out_of_memory back to the charge path"). So to not
    trigger oom-killer in the remote memcg, explicitly add
    __GFP_RETRY_MAYFAIL to the fanotigy and inotify event allocations.

    Link: http://lkml.kernel.org/r/20190514212259.156585-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Jan Kara
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Amir Goldstein
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

11 Jul, 2019

1 commit

  • Pull fsnotify updates from Jan Kara:
    "This contains cleanups of the fsnotify name removal hook and also a
    patch to disable fanotify permission events for 'proc' filesystem"

    * tag 'fsnotify_for_v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    fsnotify: get rid of fsnotify_nameremove()
    fsnotify: move fsnotify_nameremove() hook out of d_delete()
    configfs: call fsnotify_rmdir() hook
    debugfs: call fsnotify_{unlink,rmdir}() hooks
    debugfs: simplify __debugfs_remove_file()
    devpts: call fsnotify_unlink() hook
    tracefs: call fsnotify_{unlink,rmdir}() hooks
    rpc_pipefs: call fsnotify_{unlink,rmdir}() hooks
    btrfs: call fsnotify_rmdir() hook
    fsnotify: add empty fsnotify_{unlink,rmdir}() hooks
    fanotify: Disallow permission events for proc filesystem

    Linus Torvalds
     

20 Jun, 2019

1 commit

  • For all callers of fsnotify_{unlink,rmdir}(), we made sure that d_parent
    and d_name are stable. Therefore, fsnotify_{unlink,rmdir}() do not need
    the safety measures in fsnotify_nameremove() to stabilize parent and name.
    We can now simplify those hooks and get rid of fsnotify_nameremove().

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

19 Jun, 2019

1 commit

  • When implementing connector fsid cache, we only initialized the cache
    when the first mark added to object was added by FAN_REPORT_FID group.
    We forgot to update conn->fsid when the second mark is added by
    FAN_REPORT_FID group to an already attached connector without fsid
    cache.

    Reported-and-tested-by: syzbot+c277e8e2f46414645508@syzkaller.appspotmail.com
    Fixes: 77115225acc6 ("fanotify: cache fsid in fsnotify_mark_connector")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

29 May, 2019

1 commit

  • Proc filesystem has special locking rules for various files. Thus
    fanotify which opens files on event delivery can easily deadlock
    against another process that waits for fanotify permission event to be
    handled. Since permission events on /proc have doubtful value anyway,
    just disallow them.

    Link: https://lore.kernel.org/linux-fsdevel/20190320131642.GE9485@quack2.suse.cz/
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

24 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 or at your option any
    later version this program is distributed in the hope that it will
    be useful but without any warranty without even the implied warranty
    of merchantability or fitness for a particular purpose see the gnu
    general public license for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 44 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190523091651.032047323@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

2 commits

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 or at your option any
    later version this program is distributed in the hope that it will
    be useful but without any warranty without even the implied warranty
    of merchantability or fitness for a particular purpose see the gnu
    general public license for more details you should have received a
    copy of the gnu general public license along with this program see
    the file copying if not write to the free software foundation 675
    mass ave cambridge ma 02139 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 52 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Jilayne Lovejoy
    Reviewed-by: Steve Winslow
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190519154042.342335923@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Add SPDX license identifiers to all Make/Kconfig files which:

    - Have no license information of any form

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

14 May, 2019

1 commit


09 May, 2019

1 commit

  • __fsnotify_parent() has an optimization in place to avoid unneeded
    take_dentry_name_snapshot(). When fsnotify_nameremove() was changed
    not to call __fsnotify_parent(), we left out the optimization.
    Kernel test robot reported a 5% performance regression in concurrent
    unlink() workload.

    Reported-by: kernel test robot
    Link: https://lore.kernel.org/lkml/20190505062153.GG29809@shao2-debian/
    Link: https://lore.kernel.org/linux-fsdevel/20190104090357.GD22409@quack2.suse.cz/
    Fixes: 5f02a8776384 ("fsnotify: annotate directory entry modification events")
    CC: stable@vger.kernel.org
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

08 May, 2019

2 commits

  • Pull misc dcache updates from Al Viro:
    "Most of this pile is putting name length into struct name_snapshot and
    making use of it.

    The beginning of this series ("ovl_lookup_real_one(): don't bother
    with strlen()") ought to have been split in two (separate switch of
    name_snapshot to struct qstr from overlayfs reaping the trivial
    benefits of that), but I wanted to avoid a rebase - by the time I'd
    spotted that it was (a) in -next and (b) close to 5.1-final ;-/"

    * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    audit_compare_dname_path(): switch to const struct qstr *
    audit_update_watch(): switch to const struct qstr *
    inotify_handle_event(): don't bother with strlen()
    fsnotify: switch send_to_group() and ->handle_event to const struct qstr *
    fsnotify(): switch to passing const struct qstr * for file_name
    switch fsnotify_move() to passing const struct qstr * for old_name
    ovl_lookup_real_one(): don't bother with strlen()
    sysv: bury the broken "quietly truncate the long filenames" logics
    nsfs: unobfuscate
    unexport d_alloc_pseudo()

    Linus Torvalds
     
  • Pull pidfd updates from Christian Brauner:
    "This patchset makes it possible to retrieve pidfds at process creation
    time by introducing the new flag CLONE_PIDFD to the clone() system
    call. Linus originally suggested to implement this as a new flag to
    clone() instead of making it a separate system call.

    After a thorough review from Oleg CLONE_PIDFD returns pidfds in the
    parent_tidptr argument. This means we can give back the associated pid
    and the pidfd at the same time. Access to process metadata information
    thus becomes rather trivial.

    As has been agreed, CLONE_PIDFD creates file descriptors based on
    anonymous inodes similar to the new mount api. They are made
    unconditional by this patchset as they are now needed by core kernel
    code (vfs, pidfd) even more than they already were before (timerfd,
    signalfd, io_uring, epoll etc.). The core patchset is rather small.
    The bulky looking changelist is caused by David's very simple changes
    to Kconfig to make anon inodes unconditional.

    A pidfd comes with additional information in fdinfo if the kernel
    supports procfs. The fdinfo file contains the pid of the process in
    the callers pid namespace in the same format as the procfs status
    file, i.e. "Pid:\t%d".

    To remove worries about missing metadata access this patchset comes
    with a sample/test program that illustrates how a combination of
    CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free
    access to process metadata through /proc/.

    Further work based on this patchset has been done by Joel. His work
    makes pidfds pollable. It finished too late for this merge window. I
    would prefer to have it sitting in linux-next for a while and send it
    for inclusion during the 5.3 merge window"

    * tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    samples: show race-free pidfd metadata access
    signal: support CLONE_PIDFD with pidfd_send_signal
    clone: add CLONE_PIDFD
    Make anon_inodes unconditional

    Linus Torvalds
     

02 May, 2019

1 commit


29 Apr, 2019

1 commit

  • fanotify_get_fsid() is reading mark->connector->fsid under srcu. It can
    happen that it sees mark not fully initialized or mark that is already
    detached from the object list. In these cases mark->connector
    can be NULL leading to NULL ptr dereference. Fix the problem by
    being careful when reading mark->connector and check it for being NULL.
    Also use WRITE_ONCE when writing the mark just to prevent compiler from
    doing something stupid.

    Reported-by: syzbot+15927486a4f1bfcbaf91@syzkaller.appspotmail.com
    Fixes: 77115225acc6 ("fanotify: cache fsid in fsnotify_mark_connector")
    Signed-off-by: Jan Kara

    Jan Kara
     

27 Apr, 2019

4 commits


19 Apr, 2019

1 commit

  • Make the anon_inodes facility unconditional so that it can be used by core
    VFS code and pidfd code.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro
    [christian@brauner.io: adapt commit message to mention pidfds]
    Signed-off-by: Christian Brauner

    David Howells
     

19 Mar, 2019

1 commit

  • When file handle is embedded inside fanotify_event and usercopy checks
    are enabled, we get a warning like:

    Bad or missing usercopy whitelist? Kernel memory exposure attempt detected
    from SLAB object 'fanotify_event' (offset 40, size 8)!
    WARNING: CPU: 1 PID: 7649 at mm/usercopy.c:78 usercopy_warn+0xeb/0x110
    mm/usercopy.c:78

    Annotate handling in fanotify_event properly to mark copying it to
    userspace is fine.

    Reported-by: syzbot+2c49971e251e36216d1f@syzkaller.appspotmail.com
    Fixes: a8b13aa20afb ("fanotify: enable FAN_REPORT_FID init flag")
    Signed-off-by: Kees Cook
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

11 Mar, 2019

1 commit


21 Feb, 2019

1 commit

  • Making waits for response to fanotify permission events interruptible
    can result in EINTR returns from open(2) or other syscalls when there's
    e.g. AV software that's monitoring the file. Orion reports that e.g.
    bash is complaining like:

    bash: /etc/bash_completion.d/itweb-settings.bash: Interrupted system call

    So for now convert the wait from interruptible to only killable one.
    That is mostly invisible to userspace. Sadly this breaks hibernation
    with fanotify permission events pending again but we have to put more
    thought into how to fix this without regressing userspace visible
    behavior.

    Reported-by: Orion Poplawski
    Signed-off-by: Jan Kara

    Jan Kara
     

18 Feb, 2019

6 commits

  • When waiting for response to fanotify permission events, we currently
    use uninterruptible waits. That makes code simple however it can cause
    lots of processes to end up in uninterruptible sleep with hard reboot
    being the only alternative in case fanotify listener process stops
    responding (e.g. due to a bug in its implementation). Uninterruptible
    sleep also makes system hibernation fail if the listener gets frozen
    before the process generating fanotify permission event.

    Fix these problems by using interruptible sleep for waiting for response
    to fanotify event. This is slightly tricky though - we have to
    detect when the event got already reported to userspace as in that
    case we must not free the event. Instead we push the responsibility for
    freeing the event to the process that will write response to the
    event.

    Reported-by: Orion Poplawski
    Reported-by: Konstantin Khlebnikov
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Track whether permission event got already reported to userspace and
    whether userspace already answered to the permission event. Protect
    stores to this field together with updates to ->response field by
    group->notification_lock. This will allow aborting wait for reply to
    permission event from userspace.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Simplify iteration cleaning access_list in fanotify_release(). That will
    make following changes more obvious.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Create function to remove event from the notification list. Later it will
    be used from more places.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • get_one_event() has a single caller and that just locks
    notification_lock around the call. Move locking inside get_one_event()
    as that will make using ->response field for permission event state
    easier.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Fold dequeue_event() into process_access_response(). This will make
    changes to use of ->response field easier.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

15 Feb, 2019

1 commit

  • Fanotify now uses exportfs_encode_inode_fh() so it needs to select
    EXPORTFS.

    Fixes: e9e0c8903009 "fanotify: encode file identifier for FAN_REPORT_FID"
    Reported-by: Randy Dunlap
    Signed-off-by: Jan Kara

    Jan Kara