09 Jun, 2022

2 commits

  • [ Upstream commit 623af4f538b5df9b416e1b82f720af7371b4c771 ]

    Commit 6960b0d909cd ("fsnotify: change locking order") changed some
    of the mark_mutex locks in direct reclaim path to use:
    mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING);

    This change is explained:
    "...It uses nested locking to avoid deadlock in case we do the final
    iput() on an inode which still holds marks and thus would take the
    mutex again when calling fsnotify_inode_delete() in destroy_inode()."

    The problem is that the mutex_lock_nested() is not a nested lock at
    all. In fact, it has the opposite effect of preventing lockdep from
    warning about a very possible deadlock.

    Due to these wrong annotations, a deadlock that was introduced with
    nfsd filecache in kernel v5.4 went unnoticed in v5.4.y for over two
    years until it was reported recently by Khazhismel Kumykov, only to
    find out that the deadlock was already fixed in kernel v5.5.

    Fix the wrong lockdep annotations.

    Cc: Khazhismel Kumykov
    Fixes: 6960b0d909cd ("fsnotify: change locking order")
    Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
    Link: https://lore.kernel.org/r/20220422120327.3459282-4-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: Sasha Levin

    Amir Goldstein
     
  • [ Upstream commit a32e697cda27679a0327ae2cafdad8c7170f548f ]

    The inotify mask flags IN_ONESHOT and IN_EXCL_UNLINK are not "internal
    to kernel" and should be exposed in procfs fdinfo so CRIU can restore
    them.

    Fixes: 6933599697c9 ("inotify: hide internal kernel bits from fdinfo")
    Link: https://lore.kernel.org/r/20220422120327.3459282-2-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: Sasha Levin

    Amir Goldstein
     

05 Feb, 2022

1 commit

  • commit ee12595147ac1fbfb5bcb23837e26dd58d94b15d upstream.

    This code calls fd_install() which gives the userspace access to the fd.
    Then if copy_info_records_to_user() fails it calls put_unused_fd(fd) but
    that will not release it and leads to a stale entry in the file
    descriptor table.

    Generally you can't trust the fd after a call to fd_install(). The fix
    is to delay the fd_install() until everything else has succeeded.

    Fortunately it requires CAP_SYS_ADMIN to reach this code so the security
    impact is less.

    Fixes: f644bc449b37 ("fanotify: fix copy_event_to_user() fid error clean up")
    Link: https://lore.kernel.org/r/20220128195656.GA26981@kili
    Signed-off-by: Dan Carpenter
    Reviewed-by: Mathias Krause
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     

11 Sep, 2021

1 commit

  • Fix a leak in s_fsnotify_connectors counter in case of a race between
    concurrent add of new fsnotify mark to an object.

    The task that lost the race fails to drop the counter before freeing
    the unused connector.

    Following umount() hangs in fsnotify_sb_delete()/wait_var_event(),
    because s_fsnotify_connectors never drops to zero.

    Fixes: ec44610fe2b8 ("fsnotify: count all objects with attached connectors")
    Reported-by: Murphy Zhou
    Link: https://lore.kernel.org/linux-fsdevel/20210907063338.ycaw6wvhzrfsfdlp@xzhoux.usersys.redhat.com/
    Signed-off-by: Amir Goldstein
    Signed-off-by: Linus Torvalds

    Amir Goldstein
     

31 Aug, 2021

1 commit

  • Pull fsnotify updates from Jan Kara:
    "fsnotify speedups when notification actually isn't used and support
    for identifying processes which caused fanotify events through pidfd
    instead of normal pid"

    * tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    fsnotify: optimize the case of no marks of any type
    fsnotify: count all objects with attached connectors
    fsnotify: count s_fsnotify_inode_refs for attached connectors
    fsnotify: replace igrab() with ihold() on attach connector
    fanotify: add pidfd support to the fanotify API
    fanotify: introduce a generic info record copying helper
    fanotify: minor cosmetic adjustments to fid labels
    kernel/pid.c: implement additional checks upon pidfd_create() parameters
    kernel/pid.c: remove static qualifier from pidfd_create()

    Linus Torvalds
     

11 Aug, 2021

3 commits

  • Rename s_fsnotify_inode_refs to s_fsnotify_connectors and count all
    objects with attached connectors, not only inodes with attached
    connectors.

    This will be used to optimize fsnotify() calls on sb without any
    type of marks.

    Link: https://lore.kernel.org/r/20210810151220.285179-4-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Reviewed-by: Matthew Bobrowski
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Instead of incrementing s_fsnotify_inode_refs when detaching connector
    from inode, increment it earlier when attaching connector to inode.
    Next patch is going to use s_fsnotify_inode_refs to count all objects
    with attached connectors.

    Link: https://lore.kernel.org/r/20210810151220.285179-3-amir73il@gmail.com
    Reviewed-by: Matthew Bobrowski
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • We must have a reference on inode, so ihold is cheaper.

    Link: https://lore.kernel.org/r/20210810151220.285179-2-amir73il@gmail.com
    Reviewed-by: Matthew Bobrowski
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

10 Aug, 2021

4 commits

  • Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
    allows userspace applications to control whether a pidfd information
    record containing a pidfd is to be returned alongside the generic
    event metadata for each event.

    If FAN_REPORT_PIDFD is enabled for a notification group, an additional
    struct fanotify_event_info_pidfd object type will be supplied
    alongside the generic struct fanotify_event_metadata for a single
    event. This functionality is analogous to that of FAN_REPORT_FID in
    terms of how the event structure is supplied to a userspace
    application. Usage of FAN_REPORT_PIDFD with
    FAN_REPORT_FID/FAN_REPORT_DFID_NAME is permitted, and in this case a
    struct fanotify_event_info_pidfd object will likely follow any struct
    fanotify_event_info_fid object.

    Currently, the usage of the FAN_REPORT_TID flag is not permitted along
    with FAN_REPORT_PIDFD as the pidfd API currently only supports the
    creation of pidfds for thread-group leaders. Additionally, usage of
    the FAN_REPORT_PIDFD flag is limited to privileged processes only
    i.e. event listeners that are running with the CAP_SYS_ADMIN
    capability. Attempting to supply the FAN_REPORT_TID initialization
    flags with FAN_REPORT_PIDFD or creating a notification group without
    CAP_SYS_ADMIN will result with -EINVAL being returned to the caller.

    In the event of a pidfd creation error, there are two types of error
    values that can be reported back to the listener. There is
    FAN_NOPIDFD, which will be reported in cases where the process
    responsible for generating the event has terminated prior to the event
    listener being able to read the event. Then there is FAN_EPIDFD, which
    will be reported when a more generic pidfd creation error has occurred
    when fanotify calls pidfd_create().

    Link: https://lore.kernel.org/r/5f9e09cff7ed62bfaa51c1369e0f7ea5f16a91aa.1628398044.git.repnop@google.com
    Signed-off-by: Matthew Bobrowski
    Signed-off-by: Jan Kara

    Matthew Bobrowski
     
  • The copy_info_records_to_user() helper allows for the separation of
    info record copying routines/conditionals from copy_event_to_user(),
    which reduces the overall clutter within this function. This becomes
    especially true as we start introducing additional info records in the
    future i.e. struct fanotify_event_info_pidfd. On success, this helper
    returns the total amount of bytes that have been copied into the user
    supplied buffer and on error, a negative value is returned to the
    caller.

    The newly defined macro FANOTIFY_INFO_MODES can be used to obtain info
    record types that have been enabled for a specific notification
    group. This macro becomes useful in the subsequent patch when the
    FAN_REPORT_PIDFD initialization flag is introduced.

    Link: https://lore.kernel.org/r/8872947dfe12ce8ae6e9a7f2d49ea29bc8006af0.1628398044.git.repnop@google.com
    Signed-off-by: Matthew Bobrowski
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Matthew Bobrowski
     
  • With the idea to support additional info record types in the future
    i.e. fanotify_event_info_pidfd, it's a good idea to rename some of the
    labels assigned to some of the existing fid related functions,
    parameters, etc which more accurately represent the intent behind
    their usage.

    For example, copy_info_to_user() was defined with a generic function
    label, which arguably reads as being supportive of different info
    record types, however the parameter list for this function is
    explicitly tailored towards the creation and copying of the
    fanotify_event_info_fid records. This same point applies to the macro
    defined as FANOTIFY_INFO_HDR_LEN.

    With fanotify_event_info_len(), we change the parameter label so that
    the function implies that it can be extended to calculate the length
    for additional info record types.

    Link: https://lore.kernel.org/r/7c3ec33f3c718dac40764305d4d494d858f59c51.1628398044.git.repnop@google.com
    Signed-off-by: Matthew Bobrowski
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Matthew Bobrowski
     
  • commit f9c82a4ea89c3 ("Increase size of ucounts to atomic_long_t")
    changed the data type of ucounts/ucounts_max to long, but missed to
    adjust a few other places. This is noticeable on big endian platforms
    from user space because the /proc/sys/user/max_*_names files all
    contain 0.

    v4 - Made the min and max constants long so the sysctl values
    are actually settable on little endian machines.
    -- EWB

    Fixes: f9c82a4ea89c ("Increase size of ucounts to atomic_long_t")
    Signed-off-by: Sven Schnelle
    Tested-by: Nathan Chancellor
    Tested-by: Linux Kernel Functional Testing
    Acked-by: Alexey Gladkov
    v1: https://lkml.kernel.org/r/20210721115800.910778-1-svens@linux.ibm.com
    v2: https://lkml.kernel.org/r/20210721125233.1041429-1-svens@linux.ibm.com
    v3: https://lkml.kernel.org/r/20210730062854.3601635-1-svens@linux.ibm.com
    Link: https://lkml.kernel.org/r/8735rijqlv.fsf_-_@disp2133
    Signed-off-by: Eric W. Biederman

    Sven Schnelle
     

14 Jun, 2021

1 commit

  • Ensure that clean up is performed on the allocated file descriptor and
    struct file object in the event that an error is encountered while copying
    fid info objects. Currently, we return directly to the caller when an error
    is experienced in the fid info copying helper, which isn't ideal given that
    the listener process could be left with a dangling file descriptor in their
    fdtable.

    Fixes: 5e469c830fdb ("fanotify: copy event fid info to user")
    Fixes: 44d705b0370b ("fanotify: report name info for FAN_DIR_MODIFY event")
    Link: https://lore.kernel.org/linux-fsdevel/YMKv1U7tNPK955ho@google.com/T/#m15361cd6399dad4396aad650de25dbf6b312288e
    Link: https://lore.kernel.org/r/1ef8ae9100101eb1a91763c516c2e9a3a3b112bd.1623376346.git.repnop@google.com
    Signed-off-by: Matthew Bobrowski
    Signed-off-by: Jan Kara

    Matthew Bobrowski
     

25 May, 2021

1 commit

  • Reporting event->pid should depend on the privileges of the user that
    initialized the group, not the privileges of the user reading the
    events.

    Use an internal group flag FANOTIFY_UNPRIV to record the fact that the
    group was initialized by an unprivileged user.

    To be on the safe side, the premissions to setup filesystem and mount
    marks now require that both the user that initialized the group and
    the user setting up the mark have CAP_SYS_ADMIN.

    Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxiA77_P5vtv7e83g0+9d7B5W9ZTE4GfQEYbWmfT1rA=VA@mail.gmail.com/
    Fixes: 7cea2a3c505e ("fanotify: support limited functionality for unprivileged users")
    Cc: # v5.12+
    Link: https://lore.kernel.org/r/20210524135321.2190062-1-amir73il@gmail.com
    Reviewed-by: Matthew Bobrowski
    Acked-by: Christian Brauner
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

25 Mar, 2021

1 commit

  • I don't see an obvious reason why the upper 32 bit check needs to be
    open-coded this way. Switch to upper_32_bits() which is more idiomatic and
    should conceptually be the same check.

    Cc: Amir Goldstein
    Cc: Jan Kara
    Link: https://lore.kernel.org/r/20210325083742.2334933-1-brauner@kernel.org
    Signed-off-by: Christian Brauner
    Signed-off-by: Jan Kara

    Christian Brauner
     

16 Mar, 2021

7 commits

  • Add limited support for unprivileged fanotify groups.
    An unprivileged users is not allowed to get an open file descriptor in
    the event nor the process pid of another process. An unprivileged user
    cannot request permission events, cannot set mount/filesystem marks and
    cannot request unlimited queue/marks.

    This enables the limited functionality similar to inotify when watching a
    set of files and directories for OPEN/ACCESS/MODIFY/CLOSE events, without
    requiring SYS_CAP_ADMIN privileges.

    The FAN_REPORT_DFID_NAME init flag, provide a method for an unprivileged
    listener watching a set of directories (with FAN_EVENT_ON_CHILD) to monitor
    all changes inside those directories.

    This typically requires that the listener keeps a map of watched directory
    fid to dirfd (O_PATH), where fid is obtained with name_to_handle_at()
    before starting to watch for changes.

    When getting an event, the reported fid of the parent should be resolved
    to dirfd and fstatsat(2) with dirfd and name should be used to query the
    state of the filesystem entry.

    Link: https://lore.kernel.org/r/20210304112921.3996419-3-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • fanotify has some hardcoded limits. The only APIs to escape those limits
    are FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARKS.

    Allow finer grained tuning of the system limits via sysfs tunables under
    /proc/sys/fs/fanotify, similar to tunables under /proc/sys/fs/inotify,
    with some minor differences.

    - max_queued_events - global system tunable for group queue size limit.
    Like the inotify tunable with the same name, it defaults to 16384 and
    applies on initialization of a new group.

    - max_user_marks - user ns tunable for marks limit per user.
    Like the inotify tunable named max_user_watches, on a machine with
    sufficient RAM and it defaults to 1048576 in init userns and can be
    further limited per containing user ns.

    - max_user_groups - user ns tunable for number of groups per user.
    Like the inotify tunable named max_user_instances, it defaults to 128
    in init userns and can be further limited per containing user ns.

    The slightly different tunable names used for fanotify are derived from
    the "group" and "mark" terminology used in the fanotify man pages and
    throughout the code.

    Considering the fact that the default value for max_user_instances was
    increased in kernel v5.10 from 8192 to 1048576, leaving the legacy
    fanotify limit of 8192 marks per group in addition to the max_user_marks
    limit makes little sense, so the per group marks limit has been removed.

    Note that when a group is initialized with FAN_UNLIMITED_MARKS, its own
    marks are not accounted in the per user marks account, so in effect the
    limit of max_user_marks is only for the collection of groups that are
    not initialized with FAN_UNLIMITED_MARKS.

    Link: https://lore.kernel.org/r/20210304112921.3996419-2-amir73il@gmail.com
    Suggested-by: Jan Kara
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Event merges are expensive when event queue size is large, so limit the
    linear search to 128 merge tests.

    In combination with 128 size hash table, there is a potential to merge
    with up to 16K events in the hashed queue.

    Link: https://lore.kernel.org/r/20210304104826.3993892-6-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • In order to improve event merge performance, hash events in a 128 size
    hash table by the event merge key.

    The fanotify_event size grows by two pointers, but we just reduced its
    size by removing the objectid member, so overall its size is increased
    by one pointer.

    Permission events and overflow event are not merged so they are also
    not hashed.

    Link: https://lore.kernel.org/r/20210304104826.3993892-5-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Improve the merge key hash by mixing more values relevant for merge.

    For example, all FAN_CREATE name events in the same dir used to have the
    same merge key based on the dir inode. With this change the created
    file name is mixed into the merge key.

    The object id that was used as merge key is redundant to the event info
    so it is no longer mixed into the hash.

    Permission events are not hashed, so no need to hash their info.

    Link: https://lore.kernel.org/r/20210304104826.3993892-4-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • objectid is only used by fanotify backend and it is just an optimization
    for event merge before comparing all fields in event.

    Move the objectid member from common struct fsnotify_event into struct
    fanotify_event and reduce it to 29-bit hash to cram it together with the
    3-bit event type.

    Events of different types are never merged, so the combination of event
    type and hash form a 32-bit key for fast compare of events.

    This reduces the size of events by one pointer and paves the way for
    adding hashed queue support for fanotify.

    Link: https://lore.kernel.org/r/20210304104826.3993892-3-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Current code has an assumtion that fsnotify_notify_queue_is_empty() is
    called to verify that queue is not empty before trying to peek or remove
    an event from queue.

    Remove this assumption by moving the fsnotify_notify_queue_is_empty()
    into the functions, allow them to return NULL value and check return
    value by all callers.

    This is a prep patch for multi event queues.

    Link: https://lore.kernel.org/r/20210304104826.3993892-2-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

24 Feb, 2021

1 commit

  • Pull idmapped mounts from Christian Brauner:
    "This introduces idmapped mounts which has been in the making for some
    time. Simply put, different mounts can expose the same file or
    directory with different ownership. This initial implementation comes
    with ports for fat, ext4 and with Christoph's port for xfs with more
    filesystems being actively worked on by independent people and
    maintainers.

    Idmapping mounts handle a wide range of long standing use-cases. Here
    are just a few:

    - Idmapped mounts make it possible to easily share files between
    multiple users or multiple machines especially in complex
    scenarios. For example, idmapped mounts will be used in the
    implementation of portable home directories in
    systemd-homed.service(8) where they allow users to move their home
    directory to an external storage device and use it on multiple
    computers where they are assigned different uids and gids. This
    effectively makes it possible to assign random uids and gids at
    login time.

    - It is possible to share files from the host with unprivileged
    containers without having to change ownership permanently through
    chown(2).

    - It is possible to idmap a container's rootfs and without having to
    mangle every file. For example, Chromebooks use it to share the
    user's Download folder with their unprivileged containers in their
    Linux subsystem.

    - It is possible to share files between containers with
    non-overlapping idmappings.

    - Filesystem that lack a proper concept of ownership such as fat can
    use idmapped mounts to implement discretionary access (DAC)
    permission checking.

    - They allow users to efficiently changing ownership on a per-mount
    basis without having to (recursively) chown(2) all files. In
    contrast to chown (2) changing ownership of large sets of files is
    instantenous with idmapped mounts. This is especially useful when
    ownership of a whole root filesystem of a virtual machine or
    container is changed. With idmapped mounts a single syscall
    mount_setattr syscall will be sufficient to change the ownership of
    all files.

    - Idmapped mounts always take the current ownership into account as
    idmappings specify what a given uid or gid is supposed to be mapped
    to. This contrasts with the chown(2) syscall which cannot by itself
    take the current ownership of the files it changes into account. It
    simply changes the ownership to the specified uid and gid. This is
    especially problematic when recursively chown(2)ing a large set of
    files which is commong with the aforementioned portable home
    directory and container and vm scenario.

    - Idmapped mounts allow to change ownership locally, restricting it
    to specific mounts, and temporarily as the ownership changes only
    apply as long as the mount exists.

    Several userspace projects have either already put up patches and
    pull-requests for this feature or will do so should you decide to pull
    this:

    - systemd: In a wide variety of scenarios but especially right away
    in their implementation of portable home directories.

    https://systemd.io/HOME_DIRECTORY/

    - container runtimes: containerd, runC, LXD:To share data between
    host and unprivileged containers, unprivileged and privileged
    containers, etc. The pull request for idmapped mounts support in
    containerd, the default Kubernetes runtime is already up for quite
    a while now: https://github.com/containerd/containerd/pull/4734

    - The virtio-fs developers and several users have expressed interest
    in using this feature with virtual machines once virtio-fs is
    ported.

    - ChromeOS: Sharing host-directories with unprivileged containers.

    I've tightly synced with all those projects and all of those listed
    here have also expressed their need/desire for this feature on the
    mailing list. For more info on how people use this there's a bunch of
    talks about this too. Here's just two recent ones:

    https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
    https://fosdem.org/2021/schedule/event/containers_idmap/

    This comes with an extensive xfstests suite covering both ext4 and
    xfs:

    https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

    It covers truncation, creation, opening, xattrs, vfscaps, setid
    execution, setgid inheritance and more both with idmapped and
    non-idmapped mounts. It already helped to discover an unrelated xfs
    setgid inheritance bug which has since been fixed in mainline. It will
    be sent for inclusion with the xfstests project should you decide to
    merge this.

    In order to support per-mount idmappings vfsmounts are marked with
    user namespaces. The idmapping of the user namespace will be used to
    map the ids of vfs objects when they are accessed through that mount.
    By default all vfsmounts are marked with the initial user namespace.
    The initial user namespace is used to indicate that a mount is not
    idmapped. All operations behave as before and this is verified in the
    testsuite.

    Based on prior discussions we want to attach the whole user namespace
    and not just a dedicated idmapping struct. This allows us to reuse all
    the helpers that already exist for dealing with idmappings instead of
    introducing a whole new range of helpers. In addition, if we decide in
    the future that we are confident enough to enable unprivileged users
    to setup idmapped mounts the permission checking can take into account
    whether the caller is privileged in the user namespace the mount is
    currently marked with.

    The user namespace the mount will be marked with can be specified by
    passing a file descriptor refering to the user namespace as an
    argument to the new mount_setattr() syscall together with the new
    MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
    of extensibility.

    The following conditions must be met in order to create an idmapped
    mount:

    - The caller must currently have the CAP_SYS_ADMIN capability in the
    user namespace the underlying filesystem has been mounted in.

    - The underlying filesystem must support idmapped mounts.

    - The mount must not already be idmapped. This also implies that the
    idmapping of a mount cannot be altered once it has been idmapped.

    - The mount must be a detached/anonymous mount, i.e. it must have
    been created by calling open_tree() with the OPEN_TREE_CLONE flag
    and it must not already have been visible in the filesystem.

    The last two points guarantee easier semantics for userspace and the
    kernel and make the implementation significantly simpler.

    By default vfsmounts are marked with the initial user namespace and no
    behavioral or performance changes are observed.

    The manpage with a detailed description can be found here:

    https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8

    In order to support idmapped mounts, filesystems need to be changed
    and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
    patches to convert individual filesystem are not very large or
    complicated overall as can be seen from the included fat, ext4, and
    xfs ports. Patches for other filesystems are actively worked on and
    will be sent out separately. The xfstestsuite can be used to verify
    that port has been done correctly.

    The mount_setattr() syscall is motivated independent of the idmapped
    mounts patches and it's been around since July 2019. One of the most
    valuable features of the new mount api is the ability to perform
    mounts based on file descriptors only.

    Together with the lookup restrictions available in the openat2()
    RESOLVE_* flag namespace which we added in v5.6 this is the first time
    we are close to hardened and race-free (e.g. symlinks) mounting and
    path resolution.

    While userspace has started porting to the new mount api to mount
    proper filesystems and create new bind-mounts it is currently not
    possible to change mount options of an already existing bind mount in
    the new mount api since the mount_setattr() syscall is missing.

    With the addition of the mount_setattr() syscall we remove this last
    restriction and userspace can now fully port to the new mount api,
    covering every use-case the old mount api could. We also add the
    crucial ability to recursively change mount options for a whole mount
    tree, both removing and adding mount options at the same time. This
    syscall has been requested multiple times by various people and
    projects.

    There is a simple tool available at

    https://github.com/brauner/mount-idmapped

    that allows to create idmapped mounts so people can play with this
    patch series. I'll add support for the regular mount binary should you
    decide to pull this in the following weeks:

    Here's an example to a simple idmapped mount of another user's home
    directory:

    u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

    u1001@f2-vm:/$ ls -al /home/ubuntu/
    total 28
    drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
    drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
    -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
    -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ ls -al /mnt/
    total 28
    drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
    drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
    -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
    -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ touch /mnt/my-file

    u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

    u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

    u1001@f2-vm:/$ ls -al /mnt/my-file
    -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

    u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
    -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

    u1001@f2-vm:/$ getfacl /mnt/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: mnt/my-file
    # owner: u1001
    # group: u1001
    user::rw-
    user:u1001:rwx
    group::rw-
    mask::rwx
    other::r--

    u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: home/ubuntu/my-file
    # owner: ubuntu
    # group: ubuntu
    user::rw-
    user:ubuntu:rwx
    group::rw-
    mask::rwx
    other::r--"

    * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
    xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
    xfs: support idmapped mounts
    ext4: support idmapped mounts
    fat: handle idmapped mounts
    tests: add mount_setattr() selftests
    fs: introduce MOUNT_ATTR_IDMAP
    fs: add mount_setattr()
    fs: add attr_flags_to_mnt_flags helper
    fs: split out functions to hold writers
    namespace: only take read lock in do_reconfigure_mnt()
    mount: make {lock,unlock}_mount_hash() static
    namespace: take lock_mount_hash() directly when changing flags
    nfs: do not export idmapped mounts
    overlayfs: do not mount on top of idmapped mounts
    ecryptfs: do not mount on top of idmapped mounts
    ima: handle idmapped mounts
    apparmor: handle idmapped mounts
    fs: make helpers idmap mount aware
    exec: handle idmapped mounts
    would_dump: handle idmapped mounts
    ...

    Linus Torvalds
     

23 Feb, 2021

1 commit


24 Jan, 2021

1 commit

  • Add two simple helpers to check permissions on a file and path
    respectively and convert over some callers. It simplifies quite a few
    codepaths and also reduces the churn in later patches quite a bit.
    Christoph also correctly points out that this makes codepaths (e.g.
    ioctls) way easier to follow that would otherwise have to do more
    complex argument passing than necessary.

    Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Signed-off-by: Christian Brauner

    Christian Brauner
     

05 Jan, 2021

1 commit

  • Currently the fs sysctl inotify/max_user_instances is used to limit the
    number of inotify instances on the system. For systems running multiple
    workloads, the per-user namespace sysctl max_inotify_instances can be
    used to further partition inotify instances. However there is no easy
    way to set a sensible system level max limit on inotify instances and
    further partition it between the workloads. It is much easier to charge
    the underlying resource (i.e. memory) behind the inotify instances to
    the memcg of the workload and let their memory limits limit the number
    of inotify instances they can create.

    With inotify instances charged to memcg, the admin can simply set
    max_user_instances to INT_MAX and let the memcg limits of the jobs limit
    their inotify instances.

    Link: https://lore.kernel.org/r/20201220044608.1258123-1-shakeelb@google.com
    Reviewed-by: Amir Goldstein
    Signed-off-by: Shakeel Butt
    Signed-off-by: Jan Kara

    Shakeel Butt
     

28 Dec, 2020

1 commit

  • Commit

    121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")

    converted native x86-32 which take 64-bit arguments to use the
    compat handlers to allow conversion to passing args via pt_regs.
    sys_fanotify_mark() was however missed, as it has a general compat
    handler. Add a config option that will use the syscall wrapper that
    takes the split args for native 32-bit.

    [ bp: Fix typo in Kconfig help text. ]

    Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")
    Reported-by: Paweł Jasiak
    Signed-off-by: Brian Gerst
    Signed-off-by: Borislav Petkov
    Acked-by: Jan Kara
    Acked-by: Andy Lutomirski
    Link: https://lkml.kernel.org/r/20201130223059.101286-1-brgerst@gmail.com

    Brian Gerst
     

18 Dec, 2020

1 commit

  • Pull fsnotify updates from Jan Kara:
    "A few fsnotify fixes from Amir fixing fallout from big fsnotify
    overhaul a few months back and an improvement of defaults limiting
    maximum number of inotify watches from Waiman"

    * tag 'fsnotify_for_v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    fsnotify: fix events reported to watching parent and child
    inotify: convert to handle_inode_event() interface
    fsnotify: generalize handle_inode_event()
    inotify: Increase default inotify.max_user_watches limit to 1048576

    Linus Torvalds
     

16 Dec, 2020

1 commit

  • …biederm/user-namespace

    Pull execve updates from Eric Biederman:
    "This set of changes ultimately fixes the interaction of posix file
    lock and exec. Fundamentally most of the change is just moving where
    unshare_files is called during exec, and tweaking the users of
    files_struct so that the count of files_struct is not unnecessarily
    played with.

    Along the way fcheck and related helpers were renamed to more
    accurately reflect what they do.

    There were also many other small changes that fell out, as this is the
    first time in a long time much of this code has been touched.

    Benchmarks haven't turned up any practical issues but Al Viro has
    observed a possibility for a lot of pounding on task_lock. So I have
    some changes in progress to convert put_files_struct to always rcu
    free files_struct. That wasn't ready for the merge window so that will
    have to wait until next time"

    * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    exec: Move io_uring_task_cancel after the point of no return
    coredump: Document coredump code exclusively used by cell spufs
    file: Remove get_files_struct
    file: Rename __close_fd_get_file close_fd_get_file
    file: Replace ksys_close with close_fd
    file: Rename __close_fd to close_fd and remove the files parameter
    file: Merge __alloc_fd into alloc_fd
    file: In f_dupfd read RLIMIT_NOFILE once.
    file: Merge __fd_install into fd_install
    proc/fd: In fdinfo seq_show don't use get_files_struct
    bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
    proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
    file: Implement task_lookup_next_fd_rcu
    kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
    proc/fd: In tid_fd_mode use task_lookup_fd_rcu
    file: Implement task_lookup_fd_rcu
    file: Rename fcheck lookup_fd_rcu
    file: Replace fcheck_files with files_lookup_fd_rcu
    file: Factor files_lookup_fd_locked out of fcheck_files
    file: Rename __fcheck_files to files_lookup_fd_raw
    ...

    Linus Torvalds
     

11 Dec, 2020

2 commits

  • fsnotify_parent() used to send two separate events to backends when a
    parent inode is watching children and the child inode is also watching.
    In an attempt to avoid duplicate events in fanotify, we unified the two
    backend callbacks to a single callback and handled the reporting of the
    two separate events for the relevant backends (inotify and dnotify).
    However the handling is buggy and can result in inotify and dnotify
    listeners receiving events of the type they never asked for or spurious
    events.

    The problem is the unified event callback with two inode marks (parent and
    child) is called when any of the parent and child inodes are watched and
    interested in the event, but the parent inode's mark that is interested
    in the event on the child is not necessarily the one we are currently
    reporting to (it could belong to a different group).

    So before reporting the parent or child event flavor to backend we need
    to check that the mark is really interested in that event flavor.

    The semantics of INODE and CHILD marks were hard to follow and made the
    logic more complicated than it should have been. Replace it with INODE
    and PARENT marks semantics to hopefully make the logic more clear.

    Thanks to Hugh Dickins for spotting a bug in the earlier version of this
    patch.

    Fixes: 497b0c5a7c06 ("fsnotify: send event to parent and child with single callback")
    CC: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20201202120713.702387-4-amir73il@gmail.com
    Reported-by: Hugh Dickins
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Also remove the confusing comment about checking if a fd exists. I
    could not find one instance in the entire kernel that still matches
    the description or the reason for the name fcheck.

    The need for better names became apparent in the last round of
    discussion of this set of changes[1].

    [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-10-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

03 Dec, 2020

2 commits

  • Convert inotify to use the simple handle_inode_event() interface to
    get rid of the code duplication between the generic helper
    fsnotify_handle_event() and the inotify_handle_event() callback, which
    also happen to be buggy code.

    The bug will be fixed in the generic helper.

    Link: https://lore.kernel.org/r/20201202120713.702387-3-amir73il@gmail.com
    CC: stable@vger.kernel.org
    Fixes: b9a1b9772509 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • The handle_inode_event() interface was added as (quoting comment):
    "a simple variant of handle_event() for groups that only have inode
    marks and don't have ignore mask".

    In other words, all backends except fanotify. The inotify backend
    also falls under this category, but because it required extra arguments
    it was left out of the initial pass of backends conversion to the
    simple interface.

    This results in code duplication between the generic helper
    fsnotify_handle_event() and the inotify_handle_event() callback
    which also happen to be buggy code.

    Generalize the handle_inode_event() arguments and add the check for
    FS_EXCL_UNLINK flag to the generic helper, so inotify backend could
    be converted to use the simple interface.

    Link: https://lore.kernel.org/r/20201202120713.702387-2-amir73il@gmail.com
    CC: stable@vger.kernel.org
    Fixes: b9a1b9772509 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

09 Nov, 2020

2 commits

  • The default value of inotify.max_user_watches sysctl parameter was set
    to 8192 since the introduction of the inotify feature in 2005 by
    commit 0eeca28300df ("[PATCH] inotify"). Today this value is just too
    small for many modern usage. As a result, users have to explicitly set
    it to a larger value to make it work.

    After some searching around the web, these are the
    inotify.max_user_watches values used by some projects:
    - vscode: 524288
    - dropbox support: 100000
    - users on stackexchange: 12228
    - lsyncd user: 2000000
    - code42 support: 1048576
    - monodevelop: 16384
    - tectonic: 524288
    - openshift origin: 65536

    Each watch point adds an inotify_inode_mark structure to an inode to
    be watched. It also pins the watched inode.

    Modeled after the epoll.max_user_watches behavior to adjust the default
    value according to the amount of addressable memory available, make
    inotify.max_user_watches behave in a similar way to make it use no more
    than 1% of addressable memory within the range [8192, 1048576].

    We estimate the amount of memory used by inotify mark to size of
    inotify_inode_mark plus two times the size of struct inode (we double
    the inode size to cover the additional filesystem private inode part).
    That means that a 64-bit system with 128GB or more memory will likely
    have the maximum value of 1048576 for inotify.max_user_watches. This
    default should be big enough for most use cases.

    Link: https://lore.kernel.org/r/20201109035931.4740-1-longman@redhat.com
    Reviewed-by: Amir Goldstein
    Signed-off-by: Waiman Long
    Signed-off-by: Jan Kara

    Waiman Long
     
  • The victim inode's parent and name info is required when an event
    needs to be delivered to a group interested in filename info OR
    when the inode's parent is interested in an event on its children.

    Let us call the first condition 'parent_needed' and the second
    condition 'parent_interested'.

    In fsnotify_parent(), the condition where the inode's parent is
    interested in some events on its children, but not necessarily
    interested the specific event is called 'parent_watched'.

    fsnotify_parent() tests the condition (!parent_watched && !parent_needed)
    for sending the event without parent and name info, which is correct.

    It then wrongly assumes that parent_watched implies !parent_needed
    and tests the condition (parent_watched && !parent_interested)
    for sending the event without parent and name info, which is wrong,
    because parent may still be needed by some group.

    For example, after initializing a group with FAN_REPORT_DFID_NAME and
    adding a FAN_MARK_MOUNT with FAN_OPEN mask, open events on non-directory
    children of "testdir" are delivered with file name info.

    After adding another mark to the same group on the parent "testdir"
    with FAN_CLOSE|FAN_EVENT_ON_CHILD mask, open events on non-directory
    children of "testdir" are no longer delivered with file name info.

    Fix the logic and use auxiliary variables to clarify the conditions.

    Fixes: 9b93f33105f5 ("fsnotify: send event with parent/name info to sb/mount/non-dir marks")
    Cc: stable@vger.kernel.org#v5.9
    Link: https://lore.kernel.org/r/20201108105906.8493-1-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

19 Oct, 2020

1 commit

  • Currently the remote memcg charging API consists of two functions:
    memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
    memcg value, which overwrites the memcg of the current task.

    memalloc_use_memcg(target_memcg);

    memalloc_unuse_memcg();

    It works perfectly for allocations performed from a normal context,
    however an attempt to call it from an interrupt context or just nest two
    remote charging blocks will lead to an incorrect accounting. On exit from
    the inner block the active memcg will be cleared instead of being
    restored.

    memalloc_use_memcg(target_memcg);

    memalloc_use_memcg(target_memcg_2);

    memalloc_unuse_memcg();

    Error: allocation here are charged to the memcg of the current
    process instead of target_memcg.

    memalloc_unuse_memcg();

    This patch extends the remote charging API by switching to a single
    function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
    which sets the new value and returns the old one. So a remote charging
    block will look like:

    old_memcg = set_active_memcg(target_memcg);

    set_active_memcg(old_memcg);

    This patch is heavily based on the patch by Johannes Weiner, which can be
    found here: https://lkml.org/lkml/2020/5/28/806 .

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Dan Schatzberg
    Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

28 Jul, 2020

3 commits

  • When merging name events, fsids of the two involved events have to
    match. Otherwise we could merge events from two different filesystems
    and thus effectively loose the second event.

    Backporting note: Although the commit cacfb956d46e introducing this bug
    was merged for 5.7, the relevant code didn't get used in the end until
    7e8283af6ede ("fanotify: report parent fid + name + child fid") which
    will be merged with this patch. So there's no need for backporting this.

    Fixes: cacfb956d46e ("fanotify: record name info for FAN_DIR_MODIFY event")
    Reported-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • The method handle_event() grew a lot of complexity due to the design of
    fanotify and merging of ignore masks.

    Most backends do not care about this complex functionality, so we can hide
    this complexity from them.

    Introduce a method handle_inode_event() that serves those backends and
    passes a single inode mark and less arguments.

    This change converts all backends except fanotify and inotify to use the
    simplified handle_inode_event() method. In pricipal, inotify could have
    also used the new method, but that would require passing more arguments
    on the simple helper (data, data_type, cookie), so we leave it with the
    handle_event() method.

    Link: https://lore.kernel.org/r/20200722125849.17418-9-amir73il@gmail.com
    Suggested-by: Jan Kara
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Add support for FAN_REPORT_FID | FAN_REPORT_DIR_FID.
    Internally, it is implemented as a private case of reporting both
    parent and child fids and name, the parent and child fids are recorded
    in a variable length fanotify_name_event, but there is no name.

    It should be noted that directory modification events are recorded
    in fixed size fanotify_fid_event when not reporting name, just like
    with group flags FAN_REPORT_FID.

    Link: https://lore.kernel.org/r/20200716084230.30611-23-amir73il@gmail.com
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein