24 Sep, 2019

1 commit

  • Pull selinux updates from Paul Moore:

    - Add LSM hooks, and SELinux access control hooks, for dnotify,
    fanotify, and inotify watches. This has been discussed with both the
    LSM and fs/notify folks and everybody is good with these new hooks.

    - The LSM stacking changes missed a few calls to current_security() in
    the SELinux code; we fix those and remove current_security() for
    good.

    - Improve our network object labeling cache so that we always return
    the object's label, even when under memory pressure. Previously we
    would return an error if we couldn't allocate a new cache entry, now
    we always return the label even if we can't create a new cache entry
    for it.

    - Convert the sidtab atomic_t counter to a normal u32 with
    READ/WRITE_ONCE() and memory barrier protection.

    - A few patches to policydb.c to clean things up (remove forward
    declarations, long lines, bad variable names, etc)

    * tag 'selinux-pr-20190917' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    lsm: remove current_security()
    selinux: fix residual uses of current_security() for the SELinux blob
    selinux: avoid atomic_t usage in sidtab
    fanotify, inotify, dnotify, security: add security hook for fs notifications
    selinux: always return a secid from the network caches if we find one
    selinux: policydb - rename type_val_to_struct_array
    selinux: policydb - fix some checkpatch.pl warnings
    selinux: shuffle around policydb.c to get rid of forward declarations

    Linus Torvalds
     

13 Aug, 2019

1 commit

  • As of now, setting watches on filesystem objects has, at most, applied a
    check for read access to the inode, and in the case of fanotify, requires
    CAP_SYS_ADMIN. No specific security hook or permission check has been
    provided to control the setting of watches. Using any of inotify, dnotify,
    or fanotify, it is possible to observe, not only write-like operations, but
    even read access to a file. Modeling the watch as being merely a read from
    the file is insufficient for the needs of SELinux. This is due to the fact
    that read access should not necessarily imply access to information about
    when another process reads from a file. Furthermore, fanotify watches grant
    more power to an application in the form of permission events. While
    notification events are solely, unidirectional (i.e. they only pass
    information to the receiving application), permission events are blocking.
    Permission events make a request to the receiving application which will
    then reply with a decision as to whether or not that action may be
    completed. This causes the issue of the watching application having the
    ability to exercise control over the triggering process. Without drawing a
    distinction within the permission check, the ability to read would imply
    the greater ability to control an application. Additionally, mount and
    superblock watches apply to all files within the same mount or superblock.
    Read access to one file should not necessarily imply the ability to watch
    all files accessed within a given mount or superblock.

    In order to solve these issues, a new LSM hook is implemented and has been
    placed within the system calls for marking filesystem objects with inotify,
    fanotify, and dnotify watches. These calls to the hook are placed at the
    point at which the target path has been resolved and are provided with the
    path struct, the mask of requested notification events, and the type of
    object on which the mark is being set (inode, superblock, or mount). The
    mask and obj_type have already been translated into common FS_* values
    shared by the entirety of the fs notification infrastructure. The path
    struct is passed rather than just the inode so that the mount is available,
    particularly for mount watches. This also allows for use of the hook by
    pathname-based security modules. However, since the hook is intended for
    use even by inode based security modules, it is not placed under the
    CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
    modules would need to enable all of the path hooks, even though they do not
    use any of them.

    This only provides a hook at the point of setting a watch, and presumes
    that permission to set a particular watch implies the ability to receive
    all notification about that object which match the mask. This is all that
    is required for SELinux. If other security modules require additional hooks
    or infrastructure to control delivery of notification, these can be added
    by them. It does not make sense for us to propose hooks for which we have
    no implementation. The understanding that all notifications received by the
    requesting application are all strictly of a type for which the application
    has been granted permission shows that this implementation is sufficient in
    its coverage.

    Security modules wishing to provide complete control over fanotify must
    also implement a security_file_open hook that validates that the access
    requested by the watching application is authorized. Fanotify has the issue
    that it returns a file descriptor with the file mode specified during
    fanotify_init() to the watching process on event. This is already covered
    by the LSM security_file_open hook if the security module implements
    checking of the requested file mode there. Otherwise, a watching process
    can obtain escalated access to a file for which it has not been authorized.

    The selinux_path_notify hook implementation works by adding five new file
    permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
    (descriptions about which will follow), and one new filesystem permission:
    watch (which is applied to superblock checks). The hook then decides which
    subset of these permissions must be held by the requesting application
    based on the contents of the provided mask and the obj_type. The
    selinux_file_open hook already checks the requested file mode and therefore
    ensures that a watching process cannot escalate its access through
    fanotify.

    The watch, watch_mount, and watch_sb permissions are the baseline
    permissions for setting a watch on an object and each are a requirement for
    any watch to be set on a file, mount, or superblock respectively. It should
    be noted that having either of the other two permissions (watch_reads and
    watch_with_perm) does not imply the watch, watch_mount, or watch_sb
    permission. Superblock watches further require the filesystem watch
    permission to the superblock. As there is no labeled object in view for
    mounts, there is no specific check for mount watches beyond watch_mount to
    the inode. Such a check could be added in the future, if a suitable labeled
    object existed representing the mount.

    The watch_reads permission is required to receive notifications from
    read-exclusive events on filesystem objects. These events include accessing
    a file for the purpose of reading and closing a file which has been opened
    read-only. This distinction has been drawn in order to provide a direct
    indication in the policy for this otherwise not obvious capability. Read
    access to a file should not necessarily imply the ability to observe read
    events on a file.

    Finally, watch_with_perm only applies to fanotify masks since it is the
    only way to set a mask which allows for the blocking, permission event.
    This permission is needed for any watch which is of this type. Though
    fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
    trust to root, which we do not do, and does not support least privilege.

    Signed-off-by: Aaron Goidel
    Acked-by: Casey Schaufler
    Acked-by: Jan Kara
    Signed-off-by: Paul Moore

    Aaron Goidel
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

13 Jul, 2019

1 commit

  • Commit d46eb14b735b ("fs: fsnotify: account fsnotify metadata to
    kmemcg") added remote memcg charging for fanotify and inotify event
    objects. The aim was to charge the memory to the listener who is
    interested in the events but without triggering the OOM killer.
    Otherwise there would be security concerns for the listener.

    At the time, oom-kill trigger was not in the charging path. A parallel
    work added the oom-kill back to charging path i.e. commit 29ef680ae7c2
    ("memcg, oom: move out_of_memory back to the charge path"). So to not
    trigger oom-killer in the remote memcg, explicitly add
    __GFP_RETRY_MAYFAIL to the fanotigy and inotify event allocations.

    Link: http://lkml.kernel.org/r/20190514212259.156585-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Jan Kara
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Amir Goldstein
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

24 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 or at your option any
    later version this program is distributed in the hope that it will
    be useful but without any warranty without even the implied warranty
    of merchantability or fitness for a particular purpose see the gnu
    general public license for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 44 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190523091651.032047323@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

1 commit


08 May, 2019

1 commit

  • Pull misc dcache updates from Al Viro:
    "Most of this pile is putting name length into struct name_snapshot and
    making use of it.

    The beginning of this series ("ovl_lookup_real_one(): don't bother
    with strlen()") ought to have been split in two (separate switch of
    name_snapshot to struct qstr from overlayfs reaping the trivial
    benefits of that), but I wanted to avoid a rebase - by the time I'd
    spotted that it was (a) in -next and (b) close to 5.1-final ;-/"

    * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    audit_compare_dname_path(): switch to const struct qstr *
    audit_update_watch(): switch to const struct qstr *
    inotify_handle_event(): don't bother with strlen()
    fsnotify: switch send_to_group() and ->handle_event to const struct qstr *
    fsnotify(): switch to passing const struct qstr * for file_name
    switch fsnotify_move() to passing const struct qstr * for old_name
    ovl_lookup_real_one(): don't bother with strlen()
    sysv: bury the broken "quietly truncate the long filenames" logics
    nsfs: unobfuscate
    unexport d_alloc_pseudo()

    Linus Torvalds
     

27 Apr, 2019

2 commits


19 Apr, 2019

1 commit

  • Make the anon_inodes facility unconditional so that it can be used by core
    VFS code and pidfd code.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro
    [christian@brauner.io: adapt commit message to mention pidfds]
    Signed-off-by: Christian Brauner

    David Howells
     

11 Mar, 2019

1 commit


07 Feb, 2019

1 commit

  • We need to report FS_ISDIR flag with MOVE_SELF and DELETE_SELF events
    for fanotify, because fanotify API requires the user to explicitly
    request events on directories by FAN_ONDIR flag.

    inotify never reported IN_ISDIR with those events. It looks like an
    oversight, but to avoid the risk of breaking existing inotify programs,
    mask the FS_ISDIR flag out when reprting those events to inotify backend.

    We also add the FS_ISDIR flag with FS_ATTRIB event in the case of rename
    over an empty target directory. inotify did not report IN_ISDIR in this
    case, but it normally does report IN_ISDIR along with IN_ATTRIB event,
    so in this case, we do not mask out the FS_ISDIR flag.

    [JK: Simplify the checks in fsnotify_move()]

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

06 Feb, 2019

1 commit

  • Common fsnotify_event helpers have no need for the mask field.
    It is only used by backend code, so move the field out of the
    abstract fsnotify_event struct and into the concrete backend
    event structs.

    This change packs struct inotify_event_info better on 64bit
    machine and will allow us to cram some more fields into
    struct fanotify_event_info.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

03 Jan, 2019

1 commit

  • Commit 4d97f7d53da7dc83 ("inotify: Add flag IN_MASK_CREATE for
    inotify_add_watch()") forgot to call fdput() before bailing out.

    Fixes: 4d97f7d53da7dc83 ("inotify: Add flag IN_MASK_CREATE for inotify_add_watch()")
    CC: stable@vger.kernel.org
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Tetsuo Handa
     

04 Oct, 2018

1 commit


18 Aug, 2018

2 commits

  • Merge updates from Andrew Morton:

    - a few misc things

    - a few Y2038 fixes

    - ntfs fixes

    - arch/sh tweaks

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (111 commits)
    mm/hmm.c: remove unused variables align_start and align_end
    fs/userfaultfd.c: remove redundant pointer uwq
    mm, vmacache: hash addresses based on pmd
    mm/list_lru: introduce list_lru_shrink_walk_irq()
    mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
    mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
    mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
    mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
    mm/sparse: delete old sparse_init and enable new one
    mm/sparse: add new sparse_init_nid() and sparse_init()
    mm/sparse: move buffer init/fini to the common place
    mm/sparse: use the new sparse buffer functions in non-vmemmap
    mm/sparse: abstract sparse buffer allocations
    mm/hugetlb.c: don't zero 1GiB bootmem pages
    mm, page_alloc: double zone's batchsize
    mm/oom_kill.c: document oom_lock
    mm/hugetlb: remove gigantic page support for HIGHMEM
    mm, oom: remove sleep from under oom_lock
    kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
    mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
    ...

    Linus Torvalds
     
  • Patch series "Directed kmem charging", v8.

    The Linux kernel's memory cgroup allows limiting the memory usage of the
    jobs running on the system to provide isolation between the jobs. All
    the kernel memory allocated in the context of the job and marked with
    __GFP_ACCOUNT will also be included in the memory usage and be limited
    by the job's limit.

    The kernel memory can only be charged to the memcg of the process in
    whose context kernel memory was allocated. However there are cases
    where the allocated kernel memory should be charged to the memcg
    different from the current processes's memcg. This patch series
    contains two such concrete use-cases i.e. fsnotify and buffer_head.

    The fsnotify event objects can consume a lot of system memory for large
    or unlimited queues if there is either no or slow listener. The events
    are allocated in the context of the event producer. However they should
    be charged to the event consumer. Similarly the buffer_head objects can
    be allocated in a memcg different from the memcg of the page for which
    buffer_head objects are being allocated.

    To solve this issue, this patch series introduces mechanism to charge
    kernel memory to a given memcg. In case of fsnotify events, the memcg
    of the consumer can be used for charging and for buffer_head, the memcg
    of the page can be charged. For directed charging, the caller can use
    the scope API memalloc_[un]use_memcg() to specify the memcg to charge
    for all the __GFP_ACCOUNT allocations within the scope.

    This patch (of 2):

    A lot of memory can be consumed by the events generated for the huge or
    unlimited queues if there is either no or slow listener. This can cause
    system level memory pressure or OOMs. So, it's better to account the
    fsnotify kmem caches to the memcg of the listener.

    However the listener can be in a different memcg than the memcg of the
    producer and these allocations happen in the context of the event
    producer. This patch introduces remote memcg charging API which the
    producer can use to charge the allocations to the memcg of the listener.

    There are seven fsnotify kmem caches and among them allocations from
    dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
    inotify_inode_mark_cachep happens in the context of syscall from the
    listener. So, SLAB_ACCOUNT is enough for these caches.

    The objects from fsnotify_mark_connector_cachep are not accounted as
    they are small compared to the notification mark or events and it is
    unclear whom to account connector to since it is shared by all events
    attached to the inode.

    The allocations from the event caches happen in the context of the event
    producer. For such caches we will need to remote charge the allocations
    to the listener's memcg. Thus we save the memcg reference in the
    fsnotify_group structure of the listener.

    This patch has also moved the members of fsnotify_group to keep the size
    same, at least for 64 bit build, even with additional member by filling
    the holes.

    [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
    Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jan Kara
    Cc: Amir Goldstein
    Cc: Greg Thelen
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

28 Jun, 2018

1 commit

  • The flag IN_MASK_CREATE is introduced as a flag for inotiy_add_watch()
    which prevents inotify from modifying any existing watches when invoked.
    If the pathname specified in the call has a watched inode associated
    with it and IN_MASK_CREATE is specified, fail with an errno of EEXIST.

    Use of IN_MASK_CREATE with IN_MASK_ADD is reserved for future use and
    will return EINVAL.

    RATIONALE

    In the current implementation, there is no way to prevent
    inotify_add_watch() from modifying existing watch descriptors. Even if
    the caller keeps a record of all watch descriptors collected, this is
    only sufficient to detect that an existing watch descriptor may have
    been modified.

    The assumption that a particular path will map to the same inode over
    multiple calls to inotify_add_watch() cannot be made as files can be
    renamed or deleted. It is also not possible to assume that two distinct
    paths do no map to the same inode, due to hard-links or a dereferenced
    symbolic link. Further uses of inotify_add_watch() to revert the change
    may cause other watch descriptors to be modified or created, merely
    compunding the problem. There is currently no system call such as
    inotify_modify_watch() to explicity modify a watch descriptor, which
    would be able to revert unwanted changes. Thus the caller cannot
    guarantee to be able to revert any changes to existing watch decriptors.

    Additionally the caller cannot assume that the events that are
    associated with a watch descriptor are within the set requested, as any
    future calls to inotify_add_watch() may unintentionally modify a watch
    descriptor's mask. Thus it cannot currently be guaranteed that a watch
    descriptor will only generate events which have been requested. The
    program must filter events which come through its watch descriptor to
    within its expected range.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Henry Wilson
    Signed-off-by: Jan Kara

    Henry Wilson
     

18 May, 2018

3 commits

  • Before changing the arguments of the functions fsnotify_add_mark()
    and fsnotify_add_mark_locked(), convert most callers to use a wrapper.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Make some code that handles marks of object types inode and vfsmount
    generic, so it can handle other object types.

    Introduce fsnotify_foreach_obj_type macro to iterate marks by object type
    and fsnotify_iter_{should|set}_report_type macros to set/test report_mask.

    This is going to be used for adding mark of another object type
    (super block mark).

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • inode_mark and vfsmount_mark arguments are passed to handle_event()
    operation as function arguments as well as on iter_info struct.
    The difference is that iter_info struct may contain marks that should
    not be handled and are represented as NULL arguments to inode_mark or
    vfsmount_mark.

    Instead of passing the inode_mark and vfsmount_mark arguments, add
    a report_mask member to iter_info struct to indicate which marks should
    be handled, versus marks that should only be kept alive during user
    wait.

    This change is going to be used for passing more mark types
    with handle_event() (i.e. super block marks).

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

06 Apr, 2018

1 commit

  • Pull misc filesystem updates from Jan Kara:
    "udf, ext2, quota, fsnotify fixes & cleanups:

    - udf fixes for handling of media without uid/gid

    - udf fixes for some corner cases in parsing of volume recognition
    sequence

    - improvements of fsnotify handling of ENOMEM

    - new ioctl to allow setting of watch descriptor id for inotify (for
    checkpoint - restart)

    - small ext2, reiserfs, quota cleanups"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    quota: Kill an unused extern entry form quota.h
    reiserfs: Remove VLA from fs/reiserfs/reiserfs.h
    udf: fix potential refcnt problem of nls module
    ext2: change return code to -ENOMEM when failing memory allocation
    udf: Do not mark possibly inconsistent filesystems as closed
    fsnotify: Let userspace know about lost events due to ENOMEM
    fanotify: Avoid lost events due to ENOMEM for unlimited queues
    udf: Remove never implemented mount options
    udf: Update mount option documentation
    udf: Provide saner default for invalid uid / gid
    udf: Clean up handling of invalid uid/gid
    udf: Apply uid/gid mount options also to new inodes & chown
    udf: Ignore [ug]id=ignore mount options
    udf: Fix handling of Partition Descriptors
    udf: Unify common handling of descriptors
    udf: Convert descriptor index definitions to enum
    udf: Allow volume descriptor sequence to be terminated by unrecorded block
    udf: Simplify handling of Volume Descriptor Pointers
    udf: Fix off-by-one in volume descriptor sequence length
    inotify: Extend ioctl to allow to request id of new watch descriptor

    Linus Torvalds
     

03 Apr, 2018

1 commit

  • Using the inotify-internal do_inotify_init() helper allows us to get rid
    of the in-kernel call to sys_inotify_init1() syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Acked-by: Jan Kara
    Cc: Amir Goldstein
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

27 Feb, 2018

1 commit

  • Currently if notification event is lost due to event allocation failing
    we ENOMEM, we just silently continue (except for fanotify permission
    events where we deny the access). This is undesirable as userspace has
    no way of knowing whether the notifications it got are complete or not.
    Treat lost events due to ENOMEM the same way as lost events due to queue
    overflow so that userspace knows something bad happened and it likely
    needs to rescan the filesystem.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

14 Feb, 2018

1 commit

  • Watch descriptor is id of the watch created by inotify_add_watch().
    It is allocated in inotify_add_to_idr(), and takes the numbers
    starting from 1. Every new inotify watch obtains next available
    number (usually, old + 1), as served by idr_alloc_cyclic().

    CRIU (Checkpoint/Restore In Userspace) project supports inotify
    files, and restores watched descriptors with the same numbers,
    they had before dump. Since there was no kernel support, we
    had to use cycle to add a watch with specific descriptor id:

    while (1) {
    int wd;

    wd = inotify_add_watch(inotify_fd, path, mask);
    if (wd < 0) {
    break;
    } else if (wd == desired_wd_id) {
    ret = 0;
    break;
    }

    inotify_rm_watch(inotify_fd, wd);
    }

    (You may find the actual code at the below link:
    https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)

    The cycle is suboptiomal and very expensive, but since there is no better
    kernel support, it was the only way to restore that. Happily, we had met
    mostly descriptors with small id, and this approach had worked somehow.

    But recent time containers with inotify with big watch descriptors
    begun to come, and this way stopped to work at all. When descriptor id
    is something about 0x34d71d6, the restoring process spins in busy loop
    for a long time, and the restore hungs and delay of migration from node
    to node could easily be watched.

    This patch aims to solve this problem. It introduces new ioctl
    INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
    watch descriptor from userspace. It simply calls idr_set_cursor() primitive
    to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
    will return this id, if it is not occupied. This is the way which is
    used to restore some other resources from userspace. For example,
    /proc/sys/kernel/ns_last_pid works the same for task pids.

    The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
    may exclude it.

    v2: Use INT_MAX instead of custom definition of max id,
    as IDR subsystem guarantees id is between 0 and INT_MAX.

    CC: Jan Kara
    CC: Matthew Wilcox
    CC: Andrew Morton
    CC: Amir Goldstein
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Cyrill Gorcunov
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Signed-off-by: Jan Kara

    Kirill Tkhai
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Nov, 2017

1 commit


15 Nov, 2017

1 commit

  • Pull fsnotify updates from Jan Kara:

    - fixes of use-after-tree issues when handling fanotify permission
    events from Miklos

    - refcount_t conversions from Elena

    - fixes of ENOMEM handling in dnotify and fsnotify from me

    * 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    fsnotify: convert fsnotify_mark.refcnt from atomic_t to refcount_t
    fanotify: clean up CONFIG_FANOTIFY_ACCESS_PERMISSIONS ifdefs
    fsnotify: clean up fsnotify()
    fanotify: fix fsnotify_prepare_user_wait() failure
    fsnotify: fix pinning group in fsnotify_prepare_user_wait()
    fsnotify: pin both inode and vfsmount mark
    fsnotify: clean up fsnotify_prepare/finish_user_wait()
    fsnotify: convert fsnotify_group.refcnt from atomic_t to refcount_t
    fsnotify: Protect bail out path of fsnotify_add_mark_locked() properly
    dnotify: Handle errors from fsnotify_add_mark_locked() in fcntl_dirnotify()

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

01 Nov, 2017

1 commit

  • atomic_t variables are currently used to implement reference
    counters with the following properties:
    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable fsnotify_mark.refcnt is used as pure reference counter.
    Convert it to refcount_t and fix up the operations.

    Suggested-by: Kees Cook
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Signed-off-by: Elena Reshetova
    Signed-off-by: Jan Kara

    Elena Reshetova
     

10 Apr, 2017

8 commits

  • Pointer to ->free_mark callback unnecessarily occupies one long in each
    fsnotify_mark although they are the same for all marks from one
    notification group. Move the callback pointer to fsnotify_ops.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Currently we initialize mark->group only in fsnotify_add_mark_lock().
    However we will need to access fsnotify_ops of corresponding group from
    fsnotify_put_mark() so we need mark->group initialized earlier. Do that
    in fsnotify_init_mark() which has a consequence that once
    fsnotify_init_mark() is called on a mark, the mark has to be destroyed
    by fsnotify_put_mark().

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • These are very thin wrappers, just remove them. Drop
    fs/notify/vfsmount_mark.c as it is empty now.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • These helpers are just very thin wrappers now. Remove them.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • These helpers are now only a simple assignment and just obfuscate
    what is going on. Remove them.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Pass fsnotify_iter_info into ->handle_event() handler so that it can
    release and reacquire SRCU lock via fsnotify_prepare_user_wait() and
    fsnotify_finish_user_wait() functions. These functions also make sure
    current marks are appropriately pinned so that iteration protected by
    srcu in fsnotify() stays safe.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Currently we queue mark into a list of marks for destruction in
    __fsnotify_free_mark() and keep the last mark reference dangling. After the
    worker waits for SRCU period, it drops the last reference to the mark
    which frees it. This scheme has the disadvantage that if we hold
    reference to a mark and drop and reacquire SRCU lock, the mark can get
    freed immediately which is slightly inconvenient and we will need to
    avoid this in the future.

    Move to a scheme where queueing of mark into a list of marks for
    destruction happens when the last reference to the mark is dropped. Also
    drop reference to the mark held by group list already when mark is
    removed from that list instead of dropping it only from the destruction
    worker.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Dropping mark reference can result in mark being freed. Although it
    should not happen in inotify_remove_from_idr() since caller should hold
    another reference, just don't risk lock up just after WARN_ON
    unnecessarily. Also fold do_inotify_remove_from_idr() into the single
    callsite as that function really is just two lines of real code.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

03 Apr, 2017

1 commit

  • Printing inode pointers in warnings has dubious value and with future
    changes we won't be able to easily get them without either locking or
    chances we oops along the way. So just remove inode pointers from the
    warning messages.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

02 Mar, 2017

1 commit