31 Jan, 2019

1 commit

  • commit 125892edfe69915a227d8d125ff0e1cd713178f4 upstream.

    Commit 4d97f7d53da7dc83 ("inotify: Add flag IN_MASK_CREATE for
    inotify_add_watch()") forgot to call fdput() before bailing out.

    Fixes: 4d97f7d53da7dc83 ("inotify: Add flag IN_MASK_CREATE for inotify_add_watch()")
    CC: stable@vger.kernel.org
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

01 Dec, 2018

2 commits

  • commit b469e7e47c8a075cc08bcd1e85d4365134bdcdd5 upstream.

    When an event is reported on a sub-directory and the parent inode has
    a mark mask with FS_EVENT_ON_CHILD|FS_ISDIR, the event will be sent to
    fsnotify() even if the event type is not in the parent mark mask
    (e.g. FS_OPEN).

    Further more, if that event happened on a mount or a filesystem with
    a mount/sb mark that does have that event type in their mask, the "on
    child" event will be reported on the mount/sb mark. That is not
    desired, because user will get a duplicate event for the same action.

    Note that the event reported on the victim inode is never merged with
    the event reported on the parent inode, because of the check in
    should_merge(): old_fsn->inode == new_fsn->inode.

    Fix this by looking for a match of an actual event type (i.e. not just
    FS_ISDIR) in parent's inode mark mask and by not reporting an "on child"
    event to group if event type is only found on mount/sb marks.

    [backport hint: The bug seems to have always been in fanotify, but this
    patch will only apply cleanly to v4.19.y]

    Cc: # v4.19
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    [amir: backport to v4.19]
    Signed-off-by: Amir Goldstein
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit 007d1e8395eaa59b0e7ad9eb2b53a40859446a88 upstream.

    FS_EVENT_ON_CHILD gets a special treatment in fsnotify() because it is
    not a flag specifying an event type, but rather an extra flags that may
    be reported along with another event and control the handling of the
    event by the backend.

    FS_ISDIR is also an "extra flag" and not an "event type" and therefore
    desrves the same treatment. With inotify/dnotify backends it was never
    possible to set FS_ISDIR in mark masks, so it did not matter.
    With fanotify backend, mark adding code jumps through hoops to avoid
    setting the FS_ISDIR in the commulative object mask.

    Separate the constant ALL_FSNOTIFY_EVENTS to ALL_FSNOTIFY_FLAGS and
    ALL_FSNOTIFY_EVENTS, so the latter can be used to test for specific
    event types.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     

14 Nov, 2018

1 commit

  • commit 721fb6fbfd2132164c2e8777cc837f9b2c1794dc upstream.

    Detaching of mark connector from fsnotify_put_mark() can race with
    unmounting of the filesystem like:

    CPU1 CPU2
    fsnotify_put_mark()
    spin_lock(&conn->lock);
    ...
    inode = fsnotify_detach_connector_from_object(conn)
    spin_unlock(&conn->lock);
    generic_shutdown_super()
    fsnotify_unmount_inodes()
    sees connector detached for inode
    -> nothing to do
    evict_inode()
    barfs on pending inode reference
    iput(inode);

    Resulting in "Busy inodes after unmount" message and possible kernel
    oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
    references from detached connectors.

    Note that the accounting of outstanding inode references in the
    superblock can cause some cacheline contention on the counter. OTOH it
    happens only during deletion of the last notification mark from an inode
    (or during unlinking of watched inode) and that is not too bad. I have
    measured time to create & delete inotify watch 100000 times from 64
    processes in parallel (each process having its own inotify group and its
    own file on a shared superblock) on a 64 CPU machine. Average and
    standard deviation of 15 runs look like:

    Avg Stddev
    Vanilla 9.817400 0.276165
    Fixed 9.710467 0.228294

    So there's no statistically significant difference.

    Fixes: 6b3f05d24d35 ("fsnotify: Detach mark from object list when last reference is dropped")
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

03 Sep, 2018

1 commit

  • Commit 92183a42898d ("fsnotify: fix ignore mask logic in
    send_to_group()") acknoledges the use case of ignoring an event on
    an inode mark, because of an ignore mask on a mount mark of the same
    group (i.e. I want to get all events on this file, except for the events
    that came from that mount).

    This change depends on correctly merging the inode marks and mount marks
    group lists, so that the mount mark ignore mask would be tested in
    send_to_group(). Alas, the merging of the lists did not take into
    account the case where event in question is not in the mask of any of
    the mount marks.

    To fix this, completely remove the tests for inode and mount event masks
    from the lists merging code.

    Fixes: 92183a42898d ("fsnotify: fix ignore mask logic in send_to_group")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

30 Aug, 2018

1 commit

  • Pull misc fs fixes from Jan Kara:

    - make UDF to properly mount media created by Win7

    - make isofs to properly refuse devices with large physical block size

    - fix a Spectre gadget in quotactl(2)

    - fix a warning in fsnotify code hit by syzkaller

    * tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    udf: Fix mounting of Win7 created UDF filesystems
    udf: Remove dead code from udf_find_fileset()
    fs/quota: Fix spectre gadget in do_quotactl
    fs/quota: Replace XQM_MAXQUOTAS usage with MAXQUOTAS
    isofs: reject hardware sector size > 2048 bytes
    fsnotify: fix false positive warning on inode delete

    Linus Torvalds
     

22 Aug, 2018

1 commit

  • …iederm/user-namespace

    Pull core signal handling updates from Eric Biederman:
    "It was observed that a periodic timer in combination with a
    sufficiently expensive fork could prevent fork from every completing.
    This contains the changes to remove the need for that restart.

    This set of changes is split into several parts:

    - The first part makes PIDTYPE_TGID a proper pid type instead
    something only for very special cases. The part starts using
    PIDTYPE_TGID enough so that in __send_signal where signals are
    actually delivered we know if the signal is being sent to a a group
    of processes or just a single process.

    - With that prep work out of the way the logic in fork is modified so
    that fork logically makes signals received while it is running
    appear to be received after the fork completes"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
    signal: Don't send signals to tasks that don't exist
    signal: Don't restart fork when signals come in.
    fork: Have new threads join on-going signal group stops
    fork: Skip setting TIF_SIGPENDING in ptrace_init_task
    signal: Add calculate_sigpending()
    fork: Unconditionally exit if a fatal signal is pending
    fork: Move and describe why the code examines PIDNS_ADDING
    signal: Push pid type down into complete_signal.
    signal: Push pid type down into __send_signal
    signal: Push pid type down into send_signal
    signal: Pass pid type into do_send_sig_info
    signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
    signal: Pass pid type into group_send_sig_info
    signal: Pass pid and pid type into send_sigqueue
    posix-timers: Noralize good_sigevent
    signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
    pid: Implement PIDTYPE_TGID
    pids: Move the pgrp and session pid pointers from task_struct to signal_struct
    kvm: Don't open code task_pid in kvm_vcpu_ioctl
    pids: Compute task_tgid using signal->leader_pid
    ...

    Linus Torvalds
     

20 Aug, 2018

1 commit

  • When inode is getting deleted and someone else holds reference to a mark
    attached to the inode, we just detach the connector from the inode. In
    that case fsnotify_put_mark() called from fsnotify_destroy_marks() will
    decide to recalculate mask for the inode and __fsnotify_recalc_mask()
    will WARN about invalid connector type:

    WARNING: CPU: 1 PID: 12015 at fs/notify/mark.c:139
    __fsnotify_recalc_mask+0x2d7/0x350 fs/notify/mark.c:139

    Actually there's no reason to warn about detached connector in
    __fsnotify_recalc_mask() so just silently skip updating the mask in such
    case.

    Reported-by: syzbot+c34692a51b9a6ca93540@syzkaller.appspotmail.com
    Fixes: 3ac70bfcde81 ("fsnotify: add helper to get mask from connector")
    Signed-off-by: Jan Kara

    Jan Kara
     

18 Aug, 2018

2 commits

  • Merge updates from Andrew Morton:

    - a few misc things

    - a few Y2038 fixes

    - ntfs fixes

    - arch/sh tweaks

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (111 commits)
    mm/hmm.c: remove unused variables align_start and align_end
    fs/userfaultfd.c: remove redundant pointer uwq
    mm, vmacache: hash addresses based on pmd
    mm/list_lru: introduce list_lru_shrink_walk_irq()
    mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
    mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
    mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
    mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
    mm/sparse: delete old sparse_init and enable new one
    mm/sparse: add new sparse_init_nid() and sparse_init()
    mm/sparse: move buffer init/fini to the common place
    mm/sparse: use the new sparse buffer functions in non-vmemmap
    mm/sparse: abstract sparse buffer allocations
    mm/hugetlb.c: don't zero 1GiB bootmem pages
    mm, page_alloc: double zone's batchsize
    mm/oom_kill.c: document oom_lock
    mm/hugetlb: remove gigantic page support for HIGHMEM
    mm, oom: remove sleep from under oom_lock
    kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
    mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
    ...

    Linus Torvalds
     
  • Patch series "Directed kmem charging", v8.

    The Linux kernel's memory cgroup allows limiting the memory usage of the
    jobs running on the system to provide isolation between the jobs. All
    the kernel memory allocated in the context of the job and marked with
    __GFP_ACCOUNT will also be included in the memory usage and be limited
    by the job's limit.

    The kernel memory can only be charged to the memcg of the process in
    whose context kernel memory was allocated. However there are cases
    where the allocated kernel memory should be charged to the memcg
    different from the current processes's memcg. This patch series
    contains two such concrete use-cases i.e. fsnotify and buffer_head.

    The fsnotify event objects can consume a lot of system memory for large
    or unlimited queues if there is either no or slow listener. The events
    are allocated in the context of the event producer. However they should
    be charged to the event consumer. Similarly the buffer_head objects can
    be allocated in a memcg different from the memcg of the page for which
    buffer_head objects are being allocated.

    To solve this issue, this patch series introduces mechanism to charge
    kernel memory to a given memcg. In case of fsnotify events, the memcg
    of the consumer can be used for charging and for buffer_head, the memcg
    of the page can be charged. For directed charging, the caller can use
    the scope API memalloc_[un]use_memcg() to specify the memcg to charge
    for all the __GFP_ACCOUNT allocations within the scope.

    This patch (of 2):

    A lot of memory can be consumed by the events generated for the huge or
    unlimited queues if there is either no or slow listener. This can cause
    system level memory pressure or OOMs. So, it's better to account the
    fsnotify kmem caches to the memcg of the listener.

    However the listener can be in a different memcg than the memcg of the
    producer and these allocations happen in the context of the event
    producer. This patch introduces remote memcg charging API which the
    producer can use to charge the allocations to the memcg of the listener.

    There are seven fsnotify kmem caches and among them allocations from
    dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
    inotify_inode_mark_cachep happens in the context of syscall from the
    listener. So, SLAB_ACCOUNT is enough for these caches.

    The objects from fsnotify_mark_connector_cachep are not accounted as
    they are small compared to the notification mark or events and it is
    unclear whom to account connector to since it is shared by all events
    attached to the inode.

    The allocations from the event caches happen in the context of the event
    producer. For such caches we will need to remote charge the allocations
    to the listener's memcg. Thus we save the memcg reference in the
    fsnotify_group structure of the listener.

    This patch has also moved the members of fsnotify_group to keep the size
    same, at least for 64 bit build, even with additional member by filling
    the holes.

    [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
    Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jan Kara
    Cc: Amir Goldstein
    Cc: Greg Thelen
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

21 Jul, 2018

2 commits

  • When f_setown is called a pid and a pid type are stored. Replace the use
    of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
    group. Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
    is only for a thread.

    Update the users of __f_setown to use PIDTYPE_TGID instead of
    PIDTYPE_PID.

    For now the code continues to capture task_pid (when task_tgid would
    really be appropriate), and iterate on PIDTYPE_PID (even when type ==
    PIDTYPE_TGID) out of an abundance of caution to preserve existing
    behavior.

    Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID
    for tgid lookup also be used to avoid taking the tasklist lock.

    Suggested-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The cost is the the same and this removes the need
    to worry about complications that come from de_thread
    and group_leader changing.

    __task_pid_nr_ns has been updated to take advantage of this change.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Jun, 2018

1 commit

  • The flag IN_MASK_CREATE is introduced as a flag for inotiy_add_watch()
    which prevents inotify from modifying any existing watches when invoked.
    If the pathname specified in the call has a watched inode associated
    with it and IN_MASK_CREATE is specified, fail with an errno of EEXIST.

    Use of IN_MASK_CREATE with IN_MASK_ADD is reserved for future use and
    will return EINVAL.

    RATIONALE

    In the current implementation, there is no way to prevent
    inotify_add_watch() from modifying existing watch descriptors. Even if
    the caller keeps a record of all watch descriptors collected, this is
    only sufficient to detect that an existing watch descriptor may have
    been modified.

    The assumption that a particular path will map to the same inode over
    multiple calls to inotify_add_watch() cannot be made as files can be
    renamed or deleted. It is also not possible to assume that two distinct
    paths do no map to the same inode, due to hard-links or a dereferenced
    symbolic link. Further uses of inotify_add_watch() to revert the change
    may cause other watch descriptors to be modified or created, merely
    compunding the problem. There is currently no system call such as
    inotify_modify_watch() to explicity modify a watch descriptor, which
    would be able to revert unwanted changes. Thus the caller cannot
    guarantee to be able to revert any changes to existing watch decriptors.

    Additionally the caller cannot assume that the events that are
    associated with a watch descriptor are within the set requested, as any
    future calls to inotify_add_watch() may unintentionally modify a watch
    descriptor's mask. Thus it cannot currently be guaranteed that a watch
    descriptor will only generate events which have been requested. The
    program must filter events which come through its watch descriptor to
    within its expected range.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Henry Wilson
    Signed-off-by: Jan Kara

    Henry Wilson
     

27 Jun, 2018

5 commits


18 May, 2018

7 commits

  • Before changing the arguments of the functions fsnotify_add_mark()
    and fsnotify_add_mark_locked(), convert most callers to use a wrapper.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Use fsnotify_foreach_obj_type macros to generalize the code that filters
    events by marks mask and ignored_mask.

    This is going to be used for adding mark of super block object type.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Use fsnotify_foreach_obj_type macros to generalize the code that filters
    events by marks mask and ignored_mask.

    This is going to be used for adding mark of super block object type.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Make some code that handles marks of object types inode and vfsmount
    generic, so it can handle other object types.

    Introduce fsnotify_foreach_obj_type macro to iterate marks by object type
    and fsnotify_iter_{should|set}_report_type macros to set/test report_mask.

    This is going to be used for adding mark of another object type
    (super block mark).

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • Introduce helpers fsnotify_iter_select_report_types() and
    fsnotify_iter_next() to abstract the inode/vfsmount marks merged
    list iteration.

    This is a preparation patch before generalizing mark list
    iteration to more mark object types (i.e. super block marks).

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • inode_mark and vfsmount_mark arguments are passed to handle_event()
    operation as function arguments as well as on iter_info struct.
    The difference is that iter_info struct may contain marks that should
    not be handled and are represented as NULL arguments to inode_mark or
    vfsmount_mark.

    Instead of passing the inode_mark and vfsmount_mark arguments, add
    a report_mask member to iter_info struct to indicate which marks should
    be handled, versus marks that should only be kept alive during user
    wait.

    This change is going to be used for passing more mark types
    with handle_event() (i.e. super block marks).

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     
  • An fsnotify_mark_connector is referencing a single type of object
    (either inode or vfsmount). Instead of storing a type mask in
    connector->flags, store a single type id in connector->type to
    identify the type of object.

    When a connector object is detached from the object, its type is set
    to FSNOTIFY_OBJ_TYPE_DETACHED and this object is not going to be
    reused.

    The function fsnotify_clear_marks_by_group() is the only place where
    type mask was used, so use type flags instead of type id to this
    function.

    This change is going to be more convenient when adding a new object
    type (super block).

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

13 Apr, 2018

1 commit

  • The ignore mask logic in send_to_group() does not match the logic
    in fanotify_should_send_event(). In the latter, a vfsmount mark ignore
    mask precedes an inode mark mask and in the former, it does not.

    That difference may cause events to be sent to fanotify backend for no
    reason. Fix the logic in send_to_group() to match that of
    fanotify_should_send_event().

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

09 Apr, 2018

1 commit

  • When event on child inodes are sent to the parent inode mark and
    parent inode mark was not marked with FAN_EVENT_ON_CHILD, the event
    will not be delivered to the listener process. However, if the same
    process also has a mount mark, the event to the parent inode will be
    delivered regadless of the mount mark mask.

    This behavior is incorrect in the case where the mount mark mask does
    not contain the specific event type. For example, the process adds
    a mark on a directory with mask FAN_MODIFY (without FAN_EVENT_ON_CHILD)
    and a mount mark with mask FAN_CLOSE_NOWRITE (without FAN_ONDIR).

    A modify event on a file inside that directory (and inside that mount)
    should not create a FAN_MODIFY event, because neither of the marks
    requested to get that event on the file.

    Fixes: 1968f5eed54c ("fanotify: use both marks when possible")
    Cc: stable
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

06 Apr, 2018

1 commit

  • Pull misc filesystem updates from Jan Kara:
    "udf, ext2, quota, fsnotify fixes & cleanups:

    - udf fixes for handling of media without uid/gid

    - udf fixes for some corner cases in parsing of volume recognition
    sequence

    - improvements of fsnotify handling of ENOMEM

    - new ioctl to allow setting of watch descriptor id for inotify (for
    checkpoint - restart)

    - small ext2, reiserfs, quota cleanups"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    quota: Kill an unused extern entry form quota.h
    reiserfs: Remove VLA from fs/reiserfs/reiserfs.h
    udf: fix potential refcnt problem of nls module
    ext2: change return code to -ENOMEM when failing memory allocation
    udf: Do not mark possibly inconsistent filesystems as closed
    fsnotify: Let userspace know about lost events due to ENOMEM
    fanotify: Avoid lost events due to ENOMEM for unlimited queues
    udf: Remove never implemented mount options
    udf: Update mount option documentation
    udf: Provide saner default for invalid uid / gid
    udf: Clean up handling of invalid uid/gid
    udf: Apply uid/gid mount options also to new inodes & chown
    udf: Ignore [ug]id=ignore mount options
    udf: Fix handling of Partition Descriptors
    udf: Unify common handling of descriptors
    udf: Convert descriptor index definitions to enum
    udf: Allow volume descriptor sequence to be terminated by unrecorded block
    udf: Simplify handling of Volume Descriptor Pointers
    udf: Fix off-by-one in volume descriptor sequence length
    inotify: Extend ioctl to allow to request id of new watch descriptor

    Linus Torvalds
     

03 Apr, 2018

2 commits

  • Using the fs-internal do_fanotify_mark() helper allows us to get rid of
    the fs-internal call to the sys_fanotify_mark() syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Acked-by: Jan Kara
    Cc: Amir Goldstein
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the inotify-internal do_inotify_init() helper allows us to get rid
    of the in-kernel call to sys_inotify_init1() syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Acked-by: Jan Kara
    Cc: Amir Goldstein
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

27 Feb, 2018

2 commits

  • Currently if notification event is lost due to event allocation failing
    we ENOMEM, we just silently continue (except for fanotify permission
    events where we deny the access). This is undesirable as userspace has
    no way of knowing whether the notifications it got are complete or not.
    Treat lost events due to ENOMEM the same way as lost events due to queue
    overflow so that userspace knows something bad happened and it likely
    needs to rescan the filesystem.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Fanotify queues of unlimited length do not expect events can be lost.
    Since these queues are used for system auditing and other security
    related tasks, loosing events can even have security implications.
    Currently, since the allocation is small (32-bytes), it cannot fail
    however when we start accounting events in memcgs, allocation can start
    failing. So avoid loosing events due to failure to allocate memory by
    making event allocation use __GFP_NOFAIL.

    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

14 Feb, 2018

1 commit

  • Watch descriptor is id of the watch created by inotify_add_watch().
    It is allocated in inotify_add_to_idr(), and takes the numbers
    starting from 1. Every new inotify watch obtains next available
    number (usually, old + 1), as served by idr_alloc_cyclic().

    CRIU (Checkpoint/Restore In Userspace) project supports inotify
    files, and restores watched descriptors with the same numbers,
    they had before dump. Since there was no kernel support, we
    had to use cycle to add a watch with specific descriptor id:

    while (1) {
    int wd;

    wd = inotify_add_watch(inotify_fd, path, mask);
    if (wd < 0) {
    break;
    } else if (wd == desired_wd_id) {
    ret = 0;
    break;
    }

    inotify_rm_watch(inotify_fd, wd);
    }

    (You may find the actual code at the below link:
    https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)

    The cycle is suboptiomal and very expensive, but since there is no better
    kernel support, it was the only way to restore that. Happily, we had met
    mostly descriptors with small id, and this approach had worked somehow.

    But recent time containers with inotify with big watch descriptors
    begun to come, and this way stopped to work at all. When descriptor id
    is something about 0x34d71d6, the restoring process spins in busy loop
    for a long time, and the restore hungs and delay of migration from node
    to node could easily be watched.

    This patch aims to solve this problem. It introduces new ioctl
    INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
    watch descriptor from userspace. It simply calls idr_set_cursor() primitive
    to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
    will return this id, if it is not occupied. This is the way which is
    used to restore some other resources from userspace. For example,
    /proc/sys/kernel/ns_last_pid works the same for task pids.

    The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
    may exclude it.

    v2: Use INT_MAX instead of custom definition of max id,
    as IDR subsystem guarantees id is between 0 and INT_MAX.

    CC: Jan Kara
    CC: Matthew Wilcox
    CC: Andrew Morton
    CC: Amir Goldstein
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Cyrill Gorcunov
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Signed-off-by: Jan Kara

    Kirill Tkhai
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Jan, 2018

1 commit

  • Pull poll annotations from Al Viro:
    "This introduces a __bitwise type for POLL### bitmap, and propagates
    the annotations through the tree. Most of that stuff is as simple as
    'make ->poll() instances return __poll_t and do the same to local
    variables used to hold the future return value'.

    Some of the obvious brainos found in process are fixed (e.g. POLLIN
    misspelled as POLL_IN). At that point the amount of sparse warnings is
    low and most of them are for genuine bugs - e.g. ->poll() instance
    deciding to return -EINVAL instead of a bitmap. I hadn't touched those
    in this series - it's large enough as it is.

    Another problem it has caught was eventpoll() ABI mess; select.c and
    eventpoll.c assumed that corresponding POLL### and EPOLL### were
    equal. That's true for some, but not all of them - EPOLL### are
    arch-independent, but POLL### are not.

    The last commit in this series separates userland POLL### values from
    the (now arch-independent) kernel-side ones, converting between them
    in the few places where they are copied to/from userland. AFAICS, this
    is the least disruptive fix preserving poll(2) ABI and making epoll()
    work on all architectures.

    As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
    it will trigger only on what would've triggered EPOLLWRBAND on other
    architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
    at all on sparc. With this patch they should work consistently on all
    architectures"

    * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    make kernel-side POLL... arch-independent
    eventpoll: no need to mask the result of epi_item_poll() again
    eventpoll: constify struct epoll_event pointers
    debugging printk in sg_poll() uses %x to print POLL... bitmap
    annotate poll(2) guts
    9p: untangle ->poll() mess
    ->si_band gets POLL... bitmap stored into a user-visible long field
    ring_buffer_poll_wait() return value used as return value of ->poll()
    the rest of drivers/*: annotate ->poll() instances
    media: annotate ->poll() instances
    fs: annotate ->poll() instances
    ipc, kernel, mm: annotate ->poll() instances
    net: annotate ->poll() instances
    apparmor: annotate ->poll() instances
    tomoyo: annotate ->poll() instances
    sound: annotate ->poll() instances
    acpi: annotate ->poll() instances
    crypto: annotate ->poll() instances
    block: annotate ->poll() instances
    x86: annotate ->poll() instances
    ...

    Linus Torvalds
     

28 Nov, 2017

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • This is a pure automated search-and-replace of the internal kernel
    superblock flags.

    The s_flags are now called SB_*, with the names and the values for the
    moment mirroring the MS_* flags that they're equivalent to.

    Note how the MS_xyz flags are the ones passed to the mount system call,
    while the SB_xyz flags are what we then use in sb->s_flags.

    The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
    include/linux/fs.h include/uapi/linux/bfs_fs.h \
    security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
    DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
    POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
    I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
    ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

    Requested-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Nov, 2017

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual rocket-science from trivial tree for 4.15"

    * 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    MAINTAINERS: relinquish kconfig
    MAINTAINERS: Update my email address
    treewide: Fix typos in Kconfig
    kfifo: Fix comments
    init/Kconfig: Fix module signing document location
    misc: ibmasm: Return error on error path
    HID: logitech-hidpp: fix mistake in printk, "feeback" -> "feedback"
    MAINTAINERS: Correct path to uDraw PS3 driver
    tracing: Fix doc mistakes in trace sample
    tracing: Kconfig text fixes for CONFIG_HWLAT_TRACER
    MIPS: Alchemy: Remove reverted CONFIG_NETLINK_MMAP from db1xxx_defconfig
    mm/huge_memory.c: fixup grammar in comment
    lib/xz: Add fall-through comments to a switch statement

    Linus Torvalds
     

15 Nov, 2017

2 commits

  • Pull quota, ext2, isofs and udf fixes from Jan Kara:

    - two small quota error handling fixes

    - two isofs fixes for architectures with signed char

    - several udf block number overflow and signedness fixes

    - ext2 rework of mount option handling to avoid GFP_KERNEL allocation
    with spinlock held

    - ... it also contains a patch to implement auditing of responses to
    fanotify permission events. That should have been in the fanotify
    pull request but I mistakenly merged that patch into a wrong branch
    and noticed only now at which point I don't think it's worth rebasing
    and redoing.

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    quota: be aware of error from dquot_initialize
    quota: fix potential infinite loop
    isofs: use unsigned char types consistently
    isofs: fix timestamps beyond 2027
    udf: Fix some sign-conversion warnings
    udf: Fix signed/unsigned format specifiers
    udf: Fix 64-bit sign extension issues affecting blocks > 0x7FFFFFFF
    udf: Remove some outdate references from documentation
    udf: Avoid overflow when session starts at large offset
    ext2: Fix possible sleep in atomic during mount option parsing
    ext2: Parse mount options into a dedicated structure
    audit: Record fanotify access control decisions

    Linus Torvalds
     
  • Pull fsnotify updates from Jan Kara:

    - fixes of use-after-tree issues when handling fanotify permission
    events from Miklos

    - refcount_t conversions from Elena

    - fixes of ENOMEM handling in dnotify and fsnotify from me

    * 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    fsnotify: convert fsnotify_mark.refcnt from atomic_t to refcount_t
    fanotify: clean up CONFIG_FANOTIFY_ACCESS_PERMISSIONS ifdefs
    fsnotify: clean up fsnotify()
    fanotify: fix fsnotify_prepare_user_wait() failure
    fsnotify: fix pinning group in fsnotify_prepare_user_wait()
    fsnotify: pin both inode and vfsmount mark
    fsnotify: clean up fsnotify_prepare/finish_user_wait()
    fsnotify: convert fsnotify_group.refcnt from atomic_t to refcount_t
    fsnotify: Protect bail out path of fsnotify_add_mark_locked() properly
    dnotify: Handle errors from fsnotify_add_mark_locked() in fcntl_dirnotify()

    Linus Torvalds