13 Jan, 2019

1 commit

  • commit 7b55851367136b1efd84d98fea81ba57a98304cf upstream.

    This changes the fork(2) syscall to record the process start_time after
    initializing the basic task structure but still before making the new
    process visible to user-space.

    Technically, we could record the start_time anytime during fork(2). But
    this might lead to scenarios where a start_time is recorded long before
    a process becomes visible to user-space. For instance, with
    userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
    for an indefinite amount of time (and will, if this causes network
    access, or similar).

    By recording the start_time late, it much closer reflects the point in
    time where the process becomes live and can be observed by other
    processes.

    Lastly, this makes it much harder for user-space to predict and control
    the start_time they get assigned. Previously, user-space could fork a
    process and stall it in copy_thread_tls() before its pid is allocated,
    but after its start_time is recorded. This can be misused to later-on
    cycle through PIDs and resume the stalled fork(2) yielding a process
    that has the same pid and start_time as a process that existed before.
    This can be used to circumvent security systems that identify processes
    by their pid+start_time combination.

    Even though user-space was always aware that start_time recording is
    flaky (but several projects are known to still rely on start_time-based
    identification), changing the start_time to be recorded late will help
    mitigate existing attacks and make it much harder for user-space to
    control the start_time a process gets assigned.

    Reported-by: Jann Horn
    Signed-off-by: Tom Gundersen
    Signed-off-by: David Herrmann
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Herrmann
     

05 Sep, 2018

1 commit

  • Commit d70f2a14b72a ("include/linux/sched/mm.h: uninline mmdrop_async(),
    etc") ignored the return value of arch_dup_mmap(). As a result, on x86,
    a failure to duplicate the LDT (e.g. due to memory allocation error)
    would leave the duplicated memory mapping in an inconsistent state.

    Fix by using the return value, as it was before the change.

    Link: http://lkml.kernel.org/r/20180823051229.211856-1-namit@vmware.com
    Fixes: d70f2a14b72a4 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc")
    Signed-off-by: Nadav Amit
    Acked-by: Michal Hocko
    Cc:

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

23 Aug, 2018

4 commits

  • Merge more updates from Andrew Morton:

    - the rest of MM

    - procfs updates

    - various misc things

    - more y2038 fixes

    - get_maintainer updates

    - lib/ updates

    - checkpatch updates

    - various epoll updates

    - autofs updates

    - hfsplus

    - some reiserfs work

    - fatfs updates

    - signal.c cleanups

    - ipc/ updates

    * emailed patches from Andrew Morton : (166 commits)
    ipc/util.c: update return value of ipc_getref from int to bool
    ipc/util.c: further variable name cleanups
    ipc: simplify ipc initialization
    ipc: get rid of ids->tables_initialized hack
    lib/rhashtable: guarantee initial hashtable allocation
    lib/rhashtable: simplify bucket_table_alloc()
    ipc: drop ipc_lock()
    ipc/util.c: correct comment in ipc_obtain_object_check
    ipc: rename ipcctl_pre_down_nolock()
    ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
    ipc: reorganize initialization of kern_ipc_perm.seq
    ipc: compute kern_ipc_perm.id under the ipc lock
    init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
    fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
    adfs: use timespec64 for time conversion
    kernel/sysctl.c: fix typos in comments
    drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
    fork: don't copy inconsistent signal handler state to child
    signal: make get_signal() return bool
    signal: make sigkill_pending() return bool
    ...

    Linus Torvalds
     
  • Before this change, if a multithreaded process forks while one of its
    threads is changing a signal handler using sigaction(), the memcpy() in
    copy_sighand() can race with the struct assignment in do_sigaction(). It
    isn't clear whether this can cause corruption of the userspace signal
    handler pointer, but it definitely can cause inconsistency between
    different fields of struct sigaction.

    Take the appropriate spinlock to avoid this.

    I have tested that this patch prevents inconsistency between sa_sigaction
    and sa_flags, which is possible before this patch.

    Link: http://lkml.kernel.org/r/20180702145108.73189-1-jannh@google.com
    Signed-off-by: Jann Horn
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Rik van Riel
    Cc: "Peter Zijlstra (Intel)"
    Cc: Kees Cook
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Currently task hung checking interval is equal to timeout, as the result
    hung is detected anywhere between timeout and 2*timeout. This is fine for
    most interactive environments, but this hurts automated testing setups
    (syzbot). In an automated setup we need to strictly order CPU lockup <
    RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
    is not detected as task hung and task hung is not detected as silent
    machine loss. The large variance in task hung detection timeout requires
    setting silent machine loss timeout to a very large value (e.g. if task
    hung is 3 mins, then silent loss need to be set to ~7 mins). The
    additional 3 minutes significantly reduce testing efficiency because
    usually we crash kernel within a minute, and this can add hours to bug
    localization process as it needs to do dozens of tests.

    Allow setting checking interval separately from timeout. This allows to
    set timeout to, say, 3 minutes, but checking interval to 10 secs.

    The interval is controlled via a new hung_task_check_interval_secs sysctl,
    similar to the existing hung_task_timeout_secs sysctl. The default value
    of 0 results in the current behavior: checking interval is equal to
    timeout.

    [akpm@linux-foundation.org: update hung_task_timeout_max's comment]
    Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Cc: Paul E. McKenney
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • Rather than in vm_area_alloc(). To ensure that the various oddball
    stack-based vmas are in a good state. Some of the callers were zeroing
    them out, others were not.

    Acked-by: Kirill A. Shutemov
    Cc: Russell King
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

22 Aug, 2018

1 commit

  • …iederm/user-namespace

    Pull core signal handling updates from Eric Biederman:
    "It was observed that a periodic timer in combination with a
    sufficiently expensive fork could prevent fork from every completing.
    This contains the changes to remove the need for that restart.

    This set of changes is split into several parts:

    - The first part makes PIDTYPE_TGID a proper pid type instead
    something only for very special cases. The part starts using
    PIDTYPE_TGID enough so that in __send_signal where signals are
    actually delivered we know if the signal is being sent to a a group
    of processes or just a single process.

    - With that prep work out of the way the logic in fork is modified so
    that fork logically makes signals received while it is running
    appear to be received after the fork completes"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
    signal: Don't send signals to tasks that don't exist
    signal: Don't restart fork when signals come in.
    fork: Have new threads join on-going signal group stops
    fork: Skip setting TIF_SIGPENDING in ptrace_init_task
    signal: Add calculate_sigpending()
    fork: Unconditionally exit if a fatal signal is pending
    fork: Move and describe why the code examines PIDNS_ADDING
    signal: Push pid type down into complete_signal.
    signal: Push pid type down into __send_signal
    signal: Push pid type down into send_signal
    signal: Pass pid type into do_send_sig_info
    signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
    signal: Pass pid type into group_send_sig_info
    signal: Pass pid and pid type into send_sigqueue
    posix-timers: Noralize good_sigevent
    signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
    pid: Implement PIDTYPE_TGID
    pids: Move the pgrp and session pid pointers from task_struct to signal_struct
    kvm: Don't open code task_pid in kvm_vcpu_ioctl
    pids: Compute task_tgid using signal->leader_pid
    ...

    Linus Torvalds
     

18 Aug, 2018

1 commit

  • Patch series "Directed kmem charging", v8.

    The Linux kernel's memory cgroup allows limiting the memory usage of the
    jobs running on the system to provide isolation between the jobs. All
    the kernel memory allocated in the context of the job and marked with
    __GFP_ACCOUNT will also be included in the memory usage and be limited
    by the job's limit.

    The kernel memory can only be charged to the memcg of the process in
    whose context kernel memory was allocated. However there are cases
    where the allocated kernel memory should be charged to the memcg
    different from the current processes's memcg. This patch series
    contains two such concrete use-cases i.e. fsnotify and buffer_head.

    The fsnotify event objects can consume a lot of system memory for large
    or unlimited queues if there is either no or slow listener. The events
    are allocated in the context of the event producer. However they should
    be charged to the event consumer. Similarly the buffer_head objects can
    be allocated in a memcg different from the memcg of the page for which
    buffer_head objects are being allocated.

    To solve this issue, this patch series introduces mechanism to charge
    kernel memory to a given memcg. In case of fsnotify events, the memcg
    of the consumer can be used for charging and for buffer_head, the memcg
    of the page can be charged. For directed charging, the caller can use
    the scope API memalloc_[un]use_memcg() to specify the memcg to charge
    for all the __GFP_ACCOUNT allocations within the scope.

    This patch (of 2):

    A lot of memory can be consumed by the events generated for the huge or
    unlimited queues if there is either no or slow listener. This can cause
    system level memory pressure or OOMs. So, it's better to account the
    fsnotify kmem caches to the memcg of the listener.

    However the listener can be in a different memcg than the memcg of the
    producer and these allocations happen in the context of the event
    producer. This patch introduces remote memcg charging API which the
    producer can use to charge the allocations to the memcg of the listener.

    There are seven fsnotify kmem caches and among them allocations from
    dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
    inotify_inode_mark_cachep happens in the context of syscall from the
    listener. So, SLAB_ACCOUNT is enough for these caches.

    The objects from fsnotify_mark_connector_cachep are not accounted as
    they are small compared to the notification mark or events and it is
    unclear whom to account connector to since it is shared by all events
    attached to the inode.

    The allocations from the event caches happen in the context of the event
    producer. For such caches we will need to remote charge the allocations
    to the listener's memcg. Thus we save the memcg reference in the
    fsnotify_group structure of the listener.

    This patch has also moved the members of fsnotify_group to keep the size
    same, at least for 64 bit build, even with additional member by filling
    the holes.

    [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
    Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jan Kara
    Cc: Amir Goldstein
    Cc: Greg Thelen
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

15 Aug, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • Pull x86 mm updates from Thomas Gleixner:

    - Make lazy TLB mode even lazier to avoid pointless switch_mm()
    operations, which reduces CPU load by 1-2% for memcache workloads

    - Small cleanups and improvements all over the place

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mm: Remove redundant check for kmem_cache_create()
    arm/asm/tlb.h: Fix build error implicit func declaration
    x86/mm/tlb: Make clear_asid_other() static
    x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
    x86/mm/tlb: Always use lazy TLB mode
    x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
    x86/mm/tlb: Make lazy TLB mode lazier
    x86/mm/tlb: Restructure switch_mm_irqs_off()
    x86/mm/tlb: Leave lazy TLB mode at page table free time
    mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
    x86/mm: Add TLB purge to free pmd/pte page interfaces
    ioremap: Update pgtable free interfaces with addr
    x86/mm: Disable ioremap free page handling on x86-PAE

    Linus Torvalds
     

10 Aug, 2018

1 commit

  • Wen Yang and majiang
    report that a periodic signal received during fork can cause fork to
    continually restart preventing an application from making progress.

    The code was being overly pessimistic. Fork needs to guarantee that a
    signal sent to multiple processes is logically delivered before the
    fork and just to the forking process or logically delivered after the
    fork to both the forking process and it's newly spawned child. For
    signals like periodic timers that are always delivered to a single
    process fork can safely complete and let them appear to logically
    delivered after the fork().

    While examining this issue I also discovered that fork today will miss
    signals delivered to multiple processes during the fork and handled by
    another thread. Similarly the current code will also miss blocked
    signals that are delivered to multiple process, as those signals will
    not appear pending during fork.

    Add a list of each thread that is currently forking, and keep on that
    list a signal set that records all of the signals sent to multiple
    processes. When fork completes initialize the new processes
    shared_pending signal set with it. The calculate_sigpending function
    will see those signals and set TIF_SIGPENDING causing the new task to
    take the slow path to userspace to handle those signals. Making it
    appear as if those signals were received immediately after the fork.

    It is not possible to send real time signals to multiple processes and
    exceptions don't go to multiple processes, which means that that are
    no signals sent to multiple processes that require siginfo. This
    means it is safe to not bother collecting siginfo on signals sent
    during fork.

    The sigaction of a child of fork is initially the same as the
    sigaction of the parent process. So a signal the parent ignores the
    child will also initially ignore. Therefore it is safe to ignore
    signals sent to multiple processes and ignored by the forking process.

    Signals sent to only a single process or only a single thread and delivered
    during fork are treated as if they are received after the fork, and generally
    not dealt with. They won't cause any problems.

    V2: Added removal from the multiprocess list on failure.
    V3: Use -ERESTARTNOINTR directly
    V4: - Don't queue both SIGCONT and SIGSTOP
    - Initialize signal_struct.multiprocess in init_task
    - Move setting of shared_pending to before the new task
    is visible to signals. This prevents signals from comming
    in before shared_pending.signal is set to delayed.signal
    and being lost.
    V5: - rework list add and delete to account for idle threads
    v6: - Use sigdelsetmask when removing stop signals

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
    Reported-by: Wen Yang and
    Reported-by: majiang
    Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

06 Aug, 2018

1 commit


04 Aug, 2018

1 commit

  • There are only two signals that are delivered to every member of a
    signal group: SIGSTOP and SIGKILL. Signal delivery requires every
    signal appear to be delivered either before or after a clone syscall.
    SIGKILL terminates the clone so does not need to be considered. Which
    leaves only SIGSTOP that needs to be considered when creating new
    threads.

    Today in the event of a group stop TIF_SIGPENDING will get set and the
    fork will restart ensuring the fork syscall participates in the group
    stop.

    A fork (especially of a process with a lot of memory) is one of the
    most expensive system so we really only want to restart a fork when
    necessary.

    It is easy so check to see if a SIGSTOP is ongoing and have the new
    thread join it immediate after the clone completes. Making it appear
    the clone completed happened just before the SIGSTOP.

    The calculate_sigpending function will see the bits set in jobctl and
    set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.

    V2: The call to task_join_group_stop was moved before the new task is
    added to the thread group list. This should not matter as
    sighand->siglock is held over both the addition of the threads,
    the call to task_join_group_stop and do_signal_stop. But the change
    is trivial and it is one less thing to worry about when reading
    the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

01 Aug, 2018

1 commit

  • We were hitting a panic in production where we put too many times on the
    request queue. This is because we'd get the throttle_queue of the
    parent if we fork()'ed while we needed to be throttled, but we didn't
    have a reference on it. Instead just clear these flags on fork so the
    child doesn't pay for the sins of its father.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

27 Jul, 2018

1 commit

  • Not all VMAs allocated with vm_area_alloc(). Some of them allocated on
    stack or in data segment.

    The new helper can be use to initialize VMA properly regardless where it
    was allocated.

    Link: http://lkml.kernel.org/r/20180724121139.62570-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Jul, 2018

2 commits

  • In practice this does not change anything as testing for fatal_signal_pending
    and exiting for with an error code duplicates the work of the next clause
    which recalculates pending signals and then exits fork if any are pending.
    In both cases the pending signal will trigger the slow path when existing
    to userspace, and the fatal signal will cause do_exit to be called.

    The advantage of making this a separate test is that it makes it clear
    processing the fatal signal will terminate the fork, and it allows the
    rest of the signal logic to be updated without fear that this important
    case will be lost.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Normally this would be something that would be handled by handling
    signals that are sent to a group of processes but in this case the
    forking process is not a member of the group being signaled. Thus
    special code is needed to prevent a race with pid namespaces exiting,
    and fork adding new processes within them.

    Move this test up before the signal restart just in case signals are
    also pending. Fatal conditions should take presedence over restarts.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

22 Jul, 2018

3 commits

  • Like vm_area_dup(), it initializes the anon_vma_chain head, and the
    basic mm pointer.

    The rest of the fields end up being different for different users,
    although the plan is to also initialize the 'vm_ops' field to a dummy
    entry.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • .. and re-initialize th eanon_vma_chain head.

    This removes some boiler-plate from the users, and also makes it clear
    why it didn't need use the 'zalloc()' version.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The vm_area_struct is one of the most fundamental memory management
    objects, but the management of it is entirely open-coded evertwhere,
    ranging from allocation and freeing (using kmem_cache_[z]alloc and
    kmem_cache_free) to initializing all the fields.

    We want to unify this in order to end up having some unified
    initialization of the vmas, and the first step to this is to at least
    have basic allocation functions.

    Right now those functions are literally just wrappers around the
    kmem_cache_*() calls. This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

    to the point where the old vma passed in to the vm_area_dup() function
    isn't even used yet (because I've left all the old manual initialization
    alone).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jul, 2018

2 commits

  • Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id). Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array. Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.

    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.

    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest. The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • To access these fields the code always has to go to group leader so
    going to signal struct is no loss and is actually a fundamental simplification.

    This saves a little bit of memory by only allocating the pid pointer array
    once instead of once for every thread, and even better this removes a
    few potential races caused by the fact that group_leader can be changed
    by de_thread, while signal_struct can not.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

17 Jul, 2018

1 commit

  • The mm_struct always contains a cpumask bitmap, regardless of
    CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
    simplify things, and simply have one bitmask at the end of the
    mm_struct for the mm_cpumask.

    This does necessitate moving everything else in mm_struct into
    an anonymous sub-structure, which can be randomized when struct
    randomization is enabled.

    The second step is to determine the correct size for the
    mm_struct slab object from the size of the mm_struct
    (excluding the CPU bitmap) and the size the cpumask.

    For init_mm we can simply allocate the maximum size this
    kernel is compiled for, since we only have one init_mm
    in the system, anyway.

    Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
    getting confused by the dynamically sized array.

    Tested-by: Song Liu
    Signed-off-by: Rik van Riel
    Signed-off-by: Mike Galbraith
    Signed-off-by: Rik van Riel
    Acked-by: Dave Hansen
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Cc: luto@kernel.org
    Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

15 Jun, 2018

1 commit

  • As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas
    can loop while potentially allocating memory, with mm->mmap_sem held for
    write by current thread. This is bad if current thread was selected as
    an OOM victim, for current thread will continue allocations using memory
    reserves while OOM reaper is unable to reclaim memory.

    As an actually observable problem, it is not difficult to make OOM
    reaper unable to reclaim memory if the OOM victim is blocked at
    i_mmap_lock_write() in this loop. Unfortunately, since nobody can
    explain whether it is safe to use killable wait there, let's check for
    SIGKILL before trying to allocate memory. Even without an OOM event,
    there is no point with continuing the loop from the beginning if current
    thread is killed.

    I tested with debug printk(). This patch should be safe because we
    already fail if security_vm_enough_memory_mm() or
    kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it.

    ***** Aborting dup_mmap() due to SIGKILL *****
    ***** Aborting dup_mmap() due to SIGKILL *****
    ***** Aborting dup_mmap() due to SIGKILL *****
    ***** Aborting dup_mmap() due to SIGKILL *****
    ***** Aborting exit_mmap() due to NULL mmap *****

    [akpm@linux-foundation.org: add comment]
    Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Cc: Alexander Viro
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

14 Jun, 2018

1 commit

  • The changes to automatically test for working stack protector compiler
    support in the Kconfig files removed the special STACKPROTECTOR_AUTO
    option that picked the strongest stack protector that the compiler
    supported.

    That was all a nice cleanup - it makes no sense to have the AUTO case
    now that the Kconfig phase can just determine the compiler support
    directly.

    HOWEVER.

    It also meant that doing "make oldconfig" would now _disable_ the strong
    stackprotector if you had AUTO enabled, because in a legacy config file,
    the sane stack protector configuration would look like

    CONFIG_HAVE_CC_STACKPROTECTOR=y
    # CONFIG_CC_STACKPROTECTOR_NONE is not set
    # CONFIG_CC_STACKPROTECTOR_REGULAR is not set
    # CONFIG_CC_STACKPROTECTOR_STRONG is not set
    CONFIG_CC_STACKPROTECTOR_AUTO=y

    and when you ran this through "make oldconfig" with the Kbuild changes,
    it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
    been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
    CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
    used to be disabled (because it was really enabled by AUTO), and would
    disable it in the new config, resulting in:

    CONFIG_HAVE_CC_STACKPROTECTOR=y
    CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
    CONFIG_CC_STACKPROTECTOR=y
    # CONFIG_CC_STACKPROTECTOR_STRONG is not set
    CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

    That's dangerously subtle - people could suddenly find themselves with
    the weaker stack protector setup without even realizing.

    The solution here is to just rename not just the old RECULAR stack
    protector option, but also the strong one. This does that by just
    removing the CC_ prefix entirely for the user choices, because it really
    is not about the compiler support (the compiler support now instead
    automatially impacts _visibility_ of the options to users).

    This results in "make oldconfig" actually asking the user for their
    choice, so that we don't have any silent subtle security model changes.
    The end result would generally look like this:

    CONFIG_HAVE_CC_STACKPROTECTOR=y
    CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
    CONFIG_STACKPROTECTOR=y
    CONFIG_STACKPROTECTOR_STRONG=y
    CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

    where the "CC_" versions really are about internal compiler
    infrastructure, not the user selections.

    Acked-by: Masahiro Yamada
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Jun, 2018

1 commit

  • Pull restartable sequence support from Thomas Gleixner:
    "The restartable sequences syscall (finally):

    After a lot of back and forth discussion and massive delays caused by
    the speculative distraction of maintainers, the core set of
    restartable sequences has finally reached a consensus.

    It comes with the basic non disputed core implementation along with
    support for arm, powerpc and x86 and a full set of selftests

    It was exposed to linux-next earlier this week, so it does not fully
    comply with the merge window requirements, but there is really no
    point to drag it out for yet another cycle"

    * 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rseq/selftests: Provide Makefile, scripts, gitignore
    rseq/selftests: Provide parametrized tests
    rseq/selftests: Provide basic percpu ops test
    rseq/selftests: Provide basic test
    rseq/selftests: Provide rseq library
    selftests/lib.mk: Introduce OVERRIDE_TARGETS
    powerpc: Wire up restartable sequences system call
    powerpc: Add syscall detection for restartable sequences
    powerpc: Add support for restartable sequences
    x86: Wire up restartable sequence system call
    x86: Add support for restartable sequences
    arm: Wire up restartable sequences system call
    arm: Add syscall detection for restartable sequences
    arm: Add restartable sequences support
    rseq: Introduce restartable sequences system call
    uapi/headers: Provide types_32_64.h

    Linus Torvalds
     

08 Jun, 2018

1 commit

  • mmap_sem is on the hot path of kernel, and it very contended, but it is
    abused too. It is used to protect arg_start|end and evn_start|end when
    reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
    sense since those proc files just expect to read 4 values atomically and
    not related to VM, they could be set to arbitrary values by C/R.

    And, the mmap_sem contention may cause unexpected issue like below:

    INFO: task ps:14018 blocked for more than 120 seconds.
    Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    ps D 0 14018 1 0x00000004
    Call Trace:
    schedule+0x36/0x80
    rwsem_down_read_failed+0xf0/0x150
    call_rwsem_down_read_failed+0x18/0x30
    down_read+0x20/0x40
    proc_pid_cmdline_read+0xd9/0x4e0
    __vfs_read+0x37/0x150
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x1a/0xc5

    Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
    for them to mitigate the abuse of mmap_sem.

    So, introduce a new spinlock in mm_struct to protect the concurrent
    access to arg_start|end, env_start|end and others, as well as replace
    write map_sem to read to protect the race condition between prctl and
    sys_brk which might break check_data_rlimit(), and makes prctl more
    friendly to other VM operations.

    This patch just eliminates the abuse of mmap_sem, but it can't resolve
    the above hung task warning completely since the later
    access_remote_vm() call needs acquire mmap_sem. The mmap_sem
    scalability issue will be solved in the future.

    [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
    Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Matthew Wilcox
    Cc: Mateusz Guzik
    Cc: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

07 Jun, 2018

1 commit

  • Pull audit updates from Paul Moore:
    "Another reasonable chunk of audit changes for v4.18, thirteen patches
    in total.

    The thirteen patches can mostly be broken down into one of four
    categories: general bug fixes, accessor functions for audit state
    stored in the task_struct, negative filter matches on executable
    names, and extending the (relatively) new seccomp logging knobs to the
    audit subsystem.

    The main driver for the accessor functions from Richard are the
    changes we're working on to associate audit events with containers,
    but I think they have some standalone value too so I figured it would
    be good to get them in now.

    The seccomp/audit patches from Tyler apply the seccomp logging
    improvements from a few releases ago to audit's seccomp logging;
    starting with this patchset the changes in
    /proc/sys/kernel/seccomp/actions_logged should apply to both the
    standard kernel logging and audit.

    As usual, everything passes the audit-testsuite and it happens to
    merge cleanly with your tree"

    [ Heh, except it had trivial merge conflicts with the SELinux tree that
    also came in from Paul - Linus ]

    * tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: Fix wrong task in comparison of session ID
    audit: use existing session info function
    audit: normalize loginuid read access
    audit: use new audit_context access funciton for seccomp_actions_logged
    audit: use inline function to set audit context
    audit: use inline function to get audit context
    audit: convert sessionid unset to a macro
    seccomp: Don't special case audited processes when logging
    seccomp: Audit attempts to modify the actions_logged sysctl
    seccomp: Configurable separator for the actions_logged string
    seccomp: Separate read and write code for actions_logged sysctl
    audit: allow not equal op for audit by executable
    audit: add syscall information to FEATURE_CHANGE records

    Linus Torvalds
     

06 Jun, 2018

1 commit

  • Expose a new system call allowing each thread to register one userspace
    memory area to be used as an ABI between kernel and user-space for two
    purposes: user-space restartable sequences and quick access to read the
    current CPU number value from user-space.

    * Restartable sequences (per-cpu atomics)

    Restartables sequences allow user-space to perform update operations on
    per-cpu data without requiring heavy-weight atomic operations.

    The restartable critical sections (percpu atomics) work has been started
    by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
    critical sections. [1] [2] The re-implementation proposed here brings a
    few simplifications to the ABI which facilitates porting to other
    architectures and speeds up the user-space fast path.

    Here are benchmarks of various rseq use-cases.

    Test hardware:

    arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
    x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

    The following benchmarks were all performed on a single thread.

    * Per-CPU statistic counter increment

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 344.0 31.4 11.0
    x86-64: 15.3 2.0 7.7

    * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
    per-cpu buffer

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 2502.0 2250.0 1.1
    x86-64: 117.4 98.0 1.2

    * liburcu percpu: lock-unlock pair, dereference, read/compare word

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 751.0 128.5 5.8
    x86-64: 53.4 28.6 1.9

    * jemalloc memory allocator adapted to use rseq

    Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
    rseq 2016 implementation):

    The production workload response-time has 1-2% gain avg. latency, and
    the P99 overall latency drops by 2-3%.

    * Reading the current CPU number

    Speeding up reading the current CPU number on which the caller thread is
    running is done by keeping the current CPU number up do date within the
    cpu_id field of the memory area registered by the thread. This is done
    by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
    current thread. Upon return to user-space, a notify-resume handler
    updates the current CPU value within the registered user-space memory
    area. User-space can then read the current CPU number directly from
    memory.

    Keeping the current cpu id in a memory area shared between kernel and
    user-space is an improvement over current mechanisms available to read
    the current CPU number, which has the following benefits over
    alternative approaches:

    - 35x speedup on ARM vs system call through glibc
    - 20x speedup on x86 compared to calling glibc, which calls vdso
    executing a "lsl" instruction,
    - 14x speedup on x86 compared to inlined "lsl" instruction,
    - Unlike vdso approaches, this cpu_id value can be read from an inline
    assembly, which makes it a useful building block for restartable
    sequences.
    - The approach of reading the cpu id through memory mapping shared
    between kernel and user-space is portable (e.g. ARM), which is not the
    case for the lsl-based x86 vdso.

    On x86, yet another possible approach would be to use the gs segment
    selector to point to user-space per-cpu data. This approach performs
    similarly to the cpu id cache, but it has two disadvantages: it is
    not portable, and it is incompatible with existing applications already
    using the gs segment selector for other purposes.

    Benchmarking various approaches for reading the current CPU number:

    ARMv7 Processor rev 4 (v7l)
    Machine model: Cubietruck
    - Baseline (empty loop): 8.4 ns
    - Read CPU from rseq cpu_id: 16.7 ns
    - Read CPU from rseq cpu_id (lazy register): 19.8 ns
    - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
    - getcpu system call: 234.9 ns

    x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
    - Baseline (empty loop): 0.8 ns
    - Read CPU from rseq cpu_id: 0.8 ns
    - Read CPU from rseq cpu_id (lazy register): 0.8 ns
    - Read using gs segment selector: 0.8 ns
    - "lsl" inline assembly: 13.0 ns
    - glibc 2.19-0ubuntu6 getcpu: 16.6 ns
    - getcpu system call: 53.9 ns

    - Speed (benchmark taken on v8 of patchset)

    Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
    expectations, that enabling CONFIG_RSEQ slightly accelerates the
    scheduler:

    Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
    2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
    saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
    kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
    restartable sequences series applied.

    * CONFIG_RSEQ=n

    avg.: 41.37 s
    std.dev.: 0.36 s

    * CONFIG_RSEQ=y

    avg.: 40.46 s
    std.dev.: 0.33 s

    - Size

    On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
    567 bytes, and the data size increase of vmlinux is 5696 bytes.

    [1] https://lwn.net/Articles/650333/
    [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Joel Fernandes
    Cc: Catalin Marinas
    Cc: Dave Watson
    Cc: Will Deacon
    Cc: Andi Kleen
    Cc: "H . Peter Anvin"
    Cc: Chris Lameter
    Cc: Russell King
    Cc: Andrew Hunter
    Cc: Michael Kerrisk
    Cc: "Paul E . McKenney"
    Cc: Paul Turner
    Cc: Boqun Feng
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Ben Maurer
    Cc: Alexander Viro
    Cc: linux-api@vger.kernel.org
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
    Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
    Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers
     

15 May, 2018

1 commit


21 Apr, 2018

1 commit

  • One of the classes of kernel stack content leaks[1] is exposing the
    contents of prior heap or stack contents when a new process stack is
    allocated. Normally, those stacks are not zeroed, and the old contents
    remain in place. In the face of stack content exposure flaws, those
    contents can leak to userspace.

    Fixing this will make the kernel no longer vulnerable to these flaws, as
    the stack will be wiped each time a stack is assigned to a new process.
    There's not a meaningful change in runtime performance; it almost looks
    like it provides a benefit.

    Performing back-to-back kernel builds before:
    Run times: 157.86 157.09 158.90 160.94 160.80
    Mean: 159.12
    Std Dev: 1.54

    and after:
    Run times: 159.31 157.34 156.71 158.15 160.81
    Mean: 158.46
    Std Dev: 1.46

    Instead of making this a build or runtime config, Andy Lutomirski
    recommended this just be enabled by default.

    [1] A noisy search for many kinds of stack content leaks can be seen here:
    https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak

    I did some more with perf and cycle counts on running 100,000 execs of
    /bin/true.

    before:
    Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
    Mean: 221015379122.60
    Std Dev: 4662486552.47

    after:
    Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
    Mean: 217745009865.40
    Std Dev: 5935559279.99

    It continues to look like it's faster, though the deviation is rather
    wide, but I'm not sure what I could do that would be less noisy. I'm
    open to ideas!

    Link: http://lkml.kernel.org/r/20180221021659.GA37073@beast
    Signed-off-by: Kees Cook
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Laura Abbott
    Cc: Rasmus Villemoes
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

06 Apr, 2018

1 commit

  • KASAN splats indicate that in some cases we free a live mm, then
    continue to access it, with potentially disastrous results. This is
    likely due to a mismatched mmdrop() somewhere in the kernel, but so far
    the culprit remains elusive.

    Let's have __mmdrop() verify that the mm isn't live for the current
    task, similar to the existing check for init_mm. This way, we can catch
    this class of issue earlier, and without requiring KASAN.

    Currently, idle_task_exit() leaves active_mm stale after it switches to
    init_mm. This isn't harmful, but will trigger the new assertions, so we
    must adjust idle_task_exit() to update active_mm.

    Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.com
    Signed-off-by: Mark Rutland
    Reviewed-by: Andrew Morton
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     

03 Apr, 2018

2 commits

  • Using this helper allows us to avoid the in-kernel calls to the
    sys_unshare() syscall. The ksys_ prefix denotes that this function is meant
    as a drop-in replacement for the syscall. In particular, it uses the same
    calling convention as sys_unshare().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • sys_futex() is a wrapper to do_futex() which does not modify any
    values here:

    - uaddr, val and val3 are kept the same

    - op is masked with FUTEX_CMD_MASK, but is always set to FUTEX_WAKE.
    Therefore, val2 is always 0.

    - as utime is set to NULL, *timeout is NULL

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Andrew Morton
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

22 Feb, 2018

1 commit

  • As Peter points out, Doing a CALL+RET for just the decrement is a bit silly.

    Fixes: d70f2a14b72a4bc ("include/linux/sched/mm.h: uninline mmdrop_async(), etc")
    Acked-by: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

07 Feb, 2018

4 commits

  • Merge misc updates from Andrew Morton:

    - kasan updates

    - procfs

    - lib/bitmap updates

    - other lib/ updates

    - checkpatch tweaks

    - rapidio

    - ubsan

    - pipe fixes and cleanups

    - lots of other misc bits

    * emailed patches from Andrew Morton : (114 commits)
    Documentation/sysctl/user.txt: fix typo
    MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns
    MAINTAINERS: update various PALM patterns
    MAINTAINERS: update "ARM/OXNAS platform support" patterns
    MAINTAINERS: update Cortina/Gemini patterns
    MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern
    MAINTAINERS: remove ANDROID ION pattern
    mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors
    mm: docs: fix parameter names mismatch
    mm: docs: fixup punctuation
    pipe: read buffer limits atomically
    pipe: simplify round_pipe_size()
    pipe: reject F_SETPIPE_SZ with size over UINT_MAX
    pipe: fix off-by-one error when checking buffer limits
    pipe: actually allow root to exceed the pipe buffer limits
    pipe, sysctl: remove pipe_proc_fn()
    pipe, sysctl: drop 'min' parameter from pipe-max-size converter
    kasan: rework Kconfig settings
    crash_dump: is_kdump_kernel can be boolean
    kernel/mutex: mutex_is_locked can be boolean
    ...

    Linus Torvalds
     
  • All other places that deals with namespaces have an explanation of why
    the restriction is there.

    The description added in this commit was based on commit e66eded8309e
    ("userns: Don't allow CLONE_NEWUSER | CLONE_FS").

    Link: http://lkml.kernel.org/r/20171112151637.13258-1-marcos.souza.org@gmail.com
    Signed-off-by: Marcos Paulo de Souza
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcos Paulo de Souza
     
  • Thus reducing one indentation level while maintaining the same rationale.

    Link: http://lkml.kernel.org/r/20171117002929.5155-1-marcos.souza.org@gmail.com
    Signed-off-by: Marcos Paulo de Souza
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcos Paulo de Souza
     
  • Conflicts:
    arch/arm64/kernel/entry.S
    arch/x86/Kconfig
    include/linux/sched/mm.h
    kernel/fork.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

04 Feb, 2018

1 commit

  • Pull hardened usercopy whitelisting from Kees Cook:
    "Currently, hardened usercopy performs dynamic bounds checking on slab
    cache objects. This is good, but still leaves a lot of kernel memory
    available to be copied to/from userspace in the face of bugs.

    To further restrict what memory is available for copying, this creates
    a way to whitelist specific areas of a given slab cache object for
    copying to/from userspace, allowing much finer granularity of access
    control.

    Slab caches that are never exposed to userspace can declare no
    whitelist for their objects, thereby keeping them unavailable to
    userspace via dynamic copy operations. (Note, an implicit form of
    whitelisting is the use of constant sizes in usercopy operations and
    get_user()/put_user(); these bypass all hardened usercopy checks since
    these sizes cannot change at runtime.)

    This new check is WARN-by-default, so any mistakes can be found over
    the next several releases without breaking anyone's system.

    The series has roughly the following sections:
    - remove %p and improve reporting with offset
    - prepare infrastructure and whitelist kmalloc
    - update VFS subsystem with whitelists
    - update SCSI subsystem with whitelists
    - update network subsystem with whitelists
    - update process memory with whitelists
    - update per-architecture thread_struct with whitelists
    - update KVM with whitelists and fix ioctl bug
    - mark all other allocations as not whitelisted
    - update lkdtm for more sensible test overage"

    * tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
    lkdtm: Update usercopy tests for whitelisting
    usercopy: Restrict non-usercopy caches to size 0
    kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
    kvm: whitelist struct kvm_vcpu_arch
    arm: Implement thread_struct whitelist for hardened usercopy
    arm64: Implement thread_struct whitelist for hardened usercopy
    x86: Implement thread_struct whitelist for hardened usercopy
    fork: Provide usercopy whitelisting for task_struct
    fork: Define usercopy region in thread_stack slab caches
    fork: Define usercopy region in mm_struct slab caches
    net: Restrict unwhitelisted proto caches to size 0
    sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
    sctp: Define usercopy region in SCTP proto slab cache
    caif: Define usercopy region in caif proto slab cache
    ip: Define usercopy region in IP proto slab cache
    net: Define usercopy region in struct proto slab cache
    scsi: Define usercopy region in scsi_sense_cache slab cache
    cifs: Define usercopy region in cifs_request slab cache
    vxfs: Define usercopy region in vxfs_inode slab cache
    ufs: Define usercopy region in ufs_inode_cache slab cache
    ...

    Linus Torvalds