15 Jan, 2021

1 commit

  • With CFI, a callback function passed to __kthread_queue_delayed_work
    from a module can point to a jump table entry defined in the module
    instead of the one used in the core kernel, which breaks this test:

    WARN_ON_ONCE(timer->function != kthread_delayed_work_timer_fn);

    To work around the problem, disable the warning when CFI and modules
    are both enabled.

    Bug: 145210207
    Change-Id: I5b0a60bb69ce8e2bc0d8e4bf6736457b6425b6cf
    Signed-off-by: Sami Tolvanen

    Sami Tolvanen
     

04 Dec, 2020

1 commit


03 Nov, 2020

1 commit

  • There is a small race window when a delayed work is being canceled and
    the work still might be queued from the timer_fn:

    CPU0 CPU1
    kthread_cancel_delayed_work_sync()
    __kthread_cancel_work_sync()
    __kthread_cancel_work()
    work->canceling++;
    kthread_delayed_work_timer_fn()
    kthread_insert_work();

    BUG: kthread_insert_work() should not get called when work->canceling is
    set.

    Signed-off-by: Zqiang
    Signed-off-by: Andrew Morton
    Reviewed-by: Petr Mladek
    Acked-by: Tejun Heo
    Cc:
    Link: https://lkml.kernel.org/r/20201014083030.16895-1-qiang.zhang@windriver.com
    Signed-off-by: Linus Torvalds

    Zqiang
     

17 Oct, 2020

1 commit

  • Fix multiple occurrences of duplicated words in kernel/.

    Fix one typo/spello on the same line as a duplicate word. Change one
    instance of "the the" to "that the". Otherwise just drop one of the
    repeated words.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/98202fa6-8919-ef63-9efe-c0fad5ca7af1@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

13 Aug, 2020

1 commit

  • Add helpers to wrap the get_fs/set_fs magic for undoing any damange done
    by set_fs(KERNEL_DS). There is no real functional benefit, but this
    documents the intent of these calls better, and will allow stubbing the
    functions out easily for kernels builds that do not allow address space
    overrides in the future.

    [hch@lst.de: drop two incorrect hunks, fix a commit log typo]
    Link: http://lkml.kernel.org/r/20200714105505.935079-6-hch@lst.de

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Acked-by: Mark Rutland
    Acked-by: Greentime Hu
    Acked-by: Geert Uytterhoeven
    Cc: Nick Hu
    Cc: Vincent Chen
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Link: http://lkml.kernel.org/r/20200710135706.537715-6-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

08 Aug, 2020

2 commits

  • Originally kthread_create_on_cpu() parked and woke up the new thread.
    However, since commit a65d40961dc7 ("kthread/smpboot: do not park in
    kthread_create_on_cpu()") this is no longer the case. This patch removes
    the comment that has been left behind and is now incorrect / stale.

    Fixes: a65d40961dc7 ("kthread/smpboot: do not park in kthread_create_on_cpu()")
    Signed-off-by: Ilias Stamatis
    Signed-off-by: Andrew Morton
    Reviewed-by: Petr Mladek
    Link: http://lkml.kernel.org/r/20200611135920.240551-1-stamatis.iliass@gmail.com
    Signed-off-by: Linus Torvalds

    Ilias Stamatis
     
  • For SMP systems using IPI based TLB invalidation, looking at
    current->active_mm is entirely reasonable. This then presents the
    following race condition:

    CPU0 CPU1

    flush_tlb_mm(mm) use_mm(mm)

    tsk->active_mm = mm;

    if (tsk->active_mm == mm)
    // flush TLBs

    switch_mm(old_mm,mm,tsk);

    Where it is possible the IPI flushed the TLBs for @old_mm, not @mm,
    because the IPI lands before we actually switched.

    Avoid this by disabling IRQs across changing ->active_mm and
    switch_mm().

    Of the (SMP) architectures that have IPI based TLB invalidate:

    Alpha - checks active_mm
    ARC - ASID specific
    IA64 - checks active_mm
    MIPS - ASID specific flush
    OpenRISC - shoots down world
    PARISC - shoots down world
    SH - ASID specific
    SPARC - ASID specific
    x86 - N/A
    xtensa - checks active_mm

    So at the very least Alpha, IA64 and Xtensa are suspect.

    On top of this, for scheduler consistency we need at least preemption
    disabled across changing tsk->mm and doing switch_mm(), which is
    currently provided by task_lock(), but that's not sufficient for
    PREEMPT_RT.

    [akpm@linux-foundation.org: add comment]

    Reported-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Cc: Nicholas Piggin
    Cc: Jens Axboe
    Cc: Kees Cook
    Cc: Jann Horn
    Cc: Will Deacon
    Cc: Christoph Hellwig
    Cc: Mathieu Desnoyers
    Cc:
    Link: http://lkml.kernel.org/r/20200721154106.GE10769@hirez.programming.kicks-ass.net
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

08 Jul, 2020

1 commit


18 Jun, 2020

1 commit


15 Jun, 2020

2 commits

  • This is a kernel enhancement that configures the cpu affinity of kernel
    threads via kernel boot option nohz_full=.

    When this option is specified, the cpumask is immediately applied upon
    kthread launch. This does not affect kernel threads that specify cpu
    and node.

    This allows CPU isolation (that is not allowing certain threads
    to execute on certain CPUs) without using the isolcpus=domain parameter,
    making it possible to enable load balancing on such CPUs
    during runtime (see kernel-parameters.txt).

    Note-1: this is based off on Wind River's patch at
    https://github.com/starlingx-staging/stx-integ/blob/master/kernel/kernel-std/centos/patches/affine-compute-kernel-threads.patch

    Difference being that this patch is limited to modifying kernel thread
    cpumask. Behaviour of other threads can be controlled via cgroups or
    sched_setaffinity.

    Note-2: Wind River's patch was based off Christoph Lameter's patch at
    https://lwn.net/Articles/565932/ with the only difference being
    the kernel parameter changed from kthread to kthread_cpus.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200527142909.23372-3-frederic@kernel.org

    Marcelo Tosatti
     
  • Next patch will switch unbound kernel threads mask to
    housekeeping_cpumask(), a subset of cpu_possible_mask. So in order to
    ease bisection, lets first switch kthreads default affinity from
    cpu_all_mask to cpu_possible_mask.

    It looks safe to do so as cpu_possible_mask seem to be initialized
    at setup_arch() time, way before kthreadd is created.

    Suggested-by: Frederic Weisbecker
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200527142909.23372-2-frederic@kernel.org

    Marcelo Tosatti
     

12 Jun, 2020

1 commit

  • Merge some more updates from Andrew Morton:

    - various hotfixes and minor things

    - hch's use_mm/unuse_mm clearnups

    Subsystems affected by this patch series: mm/hugetlb, scripts, kcov,
    lib, nilfs, checkpatch, lib, mm/debug, ocfs2, lib, misc.

    * emailed patches from Andrew Morton :
    kernel: set USER_DS in kthread_use_mm
    kernel: better document the use_mm/unuse_mm API contract
    kernel: move use_mm/unuse_mm to kthread.c
    kernel: move use_mm/unuse_mm to kthread.c
    stacktrace: cleanup inconsistent variable type
    lib: test get_count_order/long in test_bitops.c
    mm: add comments on pglist_data zones
    ocfs2: fix spelling mistake and grammar
    mm/debug_vm_pgtable: fix kernel crash by checking for THP support
    lib: fix bitmap_parse() on 64-bit big endian archs
    checkpatch: correct check for kernel parameters doc
    nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
    lib/lz4/lz4_decompress.c: document deliberate use of `&'
    kcov: check kcov_softirq in kcov_remote_stop()
    scripts/spelling: add a few more typos
    khugepaged: selftests: fix timeout condition in wait_for_scan()

    Linus Torvalds
     

11 Jun, 2020

3 commits

  • Some architectures like arm64 and s390 require USER_DS to be set for
    kernel threads to access user address space, which is the whole purpose of
    kthread_use_mm, but other like x86 don't. That has lead to a huge mess
    where some callers are fixed up once they are tested on said
    architectures, while others linger around and yet other like io_uring try
    to do "clever" optimizations for what usually is just a trivial asignment
    to a member in the thread_struct for most architectures.

    Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the
    previous value instead.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Tested-by: Jens Axboe
    Reviewed-by: Jens Axboe
    Acked-by: Michael S. Tsirkin
    Cc: Alex Deucher
    Cc: Al Viro
    Cc: Felipe Balbi
    Cc: Felix Kuehling
    Cc: Jason Wang
    Cc: Zhenyu Wang
    Cc: Zhi Wang
    Cc: Greg Kroah-Hartman
    Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Switch the function documentation to kerneldoc comments, and add
    WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
    not have ->mm set (or has ->mm set in the case of unuse_mm).

    Also give the functions a kthread_ prefix to better document the use case.

    [hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
    Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
    [sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
    Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Tested-by: Jens Axboe
    Reviewed-by: Jens Axboe
    Acked-by: Felix Kuehling
    Acked-by: Greg Kroah-Hartman [usb]
    Acked-by: Haren Myneni
    Cc: Alex Deucher
    Cc: Al Viro
    Cc: Felipe Balbi
    Cc: Jason Wang
    Cc: "Michael S. Tsirkin"
    Cc: Zhenyu Wang
    Cc: Zhi Wang
    Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "improve use_mm / unuse_mm", v2.

    This series improves the use_mm / unuse_mm interface by better documenting
    the assumptions, and my taking the set_fs manipulations spread over the
    callers into the core API.

    This patch (of 3):

    Use the proper API instead.

    Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de

    These helpers are only for use with kernel threads, and I will tie them
    more into the kthread infrastructure going forward. Also move the
    prototypes to kthread.h - mmu_context.h was a little weird to start with
    as it otherwise contains very low-level MM bits.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Tested-by: Jens Axboe
    Reviewed-by: Jens Axboe
    Acked-by: Felix Kuehling
    Cc: Alex Deucher
    Cc: Al Viro
    Cc: Felipe Balbi
    Cc: Jason Wang
    Cc: "Michael S. Tsirkin"
    Cc: Zhenyu Wang
    Cc: Zhi Wang
    Cc: Greg Kroah-Hartman
    Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
    Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
    Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

09 May, 2020

1 commit

  • It's handy to keep the kthread_fn just as a unique cookie to identify
    classes of kthreads. E.g. if you can verify that a given task is
    running your thread_fn, then you may know what sort of type kthread_data
    points to.

    We'll use this in nfsd to pass some information into the vfs. Note it
    will need kthread_data() exported too.

    Original-patch-by: Tejun Heo
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

20 Mar, 2020

1 commit

  • when we create a kthread with ktrhead_create_on_cpu(),the child thread
    entry is ktread.c:ktrhead() which will be preempted by the parent after
    call complete(done) while schedule() is not called yet,then the parent
    will call wait_task_inactive(child) but the child is still on the runqueue,
    so the parent will schedule_hrtimeout() for 1 jiffy,it will waste a lot of
    time,especially on startup.

    parent child
    ktrhead_create_on_cpu()
    wait_fo_completion(&done) -----> ktread.c:ktrhead()
    |----- complete(done);--wakeup and preempted by parent
    kthread_bind() schedule();--dequeue here
    wait_task_inactive(child) |
    schedule_hrtimeout(1 jiffy) -|

    So we hope the child just wakeup parent but not preempted by parent, and the
    child is going to call schedule() soon,then the parent will not call
    schedule_hrtimeout(1 jiffy) as the child is already dequeue.

    The same issue for ktrhead_park()&&kthread_parkme().
    This patch can save 120ms on rk312x startup with CONFIG_HZ=300.

    Signed-off-by: Liang Chen
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Steven Rostedt (VMware)
    Link: https://lkml.kernel.org/r/20200306070133.18335-2-cl@rock-chips.com

    Liang Chen
     

17 Oct, 2019

1 commit

  • The __kthread_queue_delayed_work is not exported so
    make it static, to avoid the following sparse warning:

    kernel/kthread.c:869:6: warning: symbol '__kthread_queue_delayed_work' was not declared. Should it be static?

    Signed-off-by: Ben Dooks
    Signed-off-by: Linus Torvalds

    Ben Dooks
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • kthread.h can't be included in psi_types.h because it creates a circular
    inclusion with kthread.h eventually including psi_types.h and
    complaining on kthread structures not being defined because they are
    defined further in the kthread.h. Resolve this by removing psi_types.h
    inclusion from the headers included from kthread.h.

    Link: http://lkml.kernel.org/r/20190319235619.260832-7-surenb@google.com
    Signed-off-by: Suren Baghdasaryan
    Acked-by: Johannes Weiner
    Cc: Dennis Zhou
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Li Zefan
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suren Baghdasaryan
     

07 Mar, 2019

2 commits

  • Merge misc updates from Andrew Morton:

    - a few misc things

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (159 commits)
    tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
    proc: more robust bulk read test
    proc: test /proc/*/maps, smaps, smaps_rollup, statm
    proc: use seq_puts() everywhere
    proc: read kernel cpu stat pointer once
    proc: remove unused argument in proc_pid_lookup()
    fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
    fs/proc/self.c: code cleanup for proc_setup_self()
    proc: return exit code 4 for skipped tests
    mm,mremap: bail out earlier in mremap_to under map pressure
    mm/sparse: fix a bad comparison
    mm/memory.c: do_fault: avoid usage of stale vm_area_struct
    writeback: fix inode cgroup switching comment
    mm/huge_memory.c: fix "orig_pud" set but not used
    mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
    mm/memcontrol.c: fix bad line in comment
    mm/cma.c: cma_declare_contiguous: correct err handling
    mm/page_ext.c: fix an imbalance with kmemleak
    mm/compaction: pass pgdat to too_many_isolated() instead of zone
    mm: remove zone_lru_lock() function, access ->lru_lock directly
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - refcount conversions

    - Solve the rq->leaf_cfs_rq_list can of worms for real.

    - improve power-aware scheduling

    - add sysctl knob for Energy Aware Scheduling

    - documentation updates

    - misc other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    kthread: Do not use TIMER_IRQSAFE
    kthread: Convert worker lock to raw spinlock
    sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
    sched/fair: Remove unused 'sd' parameter from select_idle_smt()
    sched/wait: Use freezable_schedule() when possible
    sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
    sched/fair: Explain LLC nohz kick condition
    sched/fair: Simplify nohz_balancer_kick()
    sched/topology: Fix percpu data types in struct sd_data & struct s_data
    sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
    sched/fair: Fix O(nr_cgroups) in the load balancing path
    sched/fair: Optimize update_blocked_averages()
    sched/fair: Fix insertion in rq->leaf_cfs_rq_list
    sched/fair: Add tmp_alone_branch assertion
    sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
    sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
    sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
    sched/fair: Update scale invariance of PELT
    sched/fair: Move the rq_of() helper function
    sched/core: Convert task_struct.stack_refcount to refcount_t
    ...

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

28 Feb, 2019

2 commits

  • The TIMER_IRQSAFE usage was introduced in commit 22597dc3d97b1 ("kthread:
    initial support for delayed kthread work") which modelled the delayed
    kthread code after workqueue's code. The workqueue code requires the flag
    TIMER_IRQSAFE for synchronisation purpose. This is not true for kthread's
    delay timer since all operations occur under a lock.

    Remove TIMER_IRQSAFE from the timer initialisation and use timer_setup()
    for initialisation purpose which is the official function.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Petr Mladek
    Link: https://lkml.kernel.org/r/20190212162554.19779-2-bigeasy@linutronix.de

    Sebastian Andrzej Siewior
     
  • In order to enable the queuing of kthread work items from hardirq context
    even when PREEMPT_RT_FULL is enabled, convert the worker spin_lock to a
    raw_spin_lock.

    This is only acceptable to do because the work performed under the lock is
    well-bounded and minimal.

    Reported-by: Steffen Trumtrar
    Reported-by: Tim Sander
    Signed-off-by: Julia Cartwright
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Tested-by: Steffen Trumtrar
    Reviewed-by: Petr Mladek
    Cc: Guenter Roeck
    Link: https://lkml.kernel.org/r/20190212162554.19779-1-bigeasy@linutronix.de

    Julia Cartwright
     

11 Feb, 2019

1 commit

  • kthread_should_park() is used to check if the calling kthread ('current')
    should park, but there is no function to check whether an arbitrary kthread
    should be parked. The latter is required to plug a CPU hotplug race vs. a
    parking ksoftirqd thread.

    The new __kthread_should_park() receives a task_struct as parameter to
    check if the corresponding kernel thread should be parked.

    Call __kthread_should_park() from kthread_should_park() to avoid code
    duplication.

    Signed-off-by: Matthias Kaehlcke
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E . McKenney"
    Cc: Sebastian Andrzej Siewior
    Cc: Douglas Anderson
    Cc: Stephen Boyd
    Link: https://lkml.kernel.org/r/20190128234625.78241-2-mka@chromium.org

    Matthias Kaehlcke
     

14 Aug, 2018

1 commit

  • Pull scheduler updates from Thomas Gleixner:

    - Cleanup and improvement of NUMA balancing

    - Refactoring and improvements to the PELT (Per Entity Load Tracking)
    code

    - Watchdog simplification and related cleanups

    - The usual pile of small incremental fixes and improvements

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
    watchdog: Reduce message verbosity
    stop_machine: Reflow cpu_stop_queue_two_works()
    sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()
    sched/numa: Use group_weights to identify if migration degrades locality
    sched/numa: Update the scan period without holding the numa_group lock
    sched/numa: Remove numa_has_capacity()
    sched/numa: Modify migrate_swap() to accept additional parameters
    sched/numa: Remove unused task_capacity from 'struct numa_stats'
    sched/numa: Skip nodes that are at 'hoplimit'
    sched/debug: Reverse the order of printing faults
    sched/numa: Use task faults only if numa_group is not yet set up
    sched/numa: Set preferred_node based on best_cpu
    sched/numa: Simplify load_too_imbalanced()
    sched/numa: Evaluate move once per node
    sched/numa: Remove redundant field
    sched/debug: Show the sum wait time of a task group
    sched/fair: Remove #ifdefs from scale_rt_capacity()
    sched/core: Remove get_cpu() from sched_fork()
    sched/cpufreq: Clarify sugov_get_util()
    sched/sysctl: Remove unused sched_time_avg_ms sysctl
    ...

    Linus Torvalds
     

26 Jul, 2018

1 commit

  • There is a window for racing when printing directly to task->comm,
    allowing other threads to see a non-terminated string. The vsnprintf
    function fills the buffer, counts the truncated chars, then finally
    writes the \0 at the end.

    creator other
    vsnprintf:
    fill (not terminated)
    count the rest trace_sched_waking(p):
    ... memcpy(comm, p->comm, TASK_COMM_LEN)
    write \0

    The consequences depend on how 'other' uses the string. In our case,
    it was copied into the tracing system's saved cmdlines, a buffer of
    adjacent TASK_COMM_LEN-byte buffers (note the 'n' where 0 should be):

    crash-arm64> x/1024s savedcmd->saved_cmdlines | grep 'evenk'
    0xffffffd5b3818640: "irq/497-pwr_evenkworker/u16:12"

    ...and a strcpy out of there would cause stack corruption:

    [224761.522292] Kernel panic - not syncing: stack-protector:
    Kernel stack is corrupted in: ffffff9bf9783c78

    crash-arm64> kbt | grep 'comm\|trace_print_context'
    #6 0xffffff9bf9783c78 in trace_print_context+0x18c(+396)
    comm (char [16]) = "irq/497-pwr_even"

    crash-arm64> rd 0xffffffd4d0e17d14 8
    ffffffd4d0e17d14: 2f71726900000000 5f7277702d373934 ....irq/497-pwr_
    ffffffd4d0e17d24: 726f776b6e657665 3a3631752f72656b evenkworker/u16:
    ffffffd4d0e17d34: f9780248ff003231 cede60e0ffffff9b 12..H.x......`..
    ffffffd4d0e17d44: cede60c8ffffffd4 00000fffffffffd4 .....`..........

    The workaround in e09e28671 (use strlcpy in __trace_find_cmdline) was
    likely needed because of this same bug.

    Solved by vsnprintf:ing to a local buffer, then using set_task_comm().
    This way, there won't be a window where comm is not terminated.

    Link: http://lkml.kernel.org/r/20180726071539.188015-1-snild@sony.com

    Cc: stable@vger.kernel.org
    Fixes: bc0c38d139ec7 ("ftrace: latency tracer infrastructure")
    Reviewed-by: Steven Rostedt (VMware)
    Signed-off-by: Snild Dolkow
    Signed-off-by: Steven Rostedt (VMware)

    Snild Dolkow
     

03 Jul, 2018

2 commits

  • Oleg explains the reason we could hit park+park is that
    smpboot_update_cpumask_percpu_thread()'s

    for_each_cpu_and(cpu, &tmp, cpu_online_mask)
    smpboot_park_kthread();

    turns into:

    for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)and)
    smpboot_park_kthread();

    on UP, ignoring the mask. But since we just completely removed that
    function, this is no longer relevant.

    So revert commit:

    b1f5b378e126 ("kthread: Allow kthread_park() on a parked kthread")

    Suggested-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Gaurav reports that commit:

    85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue")

    isn't working for him. Because of the following race:

    > controller Thread CPUHP Thread
    > takedown_cpu
    > kthread_park
    > kthread_parkme
    > Set KTHREAD_SHOULD_PARK
    > smpboot_thread_fn
    > set Task interruptible
    >
    >
    > wake_up_process
    > if (!(p->state & state))
    > goto out;
    >
    > Kthread_parkme
    > SET TASK_PARKED
    > schedule
    > raw_spin_lock(&rq->lock)
    > ttwu_remote
    > waiting for __task_rq_lock
    > context_switch
    >
    > finish_lock_switch
    >
    >
    >
    > Case TASK_PARKED
    > kthread_park_complete
    >
    >
    > SET Running

    Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
    handling is buggered because the TASK_DEAD thing is done with
    preemption disabled, the current code can still complete early on
    preemption :/

    So basically revert that earlier fix and go with a variant of the
    alternative mentioned in the commit. Promote TASK_PARKED to special
    state to avoid the store-store issue on task->state leading to the
    WARN in kthread_unpark() -> __kthread_bind().

    But in addition, add wait_task_inactive() to kthread_park() to ensure
    the task really is PARKED when we return from kthread_park(). This
    avoids the whole kthread still gets migrated nonsense -- although it
    would be really good to get this done differently.

    Reported-by: Gaurav Kohli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 May, 2018

1 commit

  • The following commit:

    85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue")

    added a WARN() in the case where we call kthread_park() on an already
    parked thread, because the old code wasn't doing the right thing there
    and it wasn't at all clear that would happen.

    It turns out, this does in fact happen, so we have to deal with it.

    Instead of potentially returning early, also wait for the completion.
    This does however mean we have to use complete_all() and re-initialize
    the completion on re-use.

    Reported-by: LKP
    Tested-by: Meelis Roos
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: kernel test robot
    Cc: wfg@linux.intel.com
    Cc: Thomas Gleixner
    Fixes: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue")
    Link: http://lkml.kernel.org/r/20180504091142.GI12235@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 May, 2018

2 commits

  • Even with the wait-loop fixed, there is a further issue with
    kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
    smpboot_park_threads() can return before all those threads are in fact
    blocked, due to the placement of the complete() in __kthread_parkme().

    When that happens, sched_cpu_dying() -> migrate_tasks() can end up
    migrating such a still runnable task onto another CPU.

    Normally the task will have hit schedule() and gone to sleep by the
    time we do kthread_unpark(), which will then do __kthread_bind() to
    re-bind the task to the correct CPU.

    However, when we loose the initial TASK_PARKED store to the concurrent
    wakeup issue described previously, do the complete(), get migrated, it
    is possible to either:

    - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
    the park and set TASK_RUNNING, or

    - __kthread_bind()'s wait_task_inactive() to observe the competing
    TASK_RUNNING store.

    Either way the WARN() in __kthread_bind() will trigger and fail to
    correctly set the CPU affinity.

    Fix this by only issuing the complete() when the kthread has scheduled
    out. This does away with all the icky 'still running' nonsense.

    The alternative is to promote TASK_PARKED to a special state, this
    guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
    and we'll end up doing the right thing, but this preserves the whole
    icky business of potentially migating the still runnable thing.

    Reported-by: Gaurav Kohli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Gaurav reported a problem with __kthread_parkme() where a concurrent
    try_to_wake_up() could result in competing stores to ->state which,
    when the TASK_PARKED store got lost bad things would happen.

    The comment near set_current_state() actually mentions this competing
    store, but only mentions the case against TASK_RUNNING. This same
    store, with different timing, can happen against a subsequent !RUNNING
    store.

    This normally is not a problem, because as per that same comment, the
    !RUNNING state store is inside a condition based wait-loop:

    for (;;) {
    set_current_state(TASK_UNINTERRUPTIBLE);
    if (!need_sleep)
    break;
    schedule();
    }
    __set_current_state(TASK_RUNNING);

    If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
    (concurrent) wakeup, the schedule() will NO-OP and we'll go around the
    loop once more.

    The problem here is that the TASK_PARKED store is not inside the
    KTHREAD_SHOULD_PARK condition wait-loop.

    There is a genuine issue with sleeps that do not have a condition;
    this is addressed in a subsequent patch.

    Reported-by: Gaurav Kohli
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Nov, 2017

1 commit

  • With all callbacks converted, and the timer callback prototype
    switched over, the TIMER_FUNC_TYPE cast is no longer needed,
    so remove it. Conversion was done with the following scripts:

    perl -pi -e 's|\(TIMER_FUNC_TYPE\)||g' \
    $(git grep TIMER_FUNC_TYPE | cut -d: -f1 | sort -u)

    perl -pi -e 's|\(TIMER_DATA_TYPE\)||g' \
    $(git grep TIMER_DATA_TYPE | cut -d: -f1 | sort -u)

    The now unused macros are also dropped from include/linux/timer.h.

    Signed-off-by: Kees Cook

    Kees Cook
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

11 Nov, 2017

1 commit

  • kthread() could bail out early before we initialize blkcg_css (if the
    kthread is killed very early. Please see xchg() statement in kthread()),
    which confuses free_kthread_struct. Instead of moving the blkcg_css
    initialization early, we simply zero the whole 'self' data structure,
    which doesn't sound much overhead.

    Reported-by: syzbot
    Fixes: 05e3db95ebfc ("kthread: add a mechanism to store cgroup info")
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Dmitry Vyukov
    Acked-by: Tejun Heo
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

05 Oct, 2017

1 commit

  • In preparation for unconditionally passing the struct timer_list pointer
    to all timer callbacks, switch kthread to use from_timer() and pass the
    timer pointer explicitly.

    Signed-off-by: Kees Cook
    Signed-off-by: Thomas Gleixner
    Cc: linux-mips@linux-mips.org
    Cc: Len Brown
    Cc: Benjamin Herrenschmidt
    Cc: Lai Jiangshan
    Cc: Sebastian Reichel
    Cc: Kalle Valo
    Cc: Paul Mackerras
    Cc: Pavel Machek
    Cc: linux1394-devel@lists.sourceforge.net
    Cc: Chris Metcalf
    Cc: linux-s390@vger.kernel.org
    Cc: linux-wireless@vger.kernel.org
    Cc: "James E.J. Bottomley"
    Cc: Wim Van Sebroeck
    Cc: Michael Ellerman
    Cc: Ursula Braun
    Cc: Geert Uytterhoeven
    Cc: Viresh Kumar
    Cc: Harish Patil
    Cc: Stephen Boyd
    Cc: Guenter Roeck
    Cc: Manish Chopra
    Cc: Petr Mladek
    Cc: Arnd Bergmann
    Cc: linux-pm@vger.kernel.org
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Julian Wiedmann
    Cc: John Stultz
    Cc: Mark Gross
    Cc: linux-watchdog@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: "Martin K. Petersen"
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Oleg Nesterov
    Cc: Ralf Baechle
    Cc: Stefan Richter
    Cc: Michael Reed
    Cc: netdev@vger.kernel.org
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Sudip Mukherjee
    Link: https://lkml.kernel.org/r/1507159627-127660-13-git-send-email-keescook@chromium.org

    Kees Cook
     

27 Sep, 2017

1 commit

  • The code is only for blkcg not for all cgroups

    Fixes: d4478e92d618 ("block/loop: make loop cgroup aware")
    Reported-by: kbuild test robot
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

26 Sep, 2017

1 commit

  • kthread usually runs jobs on behalf of other threads. The jobs should be
    charged to cgroup of original threads. But the jobs run in a kthread,
    where we lose the cgroup context of original threads. The patch adds a
    machanism to record cgroup info of original threads in kthread context.
    Later we can retrieve the cgroup info and attach the cgroup info to jobs.

    Since this mechanism is only required by kthread, we store the cgroup
    info in kthread data instead of generic task_struct.

    Acked-by: Tejun Heo
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

01 Sep, 2017

1 commit

  • If the worker thread continues getting work, it will hog the cpu and rcu
    stall complains. Make it a good citizen. This is triggered in a loop
    block device test.

    Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Cc: Petr Mladek
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li