26 Nov, 2019

1 commit

  • Pull cgroup updates from Tejun Heo:
    "There are several notable changes here:

    - Single thread migrating itself has been optimized so that it
    doesn't need threadgroup rwsem anymore.

    - Freezer optimization to avoid unnecessary frozen state changes.

    - cgroup ID unification so that cgroup fs ino is the only unique ID
    used for the cgroup and can be used to directly look up live
    cgroups through filehandle interface on 64bit ino archs. On 32bit
    archs, cgroup fs ino is still the only ID in use but it is only
    unique when combined with gen.

    - selftest and other changes"

    * 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
    writeback: fix -Wformat compilation warnings
    docs: cgroup: mm: Fix spelling of "list"
    cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
    cgroup: use cgrp->kn->id as the cgroup ID
    kernfs: use 64bit inos if ino_t is 64bit
    kernfs: implement custom exportfs ops and fid type
    kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
    kernfs: convert kernfs_node->id from union kernfs_node_id to u64
    kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
    kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
    netprio: use css ID instead of cgroup ID
    writeback: use ino_t for inodes in tracepoints
    kernfs: fix ino wrap-around detection
    kselftests: cgroup: Avoid the reuse of fd after it is deallocated
    cgroup: freezer: don't change task and cgroups status unnecessarily
    cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
    cgroup: remove cgroup_enable_task_cg_lists() optimization
    cgroup: pids: use atomic64_t for pids->limit
    selftests: cgroup: Run test_core under interfering stress
    selftests: cgroup: Add task migration tests
    ...

    Linus Torvalds
     

16 Nov, 2019

1 commit

  • Pull misc vfs fixes from Al Viro:
    "Assorted fixes all over the place; some of that is -stable fodder,
    some regressions from the last window"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
    ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
    ecryptfs: fix unlink and rmdir in face of underlying fs modifications
    audit_get_nd(): don't unlock parent too early
    exportfs_decode_fh(): negative pinned may become positive without the parent locked
    cgroup: don't put ERR_PTR() into fc->root
    autofs: fix a leak in autofs_expire_indirect()
    aio: Fix io_pgetevents() struct __compat_aio_sigset layout
    fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()

    Linus Torvalds
     

15 Nov, 2019

1 commit

  • 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID") added WARN
    which triggers if cgroup_id(root_cgrp) is not 1. This is fine on
    64bit ino archs but on 32bit archs cgroup ID is ((gen << 32) | ino)
    and gen starts at 1, so the root id is 0x1_0000_0001 instead of 1
    always triggering the WARN.

    What we wanna make sure is that the ino part is 1. Fix it.

    Reported-by: Naresh Kamboju
    Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID")
    Signed-off-by: Tejun Heo

    Tejun Heo
     

13 Nov, 2019

3 commits

  • cgroup ID is currently allocated using a dedicated per-hierarchy idr
    and used internally and exposed through tracepoints and bpf. This is
    confusing because there are tracepoints and other interfaces which use
    the cgroupfs ino as IDs.

    The preceding changes made kn->id exposed as ino as 64bit ino on
    supported archs or ino+gen (low 32bits as ino, high gen). There's no
    reason for cgroup to use different IDs. The kernfs IDs are unique and
    userland can easily discover them and map them back to paths using
    standard file operations.

    This patch replaces cgroup IDs with kernfs IDs.

    * cgroup_id() is added and all cgroup ID users are converted to use it.

    * kernfs_node creation is moved to earlier during cgroup init so that
    cgroup_id() is available during init.

    * While at it, s/cgroup/cgrp/ in psi helpers for consistency.

    * Fallback ID value is changed to 1 to be consistent with root cgroup
    ID.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim

    Tejun Heo
     
  • kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the
    specified ino. On top of that, kernfs_get_node_by_id() and
    kernfs_fh_get_inode() implement full ID matching by testing the rest
    of ID.

    On surface, confusingly, the two are slightly different in that the
    latter uses 0 gen as wildcard while the former doesn't - does it mean
    that the latter can't uniquely identify inodes w/ 0 gen? In practice,
    this is a distinction without a difference because generation number
    starts at 1. There are no actual IDs with 0 gen, so it can always
    safely used as wildcard.

    Let's simplify the code by renaming kernfs_find_and_get_node_by_ino()
    to kernfs_find_and_get_node_by_id(), moving all lookup logics into it,
    and removing now unnecessary kernfs_get_node_by_id().

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman

    Tejun Heo
     
  • kernfs_node->id is currently a union kernfs_node_id which represents
    either a 32bit (ino, gen) pair or u64 value. I can't see much value
    in the usage of the union - all that's needed is a 64bit ID which the
    current code is already limited to. Using a union makes the code
    unnecessarily complicated and prevents using 64bit ino without adding
    practical benefits.

    This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
    ino is stored in the lower 32bits and gen upper. Accessors -
    kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
    ino and gen. This simplifies ID handling less cumbersome and will
    allow using 64bit inos on supported archs.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim
    Cc: Jens Axboe
    Cc: Alexei Starovoitov

    Tejun Heo
     

11 Nov, 2019

1 commit


07 Nov, 2019

2 commits

  • It's not necessary to adjust the task state and revisit the state
    of source and destination cgroups if the cgroups are not in freeze
    state and the task itself is not frozen.

    And in this scenario, it wakes up the task who's not supposed to be
    ready to run.

    Don't do the unnecessary task state adjustment can help stop waking
    up the task without a reason.

    Signed-off-by: Honglei Wang
    Acked-by: Roman Gushchin
    Signed-off-by: Tejun Heo

    Honglei Wang
     
  • cgroup->bstat_pending is used to determine the base stat delta to
    propagate to the parent. While correct, this is different from how
    percpu delta is determined for no good reason and the inconsistency
    makes the code more difficult to understand.

    This patch makes parent propagation delta calculation use the same
    method as percpu to global propagation.

    * cgroup_base_stat_accumulate() is renamed to cgroup_base_stat_add()
    and cgroup_base_stat_sub() is added.

    * percpu propagation calculation is updated to use the above helpers.

    * cgroup->bstat_pending is replaced with cgroup->last_bstat and
    updated to use the same calculation as percpu propagation.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

29 Oct, 2019

1 commit

  • Turns out hotplugging CPUs that are in exclusive cpusets can lead to the
    cpuset code feeding empty cpumasks to the sched domain rebuild machinery.

    This leads to the following splat:

    Internal error: Oops: 96000004 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 0 PID: 235 Comm: kworker/5:2 Not tainted 5.4.0-rc1-00005-g8d495477d62e #23
    Hardware name: ARM Juno development board (r0) (DT)
    Workqueue: events cpuset_hotplug_workfn
    pstate: 60000005 (nZCv daif -PAN -UAO)
    pc : build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
    lr : build_sched_domains (kernel/sched/topology.c:1966)
    Call trace:
    build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
    partition_sched_domains_locked (kernel/sched/topology.c:2250)
    rebuild_sched_domains_locked (./include/linux/bitmap.h:370 ./include/linux/cpumask.h:538 kernel/cgroup/cpuset.c:955 kernel/cgroup/cpuset.c:978 kernel/cgroup/cpuset.c:1019)
    rebuild_sched_domains (kernel/cgroup/cpuset.c:1032)
    cpuset_hotplug_workfn (kernel/cgroup/cpuset.c:3205 (discriminator 2))
    process_one_work (./arch/arm64/include/asm/jump_label.h:21 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:114 kernel/workqueue.c:2274)
    worker_thread (./include/linux/compiler.h:199 ./include/linux/list.h:268 kernel/workqueue.c:2416)
    kthread (kernel/kthread.c:255)
    ret_from_fork (arch/arm64/kernel/entry.S:1167)
    Code: f860dae2 912802d6 aa1603e1 12800000 (f8616853)

    The faulty line in question is:

    cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));

    and we're not checking the return value against nr_cpu_ids (we shouldn't
    have to!), which leads to the above.

    Prevent generate_sched_domains() from returning empty cpumasks, and add
    some assertion in build_sched_domains() to scream bloody murder if it
    happens again.

    The above splat was obtained on my Juno r0 with the following reproducer:

    $ cgcreate -g cpuset:asym
    $ cgset -r cpuset.cpus=0-3 asym
    $ cgset -r cpuset.mems=0 asym
    $ cgset -r cpuset.cpu_exclusive=1 asym

    $ cgcreate -g cpuset:smp
    $ cgset -r cpuset.cpus=4-5 smp
    $ cgset -r cpuset.mems=0 smp
    $ cgset -r cpuset.cpu_exclusive=1 smp

    $ cgset -r cpuset.sched_load_balance=0 .

    $ echo 0 > /sys/devices/system/cpu/cpu4/online
    $ echo 0 > /sys/devices/system/cpu/cpu5/online

    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hannes@cmpxchg.org
    Cc: lizefan@huawei.com
    Cc: morten.rasmussen@arm.com
    Cc: qperret@google.com
    Cc: tj@kernel.org
    Cc: vincent.guittot@linaro.org
    Fixes: 05484e098448 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection")
    Link: https://lkml.kernel.org/r/20191023153745.19515-2-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar

    Valentin Schneider
     

25 Oct, 2019

2 commits

  • cgroup_enable_task_cg_lists() is used to lazyily initialize task
    cgroup associations on the first use to reduce fork / exit overheads
    on systems which don't use cgroup. Unfortunately, locking around it
    has never been actually correct and its value is dubious given how the
    vast majority of systems use cgroup right away from boot.

    This patch removes the optimization. For now, replace the cg_list
    based branches with WARN_ON_ONCE()'s to be on the safe side. We can
    simplify the logic further in the future.

    Signed-off-by: Tejun Heo
    Reported-by: Oleg Nesterov
    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Because pids->limit can be changed concurrently (but we don't want to
    take a lock because it would be needlessly expensive), use atomic64_ts
    instead.

    Fixes: commit 49b786ea146f ("cgroup: implement the PIDs subsystem")
    Cc: stable@vger.kernel.org # v4.3+
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai
     

07 Oct, 2019

3 commits

  • There are reports of users who use thread migrations between cgroups and
    they report performance drop after d59cfc09c32a ("sched, cgroup: replace
    signal_struct->group_rwsem with a global percpu_rwsem"). The effect is
    pronounced on machines with more CPUs.

    The migration is affected by forking noise happening in the background,
    after the mentioned commit a migrating thread must wait for all
    (forking) processes on the system, not only of its threadgroup.

    There are several places that need to synchronize with migration:
    a) do_exit,
    b) de_thread,
    c) copy_process,
    d) cgroup_update_dfl_csses,
    e) parallel migration (cgroup_{proc,thread}s_write).

    In the case of self-migrating thread, we relax the synchronization on
    cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are
    excluded with cgroup_mutex, c) does not matter in case of single thread
    migration and the executing thread cannot exec(2) or exit(2) while it is
    writing into cgroup.threads. In case of do_exit because of signal
    delivery, we either exit before the migration or finish the migration
    (of not yet PF_EXITING thread) and die afterwards.

    This patch handles only the case of self-migration by writing "0" into
    cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem
    with numeric PIDs.

    This change improves migration dependent workload performance similar
    to per-signal_struct state.

    Signed-off-by: Michal Koutný
    Signed-off-by: Tejun Heo

    Michal Koutný
     
  • We no longer take cgroup_mutex in cgroup_exit and the exiting tasks are
    not moved to init_css_set, reflect that in several comments to prevent
    confusion.

    Signed-off-by: Michal Koutný
    Signed-off-by: Tejun Heo

    Michal Koutný
     
  • Like commit 13d82fb77abb ("cgroup: short-circuit cset_cgroup_from_root() on
    the default hierarchy"), short-circuit current_cgns_cgroup_from_root() on
    the default hierarchy.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Tejun Heo

    Miaohe Lin
     

18 Sep, 2019

1 commit


17 Sep, 2019

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
    Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
    Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

    As perf and the scheduler is getting bigger and more complex,
    document the status quo of current responsibilities and interests,
    and spread the review pain^H^H^H^H fun via an increase in the Cc:
    linecount generated by scripts/get_maintainer.pl. :-)

    - Add another series of patches that brings the -rt (PREEMPT_RT) tree
    closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
    into a new CONFIG_PREEMPTION category that will allow the eventual
    introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
    to go though.

    - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
    allow the finer shaping of CPU bandwidth usage.

    - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

    - Improve the behavior of high CPU count, high thread count
    applications running under cpu.cfs_quota_us constraints.

    - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

    - Improve CPU isolation housekeeping CPU allocation NUMA locality.

    - Fix deadline scheduler bandwidth calculations and logic when cpusets
    rebuilds the topology, or when it gets deadline-throttled while it's
    being offlined.

    - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
    setscheduler() system calls without creating global serialization.
    Add new synchronization between cpuset topology-changing events and
    the deadline acceptance tests in setscheduler(), which were broken
    before.

    - Rework the active_mm state machine to be less confusing and more
    optimal.

    - Rework (simplify) the pick_next_task() slowpath.

    - Improve load-balancing on AMD EPYC systems.

    - ... and misc cleanups, smaller fixes and improvements - please see
    the Git log for more details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
    sched/psi: Correct overly pessimistic size calculation
    sched/fair: Speed-up energy-aware wake-ups
    sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
    sched/uclamp: Update CPU's refcount on TG's clamp changes
    sched/uclamp: Use TG's clamps to restrict TASK's clamps
    sched/uclamp: Propagate system defaults to the root group
    sched/uclamp: Propagate parent clamps
    sched/uclamp: Extend CPU's cgroup controller
    sched/topology: Improve load balancing on AMD EPYC systems
    arch, ia64: Make NUMA select SMP
    sched, perf: MAINTAINERS update, add submaintainers and reviewers
    sched/fair: Use rq_lock/unlock in online_fair_sched_group
    cpufreq: schedutil: fix equation in comment
    sched: Rework pick_next_task() slow-path
    sched: Allow put_prev_task() to drop rq->lock
    sched/fair: Expose newidle_balance()
    sched: Add task_struct pointer to sched_class::set_curr_task
    sched: Rework CPU hotplug task selection
    sched/{rt,deadline}: Fix set_next_task vs pick_next_task
    sched: Fix kerneldoc comment for ia64_set_curr_task
    ...

    Linus Torvalds
     

13 Sep, 2019

1 commit

  • If a new child cgroup is created in the frozen cgroup hierarchy
    (one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup
    flag should be set. Otherwise if a process will be attached to the
    child cgroup, it won't become frozen.

    The problem can be reproduced with the test_cgfreezer_mkdir test.

    This is the output before this patch:
    ~/test_freezer
    ok 1 test_cgfreezer_simple
    ok 2 test_cgfreezer_tree
    ok 3 test_cgfreezer_forkbomb
    Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen
    not ok 4 test_cgfreezer_mkdir
    ok 5 test_cgfreezer_rmdir
    ok 6 test_cgfreezer_migrate
    ok 7 test_cgfreezer_ptrace
    ok 8 test_cgfreezer_stopped
    ok 9 test_cgfreezer_ptraced
    ok 10 test_cgfreezer_vfork

    And with this patch:
    ~/test_freezer
    ok 1 test_cgfreezer_simple
    ok 2 test_cgfreezer_tree
    ok 3 test_cgfreezer_forkbomb
    ok 4 test_cgfreezer_mkdir
    ok 5 test_cgfreezer_rmdir
    ok 6 test_cgfreezer_migrate
    ok 7 test_cgfreezer_ptrace
    ok 8 test_cgfreezer_stopped
    ok 9 test_cgfreezer_ptraced
    ok 10 test_cgfreezer_vfork

    Reported-by: Mark Crossen
    Signed-off-by: Roman Gushchin
    Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
    Cc: Tejun Heo
    Cc: stable@vger.kernel.org # v5.2+
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

08 Aug, 2019

1 commit


25 Jul, 2019

4 commits

  • No synchronisation mechanism exists between the cpuset subsystem and
    calls to function __sched_setscheduler(). As such, it is possible that
    new root domains are created on the cpuset side while a deadline
    acceptance test is carried out in __sched_setscheduler(), leading to a
    potential oversell of CPU bandwidth.

    Grab cpuset_rwsem read lock from core scheduler, so to prevent
    situations such as the one described above from happening.

    The only exception is normalize_rt_tasks() which needs to work under
    tasklist_lock and can't therefore grab cpuset_rwsem. We are fine with
    this, as this function is only called by sysrq and, if that gets
    triggered, DEADLINE guarantees are already gone out of the window
    anyway.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: mathieu.poirier@linaro.org
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-9-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • cpuset_rwsem is going to be acquired from sched_setscheduler() with a
    following patch. There are however paths (e.g., spawn_ksoftirqd) in
    which sched_scheduler() is eventually called while holding hotplug lock;
    this creates a dependecy between hotplug lock (to be always acquired
    first) and cpuset_rwsem (to be always acquired after hotplug lock).

    Fix paths which currently take the two locks in the wrong order (after
    a following patch is applied).

    Tested-by: Dietmar Eggemann
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: mathieu.poirier@linaro.org
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-7-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Holding cpuset_mutex means that cpusets are stable (only the holder can
    make changes) and this is required for fixing a synchronization issue
    between cpusets and scheduler core. However, grabbing cpuset_mutex from
    setscheduler() hotpath (as implemented in a later patch) is a no-go, as
    it would create a bottleneck for tasks concurrently calling
    setscheduler().

    Convert cpuset_mutex to be a percpu_rwsem (cpuset_rwsem), so that
    setscheduler() will then be able to read lock it and avoid concurrency
    issues.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: mathieu.poirier@linaro.org
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-6-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • When the topology of root domains is modified by CPUset or CPUhotplug
    operations information about the current deadline bandwidth held in the
    root domain is lost.

    This patch addresses the issue by recalculating the lost deadline
    bandwidth information by circling through the deadline tasks held in
    CPUsets and adding their current load to the root domain they are
    associated with.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Mathieu Poirier
    Signed-off-by: Juri Lelli
    [ Various additional modifications. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     

24 Jul, 2019

2 commits


20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

15 Jul, 2019

1 commit


12 Jul, 2019

1 commit

  • Pull networking updates from David Miller:
    "Some highlights from this development cycle:

    1) Big refactoring of ipv6 route and neigh handling to support
    nexthop objects configurable as units from userspace. From David
    Ahern.

    2) Convert explored_states in BPF verifier into a hash table,
    significantly decreased state held for programs with bpf2bpf
    calls, from Alexei Starovoitov.

    3) Implement bpf_send_signal() helper, from Yonghong Song.

    4) Various classifier enhancements to mvpp2 driver, from Maxime
    Chevallier.

    5) Add aRFS support to hns3 driver, from Jian Shen.

    6) Fix use after free in inet frags by allocating fqdirs dynamically
    and reworking how rhashtable dismantle occurs, from Eric Dumazet.

    7) Add act_ctinfo packet classifier action, from Kevin
    Darbyshire-Bryant.

    8) Add TFO key backup infrastructure, from Jason Baron.

    9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

    10) Add devlink notifications for flash update status to mlxsw driver,
    from Jiri Pirko.

    11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

    12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

    13) Various enhancements to ipv6 flow label handling, from Eric
    Dumazet and Willem de Bruijn.

    14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
    der Merwe, and others.

    15) Various improvements to axienet driver including converting it to
    phylink, from Robert Hancock.

    16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

    17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
    Radulescu.

    18) Add devlink health reporting to mlx5, from Moshe Shemesh.

    19) Convert stmmac over to phylink, from Jose Abreu.

    20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
    Shalom Toledo.

    21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

    22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

    23) Track spill/fill of constants in BPF verifier, from Alexei
    Starovoitov.

    24) Support bounded loops in BPF, from Alexei Starovoitov.

    25) Various page_pool API fixes and improvements, from Jesper Dangaard
    Brouer.

    26) Just like ipv4, support ref-countless ipv6 route handling. From
    Wei Wang.

    27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

    28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

    29) Add flower GRE encap/decap support to nfp driver, from Pieter
    Jansen van Vuuren.

    30) Protect against stack overflow when using act_mirred, from John
    Hurley.

    31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

    32) Use page_pool API in netsec driver, Ilias Apalodimas.

    33) Add Google gve network driver, from Catherine Sullivan.

    34) More indirect call avoidance, from Paolo Abeni.

    35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

    36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

    37) Add MPLS manipulation actions to TC, from John Hurley.

    38) Add sending a packet to connection tracking from TC actions, and
    then allow flower classifier matching on conntrack state. From
    Paul Blakey.

    39) Netfilter hw offload support, from Pablo Neira Ayuso"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
    net/mlx5e: Return in default case statement in tx_post_resync_params
    mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
    net: dsa: add support for BRIDGE_MROUTER attribute
    pkt_sched: Include const.h
    net: netsec: remove static declaration for netsec_set_tx_de()
    net: netsec: remove superfluous if statement
    netfilter: nf_tables: add hardware offload support
    net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
    net: flow_offload: add flow_block_cb_is_busy() and use it
    net: sched: remove tcf block API
    drivers: net: use flow block API
    net: sched: use flow block API
    net: flow_offload: add flow_block_cb_{priv, incref, decref}()
    net: flow_offload: add list handling functions
    net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
    net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
    net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
    net: flow_offload: add flow_block_cb_setup_simple()
    net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
    net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
    ...

    Linus Torvalds
     

10 Jul, 2019

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main block updates for 5.3. Nothing earth shattering or
    major in here, just fixes, additions, and improvements all over the
    map. This contains:

    - Series of documentation fixes (Bart)

    - Optimization of the blk-mq ctx get/put (Bart)

    - null_blk removal race condition fix (Bob)

    - req/bio_op() cleanups (Chaitanya)

    - Series cleaning up the segment accounting, and request/bio mapping
    (Christoph)

    - Series cleaning up the page getting/putting for bios (Christoph)

    - block cgroup cleanups and moving it to where it is used (Christoph)

    - block cgroup fixes (Tejun)

    - Series of fixes and improvements to bcache, most notably a write
    deadlock fix (Coly)

    - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

    - Series of improvements and fixes to BFQ (Douglas, Paolo)

    - debugfs_create() return value check removal for drbd (Greg)

    - Use struct_size(), where appropriate (Gustavo)

    - Two lighnvm fixes (Heiner, Geert)

    - MD fixes, including a read balance and corruption fix (Guoqing,
    Marcos, Xiao, Yufen)

    - block opal shadow mbr additions (Jonas, Revanth)

    - sbitmap compare-and-exhange improvemnts (Pavel)

    - Fix for potential bio->bi_size overflow (Ming)

    - NVMe pull requests:
    - improved PCIe suspent support (Keith Busch)
    - error injection support for the admin queue (Akinobu Mita)
    - Fibre Channel discovery improvements (James Smart)
    - tracing improvements including nvmetc tracing support (Minwoo Im)
    - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
    Kulkarni)"

    - Various little fixes and improvements to drivers and core"

    * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
    blk-iolatency: fix STS_AGAIN handling
    block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
    blk-mq: simplify blk_mq_make_request()
    blk-mq: remove blk_mq_put_ctx()
    sbitmap: Replace cmpxchg with xchg
    block: fix .bi_size overflow
    block: sed-opal: check size of shadow mbr
    block: sed-opal: ioctl for writing to shadow mbr
    block: sed-opal: add ioctl for done-mark of shadow mbr
    block: never take page references for ITER_BVEC
    direct-io: use bio_release_pages in dio_bio_complete
    block_dev: use bio_release_pages in bio_unmap_user
    block_dev: use bio_release_pages in blkdev_bio_end_io
    iomap: use bio_release_pages in iomap_dio_bio_end_io
    block: use bio_release_pages in bio_map_user_iov
    block: use bio_release_pages in bio_unmap_user
    block: optionally mark pages dirty in bio_release_pages
    block: move the BIO_NO_PAGE_REF check into bio_release_pages
    block: skd_main.c: Remove call to memset after dma_alloc_coherent
    block: mtip32xx: Remove call to memset after dma_alloc_coherent
    ...

    Linus Torvalds
     

09 Jul, 2019

2 commits

  • Pull cgroup updates from Tejun Heo:
    "Documentation updates and the addition of cgroup_parse_float() which
    will be used by new controllers including blk-iocost"

    * 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    docs: cgroup-v1: convert docs to ReST and rename to *.rst
    cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
    cgroup: add cgroup_parse_float()

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:

    - Remove the unused per rq load array and all its infrastructure, by
    Dietmar Eggemann.

    - Add utilization clamping support by Patrick Bellasi. This is a
    refinement of the energy aware scheduling framework with support for
    boosting of interactive and capping of background workloads: to make
    sure critical GUI threads get maximum frequency ASAP, and to make
    sure background processing doesn't unnecessarily move to cpufreq
    governor to higher frequencies and less energy efficient CPU modes.

    - Add the bare minimum of tracepoints required for LISA EAS regression
    testing, by Qais Yousef - which allows automated testing of various
    power management features, including energy aware scheduling.

    - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
    kernel used to modify the scheduler's CPU affinity logic such as
    migrate_disable() - introduce the task->cpus_ptr value instead of
    taking the address of &task->cpus_allowed directly - by Sebastian
    Andrzej Siewior.

    - Misc optimizations, fixes, cleanups and small enhancements - see the
    Git log for details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    sched/uclamp: Add uclamp support to energy_compute()
    sched/uclamp: Add uclamp_util_with()
    sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
    sched/uclamp: Set default clamps for RT tasks
    sched/uclamp: Reset uclamp values on RESET_ON_FORK
    sched/uclamp: Extend sched_setattr() to support utilization clamping
    sched/core: Allow sched_setattr() to use the current policy
    sched/uclamp: Add system default clamps
    sched/uclamp: Enforce last task's UCLAMP_MAX
    sched/uclamp: Add bucket local max tracking
    sched/uclamp: Add CPU's clamp buckets refcounting
    sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
    sched/debug: Export the newly added tracepoints
    sched/debug: Add sched_overutilized tracepoint
    sched/debug: Add new tracepoint to track PELT at se level
    sched/debug: Add new tracepoints to track PELT at rq level
    sched/debug: Add a new sched_trace_*() helper functions
    sched/autogroup: Make autogroup_path() always available
    sched/wait: Deduplicate code with do-while
    sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
    ...

    Linus Torvalds
     

01 Jul, 2019

1 commit

  • Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
    fix. Otherwise we end up having conflicts with future patches between
    for-5.3/block and master that touch this area. In particular, it makes
    the bio_full() fix hard to backport to stable.

    * tag 'v5.2-rc6': (482 commits)
    Linux 5.2-rc6
    Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
    Bluetooth: Fix regression with minimum encryption key size alignment
    tcp: refine memory limit test in tcp_fragment()
    x86/vdso: Prevent segfaults due to hoisted vclock reads
    SUNRPC: Fix a credential refcount leak
    Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
    net :sunrpc :clnt :Fix xps refcount imbalance on the error path
    NFS4: Only set creation opendata if O_CREAT
    ARM: 8867/1: vdso: pass --be8 to linker if necessary
    KVM: nVMX: reorganize initial steps of vmx_set_nested_state
    KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
    habanalabs: use u64_to_user_ptr() for reading user pointers
    nfsd: replace Jeff by Chuck as nfsd co-maintainer
    inet: clear num_timeout reqsk_alloc()
    PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
    net: mvpp2: debugfs: Add pmap to fs dump
    ipv6: Default fib6_type to RTN_UNICAST when not set
    net: hns3: Fix inconsistent indenting
    net/af_iucv: always register net_device notifier
    ...

    Jens Axboe
     

29 Jun, 2019

1 commit

  • …k/linux-rcu into core/rcu

    Pull rcu/next + tools/memory-model changes from Paul E. McKenney:

    - RCU flavor consolidation cleanups and optmizations
    - Documentation updates
    - Miscellaneous fixes
    - SRCU updates
    - RCU-sync flavor consolidation
    - Torture-test updates
    - Linux-kernel memory-consistency-model updates, most notably the addition of plain C-language accesses

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

25 Jun, 2019

1 commit


22 Jun, 2019

1 commit


21 Jun, 2019

1 commit

  • The bfq schedule now uses css_next_descendant_pre directly after
    the stats functionality depending on it has been from the core
    blk-cgroup code to bfq. Export the symbol so that bfq can still
    be build modular.

    Fixes: d6258980daf2 ("bfq-iosched: move bfq_stat_recursive_sum into the only caller")
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

19 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this file is subject to the terms and conditions of version 2 of the
    gnu general public license see the file copying in the main
    directory of the linux distribution for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 5 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081200.872755311@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

18 Jun, 2019

1 commit


17 Jun, 2019

1 commit


15 Jun, 2019

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "This has an unusually high density of tricky fixes:

    - task_get_css() could deadlock when it races against a dying cgroup.

    - cgroup.procs didn't list thread group leaders with live threads.

    This could mislead readers to think that a cgroup is empty when
    it's not. Fixed by making PROCS iterator include dead tasks. I made
    a couple mistakes making this change and this pull request contains
    a couple follow-up patches.

    - When cpusets run out of online cpus, it updates cpusmasks of member
    tasks in bizarre ways. Joel improved the behavior significantly"

    * 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: restore sanity to cpuset_cpus_allowed_fallback()
    cgroup: Fix css_task_iter_advance_css_set() cset skip condition
    cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
    cgroup: Include dying leaders with live threads in PROCS iterations
    cgroup: Implement css_task_iter_skip()
    cgroup: Call cgroup_release() before __exit_signal()
    docs cgroups: add another example size for hugetlb
    cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()

    Linus Torvalds