06 Feb, 2020

1 commit

  • commit 3bc0bb36fa30e95ca829e9cf480e1ef7f7638333 upstream.

    The test_cgcore_no_internal_process_constraint_on_threads selftest when
    running with subsystem controlling noise triggers two warnings:

    > [ 597.443115] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3131 cgroup_apply_control_enable+0xe0/0x3f0
    > [ 597.443413] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3177 cgroup_apply_control_disable+0xa6/0x160

    Both stem from a call to cgroup_type_write. The first warning was also
    triggered by syzkaller.

    When we're switching cgroup to threaded mode shortly after a subsystem
    was disabled on it, we can see the respective subsystem css dying there.

    The warning in cgroup_apply_control_enable is harmless in this case
    since we're not adding new subsys anyway.
    The warning in cgroup_apply_control_disable indicates an attempt to kill
    css of recently disabled subsystem repeatedly.

    The commit prevents these situations by making cgroup_type_write wait
    for all dying csses to go away before re-applying subtree controls.
    When at it, the locations of WARN_ON_ONCE calls are moved so that
    warning is triggered only when we are about to misuse the dying css.

    Reported-by: syzbot+5493b2a54d31d6aea629@syzkaller.appspotmail.com
    Reported-by: Christian Brauner
    Signed-off-by: Michal Koutný
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Michal Koutný
     

31 Dec, 2019

1 commit

  • [ Upstream commit 742e8cd3e1ba6f19cad6d912f8d469df5557d0fd ]

    It's not necessary to adjust the task state and revisit the state
    of source and destination cgroups if the cgroups are not in freeze
    state and the task itself is not frozen.

    And in this scenario, it wakes up the task who's not supposed to be
    ready to run.

    Don't do the unnecessary task state adjustment can help stop waking
    up the task without a reason.

    Signed-off-by: Honglei Wang
    Acked-by: Roman Gushchin
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Honglei Wang
     

18 Dec, 2019

1 commit

  • commit a713af394cf382a30dd28a1015cbe572f1b9ca75 upstream.

    Because pids->limit can be changed concurrently (but we don't want to
    take a lock because it would be needlessly expensive), use atomic64_ts
    instead.

    Fixes: commit 49b786ea146f ("cgroup: implement the PIDs subsystem")
    Cc: stable@vger.kernel.org # v4.3+
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Aleksa Sarai
     

16 Nov, 2019

1 commit

  • Pull misc vfs fixes from Al Viro:
    "Assorted fixes all over the place; some of that is -stable fodder,
    some regressions from the last window"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
    ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
    ecryptfs: fix unlink and rmdir in face of underlying fs modifications
    audit_get_nd(): don't unlock parent too early
    exportfs_decode_fh(): negative pinned may become positive without the parent locked
    cgroup: don't put ERR_PTR() into fc->root
    autofs: fix a leak in autofs_expire_indirect()
    aio: Fix io_pgetevents() struct __compat_aio_sigset layout
    fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()

    Linus Torvalds
     

11 Nov, 2019

1 commit


29 Oct, 2019

1 commit

  • Turns out hotplugging CPUs that are in exclusive cpusets can lead to the
    cpuset code feeding empty cpumasks to the sched domain rebuild machinery.

    This leads to the following splat:

    Internal error: Oops: 96000004 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 0 PID: 235 Comm: kworker/5:2 Not tainted 5.4.0-rc1-00005-g8d495477d62e #23
    Hardware name: ARM Juno development board (r0) (DT)
    Workqueue: events cpuset_hotplug_workfn
    pstate: 60000005 (nZCv daif -PAN -UAO)
    pc : build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
    lr : build_sched_domains (kernel/sched/topology.c:1966)
    Call trace:
    build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
    partition_sched_domains_locked (kernel/sched/topology.c:2250)
    rebuild_sched_domains_locked (./include/linux/bitmap.h:370 ./include/linux/cpumask.h:538 kernel/cgroup/cpuset.c:955 kernel/cgroup/cpuset.c:978 kernel/cgroup/cpuset.c:1019)
    rebuild_sched_domains (kernel/cgroup/cpuset.c:1032)
    cpuset_hotplug_workfn (kernel/cgroup/cpuset.c:3205 (discriminator 2))
    process_one_work (./arch/arm64/include/asm/jump_label.h:21 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:114 kernel/workqueue.c:2274)
    worker_thread (./include/linux/compiler.h:199 ./include/linux/list.h:268 kernel/workqueue.c:2416)
    kthread (kernel/kthread.c:255)
    ret_from_fork (arch/arm64/kernel/entry.S:1167)
    Code: f860dae2 912802d6 aa1603e1 12800000 (f8616853)

    The faulty line in question is:

    cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));

    and we're not checking the return value against nr_cpu_ids (we shouldn't
    have to!), which leads to the above.

    Prevent generate_sched_domains() from returning empty cpumasks, and add
    some assertion in build_sched_domains() to scream bloody murder if it
    happens again.

    The above splat was obtained on my Juno r0 with the following reproducer:

    $ cgcreate -g cpuset:asym
    $ cgset -r cpuset.cpus=0-3 asym
    $ cgset -r cpuset.mems=0 asym
    $ cgset -r cpuset.cpu_exclusive=1 asym

    $ cgcreate -g cpuset:smp
    $ cgset -r cpuset.cpus=4-5 smp
    $ cgset -r cpuset.mems=0 smp
    $ cgset -r cpuset.cpu_exclusive=1 smp

    $ cgset -r cpuset.sched_load_balance=0 .

    $ echo 0 > /sys/devices/system/cpu/cpu4/online
    $ echo 0 > /sys/devices/system/cpu/cpu5/online

    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hannes@cmpxchg.org
    Cc: lizefan@huawei.com
    Cc: morten.rasmussen@arm.com
    Cc: qperret@google.com
    Cc: tj@kernel.org
    Cc: vincent.guittot@linaro.org
    Fixes: 05484e098448 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection")
    Link: https://lkml.kernel.org/r/20191023153745.19515-2-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar

    Valentin Schneider
     

18 Sep, 2019

1 commit


17 Sep, 2019

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
    Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
    Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

    As perf and the scheduler is getting bigger and more complex,
    document the status quo of current responsibilities and interests,
    and spread the review pain^H^H^H^H fun via an increase in the Cc:
    linecount generated by scripts/get_maintainer.pl. :-)

    - Add another series of patches that brings the -rt (PREEMPT_RT) tree
    closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
    into a new CONFIG_PREEMPTION category that will allow the eventual
    introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
    to go though.

    - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
    allow the finer shaping of CPU bandwidth usage.

    - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

    - Improve the behavior of high CPU count, high thread count
    applications running under cpu.cfs_quota_us constraints.

    - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

    - Improve CPU isolation housekeeping CPU allocation NUMA locality.

    - Fix deadline scheduler bandwidth calculations and logic when cpusets
    rebuilds the topology, or when it gets deadline-throttled while it's
    being offlined.

    - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
    setscheduler() system calls without creating global serialization.
    Add new synchronization between cpuset topology-changing events and
    the deadline acceptance tests in setscheduler(), which were broken
    before.

    - Rework the active_mm state machine to be less confusing and more
    optimal.

    - Rework (simplify) the pick_next_task() slowpath.

    - Improve load-balancing on AMD EPYC systems.

    - ... and misc cleanups, smaller fixes and improvements - please see
    the Git log for more details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
    sched/psi: Correct overly pessimistic size calculation
    sched/fair: Speed-up energy-aware wake-ups
    sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
    sched/uclamp: Update CPU's refcount on TG's clamp changes
    sched/uclamp: Use TG's clamps to restrict TASK's clamps
    sched/uclamp: Propagate system defaults to the root group
    sched/uclamp: Propagate parent clamps
    sched/uclamp: Extend CPU's cgroup controller
    sched/topology: Improve load balancing on AMD EPYC systems
    arch, ia64: Make NUMA select SMP
    sched, perf: MAINTAINERS update, add submaintainers and reviewers
    sched/fair: Use rq_lock/unlock in online_fair_sched_group
    cpufreq: schedutil: fix equation in comment
    sched: Rework pick_next_task() slow-path
    sched: Allow put_prev_task() to drop rq->lock
    sched/fair: Expose newidle_balance()
    sched: Add task_struct pointer to sched_class::set_curr_task
    sched: Rework CPU hotplug task selection
    sched/{rt,deadline}: Fix set_next_task vs pick_next_task
    sched: Fix kerneldoc comment for ia64_set_curr_task
    ...

    Linus Torvalds
     

13 Sep, 2019

1 commit

  • If a new child cgroup is created in the frozen cgroup hierarchy
    (one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup
    flag should be set. Otherwise if a process will be attached to the
    child cgroup, it won't become frozen.

    The problem can be reproduced with the test_cgfreezer_mkdir test.

    This is the output before this patch:
    ~/test_freezer
    ok 1 test_cgfreezer_simple
    ok 2 test_cgfreezer_tree
    ok 3 test_cgfreezer_forkbomb
    Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen
    not ok 4 test_cgfreezer_mkdir
    ok 5 test_cgfreezer_rmdir
    ok 6 test_cgfreezer_migrate
    ok 7 test_cgfreezer_ptrace
    ok 8 test_cgfreezer_stopped
    ok 9 test_cgfreezer_ptraced
    ok 10 test_cgfreezer_vfork

    And with this patch:
    ~/test_freezer
    ok 1 test_cgfreezer_simple
    ok 2 test_cgfreezer_tree
    ok 3 test_cgfreezer_forkbomb
    ok 4 test_cgfreezer_mkdir
    ok 5 test_cgfreezer_rmdir
    ok 6 test_cgfreezer_migrate
    ok 7 test_cgfreezer_ptrace
    ok 8 test_cgfreezer_stopped
    ok 9 test_cgfreezer_ptraced
    ok 10 test_cgfreezer_vfork

    Reported-by: Mark Crossen
    Signed-off-by: Roman Gushchin
    Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
    Cc: Tejun Heo
    Cc: stable@vger.kernel.org # v5.2+
    Signed-off-by: Tejun Heo

    Roman Gushchin
     

08 Aug, 2019

1 commit


25 Jul, 2019

4 commits

  • No synchronisation mechanism exists between the cpuset subsystem and
    calls to function __sched_setscheduler(). As such, it is possible that
    new root domains are created on the cpuset side while a deadline
    acceptance test is carried out in __sched_setscheduler(), leading to a
    potential oversell of CPU bandwidth.

    Grab cpuset_rwsem read lock from core scheduler, so to prevent
    situations such as the one described above from happening.

    The only exception is normalize_rt_tasks() which needs to work under
    tasklist_lock and can't therefore grab cpuset_rwsem. We are fine with
    this, as this function is only called by sysrq and, if that gets
    triggered, DEADLINE guarantees are already gone out of the window
    anyway.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: mathieu.poirier@linaro.org
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-9-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • cpuset_rwsem is going to be acquired from sched_setscheduler() with a
    following patch. There are however paths (e.g., spawn_ksoftirqd) in
    which sched_scheduler() is eventually called while holding hotplug lock;
    this creates a dependecy between hotplug lock (to be always acquired
    first) and cpuset_rwsem (to be always acquired after hotplug lock).

    Fix paths which currently take the two locks in the wrong order (after
    a following patch is applied).

    Tested-by: Dietmar Eggemann
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: mathieu.poirier@linaro.org
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-7-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Holding cpuset_mutex means that cpusets are stable (only the holder can
    make changes) and this is required for fixing a synchronization issue
    between cpusets and scheduler core. However, grabbing cpuset_mutex from
    setscheduler() hotpath (as implemented in a later patch) is a no-go, as
    it would create a bottleneck for tasks concurrently calling
    setscheduler().

    Convert cpuset_mutex to be a percpu_rwsem (cpuset_rwsem), so that
    setscheduler() will then be able to read lock it and avoid concurrency
    issues.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: mathieu.poirier@linaro.org
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-6-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • When the topology of root domains is modified by CPUset or CPUhotplug
    operations information about the current deadline bandwidth held in the
    root domain is lost.

    This patch addresses the issue by recalculating the lost deadline
    bandwidth information by circling through the deadline tasks held in
    CPUsets and adding their current load to the root domain they are
    associated with.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Mathieu Poirier
    Signed-off-by: Juri Lelli
    [ Various additional modifications. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     

24 Jul, 2019

2 commits


20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

15 Jul, 2019

1 commit


12 Jul, 2019

1 commit

  • Pull networking updates from David Miller:
    "Some highlights from this development cycle:

    1) Big refactoring of ipv6 route and neigh handling to support
    nexthop objects configurable as units from userspace. From David
    Ahern.

    2) Convert explored_states in BPF verifier into a hash table,
    significantly decreased state held for programs with bpf2bpf
    calls, from Alexei Starovoitov.

    3) Implement bpf_send_signal() helper, from Yonghong Song.

    4) Various classifier enhancements to mvpp2 driver, from Maxime
    Chevallier.

    5) Add aRFS support to hns3 driver, from Jian Shen.

    6) Fix use after free in inet frags by allocating fqdirs dynamically
    and reworking how rhashtable dismantle occurs, from Eric Dumazet.

    7) Add act_ctinfo packet classifier action, from Kevin
    Darbyshire-Bryant.

    8) Add TFO key backup infrastructure, from Jason Baron.

    9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

    10) Add devlink notifications for flash update status to mlxsw driver,
    from Jiri Pirko.

    11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

    12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

    13) Various enhancements to ipv6 flow label handling, from Eric
    Dumazet and Willem de Bruijn.

    14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
    der Merwe, and others.

    15) Various improvements to axienet driver including converting it to
    phylink, from Robert Hancock.

    16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

    17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
    Radulescu.

    18) Add devlink health reporting to mlx5, from Moshe Shemesh.

    19) Convert stmmac over to phylink, from Jose Abreu.

    20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
    Shalom Toledo.

    21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

    22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

    23) Track spill/fill of constants in BPF verifier, from Alexei
    Starovoitov.

    24) Support bounded loops in BPF, from Alexei Starovoitov.

    25) Various page_pool API fixes and improvements, from Jesper Dangaard
    Brouer.

    26) Just like ipv4, support ref-countless ipv6 route handling. From
    Wei Wang.

    27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

    28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

    29) Add flower GRE encap/decap support to nfp driver, from Pieter
    Jansen van Vuuren.

    30) Protect against stack overflow when using act_mirred, from John
    Hurley.

    31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

    32) Use page_pool API in netsec driver, Ilias Apalodimas.

    33) Add Google gve network driver, from Catherine Sullivan.

    34) More indirect call avoidance, from Paolo Abeni.

    35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

    36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

    37) Add MPLS manipulation actions to TC, from John Hurley.

    38) Add sending a packet to connection tracking from TC actions, and
    then allow flower classifier matching on conntrack state. From
    Paul Blakey.

    39) Netfilter hw offload support, from Pablo Neira Ayuso"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
    net/mlx5e: Return in default case statement in tx_post_resync_params
    mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
    net: dsa: add support for BRIDGE_MROUTER attribute
    pkt_sched: Include const.h
    net: netsec: remove static declaration for netsec_set_tx_de()
    net: netsec: remove superfluous if statement
    netfilter: nf_tables: add hardware offload support
    net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
    net: flow_offload: add flow_block_cb_is_busy() and use it
    net: sched: remove tcf block API
    drivers: net: use flow block API
    net: sched: use flow block API
    net: flow_offload: add flow_block_cb_{priv, incref, decref}()
    net: flow_offload: add list handling functions
    net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
    net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
    net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
    net: flow_offload: add flow_block_cb_setup_simple()
    net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
    net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
    ...

    Linus Torvalds
     

10 Jul, 2019

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main block updates for 5.3. Nothing earth shattering or
    major in here, just fixes, additions, and improvements all over the
    map. This contains:

    - Series of documentation fixes (Bart)

    - Optimization of the blk-mq ctx get/put (Bart)

    - null_blk removal race condition fix (Bob)

    - req/bio_op() cleanups (Chaitanya)

    - Series cleaning up the segment accounting, and request/bio mapping
    (Christoph)

    - Series cleaning up the page getting/putting for bios (Christoph)

    - block cgroup cleanups and moving it to where it is used (Christoph)

    - block cgroup fixes (Tejun)

    - Series of fixes and improvements to bcache, most notably a write
    deadlock fix (Coly)

    - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

    - Series of improvements and fixes to BFQ (Douglas, Paolo)

    - debugfs_create() return value check removal for drbd (Greg)

    - Use struct_size(), where appropriate (Gustavo)

    - Two lighnvm fixes (Heiner, Geert)

    - MD fixes, including a read balance and corruption fix (Guoqing,
    Marcos, Xiao, Yufen)

    - block opal shadow mbr additions (Jonas, Revanth)

    - sbitmap compare-and-exhange improvemnts (Pavel)

    - Fix for potential bio->bi_size overflow (Ming)

    - NVMe pull requests:
    - improved PCIe suspent support (Keith Busch)
    - error injection support for the admin queue (Akinobu Mita)
    - Fibre Channel discovery improvements (James Smart)
    - tracing improvements including nvmetc tracing support (Minwoo Im)
    - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
    Kulkarni)"

    - Various little fixes and improvements to drivers and core"

    * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
    blk-iolatency: fix STS_AGAIN handling
    block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
    blk-mq: simplify blk_mq_make_request()
    blk-mq: remove blk_mq_put_ctx()
    sbitmap: Replace cmpxchg with xchg
    block: fix .bi_size overflow
    block: sed-opal: check size of shadow mbr
    block: sed-opal: ioctl for writing to shadow mbr
    block: sed-opal: add ioctl for done-mark of shadow mbr
    block: never take page references for ITER_BVEC
    direct-io: use bio_release_pages in dio_bio_complete
    block_dev: use bio_release_pages in bio_unmap_user
    block_dev: use bio_release_pages in blkdev_bio_end_io
    iomap: use bio_release_pages in iomap_dio_bio_end_io
    block: use bio_release_pages in bio_map_user_iov
    block: use bio_release_pages in bio_unmap_user
    block: optionally mark pages dirty in bio_release_pages
    block: move the BIO_NO_PAGE_REF check into bio_release_pages
    block: skd_main.c: Remove call to memset after dma_alloc_coherent
    block: mtip32xx: Remove call to memset after dma_alloc_coherent
    ...

    Linus Torvalds
     

09 Jul, 2019

2 commits

  • Pull cgroup updates from Tejun Heo:
    "Documentation updates and the addition of cgroup_parse_float() which
    will be used by new controllers including blk-iocost"

    * 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    docs: cgroup-v1: convert docs to ReST and rename to *.rst
    cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
    cgroup: add cgroup_parse_float()

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:

    - Remove the unused per rq load array and all its infrastructure, by
    Dietmar Eggemann.

    - Add utilization clamping support by Patrick Bellasi. This is a
    refinement of the energy aware scheduling framework with support for
    boosting of interactive and capping of background workloads: to make
    sure critical GUI threads get maximum frequency ASAP, and to make
    sure background processing doesn't unnecessarily move to cpufreq
    governor to higher frequencies and less energy efficient CPU modes.

    - Add the bare minimum of tracepoints required for LISA EAS regression
    testing, by Qais Yousef - which allows automated testing of various
    power management features, including energy aware scheduling.

    - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
    kernel used to modify the scheduler's CPU affinity logic such as
    migrate_disable() - introduce the task->cpus_ptr value instead of
    taking the address of &task->cpus_allowed directly - by Sebastian
    Andrzej Siewior.

    - Misc optimizations, fixes, cleanups and small enhancements - see the
    Git log for details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    sched/uclamp: Add uclamp support to energy_compute()
    sched/uclamp: Add uclamp_util_with()
    sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
    sched/uclamp: Set default clamps for RT tasks
    sched/uclamp: Reset uclamp values on RESET_ON_FORK
    sched/uclamp: Extend sched_setattr() to support utilization clamping
    sched/core: Allow sched_setattr() to use the current policy
    sched/uclamp: Add system default clamps
    sched/uclamp: Enforce last task's UCLAMP_MAX
    sched/uclamp: Add bucket local max tracking
    sched/uclamp: Add CPU's clamp buckets refcounting
    sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
    sched/debug: Export the newly added tracepoints
    sched/debug: Add sched_overutilized tracepoint
    sched/debug: Add new tracepoint to track PELT at se level
    sched/debug: Add new tracepoints to track PELT at rq level
    sched/debug: Add a new sched_trace_*() helper functions
    sched/autogroup: Make autogroup_path() always available
    sched/wait: Deduplicate code with do-while
    sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
    ...

    Linus Torvalds
     

01 Jul, 2019

1 commit

  • Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
    fix. Otherwise we end up having conflicts with future patches between
    for-5.3/block and master that touch this area. In particular, it makes
    the bio_full() fix hard to backport to stable.

    * tag 'v5.2-rc6': (482 commits)
    Linux 5.2-rc6
    Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
    Bluetooth: Fix regression with minimum encryption key size alignment
    tcp: refine memory limit test in tcp_fragment()
    x86/vdso: Prevent segfaults due to hoisted vclock reads
    SUNRPC: Fix a credential refcount leak
    Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
    net :sunrpc :clnt :Fix xps refcount imbalance on the error path
    NFS4: Only set creation opendata if O_CREAT
    ARM: 8867/1: vdso: pass --be8 to linker if necessary
    KVM: nVMX: reorganize initial steps of vmx_set_nested_state
    KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
    habanalabs: use u64_to_user_ptr() for reading user pointers
    nfsd: replace Jeff by Chuck as nfsd co-maintainer
    inet: clear num_timeout reqsk_alloc()
    PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
    net: mvpp2: debugfs: Add pmap to fs dump
    ipv6: Default fib6_type to RTN_UNICAST when not set
    net: hns3: Fix inconsistent indenting
    net/af_iucv: always register net_device notifier
    ...

    Jens Axboe
     

29 Jun, 2019

1 commit

  • …k/linux-rcu into core/rcu

    Pull rcu/next + tools/memory-model changes from Paul E. McKenney:

    - RCU flavor consolidation cleanups and optmizations
    - Documentation updates
    - Miscellaneous fixes
    - SRCU updates
    - RCU-sync flavor consolidation
    - Torture-test updates
    - Linux-kernel memory-consistency-model updates, most notably the addition of plain C-language accesses

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

25 Jun, 2019

1 commit


22 Jun, 2019

1 commit


21 Jun, 2019

1 commit

  • The bfq schedule now uses css_next_descendant_pre directly after
    the stats functionality depending on it has been from the core
    blk-cgroup code to bfq. Export the symbol so that bfq can still
    be build modular.

    Fixes: d6258980daf2 ("bfq-iosched: move bfq_stat_recursive_sum into the only caller")
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

19 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this file is subject to the terms and conditions of version 2 of the
    gnu general public license see the file copying in the main
    directory of the linux distribution for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 5 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081200.872755311@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

18 Jun, 2019

1 commit


17 Jun, 2019

1 commit


15 Jun, 2019

3 commits

  • Pull cgroup fixes from Tejun Heo:
    "This has an unusually high density of tricky fixes:

    - task_get_css() could deadlock when it races against a dying cgroup.

    - cgroup.procs didn't list thread group leaders with live threads.

    This could mislead readers to think that a cgroup is empty when
    it's not. Fixed by making PROCS iterator include dead tasks. I made
    a couple mistakes making this change and this pull request contains
    a couple follow-up patches.

    - When cpusets run out of online cpus, it updates cpusmasks of member
    tasks in bizarre ways. Joel improved the behavior significantly"

    * 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: restore sanity to cpuset_cpus_allowed_fallback()
    cgroup: Fix css_task_iter_advance_css_set() cset skip condition
    cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
    cgroup: Include dying leaders with live threads in PROCS iterations
    cgroup: Implement css_task_iter_skip()
    cgroup: Call cgroup_release() before __exit_signal()
    docs cgroups: add another example size for hugetlb
    cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()

    Linus Torvalds
     
  • Convert the cgroup-v1 files to ReST format, in order to
    allow a later addition to the admin-guide.

    The conversion is actually:
    - add blank lines and identation in order to identify paragraphs;
    - fix tables markups;
    - add some lists markups;
    - mark literal blocks;
    - adjust title markups.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Tejun Heo
    Signed-off-by: Tejun Heo

    Mauro Carvalho Chehab
     
  • a5e112e6424a ("cgroup: add cgroup_parse_float()") accidentally added
    cgroup_parse_float() inside CONFIG_SYSFS block. Move it outside so
    that it doesn't cause failures on !CONFIG_SYSFS builds.

    Signed-off-by: Tejun Heo
    Fixes: a5e112e6424a ("cgroup: add cgroup_parse_float()")

    Tejun Heo
     

13 Jun, 2019

1 commit

  • In the case that a process is constrained by taskset(1) (i.e.
    sched_setaffinity(2)) to a subset of available cpus, and all of those are
    subsequently offlined, the scheduler will set tsk->cpus_allowed to
    the current value of task_cs(tsk)->effective_cpus.

    This is done via a call to do_set_cpus_allowed() in the context of
    cpuset_cpus_allowed_fallback() made by the scheduler when this case is
    detected. This is the only call made to cpuset_cpus_allowed_fallback()
    in the latest mainline kernel.

    However, this is not sane behavior.

    I will demonstrate this on a system running the latest upstream kernel
    with the following initial configuration:

    # grep -i cpu /proc/$$/status
    Cpus_allowed: ffffffff,fffffff
    Cpus_allowed_list: 0-63

    (Where cpus 32-63 are provided via smt.)

    If we limit our current shell process to cpu2 only and then offline it
    and reonline it:

    # taskset -p 4 $$
    pid 2272's current affinity mask: ffffffffffffffff
    pid 2272's new affinity mask: 4

    # echo off > /sys/devices/system/cpu/cpu2/online
    # dmesg | tail -3
    [ 2195.866089] process 2272 (bash) no longer affine to cpu2
    [ 2195.872700] IRQ 114: no longer affine to CPU2
    [ 2195.879128] smpboot: CPU 2 is now offline

    # echo on > /sys/devices/system/cpu/cpu2/online
    # dmesg | tail -1
    [ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4

    We see that our current process now has an affinity mask containing
    every cpu available on the system _except_ the one we originally
    constrained it to:

    # grep -i cpu /proc/$$/status
    Cpus_allowed: ffffffff,fffffffb
    Cpus_allowed_list: 0-1,3-63

    This is not sane behavior, as the scheduler can now not only place the
    process on previously forbidden cpus, it can't even schedule it on
    the cpu it was originally constrained to!

    Other cases result in even more exotic affinity masks. Take for instance
    a process with an affinity mask containing only cpus provided by smt at
    the moment that smt is toggled, in a configuration such as the following:

    # taskset -p f000000000 $$
    # grep -i cpu /proc/$$/status
    Cpus_allowed: 000000f0,00000000
    Cpus_allowed_list: 36-39

    A double toggle of smt results in the following behavior:

    # echo off > /sys/devices/system/cpu/smt/control
    # echo on > /sys/devices/system/cpu/smt/control
    # grep -i cpus /proc/$$/status
    Cpus_allowed: ffffff00,ffffffff
    Cpus_allowed_list: 0-31,40-63

    This is even less sane than the previous case, as the new affinity mask
    excludes all smt-provided cpus with ids less than those that were
    previously in the affinity mask, as well as those that were actually in
    the mask.

    With this patch applied, both of these cases end in the following state:

    # grep -i cpu /proc/$$/status
    Cpus_allowed: ffffffff,ffffffff
    Cpus_allowed_list: 0-63

    The original policy is discarded. Though not ideal, it is the simplest way
    to restore sanity to this fallback case without reinventing the cpuset
    wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
    for the previous affinity mask to be restored in this fallback case can use
    that mechanism instead.

    This patch modifies scheduler behavior by instead resetting the mask to
    task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
    mode. I tested the cases above on both modes.

    Note that the scheduler uses this fallback mechanism if and only if
    _every_ other valid avenue has been traveled, and it is the last resort
    before calling BUG().

    Suggested-by: Waiman Long
    Suggested-by: Phil Auld
    Signed-off-by: Joel Savitz
    Acked-by: Phil Auld
    Acked-by: Waiman Long
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Tejun Heo

    Joel Savitz
     

11 Jun, 2019

2 commits


10 Jun, 2019

1 commit

  • There's some discussion on how to do this the best, and Tejun prefers
    that BFQ just create the file itself instead of having cgroups support
    a symlink feature.

    Hence revert commit 54b7b868e826 and 19e9da9e86c4 for 5.2, and this
    can be done properly for 5.3.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Jun, 2019

1 commit


07 Jun, 2019

1 commit


06 Jun, 2019

1 commit

  • b636fd38dc40 ("cgroup: Implement css_task_iter_skip()") introduced
    css_task_iter_skip() which is used to fix task iterations skipping
    dying threadgroup leaders with live threads. Skipping is implemented
    as a subportion of full advancing but css_task_iter_next() forgot to
    fully advance a skipped iterator before determining the next task to
    visit causing it to return invalid task pointers.

    Fix it by making css_task_iter_next() fully advance the iterator if it
    has been skipped since the previous iteration.

    Signed-off-by: Tejun Heo
    Reported-by: syzbot
    Link: http://lkml.kernel.org/r/00000000000097025d058a7fd785@google.com
    Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()")

    Tejun Heo