22 Mar, 2020

1 commit

  • Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
    __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
    the vunmap() code-path. While this change was necessary to maintain
    correctness on x86-32-pae kernels, it also adds additional cycles for
    architectures that don't need it.

    Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
    severe performance regressions in micro-benchmarks because it now also
    calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
    the vmalloc_sync_all() implementation on x86-64 is only needed for newly
    created mappings.

    To avoid the unnecessary work on x86-64 and to gain the performance
    back, split up vmalloc_sync_all() into two functions:

    * vmalloc_sync_mappings(), and
    * vmalloc_sync_unmappings()

    Most call-sites to vmalloc_sync_all() only care about new mappings being
    synchronized. The only exception is the new call-site added in the
    above mentioned commit.

    Shile Zhang directed us to a report of an 80% regression in reaim
    throughput.

    Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
    Reported-by: kernel test robot
    Reported-by: Shile Zhang
    Signed-off-by: Joerg Roedel
    Signed-off-by: Andrew Morton
    Tested-by: Borislav Petkov
    Acked-by: Rafael J. Wysocki [GHES]
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc:
    Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
    Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
    Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Joerg Roedel
     

16 Mar, 2020

2 commits


13 Mar, 2020

1 commit

  • Pull networking fixes from David Miller:
    "It looks like a decent sized set of fixes, but a lot of these are one
    liner off-by-one and similar type changes:

    1) Fix netlink header pointer to calcular bad attribute offset
    reported to user. From Pablo Neira Ayuso.

    2) Don't double clear PHY interrupts when ->did_interrupt is set,
    from Heiner Kallweit.

    3) Add missing validation of various (devlink, nl802154, fib, etc.)
    attributes, from Jakub Kicinski.

    4) Missing *pos increments in various netfilter seq_next ops, from
    Vasily Averin.

    5) Missing break in of_mdiobus_register() loop, from Dajun Jin.

    6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.

    7) Work around FMAN erratum A050385, from Madalin Bucur.

    8) Make sure ARP header is pulled early enough in bonding driver,
    from Eric Dumazet.

    9) Do a cond_resched() during multicast processing of ipvlan and
    macvlan, from Mahesh Bandewar.

    10) Don't attach cgroups to unrelated sockets when in interrupt
    context, from Shakeel Butt.

    11) Fix tpacket ring state management when encountering unknown GSO
    types. From Willem de Bruijn.

    12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
    only in the suspend context. From Heiner Kallweit"

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
    net: systemport: fix index check to avoid an array out of bounds access
    tc-testing: add ETS scheduler to tdc build configuration
    net: phy: fix MDIO bus PM PHY resuming
    net: hns3: clear port base VLAN when unload PF
    net: hns3: fix RMW issue for VLAN filter switch
    net: hns3: fix VF VLAN table entries inconsistent issue
    net: hns3: fix "tc qdisc del" failed issue
    taprio: Fix sending packets without dequeueing them
    net: mvmdio: avoid error message for optional IRQ
    net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
    net: memcg: fix lockdep splat in inet_csk_accept()
    s390/qeth: implement smarter resizing of the RX buffer pool
    s390/qeth: refactor buffer pool code
    s390/qeth: use page pointers to manage RX buffer pool
    seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
    net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
    net/packet: tpacket_rcv: do not increment ring index on drop
    sxgbe: Fix off by one in samsung driver strncpy size arg
    net: caif: Add lockdep expression to RCU traversal primitive
    MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
    ...

    Linus Torvalds
     

12 Mar, 2020

2 commits

  • Pull thread fix from Christian Brauner:
    "This contains a single fix for a regression which was introduced when
    we introduced the ability to select a specific pid at process creation
    time.

    When this feature is requested, the error value will be set to -EPERM
    after exiting the pid allocation loop. This caused EPERM to be
    returned when e.g. the init process/child subreaper of the pid
    namespace has already died where we used to return ENOMEM before.

    The first patch here simply fixes the regression by unconditionally
    setting the return value back to ENOMEM again once we've successfully
    allocated the requested pid number. This should be easy to backport to
    v5.5.

    The second patch adds a comment explaining that we must keep returning
    ENOMEM since we've been doing it for a long time and have explicitly
    documented this behavior for userspace. This seemed worthwhile because
    we now have at least two separate example where people tried to change
    the return value to something other than ENOMEM (The first version of
    the regression fix did that too and the commit message links to an
    earlier patch that tried to do the same.).

    I have a simple regression test to make sure we catch this regression
    in the future but since that introduces a whole new selftest subdir
    and test files I'll keep this for v5.7"

    * tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    pid: make ENOMEM return value more obvious
    pid: Fix error return value in some cases

    Linus Torvalds
     
  • Pull ftrace fix from Steven Rostedt:
    "Have ftrace lookup_rec() return a consistent record otherwise it can
    break live patching"

    * tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Return the first found result in lookup_rec()

    Linus Torvalds
     

11 Mar, 2020

4 commits

  • It appears that ip ranges can overlap so. In that case lookup_rec()
    returns whatever results it got last even if it found nothing in last
    searched page.

    This breaks an obscure livepatch late module patching usecase:
    - load livepatch
    - load the patched module
    - unload livepatch
    - try to load livepatch again

    To fix this return from lookup_rec() as soon as it found the record
    containing searched-for ip. This used to be this way prior lookup_rec()
    introduction.

    Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com

    Cc: stable@vger.kernel.org
    Fixes: 7e16f581a817 ("ftrace: Separate out functionality from ftrace_location_range()")
    Signed-off-by: Artem Savkov
    Signed-off-by: Steven Rostedt (VMware)

    Artem Savkov
     
  • We are testing network memory accounting in our setup and noticed
    inconsistent network memory usage and often unrelated cgroups network
    usage correlates with testing workload. On further inspection, it
    seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
    irq context specially for cgroup v1.

    mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
    and kind of assumes that this can only happen from sk_clone_lock()
    and the source sock object has already associated cgroup. However in
    cgroup v1, where network memory accounting is opt-in, the source sock
    can be unassociated with any cgroup and the new cloned sock can get
    associated with unrelated interrupted cgroup.

    Cgroup v2 can also suffer if the source sock object was created by
    process in the root cgroup or if sk_alloc() is called in irq context.
    The fix is to just do nothing in interrupt.

    WARNING: Please note that about half of the TCP sockets are allocated
    from the IRQ context, so, memory used by such sockets will not be
    accouted by the memcg.

    The stack trace of mem_cgroup_sk_alloc() from IRQ-context:

    CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1
    Hardware name: ...
    Call Trace:

    dump_stack+0x57/0x75
    mem_cgroup_sk_alloc+0xe9/0xf0
    sk_clone_lock+0x2a7/0x420
    inet_csk_clone_lock+0x1b/0x110
    tcp_create_openreq_child+0x23/0x3b0
    tcp_v6_syn_recv_sock+0x88/0x730
    tcp_check_req+0x429/0x560
    tcp_v6_rcv+0x72d/0xa40
    ip6_protocol_deliver_rcu+0xc9/0x400
    ip6_input+0x44/0xd0
    ? ip6_protocol_deliver_rcu+0x400/0x400
    ip6_rcv_finish+0x71/0x80
    ipv6_rcv+0x5b/0xe0
    ? ip6_sublist_rcv+0x2e0/0x2e0
    process_backlog+0x108/0x1e0
    net_rx_action+0x26b/0x460
    __do_softirq+0x104/0x2a6
    do_softirq_own_stack+0x2a/0x40

    do_softirq.part.19+0x40/0x50
    __local_bh_enable_ip+0x51/0x60
    ip6_finish_output2+0x23d/0x520
    ? ip6table_mangle_hook+0x55/0x160
    __ip6_finish_output+0xa1/0x100
    ip6_finish_output+0x30/0xd0
    ip6_output+0x73/0x120
    ? __ip6_finish_output+0x100/0x100
    ip6_xmit+0x2e3/0x600
    ? ipv6_anycast_cleanup+0x50/0x50
    ? inet6_csk_route_socket+0x136/0x1e0
    ? skb_free_head+0x1e/0x30
    inet6_csk_xmit+0x95/0xf0
    __tcp_transmit_skb+0x5b4/0xb20
    __tcp_send_ack.part.60+0xa3/0x110
    tcp_send_ack+0x1d/0x20
    tcp_rcv_state_process+0xe64/0xe80
    ? tcp_v6_connect+0x5d1/0x5f0
    tcp_v6_do_rcv+0x1b1/0x3f0
    ? tcp_v6_do_rcv+0x1b1/0x3f0
    __release_sock+0x7f/0xd0
    release_sock+0x30/0xa0
    __inet_stream_connect+0x1c3/0x3b0
    ? prepare_to_wait+0xb0/0xb0
    inet_stream_connect+0x3b/0x60
    __sys_connect+0x101/0x120
    ? __sys_getsockopt+0x11b/0x140
    __x64_sys_connect+0x1a/0x20
    do_syscall_64+0x51/0x200
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
    Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking")
    Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Signed-off-by: David S. Miller

    Shakeel Butt
     
  • Pull cgroup fixes from Tejun Heo:

    - cgroup.procs listing related fixes.

    It didn't interlock properly with exiting tasks leaving a short
    window where a cgroup has empty cgroup.procs but still can't be
    removed and misbehaved on short reads.

    - psi_show() crash fix on 32bit ino archs

    - Empty release_agent handling fix

    * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup1: don't call release_agent when it is ""
    cgroup: fix psi_show() crash on 32bit ino archs
    cgroup: Iterate tasks that did not finish do_exit()
    cgroup: cgroup_procs_next should increase position index
    cgroup-v1: cgroup_pidlist_next should update position index

    Linus Torvalds
     
  • Pull workqueue fixes from Tejun Heo:
    "Workqueue has been incorrectly round-robining per-cpu work items.
    Hillf's patch fixes that.

    The other patch documents memory-ordering properties of workqueue
    operations"

    * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: don't use wq_select_unbound_cpu() for bound works
    workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()

    Linus Torvalds
     

10 Mar, 2020

3 commits

  • wq_select_unbound_cpu() is designed for unbound workqueues only, but
    it's wrongly called when using a bound workqueue too.

    Fixing this ensures work queued to a bound workqueue with
    cpu=WORK_CPU_UNBOUND always runs on the local CPU.

    Before, that would happen only if wq_unbound_cpumask happened to include
    it (likely almost always the case), or was empty, or we got lucky with
    forced round-robin placement. So restricting
    /sys/devices/virtual/workqueue/cpumask to a small subset of a machine's
    CPUs would cause some bound work items to run unexpectedly there.

    Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
    Cc: stable@vger.kernel.org # v4.5+
    Signed-off-by: Hillf Danton
    [dj: massage changelog]
    Signed-off-by: Daniel Jordan
    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Tejun Heo

    Hillf Danton
     
  • The alloc_pid() codepath used to be simpler. With the introducation of the
    ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to
    support setting a PID") it got more complex. It hasn't been super obvious
    that ENOMEM is returned when the pid namespace init process/child subreaper
    of the pid namespace has died. As can be seen from multiple attempts to
    improve this see e.g. [1] and most recently [2].
    We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
    comment on top explaining that this is historic and documented behavior and
    cannot easily be changed.

    [1]: 35f71bc0a09a ("fork: report pid reservation failure properly")
    [2]: b26ebfe12f34 ("pid: Fix error return value in some cases")
    [3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The recent futex inode life time fix changed the ordering of the futex key
    union struct members, but forgot to adjust the hash function accordingly,

    As a result the hashing omits the leading 64bit and even hashes beyond the
    futex key causing a bad hash distribution which led to a ~100% performance
    regression.

    Hand in the futex key pointer instead of a random struct member and make
    the size calculation based of the struct offset.

    Fixes: 8019ad13ef7f ("futex: Fix inode life-time issue")
    Reported-by: Rong Chen
    Decoded-by: Linus Torvalds
    Signed-off-by: Thomas Gleixner
    Tested-by: Rong Chen
    Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de

    Thomas Gleixner
     

08 Mar, 2020

2 commits

  • Recent changes to alloc_pid() allow the pid number to be specified on
    the command line. If set_tid_size is set, then the code scanning the
    levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM
    value.

    After the code scanning the levels, there are error returns that do not
    set retval, assuming it is still set to -ENOMEM.

    So set retval back to -ENOMEM after scanning the levels.

    Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
    Signed-off-by: Corey Minyard
    Acked-by: Christian Brauner
    Cc: Andrei Vagin
    Cc: Dmitry Safonov
    Cc: Oleg Nesterov
    Cc: Adrian Reber
    Cc: # 5.5
    Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org
    [christian.brauner@ubuntu.com: fixup commit message]
    Signed-off-by: Christian Brauner

    Corey Minyard
     
  • Pull block fixes from Jens Axboe:
    "Here are a few fixes that should go into this release. This contains:

    - Revert of a bad bcache patch from this merge window

    - Removed unused function (Daniel)

    - Fixup for the blktrace fix from Jan from this release (Cengiz)

    - Fix of deeper level bfqq overwrite in BFQ (Carlo)"

    * tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
    block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()
    blktrace: fix dereference after null check
    Revert "bcache: ignore pending signals when creating gc and allocator thread"
    block: Remove used kblockd_schedule_work_on()

    Linus Torvalds
     

07 Mar, 2020

1 commit

  • Pull thread fixes from Christian Brauner:
    "Here are a few hopefully uncontroversial fixes:

    - Use RCU_INIT_POINTER() when initializing rcu protected members in
    task_struct to fix sparse warnings.

    - Add pidfd_fdinfo_test binary to .gitignore file"

    * tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
    selftests: pidfd: Add pidfd_fdinfo_test in .gitignore
    exit: Fix Sparse errors and warnings
    fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()

    Linus Torvalds
     

06 Mar, 2020

2 commits

  • As reported by Jann, ihold() does not in fact guarantee inode
    persistence. And instead of making it so, replace the usage of inode
    pointers with a per boot, machine wide, unique inode identifier.

    This sequence number is global, but shared (file backed) futexes are
    rare enough that this should not become a performance issue.

    Reported-by: Jann Horn
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)

    Peter Zijlstra
     
  • There was a recent change in blktrace.c that added a RCU protection to
    `q->blk_trace` in order to fix a use-after-free issue during access.

    However the change missed an edge case that can lead to dereferencing of
    `bt` pointer even when it's NULL:

    Coverity static analyzer marked this as a FORWARD_NULL issue with CID
    1460458.

    ```
    /kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store()
    1898 ret = 0;
    1899 if (bt == NULL)
    1900 ret = blk_trace_setup_queue(q, bdev);
    1901
    1902 if (ret == 0) {
    1903 if (attr == &dev_attr_act_mask)
    >>> CID 1460458: Null pointer dereferences (FORWARD_NULL)
    >>> Dereferencing null pointer "bt".
    1904 bt->act_mask = value;
    1905 else if (attr == &dev_attr_pid)
    1906 bt->pid = value;
    1907 else if (attr == &dev_attr_start_lba)
    1908 bt->start_lba = value;
    1909 else if (attr == &dev_attr_end_lba)
    ```

    Added a reassignment with RCU annotation to fix the issue.

    Fixes: c780e86dd48 ("blktrace: Protect q->blk_trace with RCU")
    Cc: stable@vger.kernel.org
    Reviewed-by: Ming Lei
    Reviewed-by: Bob Liu
    Reviewed-by: Steven Rostedt (VMware)
    Signed-off-by: Cengiz Can
    Signed-off-by: Jens Axboe

    Cengiz Can
     

05 Mar, 2020

2 commits

  • Older (and maybe current) versions of systemd set release_agent to "" when
    shutting down, but do not set notify_on_release to 0.

    Since 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate
    call_usermodehelper()"), we filter out such calls when the user mode helper
    path is "". However, when used in conjunction with an actual (i.e. non "")
    STATIC_USERMODEHELPER, the path is never "", so the real usermode helper
    will be called with argv[0] == "".

    Let's avoid this by not invoking the release_agent when it is "".

    Signed-off-by: Tycho Andersen
    Signed-off-by: Tejun Heo

    Tycho Andersen
     
  • Similar to the commit d7495343228f ("cgroup: fix incorrect
    WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not
    equal to 1 on 32bit ino archs which triggers all sorts of issues with
    psi_show() on s390x. For example,

    BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/
    Read of size 4 at addr 000000001e0ce000 by task read_all/3667
    collect_percpu_times+0x2d0/0x798
    psi_show+0x7c/0x2a8
    seq_read+0x2ac/0x830
    vfs_read+0x92/0x150
    ksys_read+0xe2/0x188
    system_call+0xd8/0x2b4

    Fix it by using cgroup_ino().

    Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID")
    Signed-off-by: Qian Cai
    Acked-by: Johannes Weiner
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v5.5

    Qian Cai
     

04 Mar, 2020

1 commit

  • The sysinfo() syscall includes uptime in seconds but has no correction for
    time namespaces which makes it inconsistent with the /proc/uptime inside of
    a time namespace.

    Add the missing time namespace adjustment call.

    Signed-off-by: Cyril Hrubis
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Dmitry Safonov
    Link: https://lkml.kernel.org/r/20200303150638.7329-1-chrubis@suse.cz

    Cyril Hrubis
     

02 Mar, 2020

1 commit


29 Feb, 2020

2 commits

  • Pull block fixes from Jens Axboe:

    - Passthrough insertion fix (Ming)

    - Kill off some unused arguments (John)

    - blktrace RCU fix (Jan)

    - Dead fields removal for null_blk (Dongli)

    - NVMe polled IO fix (Bijan)

    * tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
    nvme-pci: Hold cq_poll_lock while completing CQEs
    blk-mq: Remove some unused function arguments
    null_blk: remove unused fields in 'nullb_cmd'
    blktrace: Protect q->blk_trace with RCU
    blk-mq: insert passthrough request into hctx->dispatch directly

    Linus Torvalds
     
  • Pull power management fixes from Rafael Wysocki:
    "Fix a recent cpufreq initialization regression (Rafael Wysocki),
    revert a devfreq commit that made incompatible changes and broke user
    land on some systems (Orson Zhai), drop a stale reference to a
    document that has gone away recently (Jonathan Neuschäfer), and fix a
    typo in a hibernation code comment (Alexandre Belloni)"

    * tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpufreq: Fix policy initialization for internal governor drivers
    Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"
    PM / hibernate: fix typo "reserverd_size" -> "reserved_size"
    Documentation: power: Drop reference to interface.rst

    Linus Torvalds
     

28 Feb, 2020

4 commits

  • This patch fixes the following sparse error:
    kernel/exit.c:627:25: error: incompatible types in comparison expression

    And the following warning:
    kernel/exit.c:626:40: warning: incorrect type in assignment

    Signed-off-by: Madhuparna Bhowmik
    Acked-by: Oleg Nesterov
    Acked-by: Christian Brauner
    [christian.brauner@ubuntu.com: edit commit message]
    Link: https://lore.kernel.org/r/20200130062028.4870-1-madhuparnabhowmik10@gmail.com
    Signed-off-by: Christian Brauner

    Madhuparna Bhowmik
     
  • Use RCU_INIT_POINTER() instead of rcu_access_pointer() in
    copy_sighand().

    Suggested-by: Oleg Nesterov
    Signed-off-by: Madhuparna Bhowmik
    Acked-by: Oleg Nesterov
    Acked-by: Christian Brauner
    [christian.brauner@ubuntu.com: edit commit message]
    Link: https://lore.kernel.org/r/20200127175821.10833-1-madhuparnabhowmik10@gmail.com
    Signed-off-by: Christian Brauner

    Madhuparna Bhowmik
     
  • * pm-sleep:
    PM / hibernate: fix typo "reserverd_size" -> "reserved_size"
    Documentation: power: Drop reference to interface.rst

    * pm-devfreq:
    Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"

    Rafael J. Wysocki
     
  • Pull audit fixes from Paul Moore:
    "Two fixes for problems found by syzbot:

    - Moving audit filter structure fields into a union caused some
    problems in the code which populates that filter structure.

    We keep the union (that idea is a good one), but we are fixing the
    code so that it doesn't needlessly set fields in the union and mess
    up the error handling.

    - The audit_receive_msg() function wasn't validating user input as
    well as it should in all cases, we add the necessary checks"

    * tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: always check the netlink payload length in audit_receive_msg()
    audit: fix error handling in audit_data_to_entry()

    Linus Torvalds
     

27 Feb, 2020

3 commits

  • sgs->group_weight is not set while gathering statistics in
    update_sg_wakeup_stats(). This means that a group can be classified as
    fully busy with 0 running tasks if utilization is high enough.

    This path is mainly used for fork and exec.

    Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()")
    Signed-off-by: Vincent Guittot
    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Acked-by: Mel Gorman
    Link: https://lore.kernel.org/r/20200218144534.4564-1-vincent.guittot@linaro.org

    Vincent Guittot
     
  • Pull tracing and bootconfig updates:
    "Fixes and changes to bootconfig before it goes live in a release.

    Change in API of bootconfig (before it comes live in a release):
    - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig
    exists
    - Set CONFIG_BOOT_CONFIG to 'n' by default
    - Show error if "bootconfig" on cmdline but not compiled in
    - Prevent redefining the same value
    - Have a way to append values
    - Added a SELECT BLK_DEV_INITRD to fix a build failure

    Synthetic event fixes:
    - Switch to raw_smp_processor_id() for recording CPU value in preempt
    section. (No care for what the value actually is)
    - Fix samples always recording u64 values
    - Fix endianess
    - Check number of values matches number of fields
    - Fix a printing bug

    Fix of trace_printk() breaking postponed start up tests

    Make a function static that is only used in a single file"

    * tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue
    bootconfig: Add append value operator support
    bootconfig: Prohibit re-defining value on same key
    bootconfig: Print array as multiple commands for legacy command line
    bootconfig: Reject subkey and value on same parent key
    tools/bootconfig: Remove unneeded error message silencer
    bootconfig: Add bootconfig magic word for indicating bootconfig explicitly
    bootconfig: Set CONFIG_BOOT_CONFIG=n by default
    tracing: Clear trace_state when starting trace
    bootconfig: Mark boot_config_checksum() static
    tracing: Disable trace_printk() on post poned tests
    tracing: Have synthetic event test use raw_smp_processor_id()
    tracing: Fix number printing bug in print_synth_event()
    tracing: Check that number of vals matches number of synth event fields
    tracing: Make synth_event trace functions endian-correct
    tracing: Make sure synth_event_trace() example always uses u64

    Linus Torvalds
     
  • When queueing a signal, we increment both the users count of pending
    signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
    of the user struct itself (because we keep a reference to the user in
    the signal structure in order to correctly account for it when freeing).

    That turns out to be fairly expensive, because both of them are atomic
    updates, and particularly under extreme signal handling pressure on big
    machines, you can get a lot of cache contention on the user struct.
    That can then cause horrid cacheline ping-pong when you do these
    multiple accesses.

    So change the reference counting to only pin the user for the _first_
    pending signal, and to unpin it when the last pending signal is
    dequeued. That means that when a user sees a lot of concurrent signal
    queuing - which is the only situation when this matters - the only
    atomic access needed is generally the 'sigpending' count update.

    This was noticed because of a particularly odd timing artifact on a
    dual-socket 96C/192T Cascade Lake platform: when you get into bad
    contention, on that machine for some reason seems to be much worse when
    the contention happens in the upper 32-byte half of the cacheline.

    As a result, the kernel test robot will-it-scale 'signal1' benchmark had
    an odd performance regression simply due to random alignment of the
    'struct user_struct' (and pointed to a completely unrelated and
    apparently nonsensical commit for the regression).

    Avoiding the double increments (and decrements on the dequeueing side,
    of course) makes for much less contention and hugely improved
    performance on that will-it-scale microbenchmark.

    Quoting Feng Tang:

    "It makes a big difference, that the performance score is tripled! bump
    from original 17000 to 54000. Also the gap between 5.0-rc6 and
    5.0-rc6+Jiri's patch is reduced to around 2%"

    [ The "2% gap" is the odd cacheline placement difference on that
    platform: under the extreme contention case, the effect of which half
    of the cacheline was hot was 5%, so with the reduced contention the
    odd timing artifact is reduced too ]

    It does help in the non-contended case too, but is not nearly as
    noticeable.

    Reported-and-tested-by: Feng Tang
    Cc: Eric W. Biederman
    Cc: Huang, Ying
    Cc: Philip Li
    Cc: Andi Kleen
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Feb, 2020

1 commit

  • Since commit d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by
    default") also changed the CONFIG_BOOTTIME_TRACING to select
    CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu,
    it introduced wrong dependencies with BLK_DEV_INITRD as below.

    WARNING: unmet direct dependencies detected for BOOT_CONFIG
    Depends on [n]: BLK_DEV_INITRD [=n]
    Selected by [y]:
    - BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y]

    This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to
    fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so
    that both boot-time tracing and boot configuration off but those
    appear on the menu list.

    Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2

    Fixes: d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default")
    Reported-by: Randy Dunlap
    Compiled-tested-by: Randy Dunlap
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     

25 Feb, 2020

2 commits

  • KASAN is reporting that __blk_add_trace() has a use-after-free issue
    when accessing q->blk_trace. Indeed the switching of block tracing (and
    thus eventual freeing of q->blk_trace) is completely unsynchronized with
    the currently running tracing and thus it can happen that the blk_trace
    structure is being freed just while __blk_add_trace() works on it.
    Protect accesses to q->blk_trace by RCU during tracing and make sure we
    wait for the end of RCU grace period when shutting down tracing. Luckily
    that is rare enough event that we can afford that. Note that postponing
    the freeing of blk_trace to an RCU callback should better be avoided as
    it could have unexpected user visible side-effects as debugfs files
    would be still existing for a short while block tracing has been shut
    down.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711
    CC: stable@vger.kernel.org
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Ming Lei
    Tested-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Reported-by: Tristan Madani
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • This patch ensures that we always check the netlink payload length
    in audit_receive_msg() before we take any action on the payload
    itself.

    Cc: stable@vger.kernel.org
    Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com
    Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com
    Signed-off-by: Paul Moore

    Paul Moore
     

23 Feb, 2020

3 commits

  • Commit 219ca39427bf ("audit: use union for audit_field values since
    they are mutually exclusive") combined a number of separate fields in
    the audit_field struct into a single union. Generally this worked
    just fine because they are generally mutually exclusive.
    Unfortunately in audit_data_to_entry() the overlap can be a problem
    when a specific error case is triggered that causes the error path
    code to attempt to cleanup an audit_field struct and the cleanup
    involves attempting to free a stored LSM string (the lsm_str field).
    Currently the code always has a non-NULL value in the
    audit_field.lsm_str field as the top of the for-loop transfers a
    value into audit_field.val (both .lsm_str and .val are part of the
    same union); if audit_data_to_entry() fails and the audit_field
    struct is specified to contain a LSM string, but the
    audit_field.lsm_str has not yet been properly set, the error handling
    code will attempt to free the bogus audit_field.lsm_str value that
    was set with audit_field.val at the top of the for-loop.

    This patch corrects this by ensuring that the audit_field.val is only
    set when needed (it is cleared when the audit_field struct is
    allocated with kcalloc()). It also corrects a few other issues to
    ensure that in case of error the proper error code is returned.

    Cc: stable@vger.kernel.org
    Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive")
    Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com
    Signed-off-by: Paul Moore

    Paul Moore
     
  • Pull irq fixes from Thomas Gleixner:
    "Two fixes for the irq core code which are follow ups to the recent MSI
    fixes:

    - The WARN_ON which was put into the MSI setaffinity callback for
    paranoia reasons actually triggered via a callchain which escaped
    when all the possible ways to reach that code were analyzed.

    The proc/irq/$N/*affinity interfaces have a quirk which came in
    when ALPHA moved to the generic interface: In case that the written
    affinity mask does not contain any online CPU it calls into ALPHAs
    magic auto affinity setting code.

    A few years later this mechanism was also made available to x86 for
    no good reasons and in a way which circumvents all sanity checks
    for interrupts which cannot have their affinity set from process
    context on X86 due to the way the X86 interrupt delivery works.

    It would be possible to make this work properly, but there is no
    point in doing so. If the interrupt is not yet started then the
    affinity setting has no effect and if it is started already then it
    is already assigned to an online CPU so there is no point to
    randomly move it to some other CPU. Just return EINVAL as the code
    has done before that change forever.

    - The new MSI quirk bit in the irq domain flags turned out to be
    already occupied, which escaped the author and the reviewers
    because the already in use bits were 0,6,2,3,4,5 listed in that
    order.

    That bit 6 was simply overlooked because the ordering was straight
    forward linear otherwise. So the new bit ended up being a
    duplicate.

    Fix it up by switching the oddball 6 to the obvious 1"

    * tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq/irqdomain: Make sure all irq domain flags are distinct
    genirq/proc: Reject invalid affinity masks (again)

    Linus Torvalds
     
  • Pull s390 fixes from Vasily Gorbik:

    - Remove ieee_emulation_warnings sysctl which is a dead code.

    - Avoid triggering rebuild of the kernel during make install.

    - Enable protected virtualization guest support in default configs.

    - Fix cio_ignore seq_file .next function to increase position index.
    And use kobj_to_dev instead of container_of in cio code.

    - Fix storage block address lists to contain absolute addresses in qdio
    code.

    - Few clang warnings and spelling fixes.

    * tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
    s390/qdio: fill SBALEs with absolute addresses
    s390/qdio: fill SL with absolute addresses
    s390: remove obsolete ieee_emulation_warnings
    s390: make 'install' not depend on vmlinux
    s390/kaslr: Fix casts in get_random
    s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range
    s390/pkey/zcrypt: spelling s/crytp/crypt/
    s390/cio: use kobj_to_dev() API
    s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST
    s390/cio: cio_ignore_proc_seq_next should increase position index

    Linus Torvalds
     

22 Feb, 2020

2 commits

  • Pull networking fixes from David Miller:

    1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
    Cong Wang.

    2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
    is finished, from Magnus Karlsson.

    3) Set table field properly to RT_TABLE_COMPAT when necessary, from
    Jethro Beekman.

    4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
    that in IFLA_ALT_IFNAME. From Eric Dumazet.

    5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.

    6) Handle mtu==0 devices properly in wireguard, from Jason A.
    Donenfeld.

    7) Fix several lockdep warnings in bonding, from Taehee Yoo.

    8) Fix cls_flower port blocking, from Jason Baron.

    9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.

    10) Fix RDMA race in qede driver, from Michal Kalderon.

    11) Fix several false lockdep warnings by adding conditions to
    list_for_each_entry_rcu(), from Madhuparna Bhowmik.

    12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.

    13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.

    14) Hey, variables declared in switch statement before any case
    statements are not initialized. I learn something every day. Get
    rids of this stuff in several parts of the networking, from Kees
    Cook.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
    bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
    bnxt_en: Improve device shutdown method.
    net: netlink: cap max groups which will be considered in netlink_bind()
    net: thunderx: workaround BGX TX Underflow issue
    ionic: fix fw_status read
    net: disable BRIDGE_NETFILTER by default
    net: macb: Properly handle phylink on at91rm9200
    s390/qeth: fix off-by-one in RX copybreak check
    s390/qeth: don't warn for napi with 0 budget
    s390/qeth: vnicc Fix EOPNOTSUPP precedence
    openvswitch: Distribute switch variables for initialization
    net: ip6_gre: Distribute switch variables for initialization
    net: core: Distribute switch variables for initialization
    udp: rehash on disconnect
    net/tls: Fix to avoid gettig invalid tls record
    bpf: Fix a potential deadlock with bpf_map_do_batch
    bpf: Do not grab the bucket spinlock by default on htab batch ops
    ice: Wait for VF to be reset/ready before configuration
    ice: Don't tell the OS that link is going down
    ice: Don't reject odd values of usecs set by user
    ...

    Linus Torvalds
     
  • No users remain, so kill these off before we grow new ones.

    Link: http://lkml.kernel.org/r/20200110154232.4104492-3-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Thomas Gleixner
    Cc: Deepa Dinamani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

21 Feb, 2020

1 commit

  • Set CONFIG_BOOT_CONFIG=n by default. This also warns
    user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given
    in the kernel command line.

    Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2

    Suggested-by: Steven Rostedt
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu