09 Jan, 2020

1 commit

  • [ Upstream commit 0b8d616fb5a8ffa307b1d3af37f55c15dae14f28 ]

    When assiging and testing taskstats in taskstats_exit() there's a race
    when setting up and reading sig->stats when a thread-group with more
    than one thread exits:

    write to 0xffff8881157bbe10 of 8 bytes by task 7951 on cpu 0:
    taskstats_tgid_alloc kernel/taskstats.c:567 [inline]
    taskstats_exit+0x6b7/0x717 kernel/taskstats.c:596
    do_exit+0x2c2/0x18e0 kernel/exit.c:864
    do_group_exit+0xb4/0x1c0 kernel/exit.c:983
    get_signal+0x2a2/0x1320 kernel/signal.c:2734
    do_signal+0x3b/0xc00 arch/x86/kernel/signal.c:815
    exit_to_usermode_loop+0x250/0x2c0 arch/x86/entry/common.c:159
    prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
    syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
    do_syscall_64+0x2d7/0x2f0 arch/x86/entry/common.c:299
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    read to 0xffff8881157bbe10 of 8 bytes by task 7949 on cpu 1:
    taskstats_tgid_alloc kernel/taskstats.c:559 [inline]
    taskstats_exit+0xb2/0x717 kernel/taskstats.c:596
    do_exit+0x2c2/0x18e0 kernel/exit.c:864
    do_group_exit+0xb4/0x1c0 kernel/exit.c:983
    __do_sys_exit_group kernel/exit.c:994 [inline]
    __se_sys_exit_group kernel/exit.c:992 [inline]
    __x64_sys_exit_group+0x2e/0x30 kernel/exit.c:992
    do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fix this by using smp_load_acquire() and smp_store_release().

    Reported-by: syzbot+c5d03165a1bd1dead0c1@syzkaller.appspotmail.com
    Fixes: 34ec12349c8a ("taskstats: cleanup ->signal->stats allocation")
    Cc: stable@vger.kernel.org
    Signed-off-by: Christian Brauner
    Acked-by: Marco Elver
    Reviewed-by: Will Deacon
    Reviewed-by: Andrea Parri
    Reviewed-by: Dmitry Vyukov
    Link: https://lore.kernel.org/r/20191009114809.8643-1-christian.brauner@ubuntu.com
    Signed-off-by: Sasha Levin

    Christian Brauner
     

31 May, 2019

1 commit

  • Based on 3 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version this program is distributed in the
    hope that it will be useful but without any warranty without even
    the implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [kishon] [vijay] [abraham]
    [i] [kishon]@[ti] [com] this program is distributed in the hope that
    it will be useful but without any warranty without even the implied
    warranty of merchantability or fitness for a particular purpose see
    the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [graeme] [gregory]
    [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
    [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
    [hk] [hemahk]@[ti] [com] this program is distributed in the hope
    that it will be useful but without any warranty without even the
    implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 1105 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Richard Fontana
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

28 Apr, 2019

3 commits

  • Add options to strictly validate messages and dump messages,
    sometimes perhaps validating dump messages non-strictly may
    be required, so add an option for that as well.

    Since none of this can really be applied to existing commands,
    set the options everwhere using the following spatch:

    @@
    identifier ops;
    expression X;
    @@
    struct genl_ops ops[] = {
    ...,
    {
    .cmd = X,
    + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
    ...
    },
    ...
    };

    For new commands one should just not copy the .validate 'opt-out'
    flags and thus get strict validation.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

22 Mar, 2019

1 commit

  • Since maxattr is common, the policy can't really differ sanely,
    so make it common as well.

    The only user that did in fact manage to make a non-common policy
    is taskstats, which has to be really careful about it (since it's
    still using a common maxattr!). This is no longer supported, but
    we can fake it using pre_doit.

    This reduces the size of e.g. nl80211.o (which has lots of commands):

    text data bss dec hex filename
    398745 14323 2240 415308 6564c net/wireless/nl80211.o (before)
    397913 14331 2240 414484 65314 net/wireless/nl80211.o (after)
    --------------------------------
    -832 +8 0 -824

    Which is obviously just 8 bytes for each command, and an added 8
    bytes for the new policy pointer. I'm not sure why the ops list is
    counted as .text though.

    Most of the code transformations were done using the following spatch:
    @ops@
    identifier OPS;
    expression POLICY;
    @@
    struct genl_ops OPS[] = {
    ...,
    {
    - .policy = POLICY,
    },
    ...
    };

    @@
    identifier ops.OPS;
    expression ops.POLICY;
    identifier fam;
    expression M;
    @@
    struct genl_family fam = {
    .ops = OPS,
    .maxattr = M,
    + .policy = POLICY,
    ...
    };

    This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
    the cb->data as ops, which we want to change in a later genl patch.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

07 Feb, 2018

1 commit

  • There are several functions that do find_task_by_vpid() followed by
    get_task_struct(). We can use a helper function instead.

    Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

09 May, 2017

1 commit

  • The elapsed time, user CPU time and system CPU time for the thread group
    status request are presently left at zero. Fill these in.

    [akpm@linux-foundation.org: run ktime_get_ns() a single time]
    [akpm@linux-foundation.org: include linux/sched/cputime.h for task_cputime()]
    Link: http://lkml.kernel.org/r/1488508424-12322-1-git-send-email-xiao.zhang@windriver.com
    Signed-off-by: Zhang Xiao
    Cc: Balbir Singh
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Xiao
     

15 Nov, 2016

1 commit


04 Nov, 2016

1 commit

  • cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1],
    taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1],
    but their family.maxattr is TASKSTATS_CMD_ATTR_MAX.
    CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX,
    so we could end up accessing out-of-bound.

    Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1,
    this is safe because the rest are initialized to 0's.

    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

28 Oct, 2016

3 commits

  • Now genl_register_family() is the only thing (other than the
    users themselves, perhaps, but I didn't find any doing that)
    writing to the family struct.

    In all families that I found, genl_register_family() is only
    called from __init functions (some indirectly, in which case
    I've add __init annotations to clarifly things), so all can
    actually be marked __ro_after_init.

    This protects the data structure from accidental corruption.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Instead of providing macros/inline functions to initialize
    the families, make all users initialize them statically and
    get rid of the macros.

    This reduces the kernel code size by about 1.6k on x86-64
    (with allyesconfig).

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Static family IDs have never really been used, the only
    use case was the workaround I introduced for those users
    that assumed their family ID was also their multicast
    group ID.

    Additionally, because static family IDs would never be
    reserved by the generic netlink code, using a relatively
    low ID would only work for built-in families that can be
    registered immediately after generic netlink is started,
    which is basically only the control family (apart from
    the workaround code, which I also had to add code for so
    it would reserve those IDs)

    Thus, anything other than GENL_ID_GENERATE is flawed and
    luckily not used except in the cases I mentioned. Move
    those workarounds into a few lines of code, and then get
    rid of GENL_ID_GENERATE entirely, making it more robust.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

24 Apr, 2016

1 commit

  • Goal of this patch is to use the new libnl API to align netlink attribute
    when needed.
    The layout of the netlink message will be a bit different after the patch,
    because the padattr (TASKSTATS_TYPE_STATS) will be inside the nested
    attribute instead of before it.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

18 Jan, 2015

1 commit

  • Contrary to common expectations for an "int" return, these functions
    return only a positive value -- if used correctly they cannot even
    return 0 because the message header will necessarily be in the skb.

    This makes the very common pattern of

    if (genlmsg_end(...) < 0) { ... }

    be a whole bunch of dead code. Many places also simply do

    return nlmsg_end(...);

    and the caller is expected to deal with it.

    This also commonly (at least for me) causes errors, because it is very
    common to write

    if (my_function(...))
    /* error condition */

    and if my_function() does "return nlmsg_end()" this is of course wrong.

    Additionally, there's not a single place in the kernel that actually
    needs the message length returned, and if anyone needs it later then
    it'll be very easy to just use skb->len there.

    Remove this, and make the functions void. This removes a bunch of dead
    code as described above. The patch adds lines because I did

    - return nlmsg_end(...);
    + nlmsg_end(...);
    + return 0;

    I could have preserved all the function's return values by returning
    skb->len, but instead I've audited all the places calling the affected
    functions and found that none cared. A few places actually compared
    the return value with < 0 with no change in behaviour, so I opted for the more
    efficient version.

    One instance of the error I've made numerous times now is also present
    in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
    check for
    Signed-off-by: David S. Miller

    Johannes Berg
     

20 Nov, 2014

1 commit


27 Aug, 2014

1 commit


20 Nov, 2013

1 commit

  • As suggested by David Miller, make genl_register_family_with_ops()
    a macro and pass only the array, evaluating ARRAY_SIZE() in the
    macro, this is a little safer.

    The openvswitch has some indirection, assing ops/n_ops directly in
    that code. This might ultimately just assign the pointers in the
    family initializations, saving the struct genl_family_and_ops and
    code (once mcast groups are handled differently.)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

15 Nov, 2013

2 commits


13 Nov, 2013

2 commits


06 Oct, 2012

1 commit

  • If prepare_reply() succeeds we have allocated memory for 'rep_skb'. If
    nla_reserve() then subsequently fails and returns NULL we fail to release
    the memory we allocated, thus causing a leak.

    Signed-off-by: Jesper Juhl
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     

03 Oct, 2012

2 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     

27 Sep, 2012

1 commit


18 Sep, 2012

1 commit

  • - Explicitly limit exit task stat broadcast to the initial user and
    pid namespaces, as it is already limited to the initial network
    namespace.

    - For broadcast task stats explicitly generate all of the idenitiers
    in terms of the initial user namespace and the initial pid
    namespace.

    - For request stats report them in terms of the current user namespace
    and the current pid namespace. Netlink messages are delivered
    syncrhonously to the kernel allowing us to get the user namespace
    and the pid namespace from the current task.

    - Pass the namespaces for representing pids and uids and gids
    into bacct_add_task.

    Cc: Balbir Singh
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

31 Jul, 2012

1 commit


20 Sep, 2011

1 commit

  • Ok, this isn't optimal, since it means that 'iotop' needs admin
    capabilities, and we may have to work on this some more. But at the
    same time it is very much not acceptable to let anybody just read
    anybody elses IO statistics quite at this level.

    Use of the GENL_ADMIN_PERM suggested by Johannes Berg as an alternative
    to checking the capabilities by hand.

    Reported-by: Vasiliy Kulikov
    Cc: Johannes Berg
    Acked-by: Balbir Singh
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Aug, 2011

2 commits

  • When send_cpu_listeners() finds the orphaned listener it marks it as
    !valid and drops listeners->sem. Before it takes this sem for writing,
    s->pid can be reused and add_del_listener() can wrongly try to re-use
    this entry.

    Change add_del_listener() to check ->valid = T.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Vasiliy Kulikov
    Acked-by: Balbir Singh
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. Commit 26c4caea9d69 "don't allow duplicate entries in listener mode"
    changed add_del_listener(REGISTER) so that "next_cpu:" can reuse the
    listener allocated for the previous cpu, this doesn't look exactly
    right even if minor.

    Change the code to kfree() in the already-registered case, this case
    is unlikely anyway so the extra kmalloc_node() shouldn't hurt but
    looke more correct and clean.

    2. use the plain list_for_each_entry() instead of _safe() to scan
    listeners->list.

    3. Remove the unneeded INIT_LIST_HEAD(&s->list), we are going to
    list_add(&s->list).

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Vasiliy Kulikov
    Cc: Balbir Singh
    Reviewed-by: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

28 Jun, 2011

1 commit

  • Currently a single process may register exit handlers unlimited times.
    It may lead to a bloated listeners chain and very slow process
    terminations.

    Eg after 10KK sent TASKSTATS_CMD_ATTR_REGISTER_CPUMASKs ~300 Mb of
    kernel memory is stolen for the handlers chain and "time id" shows 2-7
    seconds instead of normal 0.003. It makes it possible to exhaust all
    kernel memory and to eat much of CPU time by triggerring numerous exits
    on a single CPU.

    The patch limits the number of times a single process may register
    itself on a single CPU to one.

    One little issue is kept unfixed - as taskstats_exit() is called before
    exit_files() in do_exit(), the orphaned listener entry (if it was not
    explicitly deregistered) is kept until the next someone's exit() and
    implicit deregistration in send_cpu_listeners(). So, if a process
    registered itself as a listener exits and the next spawned process gets
    the same pid, it would inherit taskstats attributes.

    Signed-off-by: Vasiliy Kulikov
    Cc: Balbir Singh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

24 Mar, 2011

1 commit

  • printk()s without a priority level default to KERN_WARNING. To reduce
    noise at KERN_WARNING, this patch set the priority level appriopriately
    for unleveled printks()s. This should be useful to folks that look at
    dmesg warnings closely.

    Signed-off-by: Mandeep Singh Baines
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     

14 Jan, 2011

1 commit

  • Commit 4be2c95d ("taskstats: pad taskstats netlink response for aligment
    issues on ia64") added a null field to align the taskstats structure but
    the discussion centered around ia64. The issue exists on other platforms
    with inefficient unaligned access and adding them piecemeal would be an
    unmaintainable mess.

    This patch uses Dave Miller's suggestion of using a combination of
    CONFIG_64BIT && !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
    whether alignment is needed.

    Note that this will cause breakage on those platforms with applications
    like iotop which had hard-coded offsets into the packet to access the
    taskstats structure.

    The message seen on systems without the alignment fixes looks like: kernel
    unaligned access to 0xe000023879dca9bc, ip=0xa000000100133d10

    The addresses may vary but resolve to locations inside __delayacct_add_tsk.

    iotop makes what I'd call unreasonable assumptions about the contents of a
    netlink genetlink packet containing generic attributes. They're typed and
    have headers that specify value lengths, so the client can (should)
    identify and skip the ones the client doesn't understand.

    The kernel, as of version 2.6.36, presented a packet like so:
    +--------------------------------+
    | genlmsghdr - 4 bytes |
    +--------------------------------+
    | NLA header - 4 bytes | /* Aggregate header */
    +-+------------------------------+
    | | NLA header - 4 bytes | /* PID header */
    | +------------------------------+
    | | pid/tgid - 4 bytes |
    | +------------------------------+
    | | NLA header - 4 bytes | /* stats header */
    | + -----------------------------+
    Reported-by: David S. Miller
    Acked-by: David S. Miller
    Cc: Dan Carpenter
    Cc: Balbir Singh
    Cc: Florian Mickler
    Cc: Guillaume Chazarain
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

08 Jan, 2011

1 commit

  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
    gameport: use this_cpu_read instead of lookup
    x86: udelay: Use this_cpu_read to avoid address calculation
    x86: Use this_cpu_inc_return for nmi counter
    x86: Replace uses of current_cpu_data with this_cpu ops
    x86: Use this_cpu_ops to optimize code
    vmstat: User per cpu atomics to avoid interrupt disable / enable
    irq_work: Use per cpu atomics instead of regular atomics
    cpuops: Use cmpxchg for xchg to avoid lock semantics
    x86: this_cpu_cmpxchg and this_cpu_xchg operations
    percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
    percpu,x86: relocate this_cpu_add_return() and friends
    connector: Use this_cpu operations
    xen: Use this_cpu_inc_return
    taskstats: Use this_cpu_ops
    random: Use this_cpu_inc_return
    fs: Use this_cpu_inc_return in buffer.c
    highmem: Use this_cpu_xx_return() operations
    vmstat: Use this_cpu_inc_return for vm statistics
    x86: Support for this_cpu_add, sub, dec, inc_return
    percpu: Generic support for this_cpu_add, sub, dec, inc_return
    ...

    Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
    as per Tejun.

    Linus Torvalds
     

23 Dec, 2010

1 commit

  • The taskstats structure is internally aligned on 8 byte boundaries but the
    layout of the aggregrate reply, with two NLA headers and the pid (each 4
    bytes), actually force the entire structure to be unaligned. This causes
    the kernel to issue unaligned access warnings on some architectures like
    ia64. Unfortunately, some software out there doesn't properly unroll the
    NLA packet and assumes that the start of the taskstats structure will
    always be 20 bytes from the start of the netlink payload. Aligning the
    start of the taskstats structure breaks this software, which we don't
    want. So, for now the alignment only happens on architectures that
    require it and those users will have to update to fixed versions of those
    packages. Space is reserved in the packet only when needed. This ifdef
    should be removed in several years e.g. 2012 once we can be confident
    that fixed versions are installed on most systems. We add the padding
    before the aggregate since the aggregate is already a defined type.

    Commit 85893120 ("delayacct: align to 8 byte boundary on 64-bit systems")
    previously addressed the alignment issues by padding out the pid field.
    This was supposed to be a compatible change but the circumstances
    described above mean that it wasn't. This patch backs out that change,
    since it was a hack, and introduces a new NULL attribute type to provide
    the padding. Padding the response with 4 bytes avoids allocating an
    aligned taskstats structure and copying it back. Since the structure
    weighs in at 328 bytes, it's too big to do it on the stack.

    Signed-off-by: Jeff Mahoney
    Reported-by: Brian Rogers
    Cc: Jeff Mahoney
    Cc: Guillaume Chazarain
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

17 Dec, 2010

1 commit

  • Use this_cpu_inc_return in one place and avoid ugly __raw_get_cpu in
    another.

    V3->V4:
    - Fix off by one.

    V4-V4f:
    - Use &listener_array

    Cc: Michael Holzheu
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

28 Oct, 2010

1 commit

  • Separate the finding of a task_struct by pid or tgid from filling the
    taskstats data. This makes the code more readable.

    Signed-off-by: Michael Holzheu
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu