24 Dec, 2011

1 commit

  • We can't do this without propagating the const to nlk_sk()
    too, otherwise:

    net/netlink/af_netlink.c: In function ‘netlink_is_kernel’:
    net/netlink/af_netlink.c:103:2: warning: passing argument 1 of ‘nlk_sk’ discards ‘const’ qualifier from pointer target type [enabled by default]
    net/netlink/af_netlink.c:96:36: note: expected ‘struct sock *’ but argument is of type ‘const struct sock *’

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Dec, 2011

2 commits


29 Sep, 2011

1 commit

  • Since commit 7361c36c5224 (af_unix: Allow credentials to work across
    user and pid namespaces) af_unix performance dropped a lot.

    This is because we now take a reference on pid and cred in each write(),
    and release them in read(), usually done from another process,
    eventually from another cpu. This triggers false sharing.

    # Events: 154K cycles
    #
    # Overhead Command Shared Object Symbol
    # ........ ....... .................. .........................
    #
    10.40% hackbench [kernel.kallsyms] [k] put_pid
    8.60% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
    7.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg
    6.11% hackbench [kernel.kallsyms] [k] do_raw_spin_lock
    4.95% hackbench [kernel.kallsyms] [k] unix_scm_to_skb
    4.87% hackbench [kernel.kallsyms] [k] pid_nr_ns
    4.34% hackbench [kernel.kallsyms] [k] cred_to_ucred
    2.39% hackbench [kernel.kallsyms] [k] unix_destruct_scm
    2.24% hackbench [kernel.kallsyms] [k] sub_preempt_count
    1.75% hackbench [kernel.kallsyms] [k] fget_light
    1.51% hackbench [kernel.kallsyms] [k]
    __mutex_lock_interruptible_slowpath
    1.42% hackbench [kernel.kallsyms] [k] sock_alloc_send_pskb

    This patch includes SCM_CREDENTIALS information in a af_unix message/skb
    only if requested by the sender, [man 7 unix for details how to include
    ancillary data using sendmsg() system call]

    Note: This might break buggy applications that expected SCM_CREDENTIAL
    from an unaware write() system call, and receiver not using SO_PASSCRED
    socket option.

    If SOCK_PASSCRED is set on source or destination socket, we still
    include credentials for mere write() syscalls.

    Performance boost in hackbench : more than 50% gain on a 16 thread
    machine (2 quad-core cpus, 2 threads per core)

    hackbench 20 thread 2000

    4.228 sec instead of 9.102 sec

    Signed-off-by: Eric Dumazet
    Acked-by: Tim Chen
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Aug, 2011

1 commit


25 Jun, 2011

1 commit


23 Jun, 2011

1 commit

  • Consider the following situation:
    * a dump that would show 8 entries, four in the first
    round, and four in the second
    * between the first and second rounds, 6 entries are
    removed
    * now the second round will not show any entry, and
    even if there is a sequence/generation counter the
    application will not know

    To solve this problem, add a new flag NLM_F_DUMP_INTR
    to the netlink header that indicates the dump wasn't
    consistent, this flag can also be set on the MSG_DONE
    message that terminates the dump, and as such above
    situation can be detected.

    To achieve this, add a sequence counter to the netlink
    callback struct. Of course, netlink code still needs
    to use this new functionality. The correct way to do
    that is to always set cb->seq when a dumpit callback
    is invoked and call nl_dump_check_consistent() for
    each new message. The core code will also call this
    function for the final MSG_DONE message.

    To make it usable with generic netlink, a new function
    genlmsg_nlhdr() is needed to obtain the netlink header
    from the genetlink user header.

    Signed-off-by: Johannes Berg
    Acked-by: David S. Miller
    Signed-off-by: John W. Linville

    Johannes Berg
     

17 Jun, 2011

1 commit


10 Jun, 2011

1 commit

  • The message size allocated for rtnl ifinfo dumps was limited to
    a single page. This is not enough for additional interface info
    available with devices that support SR-IOV and caused a bug in
    which VF info would not be displayed if more than approximately
    40 VFs were created per interface.

    Implement a new function pointer for the rtnl_register service that will
    calculate the amount of data required for the ifinfo dump and allocate
    enough data to satisfy the request.

    Signed-off-by: Greg Rose
    Signed-off-by: Jeff Kirsher

    Greg Rose
     

24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

08 May, 2011

1 commit


04 Mar, 2011

3 commits


01 Mar, 2011

1 commit

  • netlink_dump() may failed, but nobody handle its error.
    It generates output data, when a previous portion has been returned to
    user space. This mechanism works when all data isn't go in skb. If we
    enter in netlink_recvmsg() and skb is absent in the recv queue, the
    netlink_dump() will not been executed. So if netlink_dump() is failed
    one time, the new data never appear and the reader will sleep forever.

    netlink_dump() is called from two places:

    1. from netlink_sendmsg->...->netlink_dump_start().
    In this place we can report error directly and it will be returned
    by sendmsg().

    2. from netlink_recvmsg
    There we can't report error directly, because we have a portion of
    valid output data and call netlink_dump() for prepare the next portion.
    If netlink_dump() is failed, the socket will be mark as error and the
    next recvmsg will be failed.

    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

25 Oct, 2010

1 commit

  • commit 6c04bb18ddd633 (netlink: use call_rcu for netlink_change_ngroups)
    used a somewhat convoluted and racy way to perform call_rcu().

    The old block of memory is freed after a grace period, but the rcu_head
    used to track it is located in new block.

    This can clash if we call two times or more netlink_change_ngroups(),
    and a block is freed before another. call_rcu() called on different cpus
    makes no guarantee in order of callbacks.

    Fix this using a more standard way of handling this : Each block of
    memory contains its own rcu_head, so that no 'use after free' can
    happens.

    Signed-off-by: Eric Dumazet
    CC: Johannes Berg
    CC: Paul E. McKenney
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Sep, 2010

1 commit


19 Aug, 2010

1 commit

  • Since
    commit 1dacc76d0014a034b8aca14237c127d7c19d7726
    Author: Johannes Berg
    Date: Wed Jul 1 11:26:02 2009 +0000

    net/compat/wext: send different messages to compat tasks

    we had a race condition when setting and then
    restoring frag_list. Eric attempted to fix it,
    but the fix created even worse problems.

    However, the original motivation I had when I
    added the code that turned out to be racy is
    no longer clear to me, since we only copy up
    to skb->len to userspace, which doesn't include
    the frag_list length. As a result, not doing
    any frag_list clearing and restoring avoids
    the race condition, while not introducing any
    other problems.

    Additionally, while preparing this patch I found
    that since none of the remaining netlink code is
    really aware of the frag_list, we need to use the
    original skb's information for packet information
    and credentials. This fixes, for example, the
    group information received by compat tasks.

    Cc: Eric Dumazet
    Cc: stable@kernel.org [2.6.31+, for 2.6.35 revert 1235f504aa]
    Signed-off-by: Johannes Berg
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Johannes Berg
     

16 Aug, 2010

1 commit


27 Jul, 2010

1 commit

  • commit 1dacc76d0014
    (net/compat/wext: send different messages to compat tasks)
    introduced a race condition on netlink, in case MSG_PEEK is used.

    An skb given by skb_recv_datagram() might be shared, we must copy it
    before any modification, or risk fatal corruption.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jul, 2010

1 commit

  • Convert a few calls from kfree_skb to consume_skb

    Noticed while I was working on dropwatch that I was detecting lots of internal
    skb drops in several places. While some are legitimate, several were not,
    freeing skbs that were at the end of their life, rather than being discarded due
    to an error. This patch converts those calls sites from using kfree_skb to
    consume_skb, which quiets the in-kernel drop_monitor code from detecting them as
    drops. Tested successfully by myself

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

17 Jun, 2010

1 commit


22 May, 2010

1 commit

  • When netlink sockets are used to convey data that is in a namespace
    we need a way to select a subset of the listening sockets to deliver
    the packet to. For the network namespace we have been doing this
    by only transmitting packets in the correct network namespace.

    For data belonging to other namespaces netlink_bradcast_filtered
    provides a mechanism that allows us to examine the destination
    socket and to decide if we should transmit the specified packet
    to it.

    Signed-off-by: Eric W. Biederman
    Acked-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

07 Apr, 2010

1 commit


02 Apr, 2010

1 commit

  • check the length of the socket address passed to connect(2).

    Check the length of the socket address passed to connect(2). If the
    length is invalid, -EINVAL will be returned.

    Signed-off-by: Changli Gao
    ----
    net/bluetooth/l2cap.c | 3 ++-
    net/bluetooth/rfcomm/sock.c | 3 ++-
    net/bluetooth/sco.c | 3 ++-
    net/can/bcm.c | 3 +++
    net/ieee802154/af_ieee802154.c | 3 +++
    net/ipv4/af_inet.c | 5 +++++
    net/netlink/af_netlink.c | 3 +++
    7 files changed, 20 insertions(+), 3 deletions(-)
    Signed-off-by: David S. Miller

    Changli Gao
     

27 Mar, 2010

1 commit

  • This was included in OpenVZ kernels but wasn't integrated upstream.
    >From git://git.openvz.org/pub/linux-2.6.24-openvz:

    commit 5c69402f18adf7276352e051ece2cf31feefab02
    Author: Alexey Dobriyan
    Date: Mon Dec 24 14:37:45 2007 +0300

    netlink: fixup ->tgid to work in multiple PID namespaces

    Signed-off-by: Tom Goff
    Acked-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Tom Goff
     

21 Mar, 2010

1 commit

  • Currently, ENOBUFS errors are reported to the socket via
    netlink_set_err() even if NETLINK_RECV_NO_ENOBUFS is set. However,
    that should not happen. This fixes this problem and it changes the
    prototype of netlink_set_err() to return the number of sockets that
    have set the NETLINK_RECV_NO_ENOBUFS socket option. This return
    value is used in the next patch in these bugfix series.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

28 Feb, 2010

1 commit

  • The Inode field in /proc/net/{tcp,udp,packet,raw,...} is useful to know the types of
    file descriptors associated to a process. Actually lsof utility uses the field.
    Unfortunately, unlike /proc/net/{tcp,udp,packet,raw,...}, /proc/net/netlink doesn't have the field.
    This patch adds the field to /proc/net/netlink.

    Signed-off-by: Masatake YAMATO
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Masatake YAMATO
     

04 Feb, 2010

1 commit

  • Netlink code does module autoload if protocol userspace is asking for is
    not ready. However, module can dissapear right after it was autoloaded.
    Example: modprobe/rmmod stress-testing and xfrm_user.ko providing NETLINK_XFRM.

    netlink_create() in such situation _will_ create userspace socket and
    _will_not_ pin module. Now if module was removed and we're going to call
    ->netlink_rcv into nothing:

    BUG: unable to handle kernel paging request at ffffffffa02f842a
    ^^^^^^^^^^^^^^^^
    modules are loaded near these addresses here

    IP: [] 0xffffffffa02f842a
    PGD 161f067 PUD 1623063 PMD baa12067 PTE 0
    Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/uevent
    CPU 1
    Pid: 11515, comm: ip Not tainted 2.6.33-rc5-netns-00594-gaaa5728-dirty #6 P5E/P5E
    RIP: 0010:[] [] 0xffffffffa02f842a
    RSP: 0018:ffff8800baa3db48 EFLAGS: 00010292
    RAX: ffff8800baa3dfd8 RBX: ffff8800be353640 RCX: 0000000000000000
    RDX: ffffffff81959380 RSI: ffff8800bab7f130 RDI: 0000000000000001
    RBP: ffff8800baa3db58 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000011
    R13: ffff8800be353640 R14: ffff8800bcdec240 R15: ffff8800bd488010
    FS: 00007f93749656f0(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffffffffa02f842a CR3: 00000000ba82b000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process ip (pid: 11515, threadinfo ffff8800baa3c000, task ffff8800bab7eb30)
    Stack:
    ffffffff813637c0 ffff8800bd488000 ffff8800baa3dba8 ffffffff8136397d
    0000000000000000 ffffffff81344adc 7fffffffffffffff 0000000000000000
    ffff8800baa3ded8 ffff8800be353640 ffff8800bcdec240 0000000000000000
    Call Trace:
    [] ? netlink_unicast+0x100/0x2d0
    [] netlink_unicast+0x2bd/0x2d0

    netlink_unicast_kernel:
    nlk->netlink_rcv(skb);

    [] ? memcpy_fromiovec+0x6c/0x90
    [] netlink_sendmsg+0x1d3/0x2d0
    [] sock_sendmsg+0xbb/0xf0
    [] ? __lock_acquire+0x27b/0xa60
    [] ? might_fault+0x73/0xd0
    [] ? might_fault+0x73/0xd0
    [] ? __lock_release+0x82/0x170
    [] ? might_fault+0xbe/0xd0
    [] ? might_fault+0x73/0xd0
    [] ? verify_iovec+0x47/0xd0
    [] sys_sendmsg+0x1a9/0x360
    [] ? _raw_spin_unlock_irqrestore+0x65/0x70
    [] ? trace_hardirqs_on+0xd/0x10
    [] ? _raw_spin_unlock_irqrestore+0x42/0x70
    [] ? __up_read+0x84/0xb0
    [] ? trace_hardirqs_on_caller+0x145/0x190
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] system_call_fastpath+0x16/0x1b
    Code: Bad RIP value.
    RIP [] 0xffffffffa02f842a
    RSP
    CR2: ffffffffa02f842a

    If module was quickly removed after autoloading, return -E.

    Return -EPROTONOSUPPORT if module was quickly removed after autoloading.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

26 Nov, 2009

1 commit

  • Generated with the following semantic patch

    @@
    struct net *n1;
    struct net *n2;
    @@
    - n1 == n2
    + net_eq(n1, n2)

    @@
    struct net *n1;
    struct net *n2;
    @@
    - n1 != n2
    + !net_eq(n1, n2)

    applied over {include,net,drivers/net}.

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     

17 Nov, 2009

1 commit

  • The netlink URELEASE notifier doesn't notify for
    sockets that have been used to receive multicast
    but it should be called for such sockets as well
    since they might _also_ be used for sending and
    not solely for receiving multicast. We will need
    that for nl80211 (generic netlink sockets) in the
    future.

    Signed-off-by: Johannes Berg
    Cc: Patrick McHardy
    Signed-off-by: David S. Miller

    Johannes Berg
     

11 Nov, 2009

1 commit


06 Nov, 2009

1 commit

  • The generic __sock_create function has a kern argument which allows the
    security system to make decisions based on if a socket is being created by
    the kernel or by userspace. This patch passes that flag to the
    net_proto_family specific create function, so it can do the same thing.

    Signed-off-by: Eric Paris
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Paris
     

07 Oct, 2009

1 commit


01 Oct, 2009

1 commit

  • This provides safety against negative optlen at the type
    level instead of depending upon (sometimes non-trivial)
    checks against this sprinkled all over the the place, in
    each and every implementation.

    Based upon work done by Arjan van de Ven and feedback
    from Linus Torvalds.

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Sep, 2009

1 commit


25 Sep, 2009

1 commit

  • Similar to commit d136f1bd366fdb7e747ca7e0218171e7a00a98a5,
    there's a bug when unregistering a generic netlink family,
    which is caught by the might_sleep() added in that commit:

    BUG: sleeping function called from invalid context at net/netlink/af_netlink.c:183
    in_atomic(): 1, irqs_disabled(): 0, pid: 1510, name: rmmod
    2 locks held by rmmod/1510:
    #0: (genl_mutex){+.+.+.}, at: [] genl_unregister_family+0x2b/0x130
    #1: (rcu_read_lock){.+.+..}, at: [] __genl_unregister_mc_group+0x1c/0x120
    Pid: 1510, comm: rmmod Not tainted 2.6.31-wl #444
    Call Trace:
    [] __might_sleep+0x119/0x150
    [] netlink_table_grab+0x21/0x100
    [] netlink_clear_multicast_users+0x23/0x60
    [] __genl_unregister_mc_group+0x71/0x120
    [] genl_unregister_family+0x56/0x130
    [] nl80211_exit+0x15/0x20 [cfg80211]
    [] cfg80211_exit+0x1a/0x40 [cfg80211]

    Fix in the same way by grabbing the netlink table lock
    before doing rcu_read_lock().

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

22 Sep, 2009

1 commit

  • Sizing of memory allocations shouldn't depend on the number of physical
    pages found in a system, as that generally includes (perhaps a huge amount
    of) non-RAM pages. The amount of what actually is usable as storage
    should instead be used as a basis here.

    Some of the calculations (i.e. those not intending to use high memory)
    should likely even use (totalram_pages - totalhigh_pages).

    Signed-off-by: Jan Beulich
    Acked-by: Rusty Russell
    Acked-by: Ingo Molnar
    Cc: Dave Airlie
    Cc: Kyle McMartin
    Cc: Jeremy Fitzhardinge
    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Patrick McHardy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

15 Sep, 2009

1 commit

  • Since my commits introducing netns awareness into
    genetlink we can get this problem:

    BUG: scheduling while atomic: modprobe/1178/0x00000002
    2 locks held by modprobe/1178:
    #0: (genl_mutex){+.+.+.}, at: [] genl_register_mc_grou
    #1: (rcu_read_lock){.+.+..}, at: [] genl_register_mc_g
    Pid: 1178, comm: modprobe Not tainted 2.6.31-rc8-wl-34789-g95cb731-dirty #
    Call Trace:
    [] __schedule_bug+0x85/0x90
    [] schedule+0x108/0x588
    [] netlink_table_grab+0xa1/0xf0
    [] netlink_change_ngroups+0x47/0x100
    [] genl_register_mc_group+0x12f/0x290

    because I overlooked that netlink_table_grab() will
    schedule, thinking it was just the rwlock. However,
    in the contention case, that isn't actually true.

    Fix this by letting the code grab the netlink table
    lock first and then the RCU for netns protection.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

25 Aug, 2009

1 commit