23 Nov, 2011

1 commit


14 Nov, 2011

1 commit

  • commit 3ceca749668a52bd795585e0f71c6f0b04814f7b added a TOS attribute.

    Unfortunately TOS and TCLASS are both present in a dual-stack v6 socket,
    furthermore they can have different values. As such one cannot in a
    sane way expose both through a single attribute.

    Signed-off-by: Maciej Żenczyowski
    CC: Murali Raja
    CC: Stephen Hemminger
    CC: Eric Dumazet
    CC: David S. Miller
    Signed-off-by: David S. Miller

    Maciej Żenczykowski
     

13 Oct, 2011

1 commit

  • This patch exposes the tos value for the TCP sockets when the TOS flag
    is requested in the ext_flags for the inet_diag request. This would mainly be
    used to expose TOS values for both for TCP and UDP sockets. Currently it is
    supported for TCP. When netlink support for UDP would be added the support
    to expose the TOS values would alse be done. For IPV4 tos value is exposed
    and for IPV6 tclass value is exposed.

    Signed-off-by: Murali Raja
    Acked-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Murali Raja
     

21 Jun, 2011

1 commit


18 Jun, 2011

1 commit

  • A malicious user or buggy application can inject code and trigger an
    infinite loop in inet_diag_bc_audit()

    Also make sure each instruction is aligned on 4 bytes boundary, to avoid
    unaligned accesses.

    Reported-by: Dan Rosenberg
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jun, 2011

1 commit

  • The message size allocated for rtnl ifinfo dumps was limited to
    a single page. This is not enough for additional interface info
    available with devices that support SR-IOV and caused a bug in
    which VF info would not be displayed if more than approximately
    40 VFs were created per interface.

    Implement a new function pointer for the rtnl_register service that will
    calculate the amount of data required for the ifinfo dump and allocate
    enough data to satisfy the request.

    Signed-off-by: Greg Rose
    Signed-off-by: Jeff Kirsher

    Greg Rose
     

23 Apr, 2011

1 commit


20 Jan, 2011

1 commit


10 Jan, 2011

1 commit

  • Due to NLM_F_DUMP is composed of two bits, NLM_F_ROOT | NLM_F_MATCH,
    when doing "if (x & NLM_F_DUMP)", it tests for _either_ of the bits
    being set. Because NLM_F_MATCH's value overlaps with NLM_F_EXCL,
    non-dump requests with NLM_F_EXCL set are mistaken as dump requests.

    Substitute the condition to test for _all_ bits being set.

    Signed-off-by: Jan Engelhardt
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Jan Engelhardt
     

05 Nov, 2010

1 commit

  • We were using nlmsg_find_attr() to look up the bytecode by attribute when
    auditing, but then just using the first attribute when actually running
    bytecode. So, if we received a message with two attribute elements, where only
    the second had type INET_DIAG_REQ_BYTECODE, we would validate and run different
    bytecode strings.

    Fix this by consistently using nlmsg_find_attr everywhere.

    Signed-off-by: Nelson Elhage
    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Nelson Elhage
     

24 Sep, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

20 Jan, 2010

1 commit


19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2009

1 commit


18 Jun, 2009

1 commit

  • commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
    (net: No more expensive sock_hold()/sock_put() on each tx)
    changed initial sk_wmem_alloc value.

    We need to take into account this offset when reporting
    sk_wmem_alloc to user, in PROC_FS files or various
    ioctls (SIOCOUTQ/TIOCOUTQ)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2009

1 commit


24 Nov, 2008

1 commit

  • This is the last step to be able to perform full RCU lookups
    in __inet_lookup() : After established/timewait tables, we
    add RCU lookups to listening hash table.

    The only trick here is that a socket of a given type (TCP ipv4,
    TCP ipv6, ...) can now flight between two different tables
    (established and listening) during a RCU grace period, so we
    must use different 'nulls' end-of-chain values for two tables.

    We define a large value :

    #define LISTENING_NULLS_BASE (1U << 29)

    So that slots in listening table are guaranteed to have different
    end-of-chain values than slots in established table. A reader can
    still detect it finished its lookup in the right chain.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Nov, 2008

1 commit


20 Nov, 2008

1 commit

  • This patch prepares RCU migration of listening_hash table for
    TCP/DCCP protocols.

    listening_hash table being small (32 slots per protocol), we add
    a spinlock for each slot, instead of a single rwlock for whole table.

    This should reduce hold time of readers, and writers concurrency.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Nov, 2008

1 commit

  • RCU was added to UDP lookups, using a fast infrastructure :
    - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
    price of call_rcu() at freeing time.
    - hlist_nulls permits to use few memory barriers.

    This patch uses same infrastructure for TCP/DCCP established
    and timewait sockets.

    Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
    using short lived TCP connections. A followup patch, converting
    rwlocks to spinlocks will even speedup this case.

    __inet_lookup_established() is pretty fast now we dont have to
    dirty a contended cache line (read_lock/read_unlock)

    Only established and timewait hashtable are converted to RCU
    (bind table and listen table are still using traditional locking)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Oct, 2008

1 commit


28 Aug, 2008

1 commit


12 Jun, 2008

1 commit


01 Feb, 2008

3 commits

  • Add a net argument to inet6_lookup and propagate it further.
    Actually, this is tcp-v6 implementation of what was done for
    tcp-v4 sockets in a previous patch.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Add a net argument to inet_lookup and propagate it further
    into lookup calls. Plus tune the __inet_check_established.

    The dccp and inet_diag, which use that lookup functions
    pass the init_net into them.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=9825

    The inet_diag_lock_handler function uses ERR_PTR to encode errors but
    its callers were testing against NULL.

    This only happens when the only inet_diag modular user, DCCP, is not
    built into the kernel or available as a module.

    Also there was a problem with not dropping the mutex lock when a handler
    was not found, also fixed in this patch.

    This caused an OOPS and ss would then hang on subsequent calls, as
    &inet_diag_table_mutex was being left locked.

    Thanks to spike at ml.yaroslavl.ru for report it after trying 'ss -d'
    on a kernel that doesn't have DCCP available.

    This bug was introduced in cset
    d523a328fb0271e1a763e985a21f2488fd816e7e ("Fix inet_diag dead-lock
    regression"), after 2.6.24-rc3, so just 2.6.24 seems to be affected.

    Signed-off-by: Arnaldo Carvalho de Melo
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

29 Jan, 2008

1 commit


03 Dec, 2007

1 commit

  • The inet_diag register fix broke inet_diag module loading because the
    loaded module had to take the same mutex that's already held by the
    loader in order to register the new handler.

    This patch fixes it by introducing a separate mutex to protect the
    handling of handlers.

    Signed-off-by: Herbert Xu

    Herbert Xu
     

29 Nov, 2007

1 commit

  • The following race is possible when one cpu unregisters the handler
    while other one is trying to receive a message and call this one:

    CPU1: CPU2:
    inet_diag_rcv() inet_diag_unregister()
    mutex_lock(&inet_diag_mutex);
    netlink_rcv_skb(skb, &inet_diag_rcv_msg);
    if (inet_diag_table[nlh->nlmsg_type] ==
    NULL) /* false handler is still registered */
    ...
    netlink_dump_start(idiagnl, skb, nlh,
    inet_diag_dump, NULL);
    cb = kzalloc(sizeof(*cb), GFP_KERNEL);
    /* sleep here freeing memory
    * or preempt
    * or sleep later on nlk->cb_mutex
    */
    spin_lock(&inet_diag_register_lock);
    inet_diag_table[type] = NULL;
    ... spin_unlock(&inet_diag_register_lock);
    synchronize_rcu();
    /* CPU1 is sleeping - RCU quiescent
    * state is passed
    */
    return;
    /* inet_diag_dump is finally called: */
    inet_diag_dump()
    handler = inet_diag_table[cb->nlh->nlmsg_type];
    BUG_ON(handler == NULL);
    /* OOPS! While we slept the unregister has set
    * handler to NULL :(
    */

    Grep showed, that the register/unregister functions are called
    from init/fini module callbacks for tcp_/dccp_diag, so it's OK
    to use the inet_diag_mutex to synchronize manipulations with the
    inet_diag_table and the access to it.

    Besides, as Herbert pointed out, asynchronous dumps should hold
    this mutex as well, and thus, we provide the mutex as cb_mutex one.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Herbert Xu

    Pavel Emelyanov
     

07 Nov, 2007

1 commit

  • As done two years ago on IP route cache table (commit
    22c047ccbc68fa8f3fa57f0e8f906479a062c426) , we can avoid using one
    lock per hash bucket for the huge TCP/DCCP hash tables.

    On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
    litle performance differences. (we hit a different cache line for the
    rwlock, but then the bucket cache line have a better sharing factor
    among cpus, since we dirty it less often). For netstat or ss commands
    that want a full scan of hash table, we perform fewer memory accesses.

    Using a 'small' table of hashed rwlocks should be more than enough to
    provide correct SMP concurrency between different buckets, without
    using too much memory. Sizing of this table depends on
    num_possible_cpus() and various CONFIG settings.

    This patch provides some locking abstraction that may ease a future
    work using a different model for TCP/DCCP table.

    Signed-off-by: Eric Dumazet
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Oct, 2007

1 commit


11 Oct, 2007

4 commits

  • This patch make processing netlink user -> kernel messages synchronious.
    This change was inspired by the talk with Alexey Kuznetsov about current
    netlink messages processing. He says that he was badly wrong when introduced
    asynchronious user -> kernel communication.

    The call netlink_unicast is the only path to send message to the kernel
    netlink socket. But, unfortunately, it is also used to send data to the
    user.

    Before this change the user message has been attached to the socket queue
    and sk->sk_data_ready was called. The process has been blocked until all
    pending messages were processed. The bad thing is that this processing
    may occur in the arbitrary process context.

    This patch changes nlk->data_ready callback to get 1 skb and force packet
    processing right in the netlink_unicast.

    Kernel -> user path in netlink_unicast remains untouched.

    EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
    drop, but the process remains in the cycle until the message will be fully
    processed. So, there is no need to use this kludges now.

    Signed-off-by: Denis V. Lunev
    Acked-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • I was looking at Patrick's fix to inet_diag and it occured
    to me that we're using a pointer argument to return values
    unnecessarily in netlink_run_queue. Changing it to return
    the value will allow the compiler to generate better code
    since the value won't have to be memory-backed.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Each netlink socket will live in exactly one network namespace,
    this includes the controlling kernel sockets.

    This patch updates all of the existing netlink protocols
    to only support the initial network namespace. Request
    by clients in other namespaces will get -ECONREFUSED.
    As they would if the kernel did not have the support for
    that netlink protocol compiled in.

    As each netlink protocol is updated to be multiple network
    namespace safe it can register multiple kernel sockets
    to acquire a presence in the rest of the network namespaces.

    The implementation in af_netlink is a simple filter implementation
    at hash table insertion and hash table look up time.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Hopefully captured all single statement cases under net/. I'm
    not too sure if there is some policy about #includes that are
    "guaranteed" (ie., in the current tree) to be available through
    some other #included header, so I just added linux/kernel.h to
    each changed file that didn't #include it previously.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

11 Sep, 2007

1 commit

  • netlink_run_queue() doesn't handle multiple processes processing the
    queue concurrently. Serialize queue processing in inet_diag to fix
    a oops in netlink_rcv_skb caused by netlink_run_queue passing a
    NULL for the skb.

    BUG: unable to handle kernel NULL pointer dereference at virtual address 00000054
    [349587.500454] printing eip:
    [349587.500457] c03318ae
    [349587.500459] *pde = 00000000
    [349587.500464] Oops: 0000 [#1]
    [349587.500466] PREEMPT SMP
    [349587.500474] Modules linked in: w83627hf hwmon_vid i2c_isa
    [349587.500483] CPU: 0
    [349587.500485] EIP: 0060:[] Not tainted VLI
    [349587.500487] EFLAGS: 00010246 (2.6.22.3 #1)
    [349587.500499] EIP is at netlink_rcv_skb+0xa/0x7e
    [349587.500506] eax: 00000000 ebx: 00000000 ecx: c148d2a0 edx: c0398819
    [349587.500510] esi: 00000000 edi: c0398819 ebp: c7a21c8c esp: c7a21c80
    [349587.500517] ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
    [349587.500521] Process oidentd (pid: 17943, ti=c7a20000 task=cee231c0 task.ti=c7a20000)
    [349587.500527] Stack: 00000000 c7a21cac f7c8ba78 c7a21ca4 c0331962 c0398819 f7c8ba00 0000004c
    [349587.500542] f736f000 c7a21cb4 c03988e3 00000001 f7c8ba00 c7a21cc4 c03312a5 0000004c
    [349587.500558] f7c8ba00 c7a21cd4 c0330681 f7c8ba00 e4695280 c7a21d00 c03307c6 7fffffff
    [349587.500578] Call Trace:
    [349587.500581] [] show_trace_log_lvl+0x1c/0x33
    [349587.500591] [] show_stack_log_lvl+0x8d/0xaa
    [349587.500595] [] show_registers+0x1cb/0x321
    [349587.500604] [] die+0x112/0x1e1
    [349587.500607] [] do_page_fault+0x229/0x565
    [349587.500618] [] error_code+0x72/0x78
    [349587.500625] [] netlink_run_queue+0x40/0x76
    [349587.500632] [] inet_diag_rcv+0x1f/0x2c
    [349587.500639] [] netlink_data_ready+0x57/0x59
    [349587.500643] [] netlink_sendskb+0x24/0x45
    [349587.500651] [] netlink_unicast+0x100/0x116
    [349587.500656] [] netlink_sendmsg+0x1c2/0x280
    [349587.500664] [] sock_sendmsg+0xba/0xd5
    [349587.500671] [] sys_sendmsg+0x17b/0x1e8
    [349587.500676] [] sys_socketcall+0x230/0x24d
    [349587.500684] [] syscall_call+0x7/0xb
    [349587.500691] =======================
    [349587.500693] Code: f0 ff 4e 18 0f 94 c0 84 c0 0f 84 66 ff ff ff 89 f0 e8 86 e2 fc ff e9 5a ff ff ff f0 ff 40 10 eb be 55 89 e5 57 89 d7 56 89 c6 53 50 54 83 fa 10 72 55 8b 9e 9c 00 00 00 31 c9 8b 03 83 f8 0f

    Reported by Athanasius

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

26 Apr, 2007

3 commits