15 Sep, 2018

1 commit

  • [ Upstream commit 98c8f125fd8a6240ea343c1aa50a1be9047791b8 ]

    Via u32_change(), TCA_U32_SEL has an unspecified type in the netlink
    policy, so max length isn't enforced, only minimum. This means nkeys
    (from userspace) was being trusted without checking the actual size of
    nla_len(), which could lead to a memory over-read, and ultimately an
    exposure via a call to u32_dump(). Reachability is CAP_NET_ADMIN within
    a namespace.

    Reported-by: Al Viro
    Cc: Jamal Hadi Salim
    Cc: Cong Wang
    Cc: Jiri Pirko
    Cc: "David S. Miller"
    Cc: netdev@vger.kernel.org
    Signed-off-by: Kees Cook
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

09 Mar, 2018

2 commits

  • [ Upstream commit d7cdee5ea8d28ae1b6922deb0c1badaa3aa0ef8c ]

    Li Shuang reported an Oops with cls_u32 due to an use-after-free
    in u32_destroy_key(). The use-after-free can be triggered with:

    dev=lo
    tc qdisc add dev $dev root handle 1: htb default 10
    tc filter add dev $dev parent 1: prio 5 handle 1: protocol ip u32 divisor 256
    tc filter add dev $dev protocol ip parent 1: prio 5 u32 ht 800:: match ip dst\
    10.0.0.0/8 hashkey mask 0x0000ff00 at 16 link 1:
    tc qdisc del dev $dev root

    Which causes the following kasan splat:

    ==================================================================
    BUG: KASAN: use-after-free in u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
    Read of size 4 at addr ffff881b83dae618 by task kworker/u48:5/571

    CPU: 17 PID: 571 Comm: kworker/u48:5 Not tainted 4.15.0+ #87
    Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
    Workqueue: tc_filter_workqueue u32_delete_key_freepf_work [cls_u32]
    Call Trace:
    dump_stack+0xd6/0x182
    ? dma_virt_map_sg+0x22e/0x22e
    print_address_description+0x73/0x290
    kasan_report+0x277/0x360
    ? u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
    u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
    u32_delete_key_freepf_work+0x1c/0x30 [cls_u32]
    process_one_work+0xae0/0x1c80
    ? sched_clock+0x5/0x10
    ? pwq_dec_nr_in_flight+0x3c0/0x3c0
    ? _raw_spin_unlock_irq+0x29/0x40
    ? trace_hardirqs_on_caller+0x381/0x570
    ? _raw_spin_unlock_irq+0x29/0x40
    ? finish_task_switch+0x1e5/0x760
    ? finish_task_switch+0x208/0x760
    ? preempt_notifier_dec+0x20/0x20
    ? __schedule+0x839/0x1ee0
    ? check_noncircular+0x20/0x20
    ? firmware_map_remove+0x73/0x73
    ? find_held_lock+0x39/0x1c0
    ? worker_thread+0x434/0x1820
    ? lock_contended+0xee0/0xee0
    ? lock_release+0x1100/0x1100
    ? init_rescuer.part.16+0x150/0x150
    ? retint_kernel+0x10/0x10
    worker_thread+0x216/0x1820
    ? process_one_work+0x1c80/0x1c80
    ? lock_acquire+0x1a5/0x540
    ? lock_downgrade+0x6b0/0x6b0
    ? sched_clock+0x5/0x10
    ? lock_release+0x1100/0x1100
    ? compat_start_thread+0x80/0x80
    ? do_raw_spin_trylock+0x190/0x190
    ? _raw_spin_unlock_irq+0x29/0x40
    ? trace_hardirqs_on_caller+0x381/0x570
    ? _raw_spin_unlock_irq+0x29/0x40
    ? finish_task_switch+0x1e5/0x760
    ? finish_task_switch+0x208/0x760
    ? preempt_notifier_dec+0x20/0x20
    ? __schedule+0x839/0x1ee0
    ? kmem_cache_alloc_trace+0x143/0x320
    ? firmware_map_remove+0x73/0x73
    ? sched_clock+0x5/0x10
    ? sched_clock_cpu+0x18/0x170
    ? find_held_lock+0x39/0x1c0
    ? schedule+0xf3/0x3b0
    ? lock_downgrade+0x6b0/0x6b0
    ? __schedule+0x1ee0/0x1ee0
    ? do_wait_intr_irq+0x340/0x340
    ? do_raw_spin_trylock+0x190/0x190
    ? _raw_spin_unlock_irqrestore+0x32/0x60
    ? process_one_work+0x1c80/0x1c80
    ? process_one_work+0x1c80/0x1c80
    kthread+0x312/0x3d0
    ? kthread_create_worker_on_cpu+0xc0/0xc0
    ret_from_fork+0x3a/0x50

    Allocated by task 1688:
    kasan_kmalloc+0xa0/0xd0
    __kmalloc+0x162/0x380
    u32_change+0x1220/0x3c9e [cls_u32]
    tc_ctl_tfilter+0x1ba6/0x2f80
    rtnetlink_rcv_msg+0x4f0/0x9d0
    netlink_rcv_skb+0x124/0x320
    netlink_unicast+0x430/0x600
    netlink_sendmsg+0x8fa/0xd60
    sock_sendmsg+0xb1/0xe0
    ___sys_sendmsg+0x678/0x980
    __sys_sendmsg+0xc4/0x210
    do_syscall_64+0x232/0x7f0
    return_from_SYSCALL_64+0x0/0x75

    Freed by task 112:
    kasan_slab_free+0x71/0xc0
    kfree+0x114/0x320
    rcu_process_callbacks+0xc3f/0x1600
    __do_softirq+0x2bf/0xc06

    The buggy address belongs to the object at ffff881b83dae600
    which belongs to the cache kmalloc-4096 of size 4096
    The buggy address is located 24 bytes inside of
    4096-byte region [ffff881b83dae600, ffff881b83daf600)
    The buggy address belongs to the page:
    page:ffffea006e0f6a00 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
    flags: 0x17ffffc0008100(slab|head)
    raw: 0017ffffc0008100 0000000000000000 0000000000000000 0000000100070007
    raw: dead000000000100 dead000000000200 ffff880187c0e600 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff881b83dae500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff881b83dae580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff881b83dae600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff881b83dae680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff881b83dae700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    The problem is that the htnode is freed before the linked knodes and the
    latter will try to access the first at u32_destroy_key() time.
    This change addresses the issue using the htnode refcnt to guarantee
    the correct free order. While at it also add a RCU annotation,
    to keep sparse happy.

    v1 -> v2: use rtnl_derefence() instead of RCU read locks
    v2 -> v3:
    - don't check refcnt in u32_destroy_hnode()
    - cleaned-up u32_destroy() implementation
    - cleaned-up code comment
    v3 -> v4:
    - dropped unneeded comment

    Reported-by: Li Shuang
    Fixes: c0d378ef1266 ("net_sched: use tcf_queue_work() in u32 filter")
    Signed-off-by: Paolo Abeni
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit eb53f7af6f15285e2f6ada97285395343ce9f433 ]

    The following sequence is currently broken:

    # tc qdisc add dev foo ingress
    # tc filter replace dev foo protocol all ingress \
    u32 match u8 0 0 action mirred egress mirror dev bar1
    # tc filter replace dev foo protocol all ingress \
    handle 800::800 pref 49152 \
    u32 match u8 0 0 action mirred egress mirror dev bar2
    Error: cls_u32: Key node flags do not match passed flags.
    We have an error talking to the kernel, -1

    The error comes from u32_change() when comparing new and
    existing flags. The existing ones always contains one of
    TCA_CLS_FLAGS_{,NOT}_IN_HW flag depending on offloading state.
    These flags cannot be passed from userspace so the condition
    (n->flags != flags) in u32_change() always fails.

    Fix the condition so the flags TCA_CLS_FLAGS_NOT_IN_HW and
    TCA_CLS_FLAGS_IN_HW are not taken into account.

    Fixes: 24d3dc6d27ea ("net/sched: cls_u32: Reflect HW offload status")
    Signed-off-by: Ivan Vecera
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ivan Vecera
     

09 Nov, 2017

1 commit

  • Hold netns refcnt before call_rcu() and release it after
    the tcf_exts_destroy() is done.

    Note, on ->destroy() path we have to respect the return value
    of tcf_exts_get_net(), on other paths it should always return
    true, so we don't need to care.

    Cc: Lucas Bates
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

29 Oct, 2017

1 commit

  • Defer the tcf_exts_destroy() in RCU callback to
    tc filter workqueue and get RTNL lock.

    Reported-by: Chris Mi
    Cc: Daniel Borkmann
    Cc: Jiri Pirko
    Cc: John Fastabend
    Cc: Jamal Hadi Salim
    Cc: "Paul E. McKenney"
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

01 Sep, 2017

1 commit

  • TC filters when used as classifiers are bound to TC classes.
    However, there is a hidden difference when adding them in different
    orders:

    1. If we add tc classes before its filters, everything is fine.
    Logically, the classes exist before we specify their ID's in
    filters, it is easy to bind them together, just as in the current
    code base.

    2. If we add tc filters before the tc classes they bind, we have to
    do dynamic lookup in fast path. What's worse, this happens all
    the time not just once, because on fast path tcf_result is passed
    on stack, there is no way to propagate back to the one in tc filters.

    This hidden difference hurts performance silently if we have many tc
    classes in hierarchy.

    This patch intends to close this gap by doing the reverse binding when
    we create a new class, in this case we can actually search all the
    filters in its parent, match and fixup by classid. And because
    tcf_result is specific to each type of tc filter, we have to introduce
    a new ops for each filter to tell how to bind the class.

    Note, we still can NOT totally get rid of those class lookup in
    ->enqueue() because cgroup and flow filters have no way to determine
    the classid at setup time, they still have to go through dynamic lookup.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

26 Aug, 2017

1 commit

  • It is ugly to hide a u32-filter-specific pointer inside Qdisc,
    this breaks the TC layers:

    1. Qdisc is a generic representation, should not have any specific
    data of any type

    2. Qdisc layer is above filter layer, should only save filters in
    the list of struct tcf_proto.

    This pointer is used as the head of the chain of u32 hash tables,
    that is struct tc_u_hnode, because u32 filter is very special,
    it allows to create multiple hash tables within one qdisc and
    across multiple u32 filters.

    Instead of using this ugly pointer, we can just save it in a global
    hash table key'ed by (dev ifindex, qdisc handle), therefore we can
    still treat it as a per qdisc basis data structure conceptually.

    Of course, because of network namespaces, this key is not unique
    at all, but it is fine as we already have a pointer to Qdisc in
    struct tc_u_common, we can just compare the pointers when collision.

    And this only affects slow paths, has no impact to fast path,
    thanks to the pointer ->tp_c.

    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

12 Aug, 2017

1 commit


08 Aug, 2017

4 commits


05 Aug, 2017

1 commit


08 Jun, 2017

1 commit

  • We need to push the chain index down to the drivers, so they have the
    information to which chain the rule belongs. For now, no driver supports
    multichain offload, so only chain 0 is supported. This is needed to
    prevent chain squashes during offload for now. Later this will be used
    to implement multichain offload.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

22 Apr, 2017

1 commit

  • We could have a race condition where in ->classify() path we
    dereference tp->root and meanwhile a parallel ->destroy() makes it
    a NULL. Daniel cured this bug in commit d936377414fa
    ("net, sched: respect rcu grace period on cls destruction").

    This happens when ->destroy() is called for deleting a filter to
    check if we are the last one in tp, this tp is still linked and
    visible at that time. The root cause of this problem is the semantic
    of ->destroy(), it does two things (for non-force case):

    1) check if tp is empty
    2) if tp is empty we could really destroy it

    and its caller, if cares, needs to check its return value to see if it
    is really destroyed. Therefore we can't unlink tp unless we know it is
    empty.

    As suggested by Daniel, we could actually move the test logic to ->delete()
    so that we can safely unlink tp after ->delete() tells us the last one is
    just deleted and before ->destroy().

    Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
    Cc: Roi Dayan
    Cc: Daniel Borkmann
    Cc: John Fastabend
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    WANG Cong
     

14 Apr, 2017

1 commit


18 Feb, 2017

1 commit


10 Jan, 2017

1 commit


20 Sep, 2016

1 commit


23 Aug, 2016

1 commit

  • After commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array")
    we do dynamic allocation in tcf_exts_init(), therefore we need
    to handle the ENOMEM case properly.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

09 Jun, 2016

2 commits


08 Jun, 2016

3 commits

  • When offloading classifiers such as u32 or flower to hardware, and the
    qdisc is clsact (TC_H_CLSACT), then we need to differentiate its classes,
    since not all of them handle ingress, therefore we must leave those in
    software path. Add a .tcf_cl_offload() callback, so we can generically
    handle them, tested on ixgbe.

    Fixes: 10cbc6843446 ("net/sched: cls_flower: Hardware offloaded filters statistics support")
    Fixes: 5b33f48842fa ("net/flower: Introduce hardware offload support")
    Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs")
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Return an error if user requested skip-sw and the underlaying
    hardware cannot handle tc offloads (or offloads are disabled).

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • 'err' variable is not set in this test, we would return whatever
    previous test set 'err' to.

    Signed-off-by: Jakub Kicinski
    Acked-by: Sridhar Samudrala
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

17 May, 2016

1 commit

  • On devices that support TC U32 offloads, this flag enables a filter to be
    added only to HW. skip-sw and skip-hw are mutually exclusive flags. By
    default without any flags, the filter is added to both HW and SW, but no
    error checks are done in case of failure to add to HW. With skip-sw,
    failure to add to HW is treated as an error.

    Here is a sample script that adds 2 filters, one with skip-sw and the other
    with skip-hw flag.

    # add ingress qdisc
    tc qdisc add dev p4p1 ingress

    # enable hw tc offload.
    ethtool -K p4p1 hw-tc-offload on

    # add u32 filter with skip-sw flag.
    tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
    handle 800:0:1 u32 ht 800: flowid 800:1 \
    skip-sw \
    match ip src 192.168.1.0/24 \
    action drop

    # add u32 filter with skip-hw flag.
    tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
    handle 800:0:2 u32 ht 800: flowid 800:2 \
    skip-hw \
    match ip src 192.168.2.0/24 \
    action drop

    Signed-off-by: Sridhar Samudrala
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Samudrala, Sridhar
     

27 Apr, 2016

1 commit


02 Mar, 2016

2 commits

  • In the initial implementation the only way to stop a rule from being
    inserted into the hardware table was via the device feature flag.
    However this doesn't work well when working on an end host system
    where packets are expect to hit both the hardware and software
    datapaths.

    For example we can imagine a rule that will match an IP address and
    increment a field. If we install this rule in both hardware and
    software we may increment the field twice. To date we have only
    added support for the drop action so we have been able to ignore
    these cases. But as we extend the action support we will hit this
    example plus more such cases. Arguably these are not even corner
    cases in many working systems these cases will be common.

    To avoid forcing the driver to always abort (i.e. the above example)
    this patch adds a flag to add a rule in software only. A careful
    user can use this flag to build software and hardware datapaths
    that work together. One example we have found particularly useful
    is to use hardware resources to set the skb->mark on the skb when
    the match may be expensive to run in software but a mark lookup
    in a hash table is cheap. The idea here is hardware can do in one
    lookup what the u32 classifier may need to traverse multiple lists
    and hash tables to compute. The flag is only passed down on inserts.
    On deletion to avoid stale references in hardware we always try
    to remove a rule if it exists.

    The flags field is part of the classifier specific options. Although
    it is tempting to lift this into the generic structure doing this
    proves difficult do to how the tc netlink attributes are implemented
    along with how the dump/change routines are called. There is also
    precedence for putting seemingly generic pieces in the specific
    classifier options such as TCA_U32_POLICE, TCA_U32_ACT, etc. So
    although not ideal I've left FLAGS in the u32 options as well as it
    simplifies the code greatly and user space has already learned how
    to manage these bits ala 'tc' tool.

    Another thing if trying to update a rule we require the flags to
    be unchanged. This is to force user space, software u32 and
    the hardware u32 to keep in sync. Thanks to Simon Horman for
    catching this case.

    Signed-off-by: John Fastabend
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    John Fastabend
     
  • The offload decision was originally very basic and tied to if the dev
    implemented the appropriate ndo op hook. The next step is to allow
    the user to more flexibly define if any paticular rule should be
    offloaded or not. In order to have this logic in one function lift
    the current check into a helper routine tc_should_offload().

    Signed-off-by: John Fastabend
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    John Fastabend
     

17 Feb, 2016

1 commit


26 Aug, 2015

1 commit

  • In commit 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
    I added a check in u32_destroy() to see if all real filters are gone
    for each tp, however, that is only done for root_ht, same is needed
    for others.

    This can be reproduced by the following tc commands:

    tc filter add dev eth0 parent 1:0 prio 5 handle 15: protocol ip u32 divisor 256
    tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:2 u32
    ht 15:2: match ip src 10.0.0.2 flowid 1:10
    tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:3 u32
    ht 15:2: match ip src 10.0.0.3 flowid 1:10

    Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
    Reported-by: Akshat Kakkar
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

21 Mar, 2015

1 commit

  • Conflicts:
    drivers/net/ethernet/emulex/benet/be_main.c
    net/core/sysctl_net_core.c
    net/ipv4/inet_diag.c

    The be_main.c conflict resolution was really tricky. The conflict
    hunks generated by GIT were very unhelpful, to say the least. It
    split functions in half and moved them around, when the real actual
    conflict only existed solely inside of one function, that being
    be_map_pci_bars().

    So instead, to resolve this, I checked out be_main.c from the top
    of net-next, then I applied the be_main.c changes from 'net' since
    the last time I merged. And this worked beautifully.

    The inet_diag.c and sysctl_net_core.c conflicts were simple
    overlapping changes, and were easily to resolve.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Mar, 2015

2 commits

  • We dynamically allocate divisor+1 entries for ->ht[] in tc_u_hnode:

    ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), GFP_KERNEL);

    So ->ht is supposed to be the last field of this struct, however
    this is broken, since an rcu head is appended after it.

    Fixes: 1ce87720d456 ("net: sched: make cls_u32 lockless")
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Signed-off-by: Cong Wang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    WANG Cong
     
  • Kernel automatically creates a tp for each
    (kind, protocol, priority) tuple, which has handle 0,
    when we add a new filter, but it still is left there
    after we remove our own, unless we don't specify the
    handle (literally means all the filters under
    the tuple). For example this one is left:

    # tc filter show dev eth0
    filter parent 8001: protocol arp pref 49152 basic

    The user-space is hard to clean up these for kernel
    because filters like u32 are organized in a complex way.
    So kernel is responsible to remove it after all filters
    are gone. Each type of filter has its own way to
    store the filters, so each type has to provide its
    way to check if all filters are gone.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Cong Wang
     

10 Dec, 2014

1 commit


02 Oct, 2014

1 commit

  • This fixes the following crash:

    [ 63.976822] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    [ 63.980094] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 3.17.0-rc6+ #648
    [ 63.980094] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 63.980094] task: ffff880117dea690 ti: ffff880117dfc000 task.ti: ffff880117dfc000
    [ 63.980094] RIP: 0010:[] [] u32_destroy_key+0x27/0x6d
    [ 63.980094] RSP: 0018:ffff880117dffcc0 EFLAGS: 00010202
    [ 63.980094] RAX: ffff880117dea690 RBX: ffff8800d02e0820 RCX: 0000000000000000
    [ 63.980094] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
    [ 63.980094] RBP: ffff880117dffcd0 R08: 0000000000000000 R09: 0000000000000000
    [ 63.980094] R10: 00006c0900006ba8 R11: 00006ba100006b9d R12: 0000000000000001
    [ 63.980094] R13: ffff8800d02e0898 R14: ffffffff817e6d4d R15: ffff880117387a30
    [ 63.980094] FS: 0000000000000000(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
    [ 63.980094] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 63.980094] CR2: 00007f07e6732fed CR3: 000000011665b000 CR4: 00000000000006e0
    [ 63.980094] Stack:
    [ 63.980094] ffff88011a9cd300 ffffffff82051ac0 ffff880117dffce0 ffffffff817e6d68
    [ 63.980094] ffff880117dffd70 ffffffff810cb4c7 ffffffff810cb3cd ffff880117dfffd8
    [ 63.980094] ffff880117dea690 ffff880117dea690 ffff880117dfffd8 000000000000000a
    [ 63.980094] Call Trace:
    [ 63.980094] [] u32_delete_key_freepf_rcu+0x1b/0x1d
    [ 63.980094] [] rcu_process_callbacks+0x3bb/0x691
    [ 63.980094] [] ? rcu_process_callbacks+0x2c1/0x691
    [ 63.980094] [] ? u32_destroy_key+0x6d/0x6d
    [ 63.980094] [] __do_softirq+0x142/0x323
    [ 63.980094] [] run_ksoftirqd+0x23/0x53
    [ 63.980094] [] smpboot_thread_fn+0x203/0x221
    [ 63.980094] [] ? smpboot_unpark_thread+0x33/0x33
    [ 63.980094] [] kthread+0xc9/0xd1
    [ 63.980094] [] ? do_wait_for_common+0xf8/0x125
    [ 63.980094] [] ? __kthread_parkme+0x61/0x61
    [ 63.980094] [] ret_from_fork+0x7c/0xb0
    [ 63.980094] [] ? __kthread_parkme+0x61/0x61

    tp could be freed in call_rcu callback too, the order is not guaranteed.

    John Fastabend says:

    ====================
    Its worth noting why this is safe. Any running schedulers will either
    read the valid class field or it will be zeroed.

    All schedulers today when the class is 0 do a lookup using the
    same call used by the tcf_exts_bind(). So even if we have a running
    classifier hit the null class pointer it will do a lookup and get
    to the same result. This is particularly fragile at the moment because
    the only way to verify this is to audit the schedulers call sites.
    ====================

    Cc: John Fastabend
    Signed-off-by: Cong Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    WANG Cong
     

29 Sep, 2014

1 commit


23 Sep, 2014

3 commits

  • $ grep CONFIG_CLS_U32_MARK .config
    # CONFIG_CLS_U32_MARK is not set

    net/sched/cls_u32.c: In function 'u32_change':
    net/sched/cls_u32.c:852:1: warning: label 'errout' defined but not used
    [-Wunused-label]

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Changes to the cls_u32 classifier must appear atomic to the
    readers. Before this patch if a change is requested for both
    the exts and ifindex, first the ifindex is updated then the
    exts with tcf_exts_change(). This opens a small window where
    a reader can have a exts chain with an incorrect ifindex. This
    violates the the RCU semantics.

    Here we resolve this by always passing u32_set_parms() a copy
    of the tc_u_knode to work on and then inserting it into the hash
    table after the updates have been successfully applied.

    Tested with the following short script:

    #tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 handle 1: \
    u32 divisor 256

    #tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 \
    u32 link 1: hashkey mask ffffff00 at 12 \
    match ip src 192.168.8.0/2

    #tc filter add dev p3p2 parent 8001:0 protocol ip prio 102 \
    handle 1::10 u32 classid 1:2 ht 1: \
    match ip src 192.168.8.0/8 match ip tos 0x0a 1e

    #tc filter change dev p3p2 parent 8001:0 protocol ip prio 102 \
    handle 1::10 u32 classid 1:2 ht 1: \
    match ip src 1.1.0.0/8 match ip tos 0x0b 1e

    CC: Eric Dumazet
    CC: Jamal Hadi Salim
    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend
     
  • This fixes a missed free_percpu in the unwind code path and when
    keys are destroyed.

    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend