06 Oct, 2020

1 commit


25 Sep, 2020

1 commit

  • All TC actions call tcf_idr_insert() for new action at the end
    of their ->init(), so we can actually move it to a central place
    in tcf_action_init_1().

    And once the action is inserted into the global IDR, other parallel
    process could free it immediately as its refcnt is still 1, so we can
    not fail after this, we need to move it after the goto action
    validation to avoid handling the failure case after insertion.

    This is found during code review, is not directly triggered by syzbot.
    And this prepares for the next patch.

    Cc: Vlad Buslov
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

19 Aug, 2020

1 commit


04 Jul, 2020

1 commit

  • There are a couple of places in net/sched/ that check skb->protocol and act
    on the value there. However, in the presence of VLAN tags, the value stored
    in skb->protocol can be inconsistent based on whether VLAN acceleration is
    enabled. The commit quoted in the Fixes tag below fixed the users of
    skb->protocol to use a helper that will always see the VLAN ethertype.

    However, most of the callers don't actually handle the VLAN ethertype, but
    expect to find the IP header type in the protocol field. This means that
    things like changing the ECN field, or parsing diffserv values, stops
    working if there's a VLAN tag, or if there are multiple nested VLAN
    tags (QinQ).

    To fix this, change the helper to take an argument that indicates whether
    the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
    make sure to skip all of them, so behaviour is consistent even in QinQ
    mode.

    To make the helper usable from the ECN code, move it to if_vlan.h instead
    of pkt_sched.h.

    v3:
    - Remove empty lines
    - Move vlan variable definitions inside loop in skb_protocol()
    - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
    bpf_skb_ecn_set_ce()

    v2:
    - Use eth_type_vlan() helper in skb_protocol()
    - Also fix code that reads skb->protocol directly
    - Change a couple of 'if/else if' statements to switch constructs to avoid
    calling the helper twice

    Reported-by: Ilya Ponetayev
    Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     

19 Jan, 2020

1 commit

  • Implement a cleanup method to properly free ci->params

    BUG: memory leak
    unreferenced object 0xffff88811746e2c0 (size 64):
    comm "syz-executor617", pid 7106, jiffies 4294943055 (age 14.250s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    c0 34 60 84 ff ff ff ff 00 00 00 00 00 00 00 00 .4`.............
    backtrace:
    [] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [] slab_post_alloc_hook mm/slab.h:586 [inline]
    [] slab_alloc mm/slab.c:3320 [inline]
    [] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3549
    [] kmalloc include/linux/slab.h:556 [inline]
    [] kzalloc include/linux/slab.h:670 [inline]
    [] tcf_ctinfo_init+0x21a/0x530 net/sched/act_ctinfo.c:236
    [] tcf_action_init_1+0x400/0x5b0 net/sched/act_api.c:944
    [] tcf_action_init+0x135/0x1c0 net/sched/act_api.c:1000
    [] tcf_action_add+0x9a/0x200 net/sched/act_api.c:1410
    [] tc_ctl_action+0x14d/0x1bb net/sched/act_api.c:1465
    [] rtnetlink_rcv_msg+0x178/0x4b0 net/core/rtnetlink.c:5424
    [] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477
    [] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
    [] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
    [] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328
    [] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917
    [] sock_sendmsg_nosec net/socket.c:639 [inline]
    [] sock_sendmsg+0x54/0x70 net/socket.c:659
    [] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330
    [] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384
    [] __sys_sendmsg+0x80/0xf0 net/socket.c:2417
    [] __do_sys_sendmsg net/socket.c:2426 [inline]
    [] __se_sys_sendmsg net/socket.c:2424 [inline]
    [] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424

    Fixes: 24ec483cec98 ("net: sched: Introduce act_ctinfo action")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Cc: Kevin 'ldir' Darbyshire-Bryant
    Cc: Cong Wang
    Cc: Toke Høiland-Jørgensen
    Acked-by: Kevin 'ldir' Darbyshire-Bryant
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Nov, 2019

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Dynamic tick (nohz) updates, perhaps most notably changes to force
    the tick on when needed due to lengthy in-kernel execution on CPUs
    on which RCU is waiting.

    - Linux-kernel memory consistency model updates.

    - Replace rcu_swap_protected() with rcu_prepace_pointer().

    - Torture-test updates.

    - Documentation updates.

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
    security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer()
    net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
    net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
    net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
    bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
    fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
    drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer()
    drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer()
    x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer()
    rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()
    rcu: Suppress levelspread uninitialized messages
    rcu: Fix uninitialized variable in nocb_gp_wait()
    rcu: Update descriptions for rcu_future_grace_period tracepoint
    rcu: Update descriptions for rcu_nocb_wake tracepoint
    rcu: Remove obsolete descriptions for rcu_barrier tracepoint
    rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
    workqueue: Convert for_each_wq to use built-in list check
    rcu: Several rcu_segcblist functions can be static
    rcu: Remove unused function hlist_bl_del_init_rcu()
    Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch()
    ...

    Linus Torvalds
     

31 Oct, 2019

2 commits

  • Extend struct tc_action with new "tcfa_flags" field. Set the field in
    tcf_idr_create() function and provide new helper
    tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags
    value. Update individual hardware-offloaded actions init() to pass their
    "flags" argument to new helper in order to skip percpu stats allocation
    when user requested it through flags.

    Signed-off-by: Vlad Buslov
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • Extend TCA_ACT space with nla_bitfield32 flags. Add
    TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in
    tcf_action_init_1() and pass resulting value as additional argument to
    a_o->init().

    Signed-off-by: Vlad Buslov
    Signed-off-by: David S. Miller

    Vlad Buslov
     

30 Oct, 2019

1 commit

  • This commit replaces the use of rcu_swap_protected() with the more
    intuitively appealing rcu_replace_pointer() as a step towards removing
    rcu_swap_protected().

    Link: https://lore.kernel.org/lkml/CAHk-=wiAsJLw1egFEE=Z7-GGtM6wcvtyytXZA1+BHqta4gg6Hw@mail.gmail.com/
    Reported-by: Linus Torvalds
    [ paulmck: From rcu_replace() to rcu_replace_pointer() per Ingo Molnar. ]
    Signed-off-by: Paul E. McKenney
    Cc: Jamal Hadi Salim
    Cc: Cong Wang
    Cc: Jiri Pirko
    Cc: "David S. Miller"
    Cc:
    Cc:

    Paul E. McKenney
     

28 Aug, 2019

1 commit

  • The net pointer in struct xt_tgdtor_param is not explicitly
    initialized therefore is still NULL when dereferencing it.
    So we have to find a way to pass the correct net pointer to
    ipt_destroy_target().

    The best way I find is just saving the net pointer inside the per
    netns struct tcf_idrinfo, which could make this patch smaller.

    Fixes: 0c66dc1ea3f0 ("netfilter: conntrack: register hooks in netns when needed by ruleset")
    Reported-and-tested-by: itugrok@yahoo.com
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

06 Aug, 2019

1 commit

  • Currently init call of all actions (except ipt) init their 'parm'
    structure as a direct pointer to nla data in skb. This leads to race
    condition when some of the filter actions were initialized successfully
    (and were assigned with idr action index that was written directly
    into nla data), but then were deleted and retried (due to following
    action module missing or classifier-initiated retry), in which case
    action init code tries to insert action to idr with index that was
    assigned on previous iteration. During retry the index can be reused
    by another action that was inserted concurrently, which causes
    unintended action sharing between filters.
    To fix described race condition, save action idr index to temporary
    stack-allocated variable instead on nla data.

    Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action")
    Signed-off-by: Dmytro Linkin
    Signed-off-by: Vlad Buslov
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Dmytro Linkin
     

18 Jun, 2019

2 commits

  • Fix nla_policy definition by specifying an exact length type attribute
    to CTINFO action paraneter block structure. Without this change,
    netlink parsing will fail validation and the action will not be
    instantiated.

    8cb081746c03 ("netlink: make validation more configurable for future")
    introduced much stricter checking to attributes being passed via
    netlink. Existing actions were updated to use less restrictive
    deprecated versions of nla_parse_nested.

    As a new module, act_ctinfo should be designed to use the strict
    checking model otherwise, well, what was the point of implementing it.

    Confession time: Until very recently, development of this module has
    been done on 'net-next' tree to 'clean compile' level with run-time
    testing on backports to 4.14 & 4.19 kernels under openwrt. This is how
    I managed to miss the run-time impacts of the new strict
    nla_parse_nested function. I hopefully have learned something from this
    (glances toward laptop running a net-next kernel)

    There is however a still outstanding implication on iproute2 user space
    in that it needs to be told to pass nested netlink messages with the
    nested attribute actually set. So even with this kernel fix to do
    things correctly you still cannot instantiate a new 'strict'
    nla_parse_nested based action such as act_ctinfo with iproute2's tc.

    Signed-off-by: Kevin Darbyshire-Bryant
    Signed-off-by: David S. Miller

    Kevin Darbyshire-Bryant
     
  • Use correct return value on action creation: ACT_P_CREATED.

    The use of incorrect return value could result in a situation where the
    system thought a ctinfo module was listening but actually wasn't
    instantiated correctly leading to an OOPS in tcf_generic_walker().

    Confession time: Until very recently, development of this module has
    been done on 'net-next' tree to 'clean compile' level with run-time
    testing on backports to 4.14 & 4.19 kernels under openwrt. During the
    back & forward porting during development & testing, the critical
    ACT_P_CREATED return code got missed despite being in the 4.14 & 4.19
    backports. I have now gone through the init functions, using act_csum
    as reference with a fine toothed comb. Bonus, no more OOPSes. I
    managed to also miss this issue till now due to the new strict
    nla_parse_nested function failing validation before action creation.

    As an inexperienced developer I've learned that
    copy/pasting/backporting/forward porting code correctly is hard. If I
    ever get to a developer conference I shall don the cone of shame.

    Signed-off-by: Kevin Darbyshire-Bryant
    Signed-off-by: David S. Miller

    Kevin Darbyshire-Bryant
     

14 Jun, 2019

1 commit


31 May, 2019

1 commit


30 May, 2019

1 commit

  • ctinfo is a new tc filter action module. It is designed to restore
    information contained in firewall conntrack marks to other packet fields
    and is typically used on packet ingress paths. At present it has two
    independent sub-functions or operating modes, DSCP restoration mode &
    skb mark restoration mode.

    The DSCP restore mode:

    This mode copies DSCP values that have been placed in the firewall
    conntrack mark back into the IPv4/v6 diffserv fields of relevant
    packets.

    The DSCP restoration is intended for use and has been found useful for
    restoring ingress classifications based on egress classifications across
    links that bleach or otherwise change DSCP, typically home ISP Internet
    links. Restoring DSCP on ingress on the WAN link allows qdiscs such as
    but by no means limited to CAKE to shape inbound packets according to
    policies that are easier to set & mark on egress.

    Ingress classification is traditionally a challenging task since
    iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
    lookups, hence are unable to see internal IPv4 addresses as used on the
    typical home masquerading gateway. Thus marking the connection in some
    manner on egress for later restoration of classification on ingress is
    easier to implement.

    Parameters related to DSCP restore mode:

    dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
    conntrack mark field contain the DSCP value to be restored.

    statemask - a 32 bit mask of (usually) 1 bit length, outside the area
    specified by dscpmask. This represents a conditional operation flag
    whereby the DSCP is only restored if the flag is set. This is useful to
    implement a 'one shot' iptables based classification where the
    'complicated' iptables rules are only run once to classify the
    connection on initial (egress) packet and subsequent packets are all
    marked/restored with the same DSCP. A mask of zero disables the
    conditional behaviour ie. the conntrack mark DSCP bits are always
    restored to the ip diffserv field (assuming the conntrack entry is found
    & the skb is an ipv4/ipv6 type)

    e.g. dscpmask 0xfc000000 statemask 0x01000000

    |----0xFC----conntrack mark----000000---|
    | Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
    | DSCP | unused | flag |unused |
    |-----------------------0x01---000000---|
    | |
    | |
    ---| Conditional flag
    v only restore if set
    |-ip diffserv-|
    | 6 bits |
    |-------------|

    The skb mark restore mode (cpmark):

    This mode copies the firewall conntrack mark to the skb's mark field.
    It is completely the functional equivalent of the existing act_connmark
    action with the additional feature of being able to apply a mask to the
    restored value.

    Parameters related to skb mark restore mode:

    mask - a 32 bit mask applied to the firewall conntrack mark to mask out
    bits unwanted for restoration. This can be useful where the conntrack
    mark is being used for different purposes by different applications. If
    not specified and by default the whole mark field is copied (i.e.
    default mask of 0xffffffff)

    e.g. mask 0x00ffffff to mask out the top 8 bits being used by the
    aforementioned DSCP restore mode.

    |----0x00----conntrack mark----ffffff---|
    | Bits 31-24 | |
    | DSCP & flag| some value here |
    |---------------------------------------|
    |
    |
    v
    |------------skb mark-------------------|
    | | |
    | zeroed | |
    |---------------------------------------|

    Overall parameters:

    zone - conntrack zone

    control - action related control (reclassify | pipe | drop | continue |
    ok | goto chain )

    Signed-off-by: Kevin Darbyshire-Bryant
    Reviewed-by: Toke Høiland-Jørgensen
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Kevin 'ldir' Darbyshire-Bryant