12 Sep, 2013

1 commit

  • I found the following pattern that leads in to interesting findings:

    grep -r "ret.*|=.*__put_user" *
    grep -r "ret.*|=.*__get_user" *
    grep -r "ret.*|=.*__copy" *

    The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
    since those appear in compat code, we could probably expect the kernel
    addresses not to be reachable in the lower 32-bit range, so I think they
    might not be exploitable.

    For the "__get_user" cases, I don't think those are exploitable: the worse
    that can happen is that the kernel will copy kernel memory into in-kernel
    buffers, and will fail immediately afterward.

    The alpha csum_partial_copy_from_user() seems to be missing the
    access_ok() check entirely. The fix is inspired from x86. This could
    lead to information leak on alpha. I also noticed that many architectures
    map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
    wonder if the latter is performing the access checks on every
    architectures.

    Signed-off-by: Mathieu Desnoyers
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Jens Axboe
    Cc: Oleg Nesterov
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

11 Sep, 2013

1 commit

  • Pull nfsd updates from Bruce Fields:
    "This was a very quiet cycle! Just a few bugfixes and some cleanup"

    * 'nfsd-next' of git://linux-nfs.org/~bfields/linux:
    rpc: let xdr layer allocate gssproxy receieve pages
    rpc: fix huge kmalloc's in gss-proxy
    rpc: comment on linux_cred encoding, treat all as unsigned
    rpc: clean up decoding of gssproxy linux creds
    svcrpc: remove unused rq_resused
    nfsd4: nfsd4_create_clid_dir prints uninitialized data
    nfsd4: fix leak of inode reference on delegation failure
    Revert "nfsd: nfs4_file_get_access: need to be more careful with O_RDWR"
    sunrpc: prepare NFS for 2038
    nfsd4: fix setlease error return
    nfsd: nfs4_file_get_access: need to be more careful with O_RDWR

    Linus Torvalds
     

10 Sep, 2013

2 commits

  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    - Fix NFSv4 recovery so that it doesn't recover lost locks in cases
    such as lease loss due to a network partition, where doing so may
    result in data corruption. Add a kernel parameter to control
    choice of legacy behaviour or not.
    - Performance improvements when 2 processes are writing to the same
    file.
    - Flush data to disk when an RPCSEC_GSS session timeout is imminent.
    - Implement NFSv4.1 SP4_MACH_CRED state protection to prevent other
    NFS clients from being able to manipulate our lease and file
    locking state.
    - Allow sharing of RPCSEC_GSS caches between different rpc clients.
    - Fix the broken NFSv4 security auto-negotiation between client and
    server.
    - Fix rmdir() to wait for outstanding sillyrename unlinks to complete
    - Add a tracepoint framework for debugging NFSv4 state recovery
    issues.
    - Add tracing to the generic NFS layer.
    - Add tracing for the SUNRPC socket connection state.
    - Clean up the rpc_pipefs mount/umount event management.
    - Merge more patches from Chuck in preparation for NFSv4 migration
    support"

    * tag 'nfs-for-3.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (107 commits)
    NFSv4: use mach cred for SECINFO_NO_NAME w/ integrity
    NFS: nfs_compare_super shouldn't check the auth flavour unless 'sec=' was set
    NFSv4: Allow security autonegotiation for submounts
    NFSv4: Disallow security negotiation for lookups when 'sec=' is specified
    NFSv4: Fix security auto-negotiation
    NFS: Clean up nfs_parse_security_flavors()
    NFS: Clean up the auth flavour array mess
    NFSv4.1 Use MDS auth flavor for data server connection
    NFS: Don't check lock owner compatability unless file is locked (part 2)
    NFS: Don't check lock owner compatibility in writes unless file is locked
    nfs4: Map NFS4ERR_WRONG_CRED to EPERM
    nfs4.1: Add SP4_MACH_CRED write and commit support
    nfs4.1: Add SP4_MACH_CRED stateid support
    nfs4.1: Add SP4_MACH_CRED secinfo support
    nfs4.1: Add SP4_MACH_CRED cleanup support
    nfs4.1: Add state protection handler
    nfs4.1: Minimal SP4_MACH_CRED implementation
    SUNRPC: Replace pointer values with task->tk_pid and rpc_clnt->cl_clid
    SUNRPC: Add an identifier for struct rpc_clnt
    SUNRPC: Ensure rpc_task->tk_pid is available for tracepoints
    ...

    Linus Torvalds
     
  • Pull ceph updates from Sage Weil:
    "This includes both the first pile of Ceph patches (which I sent to
    torvalds@vger, sigh) and a few new patches that add support for
    fscache for Ceph. That includes a few fscache core fixes that David
    Howells asked go through the Ceph tree. (Thanks go to Milosz Tanski
    for putting this feature together)

    This first batch of patches (included here) had (has) several
    important RBD bug fixes, hole punch support, several different
    cleanups in the page cache interactions, improvements in the truncate
    code (new truncate mutex to avoid shenanigans with i_mutex), and a
    series of fixes in the synchronous striping read/write code.

    On top of that is a random collection of small fixes all across the
    tree (error code checks and error path cleanup, obsolete wq flags,
    etc)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits)
    ceph: use d_invalidate() to invalidate aliases
    ceph: remove ceph_lookup_inode()
    ceph: trivial buildbot warnings fix
    ceph: Do not do invalidate if the filesystem is mounted nofsc
    ceph: page still marked private_2
    ceph: ceph_readpage_to_fscache didn't check if marked
    ceph: clean PgPrivate2 on returning from readpages
    ceph: use fscache as a local presisent cache
    fscache: Netfs function for cleanup post readpages
    FS-Cache: Fix heading in documentation
    CacheFiles: Implement interface to check cache consistency
    FS-Cache: Add interface to check consistency of a cached object
    rbd: fix null dereference in dout
    rbd: fix buffer size for writes to images with snapshots
    libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc
    rbd: fix I/O error propagation for reads
    ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
    ceph: allow sync_read/write return partial successed size of read/write.
    ceph: fix bugs about handling short-read for sync read mode.
    ceph: remove useless variable revoked_rdcache
    ...

    Linus Torvalds
     

08 Sep, 2013

2 commits

  • Pull namespace changes from Eric Biederman:
    "This is an assorted mishmash of small cleanups, enhancements and bug
    fixes.

    The major theme is user namespace mount restrictions. nsown_capable
    is killed as it encourages not thinking about details that need to be
    considered. A very hard to hit pid namespace exiting bug was finally
    tracked and fixed. A couple of cleanups to the basic namespace
    infrastructure.

    Finally there is an enhancement that makes per user namespace
    capabilities usable as capabilities, and an enhancement that allows
    the per userns root to nice other processes in the user namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Kill nsown_capable it makes the wrong thing easy
    capabilities: allow nice if we are privileged
    pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
    userns: Allow PR_CAPBSET_DROP in a user namespace.
    namespaces: Simplify copy_namespaces so it is clear what is going on.
    pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
    sysfs: Restrict mounting sysfs
    userns: Better restrictions on when proc and sysfs can be mounted
    vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
    kernel/nsproxy.c: Improving a snippet of code.
    proc: Restrict mounting the proc filesystem
    vfs: Lock in place mounts from more privileged users

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "A quick set of fixes, some to deal with fallout from yesterday's
    net-next merge.

    1) Fix compilation of bnx2x driver with CONFIG_BNX2X_SRIOV disabled,
    from Dmitry Kravkov.

    2) Fix a bnx2x regression caused by one of Dave Jones's mistaken
    braces changes, from Eilon Greenstein.

    3) Add some protective filtering in the netlink tap code, from Daniel
    Borkmann.

    4) Fix TCP congestion window growth regression after timeouts, from
    Yuchung Cheng.

    5) Correctly adjust TCP's rcv_ssthresh for out of order packets, from
    Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    tcp: properly increase rcv_ssthresh for ofo packets
    net: add documentation for BQL helpers
    mlx5: remove unused MLX5_DEBUG param in Kconfig
    bnx2x: Restore a call to config_init
    bnx2x: fix broken compilation with CONFIG_BNX2X_SRIOV is not set
    tcp: fix no cwnd growth after timeout
    net: netlink: filter particular protocols from analyzers

    Linus Torvalds
     

07 Sep, 2013

6 commits

  • TCP receive window handling is multi staged.

    A socket has a memory budget, static or dynamic, in sk_rcvbuf.

    Because we do not really know how this memory budget translates to
    a TCP window (payload), TCP announces a small initial window
    (about 20 MSS).

    When a packet is received, we increase TCP rcv_win depending
    on the payload/truesize ratio of this packet. Good citizen
    packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2

    This heuristic takes place in tcp_grow_window()

    Problem is : We currently call tcp_grow_window() only for in-order
    packets.

    This means that reorders or packet losses stop proper grow of
    rcv_win, and senders are unable to benefit from fast recovery,
    or proper reordering level detection.

    Really, a packet being stored in OFO queue is not a bad citizen.
    It should be part of the game as in-order packets.

    In our traces, we very often see sender is limited by linux small
    receive windows, even if linux hosts use autotuning (DRS) and should
    allow rcv_win to grow to ~3MB.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
    it only allows cwnd to increase in Open state. This mistakenly disables
    slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
    state moves from Disorder to Open later in tcp_fastretrans_alert().

    Therefore the correct logic should be to allow cwnd to grow as long
    as the data is received in order in Open, Loss, or even Disorder state.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Fix finer-grained control and let only a whitelist of allowed netlink
    protocols pass, in our case related to networking. If later on, other
    subsystems decide they want to add their protocol as well to the list
    of allowed protocols they shall simply add it. While at it, we also
    need to tell what protocol is in use otherwise BPF_S_ANC_PROTOCOL can
    not pick it up (as it's not filled out).

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Patches for Ceph FS-Cache support

    Milosz Tanski
     
  • Pull trivial tree from Jiri Kosina:
    "The usual trivial updates all over the tree -- mostly typo fixes and
    documentation updates"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
    doc: Documentation/cputopology.txt fix typo
    treewide: Convert retrun typos to return
    Fix comment typo for init_cma_reserved_pageblock
    Documentation/trace: Correcting and extending tracepoint documentation
    mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
    power: Documentation: Update s2ram link
    doc: fix a typo in Documentation/00-INDEX
    Documentation/printk-formats.txt: No casts needed for u64/s64
    doc: Fix typo "is is" in Documentations
    treewide: Fix printks with 0x%#
    zram: doc fixes
    Documentation/kmemcheck: update kmemcheck documentation
    doc: documentation/hwspinlock.txt fix typo
    PM / Hibernate: add section for resume options
    doc: filesystems : Fix typo in Documentations/filesystems
    scsi/megaraid fixed several typos in comments
    ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
    treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
    page_isolation: Fix a comment typo in test_pages_isolated()
    doc: fix a typo about irq affinity
    ...

    Linus Torvalds
     
  • Pull HID updates from Jiri Kosina:
    "Highlights:

    - conversion of HID subsystem to use devm-based resource management,
    from Benjamin Tissoires

    - i2c-hid support for DT bindings, from Benjamin Tissoires

    - much improved support for Win8-multitouch devices, from Benjamin
    Tissoires

    - cleanup of core code using common hidinput_input_event(), from
    David Herrmann

    - fix for bug in implement() access to the bit stream (causing oops)
    that has been present in the code for ages, but devices that are
    able to trigger it have started to appear only now, from Jiri
    Kosina

    - fixes for CVE-2013-2899, CVE-2013-2898, CVE-2013-2896,
    CVE-2013-2892, CVE-2013-2888 (all triggerable only by specially
    crafted malicious HW devices plugged into the system), from Kees
    Cook

    - hidraw oops fix, from Manoj Chourasia

    - various smaller fixes here and there, support for a bunch of new
    devices by various contributors"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (53 commits)
    HID: MAINTAINERS: add roccat drivers
    HID: hid-sensor-hub: change kmalloc + memcpy by kmemdup
    HID: hid-sensor-hub: move to devm_kzalloc
    HID: hid-sensor-hub: fix indentation accross the code
    HID: move HID_REPORT_TYPES closer to the report-definitions
    HID: check for NULL field when setting values
    HID: picolcd_core: validate output report details
    HID: sensor-hub: validate feature report details
    HID: ntrig: validate feature report details
    HID: pantherlord: validate output report details
    HID: hid-wiimote: print small buffers via %*phC
    HID: uhid: improve uhid example client
    HID: Correct the USB IDs for the new Macbook Air 6
    HID: wiimote: add support for Guitar-Hero guitars
    HID: wiimote: add support for Guitar-Hero drums
    Input: introduce BTN/ABS bits for drums and guitars
    HID: battery: don't do DMA from stack
    HID: roccat: add support for KonePureOptical v2
    HID: picolcd: Prevent NULL pointer dereference on _remove()
    HID: usbhid: quirk for N-Trig DuoSense Touch Screen
    ...

    Linus Torvalds
     

06 Sep, 2013

13 commits

  • In theory the linux cred in a gssproxy reply can include up to
    NGROUPS_MAX data, 256K of data. In the common case we expect it to be
    shorter. So do as the nfsv3 ACL code does and let the xdr code allocate
    the pages as they come in, instead of allocating a lot of pages that
    won't typically be used.

    Tested-by: Simo Sorce
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • The reply to a gssproxy can include up to NGROUPS_MAX gid's, which will
    take up more than a page. We therefore need to allocate an array of
    pages to hold the reply instead of trying to allocate a single huge
    buffer.

    Tested-by: Simo Sorce
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • The encoding of linux creds is a bit confusing.

    Also: I think in practice it doesn't really matter whether we treat any
    of these things as signed or unsigned, but unsigned seems more
    straightforward: uid_t/gid_t are unsigned and it simplifies the ngroups
    overflow check.

    Tested-by: Simo Sorce
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • We can use the normal coding infrastructure here.

    Two minor behavior changes:

    - we're assuming no wasted space at the end of the linux cred.
    That seems to match gss-proxy's behavior, and I can't see why
    it would need to do differently in the future.

    - NGROUPS_MAX check added: note groups_alloc doesn't do this,
    this is the caller's responsibility.

    Tested-by: Simo Sorce
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • Pull networking changes from David Miller:
    "Noteworthy changes this time around:

    1) Multicast rejoin support for team driver, from Jiri Pirko.

    2) Centralize and simplify TCP RTT measurement handling in order to
    reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
    both timestamps and local RTT measurements are available prefer
    the later because there are broken middleware devices which
    scramble the timestamp.

    From Yuchung Cheng.

    3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
    memory consumed to queue up unsend user data. From Eric Dumazet.

    4) Add a "physical port ID" abstraction for network devices, from
    Jiri Pirko.

    5) Add a "suppress" operation to influence fib_rules lookups, from
    Stefan Tomanek.

    6) Add a networking development FAQ, from Paul Gortmaker.

    7) Extend the information provided by tcp_probe and add ipv6 support,
    from Daniel Borkmann.

    8) Use RCU locking more extensively in openvswitch data paths, from
    Pravin B Shelar.

    9) Add SCTP support to openvswitch, from Joe Stringer.

    10) Add EF10 chip support to SFC driver, from Ben Hutchings.

    11) Add new SYNPROXY netfilter target, from Patrick McHardy.

    12) Compute a rate approximation for sending in TCP sockets, and use
    this to more intelligently coalesce TSO frames. Furthermore, add
    a new packet scheduler which takes advantage of this estimate when
    available. From Eric Dumazet.

    13) Allow AF_PACKET fanouts with random selection, from Daniel
    Borkmann.

    14) Add ipv6 support to vxlan driver, from Cong Wang"

    Resolved conflicts as per discussion.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
    openvswitch: Fix alignment of struct sw_flow_key.
    netfilter: Fix build errors with xt_socket.c
    tcp: Add missing braces to do_tcp_setsockopt
    caif: Add missing braces to multiline if in cfctrl_linkup_request
    bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
    vxlan: Fix kernel panic on device delete.
    net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
    net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
    icplus: Use netif_running to determine device state
    ethernet/arc/arc_emac: Fix huge delays in large file copies
    tuntap: orphan frags before trying to set tx timestamp
    tuntap: purge socket error queue on detach
    qlcnic: use standard NAPI weights
    ipv6:introduce function to find route for redirect
    bnx2x: VF RSS support - VF side
    bnx2x: VF RSS support - PF side
    vxlan: Notify drivers for listening UDP port changes
    net: usbnet: update addr_assign_type if appropriate
    driver/net: enic: update enic maintainers and driver
    driver/net: enic: Exposing symbols for Cisco's low latency driver
    ...

    Linus Torvalds
     
  • sw_flow_key alignment was declared as " __aligned(__alignof__(long))".
    However, this breaks on the m68k architecture where long is 32 bit in
    size but 16 bit aligned by default. This aligns to the size of a long to
    ensure that we can always do comparsions in full long-sized chunks. It
    also adds an additional build check to catch any reduction in alignment.

    CC: Andy Zhou
    Reported-by: Fengguang Wu
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Jesse Gross
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • Conflicts:
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    net/bridge/br_multicast.c
    net/ipv6/sit.c

    The conflicts were minor:

    1) sit.c changes overlap with change to ip_tunnel_xmit() signature.

    2) br_multicast.c had an overlap between computing max_delay using
    msecs_to_jiffies and turning MLDV2_MRC() into an inline function
    with a name using lowercase instead of uppercase letters.

    3) stmmac had two overlapping changes, one which conditionally allocated
    and hooked up a dma_cfg based upon the presence of the pbl OF property,
    and another one handling store-and-forward DMA made. The latter of
    which should not go into the new of_find_property() basic block.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • As reported by Randy Dunlap:

    ====================
    when CONFIG_IPV6=m
    and CONFIG_NETFILTER_XT_MATCH_SOCKET=y:

    net/built-in.o: In function `socket_mt6_v1_v2':
    xt_socket.c:(.text+0x51b55): undefined reference to `udp6_lib_lookup'
    net/built-in.o: In function `socket_mt_init':
    xt_socket.c:(.init.text+0x1ef8): undefined reference to `nf_defrag_ipv6_enable'
    ====================

    Like several other modules under net/netfilter/ we have to
    have a dependency "IPV6 disabled or set compatibly with this
    module" clause.

    Reported-by: Randy Dunlap
    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: Dave Jones
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Dave Jones
     
  • The indentation here implies this was meant to be a multi-line if.

    Introduced several years back in commit c85c2951d4da1236e32f1858db418221e624aba5
    ("caif: Handle dev_queue_xmit errors.")

    Signed-off-by: Dave Jones
    Signed-off-by: David S. Miller

    Dave Jones
     
  • RFC 4861 says that the IP source address of the Redirect is the
    same as the current first-hop router for the specified ICMP
    Destination Address, so the gateway should be taken into
    consideration when we find the route for redirect.

    There was once a check in commit
    a6279458c534d01ccc39498aba61c93083ee0372 ("NDISC: Search over
    all possible rules on receipt of redirect.") and the check
    went away in commit b94f1c0904da9b8bf031667afc48080ba7c3e8c9
    ("ipv6: Use icmpv6_notify() to propagate redirect, instead of
    rt6_redirect()").

    The bug is only "exploitable" on layer-2 because the source
    address of the redirect is checked to be a valid link-local
    address but it makes spoofing a lot easier in the same L2
    domain nonetheless.

    Thanks very much for Hannes's help.

    Signed-off-by: Duan Jiong
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Duan Jiong
     
  • The multicast snooping code should have matured enough to be safely
    applicable to IPv6 link-local multicast addresses (excluding the
    link-local all nodes address, ff02::1), too.

    Signed-off-by: Linus Lüssing
    Signed-off-by: David S. Miller

    Linus Lüssing
     
  • Currently if there is no listener for a certain group then IPv6 packets
    for that group are flooded on all ports, even though there might be no
    host and router interested in it on a port.

    With this commit they are only forwarded to ports with a multicast
    router.

    Just like commit bd4265fe36 ("bridge: Only flood unregistered groups
    to routers") did for IPv4, let's do the same for IPv6 with the same
    reasoning.

    Signed-off-by: Linus Lüssing
    Signed-off-by: David S. Miller

    Linus Lüssing
     

05 Sep, 2013

15 commits

  • Add an identifier in order to aid debugging.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Pull PTR_RET() removal patches from Rusty Russell:
    "PTR_RET() is a weird name, and led to some confusing usage. We ended
    up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.

    This has been sitting in linux-next for a whole cycle"

    [ There are still some PTR_RET users scattered about, with some of them
    possibly being new, but most of them existing in Rusty's tree too. We
    have that

    #define PTR_RET(p) PTR_ERR_OR_ZERO(p)

    thing in , so they continue to work for now - Linus ]

    * tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
    Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
    drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
    sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
    dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
    drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
    mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
    staging/zcache: don't use PTR_RET().
    remoteproc: don't use PTR_RET().
    pinctrl: don't use PTR_RET().
    acpi: Replace weird use of PTR_RET.
    s390: Replace weird use of PTR_RET.
    PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
    PTR_RET is now PTR_ERR_OR_ZERO

    Linus Torvalds
     
  • We already have mld_{gq,ifc,dad}_start_timer() functions, so introduce
    mld_{gq,ifc,dad}_stop_timer() functions to reduce code size and make it
    more readable.

    Signed-off-by: Daniel Borkmann
    Cc: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Make igmp6_event_query() a bit easier to read by refactoring code
    parts into mld_process_v1() and mld_process_v2().

    Signed-off-by: Daniel Borkmann
    Cc: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Similarly as we do in MLDv2 queries, set a forged MLDv1 query with
    0 ms mld_maxdelay to minimum timer shot time of 1 jiffies. This is
    eventually done in igmp6_group_queried() anyway, so we can simplify
    a check there.

    Signed-off-by: Daniel Borkmann
    Cc: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • RFC3810, 10. Security Considerations says under subsection 10.1.
    Query Message:

    A forged Version 1 Query message will put MLDv2 listeners on that
    link in MLDv1 Host Compatibility Mode. This scenario can be avoided
    by providing MLDv2 hosts with a configuration option to ignore
    Version 1 messages completely.

    Hence, implement a MLDv2-only mode that will ignore MLDv1 traffic:

    echo 2 > /proc/sys/net/ipv6/conf/ethX/force_mld_version or
    echo 2 > /proc/sys/net/ipv6/conf/all/force_mld_version

    Note that device has a higher precedence as it was previously
    also the case in the macro MLD_V1_SEEN() that would "short-circuit"
    if condition on case.

    Signed-off-by: Daniel Borkmann
    Cc: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Get rid of MLDV2_MRC and use our new macros for mantisse and
    exponent to calculate Maximum Response Delay out of the Maximum
    Response Code.

    Signed-off-by: Daniel Borkmann
    Cc: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Replace the macro with a function to make it more readable. GCC will
    eventually decide whether to inline this or not (also, that's not
    fast-path anyway).

    Signed-off-by: Daniel Borkmann
    Cc: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • i) RFC3810, 9.2. Query Interval [QI] says:

    The Query Interval variable denotes the interval between General
    Queries sent by the Querier. Default value: 125 seconds. [...]

    ii) RFC3810, 9.3. Query Response Interval [QRI] says:

    The Maximum Response Delay used to calculate the Maximum Response
    Code inserted into the periodic General Queries. Default value:
    10000 (10 seconds) [...] The number of seconds represented by the
    [Query Response Interval] must be less than the [Query Interval].

    iii) RFC3810, 9.12. Older Version Querier Present Timeout [OVQPT] says:

    The Older Version Querier Present Timeout is the time-out for
    transitioning a host back to MLDv2 Host Compatibility Mode. When an
    MLDv1 query is received, MLDv2 hosts set their Older Version Querier
    Present Timer to [Older Version Querier Present Timeout].

    This value MUST be ([Robustness Variable] times (the [Query Interval]
    in the last Query received)) plus ([Query Response Interval]).

    Hence, on *default* the timeout results in:

    [RV] = 2, [QI] = 125sec, [QRI] = 10sec
    [OVQPT] = [RV] * [QI] + [QRI] = 260sec

    Having that said, we currently calculate [OVQPT] (here given as 'switchback'
    variable) as ...

    switchback = (idev->mc_qrv + 1) * max_delay

    RFC3810, 9.12. says "the [Query Interval] in the last Query received". In
    section "9.14. Configuring timers", it is said:

    This section is meant to provide advice to network administrators on
    how to tune these settings to their network. Ambitious router
    implementations might tune these settings dynamically based upon
    changing characteristics of the network. [...]

    iv) RFC38010, 9.14.2. Query Interval:

    The overall level of periodic MLD traffic is inversely proportional
    to the Query Interval. A longer Query Interval results in a lower
    overall level of MLD traffic. The value of the Query Interval MUST
    be equal to or greater than the Maximum Response Delay used to
    calculate the Maximum Response Code inserted in General Query
    messages.

    I assume that was why switchback is calculated as is (3 * max_delay), although
    this setting seems to be meant for routers only to configure their [QI]
    interval for non-default intervals. So usage here like this is clearly wrong.

    Concluding, the current behaviour in IPv6's multicast code is not conform
    to the RFC as switch back is calculated wrongly. That is, it has a too small
    value, so MLDv2 hosts switch back again to MLDv2 way too early, i.e. ~30secs
    instead of ~260secs on default.

    Hence, introduce necessary helper functions and fix this up properly as it
    should be.

    Introduced in 06da92283 ("[IPV6]: Add MLDv2 support."). Credits to Hannes
    Frederic Sowa who also had a hand in this as well. Also thanks to Hangbin Liu
    who did initial testing.

    Signed-off-by: Daniel Borkmann
    Cc: David Stevens
    Cc: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • In tcp_v6_do_rcv() code, when processing pkt options, we soley work
    on our skb clone opt_skb that we've created earlier before entering
    tcp_rcv_established() on our way. However, only in condition ...

    if (np->rxopt.bits.rxtclass)
    np->rcv_tclass = ipv6_get_dsfield(ipv6_hdr(skb));

    ... we work on skb itself. As we extract every other information out
    of opt_skb in ipv6_pktoptions path, this seems wrong, since skb can
    already be released by tcp_rcv_established() earlier on. When we try
    to access it in ipv6_hdr(), we will dereference freed skb.

    [ Bug added by commit 4c507d2897bd9b ("net: implement IP_RECVTOS for
    IP_PKTOPTIONS") ]

    Signed-off-by: Daniel Borkmann
    Cc: Eric Dumazet
    Acked-by: Eric Dumazet
    Acked-by: Jiri Benc
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Commit 1b7fdd2ab585("tcp: do not use cached RTT for RTT estimation")
    removes important comments on how RTO is initialized and updated.
    Hopefully this patch puts those information back.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Allocating skbs when sending out neighbour discovery messages
    currently uses sock_alloc_send_skb() based on a per net namespace
    socket and thus share a socket wmem buffer space.

    If a netdevice is temporarily unable to transmit due to carrier
    loss or for other reasons, the queued up ndisc messages will cosnume
    all of the wmem space and will thus prevent from any more skbs to
    be allocated even for netdevices that are able to transmit packets.

    The number of neighbour discovery messages sent is very limited,
    use of alloc_skb() bypasses the socket wmem buffer size enforcement
    while the manual call to skb_set_owner_w() maintains the socket
    reference needed for the IPv6 output path.

    This patch has orginally been posted by Eric Dumazet in a modified
    form.

    Signed-off-by: Thomas Graf
    Cc: Eric Dumazet
    Cc: Hannes Frederic Sowa
    Cc: Stephen Warren
    Cc: Fabio Estevam
    Tested-by: Fabio Estevam
    Tested-by: Stephen Warren
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Commit b67bfe0d42cac56c512dd5da4b1b347a23f4b70a ("hlist: drop
    the node parameter from iterators") changed the behavior of
    hlist_for_each_entry_safe to leave the p argument NULL.

    Fix this up by tracking the last argument.

    Reported-by: Michele Baldessari
    Cc: Hideaki YOSHIFUJI
    Cc: Sasha Levin
    Signed-off-by: Hannes Frederic Sowa
    Tested-by: Michele Baldessari
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • net: sctp: Fix data chunk fragmentation for MTU values which are not multiple of 4

    Initially the problem was observed with ipsec, but later it became clear that
    SCTP data chunk fragmentation algorithm has problems with MTU values which are
    not multiple of 4. Test program was used which just transmits 2000 bytes long
    packets to other host. tcpdump was used to observe re-fragmentation in IP layer
    after SCTP already fragmented data chunks.

    With MTU 1500:
    12:54:34.082904 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1500)
    10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 2366088589] [SID: 0] [SSEQ 1] [PPID 0x0]
    12:54:34.082933 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 596)
    10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 2366088590] [SID: 0] [SSEQ 1] [PPID 0x0]
    12:54:34.090576 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
    10.151.24.91.54321 > 10.151.38.153.39303: sctp (1) [SACK] [cum ack 2366088590] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0]

    With MTU 1499:
    13:02:49.955220 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 0, flags [+], proto SCTP (132), length 1492)
    10.151.38.153.39084 > 10.151.24.91.54321: sctp[|sctp]
    13:02:49.955249 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 1472, flags [none], proto SCTP (132), length 28)
    10.151.38.153 > 10.151.24.91: ip-proto-132
    13:02:49.955262 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600)
    10.151.38.153.39084 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 404355346] [SID: 0] [SSEQ 1] [PPID 0x0]
    13:02:49.956770 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
    10.151.24.91.54321 > 10.151.38.153.39084: sctp (1) [SACK] [cum ack 404355346] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0]

    Here problem in data portion limit calculation leads to re-fragmentation in IP,
    which is sub-optimal. The problem is max_data initial value, which doesn't take
    into account the fact, that data chunk must be padded to 4-bytes boundary.
    It's enough to correct max_data, because all later adjustments are correctly
    aligned to 4-bytes boundary.

    After the fix is applied, everything is fragmented correctly for uneven MTUs:
    15:16:27.083881 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1496)
    10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 3077098183] [SID: 0] [SSEQ 1] [PPID 0x0]
    15:16:27.083907 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600)
    10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 3077098184] [SID: 0] [SSEQ 1] [PPID 0x0]
    15:16:27.085640 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
    10.151.24.91.54321 > 10.151.38.153.53417: sctp (1) [SACK] [cum ack 3077098184] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0]

    The bug was there for years already, but
    - is a performance issue, the packets are still transmitted
    - doesn't show up with default MTU 1500, but possibly with ipsec (MTU 1438)

    Signed-off-by: Alexander Sverdlin
    Acked-by: Vlad Yasevich
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Alexander Sverdlin