19 Dec, 2014

1 commit


17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • As it is, default ->i_fop has NULL ->open() (along with all other methods).
    The only case where it matters is reopening (via procfs symlink) a file that
    didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
    to something sane (default would fail on read/write/ioctl/etc.).

    Unfortunately, such case exists - alloc_file() users, especially
    anon_get_file() ones. There we have tons of opened files of very different
    kinds sharing the same inode. As the result, attempt to reopen those via
    procfs succeeds and you get a descriptor you can't do anything with.

    Moreover, in case of sockets we set ->i_fop that will only be used
    on such reopen attempts - and put a failing ->open() into it to make sure
    those do not succeed.

    It would be simpler to put such ->open() into default ->i_fop and leave
    it unchanged both for anon inode (as we do anyway) and for socket ones. Result:
    * everything going through do_dentry_open() works as it used to
    * sock_no_open() kludge is gone
    * attempts to reopen anon-inode files fail as they really ought to
    * ditto for aio_private_file()
    * ditto for perfmon - this one actually tried to imitate sock_no_open()
    trick, but failed to set ->i_fop, so in the current tree reopens succeed and
    yield completely useless descriptor. Intent clearly had been to fail with
    -ENXIO on such reopens; now it actually does.
    * everything else that used alloc_file() keeps working - it has ->i_fop
    set for its inodes anyway

    Signed-off-by: Al Viro

    Al Viro
     

10 Dec, 2014

2 commits


20 Nov, 2014

3 commits

  • ... and do the same on the compat side of things.

    Signed-off-by: Al Viro

    Al Viro
     
  • use {compat_,}rw_copy_check_uvector(). As the result, we are
    guaranteed that all iovecs seen in ->msg_iov by ->sendmsg()
    and ->recvmsg() will pass access_ok().

    Signed-off-by: Al Viro

    Al Viro
     
  • Kernel-side struct msghdr is (currently) using the same layout as
    userland one, but it's not a one-to-one copy - even without considering
    32bit compat issues, we have msg_iov, msg_name and msg_control copied
    to kernel[1]. It's fairly localized, so we get away with a few functions
    where that knowledge is needed (and we could shrink that set even
    more). Pretty much everything deals with the kernel-side variant and
    the few places that want userland one just use a bunch of force-casts
    to paper over the differences.

    The thing is, kernel-side definition of struct msghdr is *not* exposed
    in include/uapi - libc doesn't see it, etc. So we can add struct user_msghdr,
    with proper annotations and let the few places that ever deal with those
    beasts use it for userland pointers. Saner typechecking aside, that will
    allow to change the layout of kernel-side msghdr - e.g. replace
    msg_iov/msg_iovlen there with struct iov_iter, getting rid of the need
    to modify the iovec as we copy data to/from it, etc.

    We could introduce kernel_msghdr instead, but that would create much more
    noise - the absolute majority of the instances would need to have the
    type switched to kernel_msghdr and definition of struct msghdr in
    include/linux/socket.h is not going to be seen by userland anyway.

    This commit just introduces user_msghdr and switches the few places that
    are dealing with userland-side msghdr to it.

    [1] actually, it's even trickier than that - we copy msg_control for
    sendmsg, but keep the userland address on recvmsg.

    Signed-off-by: Al Viro

    Al Viro
     

12 Oct, 2014

1 commit

  • Pull file locking related changes from Jeff Layton:
    "This release is a little more busy for file locking changes than the
    last:

    - a set of patches from Kinglong Mee to fix the lockowner handling in
    knfsd
    - a pile of cleanups to the internal file lease API. This should get
    us a bit closer to allowing for setlease methods that can block.

    There are some dependencies between mine and Bruce's trees this cycle,
    and I based my tree on top of the requisite patches in Bruce's tree"

    * tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux: (26 commits)
    locks: fix fcntl_setlease/getlease return when !CONFIG_FILE_LOCKING
    locks: flock_make_lock should return a struct file_lock (or PTR_ERR)
    locks: set fl_owner for leases to filp instead of current->files
    locks: give lm_break a return value
    locks: __break_lease cleanup in preparation of allowing direct removal of leases
    locks: remove i_have_this_lease check from __break_lease
    locks: move freeing of leases outside of i_lock
    locks: move i_lock acquisition into generic_*_lease handlers
    locks: define a lm_setup handler for leases
    locks: plumb a "priv" pointer into the setlease routines
    nfsd: don't keep a pointer to the lease in nfs4_file
    locks: clean up vfs_setlease kerneldoc comments
    locks: generic_delete_lease doesn't need a file_lock at all
    nfsd: fix potential lease memory leak in nfs4_setlease
    locks: close potential race in lease_get_mtime
    security: make security_file_set_fowner, f_setown and __f_setown void return
    locks: consolidate "nolease" routines
    locks: remove lock_may_read and lock_may_write
    lockd: rip out deferred lock handling from testlock codepath
    NFSD: Get reference of lockowner when coping file_lock
    ...

    Linus Torvalds
     

24 Sep, 2014

1 commit


10 Sep, 2014

3 commits


06 Sep, 2014

2 commits

  • This patch fix spelling typo found in DocBook/networking.xml.
    It is because the neworking.xml is generated from comments
    in the source, I have to fix typo in comments within the source.

    Signed-off-by: Masanari Iida
    Acked-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Masanari Iida
     
  • The timestamping API has separate bits for generating and reporting
    timestamps. A software timestamp should only be reported for a packet
    when the packet has the relevant generation flag (SKBTX_..) set
    and the socket has reporting bit SOF_TIMESTAMPING_SOFTWARE set.

    The second check was accidentally removed. Reinstitute the original
    behavior.

    Tested:
    Without this patch, Documentation/networking/txtimestamp reports
    timestamps regardless of whether SOF_TIMESTAMPING_SOFTWARE is set.
    After the patch, it only reports them when the flag is set.

    Fixes: f24b9be5957b ("net-timestamp: extend SCM_TIMESTAMPING ancillary data struct")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

07 Aug, 2014

1 commit

  • sock_tx_timestamp() should not ignore initial *tx_flags value, as TCP
    stack can store SKBTX_SHARED_FRAG in it.

    Also first argument (struct sock *) can be const.

    Signed-off-by: Eric Dumazet
    Fixes: 4ed2d765dfac ("net-timestamp: TCP timestamping")
    Cc: Willem de Bruijn
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Aug, 2014

4 commits

  • Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
    in the send() call is acknowledged. It implements the feature for TCP.

    The timestamp is generated when the TCP socket cumulative ACK is moved
    beyond the tracked seqno for the first time. The feature ignores SACK
    and FACK, because those acknowledge the specific byte, but not
    necessarily the entire contents of the buffer up to that byte.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Kernel transmit latency is often incurred in the packet scheduler.
    Introduce a new timestamp on transmission just before entering the
    scheduler. When data travels through multiple devices (bonding,
    tunneling, ...) each device will export an individual timestamp.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • sk_flags is reaching its limit. New timestamping options will not fit.
    Move all of them into a new field sk->sk_tsflags.

    Added benefit is that this removes boilerplate code to convert between
    SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.

    SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
    timestamp logic (netstamp_needed). That can be simplified and this
    last key removed, but will leave that for a separate patch.

    Signed-off-by: Willem de Bruijn

    ----

    The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
    though that scatters tstamp fields throughout the struct.
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Applications that request kernel tx timestamps with SO_TIMESTAMPING
    read timestamps as recvmsg() ancillary data. The response is defined
    implicitly as timespec[3].

    1) define struct scm_timestamping explicitly and

    2) add support for new tstamp types. On tx, scm_timestamping always
    accompanies a sock_extended_err. Define previously unused field
    ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
    define the existing behavior.

    The reception path is not modified. On rx, no struct similar to
    sock_extended_err is passed along with SCM_TIMESTAMPING.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

30 Jul, 2014

1 commit

  • The SO_TIMESTAMPING API defines three types of timestamps: software,
    hardware in raw format (hwtstamp) and hardware converted to system
    format (syststamp). The last has been deprecated in favor of combining
    hwtstamp with a PTP clock driver. There are no active users in the
    kernel.

    The option was device driver dependent. If set, but without hardware
    support, the correct behavior is to return zero in the relevant field
    in the SCM_TIMESTAMPING ancillary message. Without device drivers
    implementing the option, this field is effectively always zero.

    Remove the internal plumbing to dissuage new drivers from implementing
    the feature. Keep the SOF_TIMESTAMPING_SYS_HARDWARE flag, however, to
    avoid breaking existing applications that request the timestamp.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

17 Apr, 2014

1 commit

  • Make sys_recv a first class citizen by using the SYSCALL_DEFINEx
    macro. Besides being cleaner this will also generate meta data
    for the system call so tracing tools like ftrace or LTTng can
    resolve this system call.

    Signed-off-by: Jan Glauber
    Signed-off-by: David S. Miller

    Jan Glauber
     

02 Apr, 2014

1 commit

  • This commit fixes a build error reported by Fengguang, that is
    triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set:

    ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined!

    The fix is to introduce its own file for the PTP BPF classifier,
    so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select
    it independently from each other. IXP4xx driver on ARM needs to
    select it as well since it does not seem to select PTP_1588_CLOCK
    or similar that would pull it in automatically.

    This also allows for hiding all of the internals of the BPF PTP
    program inside that file, and only exporting relevant API bits
    to drivers.

    This patch also adds a kdoc documentation of ptp_classify_raw()
    API to make it clear that it can return PTP_CLASS_* defines. Also,
    the BPF program has been translated into bpf_asm code, so that it
    can be more easily read and altered (extensively documented in [1]).

    In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg
    tools, so the commented program can simply be translated via
    `./bpf_asm -c prog` where prog is a file that contains the
    commented code. This makes it easily readable/verifiable and when
    there's a need to change something, jump offsets etc do not need
    to be replaced manually which can be very error prone. Instead,
    a newly translated version via bpf_asm can simply replace the old
    code. I have checked opcode diffs before/after and it's the very
    same filter.

    [1] Documentation/networking/filter.txt

    Fixes: 164d8c666521 ("net: ptp: do not reimplement PTP/BPF classifier")
    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Richard Cochran
    Cc: Jiri Benc
    Acked-by: Richard Cochran
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

15 Mar, 2014

1 commit


14 Mar, 2014

1 commit

  • Pull networking fixes from David Miller:
    "I know this is a bit more than you want to see, and I've told the
    wireless folks under no uncertain terms that they must severely scale
    back the extent of the fixes they are submitting this late in the
    game.

    Anyways:

    1) vmxnet3's netpoll doesn't perform the equivalent of an ISR, which
    is the correct implementation, like it should. Instead it does
    something like a NAPI poll operation. This leads to crashes.

    From Neil Horman and Arnd Bergmann.

    2) Segmentation of SKBs requires proper socket orphaning of the
    fragments, otherwise we might access stale state released by the
    release callbacks.

    This is a 5 patch fix, but the initial patches are giving
    variables and such significantly clearer names such that the
    actual fix itself at the end looks trivial.

    From Michael S. Tsirkin.

    3) TCP control block release can deadlock if invoked from a timer on
    an already "owned" socket. Fix from Eric Dumazet.

    4) In the bridge multicast code, we must validate that the
    destination address of general queries is the link local all-nodes
    multicast address. From Linus Lüssing.

    5) The x86 BPF JIT support for negative offsets puts the parameter
    for the helper function call in the wrong register. Fix from
    Alexei Starovoitov.

    6) The descriptor type used for RTL_GIGA_MAC_VER_17 chips in the
    r8169 driver is incorrect. Fix from Hayes Wang.

    7) The xen-netback driver tests skb_shinfo(skb)->gso_type bits to see
    if a packet is a GSO frame, but that's not the correct test. It
    should use skb_is_gso(skb) instead. Fix from Wei Liu.

    8) Negative msg->msg_namelen values should generate an error, from
    Matthew Leach.

    9) at86rf230 can deadlock because it takes the same lock from it's
    ISR and it's hard_start_xmit method, without disabling interrupts
    in the latter. Fix from Alexander Aring.

    10) The FEC driver's restart doesn't perform operations in the correct
    order, so promiscuous settings can get lost. Fix from Stefan
    Wahren.

    11) Fix SKB leak in SCTP cookie handling, from Daniel Borkmann.

    12) Reference count and memory leak fixes in TIPC from Ying Xue and
    Erik Hugne.

    13) Forced eviction in inet_frag_evictor() must strictly make sure all
    frags are deleted, otherwise module unload (f.e. 6lowpan) can
    crash. Fix from Florian Westphal.

    14) Remove assumptions in AF_UNIX's use of csum_partial() (which it
    uses as a hash function), which breaks on PowerPC. From Anton
    Blanchard.

    The main gist of the issue is that csum_partial() is defined only
    as a value that, once folded (f.e. via csum_fold()) produces a
    correct 16-bit checksum. It is legitimate, therefore, for
    csum_partial() to produce two different 32-bit values over the
    same data if their respective alignments are different.

    15) Fix endiannes bug in MAC address handling of ibmveth driver, also
    from Anton Blanchard.

    16) Error checks for ipv6 exthdrs offload registration are reversed,
    from Anton Nayshtut.

    17) Externally triggered ipv6 addrconf routes should count against the
    garbage collection threshold. Fix from Sabrina Dubroca.

    18) The PCI shutdown handler added to the bnx2 driver can wedge the
    chip if it was not brought up earlier already, which in particular
    causes the firmware to shut down the PHY. Fix from Michael Chan.

    19) Adjust the sanity WARN_ON_ONCE() in qdisc_list_add() because as
    currently coded it can and does trigger in legitimate situations.
    From Eric Dumazet.

    20) BNA driver fails to build on ARM because of a too large udelay()
    call, fix from Ben Hutchings.

    21) Fair-Queue qdisc holds locks during GFP_KERNEL allocations, fix
    from Eric Dumazet.

    22) The vlan passthrough ops added in the previous release causes a
    regression in source MAC address setting of outgoing headers in
    some circumstances. Fix from Peter Boström"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (70 commits)
    ipv6: Avoid unnecessary temporary addresses being generated
    eth: fec: Fix lost promiscuous mode after reconnecting cable
    bonding: set correct vlan id for alb xmit path
    at86rf230: fix lockdep splats
    net/mlx4_en: Deregister multicast vxlan steering rules when going down
    vmxnet3: fix building without CONFIG_PCI_MSI
    MAINTAINERS: add networking selftests to NETWORKING
    net: socket: error on a negative msg_namelen
    MAINTAINERS: Add tools/net to NETWORKING [GENERAL]
    packet: doc: Spelling s/than/that/
    net/mlx4_core: Load the IB driver when the device supports IBoE
    net/mlx4_en: Handle vxlan steering rules for mac address changes
    net/mlx4_core: Fix wrong dump of the vxlan offloads device capability
    xen-netback: use skb_is_gso in xenvif_start_xmit
    r8169: fix the incorrect tx descriptor version
    tools/net/Makefile: Define PACKAGE to fix build problems
    x86: bpf_jit: support negative offsets
    bridge: multicast: enable snooping on general queries only
    bridge: multicast: add sanity check for general query destination
    tcp: tcp_release_cb() should release socket ownership
    ...

    Linus Torvalds
     

13 Mar, 2014

1 commit

  • When copying in a struct msghdr from the user, if the user has set the
    msg_namelen parameter to a negative value it gets clamped to a valid
    size due to a comparison between signed and unsigned values.

    Ensure the syscall errors when the user passes in a negative value.

    Signed-off-by: Matthew Leach
    Signed-off-by: David S. Miller

    Matthew Leach
     

10 Mar, 2014

1 commit


14 Feb, 2014

1 commit


11 Dec, 2013

1 commit

  • This patch makes socketpair() use error paths which do not
    rely on heavy-weight call to sys_close(): it's better to try
    to push the file descriptor to userspace before installing
    the socket file to the file descriptor, so that errors are
    catched earlier and being easier to handle.

    Using sys_close() seems to be the exception, while writing the
    file descriptor before installing it look like it's more or less
    the norm: eg. except for code used in init/, error handling
    involve fput() and put_unused_fd(), but not sys_close().

    This make socketpair() usage of sys_close() quite unusual.
    So it deserves to be replaced by the common pattern relying on
    fput() and put_unused_fd() just like, for example, the one used
    in pipe(2) or recvmsg(2).

    Three distinct error paths are still needed since calling
    fput() on file structure returned by sock_alloc_file() will
    implicitly call sock_release() on the associated socket
    structure.

    Cc: David S. Miller
    Cc: Al Viro
    Signed-off-by: Yann Droneaud
    Link: http://marc.info/?i=1385979146-13825-1-git-send-email-ydroneaud@opteya.com
    Signed-off-by: David S. Miller

    Yann Droneaud
     

06 Dec, 2013

1 commit


30 Nov, 2013

1 commit

  • If kmsg->msg_namelen > sizeof(struct sockaddr_storage) then in the
    original code that would lead to memory corruption in the kernel if you
    had audit configured. If you didn't have audit configured it was
    harmless.

    There are some programs such as beta versions of Ruby which use too
    large of a buffer and returning an error code breaks them. We should
    clamp the ->msg_namelen value instead.

    Fixes: 1661bf364ae9 ("net: heap overflow in __audit_sockaddr()")
    Reported-by: Eric Wong
    Signed-off-by: Dan Carpenter
    Tested-by: Eric Wong
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Dan Carpenter
     

21 Nov, 2013

2 commits


20 Nov, 2013

1 commit


19 Nov, 2013

2 commits


04 Oct, 2013

1 commit

  • We need to cap ->msg_namelen or it leads to a buffer overflow when we
    to the memcpy() in __audit_sockaddr(). It requires CAP_AUDIT_CONTROL to
    exploit this bug.

    The call tree is:
    ___sys_recvmsg()
    move_addr_to_user()
    audit_sockaddr()
    __audit_sockaddr()

    Reported-by: Jüri Aedla
    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller

    Dan Carpenter
     

14 Sep, 2013

1 commit

  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

12 Sep, 2013

1 commit

  • I found the following pattern that leads in to interesting findings:

    grep -r "ret.*|=.*__put_user" *
    grep -r "ret.*|=.*__get_user" *
    grep -r "ret.*|=.*__copy" *

    The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
    since those appear in compat code, we could probably expect the kernel
    addresses not to be reachable in the lower 32-bit range, so I think they
    might not be exploitable.

    For the "__get_user" cases, I don't think those are exploitable: the worse
    that can happen is that the kernel will copy kernel memory into in-kernel
    buffers, and will fail immediately afterward.

    The alpha csum_partial_copy_from_user() seems to be missing the
    access_ok() check entirely. The fix is inspired from x86. This could
    lead to information leak on alpha. I also noticed that many architectures
    map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
    wonder if the latter is performing the access checks on every
    architectures.

    Signed-off-by: Mathieu Desnoyers
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Jens Axboe
    Cc: Oleg Nesterov
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

02 Aug, 2013

1 commit