17 Jul, 2012

1 commit

  • Before this patch sock_diag works for init_net only and dumps
    information about sockets from all namespaces.

    This patch expands sock_diag for all name-spaces.
    It creates a netlink kernel socket for each netns and filters
    data during dumping.

    v2: filter accoding with netns in all places
    remove an unused variable.

    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: Pavel Emelyanov
    CC: Eric Dumazet
    Cc: linux-kernel@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Signed-off-by: Andrew Vagin
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Andrey Vagin
     

28 Jun, 2012

1 commit


27 Jun, 2012

1 commit


10 Jun, 2012

1 commit

  • As pointed out by Michael Tokarev , struct unix_iter_state is no longer
    needed.

    Suggested-by: Michael Tokarev
    Signed-off-by: Eric Dumazet
    Cc: Steven Whitehouse
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2012

1 commit

  • /proc/net/unix has quadratic behavior, and can hold unix_table_lock for
    a while if high number of unix sockets are alive. (90 ms for 200k
    sockets...)

    We already have a hash table, so its quite easy to use it.

    Problem is unbound sockets are still hashed in a single hash slot
    (unix_socket_table[UNIX_HASH_TABLE])

    This patch also spreads unbound sockets to 256 hash slots, to speedup
    both /proc/net/unix and unix_diag.

    Time to read /proc/net/unix with 200k unix sockets :
    (time dd if=/proc/net/unix of=/dev/null bs=4k)

    before : 520 secs
    after : 2 secs

    Signed-off-by: Eric Dumazet
    Cc: Steven Whitehouse
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Apr, 2012

1 commit


21 Apr, 2012

2 commits

  • This results in code with less boiler plate that is a bit easier
    to read.

    Additionally stops us from using compatibility code in the sysctl
    core, hastening the day when the compatibility code can be removed.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This makes it clearer which sysctls are relative to your current network
    namespace.

    This makes it a little less error prone by not exposing sysctls for the
    initial network namespace in other namespaces.

    This is the same way we handle all of our other network interfaces to
    userspace and I can't honestly remember why we didn't do this for
    sysctls right from the start.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

16 Apr, 2012

1 commit


04 Apr, 2012

1 commit

  • unix_dgram_sendmsg() currently builds linear skbs, and this can stress
    page allocator with high order page allocations. When memory gets
    fragmented, this can eventually fail.

    We can try to use order-2 allocations for skb head (SKB_MAX_ALLOC) plus
    up to 16 page fragments to lower pressure on buddy allocator.

    This patch has no effect on messages of less than 16064 bytes.
    (on 64bit arches with PAGE_SIZE=4096)

    For bigger messages (from 16065 to 81600 bytes), this patch brings
    reliability at the expense of performance penalty because of extra pages
    allocations.

    netperf -t DG_STREAM -T 0,2 -- -m 16064 -s 200000
    ->4086040 Messages / 10s

    netperf -t DG_STREAM -T 0,2 -- -m 16068 -s 200000
    ->3901747 Messages / 10s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Mar, 2012

1 commit

  • In some cases the poll() implementation in a driver has to do different
    things depending on the events the caller wants to poll for. An example
    is when a driver needs to start a DMA engine if the caller polls for
    POLLIN, but doesn't want to do that if POLLIN is not requested but instead
    only POLLOUT or POLLPRI is requested. This is something that can happen
    in the video4linux subsystem among others.

    Unfortunately, the current epoll/poll/select implementation doesn't
    provide that information reliably. The poll_table_struct does have it: it
    has a key field with the event mask. But once a poll() call matches one
    or more bits of that mask any following poll() calls are passed a NULL
    poll_table pointer.

    Also, the eventpoll implementation always left the key field at ~0 instead
    of using the requested events mask.

    This was changed in eventpoll.c so the key field now contains the actual
    events that should be polled for as set by the caller.

    The solution to the NULL poll_table pointer is to set the qproc field to
    NULL in poll_table once poll() matches the events, not the poll_table
    pointer itself. That way drivers can obtain the mask through a new
    poll_requested_events inline.

    The poll_table_struct can still be NULL since some kernel code calls it
    internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
    that case poll_requested_events() returns ~0 (i.e. all events).

    Very rarely drivers might want to know whether poll_wait will actually
    wait. If another earlier file descriptor in the set already matched the
    events the caller wanted to wait for, then the kernel will return from the
    select() call without waiting. This might be useful information in order
    to avoid doing expensive work.

    A new helper function poll_does_not_wait() is added that drivers can use
    to detect this situation. This is now used in sock_poll_wait() in
    include/net/sock.h. This was the only place in the kernel that needed
    this information.

    Drivers should no longer access any of the poll_table internals, but use
    the poll_requested_events() and poll_does_not_wait() access functions
    instead. In order to enforce that the poll_table fields are now prepended
    with an underscore and a comment was added warning against using them
    directly.

    This required a change in unix_dgram_poll() in unix/af_unix.c which used
    the key field to get the requested events. It's been replaced by a call
    to poll_requested_events().

    For qproc it was especially important to change its name since the
    behavior of that field changes with this patch since this function pointer
    can now be NULL when that wasn't possible in the past.

    Any driver accessing the qproc or key fields directly will now fail to compile.

    Some notes regarding the correctness of this patch: the driver's poll()
    function is called with a 'struct poll_table_struct *wait' argument. This
    pointer may or may not be NULL, drivers can never rely on it being one or
    the other as that depends on whether or not an earlier file descriptor in
    the select()'s fdset matched the requested events.

    There are only three things a driver can do with the wait argument:

    1) obtain the key field:

    events = wait ? wait->key : ~0;

    This will still work although it should be replaced with the new
    poll_requested_events() function (which does exactly the same).
    This will now even work better, since wait is no longer set to NULL
    unnecessarily.

    2) use the qproc callback. This could be deadly since qproc can now be
    NULL. Renaming qproc should prevent this from happening. There are no
    kernel drivers that actually access this callback directly, BTW.

    3) test whether wait == NULL to determine whether poll would return without
    waiting. This is no longer sufficient as the correct test is now
    wait == NULL || wait->_qproc == NULL.

    However, the worst that can happen here is a slight performance hit in
    the case where wait != NULL and wait->_qproc == NULL. In that case the
    driver will assume that poll_wait() will actually add the fd to the set
    of waiting file descriptors. Of course, poll_wait() will not do that
    since it tests for wait->_qproc. This will not break anything, though.

    There is only one place in the whole kernel where this happens
    (sock_poll_wait() in include/net/sock.h) and that code will be replaced
    by a call to poll_does_not_wait() in the next patch.

    Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
    actually waiting. The next file descriptor from the set might match the
    event mask and thus any possible waits will never happen.

    Signed-off-by: Hans Verkuil
    Reviewed-by: Jonathan Corbet
    Reviewed-by: Al Viro
    Cc: Davide Libenzi
    Signed-off-by: Hans de Goede
    Cc: Mauro Carvalho Chehab
    Cc: David Miller
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hans Verkuil
     

22 Mar, 2012

1 commit

  • Pull vfs pile 1 from Al Viro:
    "This is _not_ all; in particular, Miklos' and Jan's stuff is not there
    yet."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
    ext4: initialization of ext4_li_mtx needs to be done earlier
    debugfs-related mode_t whack-a-mole
    hfsplus: add an ioctl to bless files
    hfsplus: change finder_info to u32
    hfsplus: initialise userflags
    qnx4: new helper - try_extent()
    qnx4: get rid of qnx4_bread/qnx4_getblk
    take removal of PF_FORKNOEXEC to flush_old_exec()
    trim includes in inode.c
    um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
    um: embed ->stub_pages[] into mmu_context
    gadgetfs: list_for_each_safe() misuse
    ocfs2: fix leaks on failure exits in module_init
    ecryptfs: make register_filesystem() the last potential failure exit
    ntfs: forgets to unregister sysctls on register_filesystem() failure
    logfs: missing cleanup on register_filesystem() failure
    jfs: mising cleanup on register_filesystem() failure
    make configfs_pin_fs() return root dentry on success
    configfs: configfs_create_dir() has parent dentry in dentry->d_parent
    configfs: sanitize configfs_create()
    ...

    Linus Torvalds
     

21 Mar, 2012

2 commits


27 Feb, 2012

1 commit


23 Feb, 2012

1 commit

  • Piergiorgio Beruto expressed the need to fetch size of first datagram in
    queue for AF_UNIX sockets and suggested a patch against SIOCINQ ioctl.

    I suggested instead to implement MSG_TRUNC support as a recv() input
    flag, as already done for RAW, UDP & NETLINK sockets.

    len = recv(fd, &byte, 1, MSG_PEEK | MSG_TRUNC);

    MSG_TRUNC asks recv() to return the real length of the packet, even when
    is was longer than the passed buffer.

    There is risk that a userland application used MSG_TRUNC by accident
    (since it had no effect on af_unix sockets) and this might break after
    this patch.

    Signed-off-by: Eric Dumazet
    Tested-by: Piergiorgio Beruto
    CC: Michael Kerrisk
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Feb, 2012

2 commits

  • The same here -- we can protect the sk_peek_off manipulations with
    the unix_sk->readlock mutex.

    The peeking of data from a stream socket is done in the datagram style,
    i.e. even if there's enough room for more data in the user buffer, only
    the head skb's data is copied in there. This feature is preserved when
    peeking data from a given offset -- the data is read till the nearest
    skb's boundary.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The sk_peek_off manipulations are protected with the unix_sk->readlock mutex.
    This mutex is enough since all we need is to syncronize setting the offset
    vs reading the queue head. The latter is fully covered with the mentioned lock.

    The recently added __skb_recv_datagram's offset is used to pick the skb to
    read the data from.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

31 Jan, 2012

1 commit

  • Commit 0884d7aa24 (AF_UNIX: Fix poll blocking problem when reading from
    a stream socket) added a regression for epoll() in Edge Triggered mode
    (EPOLLET)

    Appropriate fix is to use skb_peek()/skb_unlink() instead of
    skb_dequeue(), and only call skb_unlink() when skb is fully consumed.

    This remove the need to requeue a partial skb into sk_receive_queue head
    and the extra sk->sk_data_ready() calls that added the regression.

    This is safe because once skb is given to sk_receive_queue, it is not
    modified by a writer, and readers are serialized by u->readlock mutex.

    This also reduce number of spinlock acquisition for small reads or
    MSG_PEEK users so should improve overall performance.

    Reported-by: Nick Mathewson
    Signed-off-by: Eric Dumazet
    Cc: Alexey Moiseytsev
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jan, 2012

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    igmp: Avoid zero delay when receiving odd mixture of IGMP queries
    netdev: make net_device_ops const
    bcm63xx: make ethtool_ops const
    usbnet: make ethtool_ops const
    net: Fix build with INET disabled.
    net: introduce netif_addr_lock_nested() and call if when appropriate
    net: correct lock name in dev_[uc/mc]_sync documentations.
    net: sk_update_clone is only used in net/core/sock.c
    8139cp: fix missing napi_gro_flush.
    pktgen: set correct max and min in pktgen_setup_inject()
    smsc911x: Unconditionally include linux/smscphy.h in smsc911x.h
    asix: fix infinite loop in rx_fixup()
    net: Default UDP and UNIX diag to 'n'.
    r6040: fix typo in use of MCR0 register bits
    net: fix sock_clone reference mismatch with tcp memcontrol

    Linus Torvalds
     

09 Jan, 2012

1 commit

  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
    reiserfs: Properly display mount options in /proc/mounts
    vfs: prevent remount read-only if pending removes
    vfs: count unlinked inodes
    vfs: protect remounting superblock read-only
    vfs: keep list of mounts for each superblock
    vfs: switch ->show_options() to struct dentry *
    vfs: switch ->show_path() to struct dentry *
    vfs: switch ->show_devname() to struct dentry *
    vfs: switch ->show_stats to struct dentry *
    switch security_path_chmod() to struct path *
    vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
    vfs: trim includes a bit
    switch mnt_namespace ->root to struct mount
    vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
    vfs: opencode mntget() mnt_set_mountpoint()
    vfs: spread struct mount - remaining argument of next_mnt()
    vfs: move fsnotify junk to struct mount
    vfs: move mnt_devname
    vfs: move mnt_list to struct mount
    vfs: switch pnode.h macros to struct mount *
    ...

    Linus Torvalds
     

08 Jan, 2012

1 commit


04 Jan, 2012

1 commit


31 Dec, 2011

3 commits


27 Dec, 2011

2 commits


21 Dec, 2011

1 commit

  • Otherwise getting

    | net/unix/diag.c:312:16: error: expected declaration specifiers or ‘...’ before string constant
    | net/unix/diag.c:313:1: error: expected declaration specifiers or ‘...’ before string constant

    Signed-off-by: Cyrill Gorcunov
    Signed-off-by: David S. Miller

    Cyrill Gorcunov
     

17 Dec, 2011

10 commits


27 Nov, 2011

1 commit