24 Apr, 2008

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (22 commits)
    tun: Multicast handling in tun_chr_ioctl() needs proper locking.
    [NET]: Fix heavy stack usage in seq_file output routines.
    [AF_UNIX] Initialise UNIX sockets before general device initcalls
    [RTNETLINK]: Fix bogus ASSERT_RTNL warning
    iwlwifi: Fix built-in compilation of iwlcore (part 2)
    tun: Fix minor race in TUNSETLINK ioctl handling.
    ppp_generic: use stats from net_device structure
    iwlwifi: Don't unlock priv->mutex if it isn't locked
    wireless: rndis_wlan: modparam_workaround_interval is never below 0.
    prism54: prism54_get_encode() test below 0 on unsigned index
    mac80211: update mesh EID values
    b43: Workaround DMA quirks
    mac80211: fix use before check of Qdisc length
    net/mac80211/rx.c: fix off-by-one
    mac80211: Fix race between ieee80211_rx_bss_put and lookup routines.
    ath5k: Fix radio identification on AR5424/2424
    ssb: Fix all-ones boardflags
    b43: Add more btcoexist workarounds
    b43: Fix HostFlags data types
    b43: Workaround invalid bluetooth settings
    ...

    Linus Torvalds
     
  • When drivers call request_module(), it tries to do something with UNIX
    sockets and triggers a 'runaway loop modprobe net-pf-1' warning. Avoid
    this by initialising AF_UNIX support earlier.

    Signed-off-by: David Woodhouse
    Signed-off-by: David S. Miller

    David Woodhouse
     

19 Apr, 2008

1 commit

  • This takes care of all of the direct callers of vfs_mknod().
    Since a few of these cases also handle normal file creation
    as well, this also covers some calls to vfs_create().

    So that we don't have to make three mnt_want/drop_write()
    calls inside of the switch statement, we move some of its
    logic outside of the switch and into a helper function
    suggested by Christoph.

    This also encapsulates a fix for mknod(S_IFREG) that Miklos
    found.

    [AV: merged mkdir handling, added missing nfsd pieces]

    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Hansen
     

13 Apr, 2008

1 commit


26 Mar, 2008

3 commits


06 Mar, 2008

1 commit


15 Feb, 2008

2 commits

  • * Add path_put() functions for releasing a reference to the dentry and
    vfsmount of a struct path in the right order

    * Switch from path_release(nd) to path_put(&nd->path)

    * Rename dput_path() to path_put_conditional()

    [akpm@linux-foundation.org: fix cifs]
    Signed-off-by: Jan Blunck
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Christoph Hellwig
    Cc:
    Cc: Al Viro
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • This is the central patch of a cleanup series. In most cases there is no good
    reason why someone would want to use a dentry for itself. This series reflects
    that fact and embeds a struct path into nameidata.

    Together with the other patches of this series
    - it enforced the correct order of getting/releasing the reference count on
    pairs
    - it prepares the VFS for stacking support since it is essential to have a
    struct path in every place where the stack can be traversed
    - it reduces the overall code size:

    without patch series:
    text data bss dec hex filename
    5321639 858418 715768 6895825 6938d1 vmlinux

    with patch series:
    text data bss dec hex filename
    5320026 858418 715768 6894212 693284 vmlinux

    This patch:

    Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix cifs]
    [akpm@linux-foundation.org: fix smack]
    Signed-off-by: Jan Blunck
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     

29 Jan, 2008

10 commits

  • Add __acquires() and __releases() annotations to suppress some sparse
    warnings.

    example of warnings :

    net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
    count at exit
    net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
    unexpected unlock

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Recently David Miller and Herbert Xu pointed out that struct net becomes
    overbloated and un-maintainable. There are two solutions:
    - provide a pointer to a network subsystem definition from struct net.
    This costs an additional dereferrence
    - place sub-system definition into the structure itself. This will speedup
    run-time access at the cost of recompilation time

    The second approach looks better for us. Other sub-systems will follow.

    Signed-off-by: Denis V. Lunev
    Acked-by: Daniel Lezcano
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • This is the core.

    * add the ctl_table_header on the struct net;
    * make the unix_sysctl_register and _unregister clone the table;
    * moves calls to them into per-net init and exit callbacks;
    * move the .data pointer in the proper place.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric W. Biederman
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Unlike previous ones, this patch is useful by its own,
    as it decreases the vmlinux size :)

    But it will be used later, when the per-namespace sysctl
    is added.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric W. Biederman
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This will make all the sub-namespaces always use the
    default value (10) and leave the tuning via sysctl
    to the init namespace only.

    Per-namespace tuning is coming.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric W. Biederman
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Add the struct net * argument to both of them to use in
    the future. Also make the register one return an error code.

    It is useless right now, but will make the future patches
    much simpler.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric W. Biederman
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The sock_wake_async() performs a bit different actions
    depending on "how" argument. Unfortunately this argument
    ony has numerical magic values.

    I propose to give names to their constants to help people
    reading this function callers understand what's going on
    without looking into this function all the time.

    I suppose this is 2.6.25 material, but if it's not (or the
    naming seems poor/bad/awful), I can rework it against the
    current net-2.6 tree.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The first_unix_socket() and next_unix_sockets() are now used
    in proc file and in forall_unix_socets macro only.

    The forall_unix_sockets is not used in this file at all so
    remove it. After this move the helpers to where they really
    belong, i.e. closer to proc code under the #ifdef CONFIG_PROC_FS
    option.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Signed-off-by: Denis V. Lunev
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • Because of the global nature of garbage collection, and because of the
    cost of per namespace hash tables unix_socket_table has been kept
    global. With a filter added on lookups so we don't see sockets from
    the wrong namespace.

    Currently I don't fold the namesapce into the hash so multiple
    namespaces using the same socket name will be guaranteed a hash
    collision.

    Changes from v1:
    - fixed unix_seq_open

    Signed-off-by: Denis V. Lunev
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Denis V. Lunev
     

29 Nov, 2007

1 commit

  • I am not absolutely sure whether this actually is a bug (as in: I've got
    no clue what the standards say or what other implementations do), but at
    least I was pretty surprised when I noticed that a recv() on a
    non-blocking unix domain socket of type SOCK_SEQPACKET (which is connection
    oriented, after all) where the remote end has closed the connection
    returned -1 (EAGAIN) rather than 0 to indicate end of file.

    This is a test case:

    | #include
    | #include
    | #include
    | #include
    | #include
    | #include
    | #include
    |
    | int main(){
    | int sock;
    | struct sockaddr_un addr;
    | char buf[4096];
    | int pfds[2];
    |
    | pipe(pfds);
    | sock=socket(PF_UNIX,SOCK_SEQPACKET,0);
    | addr.sun_family=AF_UNIX;
    | strcpy(addr.sun_path,"/tmp/foobar_testsock");
    | bind(sock,(struct sockaddr *)&addr,sizeof(addr));
    | listen(sock,1);
    | if(fork()){
    | close(sock);
    | sock=socket(PF_UNIX,SOCK_SEQPACKET,0);
    | connect(sock,(struct sockaddr *)&addr,sizeof(addr));
    | fcntl(sock,F_SETFL,fcntl(sock,F_GETFL)|O_NONBLOCK);
    | close(pfds[1]);
    | read(pfds[0],buf,sizeof(buf));
    | recv(sock,buf,sizeof(buf),0); //
    Signed-off-by: Herbert Xu

    Florian Zumbiehl
     

11 Nov, 2007

3 commits

  • The unix_nr_socks value is limited with the 2 * get_max_files() value,
    as seen from the unix_create1(). However, the check and the actual
    increment are separated with the GFP_KERNEL allocation, so this limit
    can be exceeded under a memory pressure - task may go to sleep freeing
    the pages and some other task will be allowed to allocate a new sock
    and so on and so forth.

    So make the increment before the check (similar thing is done in the
    sock_kmalloc) and go to kmalloc after this.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The scan_inflight() routine scans through the unix sockets and calls
    some passed callback. The fact is that all these callbacks work with
    the unix_sock objects, not the sock ones, so make this conversion in
    the scan_inflight() before calling the callbacks.

    This removes one unneeded variable from the inc_inflight_move_tail().

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This counter is _always_ modified under the unix_gc_lock spinlock,
    so its atomicity can be provided w/o additional efforts.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

01 Nov, 2007

1 commit

  • Finally, the zero_it argument can be completely removed from
    the callers and from the function prototype.

    Besides, fix the checkpatch.pl warnings about using the
    assignments inside if-s.

    This patch is rather big, and it is a part of the previous one.
    I splitted it wishing to make the patches more readable. Hope
    this particular split helped.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

20 Oct, 2007

1 commit

  • This is the largest patch in the set. Make all (I hope) the places where
    the pid is shown to or get from user operate on the virtual pids.

    The idea is:
    - all in-kernel data structures must store either struct pid itself
    or the pid's global nr, obtained with pid_nr() call;
    - when seeking the task from kernel code with the stored id one
    should use find_task_by_pid() call that works with global pids;
    - when showing pid's numerical value to the user the virtual one
    should be used, but however when one shows task's pid outside this
    task's namespace the global one is to be used;
    - when getting the pid from userspace one need to consider this as
    the virtual one and use appropriate task/pid-searching functions.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: nuther build fix]
    [akpm@linux-foundation.org: yet nuther build fix]
    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Alexey Dobriyan
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

15 Oct, 2007

1 commit

  • make sync wakeups affine for cache-cold tasks: if a cache-cold task
    is woken up by a sync wakeup then use the opportunity to migrate it
    straight away. (the two tasks are 'related' because they communicate)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

11 Oct, 2007

3 commits

  • This concerns the ipv4 and ipv6 code mostly, but also the netlink
    and unix sockets.

    The netlink code is an example of how to use the __seq_open_private()
    call - it saves the net namespace on this private.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This patch passes in the namespace a new socket should be created in
    and has the socket code do the appropriate reference counting. By
    virtue of this all socket create methods are touched. In addition
    the socket create methods are modified so that they will fail if
    you attempt to create a socket in a non-default network namespace.

    Failing if we attempt to create a socket outside of the default
    network namespace ensures that as we incrementally make the network stack
    network namespace aware we will not export functionality that someone
    has not audited and made certain is network namespace safe.
    Allowing us to partially enable network namespaces before all of the
    exotic protocols are supported.

    Any protocol layers I have missed will fail to compile because I now
    pass an extra parameter into the socket creation code.

    [ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ]

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes /proc/net per network namespace. It modifies the global
    variables proc_net and proc_net_stat to be per network namespace.
    The proc_net file helpers are modified to take a network namespace argument,
    and all of their callers are fixed to pass &init_net for that argument.
    This ensures that all of the /proc/net files are only visible and
    usable in the initial network namespace until the code behind them
    has been updated to be handle multiple network namespaces.

    Making /proc/net per namespace is necessary as at least some files
    in /proc/net depend upon the set of network devices which is per
    network namespace, and even more files in /proc/net have contents
    that are relevant to a single network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

31 Jul, 2007

1 commit


12 Jul, 2007

1 commit

  • Throw out the old mark & sweep garbage collector and put in a
    refcounting cycle detecting one.

    The old one had a race with recvmsg, that resulted in false positives
    and hence data loss. The old algorithm operated on all unix sockets
    in the system, so any additional locking would have meant performance
    problems for all users of these.

    The new algorithm instead only operates on "in flight" sockets, which
    are very rare, and the additional locking for these doesn't negatively
    impact the vast majority of users.

    In fact it's probable, that there weren't *any* heavy senders of
    sockets over sockets, otherwise the above race would have been
    discovered long ago.

    The patch works OK with the app that exposed the race with the old
    code. The garbage collection has also been verified to work in a few
    simple cases.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: David S. Miller

    Miklos Szeredi
     

11 Jul, 2007

1 commit


08 Jun, 2007

1 commit

  • A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
    send()+close() on the peer, causing recv() to return zero, even though
    the sent data should be received.

    This happens if the send() and the close() is performed between
    skb_dequeue() and checking sk->sk_shutdown in unix_stream_recvmsg():

    process A skb_dequeue() returns NULL, there's no data in the socket queue
    process B new data is inserted onto the queue by unix_stream_sendmsg()
    process B sk->sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock()
    process A sk->sk_shutdown is checked, unix_release_sock() returns zero

    I'm surprised nobody noticed this, it's not hard to trigger. Maybe
    it's just (un)luck with the timing.

    It's possible to work around this bug in userspace, by retrying the
    recv() once in case of a zero return value.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: David S. Miller

    Miklos Szeredi
     

04 Jun, 2007

2 commits

  • Based upon an excellent bug report and initial patch by
    Frederik Deweerdt.

    The UNIX datagram connect code blindly dereferences other->sk_socket
    via the call down to the security_unix_may_send() function.

    Without locking 'other' that pointer can go NULL via unix_release_sock()
    which does sock_orphan() which also marks the socket SOCK_DEAD.

    So we have to lock both 'sk' and 'other' yet avoid all kinds of
    potential deadlocks (connect to self is OK for datagram sockets and it
    is possible for two datagram sockets to perform a simultaneous connect
    to each other). So what we do is have a "double lock" function similar
    to how we handle this situation in other areas of the kernel. We take
    the lock of the socket pointer with the smallest address first in
    order to avoid ABBA style deadlocks.

    Once we have them both locked, we check to see if SOCK_DEAD is set
    for 'other' and if so, drop everything and retry the lookup.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The unix_state_*() locking macros imply that there is some
    rwlock kind of thing going on, but the implementation is
    actually a spinlock which makes the code more confusing than
    it needs to be.

    So use plain unix_state_lock and unix_state_unlock.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 May, 2007

1 commit


26 Apr, 2007

1 commit

  • For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
    later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
    64bit land while possibly keeping it as a pointer on 32bit.

    This one touches just the most simple cases:

    skb->h.raw = skb->data;
    skb->h.raw = {skb_push|[__]skb_pull}()

    The next ones will handle the slightly more "complex" cases.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

07 Mar, 2007

1 commit

  • This reverts two changes:

    8488df894d05d6fa41c2bd298c335f944bb0e401
    248f06726e866942b3d8ca8f411f9067713b7ff8

    A backlog value of N really does mean allow "N + 1" connections
    to queue to a listening socket. This allows one to specify
    "0" as the backlog and still get 1 connection.

    Noticed by Gerrit Renker and Rick Jones.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Mar, 2007

1 commit