23 Feb, 2013

1 commit


19 Feb, 2013

2 commits

  • proc_net_remove is only used to remove proc entries
    that under /proc/net,it's not a general function for
    removing proc entries of netns. if we want to remove
    some proc entries which under /proc/net/stat/, we still
    need to call remove_proc_entry.

    this patch use remove_proc_entry to replace proc_net_remove.
    we can remove proc_net_remove after this patch.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     
  • Right now, some modules such as bonding use proc_create
    to create proc entries under /proc/net/, and other modules
    such as ipv4 use proc_net_fops_create.

    It looks a little chaos.this patch changes all of
    proc_net_fops_create to proc_create. we can remove
    proc_net_fops_create after this patch.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     

05 Feb, 2013

1 commit


24 Jan, 2013

1 commit


17 Jan, 2013

1 commit

  • While a privileged program can open a raw socket, attach some
    restrictive filter and drop its privileges (or send the socket to an
    unprivileged program through some Unix socket), the filter can still
    be removed or modified by the unprivileged program. This commit adds a
    socket option to lock the filter (SO_LOCK_FILTER) preventing any
    modification of a socket filter program.

    This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even
    root is not allowed change/drop the filter.

    The state of the lock can be read with getsockopt(). No error is
    triggered if the state is not changed. -EPERM is returned when a user
    tries to remove the lock or to change/remove the filter while the lock
    is active. The check is done directly in sk_attach_filter() and
    sk_detach_filter() and does not affect only setsockopt() syscall.

    Signed-off-by: Vincent Bernat
    Signed-off-by: David S. Miller

    Vincent Bernat
     

22 Dec, 2012

1 commit

  • Using a seqlock for devnet_rename_seq is not a good idea,
    as device_rename() can sleep.

    As we hold RTNL, we dont need a protection for writers,
    and only need a seqcount so that readers can catch a change done
    by a writer.

    Bug added in commit c91f6df2db4972d3 (sockopt: Change getsockopt() of
    SO_BINDTODEVICE to return an interface name)

    Reported-by: Dave Jones
    Signed-off-by: Eric Dumazet
    Cc: Brian Haley
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Nov, 2012

1 commit

  • Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
    will then require another call like if_indextoname() to get the actual interface
    name, have it return the name directly.

    This also matches the existing man page description on socket(7) which mentions
    the argument being an interface name.

    If the value has not been set, zero is returned and optlen will be set to zero
    to indicate there is no interface name present.

    Added a seqlock to protect this code path, and dev_ifname(), from someone
    changing the device name via dev_change_name().

    v2: Added seqlock protection while copying device name.

    v3: Fixed word wrap in patch.

    Signed-off-by: Brian Haley
    Signed-off-by: David S. Miller

    Brian Haley
     

19 Nov, 2012

1 commit

  • Allow an unpriviled user who has created a user namespace, and then
    created a network namespace to effectively use the new network
    namespace, by reducing capable(CAP_NET_ADMIN) and
    capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
    CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

    Settings that merely control a single network device are allowed.
    Either the network device is a logical network device where
    restrictions make no difference or the network device is hardware NIC
    that has been explicity moved from the initial network namespace.

    In general policy and network stack state changes are allowed
    while resource control is left unchanged.

    Allow ethtool ioctls.

    Allow binding to network devices.
    Allow setting the socket mark.
    Allow setting the socket priority.

    Allow setting the network device alias via sysfs.
    Allow setting the mtu via sysfs.
    Allow changing the network device flags via sysfs.
    Allow setting the network device group via sysfs.

    Allow the following network device ioctls.
    SIOCGMIIPHY
    SIOCGMIIREG
    SIOCSIFNAME
    SIOCSIFFLAGS
    SIOCSIFMETRIC
    SIOCSIFMTU
    SIOCSIFHWADDR
    SIOCSIFSLAVE
    SIOCADDMULTI
    SIOCDELMULTI
    SIOCSIFHWBROADCAST
    SIOCSMIIREG
    SIOCBONDENSLAVE
    SIOCBONDRELEASE
    SIOCBONDSETHWADDR
    SIOCBONDCHANGEACTIVE
    SIOCBRADDIF
    SIOCBRDELIF
    SIOCSHWTSTAMP

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

01 Nov, 2012

1 commit

  • The SO_ATTACH_FILTER option is set only. I propose to add the get
    ability by using SO_ATTACH_FILTER in getsockopt. To be less
    irritating to eyes the SO_GET_FILTER alias to it is declared. This
    ability is required by checkpoint-restore project to be able to
    save full state of a socket.

    There are two issues with getting filter back.

    First, kernel modifies the sock_filter->code on filter load, thus in
    order to return the filter element back to user we have to decode it
    into user-visible constants. Fortunately the modification in question
    is interconvertible.

    Second, the BPF_S_ALU_DIV_K code modifies the command argument k to
    speed up the run-time division by doing kernel_k = reciprocal(user_k).
    Bad news is that different user_k may result in same kernel_k, so we
    can't get the original user_k back. Good news is that we don't have
    to do it. What we need to is calculate a user2_k so, that

    reciprocal(user2_k) == reciprocal(user_k) == kernel_k

    i.e. if it's re-loaded back the compiled again value will be exactly
    the same as it was. That said, the user2_k can be calculated like this

    user2_k = reciprocal(kernel_k)

    with an exception, that if kernel_k == 0, then user2_k == 1.

    The optlen argument is treated like this -- when zero, kernel returns
    the amount of sock_fprog elements in filter, otherwise it should be
    large enough for the sock_fprog array.

    changes since v1:
    * Declared SO_GET_FILTER in all arch headers
    * Added decode of vlan-tag codes

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

26 Oct, 2012

2 commits

  • sock_update_classid() assumes that the update operation always are
    applied on the current task. sock_update_classid() needs to know on
    which tasks to work on in order to be able to migrate task between
    cgroups using the struct cgroup_subsys attach() callback.

    Signed-off-by: Daniel Wagner
    Cc: "David S. Miller"
    Cc: "Michael S. Tsirkin"
    Cc: Eric Dumazet
    Cc: Glauber Costa
    Cc: Joe Perches
    Cc: Neil Horman
    Cc: Stanislav Kinsbursky
    Cc: Tejun Heo
    Cc:
    Cc:
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Daniel Wagner
     
  • As Eric pointed out:
    "Hey task_cls_classid() has its own rcu protection since commit
    3fb5a991916091a908d (cls_cgroup: Fix rcu lockdep warning)

    So we can safely revert Paul commit (1144182a8757f2a1)
    (We no longer need rcu_read_lock/unlock here)"

    Signed-off-by: Daniel Wagner
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Glauber Costa
    Cc: Li Zefan
    Cc: Neil Horman
    Cc: Paul E. McKenney
    Cc: Tejun Heo
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Daniel Wagner
     

22 Oct, 2012

1 commit

  • The SO_BINDTODEVICE option is the only SOL_SOCKET one that can be set, but
    cannot be get via sockopt API. The only way we can find the device id a
    socket is bound to is via sock-diag interface. But the diag works only on
    hashed sockets, while the opt in question can be set for yet unhashed one.

    That said, in order to know what device a socket is bound to (we do want
    to know this in checkpoint-restore project) I propose to make this option
    getsockopt-able and report the respective device index.

    Another solution to the problem might be to teach the sock-diag reporting
    info on unhashed sockets. Should I go this way instead?

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

03 Oct, 2012

3 commits

  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     
  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:

    - xattr support added. The implementation is shared with tmpfs. The
    usage is restricted and intended to be used to manage per-cgroup
    metadata by system software. tmpfs changes are routed through this
    branch with Hugh's permission.

    - cgroup subsystem ID handling simplified.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
    cgroup: Assign subsystem IDs during compile time
    cgroup: Do not depend on a given order when populating the subsys array
    cgroup: Wrap subsystem selection macro
    cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
    cgroup: net_prio: Do not define task_netpioidx() when not selected
    cgroup: net_cls: Do not define task_cls_classid() when not selected
    cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
    cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
    xattr: mark variable as uninitialized to make both gcc and smatch happy
    fs: add missing documentation to simple_xattr functions
    cgroup: add documentation on extended attributes usage
    cgroup: rename subsys_bits to subsys_mask
    cgroup: add xattr support
    cgroup: revise how we re-populate root directory
    xattr: extract simple_xattr code from tmpfs

    Linus Torvalds
     

29 Sep, 2012

1 commit

  • Conflicts:
    drivers/net/team/team.c
    drivers/net/usb/qmi_wwan.c
    net/batman-adv/bat_iv_ogm.c
    net/ipv4/fib_frontend.c
    net/ipv4/route.c
    net/l2tp/l2tp_netlink.c

    The team, fib_frontend, route, and l2tp_netlink conflicts were simply
    overlapping changes.

    qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

    With help from Antonio Quartulli.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Sep, 2012

1 commit

  • It seems sk_init() has no value today and even does strange things :

    # grep . /proc/sys/net/core/?mem_*
    /proc/sys/net/core/rmem_default:212992
    /proc/sys/net/core/rmem_max:131071
    /proc/sys/net/core/wmem_default:212992
    /proc/sys/net/core/wmem_max:131071

    We can remove it completely.

    Signed-off-by: Eric Dumazet
    Reviewed-by: Shan Wei
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Sep, 2012

2 commits

  • Its possible to use RAW sockets to get a crash in
    tcp_set_keepalive() / sk_reset_timer()

    Fix is to make sure socket is a SOCK_STREAM one.

    Reported-by: Dave Jones
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Sep, 2012

4 commits

  • Conflicts:
    net/netfilter/nfnetlink_log.c
    net/netfilter/xt_LOG.c

    Rather easy conflict resolution, the 'net' tree had bug fixes to make
    sure we checked if a socket is a time-wait one or not and elide the
    logging code if so.

    Whereas on the 'net-next' side we are calculating the UID and GID from
    the creds using different interfaces due to the user namespace changes
    from Eric Biederman.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • WARNING: With this change it is impossible to load external built
    controllers anymore.

    In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
    set, corresponding subsys_id should also be a constant. Up to now,
    net_prio_subsys_id and net_cls_subsys_id would be of the type int and
    the value would be assigned during runtime.

    By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
    to IS_ENABLED, all *_subsys_id will have constant value. That means we
    need to remove all the code which assumes a value can be assigned to
    net_prio_subsys_id and net_cls_subsys_id.

    A close look is necessary on the RCU part which was introduces by
    following patch:

    commit f845172531fb7410c7fb7780b1a6e51ee6df7d52
    Author: Herbert Xu Mon May 24 09:12:34 2010
    Committer: David S. Miller Mon May 24 09:12:34 2010

    cls_cgroup: Store classid in struct sock

    Tis code was added to init_cgroup_cls()

    /* We can't use rcu_assign_pointer because this is an int. */
    smp_wmb();
    net_cls_subsys_id = net_cls_subsys.subsys_id;

    respectively to exit_cgroup_cls()

    net_cls_subsys_id = -1;
    synchronize_rcu();

    and in module version of task_cls_classid()

    rcu_read_lock();
    id = rcu_dereference(net_cls_subsys_id);
    if (id >= 0)
    classid = container_of(task_subsys_state(p, id),
    struct cgroup_cls_state, css)->classid;
    rcu_read_unlock();

    Without an explicit explaination why the RCU part is needed. (The
    rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
    in a later commit, but that is a minor detail.)

    So here is my pondering why it was introduced and why it safe to
    remove it now. Note that this code was copied over to net_prio the
    reasoning holds for that subsystem too.

    The idea behind the RCU use for net_cls_subsys_id is to make sure we
    get a valid pointer back from task_subsys_state(). task_subsys_state()
    is just blindly accessing the subsys array and returning the
    pointer. Obviously, passing in -1 as id into task_subsys_state()
    returns an invalid value (out of lower bound).

    So this code makes sure that only after module is loaded and the
    subsystem registered, the id is assigned.

    Before unregistering the module all old readers must have left the
    critical section. This is done by assigning -1 to the id and issuing a
    synchronized_rcu(). Any new readers wont call task_subsys_state()
    anymore and therefore it is safe to unregister the subsystem.

    The new code relies on the same trick, but it looks at the subsys
    pointer return by task_subsys_state() (remember the id is constant
    and therefore we allways have a valid index into the subsys
    array).

    No precautions need to be taken during module loading
    module. Eventually, all CPUs will get a valid pointer back from
    task_subsys_state() because rebind_subsystem() which is called after
    the module init() function will assigned subsys[net_cls_subsys_id] the
    newly loaded module subsystem pointer.

    When the subsystem is about to be removed, rebind_subsystem() will
    called before the module exit() function. In this case,
    rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
    and then it calls synchronize_rcu(). All old readers have left by then
    the critical section. Any new reader wont access the subsystem
    anymore. At this point we are safe to unregister the subsystem. No
    synchronize_rcu() call is needed.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: "David S. Miller"
    Cc: "Paul E. McKenney"
    Cc: Andrew Morton
    Cc: Eric Dumazet
    Cc: Gao feng
    Cc: Glauber Costa
    Cc: Herbert Xu
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: Kamezawa Hiroyuki
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     
  • task_netprioidx() should not be defined in case the configuration is
    CONFIG_NETPRIO_CGROUP=n. The reason is that in a following patch the
    net_prio_subsys_id will only be defined if CONFIG_NETPRIO_CGROUP!=n.
    When net_prio is not built at all any callee should only get an empty
    task_netprioidx() without any references to net_prio_subsys_id.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: Gao feng
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     
  • task_cls_classid() should not be defined in case the configuration is
    CONFIG_NET_CLS_CGROUP=n. The reason is that in a following patch the
    net_cls_subsys_id will only be defined if CONFIG_NET_CLS_CGROUP!=n.
    When net_cls is not built at all a callee should only get an empty
    task_cls_classid() without any references to net_cls_subsys_id.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: Gao feng
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     

11 Sep, 2012

1 commit


04 Sep, 2012

1 commit


25 Aug, 2012

2 commits


15 Aug, 2012

2 commits

  • Acked-by: David S. Miller
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • With the existence of kuid_t and kgid_t we can take this further
    and remove the usage of struct cred altogether, ensuring we
    don't get cache line misses from reference counts. For now
    however start simply and do a straight forward conversion
    I can be certain is correct.

    In cred_to_ucred use from_kuid_munged and from_kgid_munged
    as these values are going directly to userspace and we want to use
    the userspace safe values not -1 when reporting a value that does not
    map. The earlier conversion that used from_kuid was buggy in that
    respect. Oops.

    Cc: Eric Dumazet
    Acked-by: David S. Miller
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

02 Aug, 2012

1 commit


01 Aug, 2012

5 commits

  • This patch series is based on top of "Swap-over-NBD without deadlocking
    v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.

    When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. In diskless systems this is not an option so if swap if
    required then swapping over the network is considered. The two likely
    scenarios are when blade servers are used as part of a cluster where the
    form factor or maintenance costs do not allow the use of disks and thin
    clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap but this is not always an option. There is no
    guarantee that the network attached storage (NAS) device is running Linux
    or supports NBD. However, it is likely that it supports NFS so there are
    users that want support for swapping over NFS despite any performance
    concern. Some distributions currently carry patches that support swapping
    over NFS but it would be preferable to support it in the mainline kernel.

    Patch 1 avoids a stream-specific deadlock that potentially affects TCP.

    Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
    reserves.

    Patch 3 adds three helpers for filesystems to handle swap cache pages.
    For example, page_file_mapping() returns page->mapping for
    file-backed pages and the address_space of the underlying
    swap file for swap cache pages.

    Patch 4 adds two address_space_operations to allow a filesystem
    to pin all metadata relevant to a swapfile in memory. Upon
    successful activation, the swapfile is marked SWP_FILE and
    the address space operation ->direct_IO is used for writing
    and ->readpage for reading in swap pages.

    Patch 5 notes that patch 3 is bolting
    filesystem-specific-swapfile-support onto the side and that
    the default handlers have different information to what
    is available to the filesystem. This patch refactors the
    code so that there are generic handlers for each of the new
    address_space operations.

    Patch 6 adds an API to allow a vector of kernel addresses to be
    translated to struct pages and pinned for IO.

    Patch 7 adds support for using highmem pages for swap by kmapping
    the pages before calling the direct_IO handler.

    Patch 8 updates NFS to use the helpers from patch 3 where necessary.

    Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.

    Patch 10 implements the new swapfile-related address_space operations
    for NFS and teaches the direct IO handler how to manage
    kernel addresses.

    Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
    where appropriate.

    Patch 12 fixes a NULL pointer dereference that occurs when using
    swap-over-NFS.

    With the patches applied, it is possible to mount a swapfile that is on an
    NFS filesystem. Swap performance is not great with a swap stress test
    taking roughly twice as long to complete than if the swap device was
    backed by NBD.

    This patch: netvm: prevent a stream-specific deadlock

    It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
    that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
    buffers from receiving data, which will prevent userspace from running,
    which is needed to reduce the buffered data.

    Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
    this change it applied, it is important that sockets that set
    SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
    If this happens, a warning is generated and the tokens reclaimed to avoid
    accounting errors until the bug is fixed.

    [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Acked-by: Rik van Riel
    Cc: Trond Myklebust
    Cc: Neil Brown
    Cc: Christoph Hellwig
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In order to make sure pfmemalloc packets receive all memory needed to
    proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
    This is limited to a subset of protocols that are expected to be used for
    writing to swap. Taps are not allowed to use PF_MEMALLOC as these are
    expected to communicate with userspace processes which could be paged out.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [jslaby@suse.cz: Lock imbalance fix]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Change the skb allocation API to indicate RX usage and use this to fall
    back to the PFMEMALLOC reserve when needed. SKBs allocated from the
    reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the
    reserve and the socket is later found to be unrelated to page reclaim, the
    packet is dropped so that the memory remains available for page reclaim.
    Network protocols are expected to recover from this packet loss.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [davem@davemloft.net: Use static branches, coding style corrections]
    [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
    for their allocations. These sockets will be able to go below watermarks
    and allocate from the emergency reserve. Such sockets are to be used to
    service the VM (iow. to swap over). They must be handled kernel side,
    exposing such a socket to user-space is a bug.

    There is a risk that the reserves be depleted so for now, the
    administrator is responsible for increasing min_free_kbytes as necessary
    to prevent deadlock for their workloads.

    [a.p.zijlstra@chello.nl: Original patches]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

23 Jul, 2012

1 commit

  • Instead of updating the sk_cgrp_prioidx struct field on every send
    this only updates the field when a task is moved via cgroup
    infrastructure.

    This allows sockets that may be used by a kernel worker thread
    to be managed. For example in the iscsi case today a user can
    put iscsid in a netprio cgroup and control traffic will be sent
    with the correct sk_cgrp_prioidx value set but as soon as data
    is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
    is updated with the kernel worker threads value which is the
    default case.

    It seems more correct to only update the field when the user
    explicitly sets it via control group infrastructure. This allows
    the users to manage sockets that may be used with other threads.

    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend
     

12 Jul, 2012

1 commit

  • This introduce TSQ (TCP Small Queues)

    TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
    device queues), to reduce RTT and cwnd bias, part of the bufferbloat
    problem.

    sk->sk_wmem_alloc not allowed to grow above a given limit,
    allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
    given time.

    TSO packets are sized/capped to half the limit, so that we have two
    TSO packets in flight, allowing better bandwidth use.

    As a side effect, setting the limit to 40000 automatically reduces the
    standard gso max limit (65536) to 40000/2 : It can help to reduce
    latencies of high prio packets, having smaller TSO packets.

    This means we divert sock_wfree() to a tcp_wfree() handler, to
    queue/send following frames when skb_orphan() [2] is called for the
    already queued skbs.

    Results on my dev machines (tg3/ixgbe nics) are really impressive,
    using standard pfifo_fast, and with or without TSO/GSO.

    Without reduction of nominal bandwidth, we have reduction of buffering
    per bulk sender :
    < 1ms on Gbit (instead of 50ms with TSO)
    < 8ms on 100Mbit (instead of 132 ms)

    I no longer have 4 MBytes backlogged in qdisc by a single netperf
    session, and both side socket autotuning no longer use 4 Mbytes.

    As skb destructor cannot restart xmit itself ( as qdisc lock might be
    taken at this point ), we delegate the work to a tasklet. We use one
    tasklest per cpu for performance reasons.

    If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
    This flag is tested in a new protocol method called from release_sock(),
    to eventually send new segments.

    [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
    [2] skb_orphan() is usually called at TX completion time,
    but some drivers call it in their start_xmit() handler.
    These drivers should at least use BQL, or else a single TCP
    session can still fill the whole NIC TX ring, since TSQ will
    have no effect.

    Signed-off-by: Eric Dumazet
    Cc: Dave Taht
    Cc: Tom Herbert
    Cc: Matt Mathis
    Cc: Yuchung Cheng
    Cc: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Jun, 2012

1 commit

  • Input packet processing for local sockets involves two major demuxes.
    One for the route and one for the socket.

    But we can optimize this down to one demux for certain kinds of local
    sockets.

    Currently we only do this for established TCP sockets, but it could
    at least in theory be expanded to other kinds of connections.

    If a TCP socket is established then it's identity is fully specified.

    This means that whatever input route was used during the three-way
    handshake must work equally well for the rest of the connection since
    the keys will not change.

    Once we move to established state, we cache the receive packet's input
    route to use later.

    Like the existing cached route in sk->sk_dst_cache used for output
    packets, we have to check for route invalidations using dst->obsolete
    and dst->ops->check().

    Early demux occurs outside of a socket locked section, so when a route
    invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
    actually inside of established state packet processing and thus have
    the socket locked.

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Jun, 2012

1 commit