27 Apr, 2015

2 commits

  • Pull NFS client updates from Trond Myklebust:
    "Another set of mainly bugfixes and a couple of cleanups. No new
    functionality in this round.

    Highlights include:

    Stable patches:
    - Fix a regression in /proc/self/mountstats
    - Fix the pNFS flexfiles O_DIRECT support
    - Fix high load average due to callback thread sleeping

    Bugfixes:
    - Various patches to fix the pNFS layoutcommit support
    - Do not cache pNFS deviceids unless server notifications are enabled
    - Fix a SUNRPC transport reconnection regression
    - make debugfs file creation failure non-fatal in SUNRPC
    - Another fix for circular directory warnings on NFSv4 "junctioned"
    mountpoints
    - Fix locking around NFSv4.2 fallocate() support
    - Truncating NFSv4 file opens should also sync O_DIRECT writes
    - Prevent infinite loop in rpcrdma_ep_create()

    Features:
    - Various improvements to the RDMA transport code's handling of
    memory registration
    - Various code cleanups"

    * tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits)
    fs/nfs: fix new compiler warning about boolean in switch
    nfs: Remove unneeded casts in nfs
    NFS: Don't attempt to decode missing directory entries
    Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"
    NFS: Rename idmap.c to nfs4idmap.c
    NFS: Move nfs_idmap.h into fs/nfs/
    NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
    NFS: Add a stub for GETDEVICELIST
    nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
    nfs: fix DIO good bytes calculation
    nfs: Fetch MOUNTED_ON_FILEID when updating an inode
    sunrpc: make debugfs file creation failure non-fatal
    nfs: fix high load average due to callback thread sleeping
    NFS: Reduce time spent holding the i_mutex during fallocate()
    NFS: Don't zap caches on fallocate()
    xprtrdma: Make rpcrdma_{un}map_one() into inline functions
    xprtrdma: Handle non-SEND completions via a callout
    xprtrdma: Add "open" memreg op
    xprtrdma: Add "destroy MRs" memreg op
    xprtrdma: Add "reset MRs" memreg op
    ...

    Linus Torvalds
     
  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

24 Apr, 2015

3 commits

  • NFS: NFSoRDMA Client Changes

    This patch series creates an operation vector for each of the different
    memory registration modes. This should make it easier to one day increase
    credit limit, rsize, and wsize.

    Signed-off-by: Anna Schumaker

    Trond Myklebust
     
  • * bugfixes:
    NFSv4: Return delegations synchronously in evict_inode
    SUNRPC: Fix a regression when reconnecting
    NFS: remount with security change should return EINVAL
    nfs: do not export discarded symbols
    NFSv4.1: don't export static symbol

    Trond Myklebust
     
  • v2: gracefully handle the case where some dentry pointers end up NULL
    and be more dilligent about zeroing out dentry pointers

    We currently have a problem that SELinux policy is being enforced when
    creating debugfs files. If a debugfs file is created as a side effect of
    doing some syscall, then that creation can fail if the SELinux policy
    for that process prevents it.

    This seems wrong. We don't do that for files under /proc, for instance,
    so Bruce has proposed a patch to fix that.

    While discussing that patch however, Greg K.H. stated:

    "No kernel code should care / fail if a debugfs function fails, so
    please fix up the sunrpc code first."

    This patch converts all of the sunrpc debugfs setup code to be void
    return functins, and the callers to not look for errors from those
    functions.

    This should allow rpc_clnt and rpc_xprt creation to work, even if the
    kernel fails to create debugfs files for some reason.

    Cc: Greg Kroah-Hartman
    Acked-by: "J. Bruce Fields"
    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     

23 Apr, 2015

1 commit

  • Pull Ceph updates from Sage Weil:
    "This time around we have a collection of CephFS fixes from Zheng
    around MDS failure handling and snapshots, support for a new CRUSH
    straw2 algorithm (to sync up with userspace) and several RBD cleanups
    and fixes from Ilya, an error path leak fix from Taesoo, and then an
    assorted collection of cleanups from others"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (28 commits)
    rbd: rbd_wq comment is obsolete
    libceph: announce support for straw2 buckets
    crush: straw2 bucket type with an efficient 64-bit crush_ln()
    crush: ensuring at most num-rep osds are selected
    crush: drop unnecessary include from mapper.c
    ceph: fix uninline data function
    ceph: rename snapshot support
    ceph: fix null pointer dereference in send_mds_reconnect()
    ceph: hold on to exclusive caps on complete directories
    libceph: simplify our debugfs attr macro
    ceph: show non-default options only
    libceph: expose client options through debugfs
    libceph, ceph: split ceph_show_options()
    rbd: mark block queue as non-rotational
    libceph: don't overwrite specific con error msgs
    ceph: cleanup unsafe requests when reconnecting is denied
    ceph: don't zero i_wrbuffer_ref when reconnecting is denied
    ceph: don't mark dirty caps when there is no auth cap
    ceph: keep i_snap_realm while there are writers
    libceph: osdmap.h: Add missing format newlines
    ...

    Linus Torvalds
     

22 Apr, 2015

5 commits

  • This is an improved straw bucket that correctly avoids any data movement
    between items A and B when neither A nor B's weights are changed. Said
    differently, if we adjust the weight of item C (including adding it anew
    or removing it completely), we will only see inputs move to or from C,
    never between other items in the bucket.

    Notably, there is not intermediate scaling factor that needs to be
    calculated. The mapping function is a simple function of the item weights.

    The below commits were squashed together into this one (mostly to avoid
    adding and then yanking a ~6000 lines worth of crush_ln_table):

    - crush: add a straw2 bucket type
    - crush: add crush_ln to calculate nature log efficently
    - crush: improve straw2 adjustment slightly
    - crush: change crush_ln to provide 32 more digits
    - crush: fix crush_get_bucket_item_weight and bucket destroy for straw2
    - crush/mapper: fix divide-by-0 in straw2
    (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need
    to create a proper compat.h in ceph.git)

    Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53,
    32a1ead92efcd351822d22a5fc37d159c65c1338,
    6289912418c4a3597a11778bcf29ed5415117ad9,
    35fcb04e2945717cf5cfe150b9fa89cb3d2303a1,
    6445d9ee7290938de1e4ee9563912a6ab6d8ee5f,
    b5921d55d16796e12d66ad2c4add7305f9ce2353.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Crush temporary buffers are allocated as per replica size configured
    by the user. When there are more final osds (to be selected as per
    rule) than the replicas, buffer overlaps and it causes crash. Now, it
    ensures that at most num-rep osds are selected even if more number of
    osds are allowed by the rule.

    Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0,
    234b066ba04976783d15ff2abc3e81b6cc06fb10.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Pull networking fixes from David Miller:
    "Just a few fixes trickling in at this point.

    1) If we see an attached socket on an skb in the ipv4 forwarding path,
    bail. This can happen due to races with FIB rule addition, and
    deletion, and we should just drop such frames. From Sebastian
    Pöhn.

    2) pppoe receive should only accept packets destined for this hosts's
    MAC address. From Joakim Tjernlund.

    3) Handle checksum unwrapping properly in ppp receive properly when
    it's encapsulated in UDP in some way, fix from Tom Herbert.

    4) Fix some bugs in mv88e6xxx DSA driver resulting from the conversion
    from register offset constants to mnenomic macros. From Vivien
    Didelot.

    5) Fix handling of HCA max message size in mlx4 adapters, from Eran
    Ben ELisha"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    net/mlx4_core: Fix reading HCA max message size in mlx4_QUERY_DEV_CAP
    tcp: add memory barriers to write space paths
    altera tse: Error-Bit on tx-avalon-stream always set.
    net: dsa: mv88e6xxx: use PORT_DEFAULT_VLAN
    net: dsa: mv88e6xxx: fix setup of port control 1
    ppp: call skb_checksum_complete_unset in ppp_receive_frame
    net: add skb_checksum_complete_unset
    pppoe: Lacks DST MAC address check
    ip_forward: Drop frames with attached skb->sk

    Linus Torvalds
     
  • Ensure that we either see that the buffer has write space
    in tcp_poll() or that we perform a wakeup from the input
    side. Did not run into any actual problem here, but thought
    that we should make things explicit.

    Signed-off-by: Jason Baron
    Signed-off-by: David S. Miller

    jbaron@akamai.com
     

21 Apr, 2015

1 commit

  • Initial discussion was:
    [FYI] xfrm: Don't lookup sk_policy for timewait sockets

    Forwarded frames should not have a socket attached. Especially
    tw sockets will lead to panics later-on in the stack.

    This was observed with TPROXY assigning a tw socket and broken
    policy routing (misconfigured). As a result frame enters
    forwarding path instead of input. We cannot solve this in
    TPROXY as it cannot know that policy routing is broken.

    v2:
    Remove useless comment

    Signed-off-by: Sebastian Poehn
    Signed-off-by: David S. Miller

    Sebastian Pöhn
     

20 Apr, 2015

3 commits


19 Apr, 2015

1 commit

  • Pull 9pfs updates from Eric Van Hensbergen:
    "Some accumulated cleanup patches for kerneldoc and unused variables as
    well as some lock bug fixes and adding privateport option for RDMA"

    * tag 'for-linus-4.1-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    net/9p: add a privport option for RDMA transport.
    fs/9p: Initialize status in v9fs_file_do_lock.
    net/9p: Initialize opts->privport as it should be.
    net/9p: use memcpy() instead of snprintf() in p9_mount_tag_show()
    9p: use unsigned integers for nwqid/count
    9p: do not crash on unknown lock status code
    9p: fix error handling in v9fs_file_do_lock
    9p: remove unused variable in p9_fd_create()
    9p: kerneldoc warning fixes

    Linus Torvalds
     

18 Apr, 2015

7 commits

  • While it is not used by newer userspace anymore, the older userspace was
    utilizing HIDP_VIRTUAL_CABLE_UNPLUG and HIDP_BOOT_PROTOCOL_MODE flags
    when adding a new HIDP connection.

    The flags validation is important, but we can not break older userspace
    and with that allow providing these flags even if newer userspace does
    not use them anymore.

    Reported-and-tested-by: Jörg Otte
    Signed-off-by: Marcel Holtmann
    Signed-off-by: Linus Torvalds

    Marcel Holtmann
     
  • Pull networking fixes from David Miller:

    1) Fix verifier memory corruption and other bugs in BPF layer, from
    Alexei Starovoitov.

    2) Add a conservative fix for doing BPF properly in the BPF classifier
    of the packet scheduler on ingress. Also from Alexei.

    3) The SKB scrubber should not clear out the packet MARK and security
    label, from Herbert Xu.

    4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue.

    5) Pause handling is not correct in the stmmac driver because it
    doesn't take into consideration the RX and TX fifo sizes. From
    Vince Bridgers.

    6) Failure path missing unlock in FOU driver, from Wang Cong.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
    net: dsa: use DEVICE_ATTR_RW to declare temp1_max
    netns: remove BUG_ONs from net_generic()
    IB/ipoib: Fix ndo_get_iflink
    sfc: Fix memcpy() with const destination compiler warning.
    altera tse: Fix network-delays and -retransmissions after high throughput.
    net: remove unused 'dev' argument from netif_needs_gso()
    act_mirred: Fix bogus header when redirecting from VLAN
    inet_diag: fix access to tcp cc information
    tcp: tcp_get_info() should fetch socket fields once
    net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state()
    skbuff: Do not scrub skb mark within the same name space
    Revert "net: Reset secmark when scrubbing packet"
    bpf: fix two bugs in verification logic when accessing 'ctx' pointer
    bpf: fix bpf helpers to use skb->mac_header relative offsets
    stmmac: Configure Flow Control to work correctly based on rxfifo size
    stmmac: Enable unicast pause frame detect in GMAC Register 6
    stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
    stmmac: Add defines and documentation for enabling flow control
    stmmac: Add properties for transmit and receive fifo sizes
    stmmac: fix oops on rmmod after assigning ip addr
    ...

    Linus Torvalds
     
  • Since commit da4759c (sysfs: Use only return value from is_visible for
    the file mode), it is possible to reduce the permissions of a file.

    So declare temp1_max with the DEVICE_ATTR_RW macro and remove the write
    permission in dsa_hwmon_attrs_visible if set_temp_limit isn't provided.

    Signed-off-by: Vivien Didelot
    Reviewed-by: Guenter Roeck
    Signed-off-by: David S. Miller

    Vivien Didelot
     
  • In commit 04ffcb255f22 ("net: Add ndo_gso_check") Tom originally
    added the 'dev' argument to be able to call ndo_gso_check().

    Then later, when generalizing this in commit 5f35227ea34b
    ("net: Generalize ndo_gso_check to ndo_features_check")
    Jesse removed the call to ndo_gso_check() in netif_needs_gso()
    by calling the new ndo_features_check() in a different place.
    This made the 'dev' argument unused.

    Remove the unused argument and go back to the code as before.

    Cc: Tom Herbert
    Cc: Jesse Gross
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • When you redirect a VLAN device to any device, you end up with
    crap in af_packet on the xmit path because hard_header_len is
    not equal to skb->mac_len. So the redirected packet contains
    four extra bytes at the start which then gets interpreted as
    part of the MAC address.

    This patch fixes this by only pushing skb->mac_len. We also
    need to fix ifb because it tries to undo the pushing done by
    act_mirred.

    Signed-off-by: Herbert Xu
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Two different problems are fixed here :

    1) inet_sk_diag_fill() might be called without socket lock held.
    icsk->icsk_ca_ops can change under us and module be unloaded.
    -> Access to freed memory.
    Fix this using rcu_read_lock() to prevent module unload.

    2) Some TCP Congestion Control modules provide information
    but again this is not safe against icsk->icsk_ca_ops
    change and nla_put() errors were ignored. Some sockets
    could not get the additional info if skb was almost full.

    Fix this by returning a status from get_info() handlers and
    using rcu protection as well.

    Signed-off-by: Eric Dumazet
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • tcp_get_info() can be called without holding socket lock,
    so any socket fields can change under us.

    Use READ_ONCE() to fetch sk_pacing_rate and sk_max_pacing_rate

    Fixes: 977cb0ecf82e ("tcp: add pacing_rate information into tcp_info")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Apr, 2015

5 commits

  • On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote:
    > Le 15/04/2015 15:57, Herbert Xu a écrit :
    > >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
    > [snip]
    > >Subject: skbuff: Do not scrub skb mark within the same name space
    > >
    > >The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:
    > Maybe add a Fixes tag?
    > Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path")
    >
    > >harmonize cleanup done on skb on rx path") broke anyone trying to
    > >use netfilter marking across IPv4 tunnels. While most of the
    > >fields that are cleared by skb_scrub_packet don't matter, the
    > >netfilter mark must be preserved.
    > >
    > >This patch rearranges skb_scurb_packet to preserve the mark field.
    > nit: s/scurb/scrub
    >
    > Else it's fine for me.

    Sure.

    PS I used the wrong email for James the first time around. So
    let me repeat the question here. Should secmark be preserved
    or cleared across tunnels within the same name space? In fact,
    do our security models even support name spaces?

    ---8
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch reverts commit b8fb4e0648a2ab3734140342002f68fb0c7d1602
    because the secmark must be preserved even when a packet crosses
    namespace boundaries. The reason is that security labels apply to
    the system as a whole and is not per-namespace.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • For the short-term solution, lets fix bpf helper functions to use
    skb->mac_header relative offsets instead of skb->data in order to
    get the same eBPF programs with cls_bpf and act_bpf work on ingress
    and egress qdisc path. We need to ensure that mac_header is set
    before calling into programs. This is effectively the first option
    from below referenced discussion.

    More long term solution for LD_ABS|LD_IND instructions will be more
    intrusive but also more beneficial than this, and implemented later
    as it's too risky at this point in time.

    I.e., we plan to look into the option of moving skb_pull() out of
    eth_type_trans() and into netif_receive_skb() as has been suggested
    as second option. Meanwhile, this solution ensures ingress can be
    used with eBPF, too, and that we won't run into ABI troubles later.
    For dealing with negative offsets inside eBPF helper functions,
    we've implemented bpf_skb_clone_unwritable() to test for unwriteable
    headers.

    Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
    Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
    Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Remove duplicated include.

    Signed-off-by: Wei Yongjun
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Wei Yongjun
     
  • Fixes: 7a6c8c34e5b7 ("fou: implement FOU_CMD_GET")
    Reported-by: Dan Carpenter
    Cc: Dan Carpenter
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

16 Apr, 2015

8 commits

  • Merge second patchbomb from Andrew Morton:

    - the rest of MM

    - various misc bits

    - add ability to run /sbin/reboot at reboot time

    - printk/vsprintf changes

    - fiddle with seq_printf() return value

    * akpm: (114 commits)
    parisc: remove use of seq_printf return value
    lru_cache: remove use of seq_printf return value
    tracing: remove use of seq_printf return value
    cgroup: remove use of seq_printf return value
    proc: remove use of seq_printf return value
    s390: remove use of seq_printf return value
    cris fasttimer: remove use of seq_printf return value
    cris: remove use of seq_printf return value
    openrisc: remove use of seq_printf return value
    ARM: plat-pxa: remove use of seq_printf return value
    nios2: cpuinfo: remove use of seq_printf return value
    microblaze: mb: remove use of seq_printf return value
    ipc: remove use of seq_printf return value
    rtc: remove use of seq_printf return value
    power: wakeup: remove use of seq_printf return value
    x86: mtrr: if: remove use of seq_printf return value
    linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
    MAINTAINERS: CREDITS: remove Stefano Brivio from B43
    .mailmap: add Ricardo Ribalda
    CREDITS: add Ricardo Ribalda Delgado
    ...

    Linus Torvalds
     
  • The current semantics of string_escape_mem are inadequate for one of its
    current users, vsnprintf(). If that is to honour its contract, it must
    know how much space would be needed for the entire escaped buffer, and
    string_escape_mem provides no way of obtaining that (short of allocating a
    large enough buffer (~4 times input string) to let it play with, and
    that's definitely a big no-no inside vsnprintf).

    So change the semantics for string_escape_mem to be more snprintf-like:
    Return the size of the output that would be generated if the destination
    buffer was big enough, but of course still only write to the part of dst
    it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
    It is then up to the caller to detect whether output was truncated and to
    append a '\0' if desired. Also, we must output partial escape sequences,
    otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
    printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
    they previously contained.

    This also fixes a bug in the escaped_string() helper function, which used
    to unconditionally pass a length of "end-buf" to string_escape_mem();
    since the latter doesn't check osz for being insanely large, it would
    happily write to dst. For example, kasprintf(GFP_KERNEL, "something and
    then %pE", ...); is an easy way to trigger an oops.

    In test-string_helpers.c, the -ENOMEM test is replaced with testing for
    getting the expected return value even if the buffer is too small. We
    also ensure that nothing is written (by relying on a NULL pointer deref)
    if the output size is 0 by passing NULL - this has to work for
    kasprintf("%pE") to work.

    In net/sunrpc/cache.c, I think qword_add still has the same semantics.
    Someone should definitely double-check this.

    In fs/proc/array.c, I made the minimum possible change, but longer-term it
    should stop poking around in seq_file internals.

    [andriy.shevchenko@linux.intel.com: simplify qword_add]
    [andriy.shevchenko@linux.intel.com: add missed curly braces]
    Signed-off-by: Rasmus Villemoes
    Acked-by: Andy Shevchenko
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • There are a lot of embedded systems that run most or all of their
    functionality in init, running as root:root. For these systems,
    supporting multiple users is not necessary.

    This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
    non-root users, non-root groups, and capabilities optional. It is enabled
    under CONFIG_EXPERT menu.

    When this symbol is not defined, UID and GID are zero in any possible case
    and processes always have all capabilities.

    The following syscalls are compiled out: setuid, setregid, setgid,
    setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
    getgroups, setfsuid, setfsgid, capget, capset.

    Also, groups.c is compiled out completely.

    In kernel/capability.c, capable function was moved in order to avoid
    adding two ifdef blocks.

    This change saves about 25 KB on a defconfig build. The most minimal
    kernels have total text sizes in the high hundreds of kB rather than
    low MB. (The 25k goes down a bit with allnoconfig, but not that much.

    The kernel was booted in Qemu. All the common functionalities work.
    Adding users/groups is not possible, failing with -ENOSYS.

    Bloat-o-meter output:
    add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Iulia Manda
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Tested-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iulia Manda
     
  • Pull second vfs update from Al Viro:
    "Now that net-next went in... Here's the next big chunk - killing
    ->aio_read() and ->aio_write().

    There'll be one more pile today (direct_IO changes and
    generic_write_checks() cleanups/fixes), but I'd prefer to keep that
    one separate"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    ->aio_read and ->aio_write removed
    pcm: another weird API abuse
    infinibad: weird APIs switched to ->write_iter()
    kill do_sync_read/do_sync_write
    fuse: use iov_iter_get_pages() for non-splice path
    fuse: switch to ->read_iter/->write_iter
    switch drivers/char/mem.c to ->read_iter/->write_iter
    make new_sync_{read,write}() static
    coredump: accept any write method
    switch /dev/loop to vfs_iter_write()
    serial2002: switch to __vfs_read/__vfs_write
    ashmem: use __vfs_read()
    export __vfs_read()
    autofs: switch to __vfs_write()
    new helper: __vfs_write()
    switch hugetlbfs to ->read_iter()
    coda: switch to ->read_iter/->write_iter
    ncpfs: switch to ->read_iter/->write_iter
    net/9p: remove (now-)unused helpers
    p9_client_attach(): set fid->uid correctly
    ...

    Linus Torvalds
     
  • socket inodes and sunrpc filesystems - inodes owned by that code

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • places where we are dealing with S_ISSOCK file creation/lookups.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • AF_UNIX sockets should call mknod on the top layer only and should not attempt
    to modify the lower layer in a layered filesystem such as overlayfs.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Pull networking updates from David Miller:

    1) Add BQL support to via-rhine, from Tino Reichardt.

    2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading. From Floria Fainelli.

    3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

    4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

    5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

    6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc. And use this to clean up the neigh layer, then use it to
    implement MPLS support. All from Eric Biederman.

    7) Support L3 forwarding offloading in switches, from Scott Feldman.

    8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further. From Alexander Duyck.

    9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf. In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

    10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

    11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

    12) Split out new connection request sockets so that they can be
    established in the main hash table. Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath. From Eric Dumazet.

    13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

    14) Support stable privacy address generation for RFC7217 in IPV6. From
    Hannes Frederic Sowa.

    15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

    16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
    fm10k: Bump driver version to 0.15.2
    fm10k: corrected VF multicast update
    fm10k: mbx_update_max_size does not drop all oversized messages
    fm10k: reset head instead of calling update_max_size
    fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
    fm10k: update xcast mode before synchronizing multicast addresses
    fm10k: start service timer on probe
    fm10k: fix function header comment
    fm10k: comment next_vf_mbx flow
    fm10k: don't handle mailbox events in iov_event path and always process mailbox
    fm10k: use separate workqueue for fm10k driver
    fm10k: Set PF queues to unlimited bandwidth during virtualization
    fm10k: expose tx_timeout_count as an ethtool stat
    fm10k: only increment tx_timeout_count in Tx hang path
    fm10k: remove extraneous "Reset interface" message
    fm10k: separate PF only stats so that VF does not display them
    fm10k: use hw->mac.max_queues for stats
    fm10k: only show actual queues, not the maximum in hardware
    fm10k: allow creation of VLAN on default vid
    fm10k: fix unused warnings
    ...

    Linus Torvalds
     

15 Apr, 2015

4 commits

  • Merge first patchbomb from Andrew Morton:

    - arch/sh updates

    - ocfs2 updates

    - kernel/watchdog feature

    - about half of mm/

    * emailed patches from Andrew Morton : (122 commits)
    Documentation: update arch list in the 'memtest' entry
    Kconfig: memtest: update number of test patterns up to 17
    arm: add support for memtest
    arm64: add support for memtest
    memtest: use phys_addr_t for physical addresses
    mm: move memtest under mm
    mm, hugetlb: abort __get_user_pages if current has been oom killed
    mm, mempool: do not allow atomic resizing
    memcg: print cgroup information when system panics due to panic_on_oom
    mm: numa: remove migrate_ratelimited
    mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
    mm: split ET_DYN ASLR from mmap ASLR
    s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
    mm: expose arch_mmap_rnd when available
    s390: standardize mmap_rnd() usage
    powerpc: standardize mmap_rnd() usage
    mips: extract logic for mmap_rnd()
    arm64: standardize mmap_rnd() usage
    x86: standardize mmap_rnd() usage
    arm: factor out mmap ASLR into mmap_rnd
    ...

    Linus Torvalds
     
  • NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

    GFP_THISNODE is a secret combination of gfp bits that have different
    behavior than expected. It is a combination of __GFP_THISNODE,
    __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
    allocator slowpath to fail without trying reclaim even though it may be
    used in combination with __GFP_WAIT.

    An example of the problem this creates: commit e97ca8e5b864 ("mm: fix
    GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
    that really just wanted __GFP_THISNODE. The problem doesn't end there,
    however, because even it was a no-op for alloc_misplaced_dst_page(),
    which also sets __GFP_NORETRY and __GFP_NOWARN, and
    migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
    is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a
    no-op in these cases since the page allocator special-cases
    __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.

    It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE
    to restrict an allocation to a local node, but remove GFP_THISNODE and
    its obscurity. Instead, we require that a caller clear __GFP_WAIT if it
    wants to avoid reclaim.

    This allows the aforementioned functions to actually reclaim as they
    should. It also enables any future callers that want to do
    __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The
    rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.

    Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
    it is unchanged.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Pravin Shelar
    Cc: Jarno Rajahalme
    Cc: Li Zefan
    Cc: Greg Thelen
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    A final pull request, I know it's very late but this time I think it's worth a
    bit of rush.

    The following patchset contains Netfilter/nf_tables updates for net-next, more
    specifically concatenation support and dynamic stateful expression
    instantiation.

    This also comes with a couple of small patches. One to fix the ebtables.h
    userspace header and another to get rid of an obsolete example file in tree
    that describes a nf_tables expression.

    This time, I decided to paste the original descriptions. This will result in a
    rather large commit description, but I think these bytes to keep.

    Patrick McHardy says:

    ====================
    netfilter: nf_tables: concatenation support

    The following patches add support for concatenations, which allow multi
    dimensional exact matches in O(1).

    The basic idea is to split the data registers, currently consisting of
    4 registers of 16 bytes each, into smaller units, 16 registers of 4
    bytes each, and making sure each register store always leaves the
    full 32 bit in a well defined state, meaning smaller stores will
    zero the remaining bits.

    Based on that, we can load multiple adjacent registers with different
    values, thereby building a concatenated bigger value, and use that
    value for set lookups.

    Sets are changed to use variable sized extensions for their key and
    data values, removing the fixed limit of 16 bytes while saving memory
    if less space is needed.

    As a side effect, these patches will allow some nice optimizations in
    the future, like using jhash2 in nft_hash, removing the masking in
    nft_cmp_fast, optimized data comparison using 32 bit word size etc.
    These are not done so far however.

    The patches are split up as follows:

    * the first five patches add length validation to register loads and
    stores to make sure we stay within bounds and prepare the validation
    functions for the new addressing mode

    * the next patches prepare for changing to 32 bit addressing by
    introducing a struct nft_regs, which holds the verdict register as
    well as the data registers. The verdict members are moved to a new
    struct nft_verdict to allow to pull struct nft_data out of the stack.

    * the next patches contain preparatory conversions of expressions and
    sets to use 32 bit addressing

    * the next patch introduces so far unused register conversion helpers
    for parsing and dumping register numbers over netlink

    * following is the real conversion to 32 bit addressing, consisting of
    replacing struct nft_data in struct nft_regs by an array of u32s and
    actually translating and validating the new register numbers.

    * the final two patches add support for variable sized data items and
    variable sized keys / data in set elements

    The patches have been verified to work correctly with nft binaries using
    both old and new addressing.
    ====================

    Patrick McHardy says:

    ====================
    netfilter: nf_tables: dynamic stateful expression instantiation

    The following patches are the grand finale of my nf_tables set work,
    using all the building blocks put in place by the previous patches
    to support something like iptables hashlimit, but a lot more powerful.

    Sets are extended to allow attaching expressions to set elements.
    The dynset expression dynamically instantiates these expressions
    based on a template when creating new set elements and evaluates
    them for all new or updated set members.

    In combination with concatenations this effectively creates state
    tables for arbitrary combinations of keys, using the existing
    expression types to maintain that state. Regular set GC takes care
    of purging expired states.

    We currently support two different stateful expressions, counter
    and limit. Using limit as a template we can express the functionality
    of hashlimit, but completely unrestricted in the combination of keys.
    Using counter we can perform accounting for arbitrary flows.

    The following examples from patch 5/5 show some possibilities.
    Userspace syntax is still WIP, especially the listing of state
    tables will most likely be seperated from normal set listings
    and use a more structured format:

    1. Limit the rate of new SSH connections per host, similar to iptables
    hashlimit:

    flow ip saddr timeout 60s \
    limit 10/second \
    accept

    2. Account network traffic between each set of /24 networks:

    flow ip saddr & 255.255.255.0 . ip daddr & 255.255.255.0 \
    counter

    3. Account traffic to each host per user:

    flow skuid . ip daddr \
    counter

    4. Account traffic for each combination of source address and TCP flags:

    flow ip saddr . tcp flags \
    counter

    The resulting set content after a Xmas-scan look like this:

    {
    192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040,
    192.168.122.1 . ack : counter packets 74 bytes 3848,
    192.168.122.1 . psh | ack : counter packets 35 bytes 3144
    }

    In the future the "expressions attached to elements" will be extended
    to also support user created non-stateful expressions to allow to
    efficiently select beween a set of parameter sets, f.i. a set of log
    statements with different prefixes based on the interface, which currently
    require one rule each. This will most likely have to wait until the next
    kernel version though.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull vfs update from Al Viro:
    "Part one:

    - struct filename-related cleanups

    - saner iov_iter_init() replacements (and switching the syscalls to
    use of those)

    - ntfs switch to ->write_iter() (Anton)

    - aio cleanups and splitting iocb into common and async parts
    (Christoph)

    - assorted fixes (me, bfields, Andrew Elble)

    There's a lot more, including the completion of switchover to
    ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
    race fixes, etc, but that goes after #for-davem merge. David has
    pulled it, and once it's in I'll send the next vfs pull request"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
    sg_start_req(): use import_iovec()
    sg_start_req(): make sure that there's not too many elements in iovec
    blk_rq_map_user(): use import_single_range()
    sg_io(): use import_iovec()
    process_vm_access: switch to {compat_,}import_iovec()
    switch keyctl_instantiate_key_common() to iov_iter
    switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
    aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
    vmsplice_to_user(): switch to import_iovec()
    kill aio_setup_single_vector()
    aio: simplify arguments of aio_setup_..._rw()
    aio: lift iov_iter_init() into aio_setup_..._rw()
    lift iov_iter into {compat_,}do_readv_writev()
    NFS: fix BUG() crash in notify_change() with patch to chown_common()
    dcache: return -ESTALE not -EBUSY on distributed fs race
    NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
    VFS: Add iov_iter_fault_in_multipages_readable()
    drop bogus check in file_open_root()
    switch security_inode_getattr() to struct path *
    constify tomoyo_realpath_from_path()
    ...

    Linus Torvalds