23 Nov, 2013

3 commits

  • Pull networking fixes from David Miller:

    1) Fix memory leaks and other issues in mwifiex driver, from Amitkumar
    Karwar.

    2) skb_segment() can choke on packets using frag lists, fix from
    Herbert Xu with help from Eric Dumazet and others.

    3) IPv4 output cached route instantiation properly handles races
    involving two threads trying to install the same route, but we
    forgot to propagate this logic to input routes as well. Fix from
    Alexei Starovoitov.

    4) Put protections in place to make sure that recvmsg() paths never
    accidently copy uninitialized memory back into userspace and also
    make sure that we never try to use more that sockaddr_storage for
    building the on-kernel-stack copy of a sockaddr. Fixes from Hannes
    Frederic Sowa.

    5) R8152 driver transmit flow bug fixes from Hayes Wang.

    6) Fix some minor fallouts from genetlink changes, from Johannes Berg
    and Michael Opdenacker.

    7) AF_PACKET sendmsg path can race with netdevice unregister notifier,
    fix by using RCU to make sure the network device doesn't go away
    from under us. Fix from Daniel Borkmann.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
    gso: handle new frag_list of frags GRO packets
    genetlink: fix genl_set_err() group ID
    genetlink: fix genlmsg_multicast() bug
    packet: fix use after free race in send path when dev is released
    xen-netback: stop the VIF thread before unbinding IRQs
    wimax: remove dead code
    net/phy: Add the autocross feature for forced links on VSC82x4
    net/phy: Add VSC8662 support
    net/phy: Add VSC8574 support
    net/phy: Add VSC8234 support
    net: add BUG_ON if kernel advertises msg_namelen > sizeof(struct sockaddr_storage)
    net: rework recvmsg handler msg_name and msg_namelen logic
    bridge: flush br's address entry in fdb when remove the
    net: core: Always propagate flag changes to interfaces
    ipv4: fix race in concurrent ip_route_input_slow()
    r8152: fix incorrect type in assignment
    r8152: support stopping/waking tx queue
    r8152: modify the tx flow
    r8152: fix tx/rx memory overflow
    netfilter: ebt_ip6: fix source and destination matching
    ...

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "Almost all of these are bug fixes. Dave Sterba's documentation update
    is the big exception because he removed our promises to set any
    machine running Btrfs on fire"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Documentation: filesystems: update btrfs tools section
    Documentation: filesystems: add new btrfs mount options
    btrfs: update kconfig help text
    btrfs: fix bio_size_ok() for max_sectors > 0xffff
    btrfs: Use trace condition for get_extent tracepoint
    btrfs: fix typo in the log message
    Btrfs: fix list delete warning when removing ordered root from the list
    Btrfs: print bytenr instead of page pointer in check-int
    Btrfs: remove dead codes from ctree.h
    Btrfs: don't wait for ordered data outside desired range
    Btrfs: fix lockdep error in async commit
    Btrfs: avoid heavy operations in btrfs_commit_super
    Btrfs: fix __btrfs_start_workers retval
    Btrfs: disable online raid-repair on ro mounts
    Btrfs: do not inc uncorrectable_errors counter on ro scrubs
    Btrfs: only drop modified extents if we logged the whole inode
    Btrfs: make sure to copy everything if we rename
    Btrfs: don't BUG_ON() if we get an error walking backrefs

    Linus Torvalds
     
  • Pull SLAB changes from Pekka Enberg:
    "The patches from Joonsoo Kim switch mm/slab.c to use 'struct page' for
    slab internals similar to mm/slub.c. This reduces memory usage and
    improves performance:

    https://lkml.org/lkml/2013/10/16/155

    Rest of the changes are bug fixes from various people"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (21 commits)
    mm, slub: fix the typo in mm/slub.c
    mm, slub: fix the typo in include/linux/slub_def.h
    slub: Handle NULL parameter in kmem_cache_flags
    slab: replace non-existing 'struct freelist *' with 'void *'
    slab: fix to calm down kmemleak warning
    slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabled
    slab: rename slab_bufctl to slab_freelist
    slab: remove useless statement for checking pfmemalloc
    slab: use struct page for slab management
    slab: replace free and inuse in struct slab with newly introduced active
    slab: remove SLAB_LIMIT
    slab: remove kmem_bufctl_t
    slab: change the management method of free objects of the slab
    slab: use __GFP_COMP flag for allocating slab pages
    slab: use well-defined macro, virt_to_slab()
    slab: overloading the RCU head over the LRU for RCU free
    slab: remove cachep in struct slab_rcu
    slab: remove nodeid in struct slab
    slab: remove colouroff in struct slab
    slab: change return type of kmem_getpages() to struct page
    ...

    Linus Torvalds
     

22 Nov, 2013

8 commits

  • Merge patches from Andrew Morton:
    "13 fixes"

    * emailed patches from Andrew Morton :
    mm: place page->pmd_huge_pte to right union
    MAINTAINERS: add keyboard driver to Hyper-V file list
    x86, mm: do not leak page->ptl for pmd page tables
    ipc,shm: correct error return value in shmctl (SHM_UNLOCK)
    mm, mempolicy: silence gcc warning
    block/partitions/efi.c: fix bound check
    ARM: drivers/rtc/rtc-at91rm9200.c: disable interrupts at shutdown
    mm: hugetlbfs: fix hugetlbfs optimization
    kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS cleanly
    ipc,shm: fix shm_file deletion races
    mm: thp: give transparent hugepage code a separate copy_page
    checkpatch: fix "Use of uninitialized value" warnings
    configfs: fix race between dentry put and lookup

    Linus Torvalds
     
  • Pull security subsystem updates from James Morris:
    "In this patchset, we finally get an SELinux update, with Paul Moore
    taking over as maintainer of that code.

    Also a significant update for the Keys subsystem, as well as
    maintenance updates to Smack, IMA, TPM, and Apparmor"

    and since I wanted to know more about the updates to key handling,
    here's the explanation from David Howells on that:

    "Okay. There are a number of separate bits. I'll go over the big bits
    and the odd important other bit, most of the smaller bits are just
    fixes and cleanups. If you want the small bits accounting for, I can
    do that too.

    (1) Keyring capacity expansion.

    KEYS: Consolidate the concept of an 'index key' for key access
    KEYS: Introduce a search context structure
    KEYS: Search for auth-key by name rather than target key ID
    Add a generic associative array implementation.
    KEYS: Expand the capacity of a keyring

    Several of the patches are providing an expansion of the capacity of a
    keyring. Currently, the maximum size of a keyring payload is one page.
    Subtract a small header and then divide up into pointers, that only gives
    you ~500 pointers on an x86_64 box. However, since the NFS idmapper uses
    a keyring to store ID mapping data, that has proven to be insufficient to
    the cause.

    Whatever data structure I use to handle the keyring payload, it can only
    store pointers to keys, not the keys themselves because several keyrings
    may point to a single key. This precludes inserting, say, and rb_node
    struct into the key struct for this purpose.

    I could make an rbtree of records such that each record has an rb_node
    and a key pointer, but that would use four words of space per key stored
    in the keyring. It would, however, be able to use much existing code.

    I selected instead a non-rebalancing radix-tree type approach as that
    could have a better space-used/key-pointer ratio. I could have used the
    radix tree implementation that we already have and insert keys into it by
    their serial numbers, but that means any sort of search must iterate over
    the whole radix tree. Further, its nodes are a bit on the capacious side
    for what I want - especially given that key serial numbers are randomly
    allocated, thus leaving a lot of empty space in the tree.

    So what I have is an associative array that internally is a radix-tree
    with 16 pointers per node where the index key is constructed from the key
    type pointer and the key description. This means that an exact lookup by
    type+description is very fast as this tells us how to navigate directly to
    the target key.

    I made the data structure general in lib/assoc_array.c as far as it is
    concerned, its index key is just a sequence of bits that leads to a
    pointer. It's possible that someone else will be able to make use of it
    also. FS-Cache might, for example.

    (2) Mark keys as 'trusted' and keyrings as 'trusted only'.

    KEYS: verify a certificate is signed by a 'trusted' key
    KEYS: Make the system 'trusted' keyring viewable by userspace
    KEYS: Add a 'trusted' flag and a 'trusted only' flag
    KEYS: Separate the kernel signature checking keyring from module signing

    These patches allow keys carrying asymmetric public keys to be marked as
    being 'trusted' and allow keyrings to be marked as only permitting the
    addition or linkage of trusted keys.

    Keys loaded from hardware during kernel boot or compiled into the kernel
    during build are marked as being trusted automatically. New keys can be
    loaded at runtime with add_key(). They are checked against the system
    keyring contents and if their signatures can be validated with keys that
    are already marked trusted, then they are marked trusted also and can
    thus be added into the master keyring.

    Patches from Mimi Zohar make this usable with the IMA keyrings also.

    (3) Remove the date checks on the key used to validate a module signature.

    X.509: Remove certificate date checks

    It's not reasonable to reject a signature just because the key that it was
    generated with is no longer valid datewise - especially if the kernel
    hasn't yet managed to set the system clock when the first module is
    loaded - so just remove those checks.

    (4) Make it simpler to deal with additional X.509 being loaded into the kernel.

    KEYS: Load *.x509 files into kernel keyring
    KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate

    The builder of the kernel now just places files with the extension ".x509"
    into the kernel source or build trees and they're concatenated by the
    kernel build and stuffed into the appropriate section.

    (5) Add support for userspace kerberos to use keyrings.

    KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches
    KEYS: Implement a big key type that can save to tmpfs

    Fedora went to, by default, storing kerberos tickets and tokens in tmpfs.
    We looked at storing it in keyrings instead as that confers certain
    advantages such as tickets being automatically deleted after a certain
    amount of time and the ability for the kernel to get at these tokens more
    easily.

    To make this work, two things were needed:

    (a) A way for the tickets to persist beyond the lifetime of all a user's
    sessions so that cron-driven processes can still use them.

    The problem is that a user's session keyrings are deleted when the
    session that spawned them logs out and the user's user keyring is
    deleted when the UID is deleted (typically when the last log out
    happens), so neither of these places is suitable.

    I've added a system keyring into which a 'persistent' keyring is
    created for each UID on request. Each time a user requests their
    persistent keyring, the expiry time on it is set anew. If the user
    doesn't ask for it for, say, three days, the keyring is automatically
    expired and garbage collected using the existing gc. All the kerberos
    tokens it held are then also gc'd.

    (b) A key type that can hold really big tickets (up to 1MB in size).

    The problem is that Active Directory can return huge tickets with lots
    of auxiliary data attached. We don't, however, want to eat up huge
    tracts of unswappable kernel space for this, so if the ticket is
    greater than a certain size, we create a swappable shmem file and dump
    the contents in there and just live with the fact we then have an
    inode and a dentry overhead. If the ticket is smaller than that, we
    slap it in a kmalloc()'d buffer"

    * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (121 commits)
    KEYS: Fix keyring content gc scanner
    KEYS: Fix error handling in big_key instantiation
    KEYS: Fix UID check in keyctl_get_persistent()
    KEYS: The RSA public key algorithm needs to select MPILIB
    ima: define '_ima' as a builtin 'trusted' keyring
    ima: extend the measurement list to include the file signature
    kernel/system_certificate.S: use real contents instead of macro GLOBAL()
    KEYS: fix error return code in big_key_instantiate()
    KEYS: Fix keyring quota misaccounting on key replacement and unlink
    KEYS: Fix a race between negating a key and reading the error set
    KEYS: Make BIG_KEYS boolean
    apparmor: remove the "task" arg from may_change_ptraced_domain()
    apparmor: remove parent task info from audit logging
    apparmor: remove tsk field from the apparmor_audit_struct
    apparmor: fix capability to not use the current task, during reporting
    Smack: Ptrace access check mode
    ima: provide hash algo info in the xattr
    ima: enable support for larger default filedata hash algorithms
    ima: define kernel parameter 'ima_template=' to change configured default
    ima: add Kconfig default measurement list template
    ...

    Linus Torvalds
     
  • Pull audit updates from Eric Paris:
    "Nothing amazing. Formatting, small bug fixes, couple of fixes where
    we didn't get records due to some old VFS changes, and a change to how
    we collect execve info..."

    Fixed conflict in fs/exec.c as per Eric and linux-next.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    audit: fix type of sessionid in audit_set_loginuid()
    audit: call audit_bprm() only once to add AUDIT_EXECVE information
    audit: move audit_aux_data_execve contents into audit_context union
    audit: remove unused envc member of audit_aux_data_execve
    audit: Kill the unused struct audit_aux_data_capset
    audit: do not reject all AUDIT_INODE filter types
    audit: suppress stock memalloc failure warnings since already managed
    audit: log the audit_names record type
    audit: add child record before the create to handle case where create fails
    audit: use given values in tty_audit enable api
    audit: use nlmsg_len() to get message payload length
    audit: use memset instead of trying to initialize field by field
    audit: fix info leak in AUDIT_GET requests
    audit: update AUDIT_INODE filter rule to comparator function
    audit: audit feature to set loginuid immutable
    audit: audit feature to only allow unsetting the loginuid
    audit: allow unsetting the loginuid (with priv)
    audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
    audit: loginuid functions coding style
    selinux: apply selinux checks on new audit message types
    ...

    Linus Torvalds
     
  • I don't know what went wrong, mis-merge or something, but ->pmd_huge_pte
    placed in wrong union within struct page.

    In original patch[1] it's placed to union with ->lru and ->slab, but in
    commit e009bb30c8df ("mm: implement split page table lock for PMD
    level") it's in union with ->index and ->freelist.

    That union seems also unused for pages with table tables and safe to
    re-use, but it's not what I've tested.

    Let's move it to original place. It fixes indentation at least. :)

    [1] https://lkml.org/lkml/2013/10/7/288

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
    caused by THP") can cause dereference of a dangling pointer if
    split_huge_page runs during PageHuge() if there are updates to the
    tail_page->private field.

    Also it is repeating compound_head twice for hugetlbfs and it is running
    compound_head+compound_trans_head for THP when a single one is needed in
    both cases.

    The new code within the PageSlab() check doesn't need to verify that the
    THP page size is never bigger than the smallest hugetlbfs page size, to
    avoid memory corruption.

    A longstanding theoretical race condition was found while fixing the
    above (see the change right after the skip_unlock label, that is
    relevant for the compound_lock path too).

    By re-establishing the _mapcount tail refcounting for all compound
    pages, this also fixes the below problem:

    echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    BUG: Bad page state in process bash pfn:59a01
    page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
    page flags: 0x1c00000000008000(tail)
    Modules linked in:
    CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x55/0x76
    bad_page+0xd5/0x130
    free_pages_prepare+0x213/0x280
    __free_pages+0x36/0x80
    update_and_free_page+0xc1/0xd0
    free_pool_huge_page+0xc2/0xe0
    set_max_huge_pages.part.58+0x14c/0x220
    nr_hugepages_store_common.isra.60+0xd0/0xf0
    nr_hugepages_store+0x13/0x20
    kobj_attr_store+0xf/0x20
    sysfs_write_file+0x189/0x1e0
    vfs_write+0xc5/0x1f0
    SyS_write+0x55/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Khalid Aziz
    Signed-off-by: Andrea Arcangeli
    Tested-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Right now, the migration code in migrate_page_copy() uses copy_huge_page()
    for hugetlbfs and thp pages:

    if (PageHuge(page) || PageTransHuge(page))
    copy_huge_page(newpage, page);

    So, yay for code reuse. But:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);

    and a non-hugetlbfs page has no page_hstate(). This works 99% of the
    time because page_hstate() determines the hstate from the page order
    alone. Since the page order of a THP page matches the default hugetlbfs
    page order, it works.

    But, if you change the default huge page size on the boot command-line
    (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
    so page_hstate() returns null and copy_huge_page() oopses pretty fast
    since copy_huge_page() dereferences the hstate:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);
    if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
    ...

    Mel noticed that the migration code is really the only user of these
    functions. This moves all the copy code over to migrate.c and makes
    copy_huge_page() work for THP by checking for it explicitly.

    I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
    THP migration for the NUMA working set scanning fault case")

    [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Reviewed-by: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Tested-by: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Fix another really stupid bug - I introduced genl_set_err()
    precisely to be able to adjust the group and reject invalid
    ones, but then forgot to do so.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Unfortunately, I introduced a tremendously stupid bug into
    genlmsg_multicast() when doing all those multicast group
    changes: it adjusts the group number, but then passes it
    to genlmsg_multicast_netns() which does that again.

    Somehow, my tests failed to catch this, so add a warning
    into genlmsg_multicast_netns() and remove the offending
    group ID adjustment.

    Also add a warning to the similar code in other functions
    so people who misuse them are more loudly warned.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

21 Nov, 2013

9 commits

  • Add auto-MDI/MDI-X capability for forced (autonegotiation disabled)
    10/100 Mbps speeds on Vitesse VSC82x4 PHYs. Exported previously static
    function genphy_setup_forced() required by the new config_aneg handler
    in the Vitesse PHY module.

    Signed-off-by: Madalin Bucur
    Signed-off-by: Shruti Kanetkar
    Signed-off-by: David S. Miller

    Madalin Bucur
     
  • This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
    set msg_namelen to the proper size
    Suggested-by: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • Doing an if statement to test some condition to know if we should
    trigger a tracepoint is pointless when tracing is disabled. This just
    adds overhead and wastes a branch prediction. This is why the
    TRACE_EVENT_CONDITION() was created. It places the check inside the jump
    label so that the branch does not happen unless tracing is enabled.

    That is, instead of doing:

    if (em)
    trace_btrfs_get_extent(root, em);

    Which is basically this:

    if (em)
    if (static_key(trace_btrfs_get_extent)) {

    Using a TRACE_EVENT_CONDITION() we can just do:

    trace_btrfs_get_extent(root, em);

    And the condition trace event will do:

    if (static_key(trace_btrfs_get_extent)) {
    if (em) {
    ...

    The static key is a non conditional jump (or nop) that is faster than
    having to check if em is NULL or not.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Steven Rostedt
     
  • This reverts commit ea1e7ed33708c7a760419ff9ded0a6cb90586a50.

    Al points out that while the commit *does* actually create a separate
    slab for the page->ptl allocation, that slab is never actually used, and
    the code continues to use kmalloc/kfree.

    Damien Wyart points out that the original patch did have the conversion
    to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.

    Revert the half-arsed attempt that didn't do anything. If we really do
    want the special slab (remember: this is all relevant just for debug
    builds, so it's not necessarily all that critical) we might as well redo
    the patch fully.

    Reported-by: Al Viro
    Acked-by: Andrew Morton
    Cc: Kirill A Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull vfs bits and pieces from Al Viro:
    "Assorted bits that got missed in the first pull request + fixes for a
    couple of coredump regressions"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fold try_to_ascend() into the sole remaining caller
    dcache.c: get rid of pointless macros
    take read_seqbegin_or_lock() and friends to seqlock.h
    consolidate simple ->d_delete() instances
    gfs2: endianness misannotations
    dump_emit(): use __kernel_write(), not vfs_write()
    dump_align(): fix the dumb braino

    Linus Torvalds
     
  • Pull more ACPI and power management updates from Rafael Wysocki:

    - ACPI-based device hotplug fixes for issues introduced recently and a
    fix for an older error code path bug in the ACPI PCI host bridge
    driver

    - Fix for recently broken OMAP cpufreq build from Viresh Kumar

    - Fix for a recent hibernation regression related to s2disk

    - Fix for a locking-related regression in the ACPI EC driver from
    Puneet Kumar

    - System suspend error code path fix related to runtime PM and runtime
    PM documentation update from Ulf Hansson

    - cpufreq's conservative governor fix from Xiaoguang Chen

    - New processor IDs for intel_idle and turbostat and removal of an
    obsolete Kconfig option from Len Brown

    - New device IDs for the ACPI LPSS (Low-Power Subsystem) driver and
    ACPI-based PCI hotplug (ACPIPHP) cleanup from Mika Westerberg

    - Removal of several ACPI video DMI blacklist entries that are not
    necessary any more from Aaron Lu

    - Rework of the ACPI companion representation in struct device and code
    cleanup related to that change from Rafael J Wysocki, Lan Tianyu and
    Jarkko Nikula

    - Fixes for assigning names to ACPI-enumerated I2C and SPI devices from
    Jarkko Nikula

    * tag 'pm+acpi-2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (24 commits)
    PCI / hotplug / ACPI: Drop unused acpiphp_debug declaration
    ACPI / scan: Set flags.match_driver in acpi_bus_scan_fixed()
    ACPI / PCI root: Clear driver_data before failing enumeration
    ACPI / hotplug: Fix PCI host bridge hot removal
    ACPI / hotplug: Fix acpi_bus_get_device() return value check
    cpufreq: governor: Remove fossil comment in the cpufreq_governor_dbs()
    ACPI / video: clean up DMI table for initial black screen problem
    ACPI / EC: Ensure lock is acquired before accessing ec struct members
    PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()
    ACPI / AC: Remove struct acpi_device pointer from struct acpi_ac
    spi: Use stable dev_name for ACPI enumerated SPI slaves
    i2c: Use stable dev_name for ACPI enumerated I2C slaves
    ACPI: Provide acpi_dev_name accessor for struct acpi_device device name
    ACPI / bind: Use (put|get)_device() on ACPI device objects too
    ACPI: Eliminate the DEVICE_ACPI_HANDLE() macro
    ACPI / driver core: Store an ACPI device pointer in struct acpi_dev_node
    cpufreq: OMAP: Fix compilation error 'r & ret undeclared'
    PM / Runtime: Fix error path for prepare
    PM / Runtime: Update documentation around probe|remove|suspend
    cpufreq: conservative: set requested_freq to policy max when it is over policy max
    ...

    Linus Torvalds
     
  • Pull slave-dmaengine changes from Vinod Koul:
    "This brings for slave dmaengine:

    - Change dma notification flag to DMA_COMPLETE from DMA_SUCCESS as
    dmaengine can only transfer and not verify validaty of dma
    transfers

    - Bunch of fixes across drivers:

    - cppi41 driver fixes from Daniel

    - 8 channel freescale dma engine support and updated bindings from
    Hongbo

    - msx-dma fixes and cleanup by Markus

    - DMAengine updates from Dan:

    - Bartlomiej and Dan finalized a rework of the dma address unmap
    implementation.

    - In the course of testing 1/ a collection of enhancements to
    dmatest fell out. Notably basic performance statistics, and
    fixed / enhanced test control through new module parameters
    'run', 'wait', 'noverify', and 'verbose'. Thanks to Andriy and
    Linus [Walleij] for their review.

    - Testing the raid related corner cases of 1/ triggered bugs in
    the recently added 16-source operation support in the ioatdma
    driver.

    - Some minor fixes / cleanups to mv_xor and ioatdma"

    * 'next' of git://git.infradead.org/users/vkoul/slave-dma: (99 commits)
    dma: mv_xor: Fix mis-usage of mmio 'base' and 'high_base' registers
    dma: mv_xor: Remove unneeded NULL address check
    ioat: fix ioat3_irq_reinit
    ioat: kill msix_single_vector support
    raid6test: add new corner case for ioatdma driver
    ioatdma: clean up sed pool kmem_cache
    ioatdma: fix selection of 16 vs 8 source path
    ioatdma: fix sed pool selection
    ioatdma: Fix bug in selftest after removal of DMA_MEMSET.
    dmatest: verbose mode
    dmatest: convert to dmaengine_unmap_data
    dmatest: add a 'wait' parameter
    dmatest: add basic performance metrics
    dmatest: add support for skipping verification and random data setup
    dmatest: use pseudo random numbers
    dmatest: support xor-only, or pq-only channels in tests
    dmatest: restore ability to start test at module load and init
    dmatest: cleanup redundant "dmatest: " prefixes
    dmatest: replace stored results mechanism, with uniform messages
    Revert "dmatest: append verify result to results"
    ...

    Linus Torvalds
     
  • Pull block IO fixes from Jens Axboe:
    "Normally I'd defer my initial for-linus pull request until after the
    merge window, but a race was uncovered in the virtio-blk conversion to
    blk-mq that could cause hangs. So here's a small collection of fixes
    for you to pull:

    - The fix for the virtio-blk IO hang reported by Dave Chinner, from
    Shaohua and myself.

    - Add the Insert blktrace event for blk-mq. This makes 'btt' happy
    when it is doing it's state transition analysis.

    - Ensure that blk-mq has disk/partition stats enabled by default,
    instead of making it opt-in.

    - A fix for __bio_add_page() and large sector counts"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    blk-mq: add blktrace insert event trace
    virtio-blk: virtqueue_kick() must be ordered with other virtqueue operations
    blk-mq: ensure that we set REQ_IO_STAT so diskstats work
    bio: fix argument of __bio_add_page() for max_sectors > 0xffff

    Linus Torvalds
     
  • Pull md update from Neil Brown:
    "Mostly optimisations and obscure bug fixes.
    - raid5 gets less lock contention
    - raid1 gets less contention between normal-io and resync-io during
    resync"

    * tag 'md/3.13' of git://neil.brown.name/md:
    md/raid5: Use conf->device_lock protect changing of multi-thread resources.
    md/raid5: Before freeing old multi-thread worker, it should flush them.
    md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
    UAPI: include in linux/raid/md_p.h
    raid1: Rewrite the implementation of iobarrier.
    raid1: Add some macros to make code clearly.
    raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
    raid1: Add a field array_frozen to indicate whether raid in freeze state.
    md: Convert use of typedef ctl_table to struct ctl_table
    md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
    md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
    md: fix some places where mddev_lock return value is not checked.
    raid5: Retry R5_ReadNoMerge flag when hit a read error.
    raid5: relieve lock contention in get_active_stripe()
    raid5: relieve lock contention in get_active_stripe()
    wait: add wait_event_cmd()
    md/raid5.c: add proper locking to error path of raid5_start_reshape.
    md: fix calculation of stacking limits on level change.
    raid5: Use slow_path to release stripe when mddev->thread is null

    Linus Torvalds
     

20 Nov, 2013

10 commits

  • Pull networking fixes from David Miller:
    "Mostly these are fixes for fallout due to merge window changes, as
    well as cures for problems that have been with us for a much longer
    period of time"

    1) Johannes Berg noticed two major deficiencies in our genetlink
    registration. Some genetlink protocols we passing in constant
    counts for their ops array rather than something like
    ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
    using fixed IDs for their multicast groups.

    We have to retain these fixed IDs to keep existing userland tools
    working, but reserve them so that other multicast groups used by
    other protocols can not possibly conflict.

    In dealing with these two problems, we actually now use less state
    management for genetlink operations and multicast groups.

    2) When configuring interface hardware timestamping, fix several
    drivers that simply do not validate that the hwtstamp_config value
    is one the driver actually supports. From Ben Hutchings.

    3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.

    4) In dev_forward_skb(), set the skb->protocol in the right order
    relative to skb_scrub_packet(). From Alexei Starovoitov.

    5) Bridge erroneously fails to use the proper wrapper functions to make
    calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
    Makita.

    6) When detaching a bridge port, make sure to flush all VLAN IDs to
    prevent them from leaking, also from Toshiaki Makita.

    7) Put in a compromise for TCP Small Queues so that deep queued devices
    that delay TX reclaim non-trivially don't have such a performance
    decrease. One particularly problematic area is 802.11 AMPDU in
    wireless. From Eric Dumazet.

    8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
    here. Fix from Eric Dumzaet, reported by Dave Jones.

    9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.

    10) When computing mergeable buffer sizes, virtio-net fails to take the
    virtio-net header into account. From Michael Dalton.

    11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
    bumping, this one has been with us for a while. From Eric Dumazet.

    12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
    Hugne.

    13) 6lowpan bit used for traffic classification was wrong, from Jukka
    Rissanen.

    14) macvlan has the same issue as normal vlans did wrt. propagating LRO
    disabling down to the real device, fix it the same way. From Michal
    Kubecek.

    15) CPSW driver needs to soft reset all slaves during suspend, from
    Daniel Mack.

    16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.

    17) The xen-netfront RX buffer refill timer isn't properly scheduled on
    partial RX allocation success, from Ma JieYue.

    18) When ipv6 ping protocol support was added, the AF_INET6 protocol
    initialization cleanup path on failure was borked a little. Fix
    from Vlad Yasevich.

    19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
    blocks we can do the wrong thing with the msg_name we write back to
    userspace. From Hannes Frederic Sowa. There is another fix in the
    works from Hannes which will prevent future problems of this nature.

    20) Fix route leak in VTI tunnel transmit, from Fan Du.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
    genetlink: make multicast groups const, prevent abuse
    genetlink: pass family to functions using groups
    genetlink: add and use genl_set_err()
    genetlink: remove family pointer from genl_multicast_group
    genetlink: remove genl_unregister_mc_group()
    hsr: don't call genl_unregister_mc_group()
    quota/genetlink: use proper genetlink multicast APIs
    drop_monitor/genetlink: use proper genetlink multicast APIs
    genetlink: only pass array to genl_register_family_with_ops()
    tcp: don't update snd_nxt, when a socket is switched from repair mode
    atm: idt77252: fix dev refcnt leak
    xfrm: Release dst if this dst is improper for vti tunnel
    netlink: fix documentation typo in netlink_set_err()
    be2net: Delete secondary unicast MAC addresses during be_close
    be2net: Fix unconditional enabling of Rx interface options
    net, virtio_net: replace the magic value
    ping: prevent NULL pointer dereference on write to msg_name
    bnx2x: Prevent "timeout waiting for state X"
    bnx2x: prevent CFC attention
    bnx2x: Prevent panic during DMAE timeout
    ...

    Linus Torvalds
     
  • Register generic netlink multicast groups as an array with
    the family and give them contiguous group IDs. Then instead
    of passing the global group ID to the various functions that
    send messages, pass the ID relative to the family - for most
    families that's just 0 because the only have one group.

    This avoids the list_head and ID in each group, adding a new
    field for the mcast group ID offset to the family.

    At the same time, this allows us to prevent abusing groups
    again like the quota and dropmon code did, since we can now
    check that a family only uses a group it owns.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • This doesn't really change anything, but prepares for the
    next patch that will change the APIs to pass the group ID
    within the family, rather than the global group ID.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Add a static inline to generic netlink to wrap netlink_set_err()
    to make it easier to use here - use it in openvswitch (the only
    generic netlink user of netlink_set_err()).

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • There's no reason to have the family pointer there since it
    can just be passed internally where needed, so remove it.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • There are no users of this API remaining, and we'll soon
    change group registration to be static (like ops are now)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • The quota code is abusing the genetlink API and is using
    its family ID as the multicast group ID, which is invalid
    and may belong to somebody else (and likely will.)

    Make the quota code use the correct API, but since this
    is already used as-is by userspace, reserve a family ID
    for this code and also reserve that group ID to not break
    userspace assumptions.

    Acked-by: Jan Kara
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • As suggested by David Miller, make genl_register_family_with_ops()
    a macro and pass only the array, evaluating ARRAY_SIZE() in the
    macro, this is a little safer.

    The openvswitch has some indirection, assing ops/n_ops directly in
    that code. This might ultimately just assign the pointers in the
    family initializations, saving the struct genl_family_and_ops and
    code (once mcast groups are handled differently.)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Pull irq cleanups from Ingo Molnar:
    "This is a multi-arch cleanup series from Thomas Gleixner, which we
    kept to near the end of the merge window, to not interfere with
    architecture updates.

    This series (motivated by the -rt kernel) unifies more aspects of IRQ
    handling and generalizes PREEMPT_ACTIVE"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    preempt: Make PREEMPT_ACTIVE generic
    sparc: Use preempt_schedule_irq
    ia64: Use preempt_schedule_irq
    m32r: Use preempt_schedule_irq
    hardirq: Make hardirq bits generic
    m68k: Simplify low level interrupt handling code
    genirq: Prevent spurious detection for unconditionally polled interrupts

    Linus Torvalds
     
  • If disk stats are enabled on the queue, a request needs to
    be marked with REQ_IO_STAT for accounting to be active on
    that request. This fixes an issue with virtio-blk not
    showing up in /proc/diskstats after the conversion to
    blk-mq.

    Add QUEUE_FLAG_MQ_DEFAULT, setting stats and same cpu-group
    completion on by default.

    Reported-by: Dave Chinner
    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Nov, 2013

6 commits

  • linux/raid/md_p.h is using conditionals depending on endianess and fails
    with an error if neither of __BIG_ENDIAN, __LITTLE_ENDIAN or
    __BYTE_ORDER are defined, but it doesn't include any header which can
    define these constants. This make this header unusable alone.

    This patch adds a #include at the beginning of this
    header to make it usable alone. This is needed to compile klibc on MIPS.

    Signed-off-by: Aurelien Jarno
    Signed-off-by: NeilBrown

    Aurelien Jarno
     
  • Pull i2c changes from Wolfram Sang:
    - new drivers for exynos5, bcm kona, and st micro
    - bigger overhauls for drivers mxs and rcar
    - typical driver bugfixes, cleanups, improvements
    - got rid of the superfluous 'driver' member in i2c_client struct This
    touches a few drivers in other subsystems. All acked.

    * 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (38 commits)
    i2c: bcm-kona: fix error return code in bcm_kona_i2c_probe()
    i2c: i2c-eg20t: do not print error message in syslog if no ACK received
    i2c: bcm-kona: Introduce Broadcom I2C Driver
    i2c: cbus-gpio: Fix device tree binding
    i2c: wmt: add missing clk_disable_unprepare() on error
    i2c: designware: add new ACPI IDs
    i2c: i801: Add Device IDs for Intel Wildcat Point-LP PCH
    i2c: exynos5: Remove incorrect clk_disable_unprepare
    i2c: i2c-st: Add ST I2C controller
    i2c: exynos5: add High Speed I2C controller driver
    i2c: rcar: fixup rcar type naming
    i2c: scmi: remove some bogus NULL checks
    i2c: sh_mobile & rcar: Enable the driver on all ARM platforms
    i2c: sh_mobile: Convert to clk_prepare/unprepare
    i2c: mux: gpio: use reg value for i2c_add_mux_adapter
    i2c: mux: gpio: use gpio_set_value_cansleep()
    i2c: Include linux/of.h header
    i2c: mxs: Fix PIO mode on i.MX23
    i2c: mxs: Rework the PIO mode operation
    i2c: mxs: distinguish i.MX23 and i.MX28 based I2C controller
    ...

    Linus Torvalds
     
  • Pull infiniband/rdma updates from Roland Dreier:
    - Re-enable flow steering verbs with new improved userspace ABI
    - Fixes for slow connection due to GID lookup scalability
    - IPoIB fixes
    - Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
    - Further improvements to SRP error handling
    - Add new transport type for Cisco usNIC

    * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (66 commits)
    IB/core: Re-enable create_flow/destroy_flow uverbs
    IB/core: extended command: an improved infrastructure for uverbs commands
    IB/core: Remove ib_uverbs_flow_spec structure from userspace
    IB/core: Use a common header for uverbs flow_specs
    IB/core: Make uverbs flow structure use names like verbs ones
    IB/core: Rename 'flow' structs to match other uverbs structs
    IB/core: clarify overflow/underflow checks on ib_create/destroy_flow
    IB/ucma: Convert use of typedef ctl_table to struct ctl_table
    IB/cm: Convert to using idr_alloc_cyclic()
    IB/mlx5: Fix page shift in create CQ for userspace
    IB/mlx4: Fix device max capabilities check
    IB/mlx5: Fix list_del of empty list
    IB/mlx5: Remove dead code
    IB/core: Encorce MR access rights rules on kernel consumers
    IB/mlx4: Fix endless loop in resize CQ
    RDMA/cma: Remove unused argument and minor dead code
    RDMA/ucma: Discard events for IDs not yet claimed by user space
    IB/core: Add Cisco usNIC rdma node and transport types
    RDMA/nes: Remove self-assignment from nes_query_qp()
    IB/srp: Report receive errors correctly
    ...

    Linus Torvalds
     
  • Pull battery updates from Anton Vorontsov:
    "Highlights:
    - A new driver for TI BQ24735 Battery Chargers, courtesy of NVidia.
    - Device tree bindings for TWL4030 chips.
    - Random fixes and cleanups"

    * tag 'for-v3.13' of git://git.infradead.org/battery-2.6:
    pm2301-charger: Remove unneeded NULL checks
    twl4030_charger: Add devicetree support
    power_supply: Fix documentation for TEMP_*ALERT* properties
    max17042_battery: Support regmap to access device's registers
    max17042_battery: Use SIMPLE_DEV_PM_OPS
    charger-manager : Replace kzalloc to devm_kzalloc and remove uneccessary code
    bq2415x_charger: Fix max battery regulation voltage
    tps65090-charger: Use "IS_ENABLED(CONFIG_OF)" for DT code
    tps65090-charger: Drop devm_free_irq of devm_ allocated irq
    power_supply: Add support for bq24735 charger
    pm2301-charger: Staticize pm2xxx_charger_die_therm_mngt
    pm2301-charger: Check return value of regulator_enable
    ab8500-charger: Remove redundant break
    ab8500-charger: Check return value of regulator_enable
    isp1704_charger: Fix driver to work with changes introduced in v3.5

    Linus Torvalds
     
  • Pull media updates from Mauro Carvalho Chehab:
    "This series include:
    - a new Remote Controller driver for ST SoC with the corresponding DT
    bindings
    - a new frontend (cx24117)
    - a new I2C camera flash driver (lm3560)
    - a new mem2mem driver for TI SoC (ti-vpe)
    - support for Raphael r828d added to r820t driver
    - some improvements on buffer allocation at VB2 core
    - usual driver fixes and improvements

    PS this time, we have a smaller number of patches. While it is hard
    to pinpoint to the reasons, I believe that it is mainly due to:

    1) there are several patch series ready, but depending on DT review.
    I decided to grant some extra time for DT maintainers to look on
    it, as they're expecting to have more time with the changes agreed
    during ARM mini-summit and KS. If they can't review in time for
    3.14, I'll review myself and apply for the next merge window.

    2) I suspect that having both LinuxCon EU and LinuxCon NA happening
    during the same merge window affected the development
    productivity, as several core media developers participated on
    both events"

    * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (151 commits)
    [media] media: st-rc: Add ST remote control driver
    [media] gpio-ir-recv: Include linux/of.h header
    [media] tvp7002: Include linux/of.h header
    [media] tvp514x: Include linux/of.h header
    [media] ths8200: Include linux/of.h header
    [media] adv7343: Include linux/of.h header
    [media] v4l: Fix typo in v4l2_subdev_get_try_crop()
    [media] media: i2c: add driver for dual LED Flash, lm3560
    [media] rtl28xxu: add 15f4:0131 Astrometa DVB-T2
    [media] rtl28xxu: add RTL2832P + R828D support
    [media] rtl2832: add new tuner R828D
    [media] r820t: add support for R828D
    [media] media/i2c: ths8200: fix build failure with gcc 4.5.4
    [media] Add support for KWorld UB435-Q V2
    [media] staging/media: fix msi3101 build errors
    [media] ddbridge: Remove casting the return value which is a void pointer
    [media] ngene: Remove casting the return value which is a void pointer
    [media] dm1105: remove unneeded not-null test
    [media] sh_mobile_ceu_camera: remove deprecated IRQF_DISABLED
    [media] media: rcar_vin: Add preliminary r8a7790 support
    ...

    Linus Torvalds
     
  • Pull MMC updates from Chris Ball:
    "MMC highlights for 3.13:

    Core:
    - Improve runtime PM support, remove mmc_{suspend,resume}_host().
    - Add MMC_CAP_RUNTIME_RESUME, for delaying MMC resume until we're
    outside of the resume sequence (in runtime_resume) to decrease
    system resume time.

    Drivers:
    - dw_mmc: Support HS200 mode.
    - sdhci-eshdc-imx: Support SD3.0 SDR clock tuning, DDR on IMX6.
    - sdhci-pci: Add support for Intel Clovertrail and Merrifield"

    * tag 'mmc-updates-for-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (108 commits)
    mmc: wbsd: Silence compiler warning
    mmc: core: Silence compiler warning in __mmc_switch
    mmc: sh_mmcif: Convert to clk_prepare|unprepare
    mmc: sh_mmcif: Convert to PM macros when defining dev_pm_ops
    mmc: dw_mmc: exynos: Revert the sdr_timing assignment
    mmc: sdhci: Avoid needless loop while handling SDIO interrupts in sdhci_irq
    mmc: core: Add MMC_CAP_RUNTIME_RESUME to resume at runtime_resume
    mmc: core: Improve runtime PM support during suspend/resume for sd/mmc
    mmc: core: Remove redundant mmc_power_up|off at runtime callbacks
    mmc: Don't force card to active state when entering suspend/shutdown
    MIPS: db1235: Don't use MMC_CLKGATE
    mmc: core: Remove deprecated mmc_suspend|resume_host APIs
    mmc: mmci: Move away from using deprecated APIs
    mmc: via-sdmmc: Move away from using deprecated APIs
    mmc: tmio: Move away from using deprecated APIs
    mmc: sh_mmcif: Move away from using deprecated APIs
    mmc: sdricoh_cs: Move away from using deprecated APIs
    mmc: rtsx: Remove redundant suspend and resume callbacks
    mmc: wbsd: Move away from using deprecated APIs
    mmc: pxamci: Remove redundant suspend and resume callbacks
    ...

    Linus Torvalds
     

18 Nov, 2013

4 commits

  • …s', 'ocrdma', 'qib' and 'srp' into for-next

    Roland Dreier
     
  • This commit reverts commit 7afbddfae993 ("IB/core: Temporarily disable
    create_flow/destroy_flow uverbs"). Since the uverbs extensions
    functionality was experimental for v3.12, this patch re-enables the
    support for them and flow-steering for v3.13.

    Signed-off-by: Matan Barak
    Signed-off-by: Roland Dreier

    Matan Barak
     
  • Commit 400dbc96583f ("IB/core: Infrastructure for extensible uverbs
    commands") added an infrastructure for extensible uverbs commands
    while later commit 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow
    through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
    using this new infrastructure.

    According to the commit 400dbc96583f, the purpose of this
    infrastructure is to support passing around provider (eg. hardware)
    specific buffers when userspace issue commands to the kernel, so that
    it would be possible to extend uverbs (eg. core) buffers independently
    from the provider buffers.

    But the new kernel command function prototypes were not modified to
    take advantage of this extension. This issue was exposed by Roland
    Dreier in a previous review[1].

    So the following patch is an attempt to a revised extensible command
    infrastructure.

    This improved extensible command infrastructure distinguish between
    core (eg. legacy)'s command/response buffers from provider
    (eg. hardware)'s command/response buffers: each extended command
    implementing function is given a struct ib_udata to hold core
    (eg. uverbs) input and output buffers, and another struct ib_udata to
    hold the hw (eg. provider) input and output buffers.

    Having those buffers identified separately make it easier to increase
    one buffer to support extension without having to add some code to
    guess the exact size of each command/response parts: This should make
    the extended functions more reliable.

    Additionally, instead of relying on command identifier being greater
    than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
    unused bits in command field: on the 32 bits provided by command
    field, only 6 bits are really needed to encode the identifier of
    commands currently supported by the kernel. (Even using only 6 bits
    leaves room for about 23 new commands).

    So this patch makes use of some high order bits in command field to
    store flags, leaving enough room for more command identifiers than one
    will ever need (eg. 256).

    The new flags are used to specify if the command should be processed
    as an extended one or a legacy one. While designing the new command
    format, care was taken to make usage of flags itself extensible.

    Using high order bits of the commands field ensure that newer
    libibverbs on older kernel will properly fail when trying to call
    extended commands. On the other hand, older libibverbs on newer kernel
    will never be able to issue calls to extended commands.

    The extended command header includes the optional response pointer so
    that output buffer length and output buffer pointer are located
    together in the command, allowing proper parameters checking. This
    should make implementing functions easier and safer.

    Additionally the extended header ensure 64bits alignment, while making
    all sizes multiple of 8 bytes, extending the maximum buffer size:

    legacy extended

    Maximum command buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
    Maximum response buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)

    For the purpose of doing proper buffer size accounting, the headers
    size are no more taken in account in "in_words".

    One of the odds of the current extensible infrastructure, reading
    twice the "legacy" command header, is fixed by removing the "legacy"
    command header from the extended command header: they are processed as
    two different parts of the command: memory is read once and
    information are not duplicated: it's making clear that's an extended
    command scheme and not a different command scheme.

    The proposed scheme will format input (command) and output (response)
    buffers this way:

    - command:

    legacy header +
    extended header +
    command data (core + hw):

    +----------------------------------------+
    | flags | 00 00 | command |
    | in_words | out_words |
    +----------------------------------------+
    | response |
    | response |
    | provider_in_words | provider_out_words |
    | padding |
    +----------------------------------------+
    | |
    . .
    . (in_words * 8) .
    | |
    +----------------------------------------+
    | |
    . .
    . (provider_in_words * 8) .
    | |
    +----------------------------------------+

    - response, if present:

    +----------------------------------------+
    | |
    . .
    . (out_words * 8) .
    | |
    +----------------------------------------+
    | |
    . .
    . (provider_out_words * 8) .
    | |
    +----------------------------------------+

    The overall design is to ensure that the extensible infrastructure is
    itself extensible while begin more reliable with more input and bound
    checking.

    Note:

    The unused field in the extended header would be perfect candidate to
    hold the command "comp_mask" (eg. bit field used to handle
    compatibility). This was suggested by Roland Dreier in a previous
    review[2]. But "comp_mask" field is likely to be present in the uverb
    input and/or provider input, likewise for the response, as noted by
    Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
    header.

    [1]:
    http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com

    [2]:
    http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com

    [3]:
    http://marc.info/?i=525C1149.6000701@mellanox.com

    Signed-off-by: Yann Droneaud
    Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com

    [ Convert "ret ? ret : 0" to the equivalent "ret". - Roland ]

    Signed-off-by: Roland Dreier

    Yann Droneaud
     
  • The structure holding any types of flow_spec is of no use to
    userspace. It would be wrong for userspace to do:

    struct ib_uverbs_flow_spec flow_spec;

    flow_spec.type = IB_FLOW_SPEC_TCP;
    flow_spec.size = sizeof(flow_spec);

    Instead, userspace should use the dedicated flow_spec structure for
    - Ethernet : struct ib_uverbs_flow_spec_eth,
    - IPv4 : struct ib_uverbs_flow_spec_ipv4,
    - TCP/UDP : struct ib_uverbs_flow_spec_tcp_udp.

    In other words, struct ib_uverbs_flow_spec is a "virtual" data
    structure that can only be use by the kernel as an alias to the other.

    Signed-off-by: Yann Droneaud
    Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
    Signed-off-by: Roland Dreier

    Yann Droneaud