18 Dec, 2014

8 commits

  • Pull ceph updates from Sage Weil:
    "The big item here is support for inline data for CephFS and for
    message signatures from Zheng. There are also several bug fixes,
    including interrupted flock request handling, 0-length xattrs, mksnap,
    cached readdir results, and a message version compat field. Finally
    there are several cleanups from Ilya, Dan, and Markus.

    Note that there is another series coming soon that fixes some bugs in
    the RBD 'lingering' requests, but it isn't quite ready yet"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
    ceph: fix setting empty extended attribute
    ceph: fix mksnap crash
    ceph: do_sync is never initialized
    libceph: fixup includes in pagelist.h
    ceph: support inline data feature
    ceph: flush inline version
    ceph: convert inline data to normal data before data write
    ceph: sync read inline data
    ceph: fetch inline data when getting Fcr cap refs
    ceph: use getattr request to fetch inline data
    ceph: add inline data to pagecache
    ceph: parse inline data in MClientReply and MClientCaps
    libceph: specify position of extent operation
    libceph: add CREATE osd operation support
    libceph: add SETXATTR/CMPXATTR osd operations support
    rbd: don't treat CEPH_OSD_OP_DELETE as extent op
    ceph: remove unused stringification macros
    libceph: require cephx message signature by default
    ceph: introduce global empty snap context
    ceph: message versioning fixes
    ...

    Linus Torvalds
     
  • allow specifying position of extent operation in multi-operations
    osd request. This is required for cephfs to convert inline data to
    normal data (compare xattr, then write object).

    Signed-off-by: Yan, Zheng
    Reviewed-by: Ilya Dryomov

    Yan, Zheng
     
  • Add CEPH_OSD_OP_CREATE support. Also change libceph to not treat
    CEPH_OSD_OP_DELETE as an extent op and add an assert to that end.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Ilya Dryomov

    Yan, Zheng
     
  • Signed-off-by: Yan, Zheng
    Reviewed-by: Ilya Dryomov

    Yan, Zheng
     
  • Signed-off-by: Yan, Zheng
    Reviewed-by: Ilya Dryomov

    Yan, Zheng
     
  • Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • Session key is required when calculating message signature. Save the session
    key in authorizer, this avoid lookup ticket handler for each message

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • Use kvfree() from linux/mm.h instead, which is identical. Also fix the
    ceph_buffer comment: we will allocate with kmalloc() up to 32k - the
    value of PAGE_ALLOC_COSTLY_ORDER, but that really is just an
    implementation detail so don't mention it at all.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

17 Dec, 2014

2 commits

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     
  • Pull nfsd updates from Bruce Fields:
    "A comparatively quieter cycle for nfsd this time, but still with two
    larger changes:

    - RPC server scalability improvements from Jeff Layton (using RCU
    instead of a spinlock to find idle threads).

    - server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna
    Schumaker, enabling fallocate on new clients"

    * 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits)
    nfsd4: fix xdr4 count of server in fs_location4
    nfsd4: fix xdr4 inclusion of escaped char
    sunrpc/cache: convert to use string_escape_str()
    sunrpc: only call test_bit once in svc_xprt_received
    fs: nfsd: Fix signedness bug in compare_blob
    sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt
    sunrpc: convert to lockless lookup of queued server threads
    sunrpc: fix potential races in pool_stats collection
    sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it
    sunrpc: require svc_create callers to pass in meaningful shutdown routine
    sunrpc: have svc_wake_up only deal with pool 0
    sunrpc: convert sp_task_pending flag to use atomic bitops
    sunrpc: move rq_cachetype field to better optimize space
    sunrpc: move rq_splice_ok flag into rq_flags
    sunrpc: move rq_dropme flag into rq_flags
    sunrpc: move rq_usedeferral flag to rq_flags
    sunrpc: move rq_local field to rq_flags
    sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
    nfsd: minor off by one checks in __write_versions()
    sunrpc: release svc_pool_map reference when serv allocation fails
    ...

    Linus Torvalds
     

15 Dec, 2014

1 commit

  • Pull driver core update from Greg KH:
    "Here's the set of driver core patches for 3.19-rc1.

    They are dominated by the removal of the .owner field in platform
    drivers. They touch a lot of files, but they are "simple" changes,
    just removing a line in a structure.

    Other than that, a few minor driver core and debugfs changes. There
    are some ath9k patches coming in through this tree that have been
    acked by the wireless maintainers as they relied on the debugfs
    changes.

    Everything has been in linux-next for a while"

    * tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits)
    Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries"
    fs: debugfs: add forward declaration for struct device type
    firmware class: Deletion of an unnecessary check before the function call "vunmap"
    firmware loader: fix hung task warning dump
    devcoredump: provide a one-way disable function
    device: Add dev__once variants
    ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries
    ath: use seq_file api for ath9k debugfs files
    debugfs: add helper function to create device related seq_file
    drivers/base: cacheinfo: remove noisy error boot message
    Revert "core: platform: add warning if driver has no owner"
    drivers: base: support cpu cache information interface to userspace via sysfs
    drivers: base: add cpu_device_create to support per-cpu devices
    topology: replace custom attribute macros with standard DEVICE_ATTR*
    cpumask: factor out show_cpumap into separate helper function
    driver core: Fix unbalanced device reference in drivers_probe
    driver core: fix race with userland in device_add()
    sysfs/kernfs: make read requests on pre-alloc files use the buffer.
    sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
    fs: sysfs: return EGBIG on write if offset is larger than file size
    ...

    Linus Torvalds
     

14 Dec, 2014

1 commit

  • Pull crypto update from Herbert Xu:
    - The crypto API is now documented :)
    - Disallow arbitrary module loading through crypto API.
    - Allow get request with empty driver name through crypto_user.
    - Allow speed testing of arbitrary hash functions.
    - Add caam support for ctr(aes), gcm(aes) and their derivatives.
    - nx now supports concurrent hashing properly.
    - Add sahara support for SHA1/256.
    - Add ARM64 version of CRC32.
    - Misc fixes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (77 commits)
    crypto: tcrypt - Allow speed testing of arbitrary hash functions
    crypto: af_alg - add user space interface for AEAD
    crypto: qat - fix problem with coalescing enable logic
    crypto: sahara - add support for SHA1/256
    crypto: sahara - replace tasklets with kthread
    crypto: sahara - add support for i.MX53
    crypto: sahara - fix spinlock initialization
    crypto: arm - replace memset by memzero_explicit
    crypto: powerpc - replace memset by memzero_explicit
    crypto: sha - replace memset by memzero_explicit
    crypto: sparc - replace memset by memzero_explicit
    crypto: algif_skcipher - initialize upon init request
    crypto: algif_skcipher - removed unneeded code
    crypto: algif_skcipher - Fixed blocking recvmsg
    crypto: drbg - use memzero_explicit() for clearing sensitive data
    crypto: drbg - use MODULE_ALIAS_CRYPTO
    crypto: include crypto- module prefix in template
    crypto: user - add MODULE_ALIAS
    crypto: sha-mb - remove a bogus NULL check
    crytpo: qat - Fix 64 bytes requests
    ...

    Linus Torvalds
     

12 Dec, 2014

6 commits

  • This patch addresses an issue with the level compression of the fib_trie.
    Specifically in the case of adding a new leaf that triggers a new node to
    be added that takes the place of the old node. The result is a trie where
    the 1 child tnode is on one side and one leaf is on the other which gives
    you a very deep trie. Below is the script I used to generate a trie on
    dummy0 with a 10.X.X.X family of addresses.

    ip link add type dummy
    ipval=184549374
    bit=2
    for i in `seq 1 23`
    do
    ifconfig dummy0:$bit $ipval/8
    ipval=`expr $ipval - $bit`
    bit=`expr $bit \* 2`
    done
    cat /proc/net/fib_triestat

    Running the script before the patch:

    Local:
    Aver depth: 10.82
    Max depth: 23
    Leaves: 29
    Prefixes: 30
    Internal nodes: 27
    1: 26 2: 1
    Pointers: 56
    Null ptrs: 1
    Total size: 5 kB

    After applying the patch and repeating:

    Local:
    Aver depth: 4.72
    Max depth: 9
    Leaves: 29
    Prefixes: 30
    Internal nodes: 12
    1: 3 2: 2 3: 7
    Pointers: 70
    Null ptrs: 30
    Total size: 4 kB

    What this fix does is start the rebalance at the newly created tnode
    instead of at the parent tnode. This way if there is a gap between the
    parent and the new node it doesn't prevent the new tnode from being
    coalesced with any pre-existing nodes that may have been pushed into one
    of the new nodes child branches.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Since the real device can segment packets by software, a vlan device
    can set TSO/UFO even when the real device doesn't have those features.
    Unlike GSO, this allows packets to be segmented after Qdisc.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • In case we cannot attach to our slave netdevice PHY, error out and
    propagate that error up to the caller: dsa_slave_create().

    Fixes: 0d8bcdd383b8 ("net: dsa: allow for more complex PHY setups")
    Signed-off-by: Andrey Volkov
    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • In case there is no PHY at the designated address on the internal
    switch, we would basically de-reference a null pointer here:

    dsa_slave_phy_setup(...)
    {
    p->phy = ds->slave_mii_bus->phy_map[p->port];
    phy_connect_direct(slave_dev, p->phy, dsa_slave_adjust_link,
    ^------

    This can be triggered when the platform configuration (platform_data or
    Device Tree) indicates there should be a PHY device at this address, but
    the HW is non-responsive, such that we cannot attach a PHY device at
    this specific location.

    Fix this by checking the return value prior to calling
    phy_connect_direct().

    CC: Andrew Lunn
    Fixes: b31f65fb4383 ("net: dsa: slave: Fix autoneg for phys on switch MDIO bus")
    Reported-by: Brian Norris
    Signed-off-by: Andrey Volkov
    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • Pull networking updates from David Miller:

    1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

    2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers. Thanks to Al Viro
    and Herbert Xu.

    3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

    4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

    5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

    6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

    7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

    8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

    9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets. From Alexei
    Starovoitov.

    10) Support TSO/LSO in sunvnet driver, from David L Stevens.

    11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

    12) Remote checksum offload, from Tom Herbert.

    13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

    14) Add MPLS support to openvswitch, from Simon Horman.

    15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

    16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet. This tries to resolve the conflicting goals between the
    desired handling of bulk vs. RPC-like traffic.

    17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU. From Eric Dumazet.

    18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

    19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

    20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

    21) Add VLAN packet scheduler action, from Jiri Pirko.

    22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
    Fix race condition between vxlan_sock_add and vxlan_sock_release
    net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
    net/mlx4: Add support for A0 steering
    net/mlx4: Refactor QUERY_PORT
    net/mlx4_core: Add explicit error message when rule doesn't meet configuration
    net/mlx4: Add A0 hybrid steering
    net/mlx4: Add mlx4_bitmap zone allocator
    net/mlx4: Add a check if there are too many reserved QPs
    net/mlx4: Change QP allocation scheme
    net/mlx4_core: Use tasklet for user-space CQ completion events
    net/mlx4_core: Mask out host side virtualization features for guests
    net/mlx4_en: Set csum level for encapsulated packets
    be2net: Export tunnel offloads only when a VxLAN tunnel is created
    gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
    cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
    net: fec: only enable mdio interrupt before phy device link up
    net: fec: clear all interrupt events to support i.MX6SX
    net: fec: reset fep link status in suspend function
    net: sock: fix access via invalid file descriptor
    net: introduce helper macro for_each_cmsghdr
    ...

    Linus Torvalds
     
  • Pull virtio updates from Michael Tsirkin:
    "virtio: virtio 1.0 support, misc patches

    This adds a lot of infrastructure for virtio 1.0 support. Notable
    missing pieces: virtio pci, virtio balloon (needs spec extension),
    vhost scsi.

    Plus, there are some minor fixes in a couple of places.

    Note: some net drivers are affected by these patches. David said he's
    fine with merging these patches through my tree.

    Rusty's on vacation, he acked using my tree for these, too"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (70 commits)
    virtio_ccw: finalize_features error handling
    virtio_ccw: future-proof finalize_features
    virtio_pci: rename virtio_pci -> virtio_pci_common
    virtio_pci: update file descriptions and copyright
    virtio_pci: split out legacy device support
    virtio_pci: setup config vector indirectly
    virtio_pci: setup vqs indirectly
    virtio_pci: delete vqs indirectly
    virtio_pci: use priv for vq notification
    virtio_pci: free up vq->priv
    virtio_pci: fix coding style for structs
    virtio_pci: add isr field
    virtio: drop legacy_only driver flag
    virtio_balloon: drop legacy_only driver flag
    virtio_ccw: rev 1 devices set VIRTIO_F_VERSION_1
    virtio: allow finalize_features to fail
    virtio_ccw: legacy: don't negotiate rev 1/features
    virtio: add API to detect legacy devices
    virtio_console: fix sparse warnings
    vhost: remove unnecessary forward declarations in vhost.h
    ...

    Linus Torvalds
     

11 Dec, 2014

20 commits

  • Pull ACPI and power management updates from Rafael Wysocki:
    "This time we have some more new material than we used to have during
    the last couple of development cycles.

    The most important part of it to me is the introduction of a unified
    interface for accessing device properties provided by platform
    firmware. It works with Device Trees and ACPI in a uniform way and
    drivers using it need not worry about where the properties come from
    as long as the platform firmware (either DT or ACPI) makes them
    available. It covers both devices and "bare" device node objects
    without struct device representation as that turns out to be necessary
    in some cases. This has been in the works for quite a few months (and
    development cycles) and has been approved by all of the relevant
    maintainers.

    On top of that, some drivers are switched over to the new interface
    (at25, leds-gpio, gpio_keys_polled) and some additional changes are
    made to the core GPIO subsystem to allow device drivers to manipulate
    GPIOs in the "canonical" way on platforms that provide GPIO
    information in their ACPI tables, but don't assign names to GPIO lines
    (in which case the driver needs to do that on the basis of what it
    knows about the device in question). That also has been approved by
    the GPIO core maintainers and the rfkill driver is now going to use
    it.

    Second is support for hardware P-states in the intel_pstate driver.
    It uses CPUID to detect whether or not the feature is supported by the
    processor in which case it will be enabled by default. However, it
    can be disabled entirely from the kernel command line if necessary.

    Next is support for a platform firmware interface based on ACPI
    operation regions used by the PMIC (Power Management Integrated
    Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms.
    That interface is used for manipulating power resources and for
    thermal management: sensor temperature reporting, trip point setting
    and so on.

    Also the ACPI core is now going to support the _DEP configuration
    information in a limited way. Basically, _DEP it supposed to reflect
    off-the-hierarchy dependencies between devices which may be very
    indirect, like when AML for one device accesses locations in an
    operation region handled by another device's driver (usually, the
    device depended on this way is a serial bus or GPIO controller). The
    support added this time is sufficient to make the ACPI battery driver
    work on Asus T100A, but it is general enough to be able to cover some
    other use cases in the future.

    Finally, we have a new cpufreq driver for the Loongson1B processor.

    In addition to the above, there are fixes and cleanups all over the
    place as usual and a traditional ACPICA update to a recent upstream
    release.

    As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for
    Intel platforms should be able to handle power management of the DMA
    engine correctly, the cpufreq-dt driver should interact with the
    thermal subsystem in a better way and the ACPI backlight driver should
    handle some more corner cases, among other things.

    On top of the ACPICA update there are fixes for race conditions in the
    ACPICA's interrupt handling code which might lead to some random and
    strange looking failures on some systems.

    In the cleanups department the most visible part is the series of
    commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration
    option. That was triggered by a discussion regarding the generic
    power domains code during which we realized that trying to support
    certain combinations of PM config options was painful and not really
    worth it, because nobody would use them in production anyway. For
    this reason, we decided to make CONFIG_PM_SLEEP select
    CONFIG_PM_RUNTIME and that lead to the conclusion that the latter
    became redundant and CONFIG_PM could be used instead of it. The
    material here makes that replacement in a major part of the tree, but
    there will be at least one more batch of that in the second part of
    the merge window.

    Specifics:

    - Support for retrieving device properties information from ACPI _DSD
    device configuration objects and a unified device properties
    interface for device drivers (and subsystems) on top of that. As
    stated above, this works with Device Trees and ACPI and allows
    device drivers to be written in a platform firmware (DT or ACPI)
    agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are
    now going to use this new interface and the GPIO subsystem is
    additionally modified to allow device drivers to assign names to
    GPIO resources returned by ACPI _CRS objects (in case _DSD is not
    present or does not provide the expected data). The changes in
    this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron
    Lu, and Darren Hart with some fixes from others (Fabio Estevam,
    Geert Uytterhoeven).

    - Support for Hardware Managed Performance States (HWP) as described
    in Volume 3, section 14.4, of the Intel SDM in the intel_pstate
    driver. CPUID is used to detect whether or not the feature is
    supported by the processor. If supported, it will be enabled
    automatically unless the intel_pstate=no_hwp switch is present in
    the kernel command line. From Dirk Brandewie.

    - New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie).

    - Support for firmware interface based on ACPI operation regions used
    by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR
    platforms for power resource control and thermal management (Aaron
    Lu).

    - Limited support for retrieving off-the-hierarchy dependencies
    between devices from ACPI _DEP device configuration objects and
    deferred probing support for the ACPI battery driver based on the
    _DEP information to make that driver work on Asus T100A (Lan
    Tianyu).

    - New cpufreq driver for the Loongson1B processor (Kelvin Cheung).

    - ACPICA update to upstream revision 20141107 which only affects
    tools (Bob Moore).

    - Fixes for race conditions in the ACPICA's interrupt handling code
    and in the ACPI code related to system suspend and resume (Lv Zheng
    and Rafael J Wysocki).

    - ACPI core fix for an RCU-related issue in the ioremap() regions
    management code that slowed down significantly after CPUs had been
    allowed to enter idle states even if they'd had RCU callbakcs
    queued and triggered some problems in certain proprietary graphics
    driver (and elsewhere). The fix replaces synchronize_rcu() in that
    code with synchronize_rcu_expedited() which makes the issue go
    away. From Konstantin Khlebnikov.

    - ACPI LPSS (Low-Power Subsystem) driver fix to handle power
    management of the DMA engine included into the LPSS correctly. The
    problem is that the DMA engine doesn't have ACPI PM support of its
    own and it simply is turned off when the last LPSS device having
    ACPI PM support goes into D3cold. To work around that, the PM
    domain used by the ACPI LPSS driver is redesigned so at least one
    device with ACPI PM support will be on as long as the DMA engine is
    in use. From Andy Shevchenko.

    - ACPI backlight driver fix to avoid using it on "Win8-compatible"
    systems where it doesn't work and where it was used by default by
    mistake (Aaron Lu).

    - Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki,
    Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin
    Chaugule (mostly related to the upcoming ARM64 support).

    - Intel RAPL (Running Average Power Limit) power capping driver fixes
    and improvements including new processor IDs (Jacob Pan).

    - Generic power domains modification to power up domains after
    attaching devices to them to meet the expectations of device
    drivers and bus types assuming devices to be accessible at probe
    time (Ulf Hansson).

    - Preliminary support for controlling device clocks from the generic
    power domains core code and modifications of the ARM/shmobile
    platform to use that feature (Ulf Hansson).

    - Assorted minor fixes and cleanups of the generic power domains core
    code (Ulf Hansson, Geert Uytterhoeven).

    - Assorted minor fixes and cleanups of the device clocks control code
    in the PM core (Geert Uytterhoeven, Grygorii Strashko).

    - Consolidation of device power management Kconfig options by making
    CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter
    which is now redundant (Rafael J Wysocki and Kevin Hilman). That
    is the first batch of the changes needed for this purpose.

    - Core device runtime power management support code cleanup related
    to the execution of callbacks (Andrzej Hajda).

    - cpuidle ARM support improvements (Lorenzo Pieralisi).

    - cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a
    new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and
    Bartlomiej Zolnierkiewicz).

    - New cpufreq driver callback (->ready) to be executed when the
    cpufreq core is ready to use a given policy object and cpufreq-dt
    driver modification to use that callback for cooling device
    registration (Viresh Kumar).

    - cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James
    Geboski, Tomeu Vizoso).

    - Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate,
    cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao,
    Stefan Wahren, Petr Cvek).

    - OPP (Operating Performance Points) framework modification to allow
    OPPs to be removed too and update of a few cpufreq drivers
    (cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added
    during initialization) on driver removal (Viresh Kumar).

    - Hibernation core fixes and cleanups (Tina Ruchandani and Markus
    Elfring).

    - PM Kconfig fix related to CPU power management (Pankaj Dubey).

    - cpupower tool fix (Prarit Bhargava)"

    * tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (120 commits)
    i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c
    dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    tools: cpupower: fix return checks for sysfs_get_idlestate_count()
    drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME
    MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    leds: leds-gpio: Fix multiple instances registration without 'label' property
    iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    hwrandom / exynos / PM: Use CONFIG_PM in #ifdef
    block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    USB / PM: Drop CONFIG_PM_RUNTIME from the USB core
    PM: Merge the SET*_RUNTIME_PM_OPS() macros
    ...

    Linus Torvalds
     
  • 0day robot reported the following crash:
    [ 21.233581] BUG: unable to handle kernel NULL pointer dereference at 0000000000000007
    [ 21.234709] IP: [] sk_attach_bpf+0x39/0xc2

    It's due to bpf_prog_get() returning ERR_PTR.
    Check it properly.

    Reported-by: Fengguang Wu
    Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
    cmsghdr from msghdr, just cleanup.

    Signed-off-by: Gu Zheng
    Signed-off-by: David S. Miller

    Gu Zheng
     
  • Merge first patchbomb from Andrew Morton:
    - a few minor cifs fixes
    - dma-debug upadtes
    - ocfs2
    - slab
    - about half of MM
    - procfs
    - kernel/exit.c
    - panic.c tweaks
    - printk upates
    - lib/ updates
    - checkpatch updates
    - fs/binfmt updates
    - the drivers/rtc tree
    - nilfs
    - kmod fixes
    - more kernel/exit.c
    - various other misc tweaks and fixes

    * emailed patches from Andrew Morton : (190 commits)
    exit: pidns: fix/update the comments in zap_pid_ns_processes()
    exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
    exit: exit_notify: re-use "dead" list to autoreap current
    exit: reparent: call forget_original_parent() under tasklist_lock
    exit: reparent: avoid find_new_reaper() if no children
    exit: reparent: introduce find_alive_thread()
    exit: reparent: introduce find_child_reaper()
    exit: reparent: document the ->has_child_subreaper checks
    exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
    exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
    exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
    exit: proc: don't try to flush /proc/tgid/task/tgid
    exit: release_task: fix the comment about group leader accounting
    exit: wait: drop tasklist_lock before psig->c* accounting
    exit: wait: don't use zombie->real_parent
    exit: wait: cleanup the ptrace_reparented() checks
    usermodehelper: kill the kmod_thread_locker logic
    usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
    fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
    nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
    ...

    Linus Torvalds
     
  • As it is, default ->i_fop has NULL ->open() (along with all other methods).
    The only case where it matters is reopening (via procfs symlink) a file that
    didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
    to something sane (default would fail on read/write/ioctl/etc.).

    Unfortunately, such case exists - alloc_file() users, especially
    anon_get_file() ones. There we have tons of opened files of very different
    kinds sharing the same inode. As the result, attempt to reopen those via
    procfs succeeds and you get a descriptor you can't do anything with.

    Moreover, in case of sockets we set ->i_fop that will only be used
    on such reopen attempts - and put a failing ->open() into it to make sure
    those do not succeed.

    It would be simpler to put such ->open() into default ->i_fop and leave
    it unchanged both for anon inode (as we do anyway) and for socket ones. Result:
    * everything going through do_dentry_open() works as it used to
    * sock_no_open() kludge is gone
    * attempts to reopen anon-inode files fail as they really ought to
    * ditto for aio_private_file()
    * ditto for perfmon - this one actually tried to imitate sock_no_open()
    trick, but failed to set ->i_fop, so in the current tree reopens succeed and
    yield completely useless descriptor. Intent clearly had been to fail with
    -ENXIO on such reopens; now it actually does.
    * everything else that used alloc_file() keeps working - it has ->i_fop
    set for its inodes anyway

    Signed-off-by: Al Viro

    Al Viro
     
  • Al Viro
     
  • Memory is internally accounted in bytes, using spinlock-protected 64-bit
    counters, even though the smallest accounting delta is a page. The
    counter interface is also convoluted and does too many things.

    Introduce a new lockless word-sized page counter API, then change all
    memory accounting over to it. The translation from and to bytes then only
    happens when interfacing with userspace.

    The removed locking overhead is noticable when scaling beyond the per-cpu
    charge caches - on a 4-socket machine with 144-threads, the following test
    shows the performance differences of 288 memcgs concurrently running a
    page fault benchmark:

    vanilla:

    18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
    1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
    24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
    1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
    50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
    stalled-cycles-frontend
    stalled-cycles-backend
    8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
    1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
    1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )

    132.474343877 seconds time elapsed ( +- 0.21% )

    lockless:

    12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
    832,850 context-switches # 0.068 K/sec ( +- 0.54% )
    15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
    1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
    32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
    stalled-cycles-frontend
    stalled-cycles-backend
    9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
    2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
    1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )

    91.369330729 seconds time elapsed ( +- 0.45% )

    On top of improved scalability, this also gets rid of the icky long long
    types in the very heart of memcg, which is great for 32 bit and also makes
    the code a lot more readable.

    Notable differences between the old and new API:

    - res_counter_charge() and res_counter_charge_nofail() become
    page_counter_try_charge() and page_counter_charge() resp. to match
    the more common kernel naming scheme of try_do()/do()

    - res_counter_uncharge_until() is only ever used to cancel a local
    counter and never to uncharge bigger segments of a hierarchy, so
    it's replaced by the simpler page_counter_cancel()

    - res_counter_set_limit() is replaced by page_counter_limit(), which
    expects its callers to serialize against themselves

    - res_counter_memparse_write_strategy() is replaced by
    page_counter_limit(), which rounds down to the nearest page size -
    rather than up. This is more reasonable for explicitely requested
    hard upper limits.

    - to keep charging light-weight, page_counter_try_charge() charges
    speculatively, only to roll back if the result exceeds the limit.
    Because of this, a failing bigger charge can temporarily lock out
    smaller charges that would otherwise succeed. The error is bounded
    to the difference between the smallest and the biggest possible
    charge size, so for memcg, this means that a failing THP charge can
    send base page charges into reclaim upto 2MB (4MB) before the limit
    would have been reached. This should be acceptable.

    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pull VFS changes from Al Viro:
    "First pile out of several (there _definitely_ will be more). Stuff in
    this one:

    - unification of d_splice_alias()/d_materialize_unique()

    - iov_iter rewrite

    - killing a bunch of ->f_path.dentry users (and f_dentry macro).

    Getting that completed will make life much simpler for
    unionmount/overlayfs, since then we'll be able to limit the places
    sensitive to file _dentry_ to reasonably few. Which allows to have
    file_inode(file) pointing to inode in a covered layer, with dentry
    pointing to (negative) dentry in union one.

    Still not complete, but much closer now.

    - crapectomy in lustre (dead code removal, mostly)

    - "let's make seq_printf return nothing" preparations

    - assorted cleanups and fixes

    There _definitely_ will be more piles"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    copy_from_iter_nocache()
    new helper: iov_iter_kvec()
    csum_and_copy_..._iter()
    iov_iter.c: handle ITER_KVEC directly
    iov_iter.c: convert copy_to_iter() to iterate_and_advance
    iov_iter.c: convert copy_from_iter() to iterate_and_advance
    iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
    iov_iter.c: convert iov_iter_zero() to iterate_and_advance
    iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
    iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
    iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
    iov_iter.c: iterate_and_advance
    iov_iter.c: macros for iterating over iov_iter
    kill f_dentry macro
    dcache: fix kmemcheck warning in switch_names
    new helper: audit_file()
    nfsd_vfs_write(): use file_inode()
    ncpfs: use file_inode()
    kill f_dentry uses
    lockd: get rid of ->f_path.dentry->d_sb
    ...

    Linus Torvalds
     
  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Features:
    - NFSv4.2 client support for hole punching and preallocation.
    - Further RPC/RDMA client improvements.
    - Add more RPC transport debugging tracepoints.
    - Add RPC debugging tools in debugfs.

    Bugfixes:
    - Stable fix for layoutget error handling
    - Fix a change in COMMIT behaviour resulting from the recent io code
    updates"

    * tag 'nfs-for-3.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
    sunrpc: add a debugfs rpc_xprt directory with an info file in it
    sunrpc: add debugfs file for displaying client rpc_task queue
    nfs: Add DEALLOCATE support
    nfs: Add ALLOCATE support
    NFS: Clean up nfs4_init_callback()
    NFS: SETCLIENTID XDR buffer sizes are incorrect
    SUNRPC: serialize iostats updates
    xprtrdma: Display async errors
    xprtrdma: Enable pad optimization
    xprtrdma: Re-write rpcrdma_flush_cqs()
    xprtrdma: Refactor tasklet scheduling
    xprtrdma: unmap all FMRs during transport disconnect
    xprtrdma: Cap req_cqinit
    xprtrdma: Return an errno from rpcrdma_register_external()
    nfs: define nfs_inc_fscache_stats and using it as possible
    nfs: replace nfs_add_stats with nfs_inc_stats when add one
    NFS: Deletion of unnecessary checks before the function call "nfs_put_client"
    sunrpc: eliminate RPC_TRACEPOINTS
    sunrpc: eliminate RPC_DEBUG
    lockd: eliminate LOCKD_DEBUG
    ...

    Linus Torvalds
     
  • Conflicts:
    drivers/net/ethernet/amd/xgbe/xgbe-desc.c
    drivers/net/ethernet/renesas/sh_eth.c

    Overlapping changes in both conflict cases.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Making things const is a good thing.

    (x86-64 defconfig with all irda)
    $ size net/irda/built-in.o*
    text data bss dec hex filename
    109276 1868 244 111388 1b31c net/irda/built-in.o.new
    108828 2316 244 111388 1b31c net/irda/built-in.o.old

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • It's better when function pointer arrays aren't modifiable.

    Net change:

    $ size net/llc/built-in.o.*
    text data bss dec hex filename
    61193 12758 1344 75295 1261f net/llc/built-in.o.new
    47113 27030 1344 75487 126df net/llc/built-in.o.old

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • It's better when function pointer arrays aren't modifiable.

    Net change from original:

    $ size net/llc/built-in.o.*
    text data bss dec hex filename
    61065 12886 1344 75295 1261f net/llc/built-in.o.new
    47113 27030 1344 75487 126df net/llc/built-in.o.old

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • It's better when function pointer arrays aren't modifiable.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • This patch effectively reverts commit 500f80872645 ("net: ovs: use CRC32
    accelerated flow hash if available"), and other remaining arch_fast_hash()
    users such as from nfsd via commit 6282cd565553 ("NFSD: Don't hand out
    delegations for 30 seconds after recalling them.") where it has been used
    as a hash function for bloom filtering.

    While we think that these users are actually not much of concern, it has
    been requested to remove the arch_fast_hash() library bits that arose
    from [1] entirely as per recent discussion [2]. The main argument is that
    using it as a hash may introduce bias due to its linearity (see avalanche
    criterion) and thus makes it less clear (though we tried to document that)
    when this security/performance trade-off is actually acceptable for a
    general purpose library function.

    Lets therefore avoid any further confusion on this matter and remove it to
    prevent any future accidental misuse of it. For the time being, this is
    going to make hashing of flow keys a bit more expensive in the ovs case,
    but future work could reevaluate a different hashing discipline.

    [1] https://patchwork.ozlabs.org/patch/299369/
    [2] https://patchwork.ozlabs.org/patch/418756/

    Cc: Neil Brown
    Cc: Francesco Fusco
    Cc: Jesse Gross
    Cc: Thomas Graf
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • For netlink, we shouldn't be using arch_fast_hash() as a hashing
    discipline, but rather jhash() instead.

    Since netlink sockets can be opened by any user, a local attacker
    would be able to easily create collisions with the DPDK-derived
    arch_fast_hash(), which trades off performance for security by
    using crc32 CPU instructions on x86_64.

    While it might have a legimite use case in other places, it should
    be avoided in netlink context, though. As rhashtable's API is very
    flexible, we could later on still decide on other hashing disciplines,
    if legitimate.

    Reference: http://thread.gmane.org/gmane.linux.kernel/1844123
    Fixes: e341694e3eb5 ("netlink: Convert netlink_lookup() to use RCU protected hash table")
    Cc: Herbert Xu
    Signed-off-by: Daniel Borkmann
    Acked-by: Thomas Graf
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • commit 908344cdda80 ("tipc: fix bug in multicast congestion handling")
    introduced a race in the broadcast link wakeup functionality.

    This patch eliminates this broadcast link wakeup race caused by
    operation on the wakeup list without proper locking. If this race
    hit and corrupted the list all subsequent wakeup messages would be
    lost, resulting in a considerable memory leak.

    Signed-off-by: Richard Alpe
    Signed-off-by: Erik Hugne
    Signed-off-by: David S. Miller

    Richard Alpe
     
  • This change pulls the core functionality out of __netdev_alloc_skb and
    places them in a new function named __alloc_rx_skb. The reason for doing
    this is to make these bits accessible to a new function __napi_alloc_skb.
    In addition __alloc_rx_skb now has a new flags value that is used to
    determine which page frag pool to allocate from. If the SKB_ALLOC_NAPI
    flag is set then the NAPI pool is used. The advantage of this is that we
    do not have to use local_irq_save/restore when accessing the NAPI pool from
    NAPI context.

    In my test setup I saw at least 11ns of savings using the napi_alloc_skb
    function versus the netdev_alloc_skb function, most of this being due to
    the fact that we didn't have to call local_irq_save/restore.

    The main use case for napi_alloc_skb would be for things such as copybreak
    or page fragment based receive paths where an skb is allocated after the
    data has been received instead of before.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This patch splits the netdev_alloc_frag function up so that it can be used
    on one of two page frag pools instead of being fixed on the
    netdev_alloc_cache. By doing this we can add a NAPI specific function
    __napi_alloc_frag that accesses a pool that is only used from softirq
    context. The advantage to this is that we do not need to call
    local_irq_save/restore which can be a significant savings.

    I also took the opportunity to refactor the core bits that were placed in
    __alloc_page_frag. First I updated the allocation to do either a 32K
    allocation or an order 0 page. This is based on the changes in commmit
    d9b2938aa where it was found that latencies could be reduced in case of
    failures. Then I also rewrote the logic to work from the end of the page to
    the start. By doing this the size value doesn't have to be used unless we
    have run out of space for page fragments. Finally I cleaned up the atomic
    bits so that we just do an atomic_sub_and_test and if that returns true then
    we set the page->_count via an atomic_set. This way we can remove the extra
    conditional for the atomic_read since it would have led to an atomic_inc in
    the case of success anyway.

    Signed-off-by: Alexander Duyck
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • More iov_iter work for the networking from Al Viro.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Dec, 2014

2 commits

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle are:

    - 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.

    This instruments might_sleep() checks to catch places that nest
    blocking primitives - such as mutex usage in a wait loop. Such
    bugs can result in hard to debug races/hangs.

    Another category of invalid nesting that this facility will detect
    is the calling of blocking functions from within schedule() ->
    sched_submit_work() -> blk_schedule_flush_plug().

    There's some potential for false positives (if secondary blocking
    primitives themselves are not ready yet for this facility), but the
    kernel will warn once about such bugs per bootup, so the warning
    isn't much of a nuisance.

    This feature comes with a number of fixes, for problems uncovered
    with it, so no messages are expected normally.

    - Another round of sched/numa optimizations and refinements, for
    CONFIG_NUMA_BALANCING=y.

    - Another round of sched/dl fixes and refinements.

    Plus various smaller fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
    sched: Add missing rcu protection to wake_up_all_idle_cpus
    sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
    sched/numa: Init numa balancing fields of init_task
    sched/deadline: Remove unnecessary definitions in cpudeadline.h
    sched/cpupri: Remove unnecessary definitions in cpupri.h
    sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
    sched/fair: Fix stale overloaded status in the busiest group finding logic
    sched: Move p->nr_cpus_allowed check to select_task_rq()
    sched/completion: Document when to use wait_for_completion_io_*()
    sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
    sched/fair: Kill task_struct::numa_entry and numa_group::task_list
    sched: Refactor task_struct to use numa_faults instead of numa_* pointers
    sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
    sched/deadline: Reschedule from switched_from_dl() after a successful pull
    sched/deadline: Push task away if the deadline is equal to curr during wakeup
    sched/deadline: Add deadline rq status print
    sched/deadline: Fix artificial overrun introduced by yield_task_dl()
    sched/rt: Clean up check_preempt_equal_prio()
    sched/core: Use dl_bw_of() under rcu_read_lock_sched()
    sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
    ...

    Linus Torvalds
     
  • To cancel nesting, this function is more convenient.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko