03 Aug, 2016

1 commit

  • Change task_work_cancel() to use lockless_dereference(), this is what
    the code really wants but we didn't have this helper when it was
    written.

    Also add the fast-path task->task_works == NULL check, in the likely
    case this task has no pending works and we can avoid
    spin_lock(task->pi_lock).

    While at it, change other users of ACCESS_ONCE() to use READ_ONCE().

    Link: http://lkml.kernel.org/r/20160610150042.GA13868@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Andrea Parri
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

31 Jul, 2016

1 commit

  • Pull misc fixes from Thomas Gleixner:
    "This update contains:

    - a fix for stomp-machine so the nmi_watchdog wont trigger on the cpu
    waiting for the others to execute the callback

    - various fixes and updates to objtool including an resync of the
    instruction decoder to match the kernel's decoder"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    objtool: Un-capitalize "Warning" for out-of-sync instruction decoder
    objtool: Resync x86 instruction decoder with the kernel's
    objtool: Support new GCC 6 switch jump table pattern
    stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE
    objtool: Add 'fixdep' to objtool/.gitignore

    Linus Torvalds
     

30 Jul, 2016

5 commits

  • Pull audit updates from Paul Moore:
    "Six audit patches for 4.8.

    There are a couple of style and minor whitespace tweaks for the logs,
    as well as a minor fixup to catch errors on user filter rules, however
    the major improvements are a fix to the s390 syscall argument masking
    code (reviewed by the nice s390 folks), some consolidation around the
    exclude filtering (less code, always a win), and a double-fetch fix
    for recording the execve arguments"

    * 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
    audit: fix a double fetch in audit_log_single_execve_arg()
    audit: fix whitespace in CWD record
    audit: add fields to exclude filter by reusing user filter
    s390: ensure that syscall arguments are properly masked on s390
    audit: fix some horrible switch statement style crimes
    audit: fixup: log on errors from filter user rules

    Linus Torvalds
     
  • Pull security subsystem updates from James Morris:
    "Highlights:

    - TPM core and driver updates/fixes
    - IPv6 security labeling (CALIPSO)
    - Lots of Apparmor fixes
    - Seccomp: remove 2-phase API, close hole where ptrace can change
    syscall #"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits)
    apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling
    tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family)
    tpm: Factor out common startup code
    tpm: use devm_add_action_or_reset
    tpm2_i2c_nuvoton: add irq validity check
    tpm: read burstcount from TPM_STS in one 32-bit transaction
    tpm: fix byte-order for the value read by tpm2_get_tpm_pt
    tpm_tis_core: convert max timeouts from msec to jiffies
    apparmor: fix arg_size computation for when setprocattr is null terminated
    apparmor: fix oops, validate buffer size in apparmor_setprocattr()
    apparmor: do not expose kernel stack
    apparmor: fix module parameters can be changed after policy is locked
    apparmor: fix oops in profile_unpack() when policy_db is not present
    apparmor: don't check for vmalloc_addr if kvzalloc() failed
    apparmor: add missing id bounds check on dfa verification
    apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task
    apparmor: use list_next_entry instead of list_entry_next
    apparmor: fix refcount race when finding a child profile
    apparmor: fix ref count leak when profile sha1 hash is read
    apparmor: check that xindex is in trans_table bounds
    ...

    Linus Torvalds
     
  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     
  • Pull more cgroup updates from Tejun Heo:
    "I forgot to include the patches which got applied to for-4.7-fixes
    late during last cycle.

    Eric's three patches fix bugs introduced with the namespace support"

    * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
    cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns
    cgroupns: Fix the locking in copy_cgroup_ns

    Linus Torvalds
     
  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the next part of the hotplug rework.

    - Convert all notifiers with a priority assigned

    - Convert all CPU_STARTING/DYING notifiers

    The final removal of the STARTING/DYING infrastructure will happen
    when the merge window closes.

    Another 700 hundred line of unpenetrable maze gone :)"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    timers/core: Correct callback order during CPU hot plug
    leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
    powerpc/numa: Convert to hotplug state machine
    arm/perf: Fix hotplug state machine conversion
    irqchip/armada: Avoid unused function warnings
    ARC/time: Convert to hotplug state machine
    clocksource/atlas7: Convert to hotplug state machine
    clocksource/armada-370-xp: Convert to hotplug state machine
    clocksource/exynos_mct: Convert to hotplug state machine
    clocksource/arm_global_timer: Convert to hotplug state machine
    rcu: Convert rcutree to hotplug state machine
    KVM/arm/arm64/vgic-new: Convert to hotplug state machine
    smp/cfd: Convert core to hotplug state machine
    x86/x2apic: Convert to CPU hotplug state machine
    profile: Convert to hotplug state machine
    timers/core: Convert to hotplug state machine
    hrtimer: Convert to hotplug state machine
    x86/tboot: Convert to hotplug state machine
    arm64/armv8 deprecated: Convert to hotplug state machine
    hwtracing/coresight-etm4x: Convert to hotplug state machine
    ...

    Linus Torvalds
     

29 Jul, 2016

11 commits

  • Pull tracing updates from Steven Rostedt:
    "This is mostly clean ups and small fixes. Some of the more visible
    changes are:

    - The function pid code uses the event pid filtering logic
    - [ku]probe events have access to current->comm
    - trace_printk now has sample code
    - PCI devices now trace physical addresses
    - stack tracing has less unnessary functions traced"

    * tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    printk, tracing: Avoiding unneeded blank lines
    tracing: Use __get_str() when manipulating strings
    tracing, RAS: Cleanup on __get_str() usage
    tracing: Use outer () on __get_str() definition
    ftrace: Reduce size of function graph entries
    tracing: Have HIST_TRIGGERS select TRACING
    tracing: Using for_each_set_bit() to simplify trace_pid_write()
    ftrace: Move toplevel init out of ftrace_init_tracefs()
    tracing/function_graph: Fix filters for function_graph threshold
    tracing: Skip more functions when doing stack tracing of events
    tracing: Expose CPU physical addresses (resource values) for PCI devices
    tracing: Show the preempt count of when the event was called
    tracing: Add trace_printk sample code
    tracing: Choose static tp_printk buffer by explicit nesting count
    tracing: expose current->comm to [ku]probe events
    ftrace: Have set_ftrace_pid use the bitmap like events do
    tracing: Move pid_list write processing into its own function
    tracing: Move the pid_list seq_file functions to be global
    tracing: Move filtered_pid helper functions into trace.c
    tracing: Make the pid filtering helper functions global

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:

    - Replace pcommit with ADR / directed-flushing.

    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement
    either ADR, or provide one or more flush addresses per nvdimm.

    ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
    to the memory controller on a power-fail event.

    Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
    Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
    A flush hint is an mmio address that when written and fenced assures
    that all previous posted writes targeting a given dimm have been
    flushed to media.

    - On-demand ARS (address range scrub).

    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices. When latent errors are detected we re-scrub the
    media to refresh the bad block list, userspace can also request a
    re-scrub at any time.

    - Support for the Microsoft DSM (device specific method) command
    format.

    - Support for EDK2/OVMF virtual disk device memory ranges.

    - Various fixes and cleanups across the subsystem.

    * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
    libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
    nfit: do an ARS scrub on hitting a latent media error
    nfit: move to nfit/ sub-directory
    nfit, libnvdimm: allow an ARS scrub to be triggered on demand
    libnvdimm: register nvdimm_bus devices with an nd_bus driver
    pmem: clarify a debug print in pmem_clear_poison
    x86/insn: remove pcommit
    Revert "KVM: x86: add pcommit support"
    nfit, tools/testing/nvdimm/: unify shutdown paths
    libnvdimm: move ->module to struct nvdimm_bus_descriptor
    nfit: cleanup acpi_nfit_init calling convention
    nfit: fix _FIT evaluation memory leak + use after free
    tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
    tools/testing/nvdimm: add virtual ramdisk range
    acpi, nfit: treat virtual ramdisk SPA as pmem region
    pmem: kill __pmem address space
    pmem: kill wmb_pmem()
    libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
    fs/dax: remove wmb_pmem()
    libnvdimm, pmem: flush posted-write queues on shutdown
    ...

    Linus Torvalds
     
  • We currently show:

    task: ti: task.ti: "

    "ti" and "task.ti" are redundant, and neither is actually what we want
    to show, which the the base of the thread stack. Change the display to
    show the stack pointer explicitly.

    Link: http://lkml.kernel.org/r/543ac5bd66ff94000a57a02e11af7239571a3055.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • We should account for stacks regardless of stack size, and we need to
    account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
    to kilobytes and Move it into account_kernel_stack().

    Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
    Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
    This only makes sense if each kernel stack exists entirely in one zone,
    and allowing vmapped stacks could break this assumption.

    Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
    allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
    architectures. Keep it simple and use KiB.

    Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Now that ZONE_DEVICE depends on SPARSEMEM_VMEMMAP we can simplify some
    ifdef guards to just ZONE_DEVICE.

    Link: http://lkml.kernel.org/r/146687646788.39261.8020536391978771940.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Vlastimil Babka
    Cc: Eric Sandeen
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
    it is checking the flag on the current task rather than the given one.

    This doesn't make much sense and it is actually wrong. If the current
    task which updates the nodemask of a cpuset got killed by the OOM killer
    then a part of the cpuset cgroup processes would have incompatible
    nodemask which is surprising to say the least.

    The comment suggests the intention was to skip oom victim or an exiting
    task so we should be checking the given task. But even then it would be
    layering violation because it is the memory allocator to interpret the
    TIF_MEMDIE meaning. Simply drop both checks. All tasks in the cpuset
    should simply follow the same mask.

    Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Miao Xie
    Cc: Miao Xie
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • freezing_slow_path() is checking TIF_MEMDIE to skip OOM killed tasks.
    It is, however, checking the flag on the current task rather than the
    given one. This is really confusing because freezing() can be called
    also on !current tasks. It would end up working correctly for its main
    purpose because __refrigerator will be always called on the current task
    so the oom victim will never get frozen. But it could lead to
    surprising results when a task which is freezing a cgroup got oom killed
    because only part of the cgroup would get frozen. This is highly
    unlikely but worth fixing as the resulting code would be more clear
    anyway.

    Link: http://lkml.kernel.org/r/1467029719-17602-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Miao Xie
    Cc: Miao Xie
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • On the tear-down path, the dead CPU callback for the timers was
    misplaced within the 'cpuhp_state' enumeration. There is a hidden
    dependency between the timers and block multiqueue. The timers
    callback must happen before the block multiqueue callback otherwise a
    RCU stall occurs.

    Move the timers callback to the proper place in the state machine.

    Reported-and-tested-by: Jon Hunter
    Reported-by: kbuild test robot
    Fixes: 24f73b99716a ("timers/core: Convert to hotplug state machine")
    Signed-off-by: Richard Cochran
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Rasmus Villemoes
    Cc: John Stultz
    Cc: rt@linutronix.de
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1469610498-25914-1-git-send-email-rcochran@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Richard Cochran
     

28 Jul, 2016

2 commits

  • Pull networking updates from David Miller:

    1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

    2) Make DSA binding more sane, from Andrew Lunn.

    3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

    4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

    5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface. From Brenden Blanco and
    others.

    6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

    7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

    8) Simplify netlink conntrack entry layout, from Florian Westphal.

    9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

    10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

    11) Support qdisc packet injection in pktgen, from John Fastabend.

    12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

    13) Add NV congestion control support to TCP, from Lawrence Brakmo.

    14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

    15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

    16) Support MPLS over IPV4, from Simon Horman.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    xgene: Fix build warning with ACPI disabled.
    be2net: perform temperature query in adapter regardless of its interface state
    l2tp: Correctly return -EBADF from pppol2tp_getname.
    net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
    net: ipmr/ip6mr: update lastuse on entry change
    macsec: ensure rx_sa is set when validation is disabled
    tipc: dump monitor attributes
    tipc: add a function to get the bearer name
    tipc: get monitor threshold for the cluster
    tipc: make cluster size threshold for monitoring configurable
    tipc: introduce constants for tipc address validation
    net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
    MAINTAINERS: xgene: Add driver and documentation path
    Documentation: dtb: xgene: Add MDIO node
    dtb: xgene: Add MDIO node
    drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
    drivers: net: xgene: Use exported functions
    drivers: net: xgene: Enable MDIO driver
    drivers: net: xgene: Add backward compatibility
    drivers: net: phy: xgene: Add MDIO driver
    ...

    Linus Torvalds
     
  • Pull xen updates from David Vrabel:
    "Features and fixes for 4.8-rc0:

    - ACPI support for guests on ARM platforms.
    - Generic steal time support for arm and x86.
    - Support cases where kernel cpu is not Xen VCPU number (e.g., if
    in-guest kexec is used).
    - Use the system workqueue instead of a custom workqueue in various
    places"

    * tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits)
    xen: add static initialization of steal_clock op to xen_time_ops
    xen/pvhvm: run xen_vcpu_setup() for the boot CPU
    xen/evtchn: use xen_vcpu_id mapping
    xen/events: fifo: use xen_vcpu_id mapping
    xen/events: use xen_vcpu_id mapping in events_base
    x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info
    x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op
    xen: introduce xen_vcpu_id mapping
    x86/acpi: store ACPI ids from MADT for future usage
    x86/xen: update cpuid.h from Xen-4.7
    xen/evtchn: add IOCTL_EVTCHN_RESTRICT
    xen-blkback: really don't leak mode property
    xen-blkback: constify instance of "struct attribute_group"
    xen-blkfront: prefer xenbus_scanf() over xenbus_gather()
    xen-blkback: prefer xenbus_scanf() over xenbus_gather()
    xen: support runqueue steal time on xen
    arm/xen: add support for vm_assist hypercall
    xen: update xen headers
    xen-pciback: drop superfluous variables
    xen-pciback: short-circuit read path used for merging write values
    ...

    Linus Torvalds
     

27 Jul, 2016

9 commits

  • Suppose that stop_machine(fn) hangs because fn() hangs. In this case NMI
    hard-lockup can be triggered on another CPU which does nothing wrong and
    the trace from nmi_panic() won't help to investigate the problem.

    And this change "fixes" the problem we (seem to) hit in practice.

    - stop_two_cpus(0, 1) races with show_state_filter() running on CPU_0.

    - CPU_1 already spins in MULTI_STOP_PREPARE state, it detects the soft
    lockup and tries to report the problem.

    - show_state_filter() enables preemption, CPU_0 calls multi_cpu_stop()
    which goes to MULTI_STOP_DISABLE_IRQ state and disables interrupts.

    - CPU_1 spends more than 10 seconds trying to flush the log buffer to
    the slow serial console.

    - NMI interrupt on CPU_0 (which now waits for CPU_1) calls nmi_panic().

    Reported-by: Wang Shu
    Signed-off-by: Oleg Nesterov
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Dave Anderson
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20160726185736.GB4088@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Merge updates from Andrew Morton:

    - a few misc bits

    - ocfs2

    - most(?) of MM

    * emailed patches from Andrew Morton : (125 commits)
    thp: fix comments of __pmd_trans_huge_lock()
    cgroup: remove unnecessary 0 check from css_from_id()
    cgroup: fix idr leak for the first cgroup root
    mm: memcontrol: fix documentation for compound parameter
    mm: memcontrol: remove BUG_ON in uncharge_list
    mm: fix build warnings in
    mm, thp: convert from optimistic swapin collapsing to conservative
    mm, thp: fix comment inconsistency for swapin readahead functions
    thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
    shmem: split huge pages beyond i_size under memory pressure
    thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
    khugepaged: add support of collapse for tmpfs/shmem pages
    shmem: make shmem_inode_info::lock irq-safe
    khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
    thp: extract khugepaged from mm/huge_memory.c
    shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
    shmem: add huge pages support
    shmem: get_unmapped_area align huge page
    shmem: prepare huge= mount option and sysfs knob
    mm, rmap: account shmem thp pages
    ...

    Linus Torvalds
     
  • Pull power management updates from Rafael Wysocki:
    "Again, the majority of changes go into the cpufreq subsystem, but
    there are no big features this time. The cpufreq changes that stand
    out somewhat are the governor interface rework and improvements
    related to the handling of frequency tables. Apart from those, there
    are fixes and new device/CPU IDs in drivers, cleanups and an
    improvement of the new schedutil governor.

    Next, there are some changes in the hibernation core, including a fix
    for a nasty problem related to the MONITOR/MWAIT usage by CPU offline
    during resume from hibernation, a few core improvements related to
    memory management during resume, a couple of additional debug features
    and cleanups.

    Finally, we have some fixes and cleanups in the devfreq subsystem,
    generic power domains framework improvements related to system
    suspend/resume, support for some new chips in intel_idle and in the
    power capping RAPL driver, a new version of the AnalyzeSuspend utility
    and some assorted fixes and cleanups.

    Specifics:

    - Rework the cpufreq governor interface to make it more
    straightforward and modify the conservative governor to avoid using
    transition notifications (Rafael Wysocki).

    - Rework the handling of frequency tables by the cpufreq core to make
    it more efficient (Viresh Kumar).

    - Modify the schedutil governor to reduce the number of wakeups it
    causes to occur in cases when the CPU frequency doesn't need to be
    changed (Steve Muckle, Viresh Kumar).

    - Fix some minor issues and clean up code in the cpufreq core and
    governors (Rafael Wysocki, Viresh Kumar).

    - Add Intel Broxton support to the intel_pstate driver (Srinivas
    Pandruvada).

    - Fix problems related to the config TDP feature and to the validity
    of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
    Srinivas Pandruvada).

    - Make intel_pstate update the cpu_frequency tracepoint even if the
    frequency doesn't change to avoid confusing powertop (Rafael
    Wysocki).

    - Clean up the usage of __init/__initdata in intel_pstate, mark some
    of its internal variables as __read_mostly and drop an unused
    structure element from it (Jisheng Zhang, Carsten Emde).

    - Clean up the usage of some duplicate MSR symbols in intel_pstate
    and turbostat (Srinivas Pandruvada).

    - Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
    Adiga, Viresh Kumar, Ben Dooks).

    - Fix a regression (introduced during the 4.5 cycle) in the
    pcc-cpufreq driver by reverting the problematic commit (Andreas
    Herrmann).

    - Add support for Intel Denverton to intel_idle, clean up Broxton
    support in it and make it explicitly non-modular (Jacob Pan, Jan
    Beulich, Paul Gortmaker).

    - Add support for Denverton and Ivy Bridge server to the Intel RAPL
    power capping driver and make it more careful about the handing of
    MSRs that may not be present (Jacob Pan, Xiaolong Wang).

    - Fix resume from hibernation on x86-64 by making the CPU offline
    during resume avoid using MONITOR/MWAIT in the "play dead" loop
    which may lead to an inadvertent "revival" of a "dead" CPU and a
    page fault leading to a kernel crash from it (Rafael Wysocki).

    - Make memory management during resume from hibernation more
    straightforward (Rafael Wysocki).

    - Add debug features that should help to detect problems related to
    hibernation and resume from it (Rafael Wysocki, Chen Yu).

    - Clean up hibernation core somewhat (Rafael Wysocki).

    - Prevent KASAN from instrumenting the hibernation core which leads
    to large numbers of false-positives from it (James Morse).

    - Prevent PM (hibernate and suspend) notifiers from being called
    during the cleanup phase if they have not been called during the
    corresponding preparation phase which is possible if one of the
    other notifiers returns an error at that time (Lianwei Wang).

    - Improve suspend-related debug printout in the tasks freezer and
    clean up suspend-related console handling (Roger Lu, Borislav
    Petkov).

    - Update the AnalyzeSuspend script in the kernel sources to version
    4.2 (Todd Brandt).

    - Modify the generic power domains framework to make it handle system
    suspend/resume better (Ulf Hansson).

    - Make the runtime PM framework avoid resuming devices synchronously
    when user space changes the runtime PM settings for them and
    improve its error reporting (Rafael Wysocki, Linus Walleij).

    - Fix error paths in devfreq drivers (exynos, exynos-ppmu,
    exynos-bus) and in the core, make some devfreq code explicitly
    non-modular and change some of it into tristate (Bartlomiej
    Zolnierkiewicz, Peter Chen, Paul Gortmaker).

    - Add DT support to the generic PM clocks management code and make it
    export some more symbols (Jon Hunter, Paul Gortmaker).

    - Make the PCI PM core code slightly more robust against possible
    driver errors (Andy Shevchenko).

    - Make it possible to change DESTDIR and PREFIX in turbostat (Andy
    Shevchenko)"

    * tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
    Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
    PM / hibernate: Introduce test_resume mode for hibernation
    cpufreq: export cpufreq_driver_resolve_freq()
    cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
    PCI / PM: check all fields in pci_set_platform_pm()
    cpufreq: acpi-cpufreq: use cached frequency mapping when possible
    cpufreq: schedutil: map raw required frequency to driver frequency
    cpufreq: add cpufreq_driver_resolve_freq()
    cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
    intel_pstate: Update cpu_frequency tracepoint every time
    cpufreq: intel_pstate: clean remnant struct element
    PM / tools: scripts: AnalyzeSuspend v4.2
    x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
    cpufreq: powernv: Replacing pstate_id with frequency table index
    intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
    PM / hibernate: Image data protection during restoration
    PM / hibernate: Add missing braces in __register_nosave_region()
    PM / hibernate: Clean up comments in snapshot.c
    PM / hibernate: Clean up function headers in snapshot.c
    PM / hibernate: Add missing braces in hibernate_setup()
    ...

    Linus Torvalds
     
  • css_idr allocation starts at 1, so index 0 will never point to an item.
    css_from_id() currently filters that before asking idr_find(), but
    idr_find() would also just return NULL, so this is not needed.

    Link: http://lkml.kernel.org/r/20160617162427.GC19084@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The valid cgroup hierarchy ID range includes 0, so we can't filter for
    positive numbers when freeing it, or it'll leak the first ID. No big
    deal, just disruptive when reading the code.

    The ID is freed during error handling and when the reference count hits
    zero, so the double-free test is not necessary; remove it.

    Link: http://lkml.kernel.org/r/20160617162359.GB19084@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Currently, to charge a non-slab allocation to kmemcg one has to use
    alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
    this helper should finally be freed using free_kmem_pages, otherwise it
    won't be uncharged.

    This API suits its current users fine, but it turns out to be impossible
    to use along with page reference counting, i.e. when an allocation is
    supposed to be freed with put_page, as it is the case with pipe or unix
    socket buffers.

    To overcome this limitation, this patch moves charging/uncharging to
    generic page allocator paths, i.e. to __alloc_pages_nodemask and
    free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
    one can use any of the available page allocation functions to get the
    allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
    just like in case of kmalloc and friends. A charged page will be
    automatically uncharged on free.

    To make it possible, we need to mark pages charged to kmemcg somehow.
    To avoid introducing a new page flag, we make use of page->_mapcount for
    marking such pages. Since pages charged to kmemcg are not supposed to
    be mapped to userspace, it should work just fine. There are other
    (ab)users of page->_mapcount - buddy and balloon pages - but we don't
    conflict with them.

    In case kmemcg is compiled out or not used at runtime, this patch
    introduces no overhead to generic page allocator paths. If kmemcg is
    used, it will be plus one gfp flags check on alloc and plus one
    page->_mapcount check on free, which shouldn't hurt performance, because
    the data accessed are hot.

    Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "Nothing too exciting.

    - updates to the pids controller so that pid limit breaches can be
    noticed and monitored from userland.

    - cleanups and non-critical bug fixes"

    * 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: remove duplicated include from cgroup.c
    cgroup: Use lld instead of ld when printing pids controller events_limit
    cgroup: Add pids controller event when fork fails because of pid limit
    cgroup: allow NULL return from ss->css_alloc()
    cgroup: remove unnecessary 0 check from css_from_id()
    cgroup: fix idr leak for the first cgroup root

    Linus Torvalds
     

26 Jul, 2016

10 commits

  • Pull irq updates from Thomas Gleixner:
    "The irq department delivers:

    - new core infrastructure to allow better management of multi-queue
    devices (interrupt spreading, node aware descriptor allocation ...)

    - a new interrupt flow handler to support the new fangled Intel VMD
    devices.

    - yet another new interrupt controller driver.

    - a series of fixes which addresses sparse warnings, missing
    includes, missing static declarations etc from Ben Dooks.

    - a fix for the error handling in the hierarchical domain allocation
    code.

    - the usual pile of small updates to core and driver code"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    genirq: Fix missing irq allocation affinity hint
    irqdomain: Fix irq_domain_alloc_irqs_recursive() error handling
    irq/Documentation: Correct result of echnoing 5 to smp_affinity
    MAINTAINERS: Remove Jiang Liu from irq domains
    genirq/msi: Fix broken debug output
    genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors
    genirq/msi: Make use of affinity aware allocations
    genirq: Use affinity hint in irqdesc allocation
    genirq: Add affinity hint to irq allocation
    genirq: Introduce IRQD_AFFINITY_MANAGED flag
    genirq/msi: Remove unused MSI_FLAG_IDENTITY_MAP
    irqchip/s3c24xx: Fixup IO accessors for big endian
    irqchip/exynos-combiner: Fix usage of __raw IO
    irqdomain: Fix disposal of mappings for interrupt hierarchies
    irqchip/aspeed-vic: Add irq controller for Aspeed
    doc/devicetree: Add Aspeed VIC bindings
    x86/PCI/VMD: Use untracked irq handler
    genirq: Add untracked irq handler
    irqchip/mips-gic: Populate irq_domain names
    irqchip/gicv3-its: Implement two-level(indirect) device table support
    ...

    Linus Torvalds
     
  • Pull timer updates from Thomas Gleixner:
    "This update provides the following changes:

    - The rework of the timer wheel which addresses the shortcomings of
    the current wheel (cascading, slow search for next expiring timer,
    etc). That's the first major change of the wheel in almost 20
    years since Finn implemted it.

    - A large overhaul of the clocksource drivers init functions to
    consolidate the Device Tree initialization

    - Some more Y2038 updates

    - A capability fix for timerfd

    - Yet another clock chip driver

    - The usual pile of updates, comment improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
    tick/nohz: Optimize nohz idle enter
    clockevents: Make clockevents_subsys static
    clocksource/drivers/time-armada-370-xp: Fix return value check
    timers: Implement optimization for same expiry time in mod_timer()
    timers: Split out index calculation
    timers: Only wake softirq if necessary
    timers: Forward the wheel clock whenever possible
    timers/nohz: Remove pointless tick_nohz_kick_tick() function
    timers: Optimize collect_expired_timers() for NOHZ
    timers: Move __run_timers() function
    timers: Remove set_timer_slack() leftovers
    timers: Switch to a non-cascading wheel
    timers: Reduce the CPU index space to 256k
    timers: Give a few structs and members proper names
    hlist: Add hlist_is_singular_node() helper
    signals: Use hrtimer for sigtimedwait()
    timers: Remove the deprecated mod_timer_pinned() API
    timers, net/ipv4/inet: Initialize connection request timers as pinned
    timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
    timers, drivers/tty/metag_da: Initialize the poll timer as pinned
    ...

    Linus Torvalds
     
  • This allows user memory to be written to during the course of a kprobe.
    It shouldn't be used to implement any kind of security mechanism
    because of TOC-TOU attacks, but rather to debug, divert, and
    manipulate execution of semi-cooperative processes.

    Although it uses probe_kernel_write, we limit the address space
    the probe can write into by checking the space with access_ok.
    We do this as opposed to calling copy_to_user directly, in order
    to avoid sleeping. In addition we ensure the threads's current fs
    / segment is USER_DS and the thread isn't exiting nor a kernel thread.

    Given this feature is meant for experiments, and it has a risk of
    crashing the system, and running programs, we print a warning on
    when a proglet that attempts to use this helper is installed,
    along with the pid and process name.

    Signed-off-by: Sargun Dhillon
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Sargun Dhillon
     
  • Pull x86 boot updates from Ingo Molnar:
    "The main changes:

    - add initial commits to randomize kernel memory section virtual
    addresses, enabled via a new kernel option: RANDOMIZE_MEMORY
    (Thomas Garnier, Kees Cook, Baoquan He, Yinghai Lu)

    - enhance KASLR (RANDOMIZE_BASE) physical memory randomization (Kees
    Cook)

    - EBDA/BIOS region boot quirk cleanups (Andy Lutomirski, Ingo Molnar)

    - misc cleanups/fixes"

    * 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/boot: Simplify EBDA-vs-BIOS reservation logic
    x86/boot: Clarify what x86_legacy_features.reserve_bios_regions does
    x86/boot: Reorganize and clean up the BIOS area reservation code
    x86/mm: Do not reference phys addr beyond kernel
    x86/mm: Add memory hotplug support for KASLR memory randomization
    x86/mm: Enable KASLR for vmalloc memory regions
    x86/mm: Enable KASLR for physical mapping memory regions
    x86/mm: Implement ASLR for kernel memory regions
    x86/mm: Separate variable for trampoline PGD
    x86/mm: Add PUD VA support for physical mapping
    x86/mm: Update physical mapping variable names
    x86/mm: Refactor KASLR entropy functions
    x86/KASLR: Fix boot crash with certain memory configurations
    x86/boot/64: Add forgotten end of function marker
    x86/KASLR: Allow randomization below the load address
    x86/KASLR: Extend kernel image physical address randomization to addresses larger than 4G
    x86/KASLR: Randomize virtual address separately
    x86/KASLR: Clarify identity map interface
    x86/boot: Refuse to build with data relocations
    x86/KASLR, x86/power: Remove x86 hibernation restrictions

    Linus Torvalds
     
  • Pull NOHZ updates from Ingo Molnar:

    - fix system/idle cputime leaked on cputime accounting (all nohz
    configs) (Rik van Riel)

    - remove the messy, ad-hoc irqtime account on nohz-full and make it
    compatible with CONFIG_IRQ_TIME_ACCOUNTING=y instead (Rik van Riel)

    - cleanups (Frederic Weisbecker)

    - remove unecessary irq disablement in the irqtime code (Rik van Riel)

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
    sched/cputime: Reorganize vtime native irqtime accounting headers
    sched/cputime: Clean up the old vtime gen irqtime accounting completely
    sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
    sched/cputime: Count actually elapsed irq & softirq time

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:

    - introduce and use task_rcu_dereference()/try_get_task_struct() to fix
    and generalize task_struct handling (Oleg Nesterov)

    - do various per entity load tracking (PELT) fixes and optimizations
    (Peter Zijlstra)

    - cputime virt-steal time accounting enhancements/fixes (Wanpeng Li)

    - introduce consolidated cputime output file cpuacct.usage_all and
    related refactorings (Zhao Lei)

    - ... plus misc fixes and enhancements

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
    sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
    sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
    sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
    sched/fair: Rework throttle_count sync
    sched/core: Fix sched_getaffinity() return value kerneldoc comment
    sched/fair: Reorder cgroup creation code
    sched/fair: Apply more PELT fixes
    sched/fair: Fix PELT integrity for new tasks
    sched/cgroup: Fix cpu_cgroup_fork() handling
    sched/fair: Fix PELT integrity for new groups
    sched/fair: Fix and optimize the fork() path
    sched/cputime: Add steal time support to full dynticks CPU time accounting
    sched/cputime: Fix prev steal time accouting during CPU hotplug
    KVM: Fix steal clock warp during guest CPU hotplug
    sched/debug: Always show 'nr_migrations'
    sched/fair: Use task_rcu_dereference()
    sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
    sched/idle: Optimize the generic idle loop
    sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()

    Linus Torvalds
     
  • Pull perf updates from Ingo Molnar:
    "With over 300 commits it's been a busy cycle - with most of the work
    concentrated on the tooling side (as it should).

    The main kernel side enhancements were:

    - Add per event callchain limit: Recently we introduced a sysctl to
    tune the max-stack for all events for which callchains were
    requested:

    $ sysctl kernel.perf_event_max_stack
    kernel.perf_event_max_stack = 127

    Now this patch introduces a way to configure this per event, i.e.
    this becomes possible:

    $ perf record -e sched:*/max-stack=2/ -e block:*/max-stack=10/ -a

    allowing finer tuning of how much buffer space callchains use.

    This uses an u16 from the reserved space at the end, leaving
    another u16 for future use.

    There has been interest in even finer tuning, namely to control the
    max stack for kernel and userspace callchains separately. Further
    discussion is needed, we may for instance use the remaining u16 for
    that and when it is present, assume that the sample_max_stack
    introduced in this patch applies for the kernel, and the u16 left
    is used for limiting the userspace callchain (Arnaldo Carvalho de
    Melo)

    - Optimize AUX event (hardware assisted side-band event) delivery
    (Kan Liang)

    - Rework Intel family name macro usage (this is partially x86 arch
    work) (Dave Hansen)

    - Refine and fix Intel LBR support (David Carrillo-Cisneros)

    - Add support for Intel 'TopDown' events (Andi Kleen)

    - Intel uncore PMU driver fixes and enhancements (Kan Liang)

    - ... other misc changes.

    Here's an incomplete list of the tooling enhancements (but there's
    much more, see the shortlog and the git log for details):

    - Support cross unwinding, i.e. collecting '--call-graph dwarf'
    perf.data files in one machine and then doing analysis in another
    machine of a different hardware architecture. This enables, for
    instance, to do:

    $ perf record -a --call-graph dwarf

    on a x86-32 or aarch64 system and then do 'perf report' on it on a
    x86_64 workstation (He Kuang)

    - Allow reading from a backward ring buffer (one setup via
    sys_perf_event_open() with perf_event_attr.write_backward = 1)
    (Wang Nan)

    - Finish merging initial SDT (Statically Defined Traces) support, see
    cset comments for details about how it all works (Masami Hiramatsu)

    - Support attaching eBPF programs to tracepoints (Wang Nan)

    - Add demangling of symbols in programs written in the Rust language
    (David Tolnay)

    - Add support for tracepoints in the python binding, including an
    example, that sets up and parses sched:sched_switch events,
    tools/perf/python/tracepoint.py (Jiri Olsa)

    - Introduce --stdio-color to set up the color output mode selection
    in 'annotate' and 'report', allowing emit color escape sequences
    when redirecting the output of these tools (Arnaldo Carvalho de
    Melo)

    - Add 'callindent' option to 'perf script -F', to indent the Intel PT
    call stack, making this output more ftrace-like (Adrian Hunter,
    Andi Kleen)

    - Allow dumping the object files generated by llvm when processing
    eBPF scriptlet events (Wang Nan)

    - Add stackcollapse.py script to help generating flame graphs (Paolo
    Bonzini)

    - Add --ldlat option to 'perf mem' to specify load latency for loads
    event (e.g. cpu/mem-loads/ ) (Jiri Olsa)

    - Tooling support for Intel TopDown counters, recently added to the
    kernel (Andi Kleen)"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (303 commits)
    perf tests: Add is_printable_array test
    perf tools: Make is_printable_array global
    perf script python: Fix string vs byte array resolving
    perf probe: Warn unmatched function filter correctly
    perf cpu_map: Add more helpers
    perf stat: Balance opening and reading events
    tools: Copy linux/{hash,poison}.h and check for drift
    perf tools: Remove include/linux/list.h from perf's MANIFEST
    tools: Copy the bitops files accessed from the kernel and check for drift
    Remove: kernel unistd*h files from perf's MANIFEST, not used
    perf tools: Remove tools/perf/util/include/linux/const.h
    perf tools: Remove tools/perf/util/include/asm/byteorder.h
    perf tools: Add missing linux/compiler.h include to perf-sys.h
    perf jit: Remove some no-op error handling
    perf jit: Add missing curly braces
    objtool: Initialize variable to silence old compiler
    objtool: Add -I$(srctree)/tools/arch/$(ARCH)/include/uapi
    perf record: Add --tail-synthesize option
    perf session: Don't warn about out of order event if write_backward is used
    perf tools: Enable overwrite settings
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The locking tree was busier in this cycle than the usual pattern - a
    couple of major projects happened to coincide.

    The main changes are:

    - implement the atomic_fetch_{add,sub,and,or,xor}() API natively
    across all SMP architectures (Peter Zijlstra)

    - add atomic_fetch_{inc/dec}() as well, using the generic primitives
    (Davidlohr Bueso)

    - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso,
    Waiman Long)

    - optimize smp_cond_load_acquire() on arm64 and implement LSE based
    atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    on arm64 (Will Deacon)

    - introduce smp_acquire__after_ctrl_dep() and fix various barrier
    mis-uses and bugs (Peter Zijlstra)

    - after discovering ancient spin_unlock_wait() barrier bugs in its
    implementation and usage, strengthen its semantics and update/fix
    usage sites (Peter Zijlstra)

    - optimize mutex_trylock() fastpath (Peter Zijlstra)

    - ... misc fixes and cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
    locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API
    locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire()
    locking/static_keys: Fix non static symbol Sparse warning
    locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()
    locking/atomic, arch/tile: Fix tilepro build
    locking/atomic, arch/m68k: Remove comment
    locking/atomic, arch/arc: Fix build
    locking/Documentation: Clarify limited control-dependency scope
    locking/atomic, arch/rwsem: Employ atomic_long_fetch_add()
    locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire()
    locking/atomic, arch/mips: Convert to _relaxed atomics
    locking/atomic, arch/alpha: Convert to _relaxed atomics
    locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions
    locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
    locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    locking/atomic: Fix atomic64_relaxed() bits
    locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - documentation updates

    - miscellaneous fixes

    - minor reorganization of code

    - torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
    rcu: Correctly handle sparse possible cpus
    rcu: sysctl: Panic on RCU Stall
    rcu: Fix a typo in a comment
    rcu: Make call_rcu_tasks() tolerate first call with irqs disabled
    rcu: Disable TASKS_RCU for usermode Linux
    rcu: No ordering for rcu_assign_pointer() of NULL
    rcutorture: Fix error return code in rcu_perf_init()
    torture: Inflict default jitter
    rcuperf: Don't treat gp_exp mis-setting as a WARN
    rcutorture: Drop "-soundhw pcspkr" from x86 boot arguments
    rcutorture: Don't specify the cpu type of QEMU on PPC
    rcutorture: Make -soundhw a x86 specific option
    rcutorture: Use vmlinux as the fallback kernel image
    rcutorture/doc: Create initrd using dracut
    torture: Stop onoff task if there is only one cpu
    torture: Add starvation events to error summary
    torture: Break online and offline functions out of torture_onoff()
    torture: Forgive lengthy trace dumps and preemption
    torture: Remove CONFIG_RCU_TORTURE_TEST_RUNNABLE, simplify code
    torture: Simplify code, eliminate RCU_PERF_TEST_RUNNABLE
    ...

    Linus Torvalds
     
  • This patch fixes the __output_custom() routine we currently use with
    bpf_skb_copy(). I missed that when len is larger than the size of the
    current handle, we can issue multiple invocations of copy_func, and
    __output_custom() advances destination but also source buffer by the
    written amount of bytes. When we have __output_custom(), this is actually
    wrong since in that case the source buffer points to a non-linear object,
    in our case an skb, which the copy_func helper is supposed to walk.
    Therefore, since this is non-linear we thus need to pass the offset into
    the helper, so that copy_func can use it for extracting the data from
    the source object.

    Therefore, adjust the callback signatures properly and pass offset
    into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
    __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
    i) to pass in whether we should advance source buffer or not; this is
    a compile-time constant condition, ii) to pass in the offset for
    __output_custom(), which we do with help of __VA_ARGS__, so everything
    can stay inlined as is currently. Both changes allow for adapting the
    __output_* fast-path helpers w/o extra overhead.

    Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output")
    Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw records")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

25 Jul, 2016

1 commit

  • * pm-cpufreq: (41 commits)
    Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
    cpufreq: export cpufreq_driver_resolve_freq()
    cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
    cpufreq: acpi-cpufreq: use cached frequency mapping when possible
    cpufreq: schedutil: map raw required frequency to driver frequency
    cpufreq: add cpufreq_driver_resolve_freq()
    cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
    intel_pstate: Update cpu_frequency tracepoint every time
    cpufreq: intel_pstate: clean remnant struct element
    cpufreq: powernv: Replacing pstate_id with frequency table index
    intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
    cpufreq: Reuse new freq-table helpers
    cpufreq: Handle sorted frequency tables more efficiently
    cpufreq: Drop redundant check from cpufreq_update_current_freq()
    intel_pstate: Declare pid_params/pstate_funcs/hwp_active __read_mostly
    intel_pstate: add __init/__initdata marker to some functions/variables
    intel_pstate: Fix incorrect placement of __initdata
    cpufreq: mvebu: fix integer to pointer cast
    cpufreq: intel_pstate: Broxton support
    cpufreq: conservative: Do not use transition notifications
    ...

    Rafael J. Wysocki