28 Apr, 2014

1 commit


01 Feb, 2014

1 commit

  • Pull more KVM updates from Paolo Bonzini:
    "Second batch of KVM updates. Some minor x86 fixes, two s390 guest
    features that need some handling in the host, and all the PPC changes.

    The PPC changes include support for little-endian guests and
    enablement for new POWER8 features"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (45 commits)
    x86, kvm: correctly access the KVM_CPUID_FEATURES leaf at 0x40000101
    x86, kvm: cache the base of the KVM cpuid leaves
    kvm: x86: move KVM_CAP_HYPERV_TIME outside #ifdef
    KVM: PPC: Book3S PR: Cope with doorbell interrupts
    KVM: PPC: Book3S HV: Add software abort codes for transactional memory
    KVM: PPC: Book3S HV: Add new state for transactional memory
    powerpc/Kconfig: Make TM select VSX and VMX
    KVM: PPC: Book3S HV: Basic little-endian guest support
    KVM: PPC: Book3S HV: Add support for DABRX register on POWER7
    KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells
    KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8
    KVM: PPC: Book3S HV: Handle guest using doorbells for IPIs
    KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from nap
    KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8
    KVM: PPC: Book3S HV: Add handler for HV facility unavailable
    KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8
    KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs
    KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers
    KVM: PPC: Book3S HV: Don't set DABR on POWER8
    kvm/ppc: IRQ disabling cleanup
    ...

    Linus Torvalds
     

29 Jan, 2014

1 commit


27 Jan, 2014

3 commits

  • This adds the software abort code defines for transactional memory (TM).
    These values are from PAPR.

    Signed-off-by: Michael Neuling
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Michael Neuling
     
  • The DABRX (DABR extension) register on POWER7 processors provides finer
    control over which accesses cause a data breakpoint interrupt. It
    contains 3 bits which indicate whether to enable accesses in user,
    kernel and hypervisor modes respectively to cause data breakpoint
    interrupts, plus one bit that enables both real mode and virtual mode
    accesses to cause interrupts. Currently, KVM sets DABRX to allow
    both kernel and user accesses to cause interrupts while in the guest.

    This adds support for the guest to specify other values for DABRX.
    PAPR defines a H_SET_XDABR hcall to allow the guest to set both DABR
    and DABRX with one call. This adds a real-mode implementation of
    H_SET_XDABR, which shares most of its code with the existing H_SET_DABR
    implementation. To support this, we add a per-vcpu field to store the
    DABRX value plus code to get and set it via the ONE_REG interface.

    For Linux guests to use this new hcall, userspace needs to add
    "hcall-xdabr" to the set of strings in the /chosen/hypertas-functions
    property in the device tree. If userspace does this and then migrates
    the guest to a host where the kernel doesn't include this patch, then
    userspace will need to implement H_SET_XDABR by writing the specified
    DABR value to the DABR using the ONE_REG interface. In that case, the
    old kernel will set DABRX to DABRX_USER | DABRX_KERNEL. That should
    still work correctly, at least for Linux guests, since Linux guests
    cope with getting data breakpoint interrupts in modes that weren't
    requested by just ignoring the interrupt, and Linux guests never set
    DABRX_BTI.

    The other thing this does is to make H_SET_DABR and H_SET_XDABR work
    on POWER8, which has the DAWR and DAWRX instead of DABR/X. Guests that
    know about POWER8 should use H_SET_MODE rather than H_SET_[X]DABR, but
    guests running in POWER7 compatibility mode will still use H_SET_[X]DABR.
    For them, this adds the logic to convert DABR/X values into DAWR/X values
    on POWER8.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This adds fields to the struct kvm_vcpu_arch to store the new
    guest-accessible SPRs on POWER8, adds code to the get/set_one_reg
    functions to allow userspace to access this state, and adds code to
    the guest entry and exit to context-switch these SPRs between host
    and guest.

    Note that DPDES (Directed Privileged Doorbell Exception State) is
    shared between threads on a core; hence we store it in struct
    kvmppc_vcore and have the master thread save and restore it.

    Signed-off-by: Michael Neuling
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Michael Neuling
     

19 Jan, 2014

1 commit

  • For user space packet capturing libraries such as libpcap, there's
    currently only one way to check which BPF extensions are supported
    by the kernel, that is, commit aa1113d9f85d ("net: filter: return
    -EINVAL if BPF_S_ANC* operation is not supported"). For querying all
    extensions at once this might be rather inconvenient.

    Therefore, this patch introduces a new option which can be used as
    an argument for getsockopt(), and allows one to obtain information
    about which BPF extensions are supported by the current kernel.

    As David Miller suggests, we do not need to define any bits right
    now and status quo can just return 0 in order to state that this
    versions supports SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET. Later
    additions to BPF extensions need to add their bits to the
    bpf_tell_extensions() function, as documented in the comment.

    Signed-off-by: Michal Sekletar
    Cc: David Miller
    Reviewed-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Michal Sekletar
     

15 Nov, 2013

1 commit

  • Pull KVM changes from Paolo Bonzini:
    "Here are the 3.13 KVM changes. There was a lot of work on the PPC
    side: the HV and emulation flavors can now coexist in a single kernel
    is probably the most interesting change from a user point of view.

    On the x86 side there are nested virtualization improvements and a few
    bugfixes.

    ARM got transparent huge page support, improved overcommit, and
    support for big endian guests.

    Finally, there is a new interface to connect KVM with VFIO. This
    helps with devices that use NoSnoop PCI transactions, letting the
    driver in the guest execute WBINVD instructions. This includes some
    nVidia cards on Windows, that fail to start without these patches and
    the corresponding userspace changes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
    kvm, vmx: Fix lazy FPU on nested guest
    arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
    arm/arm64: KVM: MMIO support for BE guest
    kvm, cpuid: Fix sparse warning
    kvm: Delete prototype for non-existent function kvm_check_iopl
    kvm: Delete prototype for non-existent function complete_pio
    hung_task: add method to reset detector
    pvclock: detect watchdog reset at pvclock read
    kvm: optimize out smp_mb after srcu_read_unlock
    srcu: API for barrier after srcu read unlock
    KVM: remove vm mmap method
    KVM: IOMMU: hva align mapping page size
    KVM: x86: trace cpuid emulation when called from emulator
    KVM: emulator: cleanup decode_register_operand() a bit
    KVM: emulator: check rex prefix inside decode_register()
    KVM: x86: fix emulation of "movzbl %bpl, %eax"
    kvm_host: typo fix
    KVM: x86: emulate SAHF instruction
    MAINTAINERS: add tree for kvm.git
    Documentation/kvm: add a 00-INDEX file
    ...

    Linus Torvalds
     

13 Nov, 2013

1 commit

  • Pull networking updates from David Miller:

    1) The addition of nftables. No longer will we need protocol aware
    firewall filtering modules, it can all live in userspace.

    At the core of nftables is a, for lack of a better term, virtual
    machine that executes byte codes to inspect packet or metadata
    (arriving interface index, etc.) and make verdict decisions.

    Besides support for loading packet contents and comparing them, the
    interpreter supports lookups in various datastructures as
    fundamental operations. For example sets are supports, and
    therefore one could create a set of whitelist IP address entries
    which have ACCEPT verdicts attached to them, and use the appropriate
    byte codes to do such lookups.

    Since the interpreted code is composed in userspace, userspace can
    do things like optimize things before giving it to the kernel.

    Another major improvement is the capability of atomically updating
    portions of the ruleset. In the existing netfilter implementation,
    one has to update the entire rule set in order to make a change and
    this is very expensive.

    Userspace tools exist to create nftables rules using existing
    netfilter rule sets, but both kernel implementations will need to
    co-exist for quite some time as we transition from the old to the
    new stuff.

    Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
    worked so hard on this.

    2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
    to our pseudo-random number generator, mostly used for things like
    UDP port randomization and netfitler, amongst other things.

    In particular the taus88 generater is updated to taus113, and test
    cases are added.

    3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
    and Yang Yingliang.

    4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
    Sujir.

    5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
    Neal Cardwell, and Yuchung Cheng.

    6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
    control message data, much like other socket option attributes.
    From Francesco Fusco.

    7) Allow applications to specify a cap on the rate computed
    automatically by the kernel for pacing flows, via a new
    SO_MAX_PACING_RATE socket option. From Eric Dumazet.

    8) Make the initial autotuned send buffer sizing in TCP more closely
    reflect actual needs, from Eric Dumazet.

    9) Currently early socket demux only happens for TCP sockets, but we
    can do it for connected UDP sockets too. Implementation from Shawn
    Bohrer.

    10) Refactor inet socket demux with the goal of improving hash demux
    performance for listening sockets. With the main goals being able
    to use RCU lookups on even request sockets, and eliminating the
    listening lock contention. From Eric Dumazet.

    11) The bonding layer has many demuxes in it's fast path, and an RCU
    conversion was started back in 3.11, several changes here extend the
    RCU usage to even more locations. From Ding Tianhong and Wang
    Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
    Falico.

    12) Allow stackability of segmentation offloads to, in particular, allow
    segmentation offloading over tunnels. From Eric Dumazet.

    13) Significantly improve the handling of secret keys we input into the
    various hash functions in the inet hashtables, TCP fast open, as
    well as syncookies. From Hannes Frederic Sowa. The key fundamental
    operation is "net_get_random_once()" which uses static keys.

    Hannes even extended this to ipv4/ipv6 fragmentation handling and
    our generic flow dissector.

    14) The generic driver layer takes care now to set the driver data to
    NULL on device removal, so it's no longer necessary for drivers to
    explicitly set it to NULL any more. Many drivers have been cleaned
    up in this way, from Jingoo Han.

    15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.

    16) Improve CRC32 interfaces and generic SKB checksum iterators so that
    SCTP's checksumming can more cleanly be handled. Also from Daniel
    Borkmann.

    17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
    using the interface MTU value. This helps avoid PMTU attacks,
    particularly on DNS servers. From Hannes Frederic Sowa.

    18) Use generic XPS for transmit queue steering rather than internal
    (re-)implementation in virtio-net. From Jason Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    random32: add test cases for taus113 implementation
    random32: upgrade taus88 generator to taus113 from errata paper
    random32: move rnd_state to linux/random.h
    random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
    random32: add periodic reseeding
    random32: fix off-by-one in seeding requirement
    PHY: Add RTL8201CP phy_driver to realtek
    xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
    macmace: add missing platform_set_drvdata() in mace_probe()
    ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
    ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
    vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
    ixgbe: add warning when max_vfs is out of range.
    igb: Update link modes display in ethtool
    netfilter: push reasm skb through instead of original frag skbs
    ip6_output: fragment outgoing reassembled skb properly
    MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
    net_sched: tbf: support of 64bit rates
    ixgbe: deleting dfwd stations out of order can cause null ptr deref
    ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
    ...

    Linus Torvalds
     

17 Oct, 2013

8 commits

  • This patch adds the debug stub support on booke/bookehv.
    Now QEMU debug stub can use hw breakpoint, watchpoint and
    software breakpoint to debug guest.

    This is how we save/restore debug register context when switching
    between guest, userspace and kernel user-process:

    When QEMU is running
    -> thread->debug_reg == QEMU debug register context.
    -> Kernel will handle switching the debug register on context switch.
    -> no vcpu_load() called

    QEMU makes ioctls (except RUN)
    -> This will call vcpu_load()
    -> should not change context.
    -> Some ioctls can change vcpu debug register, context saved in vcpu->debug_regs

    QEMU Makes RUN ioctl
    -> Save thread->debug_reg on STACK
    -> Store thread->debug_reg == vcpu->debug_reg
    -> load thread->debug_reg
    -> RUN VCPU ( So thread points to vcpu context )

    Context switch happens When VCPU running
    -> makes vcpu_load() should not load any context
    -> kernel loads the vcpu context as thread->debug_regs points to vcpu context.

    On heavyweight_exit
    -> Load the context saved on stack in thread->debug_reg

    Currently we do not support debug resource emulation to guest,
    On debug exception, always exit to user space irrespective of
    user space is expecting the debug exception or not. If this is
    unexpected exception (breakpoint/watchpoint event not set by
    userspace) then let us leave the action on user space. This
    is similar to what it was before, only thing is that now we
    have proper exit state available to user space.

    Signed-off-by: Bharat Bhushan
    Signed-off-by: Alexander Graf

    Bharat Bhushan
     
  • "ehpriv 1" instruction is used for setting software breakpoints
    by user space. This patch adds support to exit to user space
    with "run->debug" have relevant information.

    As this is the first point we are using run->debug, also defined
    the run->debug structure.

    Signed-off-by: Bharat Bhushan
    Signed-off-by: Alexander Graf

    Bharat Bhushan
     
  • This enables us to use the Processor Compatibility Register (PCR) on
    POWER7 to put the processor into architecture 2.05 compatibility mode
    when running a guest. In this mode the new instructions and registers
    that were introduced on POWER7 are disabled in user mode. This
    includes all the VSX facilities plus several other instructions such
    as ldbrx, stdbrx, popcntw, popcntd, etc.

    To select this mode, we have a new register accessible through the
    set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT. Setting
    this to zero gives the full set of capabilities of the processor.
    Setting it to one of the "logical" PVR values defined in PAPR puts
    the vcpu into the compatibility mode for the corresponding
    architecture level. The supported values are:

    0x0f000002 Architecture 2.05 (POWER6)
    0x0f000003 Architecture 2.06 (POWER7)
    0x0f100003 Architecture 2.06+ (POWER7+)

    Since the PCR is per-core, the architecture compatibility level and
    the corresponding PCR value are stored in the struct kvmppc_vcore, and
    are therefore shared between all vcpus in a virtual core.

    Signed-off-by: Paul Mackerras
    [agraf: squash in fix to add missing break statements and documentation]
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • POWER7 and later IBM server processors have a register called the
    Program Priority Register (PPR), which controls the priority of
    each hardware CPU SMT thread, and affects how fast it runs compared
    to other SMT threads. This priority can be controlled by writing to
    the PPR or by use of a set of instructions of the form or rN,rN,rN
    which are otherwise no-ops but have been defined to set the priority
    to particular levels.

    This adds code to context switch the PPR when entering and exiting
    guests and to make the PPR value accessible through the SET/GET_ONE_REG
    interface. When entering the guest, we set the PPR as late as
    possible, because if we are setting a low thread priority it will
    make the code run slowly from that point on. Similarly, the
    first-level interrupt handlers save the PPR value in the PACA very
    early on, and set the thread priority to the medium level, so that
    the interrupt handling code runs at a reasonable speed.

    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This adds the ability to have a separate LPCR (Logical Partitioning
    Control Register) value relating to a guest for each virtual core,
    rather than only having a single value for the whole VM. This
    corresponds to what real POWER hardware does, where there is a LPCR
    per CPU thread but most of the fields are required to have the same
    value on all active threads in a core.

    The per-virtual-core LPCR can be read and written using the
    GET/SET_ONE_REG interface. Userspace can can only modify the
    following fields of the LPCR value:

    DPFD Default prefetch depth
    ILE Interrupt little-endian
    TC Translation control (secondary HPT hash group search disable)

    We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which
    contains bits relating to memory management, i.e. the Virtualized
    Partition Memory (VPM) bits and the bits relating to guest real mode.
    When this default value is updated, the update needs to be propagated
    to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do
    that.

    Signed-off-by: Paul Mackerras
    [agraf: fix whitespace]
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • The VRSAVE register value for a vcpu is accessible through the
    GET/SET_SREGS interface for Book E processors, but not for Book 3S
    processors. In order to make this accessible for Book 3S processors,
    this adds a new register identifier for GET/SET_ONE_REG, and adds
    the code to implement it.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This allows guests to have a different timebase origin from the host.
    This is needed for migration, where a guest can migrate from one host
    to another and the two hosts might have a different timebase origin.
    However, the timebase seen by the guest must not go backwards, and
    should go forwards only by a small amount corresponding to the time
    taken for the migration.

    Therefore this provides a new per-vcpu value accessed via the one_reg
    interface using the new KVM_REG_PPC_TB_OFFSET identifier. This value
    defaults to 0 and is not modified by KVM. On entering the guest, this
    value is added onto the timebase, and on exiting the guest, it is
    subtracted from the timebase.

    This is only supported for recent POWER hardware which has the TBU40
    (timebase upper 40 bits) register. Writing to the TBU40 register only
    alters the upper 40 bits of the timebase, leaving the lower 24 bits
    unchanged. This provides a way to modify the timebase for guest
    migration without disturbing the synchronization of the timebase
    registers across CPU cores. The kernel rounds up the value given
    to a multiple of 2^24.

    Timebase values stored in KVM structures (struct kvm_vcpu, struct
    kvmppc_vcore, etc.) are stored as host timebase values. The timebase
    values in the dispatch trace log need to be guest timebase values,
    however, since that is read directly by the guest. This moves the
    setting of vcpu->arch.dec_expires on guest exit to a point after we
    have restored the host timebase so that vcpu->arch.dec_expires is a
    host timebase value.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This reserves space in get/set_one_reg ioctl for the extra guest state
    needed for POWER8. It doesn't implement these at all, it just reserves
    them so that the ABI is defined now.

    A few things to note here:

    - This add *a lot* state for transactional memory. TM suspend mode,
    this is unavoidable, you can't simply roll back all transactions and
    store only the checkpointed state. I've added this all to
    get/set_one_reg (including GPRs) rather than creating a new ioctl
    which returns a struct kvm_regs like KVM_GET_REGS does. This means we
    if we need to extract the TM state, we are going to need a bucket load
    of IOCTLs. Hopefully most of the time this will not be needed as we
    can look at the MSR to see if TM is active and only grab them when
    needed. If this becomes a bottle neck in future we can add another
    ioctl to grab all this state in one go.

    - The TM state is offset by 0x80000000.

    - For TM, I've done away with VMX and FP and created a single 64x128 bit
    VSX register space.

    - I've left a space of 1 (at 0x9c) since Paulus needs to add a value
    which applies to POWER7 as well.

    Signed-off-by: Michael Neuling
    Signed-off-by: Alexander Graf

    Michael Neuling
     

11 Oct, 2013

1 commit


29 Sep, 2013

1 commit

  • As mentioned in commit afe4fd062416b ("pkt_sched: fq: Fair Queue packet
    scheduler"), this patch adds a new socket option.

    SO_MAX_PACING_RATE offers the application the ability to cap the
    rate computed by transport layer. Value is in bytes per second.

    u32 val = 1000000;
    setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));

    To be effectively paced, a flow must use FQ packet scheduler.

    Note that a packet scheduler takes into account the headers for its
    computations. The effective payload rate depends on MSS and retransmits
    if any.

    I chose to make this pacing rate a SOL_SOCKET option instead of a
    TCP one because this can be used by other protocols.

    Signed-off-by: Eric Dumazet
    Cc: Steinar H. Gunderson
    Cc: Michael Kerrisk
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Aug, 2013

2 commits


01 Aug, 2013

1 commit


11 Jul, 2013

1 commit

  • Rename LL_SO to BUSY_POLL_SO
    Rename sysctl_net_ll_{read,poll} to sysctl_busy_{read,poll}
    Fix up users of these variables.
    Fix documentation for sysctl.

    a patch for the socket.7 man page will follow separately,
    because of limitations of my mail setup.

    Signed-off-by: Eliezer Tamir
    Signed-off-by: David S. Miller

    Eliezer Tamir
     

18 Jun, 2013

1 commit


01 Jun, 2013

1 commit


06 May, 2013

2 commits

  • Also, make HTM's presence dependent on the .config option.

    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Benjamin Herrenschmidt

    Nishanth Aravamudan
     
  • Pull kvm updates from Gleb Natapov:
    "Highlights of the updates are:

    general:
    - new emulated device API
    - legacy device assignment is now optional
    - irqfd interface is more generic and can be shared between arches

    x86:
    - VMCS shadow support and other nested VMX improvements
    - APIC virtualization and Posted Interrupt hardware support
    - Optimize mmio spte zapping

    ppc:
    - BookE: in-kernel MPIC emulation with irqfd support
    - Book3S: in-kernel XICS emulation (incomplete)
    - Book3S: HV: migration fixes
    - BookE: more debug support preparation
    - BookE: e6500 support

    ARM:
    - reworking of Hyp idmaps

    s390:
    - ioeventfd for virtio-ccw

    And many other bug fixes, cleanups and improvements"

    * tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
    kvm: Add compat_ioctl for device control API
    KVM: x86: Account for failing enable_irq_window for NMI window request
    KVM: PPC: Book3S: Add API for in-kernel XICS emulation
    kvm/ppc/mpic: fix missing unlock in set_base_addr()
    kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
    kvm/ppc/mpic: remove users
    kvm/ppc/mpic: fix mmio region lists when multiple guests used
    kvm/ppc/mpic: remove default routes from documentation
    kvm: KVM_CAP_IOMMU only available with device assignment
    ARM: KVM: iterate over all CPUs for CPU compatibility check
    KVM: ARM: Fix spelling in error message
    ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
    KVM: ARM: Fix API documentation for ONE_REG encoding
    ARM: KVM: promote vfp_host pointer to generic host cpu context
    ARM: KVM: add architecture specific hook for capabilities
    ARM: KVM: perform HYP initilization for hotplugged CPUs
    ARM: KVM: switch to a dual-step HYP init code
    ARM: KVM: rework HYP page table freeing
    ARM: KVM: enforce maximum size for identity mapped code
    ARM: KVM: move to a KVM provided HYP idmap
    ...

    Linus Torvalds
     

03 May, 2013

1 commit

  • Pull powerpc update from Benjamin Herrenschmidt:
    "The main highlights this time around are:

    - A pile of addition POWER8 bits and nits, such as updated
    performance counter support (Michael Ellerman), new branch history
    buffer support (Anshuman Khandual), base support for the new PCI
    host bridge when not using the hypervisor (Gavin Shan) and other
    random related bits and fixes from various contributors.

    - Some rework of our page table format by Aneesh Kumar which fixes a
    thing or two and paves the way for THP support. THP itself will
    not make it this time around however.

    - More Freescale updates, including Altivec support on the new e6500
    cores, new PCI controller support, and a pile of new boards support
    and updates.

    - The usual batch of trivial cleanups & fixes"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
    powerpc: Fix build error for book3e
    powerpc: Context switch the new EBB SPRs
    powerpc: Turn on the EBB H/FSCR bits
    powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S
    powerpc: Setup BHRB instructions facility in HFSCR for POWER8
    powerpc: Fix interrupt range check on debug exception
    powerpc: Update tlbie/tlbiel as per ISA doc
    powerpc: Print page size info during boot
    powerpc: print both base and actual page size on hash failure
    powerpc: Fix hpte_decode to use the correct decoding for page sizes
    powerpc: Decode the pte-lp-encoding bits correctly.
    powerpc: Use encode avpn where we need only avpn values
    powerpc: Reduce PTE table memory wastage
    powerpc: Move the pte free routines from common header
    powerpc: Reduce the PTE_INDEX_SIZE
    powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format
    powerpc: New hugepage directory format
    powerpc: Don't truncate pgd_index wrongly
    powerpc: Don't hard code the size of pte page
    powerpc: Save DAR and DSISR in pt_regs on MCE
    ...

    Linus Torvalds
     

02 May, 2013

2 commits

  • This adds the API for userspace to instantiate an XICS device in a VM
    and connect VCPUs to it. The API consists of a new device type for
    the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which
    functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl,
    which is used to assert and deassert interrupt inputs of the XICS.

    The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES.
    Each attribute within this group corresponds to the state of one
    interrupt source. The attribute number is the same as the interrupt
    source number.

    This does not support irq routing or irqfd yet.

    Signed-off-by: Paul Mackerras
    Acked-by: David Gibson
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • Pull networking updates from David Miller:
    "Highlights (1721 non-merge commits, this has to be a record of some
    sort):

    1) Add 'random' mode to team driver, from Jiri Pirko and Eric
    Dumazet.

    2) Make it so that any driver that supports configuration of multiple
    MAC addresses can provide the forwarding database add and del
    calls by providing a default implementation and hooking that up if
    the driver doesn't have an explicit set of handlers. From Vlad
    Yasevich.

    3) Support GSO segmentation over tunnels and other encapsulating
    devices such as VXLAN, from Pravin B Shelar.

    4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.

    5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
    Dukkipati.

    6) In the PHY layer, allow supporting wake-on-lan in situations where
    the PHY registers have to be written for it to be configured.

    Use it to support wake-on-lan in mv643xx_eth.

    From Michael Stapelberg.

    7) Significantly improve firewire IPV6 support, from YOSHIFUJI
    Hideaki.

    8) Allow multiple packets to be sent in a single transmission using
    network coding in batman-adv, from Martin Hundebøll.

    9) Add support for T5 cxgb4 chips, from Santosh Rastapur.

    10) Generalize the VXLAN forwarding tables so that there is more
    flexibility in configurating various aspects of the endpoints.
    From David Stevens.

    11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
    from Dmitry Kravkov.

    12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
    Neira Ayuso.

    13) Start adding networking selftests.

    14) In situations of overload on the same AF_PACKET fanout socket, or
    per-cpu packet receive queue, minimize drop by distributing the
    load to other cpus/fanouts. From Willem de Bruijn and Eric
    Dumazet.

    15) Add support for new payload offset BPF instruction, from Daniel
    Borkmann.

    16) Convert several drivers over to mdoule_platform_driver(), from
    Sachin Kamat.

    17) Provide a minimal BPF JIT image disassembler userspace tool, from
    Daniel Borkmann.

    18) Rewrite F-RTO implementation in TCP to match the final
    specification of it in RFC4138 and RFC5682. From Yuchung Cheng.

    19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
    you like netlink, so I implemented netlink dumping of netlink
    sockets.") From Andrey Vagin.

    20) Remove ugly passing of rtnetlink attributes into rtnl_doit
    functions, from Thomas Graf.

    21) Allow userspace to be able to see if a configuration change occurs
    in the middle of an address or device list dump, from Nicolas
    Dichtel.

    22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
    Frederic Sowa.

    23) Increase accuracy of packet length used by packet scheduler, from
    Jason Wang.

    24) Beginning set of changes to make ipv4/ipv6 fragment handling more
    scalable and less susceptible to overload and locking contention,
    from Jesper Dangaard Brouer.

    25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
    instead. From Hong Zhiguo.

    26) Optimize route usage in IPVS by avoiding reference counting where
    possible, from Julian Anastasov.

    27) Convert IPVS schedulers to RCU, also from Julian Anastasov.

    28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
    Eitzenberger.

    29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
    nfnetlink_log, and nfnetlink_queue. From Gao feng.

    30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.

    31) Support several new r8169 chips, from Hayes Wang.

    32) Support tokenized interface identifiers in ipv6, from Daniel
    Borkmann.

    33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.

    34) Add 802.1ad vlan offload support, from Patrick McHardy.

    35) Support mmap() based netlink communication, also from Patrick
    McHardy.

    36) Support HW timestamping in mlx4 driver, from Amir Vadai.

    37) Rationalize AF_PACKET packet timestamping when transmitting, from
    Willem de Bruijn and Daniel Borkmann.

    38) Bring parity to what's provided by /proc/net/packet socket dumping
    and the info provided by netlink socket dumping of AF_PACKET
    sockets. From Nicolas Dichtel.

    39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
    Poirier"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    filter: fix va_list build error
    af_unix: fix a fatal race with bit fields
    bnx2x: Prevent memory leak when cnic is absent
    bnx2x: correct reading of speed capabilities
    net: sctp: attribute printl with __printf for gcc fmt checks
    netlink: kconfig: move mmap i/o into netlink kconfig
    netpoll: convert mutex into a semaphore
    netlink: Fix skb ref counting.
    net_sched: act_ipt forward compat with xtables
    mlx4_en: fix a build error on 32bit arches
    Revert "bnx2x: allow nvram test to run when device is down"
    bridge: avoid OOPS if root port not found
    drivers: net: cpsw: fix kernel warn on cpsw irq enable
    sh_eth: use random MAC address if no valid one supplied
    3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
    tg3: fix to append hardware time stamping flags
    unix/stream: fix peeking with an offset larger than data in queue
    unix/dgram: fix peeking with an offset larger than data in queue
    unix/dgram: peek beyond 0-sized skbs
    openvswitch: Remove unneeded ovs_netdev_get_ifindex()
    ...

    Linus Torvalds
     

01 May, 2013

1 commit

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     

27 Apr, 2013

9 commits

  • This adds the ability for userspace to save and restore the state
    of the XICS interrupt presentation controllers (ICPs) via the
    KVM_GET/SET_ONE_REG interface. Since there is one ICP per vcpu, we
    simply define a new 64-bit register in the ONE_REG space for the ICP
    state. The state includes the CPU priority setting, the pending IPI
    priority, and the priority and source number of any pending external
    interrupt.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • For pseries machine emulation, in order to move the interrupt
    controller code to the kernel, we need to intercept some RTAS
    calls in the kernel itself. This adds an infrastructure to allow
    in-kernel handlers to be registered for RTAS services by name.
    A new ioctl, KVM_PPC_RTAS_DEFINE_TOKEN, then allows userspace to
    associate token values with those service names. Then, when the
    guest requests an RTAS service with one of those token values, it
    will be handled by the relevant in-kernel handler rather than being
    passed up to userspace as at present.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Paul Mackerras
    [agraf: fix warning]
    Signed-off-by: Alexander Graf

    Michael Ellerman
     
  • Now that all pieces are in place for reusing generic irq infrastructure,
    we can copy x86's implementation of KVM_IRQ_LINE irq injection and simply
    reuse it for PPC, as it will work there just as well.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • Now that all the irq routing and irqfd pieces are generic, we can expose
    real irqchip support to all of KVM's internal helpers.

    This allows us to use irqfd with the in-kernel MPIC.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • Hook the MPIC code up to the KVM interfaces, add locking, etc.

    Signed-off-by: Scott Wood
    [agraf: add stub function for kvmppc_mpic_set_epr, non-booke, 64bit]
    Signed-off-by: Alexander Graf

    Scott Wood
     
  • EPTCFG register defined by E.PT is accessed unconditionally by Linux guests
    in the presence of MAV 2.0. Emulate it now.

    Signed-off-by: Mihai Caraman
    Signed-off-by: Alexander Graf

    Mihai Caraman
     
  • Add support for TLBnPS registers available in MMU Architecture Version
    (MAV) 2.0.

    Signed-off-by: Mihai Caraman
    Signed-off-by: Alexander Graf

    Mihai Caraman
     
  • MMU registers were exposed to user-space using sregs interface. Add them
    to ONE_REG interface using kvmppc_get_one_reg/kvmppc_set_one_reg delegation
    mechanism.

    Signed-off-by: Mihai Caraman
    Signed-off-by: Alexander Graf

    Mihai Caraman
     
  • This patch defines the interface parameter for KVM_SET_GUEST_DEBUG
    ioctl support. Follow up patches will use this for setting up
    hardware breakpoints, watchpoints and software breakpoints.

    Also kvm_arch_vcpu_ioctl_set_guest_debug() is brought one level below.
    This is because I am not sure what is required for book3s. So this ioctl
    behaviour will not change for book3s.

    Signed-off-by: Bharat Bhushan
    Signed-off-by: Alexander Graf

    Bharat Bhushan