31 Oct, 2014

40 commits

  • Greg Kroah-Hartman
     
  • [ Upstream commit 06090e8ed89ea2113a236befb41f71d51f100e60 ]

    It is not sufficient to only implement get_user_pages_fast(), you
    must also implement the atomic version __get_user_pages_fast()
    otherwise you end up using the weak symbol fallback implementation
    which simply returns zero.

    This is dangerous, because it causes the futex code to loop forever
    if transparent hugepages are supported (see get_futex_key()).

    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit ef3e035c3a9b81da8a778bc333d10637acf6c199 ]

    Meelis Roos reported that kernels built with gcc-4.9 do not boot, we
    eventually narrowed this down to only impacting machines using
    UltraSPARC-III and derivitive cpus.

    The crash happens right when the first user process is spawned:

    [ 54.451346] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
    [ 54.451346]
    [ 54.571516] CPU: 1 PID: 1 Comm: init Not tainted 3.16.0-rc2-00211-gd7933ab #96
    [ 54.666431] Call Trace:
    [ 54.698453] [0000000000762f8c] panic+0xb0/0x224
    [ 54.759071] [000000000045cf68] do_exit+0x948/0x960
    [ 54.823123] [000000000042cbc0] fault_in_user_windows+0xe0/0x100
    [ 54.902036] [0000000000404ad0] __handle_user_windows+0x0/0x10
    [ 54.978662] Press Stop-A (L1-A) to return to the boot prom
    [ 55.050713] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004

    Further investigation showed that compiling only per_cpu_patch() with
    an older compiler fixes the boot.

    Detailed analysis showed that the function is not being miscompiled by
    gcc-4.9, but it is using a different register allocation ordering.

    With the gcc-4.9 compiled function, something during the code patching
    causes some of the %i* input registers to get corrupted. Perhaps
    we have a TLB miss path into the firmware that is deep enough to
    cause a register window spill and subsequent restore when we get
    back from the TLB miss trap.

    Let's plug this up by doing two things:

    1) Stop using the firmware stack for client interface calls into
    the firmware. Just use the kernel's stack.

    2) As soon as we can, call into a new function "start_early_boot()"
    to put a one-register-window buffer between the firmware's
    deepest stack frame and the top-most initial kernel one.

    Reported-by: Meelis Roos
    Tested-by: Meelis Roos
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 1cef94c36bd4d79b5ae3a3df99ee0d76d6a4a6dc ]

    This is the longest boot string that silo supports.

    Signed-off-by: Dave Kleikamp
    Cc: Bob Picco
    Cc: David S. Miller
    Cc: sparclinux@vger.kernel.org
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Dave Kleikamp
     
  • [ Upstream commit d195b71bad4347d2df51072a537f922546a904f1 ]

    swapper_low_pmd_dir and swapper_pud_dir are actually completely
    useless and unnecessary.

    We just need swapper_pg_dir[]. Naturally the other page table chunks
    will be allocated on an as-needed basis. Since the kernel actually
    accesses these tables in the PAGE_OFFSET view, there is not even a TLB
    locality advantage of placing them in the kernel image.

    Use the hard coded vmlinux.ld.S slot for swapper_pg_dir which is
    naturally page aligned.

    Increase MAX_BANKS to 1024 in order to handle heavily fragmented
    virtual guests.

    Even with this MAX_BANKS increase, the kernel is 20K+ smaller.

    Signed-off-by: David S. Miller
    Acked-by: Bob Picco
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit ee6a9333fa58e11577c1b531b8e0f5ffc0fd6f50 ]

    This patch attempts to do a few things. The highlights are: 1) enable
    SPARSE_IRQ unconditionally, 2) kills off !SPARSE_IRQ code 3) allocates
    ivector_table at boot time and 4) default to cookie only VIRQ mechanism
    for supported firmware. The first firmware with cookie only support for
    me appears on T5. You can optionally force the HV firmware to not cookie
    only mode which is the sysino support.

    The sysino is a deprecated HV mechanism according to the most recent
    SPARC Virtual Machine Specification. HV_GRP_INTR is what controls the
    cookie/sysino firmware versioning.

    The history of this interface is:

    1) Major version 1.0 only supported sysino based interrupt interfaces.

    2) Major version 2.0 added cookie based VIRQs, however due to the fact
    that OSs were using the VIRQs without negoatiating major version
    2.0 (Linux and Solaris are both guilty), the VIRQs calls were
    allowed even with major version 1.0

    To complicate things even further, the VIRQ interfaces were only
    actually hooked up in the hypervisor for LDC interrupt sources.
    VIRQ calls on other device types would result in HV_EINVAL errors.

    So effectively, major version 2.0 is unusable.

    3) Major version 3.0 was created to signal use of VIRQs and the fact
    that the hypervisor has these calls hooked up for all interrupt
    sources, not just those for LDC devices.

    A new boot option is provided should cookie only HV support have issues.
    hvirq - this is the version for HV_GRP_INTR. This is related to HV API
    versioning. The code attempts major=3 first by default. The option can
    be used to override this default.

    I've tested with SPARSE_IRQ on T5-8, M7-4 and T4-X and Jalap?no.

    Signed-off-by: Bob Picco
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    bob picco
     
  • [ Upstream commit bb4e6e85daa52a9f6210fa06a5ec6269598a202b ]

    In order to accomodate embedded per-cpu allocation with large numbers
    of cpus and numa nodes, we have to use as much virtual address space
    as possible for the vmalloc region. Otherwise we can get things like:

    PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000

    So, once we select a value for PAGE_OFFSET, derive the size of the
    vmalloc region based upon that.

    Signed-off-by: David S. Miller
    Acked-by: Bob Picco
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • Make sure, at compile time, that the kernel can properly support
    whatever MAX_PHYS_ADDRESS_BITS is defined to.

    On M7 chips, use a max_phys_bits value of 49.

    Based upon a patch by Bob Picco.

    Signed-off-by: David S. Miller
    Acked-by: Bob Picco
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit c06240c7f5c39c83dfd7849c0770775562441b96 ]

    For sparse memory configurations, the vmemmap array behaves terribly
    and it takes up an inordinate amount of space in the BSS section of
    the kernel image unconditionally.

    Just build huge PMDs and look them up just like we do for TLB misses
    in the vmalloc area.

    Kernel BSS shrinks by about 2MB.

    Signed-off-by: David S. Miller
    Acked-by: Bob Picco
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 0dd5b7b09e13dae32869371e08e1048349fd040c ]

    If max_phys_bits needs to be > 43 (f.e. for T4 chips), things like
    DEBUG_PAGEALLOC stop working because the 3-level page tables only
    can cover up to 43 bits.

    Another problem is that when we increased MAX_PHYS_ADDRESS_BITS up to
    47, several statically allocated tables became enormous.

    Compounding this is that we will need to support up to 49 bits of
    physical addressing for M7 chips.

    The two tables in question are sparc64_valid_addr_bitmap and
    kpte_linear_bitmap.

    The first holds a bitmap, with 1 bit for each 4MB chunk of physical
    memory, indicating whether that chunk actually exists in the machine
    and is valid.

    The second table is a set of 2-bit values which tell how large of a
    mapping (4MB, 256MB, 2GB, 16GB, respectively) we can use at each 256MB
    chunk of ram in the system.

    These tables are huge and take up an enormous amount of the BSS
    section of the sparc64 kernel image. Specifically, the
    sparc64_valid_addr_bitmap is 4MB, and the kpte_linear_bitmap is 128K.

    So let's solve the space wastage and the DEBUG_PAGEALLOC problem
    at the same time, by using the kernel page tables (as designed) to
    manage this information.

    We have to keep using large mappings when DEBUG_PAGEALLOC is disabled,
    and we do this by encoding huge PMDs and PUDs.

    On a T4-2 with 256GB of ram the kernel page table takes up 16K with
    DEBUG_PAGEALLOC disabled and 256MB with it enabled. Furthermore, this
    memory is dynamically allocated at run time rather than coded
    statically into the kernel image.

    Signed-off-by: David S. Miller
    Acked-by: Bob Picco
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 8c82dc0e883821c098c8b0b130ffebabf9aab5df ]

    As currently coded the KTSB accesses in the kernel only support up to
    47 bits of physical addressing.

    Adjust the instruction and patching sequence in order to support
    arbitrary 64 bits addresses.

    Signed-off-by: David S. Miller
    Acked-by: Bob Picco
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 4397bed080598001e88f612deb8b080bb1cc2322 ]

    Now that we use 4-level page tables, we can provide up to 53-bits of
    virtual address space to the user.

    Adjust the VA hole based upon the capabilities of the cpu type probed.

    Signed-off-by: David S. Miller
    Acked-by: Bob Picco
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit ac55c768143aa34cc3789c4820cbb0809a76fd9c ]

    This has become necessary with chips that support more than 43-bits
    of physical addressing.

    Based almost entirely upon a patch by Bob Picco.

    Signed-off-by: David S. Miller
    Acked-by: Bob Picco
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • The T5 (niagara5) has different PCR related HV fast trap values and a new
    HV API Group. This patch utilizes these and shares when possible with niagara4.

    We use the same sparc_pmu niagara4_pmu. Should there be new effort to
    obtain the MCU perf statistics then this would have to be changed.

    Cc: sparclinux@vger.kernel.org
    Signed-off-by: Bob Picco
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    bob picco
     
  • Signed-off-by: Allen Pais
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Allen Pais
     
  • Add M6 and M7 chip type in cpumap.c to correctly build CPU distribution map that spans all online CPUs.

    Signed-off-by: Allen Pais
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Allen Pais
     
  • The following patch adds support for correctly
    recognising M6 and M7 cpu type.

    Signed-off-by: Allen Pais
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Allen Pais
     
  • We changed PAGE_OFFSET to be a variable rather than a constant,
    but this reference here in the hibernate assembler got missed.

    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit e2653143d7d79a49f1a961aeae1d82612838b12c ]

    This breaks the stack end corruption detection facility.

    What that facility does it write a magic value to "end_of_stack()"
    and checking to see if it gets overwritten.

    "end_of_stack()" is "task_thread_info(p) + 1", which for sparc64 is
    the beginning of the FPU register save area.

    So once the user uses the FPU, the magic value is overwritten and the
    debug checks trigger.

    Fix this by making the size explicit.

    Due to the size we use for the fpsaved[], gsr[], and xfsr[] arrays we
    are limited to 7 levels of FPU state saves. So each FPU register set
    is 256 bytes, allocate 256 * 7 for the fpregs area.

    Reported-by: Meelis Roos
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit f4da3628dc7c32a59d1fb7116bb042e6f436d611 ]

    The AES loops in arch/sparc/crypto/aes_glue.c use a scheme where the
    key material is preloaded into the FPU registers, and then we loop
    over and over doing the crypt operation, reusing those pre-cooked key
    registers.

    There are intervening blkcipher*() calls between the crypt operation
    calls. And those might perform memcpy() and thus also try to use the
    FPU.

    The sparc64 kernel FPU usage mechanism is designed to allow such
    recursive uses, but with a catch.

    There has to be a trap between the two FPU using threads of control.

    The mechanism works by, when the FPU is already in use by the kernel,
    allocating a slot for FPU saving at trap time. Then if, within the
    trap handler, we try to use the FPU registers, the pre-trap FPU
    register state is saved into the slot. Then at trap return time we
    notice this and restore the pre-trap FPU state.

    Over the long term there are various more involved ways we can make
    this work, but for a quick fix let's take advantage of the fact that
    the situation where this happens is very limited.

    All sparc64 chips that support the crypto instructiosn also are using
    the Niagara4 memcpy routine, and that routine only uses the FPU for
    large copies where we can't get the source aligned properly to a
    multiple of 8 bytes.

    We look to see if the FPU is already in use in this context, and if so
    we use the non-large copy path which only uses integer registers.

    Furthermore, we also limit this special logic to when we are doing
    kernel copy, rather than a user copy.

    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit bdcf81b658ebc4c2640c3c2c55c8b31c601b6996 ]

    Inconsistently, the raw_* IRQ routines do not interact with and update
    the irqflags tracing and lockdep state, whereas the raw_* spinlock
    interfaces do.

    This causes problems in p1275_cmd_direct() because we disable hardirqs
    by hand using raw_local_irq_restore() and then do a raw_spin_lock()
    which triggers a lockdep trace because the CPU's hw IRQ state doesn't
    match IRQ tracing's internal software copy of that state.

    The CPU's irqs are disabled, yet current->hardirqs_enabled is true.

    ====================
    reboot: Restarting system
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3536 check_flags+0x7c/0x240()
    DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
    Modules linked in: openpromfs
    CPU: 0 PID: 1 Comm: systemd-shutdow Tainted: G W 3.17.0-dirty #145
    Call Trace:
    [000000000045919c] warn_slowpath_common+0x5c/0xa0
    [0000000000459210] warn_slowpath_fmt+0x30/0x40
    [000000000048f41c] check_flags+0x7c/0x240
    [0000000000493280] lock_acquire+0x20/0x1c0
    [0000000000832b70] _raw_spin_lock+0x30/0x60
    [000000000068f2fc] p1275_cmd_direct+0x1c/0x60
    [000000000068ed28] prom_reboot+0x28/0x40
    [000000000043610c] machine_restart+0x4c/0x80
    [000000000047d2d4] kernel_restart+0x54/0x80
    [000000000047d618] SyS_reboot+0x138/0x200
    [00000000004060b4] linux_sparc_syscall32+0x34/0x60
    ---[ end trace 5c439fe81c05a100 ]---
    possible reason: unannotated irqs-off.
    irq event stamp: 2010267
    hardirqs last enabled at (2010267): [] vprintk_emit+0x4b8/0x580
    hardirqs last disabled at (2010266): [] vprintk_emit+0x68/0x580
    softirqs last enabled at (2010046): [] __do_softirq+0x378/0x4a0
    softirqs last disabled at (2010039): [] do_softirq_own_stack+0x28/0x40
    Resetting ...
    ====================

    Use local_* variables of the hw IRQ interfaces so that IRQ tracing sees
    all of our changes.

    Reported-by: Meelis Roos
    Tested-by: Meelis Roos
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 473ad7f4fb005d1bb727e4ef27d370d28703a062 ]

    When we have to split up a flush request into multiple pieces
    (in order to avoid the firmware range) we don't specify the
    arguments in the right order for the second piece.

    Fix the order, or else we get hangs as the code tries to
    flush "a lot" of entries and we get lockups like this:

    [ 4422.981276] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [expect:117032]
    [ 4422.996130] Modules linked in: ipv6 loop usb_storage igb ptp sg sr_mod ehci_pci ehci_hcd pps_core n2_rng rng_core
    [ 4423.016617] CPU: 12 PID: 117032 Comm: expect Not tainted 3.17.0-rc4+ #1608
    [ 4423.030331] task: fff8003cc730e220 ti: fff8003d99d54000 task.ti: fff8003d99d54000
    [ 4423.045282] TSTATE: 0000000011001602 TPC: 00000000004521e8 TNPC: 00000000004521ec Y: 00000000 Not tainted
    [ 4423.064905] TPC:
    [ 4423.074964] g0: 000000000052fd10 g1: 00000001295a8000 g2: ffffff7176ffc000 g3: 0000000000002000
    [ 4423.092324] g4: fff8003cc730e220 g5: fff8003dfedcc000 g6: fff8003d99d54000 g7: 0000000000000006
    [ 4423.109687] o0: 0000000000000000 o1: 0000000000000000 o2: 0000000000000003 o3: 00000000f0000000
    [ 4423.127058] o4: 0000000000000080 o5: 00000001295a8000 sp: fff8003d99d56d01 ret_pc: 000000000052ff54
    [ 4423.145121] RPC:
    [ 4423.155185] l0: 0000000000000000 l1: 0000000000000000 l2: 0000000000a38040 l3: 0000000000000000
    [ 4423.172559] l4: fff8003dae8965e0 l5: ffffffffffffffff l6: 0000000000000000 l7: 00000000f7e2b138
    [ 4423.189913] i0: fff8003d99d576a0 i1: fff8003d99d576a8 i2: fff8003d99d575e8 i3: 0000000000000000
    [ 4423.207284] i4: 0000000000008008 i5: fff8003d99d575c8 i6: fff8003d99d56df1 i7: 0000000000530c24
    [ 4423.224640] I7:
    [ 4423.234193] Call Trace:
    [ 4423.239051] [0000000000530c24] free_vmap_area_noflush+0x64/0x80
    [ 4423.251029] [0000000000531a7c] remove_vm_area+0x5c/0x80
    [ 4423.261628] [0000000000531b80] __vunmap+0x20/0x120
    [ 4423.271352] [000000000071cf18] n_tty_close+0x18/0x40
    [ 4423.281423] [00000000007222b0] tty_ldisc_close+0x30/0x60
    [ 4423.292183] [00000000007225a4] tty_ldisc_reinit+0x24/0xa0
    [ 4423.303120] [0000000000722ab4] tty_ldisc_hangup+0xd4/0x1e0
    [ 4423.314232] [0000000000719aa0] __tty_hangup+0x280/0x3c0
    [ 4423.324835] [0000000000724cb4] pty_close+0x134/0x1a0
    [ 4423.334905] [000000000071aa24] tty_release+0x104/0x500
    [ 4423.345316] [00000000005511d0] __fput+0x90/0x1e0
    [ 4423.354701] [000000000047fa54] task_work_run+0x94/0xe0
    [ 4423.365126] [0000000000404b44] __handle_signal+0xc/0x2c

    Fixes: 4ca9a23765da ("sparc64: Guard against flushing openfirmware mappings.")
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 35607b02dbef304fa5037236a3b43c1d8ab2aa52 ]

    - fix BPF_LD|ABS|IND from negative offsets:
    make sure to sign extend lower 32 bits in 64-bit register
    before calling C helpers from JITed code, otherwise 'int k'
    argument of bpf_internal_load_pointer_neg_helper() function
    will be added as large unsigned integer, causing packet size
    check to trigger and abort the program.

    It's worth noting that JITed code for 'A = A op K' will affect
    upper 32 bits differently depending whether K is simm13 or not.
    Since small constants are sign extended, whereas large constants
    are stored in temp register and zero extended.
    That is ok and we don't have to pay a penalty of sign extension
    for every sethi, since all classic BPF instructions have 32-bit
    semantics and we only need to set correct upper bits when
    transitioning from JITed code into C.

    - though instructions 'A &= 0' and 'A *= 0' are odd, JIT compiler
    should not optimize them out

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • [ Upstream commit f6f2332dce0efeea8c5653b6e9d1e8c379ace65c ]

    fix several issues in sparc BPF JIT compiler.

    ldx/stx related:
    . classic BPF instructions that access mem[] slots were not setting
    SEEN_MEM flag, so stack wasn't allocated. Fix that by advertising
    correct flags

    . LDX/STX instructions were missing SEEN_XREG, so register value
    could have leaked to user space. Fix it.

    . since stack for mem[] slots is allocated with 'sub %sp' instead
    of 'save %sp', use %sp as base register instead of %fp.

    . ldx mem[0] means first slot in classic BPF which should have
    -4 offset instead of 0.

    . sparc64 needs 2047 stack bias as per ABI to access stack

    . emit_stmem() was using LD32I macro instead of ST32I

    SKF_AD_VLAN_TAG* related:
    . SKF_AD_VLAN_TAG_PRESENT must return 1 or 0 instead of '> 0' or 0
    as per classic BPF de facto standard

    . SKF_AD_VLAN_TAG needs to mask the field correctly

    Fixes: 2809a2087cc4 ("net: filter: Just In Time compiler for sparc")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • [ Upstream commit 74cad25c076a2f5253312c2fe82d1a4daecc1323 ]

    This makes memset follow the standard (instead of returning 0 on success). This
    is needed when certain versions of gcc optimizes around memset calls and assume
    that the address argument is preserved in %o0.

    Signed-off-by: Andreas Larsson
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Andreas Larsson
     
  • [ Upstream commit c21c4ab0d6921f7160a43216fa6973b5924de561 ]

    The request_irq() needs to be done from ldc_alloc()
    to avoid the following (caught by lockdep)

    [00000000004a0738] __might_sleep+0xf8/0x120
    [000000000058bea4] kmem_cache_alloc_trace+0x184/0x2c0
    [00000000004faf80] request_threaded_irq+0x80/0x160
    [000000000044f71c] ldc_bind+0x7c/0x220
    [0000000000452454] vio_port_up+0x54/0xe0
    [00000000101f6778] probe_disk+0x38/0x220 [sunvdc]
    [00000000101f6b8c] vdc_port_probe+0x22c/0x300 [sunvdc]
    [0000000000451a88] vio_device_probe+0x48/0x60
    [000000000074c56c] really_probe+0x6c/0x300
    [000000000074c83c] driver_probe_device+0x3c/0xa0
    [000000000074c92c] __driver_attach+0x8c/0xa0
    [000000000074a6ec] bus_for_each_dev+0x6c/0xa0
    [000000000074c1dc] driver_attach+0x1c/0x40
    [000000000074b0fc] bus_add_driver+0xbc/0x280

    Signed-off-by: Sowmini Varadhan
    Acked-by: Dwight Engen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sowmini Varadhan
     
  • [ Upstream commit 3dee9df54836d5f844f3d58281d3f3e6331b467f ]

    We have seen an issue with guest boot into LDOM that causes early boot failures
    because of no matching rules for node identitity of the memory. I analyzed this
    on my T4 and concluded there might not be a solution. I saw the issue in
    mainline too when booting into the control/primary domain - with guests
    configured. Note, this could be a firmware bug on some older machines.

    I'll provide a full explanation of the issues below. Should we not find a
    matching BEST latency group for a real address (RA) then we will assume node 0.
    On the T4-2 here with the information provided I can't see an alternative.

    Technically the LDOM shown below should match the MBLOCK to the
    favorable latency group. However other factors must be considered too. Were
    the memory controllers configured "fine" grained interleave or "coarse"
    grain interleaved - T4. Also should a "group" MD node be considered a NUMA
    node?

    There has to be at least one Machine Description (MD) "group" and hence one
    NUMA node. The group can have one or more latency groups (lg) - more than one
    memory controller. The current code chooses the smallest latency as the most
    favorable per group. The latency and lg information is in MLGROUP below.
    MBLOCK is the base and size of the RAs for the machine as fetched from OBP
    /memory "available" property. My machine has one MBLOCK but more would be
    possible - with holes?

    For a T4-2 the following information has been gathered:
    with LDOM guest
    MEMBLOCK configuration:
    memory size = 0x27f870000
    memory.cnt = 0x3
    memory[0x0] [0x00000020400000-0x0000029fc67fff], 0x27f868000 bytes
    memory[0x1] [0x0000029fd8a000-0x0000029fd8bfff], 0x2000 bytes
    memory[0x2] [0x0000029fd92000-0x0000029fd97fff], 0x6000 bytes
    reserved.cnt = 0x2
    reserved[0x0] [0x00000020800000-0x000000216c15c0], 0xec15c1 bytes
    reserved[0x1] [0x00000024800000-0x0000002c180c1e], 0x7980c1f bytes
    MBLOCK[0]: base[20000000] size[280000000] offset[0]
    (note: "base" and "size" reported in "MBLOCK" encompass the "memory[X]" values)
    (note: (RA + offset) & mask = val is the formula to detect a match for the
    memory controller. should there be no match for find_node node, a return
    value of -1 resulted for the node - BAD)

    There is one group. It has these forward links
    MLGROUP[1]: node[545] latency[1f7e8] match[200000000] mask[200000000]
    MLGROUP[2]: node[54d] latency[2de60] match[0] mask[200000000]
    NUMA NODE[0]: node[545] mask[200000000] val[200000000] (latency[1f7e8])
    (note: "val" is the best lg's (smallest latency) "match")

    no LDOM guest - bare metal
    MEMBLOCK configuration:
    memory size = 0xfdf2d0000
    memory.cnt = 0x3
    memory[0x0] [0x00000020400000-0x00000fff6adfff], 0xfdf2ae000 bytes
    memory[0x1] [0x00000fff6d2000-0x00000fff6e7fff], 0x16000 bytes
    memory[0x2] [0x00000fff766000-0x00000fff771fff], 0xc000 bytes
    reserved.cnt = 0x2
    reserved[0x0] [0x00000020800000-0x00000021a04580], 0x1204581 bytes
    reserved[0x1] [0x00000024800000-0x0000002c7d29fc], 0x7fd29fd bytes
    MBLOCK[0]: base[20000000] size[fe0000000] offset[0]

    there are two groups
    group node[16d5]
    MLGROUP[0]: node[1765] latency[1f7e8] match[0] mask[200000000]
    MLGROUP[3]: node[177d] latency[2de60] match[200000000] mask[200000000]
    NUMA NODE[0]: node[1765] mask[200000000] val[0] (latency[1f7e8])
    group node[171d]
    MLGROUP[2]: node[1775] latency[2de60] match[0] mask[200000000]
    MLGROUP[1]: node[176d] latency[1f7e8] match[200000000] mask[200000000]
    NUMA NODE[1]: node[176d] mask[200000000] val[200000000] (latency[1f7e8])
    (note: for this two "group" bare metal machine, 1/2 memory is in group one's
    lg and 1/2 memory is in group two's lg).

    Cc: sparclinux@vger.kernel.org
    Signed-off-by: Bob Picco
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    bob picco
     
  • [ Upstream commit 84bd6d8b9c0f06b3f188efb479c77e20f05e9a8a ]

    Every path that ends up at do_sparc64_fault() must install a valid
    FAULT_CODE_* bitmask in the per-thread fault code byte.

    Two paths leading to the label winfix_trampoline (which expects the
    FAULT_CODE_* mask in register %g4) were not doing so:

    1) For pre-hypervisor TLB protection violation traps, if we took
    the 'winfix_trampoline' path we wouldn't have %g4 initialized
    with the FAULT_CODE_* value yet. Resulting in using the
    TLB_TAG_ACCESS register address value instead.

    2) In the TSB miss path, when we notice that we are going to use a
    hugepage mapping, but we haven't allocated the hugepage TSB yet, we
    still have to take the window fixup case into consideration and
    in that particular path we leave %g4 not setup properly.

    Errors on this sort were largely invisible previously, but after
    commit 4ccb9272892c33ef1c19a783cfa87103b30c2784 ("sparc64: sun4v TLB
    error power off events") we now have a fault_code mask bit
    (FAULT_CODE_BAD_RA) that triggers due to this bug.

    FAULT_CODE_BAD_RA triggers because this bit is set in TLB_TAG_ACCESS
    (see #1 above) and thus we get seemingly random bus errors triggered
    for user processes.

    Fixes: 4ccb9272892c ("sparc64: sun4v TLB error power off events")
    Reported-by: Meelis Roos
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 4ccb9272892c33ef1c19a783cfa87103b30c2784 ]

    We've witnessed a few TLB events causing the machine to power off because
    of prom_halt. In one case it was some nfs related area during rmmod. Another
    was an mmapper of /dev/mem. A more recent one is an ITLB issue with
    a bad pagesize which could be a hardware bug. Bugs happen but we should
    attempt to not power off the machine and/or hang it when possible.

    This is a DTLB error from an mmapper of /dev/mem:
    [root@sparcie ~]# SUN4V-DTLB: Error at TPC[fffff80100903e6c], tl 1
    SUN4V-DTLB: TPC
    SUN4V-DTLB: O7[fffff801081979d0]
    SUN4V-DTLB: O7
    SUN4V-DTLB: vaddr[fffff80100000000] ctx[1250] pte[98000000000f0610] error[2]
    .

    This is recent mainline for ITLB:
    [ 3708.179864] SUN4V-ITLB: TPC
    [ 3708.188866] SUN4V-ITLB: O7[fffffc010071cee8]
    [ 3708.197377] SUN4V-ITLB: O7
    [ 3708.206539] SUN4V-ITLB: vaddr[e0003] ctx[1a3c] pte[2900000dcc800eeb] error[4]
    .

    Normally sun4v_itlb_error_report() and sun4v_dtlb_error_report() would call
    prom_halt() and drop us to OF command prompt "ok". This isn't the case for
    LDOMs and the machine powers off.

    For the HV reported error of HV_ENORADDR for HV HV_MMU_MAP_ADDR_TRAP we cause
    a SIGBUS error by qualifying it within do_sparc64_fault() for fault code mask
    of FAULT_CODE_BAD_RA. This is done when trap level (%tl) is less or equal
    one("1"). Otherwise, for %tl > 1, we proceed eventually to die_if_kernel().

    The logic of this patch was partially inspired by David Miller's feedback.

    Power off of large sparc64 machines is painful. Plus die_if_kernel provides
    more context. A reset sequence isn't a brief period on large sparc64 but
    better than power-off/power-on sequence.

    Cc: sparclinux@vger.kernel.org
    Signed-off-by: Bob Picco
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    bob picco
     
  • [ Upstream commit d1105287aabe88dbb3af825140badaa05cf0442c ]

    dma_zalloc_coherent() calls dma_alloc_coherent(__GFP_ZERO)
    but the sparc32 implementations sbus_alloc_coherent() and
    pci32_alloc_coherent() doesn't take the gfp flags into
    account.

    Tested on the SPARC32/LEON GRETH Ethernet driver which fails
    due to dma_alloc_coherent(__GFP_ZERO) returns non zeroed
    pages.

    Signed-off-by: Daniel Hellstrom
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Hellstrom
     
  • [ Upstream commit 8bccf5b313180faefce38e0d1140f76e0f327d28 ]

    Christopher reports that perf_event_print_debug() can crash in uniprocessor
    builds. The crash is due to pcr_ops being NULL.

    This happens because pcr_arch_init() is only invoked by smp_cpus_done() which
    only executes in SMP builds.

    init_hw_perf_events() is closely intertwined with pcr_ops being setup properly,
    therefore:

    1) Call pcr_arch_init() early on from init_hw_perf_events(), instead of
    from smp_cpus_done().

    2) Do not hook up a PMU type if pcr_ops is NULL after pcr_arch_init().

    3) Move init_hw_perf_events to a later initcall so that it we will be
    sure to invoke pcr_arch_init() after all cpus are brought up.

    Finally, guard the one naked sequence of pcr_ops dereferences in
    __global_pmu_self() with an appropriate NULL check.

    Reported-by: Christopher Alexander Tobias Schulze
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 58556104e9cd0107a7a8d2692cf04ef31669f6e4 ]

    nmi_cpu_busy() is a SMP function call that just makes sure that all of the
    cpus are spinning using cpu cycles while the NMI test runs.

    It does not need to disable IRQs because we just care about NMIs executing
    which will even with 'normal' IRQs disabled.

    It is not legal to enable hard IRQs in a SMP cross call, in fact this bug
    triggers the BUG check in irq_work_run_list():

    BUG_ON(!irqs_disabled());

    Because now irq_work_run() is invoked from the tail of
    generic_smp_call_function_single_interrupt().

    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • commit 0d085a529b427d97710e6a41f8a4f23e1757cd12 upstream.

    XFS has been having trouble with stray delayed allocation extents
    beyond EOF for a long time. Recent changes to the collapse range
    code has triggered erroneous EBUSY errors on page invalidtion for
    block size smaller than page size filesystems. These
    have been caused by dirty buffers beyond EOF on a partial page which
    do not get written to disk during a sync.

    The issue is that write-ahead in xfs_cluster_write() finds such a
    partial page and handles it by leaving the page dirty but pushing it
    into a writeback state. This used to work just fine, as the
    write_cache_pages() code would then find the dirty partial page in
    the next mapping tree lookup as the dirty tag is still set.

    Unfortunately, when we moved to a mark and sweep approach to
    writeback to fix other writeback sync issues, we broken this. THe
    act of marking the page as under writeback now clears the TOWRITE
    tag in the radix tree, even though the page is still dirty. This
    causes the TOWRITE tag to be cleared, and hence the next lookup on
    the mapping tree does not find the dirty partial page and so doesn't
    try to write it again.

    This same writeback bug was found recently in ext4 and fixed in
    commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
    without communication to the wider filesystem community. We can use
    exactly the same fix here so the TOWRITE flag is not cleared on
    partial page writes.

    cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e
    Root-cause-found-by: Brian Foster
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit 35425ea2492175fd39f6116481fe98b2b3ddd4ca upstream.

    Christopher Head 2014-06-28 05:26:20 UTC described:
    "I tried to reproduce this on 3.12.21. Instead, when I do "echo hello > foo"
    in an ecryptfs mount with ecryptfs_xattr specified, I get a kernel crash:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] fsstack_copy_attr_all+0x2/0x61
    PGD d7840067 PUD b2c3c067 PMD 0
    Oops: 0002 [#1] SMP
    Modules linked in: nvidia(PO)
    CPU: 3 PID: 3566 Comm: bash Tainted: P O 3.12.21-gentoo-r1 #2
    Hardware name: ASUSTek Computer Inc. G60JX/G60JX, BIOS 206 03/15/2010
    task: ffff8801948944c0 ti: ffff8800bad70000 task.ti: ffff8800bad70000
    RIP: 0010:[] [] fsstack_copy_attr_all+0x2/0x61
    RSP: 0018:ffff8800bad71c10 EFLAGS: 00010246
    RAX: 00000000000181a4 RBX: ffff880198648480 RCX: 0000000000000000
    RDX: 0000000000000004 RSI: ffff880172010450 RDI: 0000000000000000
    RBP: ffff880198490e40 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff880172010450 R11: ffffea0002c51e80 R12: 0000000000002000
    R13: 000000000000001a R14: 0000000000000000 R15: ffff880198490e40
    FS: 00007ff224caa700(0000) GS:ffff88019fcc0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 00000000bb07f000 CR4: 00000000000007e0
    Stack:
    ffffffff811826e8 ffff8800a39d8000 0000000000000000 000000000000001a
    ffff8800a01d0000 ffff8800a39d8000 ffffffff81185fd5 ffffffff81082c2c
    00000001a39d8000 53d0abbc98490e40 0000000000000037 ffff8800a39d8220
    Call Trace:
    [] ? ecryptfs_setxattr+0x40/0x52
    [] ? ecryptfs_write_metadata+0x1b3/0x223
    [] ? should_resched+0x5/0x23
    [] ? ecryptfs_initialize_file+0xaf/0xd4
    [] ? ecryptfs_create+0xf4/0x142
    [] ? vfs_create+0x48/0x71
    [] ? do_last.isra.68+0x559/0x952
    [] ? link_path_walk+0xbd/0x458
    [] ? path_openat+0x224/0x472
    [] ? do_filp_open+0x2b/0x6f
    [] ? __alloc_fd+0xd6/0xe7
    [] ? do_sys_open+0x65/0xe9
    [] ? system_call_fastpath+0x16/0x1b
    RIP [] fsstack_copy_attr_all+0x2/0x61
    RSP
    CR2: 0000000000000000
    ---[ end trace df9dba5f1ddb8565 ]---"

    If we create a file when we mount with ecryptfs_xattr_metadata option, we will
    encounter a crash in this path:
    ->ecryptfs_create
    ->ecryptfs_initialize_file
    ->ecryptfs_write_metadata
    ->ecryptfs_write_metadata_to_xattr
    ->ecryptfs_setxattr
    ->fsstack_copy_attr_all
    It's because our dentry->d_inode used in fsstack_copy_attr_all is NULL, and it
    will be initialized when ecryptfs_initialize_file finish.

    So we should skip copying attr from lower inode when the value of ->d_inode is
    invalid.

    Signed-off-by: Chao Yu
    Signed-off-by: Tyler Hicks
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • commit d1e61eb443dc7512885dfe89ee2f2a1c29fcb1da upstream.

    Commit 78b81f4666fb ("ARM: dts: imx28-evk: Run I2C0 at 400kHz") caused issues
    when doing the following sequence in loop:

    - Boot the kernel
    - Perform audio playback
    - Reboot the system via 'reboot' command

    In many times the audio card cannot be probed, which causes playback to fail.

    After restoring to the original i2c0 frequency of 100kHz there is no such
    problem anymore.

    This reverts commit 78b81f4666fbb22a20b1e63e5baf197ad2e90e88.

    Signed-off-by: Fabio Estevam
    Signed-off-by: Shawn Guo
    Signed-off-by: Greg Kroah-Hartman

    Fabio Estevam
     
  • commit ace8578182dc347b043c0825b9873f62fdaa5b77 upstream.

    The bootloader on the Netgear ReadyNAS RN102 uses Hardware BCH ECC
    (strength = 4), while the pxa3xx NAND driver by default uses
    Hamming ECC (strength = 1).

    This patch changes the ECC mode on these machines to match that
    of the bootloader and of the stock firmware. That way, it is
    now possible to update the kernel from userland (e.g. using
    standard tools from mtd-utils package); u-boot will happily
    load and boot it.

    Fixes: 92beaccd8b49 ("ARM: mvebu: Enable NAND controller in ReadyNAS 102 .dts file")
    Signed-off-by: Ben Peddell
    Acked-by: Ezequiel Garcia
    Tested-by: Arnaud Ebalard
    Link: https://lkml.kernel.org/r/1410339341-3372-1-git-send-email-klightspeed@killerwolves.net
    Signed-off-by: Jason Cooper
    Signed-off-by: Greg Kroah-Hartman

    klightspeed@killerwolves.net
     
  • commit 500abb6ccb9e3f8d638a7f422443a8549245ef90 upstream.

    The bootloader on the Netgear ReadyNAS RN2120 uses Hardware BCH
    ECC (strength = 4), while the pxa3xx NAND driver by default uses
    Hamming ECC (strength = 1).

    This patch changes the ECC mode on these machines to match that
    of the bootloader and of the stock firmware. That way, it is
    now possible to update the kernel from userland (e.g. using
    standard tools from mtd-utils package); u-boot will happily
    load and boot it.

    The issue was initially reported and fixed by Ben Pedell for
    RN102. The RN2120 shares the same Hynix H27U1G8F2BTR NAND
    flash and setup. This patch is based on Ben's fix for RN102.

    Fixes: ad51eddd95ad ("ARM: mvebu: Enable NAND controller in ReadyNAS 2120 .dts file")
    Signed-off-by: Arnaud Ebalard
    Link: https://lkml.kernel.org/r/61f6a1b7ad0adc57a0e201b9680bc2e5f214a317.1410035142.git.arno@natisbad.org
    Signed-off-by: Jason Cooper
    Signed-off-by: Greg Kroah-Hartman

    Arnaud Ebalard
     
  • commit 225b94cdf719d0bc522a354bdafc18e5da5ff83b upstream.

    The bootloader on the Netgear ReadyNAS RN104 uses Hardware BCH
    ECC (strength = 4), while the pxa3xx NAND driver by default uses
    Hamming ECC (strength = 1).

    This patch changes the ECC mode on these machines to match that
    of the bootloader and of the stock firmware. That way, it is
    now possible to update the kernel from userland (e.g. using
    standard tools from mtd-utils package); u-boot will happily
    load and boot it.

    The issue was initially reported and fixed by Ben Pedell for
    RN102. The RN104 shares the same Hynix H27U1G8F2BTR NAND
    flash and setup. This patch is based on Ben's fix for RN102.

    Fixes: 0373a558bd79 ("ARM: mvebu: Enable NAND controller in ReadyNAS 104 .dts file")
    Signed-off-by: Arnaud Ebalard
    Link: https://lkml.kernel.org/r/920c7e7169dc6aaaa3eb4bced2336d38e77b8864.1410035142.git.arno@natisbad.org
    Signed-off-by: Jason Cooper
    Signed-off-by: Greg Kroah-Hartman

    Arnaud Ebalard
     
  • commit 4f5e01e96d424b54f5f0e89ee1ba9ccca03a3941 upstream.

    During the conversion of boards to use DT to instantiate Distributed
    Switch Architecture, nobody volunteered to test. As to be expected,
    the conversion was flawed. Testers and access to hardware has now
    become available, and this patch hopefully fixes the problems.

    dsa,mii-bus must be a phandle to the top level mdio node, not the port
    specific subnode of the mdio device.

    dsa,ethernet must be a phandle to the port subnode within the ethernet
    DT node, not the ethernet node.

    Don't pinctrl hog the card detect gpio for mvsdio.

    Rename the .dts files to make it clearer which file is for the Z0
    stepping and which for the A0 or later stepping.

    Signed-off-by: Andrew Lunn
    Cc: seugene@marvell.com
    Tested-by: Eugene Sanivsky
    Fixes: e2eaa339af44: ("ARM: Kirkwood: convert rd88f6281-setup.c to DT.")
    Fixes: e7c8f3808be8: ("ARM: kirkwood: Convert mv88f6281gtw_ge switch setup to DT")
    Link: https://lkml.kernel.org/r/1409592941-22244-1-git-send-email-andrew@lunn.ch
    Signed-off-by: Jason Cooper
    Signed-off-by: Greg Kroah-Hartman

    Andrew Lunn
     
  • commit cfa1950e6c6b72251e80adc736af3c3d2907ab0e upstream.

    When introducing support for sama5d3, the write to PMC_PCDR register has
    been accidentally removed.

    Reported-by: Nathalie Cyrille
    Signed-off-by: Ludovic Desroches
    Signed-off-by: Nicolas Ferre
    Signed-off-by: Greg Kroah-Hartman

    Ludovic Desroches