11 Jan, 2006

3 commits

  • Currently arch_remove_kprobes() is only implemented/required for x86_64 and
    powerpc. All other architecture like IA64, i386 and sparc64 implementes a
    dummy function which is being called from arch independent kprobes.c file.

    This patch removes the dummy functions and replaces it with
    #define arch_remove_kprobe(p, s) do { } while(0)

    Signed-off-by: Anil S Keshavamurthy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anil S Keshavamurthy
     
  • Now that all these entries in the arch ioctl32.c files are gone [1], we can
    build fs/compat_ioctl.c as a normal object and kill tons of cruft. We need a
    special do_ioctl32_pointer handler for s390 so the compat_ptr call is done.
    This is not needed but harmless on all other architectures. Also remove some
    superflous includes in fs/compat_ioctl.c

    Tested on ppc64.

    [1] parisc still had it's PPP handler left, which is not fully correct
    for ppp and besides that ppp uses the generic SIOCPRIV ioctl so it'd
    kick in for all netdevice users. We can introduce a proper handler
    in one of the next patch series by adding a compat_ioctl method to
    struct net_device but for now let's just kill it - parisc doesn't
    compile in mainline anyway and I don't want this to block this
    patchset.

    Signed-off-by: Christoph Hellwig
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The comment in compat.c is wrong, every architecture provides a
    get_compat_sigevent() for the IPC compat code already.

    This basically moves the x86_64 version to common code and removes all the
    others.

    Signed-off-by: Christoph Hellwig
    Acked-by: Paul Mackerras
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

09 Jan, 2006

4 commits

  • Add a hook so architectures can validate /dev/mem mmap requests.

    This is analogous to validation we already perform in the read/write
    paths.

    The identity mapping scheme used on ia64 requires that each 16MB or
    64MB granule be accessed with exactly one attribute (write-back or
    uncacheable). This avoids "attribute aliasing", which can cause a
    machine check.

    Sample problem scenario:
    - Machine supports VGA, so it has uncacheable (UC) MMIO at 640K-768K
    - efi_memmap_init() discards any write-back (WB) memory in the first granule
    - Application (e.g., "hwinfo") mmaps /dev/mem, offset 0
    - hwinfo receives UC mapping (the default, since memmap says "no WB here")
    - Machine check abort (on chipsets that don't support UC access to WB
    memory, e.g., sx1000)

    In the scenario above, the only choices are
    - Use WB for hwinfo mmap. Can't do this because it causes attribute
    aliasing with the UC mapping for the VGA MMIO space.
    - Use UC for hwinfo mmap. Can't do this because the chipset may not
    support UC for that region.
    - Disallow the hwinfo mmap with -EINVAL. That's what this patch does.

    Signed-off-by: Bjorn Helgaas
    Cc: Hugh Dickins
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bjorn Helgaas
     
  • Remove various things which were checking for gcc-1.x and gcc-2.x compilers.

    From: Adrian Bunk

    Some documentation updates and removes some code paths for gcc < 3.2.

    Acked-by: Russell King
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The ptrace_get_task_struct() helper that I added as part of the ptrace
    consolidation is useful in variety of places that currently opencode it.
    Switch them to the common helpers.

    Add a ptrace_traceme() helper that needs to be explicitly called, and simplify
    the ptrace_get_task_struct() interface. We don't need the request argument
    now, and we return the task_struct directly, using ERR_PTR() for error
    returns. It's a bit more code in the callers, but we have two sane routines
    that do one thing well now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • sys_migrate_pages implementation using swap based page migration

    This is the original API proposed by Ray Bryant in his posts during the first
    half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

    The intent of sys_migrate is to migrate memory of a process. A process may
    have migrated to another node. Memory was allocated optimally for the prior
    context. sys_migrate_pages allows to shift the memory to the new node.

    sys_migrate_pages is also useful if the processes available memory nodes have
    changed through cpuset operations to manually move the processes memory. Paul
    Jackson is working on an automated mechanism that will allow an automatic
    migration if the cpuset of a process is changed. However, a user may decide
    to manually control the migration.

    This implementation is put into the policy layer since it uses concepts and
    functions that are also needed for mbind and friends. The patch also provides
    a do_migrate_pages function that may be useful for cpusets to automatically
    move memory. sys_migrate_pages does not modify policies in contrast to Ray's
    implementation.

    The current code here is based on the swap based page migration capability and
    thus is not able to preserve the physical layout relative to it containing
    nodeset (which may be a cpuset). When direct page migration becomes available
    then the implementation needs to be changed to do a isomorphic move of pages
    between different nodesets. The current implementation simply evicts all
    pages in source nodeset that are not in the target nodeset.

    Patch supports ia64, i386 and x86_64.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Jan, 2006

2 commits


05 Jan, 2006

3 commits


04 Jan, 2006

1 commit

  • The function ia64_pci_legacy_write() returns 0 for everything
    except errors. This return value gets sent back to the user from
    pci_write_legacy_io(), making it look like every write fails. The trivial
    patch below copies the behavior of the SGI sn machvec and does what
    would be expected from something implementing a write() function.

    Signed-off-by: Alex Williamson
    Signed-off-by: Tony Luck

    Alex Williamson
     

17 Dec, 2005

5 commits

  • sparc64, i386 and x86_64 have support for a special data section dedicated
    to rarely updated data that is frequently read. The section was created to
    avoid false sharing of those rarely read data with frequently written kernel
    data.

    This patch creates such a data section for ia64 and will group rarely written
    data into this section.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tony Luck

    Christoph Lameter
     
  • Change the NR_CPUS default for ia64/sn up to 1024.

    Signed-off-by: John Hawkes
    Signed-off-by: John Hesterberg
    Signed-off-by: Tony Luck

    hawkes@sgi.com
     
  • I see why the problem exists only on SN. SN uses a different hardware
    mechanism to purge TLB entries across nodes.

    It looks like there is a bug in the SN TLB flushing code. During context switch,
    kernel threads inherit the mm of the task that was previously running on the
    cpu. This confuses the code in sn2_global_tlb_purge().

    The result is a missed TLB purge for the task that owns the "borrowed" mm.

    (I hit the problem running heavy stress where kswapd was purging code pages of
    a user task that woke kswapd. The user task took a SIGILL fault trying to
    execute code in the page that had been ripped out from underneath it).

    Signed-off-by: Jack Steiner
    Signed-off-by: Tony Luck

    Jack Steiner
     
  • Use raw_smp_processor_id() instead of get_cpu() as we don't need the
    extra features of get_cpu().

    Signed-off-by: Jes Sorensen
    Signed-off-by: Tony Luck

    Jes Sorensen
     
  • The udelay() inline for ia64 uses the ITC. If CONFIG_PREEMPT is enabled
    and the platform has unsynchronized ITCs and the calling task migrates
    to another CPU while doing the udelay loop, then the effective delay may
    be too short or very, very long.

    This patch disables preemption around 100 usec chunks of the overall
    desired udelay time. This minimizes preemption-holdoffs.

    udelay() is now too big to be inline, move it out of line and export it.

    Signed-off-by: John Hawkes
    Signed-off-by: Tony Luck

    John Hawkes
     

16 Dec, 2005

1 commit


15 Dec, 2005

1 commit


13 Dec, 2005

3 commits


07 Dec, 2005

5 commits

  • What is the value shown in "cpu MHz" of /proc/cpuinfo when CPUs are capable of
    changing frequency?

    Today the answer is: It depends.
    On i386:
    SMP kernel - It is always the boot frequency
    UP kernel - Scales with the frequency change and shows that was last set.

    On x86_64:
    There is one single variable cpu_khz that gets written by all the CPUs. So,
    the frequency set by last CPU will be seen on /proc/cpuinfo of all the
    CPUs in the system. What you see also depends on whether you have constant_tsc
    capable CPU or not.

    On ia64:
    It is always boot time frequency of a particular CPU that gets displayed.

    The patch below changes this to:
    Show the last known frequency of the particular CPU, when cpufreq is present. If
    cpu doesnot support changing of frequency through cpufreq, then boot frequency
    will be shown. The patch affects i386, x86_64 and ia64 architectures.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Dave Jones

    Venkatesh Pallipadi
     
  • The patch that added support for a new platform chipset (shub2) broke
    PTC deadlock recovery on older versions of the chipset. (PTCs are the
    SN platform-specific method for doing a global TLB purge). This
    patch fixes deadlock recovery so that it works on both the old & new
    chipsets.

    Signed-off-by: Jack Steiner
    Signed-off-by: Tony Luck

    Jack Steiner
     
  • We have a customer application which trips a bug. The problem arises
    when a driver attempts to call do_munmap on an area which is mapped, but
    because current->thread.task_size has been set to 0xC0000000, the call
    to do_munmap fails thinking it is an unmap beyond the user's address
    space.

    The comment in fs/binfmt_elf.c in load_elf_library() before the call
    to SET_PERSONALITY() indicates that task_size must not be changed for
    the running application until flush_thread, but is for ia64 executing
    ia32 binaries.

    This patch moves the setting of task_size from SET_PERSONALITY() to
    flush_thread() as indicated. The customer application no longer is able
    to trip the bug.

    Signed-off-by: Robin Holt
    Signed-off-by: Tony Luck

    Robin Holt
     
  • The per-node data structures are allocated with strided offsets that are a
    function of the node number. This prevents excessive cache-aliasing from
    occurring.

    On systems with a large number of nodes, the strided offset becomes
    too large. This patch restricts the maximum offset to 32MB. This is far larger
    than the size of any current L3 cache.

    Signed-off-by: Jack Steiner
    Signed-off-by: Tony Luck

    Jack Steiner
     
  • Altix only patch to add fixup code that sets up
    pci_controller->window. This code is a temporary
    fix until ACPI support on Altix is added.

    Also, corrects the usage of pci_dev->sysdata,
    which had previously been used to reference
    platform specific device info, to now point to
    a pci_controller struct.

    Signed-off-by: John Keller
    Signed-off-by: Tony Luck

    John Keller
     

06 Dec, 2005

1 commit


03 Dec, 2005

2 commits


30 Nov, 2005

2 commits

  • break.b always sets cr.iim to 0 and the current code tries to
    get the break_num by decoding instruction. However, their
    seems to be a race condition while reading the regs->cr_iip,
    as on other cpu the break.b at regs->cr_iip might have been
    replaced with the original instruction as a result of
    unregister_kprobe() and hence decoding instruction to
    obtain break_num will result in wrong value in this case.

    Also includes changes to kprobes.c which now has to handle
    break number zero.

    Signed-off-by: Anil S Keshavamurthy
    Signed-off-by: Tony Luck

    Keshavamurthy Anil S
     
  • A single SGI Altix system can be divided into multiple partitions,
    each running their own instance of the Linux kernel. pfn_valid()
    is currently not optimal for any but the first partition, since it
    does not compare the pfn with min_low_pfn before calling the more
    costly ia64_pfn_valid().

    Signed-off-by: Dean Roe
    Signed-off-by: Tony Luck

    Dean Roe
     

24 Nov, 2005

1 commit

  • Fix a bug in kprobes that can cause an Oops or even a crash when a return
    probe is installed on one of the following functions: sys_execve,
    do_execve, load_*_binary, flush_old_exec, or flush_thread. The fix is to
    remove the call to kprobe_flush_task() in flush_thread(). This fix has
    been tested on all architectures for which the return-probes feature has
    been implemented (i386, x86_64, ppc64, ia64). Please apply.

    BACKGROUND

    Up to now, we have called kprobe_flush_task() under two situations: when a
    task exits, and when it execs. Flushing kretprobe_instances on exit is
    correct because (a) do_exit() doesn't return, and (b) one or more
    return-probed functions may be active when a task calls do_exit(). Neither
    is the case for sys_execve() and its callees.

    Initially, the mistaken call to kprobe_flush_task() on exec was harmless
    because we put the "real" return address of each active probed function
    back in the stack, just to be safe, when we recycled its
    kretprobe_instance. When support for ppc64 and ia64 was added, this safety
    measure couldn't be employed, and was eventually dropped even for i386 and
    x86_64. sys_execve() and its callees were informally blacklisted for
    return probes until this fix was developed.

    Acked-by: Prasanna S Panchamukhi
    Signed-off-by: Jim Keniston
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jim Keniston
     

22 Nov, 2005

3 commits


18 Nov, 2005

2 commits

  • Polish the comments specifically in vhpt_miss and nested_dtlb_miss
    handlers. I think it's better to explicitly name each page table
    level with its name instead of numerically name them. i.e., use
    pgd, pud, pmd, and pte instead of referring as L1, L2, L3 etc.
    Along the line, remove some magic number in the comments like:
    "PTA + (((IFA(61,63) << 7) | IFA(33,39))*8)". No code change at
    all, pure comment update. Feel free to shoot anything you have,
    darts or tomahawk cruise missile. I will duck behind a bunker ;-)

    Signed-off-by: Ken Chen
    Acked-by: Robin Holt
    Signed-off-by: Tony Luck

    Chen, Kenneth W
     
  • From source code inspection, I think there is a bug with 4 level
    page table with vhpt_miss handler. In the code path of rechecking
    page table entry against previously read value after tlb insertion,
    *pte value in register r18 was overwritten with value newly read
    from pud pointer, render the check of new *pte against previous
    *pte completely wrong. Though the bug is none fatal and the penalty
    is to purge the entry and retry. For functional correctness, it
    should be fixed. The fix is to use a different register so new
    *pud don't trash *pte. (btw, the comments in the cmp statement is
    wrong as well, which I will address in the next patch).

    Signed-off-by: Ken Chen
    Signed-off-by: Tony Luck

    Chen, Kenneth W
     

16 Nov, 2005

1 commit

  • Our performance validation on 2.6.15-rc1 caught a disastrous performance
    regression on ia64 with netperf (-98%) and volanomark (-58%) compares to
    previous kernel version 2.6.14-git7. See the following chart (result
    group 1 & 2).

    http://kernel-perf.sourceforge.net/results.machine_id=26.html

    We have root caused it to commit 64c7c8f88559624abdbe12b5da6502e8879f8d28

    This changeset broke the ia64 task resched notification. In
    sched.c:resched_task(), a reschedule IPI is conditioned upon
    TIF_POLLING_NRFLAG. However, the above changeset unconditionally set
    the polling thread flag for idle tasks regardless whether pal_halt_light
    is in use or not. As a result, resched IPI is not sent from
    resched_task(). And since the default behavior on ia64 is to use
    pal_halt_light, we end up delaying the rescheduling task until next
    timer tick, and thus cause the performance regression.

    This fixes the performance bug. I'm glad our performance suite is
    turning up bad performance bug like this in time.

    Signed-off-by: Ken Chen
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W