17 Oct, 2007

1 commit

  • x86_64 uses 2M page table entries to map its 1-1 kernel space. We also
    implement the virtual memmap using 2M page table entries. So there is no
    additional runtime overhead over FLATMEM, initialisation is slightly more
    complex. As FLATMEM still references memory to obtain the mem_map pointer and
    SPARSEMEM_VMEMMAP uses a compile time constant, SPARSEMEM_VMEMMAP should be
    superior.

    With this SPARSEMEM becomes the most efficient way of handling virt_to_page,
    pfn_to_page and friends for UP, SMP and NUMA on x86_64.

    [apw@shadowen.org: code resplit, style fixups]
    [apw@shadowen.org: vmemmap x86_64: ensure end of section memmap is initialised]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andy Whitcroft
    Acked-by: Mel Gorman
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Jul, 2007

2 commits

  • Background:
    The MCE handler has several paths that it can take, depending on various
    conditions of the MCE status and the value of the 'tolerant' knob. The
    exact semantics are not well defined and the code is a bit twisty.

    Description:
    This patch makes the MCE handler's behavior more clear by documenting the
    behavior for various 'tolerant' levels. It also fixes or enhances
    several small things in the handler. Specifically:
    * If RIPV is set it is not safe to restart, so set the 'no way out'
    flag rather than the 'kill it' flag.
    * Don't panic() on correctable MCEs.
    * If the _OVER bit is set *and* the _UC bit is set (meaning possibly
    dropped uncorrected errors), set the 'no way out' flag.
    * Use EIPV for testing whether an app can be killed (SIGBUS) rather
    than RIPV. According to docs, EIPV indicates that the error is
    related to the IP, while RIPV simply means the IP is valid to
    restart from.
    * Don't clear the MCi_STATUS registers until after the panic() path.
    This leaves the status bits set after the panic() so clever BIOSes
    can find them (and dumb BIOSes can do nothing).

    This patch also calls nonseekable_open() in mce_open (as suggested by akpm).

    Result:
    Tolerant levels behave almost identically to how they always have, but
    not it's well defined. There's a slightly higher chance of panic()ing
    when multiple errors happen (a good thing, IMHO). If you take an MBE and
    panic(), the error status bits are not cleared.

    Alternatives:
    None.

    Testing:
    I used software to inject correctable and uncorrectable errors. With
    tolerant = 3, the system usually survives. With tolerant = 2, the system
    usually panic()s (PCC) but not always. With tolerant = 1, the system
    always panic()s. When the system panic()s, the BIOS is able to detect
    that the cause of death was an MC4. I was not able to reproduce the
    case of a non-PCC error in userspace, with EIPV, with (tolerant < 3).
    That will be rare at best.

    Signed-off-by: Tim Hockin
    Signed-off-by: Andrew Morton
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Tim Hockin
     
  • .. and adjust documentation to properly reflect options that are
    x86-64 specific.

    Signed-off-by: Jan Beulich
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

03 May, 2007

5 commits

  • Background:
    We've found that MCEs (specifically DRAM SBEs) tend to come in bunches,
    especially when we are trying really hard to stress the system out. The
    current MCE poller uses a static interval which does not care whether it
    has or has not found MCEs recently.

    Description:
    This patch makes the MCE poller adjust the polling interval dynamically.
    If we find an MCE, poll 2x faster (down to 10 ms). When we stop finding
    MCEs, poll 2x slower (up to check_interval seconds). The check_interval
    tunable becomes the max polling interval. The "Machine check events
    logged" printk() is rate limited to the check_interval, which should be
    identical behavior to the old functionality.

    Result:
    If you start to take a lot of correctable errors (not exceptions), you
    log them faster and more accurately (less chance of overflowing the MCA
    registers). If you don't take a lot of errors, you will see no change.

    Alternatives:
    I considered simply reducing the polling interval to 10 ms immediately
    and keeping it there as long as we continue to find errors. This felt a
    bit heavy handed, but does perform significantly better for the default
    check_interval of 5 minutes (we're using a few seconds when testing for
    DRAM errors). I could be convinced to go with this, if anyone felt it
    was not too aggressive.

    Testing:
    I used an error-injecting DIMM to create lots of correctable DRAM errors
    and verified that the polling interval accelerates. The printk() only
    happens once per check_interval seconds.

    Patch:
    This patch is against 2.6.21-rc7.

    Signed-Off-By: Tim Hockin
    Signed-off-by: Andi Kleen

    Tim Hockin
     
  • Create a document to explain how to use numa=fake in conjunction with cpusets
    for coarse memory resource management.

    An attempt to get more awareness and testing for this feature.

    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andi Kleen
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton

    David Rientjes
     
  • Extends the numa=fake x86_64 command-line option to split the remaining system
    memory into nodes of fixed size. Any leftover memory is allocated to a final
    node unless the command-line ends with a comma.

    For example:
    numa=fake=2*512,*128 gives two 512M nodes and the remaining system
    memory is split into nodes of 128M each.

    This is beneficial for systems where the exact size of RAM is unknown or not
    necessarily relevant, but the size of the remaining nodes to be allocated is
    known based on their capacity for resource management.

    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andi Kleen
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton

    David Rientjes
     
  • Extends the numa=fake x86_64 command-line option to split the remaining
    system memory into equal-sized nodes.

    For example:
    numa=fake=2*512,4* gives two 512M nodes and the remaining system
    memory is split into four approximately equal
    chunks.

    This is beneficial for systems where the exact size of RAM is unknown or not
    necessarily relevant, but the granularity with which nodes shall be allocated
    is known.

    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andi Kleen
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton

    David Rientjes
     
  • Extends the numa=fake x86_64 command-line option to allow for configurable
    node sizes. These nodes can be used in conjunction with cpusets for coarse
    memory resource management.

    The old command-line option is still supported:
    numa=fake=32 gives 32 fake NUMA nodes, ignoring the NUMA setup of the
    actual machine.

    But now you may configure your system for the node sizes of your choice:
    numa=fake=2*512,1024,2*256
    gives two 512M nodes, one 1024M node, two 256M nodes, and
    the rest of system memory to a sixth node.

    The existing hash function is maintained to support the various node sizes
    that are possible with this implementation.

    Each node of the same size receives roughly the same amount of available
    pages, regardless of any reserved memory with its address range. The total
    available pages on the system is calculated and divided by the number of equal
    nodes to allocate. These nodes are then dynamically allocated and their
    borders extended until such time as their number of available pages reaches
    the required size.

    Configurable node sizes are recommended when used in conjunction with cpusets
    for memory control because it eliminates the overhead associated with scanning
    the zonelists of many smaller full nodes on page_alloc().

    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andi Kleen
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton

    David Rientjes
     

24 Apr, 2007

1 commit

  • noreplacement is dangerous on modern systems because it will not replace the
    context switch FNSAVE with SSE aware FXSAVE. But other places in the kernel still assume
    SSE and do FXSAVE and the CPU will then access FXSAVE information with
    FNSAVE and cause corruption.

    Easiest way to avoid this is to remove the option. It was mostly for paranoia
    reasons anyways and alternative()s have been stable for some time.

    Thanks to Jeremy F. for reporting and helping debug it.

    Signed-off-by: Andi Kleen

    Andi Kleen
     

13 Feb, 2007

3 commits

  • When a machine check event is detected (including a AMD RevF threshold
    overflow event) allow to run a "trigger" program. This allows user space
    to react to such events sooner.

    The trigger is configured using a new trigger entry in the
    machinecheck sysfs interface. It is currently shared between
    all CPUs.

    I also fixed the AMD threshold handler to run the machine
    check polling code immediately to actually log any events
    that might have caused the threshold interrupt.

    Also added some documentation for the mce sysfs interface.

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Fix typos.
    Lots of whitespace changes for readability and consistency.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andi Kleen

    Randy Dunlap
     
  • - add SWIOTLB config help text
    - mention Documentation/x86_64/boot-options.txt in
    Documentation/kernel-parameters.txt
    - remove the duplication of the iommu kernel parameter documentation.
    - Better explanation of some of the iommu kernel parameter options.
    - "32MB<
    Signed-off-by: Andi Kleen
    Acked-by: Muli Ben-Yehuda
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton

    Karsten Weiss
     

09 Jan, 2007

1 commit

  • This reverts commit b026872601976f666bae77b609dc490d1834bf77, which has
    been linked to several problem reports with IO-APIC and the timer.
    Machines either don't boot because the timer doesn't happen, or we get
    double timer interrupts because we end up double-routing the timer irq
    through multiple interfaces.

    See for example

    http://lkml.org/lkml/2006/12/16/101
    http://lkml.org/lkml/2007/1/3/9
    http://bugzilla.kernel.org/show_bug.cgi?id=7789

    about some of the discussion.

    Patches to fix this cleanup exist (and have been confirmed to work fine
    at least for some of the affected cases) and we'll revisit it for
    2.6.21, but this late in the -rc series we're better off just reverting
    the incomplete commit that caused the problems.

    Suggested-by: Adrian Bunk
    Cc: Eric W. Biederman
    Cc: Yinghai Lu
    Cc: Andrew Morton
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Dec, 2006

2 commits

  • This patch makes it possible to compile Calgary in but not use it by
    default. In this mode, use 'iommu=calgary' to activate it.

    Signed-off-by: Muli Ben-Yehuda
    Signed-off-by: Jon Mason
    Signed-off-by: Andi Kleen

    Muli Ben-Yehuda
     
  • Instead of adding all kinds of more quirks try various timer
    routing variants in check_timer.

    In particular this tries to handle quirks from:
    - Nvidia NF2-4 reference BIOS: wrong timer override
    - Asus: Wrong timer override but no HPET table
    - ATI: require timer disabled in 8259
    - Some boards: require timer enabled in 8259

    We just try many of the the known variants in the hopefully right order
    in check_timer.

    Trying pin 0/2 on Nvidia suggested by Tim Hockin.

    TBD Experimental. Needs a lot of testing

    Signed-off-by: Andi Kleen

    Andi Kleen
     

04 Oct, 2006

1 commit


30 Sep, 2006

2 commits


26 Sep, 2006

2 commits


29 Jul, 2006

1 commit


27 Jun, 2006

1 commit

  • This patch hooks Calgary into the build, the x86-64 IOMMU
    initialization paths, and introduces the Calgary specific bits. The
    implementation draws inspiration from both PPC (which has support for
    the same chip but requires firmware support which we don't have on
    x86-64) and gart. Calgary is different from gart in that it support a
    translation table per PHB, as opposed to the single gart aperture.

    Changes from previous version:
    * Addition of boot-time disablement for bus-level translation/isolation
    (e.g, enable userspace DMA for things like X)
    * Usage of newer IOMMU abstraction functions

    Signed-off-by: Muli Ben-Yehuda
    Signed-off-by: Jon Mason
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Jon Mason
     

10 Apr, 2006

1 commit

  • From: Keith Mannthey, Andi Kleen

    Implement memory hotadd without sparsemem. The memory in the SRAT
    hotadd area is just preserved instead and can be activated later.

    There are a few restrictions:
    - Only one continuous hotadd area allowed per node

    The main problem is dealing with the many buggy SRAT tables
    that are out there. The strategy here is to reject anything
    suspicious.

    Originally from Keith Mannthey, with several hacks and changes by AK
    and also contributions from Andrew Morton

    [ TBD: Problems pointed out by KAMEZAWA Hiroyuki :

    1) Goto's rebuild_zonelist patch will not work if CONFIG_MEMORY_HOTPLUG=n.

    Rebuilding zonelist is necessary when the system has just memory <
    4G at boot, and hot add memory > 4G. because x86_64 has DMA32,
    ZONE_NORAML is not included into zonelist at boot time if system
    doesn't have memory >4G at boot.

    [AK: should just force the higher zones at boot time when SRAT tells us]

    2) zone and node's spanned_pages and present_pages are not incremented.
    They should be.

    For example, our server (ia64/Fujitsu PrimeQuest) can equip memory
    from 4G to 1T(maybe 2T in future), and SRAT will *always* say we have
    possible 1T +memory. (Microsoft requires "write all possible memory
    in SRAT") When we reserve memmap for possible 1T memory, Linux will
    not work well in +minimum 4G configuraion ;)

    [AK: needs limiting to 5-10% of max memory]
    ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

27 Feb, 2006

1 commit

  • The previous experiment for using apicmaintimer on ATI systems didn't
    work out very well. In particular laptops with C2/C3 support often
    don't let it tick during idle, which makes it useless. There were also
    some other bugs that made the apicmaintimer often not used at all.

    I tried some other experiments - running timer over RTC and some other
    things but they didn't really work well neither.

    I rechecked the specs now and it turns out this simple change is
    actually enough to avoid the double ticks on the ATI systems. We just
    turn off IRQ 0 in the 8254 and only route it directly using the IO-APIC.

    I tested it on a few ATI systems and it worked there. In fact it worked
    on all chipsets (NVidia, Intel, AMD, ATI) I tried it on.

    According to the ACPI spec routing should always work through the
    IO-APIC so I think it's the correct thing to do anyways (and most of the
    old gunk in check_timer should be thrown away for x86-64).

    But for 2.6.16 it's best to do a fairly minimal change:
    - Use the known to be working everywhere-but-ATI IRQ0 both over 8254
    and IO-APIC setup everywhere
    - Except on ATI disable IRQ0 in the 8254
    - Remove the code to select apicmaintimer on ATI chipsets
    - Add some boot options to allow to override this (just paranoia)

    In 2.6.17 I hope to switch the default over to this for everybody.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

05 Feb, 2006

2 commits

  • On some broken motherboards (at least one NForce3 based AMD64 laptop)
    the PIT timer runs at a incorrect frequency. This patch adds a new
    option "apicpmtimer" that allows to use the APIC timer and calibrate it
    using the PMTimer. It requires the earlier patch that allows to run the
    main timer from the APIC.

    Specifying apicpmtimer implies apicmaintimer.

    The option defaults to off for now.

    I tested it on a few systems and the resulting APIC timer frequencies
    were usually a bit off, but always
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Another piece from the no-idle-tick patch.

    This can be enabled with the "apicmaintimer" option.

    This is mainly useful when the PIT/HPET interrupt is unreliable.
    Note there are some systems that are known to stop the APIC
    timer in C3. For those it will never work, but this case
    should be automatically detected.

    It also only works with PM timer right now. When HPET is used
    the way the main timer handler computes the delay doesn't work.

    It should be a bit more efficient because there is one less
    regular interrupt to process on the boot processor.

    Requires earlier bugfix from Venkatesh

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

15 Jan, 2006

1 commit


12 Jan, 2006

2 commits


15 Nov, 2005

4 commits

  • CONFIG_CHECKING covered some debugging code used in the early times
    of the port. But it wasn't even SMP safe for quite some time
    and the bugs it checked for seem to be gone.

    This patch removes all the code to verify GS at kernel entry. There
    haven't been any new bugs in this area for a long time.

    Previously it also covered the sysctl for the page fault tracing.
    That didn't make much sense because that code was unconditionally
    compiled in. I made that a boot option now because it is typically
    only useful at boot.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • The logging for boot errors was turned off because it was broken
    on some AMD systems. But give Intel EM64T systems a chance because they are
    supposed to be correct there.

    The advantage is that there is a chance to actually log uncorrected
    machine checks after the reset.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • With a NR_CPUS==128 kernel with CPU hotplug enabled we would waste 4MB
    on per CPU data of all possible CPUs. The reason was that HOTPLUG
    always set up possible map to NR_CPUS cpus and then we need to allocate
    that much (each per CPU data is roughly ~32k now)

    The underlying problem is that ACPI didn't tell us how many hotplug CPUs
    the platform supports. So the old code just assumed all, which would
    lead to this memory wastage.

    This implements some new heuristics:

    - If the BIOS specified disabled CPUs in the ACPI/mptables assume they
    can be enabled later (this is bending the ACPI specification a bit,
    but seems like a obvious extension)
    - The user can overwrite it with a new additionals_cpus=NUM option
    - Otherwise use half of the available CPUs or 2, whatever is more.

    Cc: ashok.raj@intel.com
    Cc: len.brown@intel.com

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • I got some questions on this, so just fix up the documentation.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

13 Sep, 2005

1 commit


08 Aug, 2005

1 commit

  • Don't log machine check events left over from boot. Too many BIOSes leave
    bogus events in there.

    This unfortunately also makes it impossible to log events that caused a
    reboot. For people with non broken BIOS there is mce=bootlog

    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

29 Jul, 2005

1 commit


21 May, 2005

1 commit

  • This works around the too fast timer seen on some ATI boards.

    I don't feel confident enough about it yet to enable it by default, but give
    users the option.

    Patch and debugging from Christopher Allen Wing , with
    minor tweaks (renamed the option and documented it)

    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds