11 Oct, 2008

13 commits

  • Conflicts:
    arch/x86/kernel/pci-gart_64.c
    include/asm-x86/dma-mapping.h

    Ingo Molnar
     
  • Conflicts:
    arch/x86/mm/init_64.c

    Ingo Molnar
     
  • Jeremy Fitzhardinge wrote:

    > I'd noticed that current tip/master hasn't been booting under Xen, and I
    > just got around to bisecting it down to this change.
    >
    > commit 065ae73c5462d42e9761afb76f2b52965ff45bd6
    > Author: Suresh Siddha
    >
    > x86, cpa: make the kernel physical mapping initialization a two pass sequence
    >
    > This patch is causing Xen to fail various pagetable updates because it
    > ends up remapping pagetables to RW, which Xen explicitly prohibits (as
    > that would allow guests to make arbitrary changes to pagetables, rather
    > than have them mediated by the hypervisor).

    Instead of making init a two pass sequence, to satisfy the Intel's TLB
    Application note (developer.intel.com/design/processor/applnots/317080.pdf
    Section 6 page 26), we preserve the original page permissions
    when fragmenting the large mappings and don't touch the existing memory
    mapping (which satisfies Xen's requirements).

    Only open issue is: on a native linux kernel, we will go back to mapping
    the first 0-1GB kernel identity mapping as executable (because of the
    static mapping setup in head_64.S). We can fix this in a different
    patch if needed.

    Signed-off-by: Suresh Siddha
    Acked-by: Jeremy Fitzhardinge
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • clean up recently added code to be more consistent with other x86 code.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Fix _end alignment check - can trigger a crash if _end happens to be
    on a page boundary.

    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Track the memtype for RAM pages in page struct instead of using the
    memtype list. This avoids the explosion in the number of entries in
    memtype list (of the order of 20,000 with AGP) and makes the PAT
    tracking simpler.

    We are using PG_arch_1 bit in page->flags.

    We still use the memtype list for non RAM pages.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Do a global flush tlb after splitting the large page and before we do the
    actual change page attribute in the PTE.

    With out this, we violate the TLB application note, which says
    "The TLBs may contain both ordinary and large-page translations for
    a 4-KByte range of linear addresses. This may occur if software
    modifies the paging structures so that the page size used for the
    address range changes. If the two translations differ with respect
    to page frame or attributes (e.g., permissions), processor behavior
    is undefined and may be implementation-specific."

    And also serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity
    mappings) using cpa_lock. So that we don't allow any other cpu, with stale
    large tlb entries change the page attribute in parallel to some other cpu
    splitting a large page entry along with changing the attribute.

    Signed-off-by: Suresh Siddha
    Cc: Suresh Siddha
    Cc: arjan@linux.intel.com
    Cc: venkatesh.pallipadi@intel.com
    Cc: jeremy@goop.org
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Interrupt context no longer splits large page in cpa(). So we can do away
    with cpa memory pool code.

    Signed-off-by: Suresh Siddha
    Cc: Suresh Siddha
    Cc: arjan@linux.intel.com
    Cc: venkatesh.pallipadi@intel.com
    Cc: jeremy@goop.org
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • No alias checking needed for setting present/not-present mapping. Otherwise,
    we may need to break large pages for 64-bit kernel text mappings (this adds to
    complexity if we want to do this from atomic context especially, for ex:
    with CONFIG_DEBUG_PAGEALLOC). Let's keep it simple!

    Signed-off-by: Suresh Siddha
    Cc: Suresh Siddha
    Cc: arjan@linux.intel.com
    Cc: venkatesh.pallipadi@intel.com
    Cc: jeremy@goop.org
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Don't use large pages for kernel identity mapping with DEBUG_PAGEALLOC.
    This will remove the need to split the large page for the
    allocated kernel page in the interrupt context.

    This will simplify cpa code(as we don't do the split any more from the
    interrupt context). cpa code simplication in the subsequent patches.

    Signed-off-by: Suresh Siddha
    Cc: Suresh Siddha
    Cc: arjan@linux.intel.com
    Cc: venkatesh.pallipadi@intel.com
    Cc: jeremy@goop.org
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • In the first pass, kernel physical mapping will be setup using large or
    small pages but uses the same PTE attributes as that of the early
    PTE attributes setup by early boot code in head_[32|64].S

    After flushing TLB's, we go through the second pass, which setups the
    direct mapped PTE's with the appropriate attributes (like NX, GLOBAL etc)
    which are runtime detectable.

    This two pass mechanism conforms to the TLB app note which says:

    "Software should not write to a paging-structure entry in a way that would
    change, for any linear address, both the page size and either the page frame
    or attributes."

    Signed-off-by: Suresh Siddha
    Cc: Suresh Siddha
    Cc: arjan@linux.intel.com
    Cc: venkatesh.pallipadi@intel.com
    Cc: jeremy@goop.org
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • remove USER from the PTE/PDE attributes for the very early identity
    mapping. We overwrite these mappings with KERNEL attribute later
    in the boot. Just being paranoid here as there is no need for USER bit
    to be set.

    If this breaks something(don't know the history), then we can simply drop
    this change.

    Signed-off-by: Suresh Siddha
    Cc: Suresh Siddha
    Cc: arjan@linux.intel.com
    Cc: venkatesh.pallipadi@intel.com
    Cc: jeremy@goop.org
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Signed-off-by: Suresh Siddha
    Cc: Suresh Siddha
    Cc: arjan@linux.intel.com
    Cc: venkatesh.pallipadi@intel.com
    Cc: jeremy@goop.org
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

10 Oct, 2008

9 commits

  • This merges phase 1 of the x86 tree, which is a collection of branches:

    x86/alternatives, x86/cleanups, x86/commandline, x86/crashdump,
    x86/debug, x86/defconfig, x86/doc, x86/exports, x86/fpu, x86/gart,
    x86/idle, x86/mm, x86/mtrr, x86/nmi-watchdog, x86/oprofile,
    x86/paravirt, x86/reboot, x86/sparse-fixes, x86/tsc, x86/urgent and
    x86/vmalloc

    and as Ingo says: "these are the easiest, purely independent x86 topics
    with no conflicts, in one nice Octopus merge".

    * 'x86-v28-for-linus-phase1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (147 commits)
    x86: mtrr_cleanup: treat WRPROT as UNCACHEABLE
    x86: mtrr_cleanup: first 1M may be covered in var mtrrs
    x86: mtrr_cleanup: print out correct type v2
    x86: trivial printk fix in efi.c
    x86, debug: mtrr_cleanup print out var mtrr before change it
    x86: mtrr_cleanup try gran_size to less than 1M, v3
    x86: mtrr_cleanup try gran_size to less than 1M, cleanup
    x86: change MTRR_SANITIZER to def_bool y
    x86, debug printouts: IOMMU setup failures should not be KERN_ERR
    x86: export set_memory_ro and set_memory_rw
    x86: mtrr_cleanup try gran_size to less than 1M
    x86: mtrr_cleanup prepare to make gran_size to less 1M
    x86: mtrr_cleanup safe to get more spare regs now
    x86_64: be less annoying on boot, v2
    x86: mtrr_cleanup hole size should be less than half of chunk_size, v2
    x86: add mtrr_cleanup_debug command line
    x86: mtrr_cleanup optimization, v2
    x86: don't need to go to chunksize to 4G
    x86_64: be less annoying on boot
    x86, olpc: fix endian bug in openfirmware workaround
    ...

    Linus Torvalds
     
  • We already did that a long time ago for pnp_system_init, but
    pnpacpi_init and pnpbios_init remained as subsys_initcalls, and get
    linked into the kernel before the arch-specific routines that finalize
    the PCI resources (pci_subsys_init).

    This means that the PnP routines would either register their resources
    before the PCI layer could, or would be unable to check whether a PCI
    resource had already been registered. Both are problematic.

    I wanted to do this before 2.6.27, but every time we change something
    like this, something breaks. That said, _every_ single time we trust
    some firmware (like PnP tables) more than we trust the hardware itself
    (like PCI probing), the problems have been worse.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * 'upstream-2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
    ata_piix: IDE Mode SATA patch for Intel Ibex Peak DeviceIDs
    libata-eh: clear UNIT ATTENTION after reset
    ata_piix: add Hercules EC-900 mini-notebook to ich_laptop short cable list
    libata: reorder ata_device to remove 8 bytes of padding on 64 bits
    [libata] pata_bf54x: Add proper PM operation
    pata_sil680: convert CONFIG_PPC_MERGE to CONFIG_PPC
    libata: Implement disk shock protection support
    [libata] Introduce ata_id_has_unload()
    PATA: RPC now selects HAVE_PATA_PLATFORM for pata platform driver
    ata_piix: drop merged SCR access and use slave_link instead
    libata: implement slave_link
    libata: misc updates to prepare for slave link
    libata: reimplement link iterator
    libata: make SCR access ops per-link

    Linus Torvalds
     
  • Linus Torvalds
     
  • This is debatable, but while we're debating it, let's disallow the
    combination of splice and an O_APPEND destination.

    It's not entirely clear what the semantics of O_APPEND should be, and
    POSIX apparently expects pwrite() to ignore O_APPEND, for example. So
    we could make up any semantics we want, including the old ones.

    But Miklos convinced me that we should at least give it some thought,
    and that accepting writes at arbitrary offsets is wrong at least for
    IS_APPEND() files (which always have O_APPEND set, even if the reverse
    isn't true: you can obviously have O_APPEND set on a regular file).

    So disallow O_APPEND entirely for now. I doubt anybody cares, and this
    way we have one less gray area to worry about.

    Reported-and-argued-for-by: Miklos Szeredi
    Acked-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * 'hwmon-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6:
    hwmon: (abituguru3) Enable DMI probing feature on Abit AT8 32X
    hwmon: (abituguru3) Enable reading from AUX3 fan on Abit AT8 32X
    hwmon: (adt7473) Fix some bogosity in documentation file
    hwmon: Define sysfs interface for energy consumption register
    hwmon: (it87) Prevent power-off on Shuttle SN68PT
    eeepc-laptop: Fix hwmon interface

    Linus Torvalds
     
  • * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
    [CPUFREQ] correct broken links and email addresses

    Linus Torvalds
     
  • This fixes the previous fix, which was completely wrong on closer
    inspection. This version has been manually tested with a user-space
    test harness and generates sane values. A nearly identical patch has
    been boot-tested.

    The problem arose from changing how kmalloc/kfree handled alignment
    padding without updating ksize to match. This brings it in sync.

    Signed-off-by: Matt Mackall
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Replace the no longer working links and email address in the
    documentation and in source code.

    Signed-off-by: Márton Németh
    Signed-off-by: Dave Jones

    Németh Márton
     

09 Oct, 2008

9 commits


08 Oct, 2008

6 commits

  • Because of rounding, in certain conditions, i.e. when in congestion
    avoidance state rho is smaller than 1/128 of the current cwnd, TCP
    Hybla congestion control starves and the cwnd is kept constant
    forever.

    This patch forces an increment by one segment after #send_cwnd calls
    without increments(newreno behavior).

    Signed-off-by: Daniele Lacamera
    Signed-off-by: David S. Miller

    Daniele Lacamera
     
  • Benjamin Thery tracked down a bug that explains many instances
    of the error

    unregister_netdevice: waiting for %s to become free. Usage count = %d

    It turns out that netdev_run_todo can dead-lock with itself if
    a second instance of it is run in a thread that will then free
    a reference to the device waited on by the first instance.

    The problem is really quite silly. We were trying to create
    parallelism where none was required. As netdev_run_todo always
    follows a RTNL section, and that todo tasks can only be added
    with the RTNL held, by definition you should only need to wait
    for the very ones that you've added and be done with it.

    There is no need for a second mutex or spinlock.

    This is exactly what the following patch does.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • David S. Miller
     
  • From: Ali Saidi

    When TCP receive copy offload is enabled it's possible that
    tcp_rcv_established() will cause two acks to be sent for a single
    packet. In the case that a tcp_dma_early_copy() is successful,
    copied_early is set to true which causes tcp_cleanup_rbuf() to be
    called early which can send an ack. Further along in
    tcp_rcv_established(), __tcp_ack_snd_check() is called and will
    schedule a delayed ACK. If no packets are processed before the delayed
    ack timer expires the packet will be acked twice.

    Signed-off-by: David S. Miller

    Ali Saidi
     
  • Jesper Dangaard Brouer reported a bug when setting a VLAN
    device down that is in promiscous mode:

    When the VLAN device is set down, the promiscous count on the real
    device is decremented by one by vlan_dev_stop(). When removing the
    promiscous flag from the VLAN device afterwards, the promiscous
    count on the real device is decremented a second time by the
    vlan_change_rx_flags() callback.

    The root cause for this is that the ->change_rx_flags() callback is
    invoked while the device is down. The synchronization is meant to mirror
    the behaviour of the ->set_rx_mode callbacks, meaning the ->open function
    is responsible for doing a full sync on open, the ->close() function is
    responsible for doing full cleanup on ->stop() and ->change_rx_flags()
    is meant to do incremental changes while the device is UP.

    Only invoke ->change_rx_flags() while the device is UP to provide the
    intended behaviour.

    Tested-by: Jesper Dangaard Brouer

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • SLOB's ksize calculation was braindamaged and generally harmlessly
    underreported the allocation size. But for very small buffers, it could
    in fact overreport them, leading code depending on krealloc to overrun
    the allocation and trample other data.

    Signed-off-by: Matt Mackall
    Tested-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Matt Mackall
     

07 Oct, 2008

3 commits

  • This reverts commit 135aedc38e812b922aa56096f36a3d72ffbcf2fb, as
    requested by Hans Verkuil.

    It was a patch for 2.6.28 where the BKL was pushed down from v4l core to
    the drivers, not for 2.6.27!

    Requested-by: Hans Verkuil
    Cc: Mauro Carvalho Chehab
    Signed-of-by: Linus Torvalds

    Linus Torvalds
     
  • Linus Torvalds
     
  • * Theodore Ts'o (tytso@mit.edu) wrote:
    >
    > I've been playing with adding some markers into ext4 to see if they
    > could be useful in solving some problems along with Systemtap. It
    > appears, though, that as of 2.6.27-rc8, markers defined in code which is
    > compiled directly into the kernel (i.e., not as modules) don't show up
    > in Module.markers:
    >
    > kvm_trace_entryexit arch/x86/kvm/kvm-intel %u %p %u %u %u %u %u %u
    > kvm_trace_handler arch/x86/kvm/kvm-intel %u %p %u %u %u %u %u %u
    > kvm_trace_entryexit arch/x86/kvm/kvm-amd %u %p %u %u %u %u %u %u
    > kvm_trace_handler arch/x86/kvm/kvm-amd %u %p %u %u %u %u %u %u
    >
    > (Note the lack of any of the kernel_sched_* markers, and the markers I
    > added for ext4_* and jbd2_* are missing as wel.)
    >
    > Systemtap apparently depends on in-kernel trace_mark being recorded in
    > Module.markers, and apparently it's been claimed that it used to be
    > there. Is this a bug in systemtap, or in how Module.markers is getting
    > built? And is there a file that contains the equivalent information
    > for markers located in non-modules code?

    I think the problem comes from "markers: fix duplicate modpost entry"
    (commit d35cb360c29956510b2fe1a953bd4968536f7216)

    Especially :

    - add_marker(mod, marker, fmt);
    + if (!mod->skip)
    + add_marker(mod, marker, fmt);
    }
    return;
    fail:

    Here is a fix that should take care if this problem.

    Thanks for the bug report!

    Signed-off-by: Mathieu Desnoyers
    Tested-by: "Theodore Ts'o"
    CC: Greg KH
    CC: David Smith
    CC: Roland McGrath
    CC: Sam Ravnborg
    CC: Wenji Huang
    CC: Takashi Nishiie
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers