23 Jun, 2008

1 commit

  • This patch addresses a very sporadic pi-futex related failure in
    highly threaded java apps on large SMP systems.

    David Holmes reported that the pi_state consistency check in
    lookup_pi_state triggered with his test application. This means that
    the kernel internal pi_state and the user space futex variable are out
    of sync. First we assumed that this is a user space data corruption,
    but deeper investigation revieled that the problem happend because the
    pi-futex code is not handling a fault in the futex_lock_pi path when
    the user space variable needs to be fixed up.

    The fault happens when a fork mapped the anon memory which contains
    the futex readonly for COW or the page got swapped out exactly between
    the unlock of the futex and the return of either the new futex owner
    or the task which was the expected owner but failed to acquire the
    kernel internal rtmutex. The current futex_lock_pi() code drops out
    with an inconsistent in case it faults and returns -EFAULT to user
    space. User space has no way to fixup that state.

    When we wrote this code we thought that we could not drop the hash
    bucket lock at this point to handle the fault.

    After analysing the code again it turned out to be wrong because there
    are only two tasks involved which might modify the pi_state and the
    user space variable:

    - the task which acquired the rtmutex
    - the pending owner of the pi_state which did not get the rtmutex

    Both tasks drop into the fixup_pi_state() function before returning to
    user space. The first task which acquired the hash bucket lock faults
    in the fixup of the user space variable, drops the spinlock and calls
    futex_handle_fault() to fault in the page. Now the second task could
    acquire the hash bucket lock and tries to fixup the user space
    variable as well. It either faults as well or it succeeds because the
    first task already faulted the page in.

    One caveat is to avoid a double fixup. After returning from the fault
    handling we reacquire the hash bucket lock and check whether the
    pi_state owner has been modified already.

    Reported-by: David Holmes
    Signed-off-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: David Holmes
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Ingo Molnar

    kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++-------------
    1 file changed, 73 insertions(+), 20 deletions(-)

    Thomas Gleixner
     

22 Jun, 2008

4 commits


21 Jun, 2008

29 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    netns: Don't receive new packets in a dead network namespace.
    sctp: Make sure N * sizeof(union sctp_addr) does not overflow.
    pppoe: warning fix
    ipv6: Drop packets for loopback address from outside of the box.
    ipv6: Remove options header when setsockopt's optlen is 0
    mac80211: detect driver tx bugs

    Linus Torvalds
     
  • Alexey Dobriyan writes:
    > Subject: ICMP sockets destruction vs ICMP packets oops

    > After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
    > icmp_reply() wants ICMP socket.
    >
    > Steps to reproduce:
    >
    > launch shell in new netns
    > move real NIC to netns
    > setup routing
    > ping -i 0
    > exit from shell
    >
    > BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    > IP: [] icmp_sk+0x17/0x30
    > PGD 17f3cd067 PUD 17f3ce067 PMD 0
    > Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
    > CPU 0
    > Modules linked in: usblp usbcore
    > Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
    > RIP: 0010:[] [] icmp_sk+0x17/0x30
    > RSP: 0018:ffffffff8057fc30 EFLAGS: 00010286
    > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
    > RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
    > RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
    > R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
    > R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
    > FS: 0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
    > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    > CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
    > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    > Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
    > Stack: 0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
    > ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
    > 000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
    > Call Trace:
    > [] icmp_reply+0x44/0x1e0
    > [] ? ip_route_input+0x23a/0x1360
    > [] icmp_echo+0x65/0x70
    > [] icmp_rcv+0x180/0x1b0
    > [] ip_local_deliver+0xf4/0x1f0
    > [] ip_rcv+0x33b/0x650
    > [] netif_receive_skb+0x27a/0x340
    > [] process_backlog+0x9d/0x100
    > [] net_rx_action+0x18d/0x250
    > [] __do_softirq+0x75/0x100
    > [] call_softirq+0x1c/0x30
    > [] do_softirq+0x65/0xa0
    > [] irq_exit+0x97/0xa0
    > [] do_IRQ+0xa8/0x130
    > [] ? mwait_idle+0x0/0x60
    > [] ret_from_intr+0x0/0xf
    > [] ? mwait_idle+0x4c/0x60
    > [] ? mwait_idle+0x43/0x60
    > [] ? cpu_idle+0x57/0xa0
    > [] ? rest_init+0x70/0x80
    > Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
    > 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 8b 04 c3 48 83 c4 08
    > 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
    > RIP [] icmp_sk+0x17/0x30
    > RSP
    > CR2: 0000000000000000
    > ---[ end trace ea161157b76b33e8 ]---
    > Kernel panic - not syncing: Aiee, killing interrupt handler!

    Receiving packets while we are cleaning up a network namespace is a
    racy proposition. It is possible when the packet arrives that we have
    removed some but not all of the state we need to fully process it. We
    have the choice of either playing wack-a-mole with the cleanup routines
    or simply dropping packets when we don't have a network namespace to
    handle them.

    Since the check looks inexpensive in netif_receive_skb let's just
    drop the incoming packets.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • As noticed by Gabriel Campana, the kmalloc() length arg
    passed in by sctp_getsockopt_local_addrs_old() can overflow
    if ->addr_num is large enough.

    Therefore, enforce an appropriate limit.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Fix warning:
    drivers/net/pppoe.c: In function 'pppoe_recvmsg':
    drivers/net/pppoe.c:945: warning: comparison of distinct pointer types lacks a cast
    because skb->len is unsigned int and total_len is size_t

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
    [IA64] SN2: security hole in sn2_ptc_proc_write

    Linus Torvalds
     
  • Which was removed in the hope that generic legacy IDE quirk in
    drivers/pci/probe.c is sufficient for Cypress IDE.
    It isn't, as this controller has non-standard BAR layout:
    secondary channel registers are in the BAR0-1 of the second
    PCI function - not in the BAR2-3 of the same function, as the
    generic quirk routine assumes.

    Signed-off-by: Ivan Kokshaysky
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     
  • Vast majority of these build failures are gcc-4.3 warnings
    about static functions and objects being referenced from
    non-static (read: "extern inline") functions, in conjunction
    with our -Werror.

    We cannot just convert "extern inline" to "static inline",
    as people keep suggesting all the time, because "extern inline"
    logic is crucial for generic kernel build.
    So
    - just make sure that all callees of critical "extern inline"
    functions are also "extern inline";
    - use "static inline", wherever it's possible.

    traps.c: work around gcc-4.3 being too smart about array
    bounds-checking.

    TODO: add "gnu_inline" attribute to all our "extern inline"
    functions to ensure desired behaviour with future compilers.

    Signed-off-by: Ivan Kokshaysky
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     
  • With built-in scsi disk driver, the final link fails with a following
    error:
    `.exit.text' referenced in section `.rodata' of drivers/built-in.o:
    defined in discarded section `.exit.text' of drivers/built-in.o

    This happens with -Os (CONFIG_CC_OPTIMIZE_FOR_SIZE=y) with all gcc-4
    versions, and also with -O2 and gcc-4.3.

    The problem is in sd.c:sd_major() being inlined into __exit function
    exit_sd(), and the compiler generating a jump table in .rodata section
    for the 'switch' statement in sd_major(). So we have references to
    discarded section.

    Fixed with a big hammer in the form of -fno-jump-tables.

    Note that jump tables vs. discarded sections is a generic problem,
    other architectures are just lucky not to suffer from it. But with
    a slightly more complex switch/case statement it can be reproduced
    on x86 as well. So maybe at some point we should consider
    -fno-jump-tables as a generic compile option...

    Signed-off-by: Ivan Kokshaysky
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     
  • To calculate addresses of locally defined variables, GCC uses 32-bit
    displacement from the GP. Which doesn't work for per cpu variables in
    modules, as an offset to the kernel per cpu area is way above 4G.

    The workaround is to force allocation of a GOT entry for per cpu variable
    using ldq instruction with a 'literal' relocation.
    I had to use custom asm/percpu.h, as a required argument magic doesn't
    work with asm-generic/percpu.h macros.

    Signed-off-by: Ivan Kokshaysky
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     
  • Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6:
    BAST: Remove old IDE driver
    pcmcia ide kingston compactflash's have a new manufacturer id
    pcmcia: add another pata/ide ID
    pcmcia: add an pata/ide ID
    ide: increase timeout in wait_drive_not_busy()
    palm_bk3710: fix resource management

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
    ieee1394: Kconfig menu touch-up
    firewire: Kconfig menu touch-up
    firewire: deadline for PHY config transmission
    firewire: fw-ohci: unify printk prefixes
    firewire: fill_bus_reset_event needs lock protection
    firewire: fw-ohci: write selfIDBufferPtr before LinkControl.rcvSelfID
    firewire: fw-ohci: disable PHY packet reception into AR context
    firewire: fw-ohci: use of uninitialized data in AR handler
    firewire: don't panic on invalid AR request buffer

    Linus Torvalds
     
  • * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
    ACPI: no AC status notification
    ACPI Exception (video-1721): UNKNOWN_STATUS_CODE, Cant attach device

    Linus Torvalds
     
  • * 'drm-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (21 commits)
    drm: only trust core drm ioctls - driver ioctls are a mess.
    drm/i915: add support for Intel series 4 chipsets.
    drm/radeon: add hier-z registers for r300 and r500 chipsets
    drm/radeon: use DSTCACHE_CTLSTAT rather than RB2D_DSTCACHE_CTLSTAT
    drm/radeon: switch IGP gart to use radeon_write_agp_base()
    drm/radeon: Restore sw interrupt on resume
    drm/r500: add support for AGP based cards.
    drm/radeon: fix texture uploads with large 3d textures (bug 13980)
    drm/radeon: add initial r500 support.
    drm/radeon: init pipe setup in kernel code.
    drm/radeon: fixup radeon_do_engine_reset
    drm/radeon: fix pixcache and purge/cache flushing registers
    drm/radeon: write AGP_BASE_2 on chips that support it.
    drm/radeon: merge IGP chip setup and fixup RS400 vs RS480 support
    drm/radeon: IGP clean up register and magic numbers.
    drm/rs690: set base 2 to 0.
    drm/rs690: set all of gart base address.
    radeon: add production microcode from AMD
    drm: pcigart use proper pci map interfaces.
    drm: the sg alloc ioctl should write back the handle to userspace
    ...

    Linus Torvalds
     
  • * 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6:
    [agp]: fixup chipset flush for new Intel G4x.
    agp: brown paper bag patch - put back the two lines it took out.

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
    rcupreempt: remove export of rcu_batches_completed_bh
    cpuset: limit the input of cpuset.sched_relax_domain_level

    Linus Torvalds
     
  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched, delay accounting: fix incorrect delay time when constantly waiting on runqueue
    sched: CPU hotplug events must not destroy scheduler domains created by the cpusets
    sched: rt-group: fix RR buglet
    sched: rt-group: heirarchy aware throttle
    sched: rt-group: fix hierarchy
    sched: NULL pointer dereference while setting sched_rt_period_us
    sched: fix defined-but-unused warning

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, geode: add a VSA2 ID for General Software
    x86: use BOOTMEM_EXCLUSIVE on 32-bit
    x86, 32-bit: fix boot failure on TSC-less processors
    x86: fix NULL pointer deref in __switch_to
    x86: set PAE PHYSICAL_MASK_SHIFT to 44 bits.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/blackfin-2.6:
    Blackfin Serial Driver: Use timer to poll CTS PIN instead of workqueue.
    Blackfin arch: fix typo error in bf548 serial header file

    Linus Torvalds
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
    ahci: sis can't do PMP
    ata_piix: add TECRA M4 to broken suspend list
    LIBATA: Add HAVE_PATA_PLATFORM to select PATA_PLATFORM driver
    sata_mv: warn on PIO with multiple DRQs
    sata_mv: enable async_notify for 60x1 Rev.C0 and higher
    libata: don't check whether to use DMA or not for no data commands
    ahci: jmb361 has only one port

    Linus Torvalds
     
  • The inline assembly in drivers/watchdog/hpwdt.c was incredibly broken,
    and included all the function prologue and epilogue stuff, even though
    it was itself then inside a C function where the compiler would add its
    own prologue and epilogue on top of it all.

    This then just _happened_ to work if you had exactly the right compiler
    version and exactly the right compiler flags, so that gcc just happened
    to not create any prologue at all (the gcc-generated epilogue wouldn't
    matter, since it would never be reached).

    But the more proper way to fix it is to simply not do this. Move the
    inline asm to the top level, with no surrounding function at all (the
    better alternative would be to remove the prologue and make it actually
    use proper description of the arguments to the inline asm, but that's a
    bigger change than the one I'm willing to make right now).

    Tested-by: S.Çağlar Onur
    Acked-by: Thomas Mingarelli
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Security hole in sn2_ptc_proc_write

    It is possible to overrun a buffer with a write to this /proc file.

    Signed-off-by: Cliff Wickman
    Signed-off-by: Tony Luck

    Cliff Wickman
     
  • Remove the old BAST IDE driver, as we are now using the platform-pata
    support.

    Signed-off-by: Ben Dooks
    Cc: Jeff Garzik
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Ben Dooks
     
  • Up to now, Kingston compactflash cards (ab)used the Toshiba Manufacturer's ID,
    In their new CF cards, they use a new one. Let's the ide subsystem
    recognize CF cards with the new id.

    Signed-off-by: Christophe Niclaes
    Acked-by: Philippe De Muyter
    Cc: Alan Cox
    Cc: Dominik Brodowski
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Christophe Niclaes
     
  • Addition of Transcend 1GB 45x id so that it is properly detected.

    [bart: fix typo in ide-cs's ID spotted by Alan Cox]

    Signed-off-by: William Peters
    Signed-off-by: Kristoffer Ericson
    CC: Alan Cox
    CC: linux-ide@vger.kernel.org
    Signed-off-by: Dominik Brodowski
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Kristoffer Ericson
     
  • Add an id for:

    product info: "M-Systems", "CF300", ""
    manfid: 0x000a, 0x0000
    function: 4 (fixed disk)

    Signed-off-by: Matt Reimer
    CC: Alan Cox
    CC: linux-ide@vger.kernel.org
    Signed-off-by: Dominik Brodowski
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Matt Reimer
     
  • Some ATAPI devices take longer than the current max timeout value to
    become ready (i.e. TEAC DV-W28ECW takes 6 ms) so increase the timeout
    value to 10 ms.

    This fixes kernel.org bugzilla bug #10887:
    http://bugzilla.kernel.org/show_bug.cgi?id=10887

    Reported-by: Masanari Iida
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Bartlomiej Zolnierkiewicz
     
  • The driver expected a *virtual* address in the IDE platform device's memory
    resource and didn't request the memory region for the register block. Fix this
    taking into account the fact that DaVinci SoC devices are fixed-mapped to the
    virtual memory early and we can get their virtual addresses using IO_ADDRESS()
    macro, not having to call ioremap()...

    While at it, also do some cosmetic changes...

    Signed-off-by: Sergei Shtylyov
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Sergei Shtylyov
     
  • KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
    557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
    the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
    generally now populate the VM with real empty pages needlessly.

    We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
    since fault handling no longer uses ZERO_PAGE for new anonymous pages,
    we now need to handle that special case in follow_page() instead.

    In particular, the removal of ZERO_PAGE effectively removed the core
    file writing optimization where we would skip writing pages that had not
    been populated at all, and increased memory pressure a lot by allocating
    all those useless newly zeroed pages.

    This reinstates the optimization by making the unmapped PTE case the
    same as for a non-existent page table, which already did this correctly.

    While at it, this also fixes the XIP case for follow_page(), where the
    caller could not differentiate between the case of a page that simply
    could not be used (because it had no "struct page" associated with it)
    and a page that just wasn't mapped.

    We do that by simply returning an error pointer for pages that could not
    be turned into a "struct page *". The error is arbitrarily picked to be
    EFAULT, since that was what get_user_pages() already used for the
    equivalent IO-mapped page case.

    [ Also removed an impossible test for pte_offset_map_lock() failing:
    that's not how that function works ]

    Acked-by: Oleg Nesterov
    Acked-by: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

20 Jun, 2008

6 commits