11 Sep, 2010

6 commits


10 Sep, 2010

34 commits

  • The workqueue implementation in 2.6.36-rcX has changed, resulting
    in the workqueues no longer having dedicated threads for work
    processing. This has caused severe livelocks under heavy parallel
    create workloads because the log IO completions have been getting
    held up behind metadata IO completions. Hence log commits would
    stall, memory allocation would stall because pages could not be
    cleaned, and lock contention on the AIL during inode IO completion
    processing was being seen to slow everything down even further.

    By making the log Io completion workqueue a high priority workqueue,
    they are queued ahead of all data/metadata IO completions and
    processed before the data/metadata completions. Hence the log never
    gets stalled, and operations needed to clean memory can continue as
    quickly as possible. This avoids the livelock conditions and allos
    the system to keep running under heavy load as per normal.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • An execve with a very large total of argument/environment strings
    can take a really long time in the execve system call. It runs
    uninterruptibly to count and copy all the strings. This change
    makes it abort the exec quickly if sent a SIGKILL.

    Note that this is the conservative change, to interrupt only for
    SIGKILL, by using fatal_signal_pending(). It would be perfectly
    correct semantics to let any signal interrupt the string-copying in
    execve, i.e. use signal_pending() instead of fatal_signal_pending().
    We'll save that change for later, since it could have user-visible
    consequences, such as having a timer set too quickly make it so that
    an execve can never complete, though it always happened to work before.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This adds a preemption point during the copying of the argument and
    environment strings for execve, in copy_strings(). There is already
    a preemption point in the count() loop, so this doesn't add any new
    points in the abstract sense.

    When the total argument+environment strings are very large, the time
    spent copying them can be much more than a normal user time slice.
    So this change improves the interactivity of the rest of the system
    when one process is doing an execve with very large arguments.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
    check the size of the argument/environment area on the stack.
    When it is unworkably large, shift_arg_pages() hits its BUG_ON.
    This is exploitable with a very large RLIMIT_STACK limit, to
    create a crash pretty easily.

    Check that the initial stack is not too large to make it possible
    to map in any executable. We're not checking that the actual
    executable (or intepreter, for binfmt_elf) will fit. So those
    mappings might clobber part of the initial stack mapping. But
    that is just userland lossage that userland made happen, not a
    kernel problem.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • * 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: x86: Perform hardware_enable in CPU_STARTING callback
    KVM: i8259: fix migration
    KVM: fix i8259 oops when no vcpus are online
    KVM: x86 emulator: fix regression with cmpxchg8b on i386 hosts

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    tracing: t_start: reset FTRACE_ITER_HASH in case of seek/pread
    perf symbols: Fix multiple initialization of symbol system
    perf: Fix CPU hotplug
    perf, trace: Fix module leak
    tracing/kprobe: Fix handling of C-unlike argument names
    tracing/kprobes: Fix handling of argument names
    perf probe: Fix handling of arguments names
    perf probe: Fix return probe support
    tracing/kprobe: Fix a memory leak in error case
    tracing: Do not allow llseek to set_ftrace_filter

    Linus Torvalds
     
  • Fix a bug in keyctl_session_to_parent() whereby it tries to check the ownership
    of the parent process's session keyring whether or not the parent has a session
    keyring [CVE-2010-2960].

    This results in the following oops:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
    IP: [] keyctl_session_to_parent+0x251/0x443
    ...
    Call Trace:
    [] ? keyctl_session_to_parent+0x67/0x443
    [] ? __do_fault+0x24b/0x3d0
    [] sys_keyctl+0xb4/0xb8
    [] system_call_fastpath+0x16/0x1b

    if the parent process has no session keyring.

    If the system is using pam_keyinit then it mostly protected against this as all
    processes derived from a login will have inherited the session keyring created
    by pam_keyinit during the log in procedure.

    To test this, pam_keyinit calls need to be commented out in /etc/pam.d/.

    Reported-by: Tavis Ormandy
    Signed-off-by: David Howells
    Acked-by: Tavis Ormandy
    Signed-off-by: Linus Torvalds

    David Howells
     
  • There's an protected access to the parent process's credentials in the middle
    of keyctl_session_to_parent(). This results in the following RCU warning:

    ===================================================
    [ INFO: suspicious rcu_dereference_check() usage. ]
    ---------------------------------------------------
    security/keys/keyctl.c:1291 invoked rcu_dereference_check() without protection!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 0
    1 lock held by keyctl-session-/2137:
    #0: (tasklist_lock){.+.+..}, at: [] keyctl_session_to_parent+0x60/0x236

    stack backtrace:
    Pid: 2137, comm: keyctl-session- Not tainted 2.6.36-rc2-cachefs+ #1
    Call Trace:
    [] lockdep_rcu_dereference+0xaa/0xb3
    [] keyctl_session_to_parent+0xed/0x236
    [] sys_keyctl+0xb4/0xb6
    [] system_call_fastpath+0x16/0x1b

    The code should take the RCU read lock to make sure the parents credentials
    don't go away, even though it's holding a spinlock and has IRQ disabled.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: Range check cpu in blk_cpu_to_group
    scatterlist: prevent invalid free when alloc fails
    writeback: Fix lost wake-up shutting down writeback thread
    writeback: do not lose wakeup events when forking bdi threads
    cciss: fix reporting of max queue depth since init
    block: switch s390 tape_block and mg_disk to elevator_change()
    block: add function call to switch the IO scheduler from a driver
    fs/bio-integrity.c: return -ENOMEM on kmalloc failure
    bio-integrity.c: remove dependency on __GFP_NOFAIL
    BLOCK: fix bio.bi_rw handling
    block: put dev->kobj in blk_register_queue fail path
    cciss: handle allocation failure
    cfq-iosched: Documentation help for new tunables
    cfq-iosched: blktrace print per slice sector stats
    cfq-iosched: Implement tunable group_idle
    cfq-iosched: Do group share accounting in IOPS when slice_idle=0
    cfq-iosched: Do not idle if slice_idle=0
    cciss: disable doorbell reset on reset_devices
    blkio: Fix return code for mkdir calls

    Linus Torvalds
     
  • * 'at91-fixes-for-linus' of git://github.com/at91linux/linux-2.6-at91:
    AT91: at91sam9261ek: remove C99 comments but keep information
    AT91: at91sam9261ek board: remove warnings related to use of SPI or SD/MMC
    AT91: dm9000 initialization update
    AT91: SAM9G45 - add a separate clock entry for every single TC block
    AT91: clock: peripheral clocks can have other parent than mck
    AT91: change dma resource index

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
    ALSA: rawmidi: fix the get next midi device ioctl
    ALSA: hda - Fix wrong HP pin detection in snd_hda_parse_pin_def_config()
    ALSA: seq/oss - Fix double-free at error path of snd_seq_oss_open()
    ALSA: msnd-classic: Fix invalid cfg parameter
    ALSA: hda - Enable PC-beep for EeePC with ALC269 codec
    ALSA: hda - Add errata initverb sequence for CS42xx codecs
    ALSA: usb - Release capture substream URBs properly
    ALSA: virtuoso: fix setting of Xonar DS line-in/mic-in controls
    ALSA: virtuoso: work around missing reset in the Xonar DS Windows driver
    ALSA: hda - Add quirk for Lenovo T400s
    ALSA: usb-audio: fix detection of vendor-specific device protocol settings
    ALSA: usb-audio: Assume first control interface is for audio
    ALSA: hda - Add a new hp-laptop model for Conexant 5066, tested on HP G60

    Linus Torvalds
     
  • We don't know how to enable it safely, especially as outputs turn on and
    off. When disabling LP1 we also need to make sure LP2 and 3 are already
    disabled.

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29173
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29082
    Reported-by: Chris Lord
    Signed-off-by: Jesse Barnes
    Tested-by: Daniel Vetter
    Cc: stable@kernel.org
    Signed-off-by: Chris Wilson

    Jesse Barnes
     
  • The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12
    bytes of uninitialized stack memory, because the fsxattr struct
    declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero)
    the 12-byte fsx_pad member before copying it back to the user. This
    patch takes care of it.

    Signed-off-by: Dan Rosenberg
    Reviewed-by: Eric Sandeen
    Signed-off-by: Alex Elder

    Dan Rosenberg
     
  • Signed-off-by: Nicolas Ferre

    Nicolas Ferre
     
  • The sd/mmc data structure is not used if SPI is selected. The configuration
    of PIO on the board prevent from using both interfaces at the same time
    (board dependent).
    Remove the warnings at compilation time adding a preprocessor condition.

    Signed-off-by: Nicolas Ferre

    Nicolas Ferre
     
  • Add information in dm9000 mac/phy chip initialization:
    - irq resource details
    - platform data details

    Signed-off-by: Nicolas Ferre

    Nicolas Ferre
     
  • While testing CPU DLPAR, the following problem was discovered.
    We were DLPAR removing the first CPU, which in this case was
    logical CPUs 0-3. CPUs 0-2 were already marked offline and
    we were in the process of offlining CPU 3. After marking
    the CPU inactive and offline in cpu_disable, but before the
    cpu was completely idle (cpu_die), we ended up in __make_request
    on CPU 3. There we looked at the topology map to see which CPU
    to complete the I/O on and found no CPUs in the cpu_sibling_map.
    This resulted in the block layer setting the completion cpu
    to be NR_CPUS, which then caused an oops when we tried to
    complete the I/O.

    Fix this by sanity checking the value we return from blk_cpu_to_group
    to be a valid cpu value.

    Signed-off-by: Brian King
    Signed-off-by: Jens Axboe

    Brian King
     
  • Takashi Iwai
     
  • …rostedt/linux-2.6-trace into perf/urgent

    Ingo Molnar
     
  • …/linux-2.6 into perf/urgent

    Ingo Molnar
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
    libata-sff: Reenable Port Multiplier after libata-sff remodeling.
    libata: skip EH autopsy and recovery during suspend
    ahci: AHCI and RAID mode SATA patch for Intel Patsburg DeviceIDs
    ata_piix: IDE Mode SATA patch for Intel Patsburg DeviceIDs
    libata,pata_via: revert ata_wait_idle() removal from ata_sff/via_tf_load()
    ahci: fix hang on failed softreset
    pata_artop: Fix device ID parity check

    Linus Torvalds
     
  • Be sure to avoid entering t_show() with FTRACE_ITER_HASH set without
    having properly started the iterator to iterate the hash. This case is
    degenerate and, as discovered by Robert Swiecki, can cause t_hash_show()
    to misuse a pointer. This causes a NULL ptr deref with possible security
    implications. Tracked as CVE-2010-3079.

    Cc: Robert Swiecki
    Cc: Eugene Teo
    Cc:
    Signed-off-by: Chris Wright
    Signed-off-by: Steven Rostedt

    Chris Wright
     
  • Keep track of the link on the which the current request is in progress.
    It allows support of links behind port multiplier.

    Not all libata-sff is PMP compliant. Code for native BMDMA controller
    does not take in accound PMP.

    Tested on Marvell 7042 and Sil7526.

    Signed-off-by: Gwendal Grignou
    Signed-off-by: Jeff Garzik

    Gwendal Grignou
     
  • For some mysterious reason, certain hardware reacts badly to usual EH
    actions while the system is going for suspend. As the devices won't
    be needed until the system is resumed, ask EH to skip usual autopsy
    and recovery and proceed directly to suspend.

    Signed-off-by: Tejun Heo
    Tested-by: Stephan Diestelhorst
    Cc: stable@kernel.org
    Signed-off-by: Jeff Garzik

    Tejun Heo
     
  • This patch adds the Intel Patsburg (PCH) SATA AHCI and RAID Controller
    DeviceIDs.

    Signed-off-by: Seth Heasley
    Signed-off-by: Jeff Garzik

    Seth Heasley
     
  • This patch adds the Intel Patsburg (PCH) IDE mode SATA Controller DeviceIDs.

    Signed-off-by: Seth Heasley
    Signed-off-by: Jeff Garzik

    Seth Heasley
     
  • Commit 978c0666 (libata: Remove excess delay in the tf_load path)
    removed ata_wait_idle() from ata_sff_tf_load() and via_tf_load().
    This caused obscure detection problems in sata_sil.

    https://bugzilla.kernel.org/show_bug.cgi?id=16606

    The commit was pure performance optimization. Revert it for now.

    Reported-by: Dieter Plaetinck
    Reported-by: Jan Beulich
    Bisected-by: gianluca
    Cc: stable@kernel.org
    Signed-off-by: Jeff Garzik

    Tejun Heo
     
  • Commit 9eed1fb721c ("minix: replace inode uid,gid,mode init with helper")
    broke directory creation on minix filesystems.

    Fix it by passing the needed mode flag to inode init helper.

    Signed-off-by: Jorge Boncompte [DTI2]
    Cc: Dmitry Monakhov
    Cc: Al Viro
    Cc: [2.6.35.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jorge Boncompte [DTI2]
     
  • When under significant memory pressure, a process enters direct reclaim
    and immediately afterwards tries to allocate a page. If it fails and no
    further progress is made, it's possible the system will go OOM. However,
    on systems with large amounts of memory, it's possible that a significant
    number of pages are on per-cpu lists and inaccessible to the calling
    process. This leads to a process entering direct reclaim more often than
    it should increasing the pressure on the system and compounding the
    problem.

    This patch notes that if direct reclaim is making progress but allocations
    are still failing that the system is already under heavy pressure. In
    this case, it drains the per-cpu lists and tries the allocation a second
    time before continuing.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Dave Chinner
    Cc: Wu Fengguang
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     
  • When allocating a page, the system uses NR_FREE_PAGES counters to
    determine if watermarks would remain intact after the allocation was made.
    This check is made without interrupts disabled or the zone lock held and
    so is race-prone by nature. Unfortunately, when pages are being freed in
    batch, the counters are updated before the pages are added on the list.
    During this window, the counters are misleading as the pages do not exist
    yet. When under significant pressure on systems with large numbers of
    CPUs, it's possible for processes to make progress even though they should
    have been stalled. This is particularly problematic if a number of the
    processes are using GFP_ATOMIC as the min watermark can be accidentally
    breached and in extreme cases, the system can livelock.

    This patch updates the counters after the pages have been added to the
    list. This makes the allocator more cautious with respect to preserving
    the watermarks and mitigates livelock possibilities.

    [akpm@linux-foundation.org: avoid modifying incoming args]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • refresh_zone_stat_thresholds() calculates parameter based on the number of
    online cpus. It's called at cpu offlining but needs to be called at
    onlining, too.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • O_NONBLOCK on parisc has a dual value:

    #define O_NONBLOCK 000200004 /* HPUX has separate NDELAY & NONBLOCK */

    It is caught by the O_* bits uniqueness check and leads to a parisc
    compile error. The fix would be to take O_NONBLOCK out.

    Signed-off-by: Wu Fengguang
    Signed-off-by: James Bottomley
    Cc: Jamie Lokier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Bottomley
     
  • Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
    show a shift since I last tested in December: in part because of firmware
    updates, in part because of the necessary move from barriers to awaiting
    completion at the block layer. While discard at swapon still shows as
    slightly beneficial on both, discarding 1MB swap cluster when allocating
    is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).

    Surrender: discard as presently implemented is more hindrance than help
    for swap; but might prove useful on other devices, or with improvements.
    So continue to do the discard at swapon, but make discard while swapping
    conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
    only the lower 16 bits of int flags).

    We can add a --discard or -d to swapon(8), and a "discard" to swap in
    /etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Nigel Cunningham
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: James Bottomley
    Cc: "Martin K. Petersen"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins