23 Mar, 2011

40 commits

  • All kthreads being created from a single helper task, they all use memory
    from a single node for their kernel stack and task struct.

    This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
    to parameters already used by kthread_create().

    This parameter serves in allocating memory for the new kthread on its
    memory node if possible.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Add a node parameter to alloc_thread_info(), and change its name to
    alloc_thread_info_node()

    This change is needed to allow NUMA aware kthread_create_on_cpu()

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • All kthreads being created from a single helper task, they all use memory
    from a single node for their kernel stack and task struct.

    This patch suite creates kthread_create_on_cpu(), adding a 'cpu' parameter
    to parameters already used by kthread_create().

    This parameter serves in allocating memory for the new kthread on its
    memory node if available.

    Users of this new function are : ksoftirqd, kworker, migration, pktgend...

    This patch:

    Add a node parameter to alloc_task_struct(), and change its name to
    alloc_task_struct_node()

    This change is needed to allow NUMA aware kthread_create_on_cpu()

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Many migrate_page's caller check return value instead of list_empy by
    cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED counting"). This patch
    makes compaction's migrate_pages consistent with others. This patch
    should not change old behavior.

    Signed-off-by: Minchan Kim
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
    order > 0") due to reports stating that kswapd CPU usage was higher and
    IRQs were being disabled more frequently. This was reported at
    http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

    Without this patch applied, CPU usage by kswapd hovers around the 20% mark
    according to the tester (Arthur Marsh:
    http://www.spinics.net/linux/fedora/alsa-user/msg09899.html). With this
    patch applied, it's around 2%.

    The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
    triggered by high-order allocations hitting the low watermark for their
    order and waking kswapd on kernels with CONFIG_COMPACTION set. The most
    common trigger for this is network cards configured for jumbo frames but
    it's also possible it'll be triggered by fork-heavy workloads (order-1)
    and some wireless cards which depend on order-1 allocations.

    The symptoms for the user will be high CPU usage by kswapd in low-memory
    situations which could be confused with another writeback problem. While
    a patch like 5a03b051 may be reintroduced in the future, this patch plays
    it safe for now and reverts it.

    [mel@csn.ul.ie: Beefed up the changelog]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reported-by: Arthur Marsh
    Tested-by: Arthur Marsh
    Cc: [2.6.38.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Provide a free area cache for the vmalloc virtual address allocator, based
    on the algorithm used by the user virtual memory allocator.

    This reduces the number of rbtree operations and linear traversals over
    the vmap extents in order to find a free area, by starting off at the last
    point that a free area was found.

    The free area cache is reset if areas are freed behind it, or if we are
    searching for a smaller area or alignment than last time. So allocation
    patterns are not changed (verified by corner-case and random test cases in
    userspace testing).

    This solves a regression caused by lazy vunmap TLB purging introduced in
    db64fe02 (mm: rewrite vmap layer). That patch will leave extents in the
    vmap allocator after they are vunmapped, and until a significant number
    accumulate that can be flushed in a single batch. So in a workload that
    vmalloc/vfree frequently, a chain of extents will build up from
    VMALLOC_START address, which have to be iterated over each time (giving an
    O(n) type of behaviour).

    After this patch, the search will start from where it left off, giving
    closer to an amortized O(1).

    This is verified to solve regressions reported Steven in GFS2, and Avi in
    KVM.

    Hugh's update:

    : I tried out the recent mmotm, and on one machine was fortunate to hit
    : the BUG_ON(first->va_start < addr) which seems to have been stalling
    : your vmap area cache patch ever since May.

    : I can get you addresses etc, I did dump a few out; but once I stared
    : at them, it was easier just to look at the code: and I cannot see how
    : you would be so sure that first->va_start < addr, once you've done
    : that addr = ALIGN(max(...), align) above, if align is over 0x1000
    : (align was 0x8000 or 0x4000 in the cases I hit: ioremaps like Steve).

    : I originally got around it by just changing the
    : if (first->va_start < addr) {
    : to
    : while (first->va_start < addr) {
    : without thinking about it any further; but that seemed unsatisfactory,
    : why would we want to loop here when we've got another very similar
    : loop just below it?

    : I am never going to admit how long I've spent trying to grasp your
    : "while (n)" rbtree loop just above this, the one with the peculiar
    : if (!first && tmp->va_start < addr + size)
    : in. That's unfamiliar to me, I'm guessing it's designed to save a
    : subsequent rb_next() in a few circumstances (at risk of then setting
    : a wrong cached_hole_size?); but they did appear few to me, and I didn't
    : feel I could sign off something with that in when I don't grasp it,
    : and it seems responsible for extra code and mistaken BUG_ON below it.

    : I've reverted to the familiar rbtree loop that find_vma() does (but
    : with va_end >= addr as you had, to respect the additional guard page):
    : and then (given that cached_hole_size starts out 0) I don't see the
    : need for any complications below it. If you do want to keep that loop
    : as you had it, please add a comment to explain what it's trying to do,
    : and where addr is relative to first when you emerge from it.

    : Aren't your tests "size first->va_start" forgetting the guard page we want
    : before the next area? I've changed those.

    : I have not changed your many "addr + size - 1 < addr" overflow tests,
    : but have since come to wonder, shouldn't they be "addr + size < addr"
    : tests - won't the vend checks go wrong if addr + size is 0?

    : I have added a few comments - Wolfgang Wander's 2.6.13 description of
    : 1363c3cd8603a913a27e2995dccbd70d5312d8e6 Avoiding mmap fragmentation
    : helped me a lot, perhaps a pointer to that would be good too. And I found
    : it easier to understand when I renamed cached_start slightly and moved the
    : overflow label down.

    : This patch would go after your mm-vmap-area-cache.patch in mmotm.
    : Trivially, nobody is going to get that BUG_ON with this patch, and it
    : appears to work fine on my machines; but I have not given it anything like
    : the testing you did on your original, and may have broken all the
    : performance you were aiming for. Please take a look and test it out
    : integrate with yours if you're satisfied - thanks.

    [akpm@linux-foundation.org: add locking comment]
    Signed-off-by: Nick Piggin
    Signed-off-by: Hugh Dickins
    Reviewed-by: Minchan Kim
    Reported-and-tested-by: Steven Whitehouse
    Reported-and-tested-by: Avi Kivity
    Tested-by: "Barry J. Marson"
    Cc: Prarit Bhargava
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In systems with multiple framebuffer devices, one of the devices might be
    blanked while another is unblanked. In order for the backlight blanking
    logic to know whether to turn off the backlight for a particular
    framebuffer's blanking notification, it needs to be able to check if a
    given framebuffer device corresponds to the backlight.

    This plumbs the check_fb hook from core backlight through the
    pwm_backlight helper to allow platform code to plug in a check_fb hook.

    Signed-off-by: Robert Morell
    Cc: Richard Purdie
    Cc: Arun Murthy
    Cc: Linus Walleij
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Morell
     
  • The following symbols are needlessly defined global: jornada_bl_init,
    jornada_bl_exit, jornada_lcd_init, jornada_lcd_exit.

    Make them static.

    Signed-off-by: Axel Lin
    Acked-by: Kristoffer Ericson
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Axel Lin
     
  • apple_bl uses ACPI interfaces (data & code), so it should depend on ACPI.

    drivers/video/backlight/apple_bl.c:142: warning: 'struct acpi_device' declared inside parameter list
    drivers/video/backlight/apple_bl.c:142: warning: its scope is only this definition or declaration, which is probably not what you want
    drivers/video/backlight/apple_bl.c:201: warning: 'struct acpi_device' declared inside parameter list
    drivers/video/backlight/apple_bl.c:215: error: variable 'apple_bl_driver' has initializer but incomplete type
    drivers/video/backlight/apple_bl.c:216: error: unknown field 'name' specified in initializer
    ...

    Signed-off-by: Randy Dunlap
    Acked-by: Matthew Garrett
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • It works on hardware other than Macbook Pros, and it works on GPUs other
    than Nvidia. It should even work on iMacs, so change the name to match
    reality more precisely and include an alias so existing users don't get
    confused.

    Signed-off-by: Matthew Garrett
    Acked-by: Richard Purdie
    Cc: Mourad De Clerck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • The SMI-based backlight control functionality may fail to work if the
    system is running under EFI rather than BIOS. Check that the hardware
    responds as expected, and exit if it doesn't.

    Signed-off-by: Matthew Garrett
    Acked-by: Richard Purdie
    Cc: Mourad De Clerck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • This driver only has to deal with two different classes of hardware, but
    right now it needs new DMI entries for every new machine. It turns out
    that there's an ACPI device that uniquely identifies Apples with backlights,
    so this patch reworks the driver into an ACPI one, identifies the hardware
    by checking the PCI vendor of the root bridge and strips out all the DMI
    code. It also changes the config text to clarify that it works on devices
    other than Macbook Pros and GPUs other than nvidia.

    Signed-off-by: Matthew Garrett
    Acked-by: Richard Purdie
    Cc: Mourad De Clerck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • Dual-GPU machines may provide more than one ACPI backlight interface. Tie
    the backlight device to the GPU in order to allow userspace to identify
    the correct interface.

    Signed-off-by: Matthew Garrett
    Cc: Richard Purdie
    Cc: Chris Wilson
    Cc: David Airlie
    Cc: Alex Deucher
    Cc: Ben Skeggs
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: Jesse Barnes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • We may eventually end up with per-connector backlights, especially with
    ddcci devices. Make sure that the parent node for the backlight device is
    the connector rather than the PCI device.

    Signed-off-by: Matthew Garrett
    Cc: Richard Purdie
    Cc: Chris Wilson
    Cc: David Airlie
    Cc: Alex Deucher
    Acked-by: Ben Skeggs
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: Jesse Barnes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • Allows e.g. power management daemons to control the backlight level. Inspired
    by the corresponding code in radeonfb.

    [mjg@redhat.com: updated to add backlight type and make the connector the parent device]
    Signed-off-by: Michel Dänzer
    Signed-off-by: Matthew Garrett
    Cc: Richard Purdie
    Cc: Chris Wilson
    Cc: David Airlie
    Acked-by: Alex Deucher
    Cc: Ben Skeggs
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: Jesse Barnes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Dänzer
     
  • There may be multiple ways of controlling the backlight on a given
    machine. Allow drivers to expose the type of interface they are
    providing, making it possible for userspace to make appropriate policy
    decisions.

    Signed-off-by: Matthew Garrett
    Cc: Richard Purdie
    Cc: Chris Wilson
    Cc: David Airlie
    Cc: Alex Deucher
    Cc: Ben Skeggs
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: Jesse Barnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • Don't allow everybody to change LED settings.

    Signed-off-by: Vasiliy Kulikov
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Don't allow everybody to change LED settings.

    Signed-off-by: Vasiliy Kulikov
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Add a ld9040 amoled panel driver.

    Signed-off-by: Donghwa Lee
    Signed-off-by: Kyungmin Park
    Signed-off-by: Inki Dae
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Donghwa Lee
     
  • And fix a typo.

    Signed-off-by: Uwe Kleine-König
    Cc: Lars-Peter Clausen
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • Simple backlight driver for National Semiconductor LM3530. Presently only
    manual mode is supported, PWM and ALS support to be added.

    Signed-off-by: Shreshtha Kumar Sahu
    Cc: Linus Walleij
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shreshtha Kumar Sahu
     
  • There is a move to deprecate bus-specific PM operations and move to using
    dev_pm_ops instead in order to reduce the amount of boilerplate code in
    buses and facilitiate updates to the PM core. Do this move for the bs2802
    driver.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Mark Brown
    Cc: Kim Kyuwon
    Cc: Kim Kyuwon
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Brown
     
  • list_del() leaves poison in the prev and next pointers. The next
    list_empty() will compare those poisons, and say the list isn't empty.
    Any list operations that assume the node is on a list because of such a
    check will be fooled into dereferencing poison. One needs to INIT the
    node after the del, and fortunately there's already a wrapper for that -
    list_del_init().

    Some of the dels are followed by deallocations, so can be ignored, and one
    can be merged with an add to make a move. Apart from that, I erred on the
    side of caution in making nodes list_empty()-queriable.

    Signed-off-by: Phil Carmody
    Reviewed-by: Paul Menage
    Cc: Li Zefan
    Acked-by: Kirill A. Shutemov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • The oom killer naturally defers killing anything if it finds an eligible
    task that is already exiting and has yet to detach its ->mm. This avoids
    unnecessarily killing tasks when one is already in the exit path and may
    free enough memory that the oom killer is no longer needed. This is
    detected by PF_EXITING since threads that have already detached its ->mm
    are no longer considered at all.

    The problem with always deferring when a thread is PF_EXITING, however, is
    that it may never actually exit when being traced, specifically if another
    task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want
    to defer in this case since there is no guarantee that thread will ever
    exit without intervention.

    This patch will now only defer the oom killer when a thread is PF_EXITING
    and no ptracer has stopped its progress in the exit path. It also ensures
    that a child is sacrificed for the chosen parent only if it has a
    different ->mm as the comment implies: this ensures that the thread group
    leader is always targeted appropriately.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Andrey Vagin
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • We shouldn't defer oom killing if a thread has already detached its ->mm
    and still has TIF_MEMDIE set. Memory needs to be freed, so find kill
    other threads that pin the same ->mm or find another task to kill.

    Signed-off-by: Andrey Vagin
    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     
  • This patch prevents unnecessary oom kills or kernel panics by reverting
    two commits:

    495789a5 (oom: make oom_score to per-process value)
    cef1d352 (oom: multi threaded process coredump don't make deadlock)

    First, 495789a5 (oom: make oom_score to per-process value) ignores the
    fact that all threads in a thread group do not necessarily exit at the
    same time.

    It is imperative that select_bad_process() detect threads that are in the
    exit path, specifically those with PF_EXITING set, to prevent needlessly
    killing additional tasks. If a process is oom killed and the thread group
    leader exits, select_bad_process() cannot detect the other threads that
    are PF_EXITING by iterating over only processes. Thus, it currently
    chooses another task unnecessarily for oom kill or panics the machine when
    nothing else is eligible.

    By iterating over threads instead, it is possible to detect threads that
    are exiting and nominate them for oom kill so they get access to memory
    reserves.

    Second, cef1d352 (oom: multi threaded process coredump don't make
    deadlock) erroneously avoids making the oom killer a no-op when an
    eligible thread other than current isfound to be exiting. We want to
    detect this situation so that we may allow that exiting thread time to
    exit and free its memory; if it is able to exit on its own, that should
    free memory so current is no loner oom. If it is not able to exit on its
    own, the oom killer will nominate it for oom kill which, in this case,
    only means it will get access to memory reserves.

    Without this change, it is easy for the oom killer to unnecessarily target
    tasks when all threads of a victim don't exit before the thread group
    leader or, in the worst case, panic the machine.

    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Andrey Vagin
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If an administrator tries to swapon a file backed by NFS, the inode mutex is
    taken (as it is for any swapfile) but later identified to be a bad swapfile
    due to the lack of bmap and tries to cleanup. During cleanup, an attempt is
    made to close the file but with inode->i_mutex still held. Closing an NFS
    file syncs it which tries to acquire the inode mutex leading to deadlock. If
    lockdep is enabled the following appears on the console;

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.38-rc8-autobuild #1
    ---------------------------------------------
    swapon/2192 is trying to acquire lock:
    (&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    other info that might help us debug this:
    1 lock held by swapon/2192:
    #0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    stack backtrace:
    Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1
    Call Trace:
    __lock_acquire+0x2eb/0x1623
    find_get_pages_tag+0x14a/0x174
    pagevec_lookup_tag+0x25/0x2e
    vfs_fsync_range+0x47/0x7c
    lock_acquire+0xd3/0x100
    vfs_fsync_range+0x47/0x7c
    nfs_flush_one+0x0/0xdf [nfs]
    mutex_lock_nested+0x40/0x2b1
    vfs_fsync_range+0x47/0x7c
    vfs_fsync_range+0x47/0x7c
    vfs_fsync+0x1c/0x1e
    nfs_file_flush+0x64/0x69 [nfs]
    filp_close+0x43/0x72
    sys_swapon+0xa39/0xae7
    sysret_check+0x2e/0x69
    system_call_fastpath+0x16/0x1b

    This patch releases the mutex if its held before calling filep_close()
    so swapon fails as expected without deadlock when the swapfile is backed
    by NFS. If accepted for 2.6.39, it should also be considered a -stable
    candidate for 2.6.38 and 2.6.37.

    Signed-off-by: Mel Gorman
    Acked-by: Hugh Dickins
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • syncfs() is duplicating name_to_handle_at() due to a merging mistake.

    Cc: Sage Weil
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slub: Add statistics for this_cmpxchg_double failures
    slub: Add missing irq restore for the OOM path

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    [net/9p]: Introduce basic flow-control for VirtIO transport.
    9p: use the updated offset given by generic_write_checks
    [net/9p] Don't re-pin pages on retrying virtqueue_add_buf().
    [net/9p] Set the condition just before waking up.
    [net/9p] unconditional wake_up to proc waiting for space on VirtIO ring
    fs/9p: Add v9fs_dentry2v9ses
    fs/9p: Attach writeback_fid on first open with WR flag
    fs/9p: Open writeback fid in O_SYNC mode
    fs/9p: Use truncate_setsize instead of vmtruncate
    net/9p: Fix compile warning
    net/9p: Convert the in the 9p rpc call path to GFP_NOFS
    fs/9p: Fix race in initializing writeback fid

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    rbd: use watch/notify for changes in rbd header
    libceph: add lingering request and watch/notify event framework
    rbd: update email address in Documentation
    ceph: rename dentry_release -> d_release, fix comment
    ceph: add request to the tail of unsafe write list
    ceph: remove request from unsafe list if it is canceled/timed out
    ceph: move readahead default to fs/ceph from libceph
    ceph: add ino32 mount option
    ceph: update common header files
    ceph: remove debugfs debug cruft
    libceph: fix osd request queuing on osdmap updates
    ceph: preserve I_COMPLETE across rename
    libceph: Fix base64-decoding when input ends in newline.

    Linus Torvalds
     
  • Using delayed-work for tty flip buffers ends up causing us to wait for
    the next tick to complete some actions. That's usually not all that
    noticeable, but for certain latency-critical workloads it ends up being
    totally unacceptable.

    As an extreme case of this, passing a token back-and-forth over a pty
    will take two ticks per iteration, so even just a thousand iterations
    will take 8 seconds assuming a common 250Hz configuration.

    Avoiding the whole delayed work issue brings that ping-pong test-case
    down to 0.009s on my machine.

    In more practical terms, this latency has been a performance problem for
    things like dive computer simulators (simulating the serial interface
    using the ptys) and for other environments (Alan mentions a CP/M emulator).

    Reported-by: Jef Driesen
    Acked-by: Greg KH
    Acked-by: Alan Cox
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Recent zerocopy work in the 9P VirtIO transport maps and pins
    user buffers into kernel memory for the server to work on them.
    Since the user process can initiate this kind of pinning with a simple
    read/write call, thousands of IO threads initiated by the user process can
    hog the system resources and could result into denial of service.

    This patch introduces flow control to avoid that extreme scenario.

    The ceiling limit to avoid denial of service attacks is set to relatively
    high (nr_free_pagecache_pages()/4) so that it won't interfere with
    regular usage, but can step in extreme cases to limit the total system
    hang. Since we don't have a global structure to accommodate this variable,
    I choose the virtio_chan as the home for this.

    Signed-off-by: Venkateswararao Jujjuri
    Reviewed-by: Badari Pulavarty
    Signed-off-by: Eric Van Hensbergen

    Venkateswararao Jujjuri (JV)
     
  • Without this fix, even if a file is opened in O_APPEND mode, data will be
    written at current file position instead of end of file.

    Signed-off-by: M. Mohan Kumar
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Eric Van Hensbergen

    M. Mohan Kumar
     
  • Signed-off-by: Venkateswararao Jujjuri
    Signed-off-by: Eric Van Hensbergen

    Venkateswararao Jujjuri (JV)
     
  • Given that the sprious wake-ups are common, we need to move the
    condition setting right next to the wake_up(). After setting the condition
    to req->status = REQ_STATUS_RCVD, sprious wakeups may cause the
    virtqueue back on the free list for someone else to use.
    This may result in kernel panic while relasing the pinned pages
    in p9_release_req_pages().

    Also rearranged the while loop in req_done() for better redability.

    Signed-off-by: Venkateswararao Jujjuri
    Signed-off-by: Eric Van Hensbergen

    Venkateswararao Jujjuri (JV)
     
  • Process may wait to get space on VirtIO ring to send a transaction to
    VirtFS server. Current code just does a conditional wake_up() which
    means only one process will be woken up even if multiple processes
    are waiting.

    This fix makes the wake_up unconditional. Hence we won't have any
    processes waiting for-ever.

    Signed-off-by: Venkateswararao Jujjuri
    Signed-off-by: Eric Van Hensbergen

    Venkateswararao Jujjuri (JV)
     
  • Add the new static inline and use the same

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Venkateswararao Jujjuri
    Signed-off-by: Eric Van Hensbergen

    Aneesh Kumar K.V
     
  • We don't need writeback fid if we are only doing O_RDONLY open

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Venkateswararao Jujjuri
    Signed-off-by: Eric Van Hensbergen

    Aneesh Kumar K.V
     
  • Older version of protocol don't support tsyncfs operation.
    So for them force a O_SYNC flag on the server

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Venkateswararao Jujjuri
    Signed-off-by: Eric Van Hensbergen

    Aneesh Kumar K.V