14 Dec, 2011

12 commits

  • * cfq_cic_lookup() may be called without queue_lock and multiple tasks
    can execute it simultaneously for the same shared ioc. Nothing
    prevents them racing each other and trying to drop the same dead cic
    entry multiple times.

    * smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing
    prevents cfq_cic_lookup() seeing stale cic->key. This usually
    doesn't blow up because by the time cic is exited, all requests have
    been drained and new requests are terminated before going through
    elevator. However, it can still be triggered by plug merge path
    which doesn't grab queue_lock and thus can't check DEAD state
    reliably.

    This patch updates lookup locking such that,

    * Lookup is always performed under queue_lock. This doesn't add any
    more locking. The only issue is cfq_allow_merge() which can be
    called from plug merge path without holding any lock. For now, this
    is worked around by using cic of the request to merge into, which is
    guaranteed to have the same ioc. For longer term, I think it would
    be best to separate out plug merge method from regular one.

    * Spurious ioc->lock locking around cic lookup hint assignment
    dropped.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cfq_get_io_context() would fail if multiple tasks race to insert cic's
    for the same association. This patch restructures
    cfq_get_io_context() such that slow path insertion race is handled
    properly.

    Note that the restructuring also makes cfq_get_io_context() called
    under queue_lock and performs both ioc and cfqd insertions while
    holding both ioc and queue locks. This is part of on-going locking
    tightening and will be used to simplify synchronization rules.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • ioprio/cgroup change was handled by marking the changed state in ioc
    and, on the following access to the ioc, performing RCU-protected
    iteration through all cic's grabbing the matching queue_lock.

    This patch moves the changed state to each cic. When ioprio or cgroup
    changes, the respective bit is set on all cic's of the ioc and when
    each of those cic (not ioc) is accessed, change is applied for that
    specific ioc-queue pair.

    This also fixes the following two race conditions between setting and
    clearing of changed states.

    * Missing barrier between assign/load of ioprio and ioprio_changed
    allowed applying old ioprio.

    * Change requests could happen between application of change and
    clearing of changed variables.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Make the following changes to prepare for ioc/cic management cleanup.

    * Add cic->q so that ioc can determine the associated queue without
    querying cfq. This will eventually replace ->key.

    * Factor out cfq_release_cic() from cic_free_func(). This function
    assumes that the caller handled locking.

    * Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it
    take only @cic.

    * Restructure cfq_cic_link() for future updates.

    This patch doesn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
    failure instead of 0 / -errno or boolean. Update it such that it
    returns %true on success and %false on failure.

    * Make sure the caller checks for the return value.

    * Separate out __blk_get_queue() which doesn't check whether @q is
    dead and put it in blk.h. This will be used later.

    This patch doesn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Ignoring copy_io() during fork, io_context can be allocated from two
    places - current_io_context() and set_task_ioprio(). The former is
    always called from local task while the latter can be called from
    different task. The synchornization between them are peculiar and
    dubious.

    * current_io_context() doesn't grab task_lock() and assumes that if it
    saw %NULL ->io_context, it would stay that way until allocation and
    assignment is complete. It has smp_wmb() between alloc/init and
    assignment.

    * set_task_ioprio() grabs task_lock() for assignment and does
    smp_read_barrier_depends() between "ioc = task->io_context" and "if
    (ioc)". Unfortunately, this doesn't achieve anything - the latter
    is not a dependent load of the former. ie, if ioc itself were being
    dereferenced "ioc->xxx", it would mean something (not sure what tho)
    but as the code currently stands, the dependent read barrier is
    noop.

    As only one of the the two test-assignment sequences is task_lock()
    protected, the task_lock() can't do much about race between the two.
    Nothing prevents current_io_context() and set_task_ioprio() allocating
    its own ioc for the same task and overwriting the other's.

    Also, set_task_ioprio() can race with exiting task and create a new
    ioc after exit_io_context() is finished.

    ioc get/put doesn't have any reason to be complex. The only hot path
    is accessing the existing ioc of %current, which is simple to achieve
    given that ->io_context is never destroyed as long as the task is
    alive. All other paths can happily go through task_lock() like all
    other task sub structures without impacting anything.

    This patch updates ioc get/put so that it becomes more conventional.

    * alloc_io_context() is replaced with get_task_io_context(). This is
    the only interface which can acquire access to ioc of another task.
    On return, the caller has an explicit reference to the object which
    should be put using put_io_context() afterwards.

    * The functionality of current_io_context() remains the same but when
    creating a new ioc, it shares the code path with
    get_task_io_context() and always goes through task_lock().

    * get_io_context() now means incrementing ref on an ioc which the
    caller already has access to (be that an explicit refcnt or implicit
    %current one).

    * PF_EXITING inhibits creation of new io_context and once
    exit_io_context() is finished, it's guaranteed that both ioc
    acquisition functions return %NULL.

    * All users are updated. Most are trivial but
    smp_read_barrier_depends() removal from cfq_get_io_context() needs a
    bit of explanation. I suppose the original intention was to ensure
    ioc->ioprio is visible when set_task_ioprio() allocates new
    io_context and installs it; however, this wouldn't have worked
    because set_task_ioprio() doesn't have wmb between init and install.
    There are other problems with this which will be fixed in another
    patch.

    * While at it, use NUMA_NO_NODE instead of -1 for wildcard node
    specification.

    -v2: Vivek spotted contamination from debug patch. Removed.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * int return from put_io_context() wasn't used by anybody. Make it
    return void like other put functions and docbook-fy the function
    comment.

    * Reorder dummy declarations for !CONFIG_BLOCK case a bit.

    * Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of
    if block and drop 0'ing.

    * Docbook-fy current_io_context() comment.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cfq allocates per-queue id using ida and uses it to index cic radix
    tree from io_context. Move it to q->id and allocate on queue init and
    free on queue release. This simplifies cfq a bit and will allow for
    further improvements of io context life-cycle management.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_insert_cloned_request(), blk_execute_rq_nowait() and
    blk_flush_plug_list() either didn't check whether the queue was dead
    or did it without holding queue_lock. Update them so that dead state
    is checked while holding queue_lock.

    AFAICS, this plugs all holes (requeue doesn't matter as the request is
    transitioning atomically from in_flight to queued).

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When trying to drain all requests, blk_drain_queue() checked only
    q->rq.count[]; however, this only tracks REQ_ALLOCED requests. This
    patch updates blk_drain_queue() such that it looks at all the counters
    and queues so that request_queue is actually empty on completion.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
    macro and use it.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The only user left for blk_insert_request() is sx8 and it can be
    trivially switched to use blk_execute_rq_nowait() - special requests
    aren't included in io stat and sx8 doesn't use block layer tagging.
    Switch sx8 and kill blk_insert_requeset().

    This patch doesn't introduce any functional difference.

    Only compile tested.

    Signed-off-by: Tejun Heo
    Acked-by: Jeff Garzik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Dec, 2011

9 commits


09 Dec, 2011

19 commits

  • In order to safely dereference current->real_parent inside an
    rcu_read_lock, we need an rcu_dereference.

    Signed-off-by: Mandeep Singh Baines
    Cc: Thomas Gleixner
    Cc: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • Modify initialization of PCIe capability registers in Tsi721 mport driver:
    - change Completion Timeout value to avoid unexpected data transfer
    aborts during intensive traffic.
    - replace hardcoded offset of PCIe capability block by making it use the
    common function.

    This patch is applicable to kernel versions starting from 3.2-rc1.

    Signed-off-by: Alexandre Bounine
    Cc: Matt Porter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Bug fix for Tsi721 RapidIO mport driver: Tsi721 supports four RapidIO
    mailboxes (MBOX0 - MBOX3) as defined by RapidIO specification. Mailbox
    resources has to be properly reported to allow use of all available
    mailboxes (initial version reports only MBOX0).

    This patch is applicable to kernel versions staring from 3.2-rc1.

    Signed-off-by: Alexandre Bounine
    Cc: Matt Porter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Replace the pair dma_alloc_coherent()+memset() with the new
    dma_zalloc_coherent() added by Andrew Morton for kernel version 3.2

    Signed-off-by: Alexandre Bounine
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Since commit a25cac5198d4 ("proc: Consider NO_HZ when printing idle and
    iowait times") we are reporting idle/io_wait time also while a CPU is
    tickless. We rely on get_{idle,iowait}_time functions to retrieve
    proper data.

    These functions, however, use usecs_to_cputime to translate micro
    seconds time to cputime64_t. This is just an alias to usecs_to_jiffies
    which reduces the data type from u64 to unsigned int and also checks
    whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
    and returns MAX_JIFFY_OFFSET in that case.

    When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
    it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
    >3000s! until we overflow unsigned int. Just for reference
    CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
    CONFIG_HZ_1000 ~2s.

    This results in a bug when people saw [h]top going mad reporting 100%
    CPU usage even though there was basically no CPU load. The reason was
    simply that /proc/stat stopped reporting idle/io_wait changes (and
    reported MAX_JIFFY_OFFSET) and so the only change happening was for user
    system time.

    Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
    to 32b type and it is much more appropriate for cumulative time values
    (unlike usecs_to_jiffies which intended for timeout calculations).

    Signed-off-by: Michal Hocko
    Tested-by: Artem S. Tashkinov
    Cc: Dave Jones
    Cc: Arnd Bergmann
    Cc: Alexey Dobriyan
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
    /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
    it is fully initialised. Unfortunately, it did not check that
    __vmalloc_area_node() successfully populated the area. In the event of
    allocation failure, the vmalloc area is freed but the pointer to freed
    memory is inserted into the vmlist leading to a a crash later in
    get_vmalloc_info().

    This patch adds a check for ____vmalloc_area_node() failure within
    __vmalloc_node_range. It does not use "goto fail" as in the previous
    error path as a warning was already displayed by __vmalloc_area_node()
    before it called vfree in its failure path.

    Credit goes to Luciano Chavez for doing all the real work of identifying
    exactly where the problem was.

    Signed-off-by: Mel Gorman
    Reported-by: Luciano Chavez
    Tested-by: Luciano Chavez
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: [3.1.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • setup_zone_migrate_reserve() expects that zone->start_pfn starts at
    pageblock_nr_pages aligned pfn otherwise we could access beyond an
    existing memblock resulting in the following panic if
    CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

    IP: [] setup_zone_migrate_reserve+0xcd/0x180
    *pdpt = 0000000000000000 *pde = f000ff53f000ff53
    Oops: 0000 [#1] SMP
    Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    EIP: 0060:[] EFLAGS: 00010006 CPU: 0
    EIP is at setup_zone_migrate_reserve+0xcd/0x180
    EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
    ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
    Call Trace:
    [] __setup_per_zone_wmarks+0xec/0x160
    [] setup_per_zone_wmarks+0xf/0x20
    [] init_per_zone_wmark_min+0x27/0x86
    [] do_one_initcall+0x2b/0x160
    [] kernel_init+0xbe/0x157
    [] kernel_thread_helper+0x6/0xd
    Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
    EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
    CR2: 00000000000012b4

    We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
    highstart_pfn = 0x36ffe.

    The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
    in a pageblock is reserved before marking it MIGRATE_RESERVE").

    Make sure that start_pfn is always aligned to pageblock_nr_pages to
    ensure that pfn_valid s always called at the start of each pageblock.
    Architectures with holes in pageblocks will be correctly handled by
    pfn_valid_within in pageblock_is_reserved.

    Signed-off-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Tested-by: Dang Bo
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Arve Hjnnevg
    Cc: KOSAKI Motohiro
    Cc: John Stultz
    Cc: Dave Hansen
    Cc: [3.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Avoid unlocking and unlocked page if we failed to lock it.

    Signed-off-by: Hillf Danton
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
    page_tail->_count zero at all times. But the current kernel does not
    set page_tail->_count to zero if a 1GB page is utilized. So when an
    IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
    tail page's _count does not equal zero.

    kernel BUG at include/linux/mm.h:386!
    invalid opcode: 0000 [#1] SMP
    Call Trace:
    gup_pud_range+0xb8/0x19d
    get_user_pages_fast+0xcb/0x192
    ? trace_hardirqs_off+0xd/0xf
    hva_to_pfn+0x119/0x2f2
    gfn_to_pfn_memslot+0x2c/0x2e
    kvm_iommu_map_pages+0xfd/0x1c1
    kvm_iommu_map_memslots+0x7c/0xbd
    kvm_iommu_map_guest+0xaa/0xbf
    kvm_vm_ioctl_assigned_device+0x2ef/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
    RIP gup_huge_pud+0xf2/0x159

    Signed-off-by: Youquan Song
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Youquan Song
     
  • With the 3.2-rc kernel, IOMMU 2M pages in KVM works. But when I tried
    to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
    failed to be used.

    The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
    page calls gup_huge_pmd. If compound pages are used and the page is a
    tail page, gup_huge_pmd() increases _mapcount to record tail page are
    mapped while gup_huge_pud does not do that.

    So when the mapped page is relesed, it will result in kernel oops
    because the page is not marked mapped.

    This patch add tail process for compound page in 1GB huge page which
    keeps the same process as 2M page.

    Reproduce like:
    1. Add grub boot option: hugepagesz=1G hugepages=8
    2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
    3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
    -net none -device pci-assign,host=07:00.1

    kernel BUG at mm/swap.c:114!
    invalid opcode: 0000 [#1] SMP
    Call Trace:
    put_page+0x15/0x37
    kvm_release_pfn_clean+0x31/0x36
    kvm_iommu_put_pages+0x94/0xb1
    kvm_iommu_unmap_memslots+0x80/0xb6
    kvm_assign_device+0xba/0x117
    kvm_vm_ioctl_assigned_device+0x301/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
    RIP put_compound_page+0xd4/0x168

    Signed-off-by: Youquan Song
    Reviewed-by: Andrea Arcangeli
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Youquan Song
     
  • Commit 4f2a8d3cf5e ("printk: Fix console_sem vs logbuf_lock unlock race")
    introduced another silly bug where we would want to acquire an already
    held lock. Avoid this.

    Reported-by: Andrea Arcangeli
    Signed-off-by: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • More players joined to memory cgroup developments and Johannes' great work
    changed internal design of memory cgroup dramatically. And he will do
    more works. Michal Hokko did many bug fixes and know memory cgroup very
    well. Daisuke Nishimura helped us very much but he seems busy now.
    Thanks to his works.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • If an error occurs after the clock is enabled, the enable/disable state
    can become unbalanced.

    Signed-off-by: Jonghwan Choi
    Cc: Alessandro Zummo
    Acked-by: Kukjin Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonghwan Choi
     
  • Small clean-up for my CREDITS entry; the GPG fingerprint was not up to
    date, so I fixed other details at the same time too.

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • khugepaged can sometimes cause suspend to fail, requiring that the user
    retry the suspend operation.

    Use wait_event_freezable_timeout() instead of
    schedule_timeout_interruptible() to avoid missing freezer wakeups. A
    try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
    tight loop too in case of the allocation failing repeatedly, and
    wait_event_freezable_timeout will provide it too.

    khugepaged would still freeze just fine by trying again the next minute
    but it's better if it freezes immediately.

    Reported-by: Jiri Slaby
    Signed-off-by: Andrea Arcangeli
    Tested-by: Jiri Slaby
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: "Srivatsa S. Bhat"
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Fix the error message "directives may not be used inside a macro argument"
    which appears when the kernel is compiled for the cris architecture.

    Signed-off-by: Claudio Scordino
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Scordino
     
  • Use atomic-long operations instead of looping around cmpxchg().

    [akpm@linux-foundation.org: massage atomic.h inclusions]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • A shrinker function can return -1, means that it cannot do anything
    without a risk of deadlock. For example prune_super() does this if it
    cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
    interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
    according to this. So the next time around this shrinker can cause
    really big pressure. Let's skip such shrinkers instead.

    Also make total_scan signed, otherwise the check (total_scan < 0) below
    never works.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit 29b68415e335 ("x86: amd_iommu: move to drivers/iommu/")
    moved the files, update the patterns.

    CC: Ohad Ben-Cohen
    CC: Joerg Roedel

    Signed-off-by: Joe Perches
    Signed-off-by: Joerg Roedel

    Joe Perches