26 Dec, 2011

1 commit

  • * pm-sleep: (51 commits)
    PM: Drop generic_subsys_pm_ops
    PM / Sleep: Remove forward-only callbacks from AMBA bus type
    PM / Sleep: Remove forward-only callbacks from platform bus type
    PM: Run the driver callback directly if the subsystem one is not there
    PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
    PM / Sleep: Merge internal functions in generic_ops.c
    PM / Sleep: Simplify generic system suspend callbacks
    PM / Hibernate: Remove deprecated hibernation snapshot ioctls
    PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
    PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly
    PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep()
    PM / Sleep: Make [un]lock_system_sleep() generic
    PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs
    PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count()
    PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else'
    Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer
    PM / Sleep: Unify diagnostic messages from device suspend/resume
    ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR
    PM / Hibernate: Remove deprecated hibernation test modes
    PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path
    ...

    Conflicts:
    kernel/kmod.c

    Rafael J. Wysocki
     

23 Dec, 2011

1 commit


22 Dec, 2011

1 commit

  • * master: (848 commits)
    SELinux: Fix RCU deref check warning in sel_netport_insert()
    binary_sysctl(): fix memory leak
    mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
    ipmi_watchdog: restore settings when BMC reset
    oom: fix integer overflow of points in oom_badness
    memcg: keep root group unchanged if creation fails
    nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
    nilfs2: unbreak compat ioctl
    cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
    evm: prevent racing during tfm allocation
    evm: key must be set once during initialization
    mmc: vub300: fix type of firmware_rom_wait_states module parameter
    Revert "mmc: enable runtime PM by default"
    mmc: sdhci: remove "state" argument from sdhci_suspend_host
    x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
    IB/qib: Correct sense on freectxts increment and decrement
    RDMA/cma: Verify private data length
    cgroups: fix a css_set not found bug in cgroup_attach_proc
    oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
    Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
    ...

    Conflicts:
    kernel/cgroup_freezer.c

    Rafael J. Wysocki
     

16 Dec, 2011

2 commits

  • …/btrfs-work into integration

    Conflicts:
    fs/btrfs/inode.c

    Signed-off-by: Chris Mason <chris.mason@oracle.com>

    Chris Mason
     
  • Al pointed out we have some random problems with the way we account for
    num_workers_starting in the async thread stuff. First of all we need to make
    sure to decrement num_workers_starting if we fail to start the worker, so make
    __btrfs_start_workers do this. Also fix __btrfs_start_workers so that it
    doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we
    failed to create a worker. Also check_pending_worker_creates needs to call
    __btrfs_start_work in it's work function since it already increments
    num_workers_starting.

    People only start one worker at a time, so get rid of the num_workers argument
    everywhere, and make btrfs_queue_worker a void since it will always succeed.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

15 Dec, 2011

1 commit


22 Nov, 2011

1 commit

  • There is no reason to export two functions for entering the
    refrigerator. Calling refrigerator() instead of try_to_freeze()
    doesn't save anything noticeable or removes any race condition.

    * Rename refrigerator() to __refrigerator() and make it return bool
    indicating whether it scheduled out for freezing.

    * Update try_to_freeze() to return bool and relay the return value of
    __refrigerator() if freezing().

    * Convert all refrigerator() users to try_to_freeze().

    * Update documentation accordingly.

    * While at it, add might_sleep() to try_to_freeze().

    Signed-off-by: Tejun Heo
    Cc: Samuel Ortiz
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Steven Whitehouse
    Cc: Andrew Morton
    Cc: Jan Kara
    Cc: KONISHI Ryusuke
    Cc: Christoph Hellwig

    Tejun Heo
     

25 May, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

05 Oct, 2009

1 commit

  • The btrfs async worker threads are used for a wide variety of things,
    including processing bio end_io functions. This means that when
    the endio threads aren't running, the rest of the FS isn't
    able to do the final processing required to clear PageWriteback.

    The endio threads also try to exit as they become idle and
    start more as the work piles up. The problem is that starting more
    threads means kthreadd may need to allocate ram, and that allocation
    may wait until the global number of writeback pages on the system is
    below a certain limit.

    The result of that throttling is that end IO threads wait on
    kthreadd, who is waiting on IO to end, which will never happen.

    This commit fixes the deadlock by handing off thread startup to a
    dedicated thread. It also fixes a bug where the on-demand thread
    creation was creating far too many threads because it didn't take into
    account threads being started by other procs.

    Signed-off-by: Chris Mason

    Chris Mason
     

16 Sep, 2009

3 commits

  • It was possible for an async worker thread to be selected to
    receive a new work item, but exit before the work item was
    actually placed into that thread's work list.

    This commit fixes the race by incrementing the num_pending
    counter earlier, and making sure to check the number of pending
    work items before a thread exits.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The exit-on-idle code for async worker threads was incorrectly
    calling spin_lock_irq with interrupts already off.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • After a new worker thread starts, it is placed into the
    list of idle threads. But, this may race with a
    check for idle done by the worker thread itself, resulting
    in a double list_add operation.

    This fix adds a check to make sure the idle thread addition
    is done properly.

    Signed-off-by: Chris Mason

    Chris Mason
     

12 Sep, 2009

3 commits

  • This changes the btrfs worker threads to batch work items
    into a local list. It allows us to pull work items in
    large chunks and significantly reduces the number of times we
    need to take the worker thread spinlock.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The btrfs worker thread spinlock was being used both for the
    queueing of IO and for the processing of ordered events.

    The ordered events never happen from end_io handlers, and so they
    don't need to use the _irq version of spinlocks. This adds a
    dedicated lock to the ordered lists so they don't have to run
    with irqs off.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The Btrfs worker threads don't currently die off after they have
    been idle for a while, leading to a lot of threads sitting around
    doing nothing for each mount.

    Also, they are unable to start atomically (from end_io hanlders).

    This commit reworks the worker threads so they can be started
    from end_io handlers (just setting a flag that asks for a thread
    to be added at a later date) and so they can exit if they
    have been idle for a long time.

    Signed-off-by: Chris Mason

    Chris Mason
     

23 Jul, 2009

1 commit

  • If spin_lock_irqsave is called twice in a row with the same second
    argument, the interrupt state at the point of the second call overwrites
    the value saved by the first call. Indeed, the second call does not need
    to save the interrupt state, so it is changed to a simple spin_lock.

    Signed-off-by: Julia Lawall
    Signed-off-by: Chris Mason

    Julia Lawall
     

03 Jul, 2009

1 commit


11 Jun, 2009

1 commit

  • This patch fixes a bug which may result race condition
    between btrfs_start_workers() and worker_loop().

    btrfs_start_workers() executed in a parent thread writes
    on workers->worker and worker_loop() in a child thread
    reads workers->worker. However, there is no synchronization
    enforcing the order of two operations.

    This patch makes btrfs_start_workers() fill workers->worker
    before it starts a child thread with worker_loop()

    Signed-off-by: Chris Mason

    Shin Hong
     

21 Apr, 2009

1 commit

  • Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
    higher priority. But, the checksumming helper threads prevent it
    from being fully effective.

    There are two problems. First, a big queue of pending checksumming
    will delay the synchronous IO behind other lower priority writes. Second,
    the checksumming uses an ordered async work queue. The ordering makes sure
    that IOs are sent to the block layer in the same order they are sent
    to the checksumming threads. Usually this gives us less seeky IO.

    But, when we start mixing IO priorities, the lower priority IO can delay
    the higher priority IO.

    This patch solves both problems by adding a high priority list to the async
    helper threads, and a new btrfs_set_work_high_prio(), which is used
    to make put a new async work item onto the higher priority list.

    The ordering is still done on high priority IO, but all of the high
    priority bios are ordered separately from the low priority bios. This
    ordering is purely an IO optimization, it is not involved in data
    or metadata integrity.

    Signed-off-by: Chris Mason

    Chris Mason
     

03 Apr, 2009

2 commits


04 Feb, 2009

2 commits


21 Jan, 2009

1 commit


06 Jan, 2009

1 commit


13 Nov, 2008

1 commit

  • In worker_loop(), the func should check whether it has been requested to stop
    before it decides to schedule out.

    Otherwise if the stop request(also the last wake_up()) sent by
    btrfs_stop_workers() happens when worker_loop() running after the "while"
    judgement and before schedule(), woker_loop() will schedule away and never be
    woken up, which will also cause btrfs_stop_workers() wait forever.

    Signed-off-by: Chris Mason

    yanhai zhu
     

07 Nov, 2008

1 commit

  • Btrfs uses kernel threads to create async work queues for cpu intensive
    operations such as checksumming and decompression. These work well,
    but they make it difficult to keep IO order intact.

    A single writepages call from pdflush or fsync will turn into a number
    of bios, and each bio is checksummed in parallel. Once the checksum is
    computed, the bio is sent down to the disk, and since we don't control
    the order in which the parallel operations happen, they might go down to
    the disk in almost any order.

    The code deals with this somewhat by having deep work queues for a single
    kernel thread, making it very likely that a single thread will process all
    the bios for a single inode.

    This patch introduces an explicitly ordered work queue. As work structs
    are placed into the queue they are put onto the tail of a list. They have
    three callbacks:

    ->func (cpu intensive processing here)
    ->ordered_func (order sensitive processing here)
    ->ordered_free (free the work struct, all processing is done)

    The work struct has three callbacks. The func callback does the cpu intensive
    work, and when it completes the work struct is marked as done.

    Every time a work struct completes, the list is checked to see if the head
    is marked as done. If so the ordered_func callback is used to do the
    order sensitive processing and the ordered_free callback is used to do
    any cleanup. Then we loop back and check the head of the list again.

    This patch also changes the checksumming code to use the ordered workqueues.
    One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.

    Signed-off-by: Chris Mason

    Chris Mason
     

01 Oct, 2008

1 commit

  • When reading in block groups, a global mask of the available raid policies
    should be adjusted based on the types of block groups found on disk. This
    global mask is then used to decide which raid policy to use for new
    block groups.

    The recent allocator changes dropped the call that updated the global
    mask, making all the block groups allocated at run time single striped
    onto a single drive.

    This also fixes the async worker threads to set any thread that uses
    the requeue mechanism as busy. This allows us to avoid blocking
    on get_request_wait for the async bio submission threads.

    Signed-off-by: Chris Mason

    Chris Mason
     

30 Sep, 2008

1 commit

  • This improves the comments at the top of many functions. It didn't
    dive into the guts of functions because I was trying to
    avoid merging problems with the new allocator and back reference work.

    extent-tree.c and volumes.c were both skipped, and there is definitely
    more work todo in cleaning and commenting the code.

    Signed-off-by: Chris Mason

    Chris Mason
     

26 Sep, 2008

1 commit


25 Sep, 2008

8 commits

  • This takes the csum mutex deeper in the call chain and releases it
    more often.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Before this change, btrfs would use a bdi congestion function to make
    sure there weren't too many pending async checksum work items.

    This change makes the process creating async work items wait instead,
    leading to fewer congestion returns from the bdi. This improves
    pdflush background_writeout scanning.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • Large streaming reads make for large bios, which means each entry on the
    list async work queues represents a large amount of data. IO
    congestion throttling on the device was kicking in before the async
    worker threads decided a single thread was busy and needed some help.

    The end result was that a streaming read would result in a single CPU
    running at 100% instead of balancing the work off to other CPUs.

    This patch also changes the pre-IO checksum lookup done by reads to
    work on a per-bio basis instead of a per-page. This results in many
    extra btree lookups on large streaming reads. Doing the checksum lookup
    right before bio submit allows us to reuse searches while processing
    adjacent offsets.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • When kthread_run() returns failure, this worker hasn't been
    added to the list, so btrfs_stop_workers() won't free it.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • This changes the worker thread pool to maintain a list of idle threads,
    avoiding a complex search for a good thread to wake up.

    Threads have two states:

    idle - we try to reuse the last thread used in hopes of improving the batching
    ratios

    busy - each time a new work item is added to a busy task, the task is
    rotated to the end of the line.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • Btrfs has been using workqueues to spread the checksumming load across
    other CPUs in the system. But, workqueues only schedule work on the
    same CPU that queued the work, giving them a limited benefit for systems with
    higher CPU counts.

    This code adds a generic facility to schedule work with pools of kthreads,
    and changes the bio submission code to queue bios up. The queueing is
    important to make sure large numbers of procs on the system don't
    turn streaming workloads into random workloads by sending IO down
    concurrently.

    The end result of all of this is much higher performance (and CPU usage) when
    doing checksumming on large machines. Two worker pools are created,
    one for writes and one for endio processing. The two could deadlock if
    we tried to service both from a single pool.

    Signed-off-by: Chris Mason

    Chris Mason