25 Oct, 2010

3 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    Revert "block: fix accounting bug on cross partition merges"

    Linus Torvalds
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Update broken web addresses in arch directory.
    Update broken web addresses in the kernel.
    Revert "drivers/usb: Remove unnecessary return's from void functions" for musb gadget
    Revert "Fix typo: configuation => configuration" partially
    ida: document IDA_BITMAP_LONGS calculation
    ext2: fix a typo on comment in ext2/inode.c
    drivers/scsi: Remove unnecessary casts of private_data
    drivers/s390: Remove unnecessary casts of private_data
    net/sunrpc/rpc_pipe.c: Remove unnecessary casts of private_data
    drivers/infiniband: Remove unnecessary casts of private_data
    drivers/gpu/drm: Remove unnecessary casts of private_data
    kernel/pm_qos_params.c: Remove unnecessary casts of private_data
    fs/ecryptfs: Remove unnecessary casts of private_data
    fs/seq_file.c: Remove unnecessary casts of private_data
    arm: uengine.c: remove C99 comments
    arm: scoop.c: remove C99 comments
    Fix typo configue => configure in comments
    Fix typo: configuation => configuration
    Fix typo interrest[ing|ed] => interest[ing|ed]
    Fix various typos of valid in comments
    ...

    Fix up trivial conflicts in:
    drivers/char/ipmi/ipmi_si_intf.c
    drivers/usb/gadget/rndis.c
    net/irda/irnet/irnet_ppp.c

    Linus Torvalds
     
  • This reverts commit 7681bfeeccff5efa9eb29bf09249a3c400b15327.

    Conflicts:

    include/linux/genhd.h

    It has numerous issues with the cleanup path and non-elevator
    devices. Revert it for now so we can come up with a clean
    version without rushing things.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Oct, 2010

1 commit

  • blk_throtl_exit() frees the throttle data hanging off the queue
    in blk_cleanup_queue(), but blk_put_queue() will indirectly
    dereference this data when calling blk_sync_queue() which in
    turns calls throtl_shutdown_timer_wq().

    Fix this by moving the freeing of the throttle data to when
    the queue is truly being released, and post the call to
    blk_sync_queue().

    Reported-by: Ingo Molnar
    Tested-by: Ingo Molnar
    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 Oct, 2010

6 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (31 commits)
    driver core: Display error codes when class suspend fails
    Driver core: Add section count to memory_block struct
    Driver core: Add mutex for adding/removing memory blocks
    Driver core: Move find_memory_block routine
    hpilo: Despecificate driver from iLO generation
    driver core: Convert link_mem_sections to use find_memory_block_hinted.
    driver core: Introduce find_memory_block_hinted which utilizes kset_find_obj_hinted.
    kobject: Introduce kset_find_obj_hinted.
    driver core: fix build for CONFIG_BLOCK not enabled
    driver-core: base: change to new flag variable
    sysfs: only access bin file vm_ops with the active lock
    sysfs: Fail bin file mmap if vma close is implemented.
    FW_LOADER: fix kconfig dependency warning on HOTPLUG
    uio: Statically allocate uio_class and use class .dev_attrs.
    uio: Support 2^MINOR_BITS minors
    uio: Cleanup irq handling.
    uio: Don't clear driver data
    uio: Fix lack of locking in init_uio_class
    SYSFS: Allow boot time switching between deprecated and modern sysfs layout
    driver core: remove CONFIG_SYSFS_DEPRECATED_V2 but keep it for block devices
    ...

    Linus Torvalds
     
  • * 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
    xen-blkfront: disable barrier/flush write support
    Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
    block: remove BLKDEV_IFL_WAIT
    aic7xxx_old: removed unused 'req' variable
    block: remove the BH_Eopnotsupp flag
    block: remove the BLKDEV_IFL_BARRIER flag
    block: remove the WRITE_BARRIER flag
    swap: do not send discards as barriers
    fat: do not send discards as barriers
    ext4: do not send discards as barriers
    jbd2: replace barriers with explicit flush / FUA usage
    jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
    jbd: replace barriers with explicit flush / FUA usage
    nilfs2: replace barriers with explicit flush / FUA usage
    reiserfs: replace barriers with explicit flush / FUA usage
    gfs2: replace barriers with explicit flush / FUA usage
    btrfs: replace barriers with explicit flush / FUA usage
    xfs: replace barriers with explicit flush / FUA usage
    block: pass gfp_mask and flags to sb_issue_discard
    dm: convey that all flushes are processed as empty
    ...

    Linus Torvalds
     
  • * 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
    cfq-iosched: Fix a gcc 4.5 warning and put some comments
    block: Turn bvec_k{un,}map_irq() into static inline functions
    block: fix accounting bug on cross partition merges
    block: Make the integrity mapped property a bio flag
    block: Fix double free in blk_integrity_unregister
    block: Ensure physical block size is unsigned int
    blkio-throttle: Fix possible multiplication overflow in iops calculations
    blkio-throttle: limit max iops value to UINT_MAX
    blkio-throttle: There is no need to convert jiffies to milli seconds
    blkio-throttle: Fix link failure failure on i386
    blkio: Recalculate the throttled bio dispatch time upon throttle limit change
    blkio: Add root group to td->tg_list
    blkio: deletion of a cgroup was causes oops
    blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
    block: set the bounce_pfn to the actual DMA limit rather than to max memory
    block: revert bad fix for memory hotplug causing bounces
    Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
    block: set the bounce_pfn to the actual DMA limit rather than to max memory
    block: Prevent hang_check firing during long I/O
    cfq: improve fsync performance for small files
    ...

    Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h

    Linus Torvalds
     
  • * 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
    vfs: make no_llseek the default
    vfs: don't use BKL in default_llseek
    llseek: automatically add .llseek fop
    libfs: use generic_file_llseek for simple_attr
    mac80211: disallow seeks in minstrel debug code
    lirc: make chardev nonseekable
    viotape: use noop_llseek
    raw: use explicit llseek file operations
    ibmasmfs: use generic_file_llseek
    spufs: use llseek in all file operations
    arm/omap: use generic_file_llseek in iommu_debug
    lkdtm: use generic_file_llseek in debugfs
    net/wireless: use generic_file_llseek in debugfs
    drm: use noop_llseek

    Linus Torvalds
     
  • * 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
    block: autoconvert trivial BKL users to private mutex
    drivers: autoconvert trivial BKL users to private mutex
    ipmi: autoconvert trivial BKL users to private mutex
    mac: autoconvert trivial BKL users to private mutex
    mtd: autoconvert trivial BKL users to private mutex
    scsi: autoconvert trivial BKL users to private mutex

    Fix up trivial conflicts (due to addition of private mutex right next to
    deletion of a version string) in drivers/char/pcmcia/cm40[04]0_cs.c

    Linus Torvalds
     
  • I have some systems which need legacy sysfs due to old tools that are
    making assumptions that a directory can never be a symlink to another
    directory, and it's a big hazzle to compile separate kernels for them.

    This patch turns CONFIG_SYSFS_DEPRECATED into a run time option
    that can be switched on/off the kernel command line. This way
    the same binary can be used in both cases with just a option
    on the command line.

    The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set
    the default. I kept the weird name to not break existing
    config files.

    Also the compat code can be still completely disabled by undefining
    CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes
    care of this now instead of lots of ifdefs. This makes the code
    look nicer.

    v2: This is an updated version on top of Kay's patch to only
    handle the block devices. I tested it on my old systems
    and that seems to work.

    Cc: axboe@kernel.dk
    Signed-off-by: Andi Kleen
    Cc: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

22 Oct, 2010

1 commit

  • - Andi encountedred following warning with gcc 4.5

    linux/block/cfq-iosched.c: In function ‘cfq_dispatch_requests’:
    linux/block/cfq-iosched.c:2156:3: warning: array subscript is above array
    bounds

    - Warning happens due to following code.

    slice = group_slice * count /
    max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
    cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));

    gcc is complaining about cfqg->busy_queues_avg[] being indexed by CFQ
    prio classes (RT, BE and IDLE) while the array size is only 2.

    - At run time, we never access cfqg->busy_queues_avg[IDLE] and return from
    function before this code hits.

    - To fix warning increase the array size though it will remain unused. This
    patch also puts some comments to clarify some of the confusions.

    - I have taken Jens's patch and modified it a bit.

    - Compile tested with gcc 4.4 and boot tested. I don't have gcc 4.5
    running, Andi can you please test it with gcc 4.5 to make sure it
    worked.

    Reported-by: Andi Kleen
    Signed-off-by: Vivek Goyal
    Acked-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

21 Oct, 2010

1 commit


19 Oct, 2010

2 commits

  • Conflicts:
    block/blk-core.c
    drivers/block/loop.c
    mm/swapfile.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • /proc/diskstats would display a strange output as follows.

    $ cat /proc/diskstats |grep sda
    8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
    8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
    ~~~~~~~~~~
    8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
    8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
    8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
    8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

    Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
    merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

    The detailed root cause is as follows.

    Assuming that there are two partition, sda1 and sda2.

    1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
    is 0 and sda2's one is 1.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    2. A bio belongs to sda1 is issued and is merged into the request mentioned on
    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
    from sda2 region to sda1 region. However the two partition's
    hd_struct->in_flight are not changed.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    3. The request is finished and blk_account_io_done() is called. In this case,
    sda2's hd_struct->in_flight, not a sda1's one, is decremented.

    | hd_struct->in_flight
    ---------------------------
    sda1 | -1
    sda2 | 1
    ---------------------------

    The patch fixes the problem by caching the partition lookup
    inside the request structure, hence making sure that the increment
    and decrement will always happen on the same partition struct. This
    also speeds up IO with accounting enabled, since it cuts down on
    the number of lookups we have to do.

    When reloading partition tables, quiesce IO to ensure that no
    request references to the partition struct exists. When it is safe
    to free the partition table, the IO for that device is restarted
    again.

    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Yasuaki Ishimatsu
     

15 Oct, 2010

3 commits

  • bsg incorrectly returns sg's masked_status value for device_status.

    [jejb: fix up expression logic]
    Reported-by: Douglas Gilbert
    Signed-off-by: FUJITA Tomonori
    Cc: Stable Tree
    Signed-off-by: James Bottomley

    FUJITA Tomonori
     
  • All file_operations should get a .llseek operation so we can make
    nonseekable_open the default for future file operations without a
    .llseek pointer.

    The three cases that we can automatically detect are no_llseek, seq_lseek
    and default_llseek. For cases where we can we can automatically prove that
    the file offset is always ignored, we use noop_llseek, which maintains
    the current behavior of not returning an error from a seek.

    New drivers should normally not use noop_llseek but instead use no_llseek
    and call nonseekable_open at open time. Existing drivers can be converted
    to do the same when the maintainer knows for certain that no user code
    relies on calling seek on the device file.

    The generated code is often incorrectly indented and right now contains
    comments that clarify for each added line why a specific variant was
    chosen. In the version that gets submitted upstream, the comments will
    be gone and I will manually fix the indentation, because there does not
    seem to be a way to do that using coccinelle.

    Some amount of new code is currently sitting in linux-next that should get
    the same modifications, which I will do at the end of the merge window.

    Many thanks to Julia Lawall for helping me learn to write a semantic
    patch that does all this.

    ===== begin semantic patch =====
    // This adds an llseek= method to all file operations,
    // as a preparation for making no_llseek the default.
    //
    // The rules are
    // - use no_llseek explicitly if we do nonseekable_open
    // - use seq_lseek for sequential files
    // - use default_llseek if we know we access f_pos
    // - use noop_llseek if we know we don't access f_pos,
    // but we still want to allow users to call lseek
    //
    @ open1 exists @
    identifier nested_open;
    @@
    nested_open(...)
    {

    }

    @ open exists@
    identifier open_f;
    identifier i, f;
    identifier open1.nested_open;
    @@
    int open_f(struct inode *i, struct file *f)
    {

    }

    @ read disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {

    }

    @ read_no_fpos disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ write @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {

    }

    @ write_no_fpos @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ fops0 @
    identifier fops;
    @@
    struct file_operations fops = {
    ...
    };

    @ has_llseek depends on fops0 @
    identifier fops0.fops;
    identifier llseek_f;
    @@
    struct file_operations fops = {
    ...
    .llseek = llseek_f,
    ...
    };

    @ has_read depends on fops0 @
    identifier fops0.fops;
    identifier read_f;
    @@
    struct file_operations fops = {
    ...
    .read = read_f,
    ...
    };

    @ has_write depends on fops0 @
    identifier fops0.fops;
    identifier write_f;
    @@
    struct file_operations fops = {
    ...
    .write = write_f,
    ...
    };

    @ has_open depends on fops0 @
    identifier fops0.fops;
    identifier open_f;
    @@
    struct file_operations fops = {
    ...
    .open = open_f,
    ...
    };

    // use no_llseek if we call nonseekable_open
    ////////////////////////////////////////////
    @ nonseekable1 depends on !has_llseek && has_open @
    identifier fops0.fops;
    identifier nso ~= "nonseekable_open";
    @@
    struct file_operations fops = {
    ... .open = nso, ...
    +.llseek = no_llseek, /* nonseekable */
    };

    @ nonseekable2 depends on !has_llseek @
    identifier fops0.fops;
    identifier open.open_f;
    @@
    struct file_operations fops = {
    ... .open = open_f, ...
    +.llseek = no_llseek, /* open uses nonseekable */
    };

    // use seq_lseek for sequential files
    /////////////////////////////////////
    @ seq depends on !has_llseek @
    identifier fops0.fops;
    identifier sr ~= "seq_read";
    @@
    struct file_operations fops = {
    ... .read = sr, ...
    +.llseek = seq_lseek, /* we have seq_read */
    };

    // use default_llseek if there is a readdir
    ///////////////////////////////////////////
    @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier readdir_e;
    @@
    // any other fop is used that changes pos
    struct file_operations fops = {
    ... .readdir = readdir_e, ...
    +.llseek = default_llseek, /* readdir is present */
    };

    // use default_llseek if at least one of read/write touches f_pos
    /////////////////////////////////////////////////////////////////
    @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read.read_f;
    @@
    // read fops use offset
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = default_llseek, /* read accesses f_pos */
    };

    @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ... .write = write_f, ...
    + .llseek = default_llseek, /* write accesses f_pos */
    };

    // Use noop_llseek if neither read nor write accesses f_pos
    ///////////////////////////////////////////////////////////

    @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    identifier write_no_fpos.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ...
    .write = write_f,
    .read = read_f,
    ...
    +.llseek = noop_llseek, /* read and write both use no f_pos */
    };

    @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write_no_fpos.write_f;
    @@
    struct file_operations fops = {
    ... .write = write_f, ...
    +.llseek = noop_llseek, /* write uses no f_pos */
    };

    @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    @@
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = noop_llseek, /* read uses no f_pos */
    };

    @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    @@
    struct file_operations fops = {
    ...
    +.llseek = noop_llseek, /* no read or write fn */
    };
    ===== End semantic patch =====

    Signed-off-by: Arnd Bergmann
    Cc: Julia Lawall
    Cc: Christoph Hellwig

    Arnd Bergmann
     
  • Commit 3839e4b introduced a kobject_put but failed to remove the
    kmem_cache_free beneath it, leading to a double free.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

14 Oct, 2010

1 commit


07 Oct, 2010

1 commit

  • 2.6.36 introduces an API for drivers to switch the IO scheduler
    instead of manually calling the elevator exit and init functions.
    This API was added since q->elevator must be cleared in between
    those two calls. And since we already have this functionality
    directly from use by the sysfs interface to switch schedulers
    online, it was prudent to reuse it internally too.

    But this API needs the queue to be in a fully initialized state
    before it is called, or it will attempt to unregister elevator
    kobjects before they have been added. This results in an oops
    like this:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
    IP: [] sysfs_create_dir+0x2e/0xc0
    PGD 47ddfc067 PUD 47c6a1067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
    CPU 2
    Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb

    Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
    RIP: 0010:[] [] sysfs_create_dir+0x2e/0xc0
    RSP: 0018:ffff88027da25d08 EFLAGS: 00010246
    RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
    RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
    RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
    R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
    R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
    FS: 00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
    Stack:
    ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
    ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
    ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
    Call Trace:
    [] kobject_add_internal+0xe7/0x1f0
    [] kobject_add_varg+0x38/0x60
    [] kobject_add+0x69/0x90
    [] ? sysfs_remove_dir+0x20/0xa0
    [] ? sub_preempt_count+0x9d/0xe0
    [] ? _raw_spin_unlock+0x30/0x50
    [] ? sysfs_remove_dir+0x20/0xa0
    [] ? sysfs_remove_dir+0x34/0xa0
    [] elv_register_queue+0x34/0xa0
    [] elevator_change+0xfd/0x250
    [] ? t_init+0x0/0x361 [t]
    [] ? t_init+0x0/0x361 [t]
    [] t_init+0xa8/0x361 [t]
    [] do_one_initcall+0x3e/0x170
    [] sys_init_module+0xbd/0x220
    [] system_call_fastpath+0x16/0x1b
    Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
    RIP [] sysfs_create_dir+0x2e/0xc0
    RSP
    CR2: 0000000000000051
    ---[ end trace a6541d3bf07945df ]---

    Fix this by adding a registered bit to the elevator queue, which is
    set when the sysfs kobjects have been registered.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Oct, 2010

1 commit

  • The block device drivers have all gained new lock_kernel
    calls from a recent pushdown, and some of the drivers
    were already using the BKL before.

    This turns the BKL into a set of per-driver mutexes.
    Still need to check whether this is safe to do.

    file=$1
    name=$2
    if grep -q lock_kernel ${file} ; then
    if grep -q 'include.*linux.mutex.h' ${file} ; then
    sed -i '/include.*/d' ${file}
    else
    sed -i 's/include.*.*$/include /g' ${file}
    fi
    sed -i ${file} \
    -e "/^#include.*linux.mutex.h/,$ {
    1,/^\(static\|int\|long\)/ {
    /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);

    } }" \
    -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
    -e '/[ ]*cycle_kernel_lock();/d'
    else
    sed -i -e '/include.*\/d' ${file} \
    -e '/cycle_kernel_lock()/d'
    fi

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

02 Oct, 2010

3 commits


01 Oct, 2010

7 commits

  • o Randy Dunlap reported following linux-next failure. This patch fixes it.

    on i386:

    blk-throttle.c:(.text+0x1abb8): undefined reference to `__udivdi3'
    blk-throttle.c:(.text+0x1b1dc): undefined reference to `__udivdi3'

    o bytes_per_second interface is 64bit and I was continuing to do 64 bit
    division even on 32bit platform without help of special macros/functions
    hence the failure.

    Signed-off-by: Vivek Goyal
    Reported-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o Currently any cgroup throttle limit changes are processed asynchronousy and
    the change does not take affect till a new bio is dispatched from same group.

    o It might happen that a user sets a redicuously low limit on throttling.
    Say 1 bytes per second on reads. In such cases simple operations like mount
    a disk can wait for a very long time.

    o Once bio is throttled, there is no easy way to come out of that wait even if
    user increases the read limit later.

    o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
    the bio dispatch time according to new limits.

    o Can't take queueu lock under blkcg_lock, hence after the change I wake
    up the dispatch thread again which recalculates the time. So there are some
    variables being synchronized across two threads without lock and I had to
    make use of barriers. Hoping I have used barriers correctly. Any review of
    memory barrier code especially will help.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o Currently all the dynamically allocated groups, except root grp is added
    to td->tg_list. This was not a problem so far but in next patch I will
    travel through td->tg_list to process any updates of limits on the group.
    If root group is not in tg_list, then root group's updates are not
    processed.

    o It is better to root group also to tg_list instead of doing special
    processing for it during limit updates.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o Now a cgroup list of blkg elements can contain blkg from multiple policies.
    Before sending an unlink event, make sure blkg belongs to they policy. If
    policy does not own the blkg, do not send update for this blkg.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Currently throttling related files were visible even if user had disabled
    throttling using config options. It was switching off background throttling
    of bio but not the cgroup files. This patch fixes it.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • The bounce_pfn of the request queue in 64 bit systems is set to the
    current max_low_pfn. Adding more memory later makes this incorrect.
    Memory allocated beyond this boot time max_low_pfn appear to require
    bounce buffers (bounce buffers are actually not allocated but used in
    calculating segments that may result in "over max segments limit"
    errors).

    Signed-off-by: Malahal Naineni
    Signed-off-by: Jens Axboe

    Malahal Naineni
     
  • Revert "block: set the bounce_pfn to the actual DMA limit rather than to max memory"

    This reverts commit c49825facfd4969585224a896a5e717f88450cad.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Sep, 2010

2 commits

  • Add logic to prevent two I/O requests being merged if
    only one of them is a discard. Ditto secure discard.

    Without this fix, it is possible for write requests
    to transform into discard requests. For example:

    Submit bio 1 to discard 8 sectors from sector n
    Submit bio 2 to write 8 sectors from sector n + 16
    Submit bio 3 to write 8 sectors from sector n + 8

    Bio 1 becomes request 1. Bio 2 becomes request 2.
    Bio 3 is merged with request 2, and then subsequently
    request 2 is merged with request 1 resulting in just
    one I/O request which discards all 24 sectors.

    Signed-off-by: Adrian Hunter

    (Moved the checks above the position checks /Jens)

    Signed-off-by: Jens Axboe

    Adrian Hunter
     
  • The bounce_pfn of the request queue in 64 bit systems is set to the
    current max_low_pfn. Adding more memory later makes this incorrect.
    Memory allocated beyond this boot time max_low_pfn appear to require
    bounce buffers (bounce buffers are actually not allocated but used in
    calculating segments that may result in "over max segments limit"
    errors).

    Signed-off-by: Malahal Naineni
    Signed-off-by: Jens Axboe

    Malahal Naineni
     

24 Sep, 2010

1 commit

  • During long I/O operations, the hang_check timer may fire,
    trigger stack dumps that unnecessarily alarm the user.

    Eg. hdparm --security-erase NULL /dev/sdb ## can take *hours* to complete

    So, if hang_check is armed, we should wake up periodically
    to prevent it from triggering. This patch uses a wake-up interval
    equal to half the hang_check timer period, which keeps overhead low enough.

    Signed-off-by: Mark Lord
    Signed-off-by: Jens Axboe

    Mark Lord
     

21 Sep, 2010

2 commits

  • Mike reported a kernel crash when a usb key hotplug is performed while all
    kernel thrads are not in a root cgroup and are running in one of the child
    cgroups of blkio controller.

    BUG: unable to handle kernel NULL pointer dereference at 0000002c
    IP: [] cfq_get_queue+0x232/0x412
    *pde = 00000000
    Oops: 0000 [#1] PREEMPT
    last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb2/2-1/2-1:1.0/host3/scsi_host/host3/uevent

    [..]
    Pid: 30039, comm: scsi_scan_3 Not tainted 2.6.35.2-fg.roam #1 Volvi2 /Aspire 4315
    EIP: 0060:[] EFLAGS: 00010086 CPU: 0
    EIP is at cfq_get_queue+0x232/0x412
    EAX: f705f9c0 EBX: e977abac ECX: 00000000 EDX: 00000000
    ESI: f00da400 EDI: f00da4ec EBP: e977a800 ESP: dff8fd00
    DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
    Process scsi_scan_3 (pid: 30039, ti=dff8e000 task=f6b6c9a0 task.ti=dff8e000)
    Stack:
    00000000 00000000 00000001 01ff0000 f00da508 00000000 f00da524 f00da540
    e7994940 dd631750 f705f9c0 e977a820 e977ac44 f00da4d0 00000001 f6b6c9a0
    00000010 00008010 0000000b 00000000 00000001 e977a800 dd76fac0 00000246
    Call Trace:
    [] ? cfq_set_request+0x228/0x34c
    [] ? cfq_set_request+0x0/0x34c
    [] ? elv_set_request+0xf/0x1c
    [] ? get_request+0x1ad/0x22f
    [] ? get_request_wait+0x1f/0x11a
    [] ? kvasprintf+0x33/0x3b
    [] ? scsi_execute+0x1d/0x103
    [] ? scsi_execute_req+0x58/0x83
    [] ? scsi_probe_and_add_lun+0x188/0x7c2
    [] ? attribute_container_add_device+0x15/0xfa
    [] ? kobject_get+0xf/0x13
    [] ? get_device+0x10/0x14
    [] ? scsi_alloc_target+0x217/0x24d
    [] ? __scsi_scan_target+0x95/0x480
    [] ? dequeue_entity+0x14/0x1fe
    [] ? update_curr+0x165/0x1ab
    [] ? update_curr+0x165/0x1ab
    [] ? scsi_scan_channel+0x4a/0x76
    [] ? scsi_scan_host_selected+0x77/0xad
    [] ? do_scan_async+0x0/0x11a
    [] ? do_scsi_scan_host+0x51/0x56
    [] ? do_scan_async+0x0/0x11a
    [] ? do_scan_async+0xe/0x11a
    [] ? do_scan_async+0x0/0x11a
    [] ? kthread+0x5e/0x63
    [] ? kthread+0x0/0x63
    [] ? kernel_thread_helper+0x6/0x10
    Code: 44 24 1c 54 83 44 24 18 54 83 fa 03 75 94 8b 06 c7 86 64 02 00 00 01 00 00 00 83 e0 03 09 f0 89 06 8b 44 24 28 8b 90 58 01 00 00 42 2c 85 c0 75 03 8b 42 08 8d 54 24 48 52 8d 4c 24 50 51 68
    EIP: [] cfq_get_queue+0x232/0x412 SS:ESP 0068:dff8fd00
    CR2: 000000000000002c
    ---[ end trace 9a88306573f69b12 ]---

    The problem here is that we don't have bdi->dev information available when
    thread does some IO. Hence when dev_name() tries to access bdi->dev, it
    crashes.

    This problem does not happen if kernel threads are in root group as root
    group is statically allocated at device initialization time and we don't
    hit this piece of code.

    Fix it by delaying the filling of major and minor number information of
    device in blk_group. Initially a blk_group is created with 0 as device
    information and this information is filled later once some more IO comes
    in from same group.

    Reported-by: Mike Kazantsev
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • This bug was introduced in 7b6d91daee5cac6402186ff224c3af39d79f4a0e
    "block: unify flags for struct bio and struct request"

    Cc: Boaz Harrosh
    Signed-off-by: Benny Halevy
    Signed-off-by: Jens Axboe

    Benny Halevy
     

20 Sep, 2010

1 commit

  • Fsync performance for small files achieved by cfq on high-end disks is
    lower than what deadline can achieve, due to idling introduced between
    the sync write happening in process context and the journal commit.

    Moreover, when competing with a sequential reader, a process writing
    small files and fsync-ing them is starved.

    This patch fixes the two problems by:
    - marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE
    flag set,
    - force all queues that have REQ_NOIDLE requests to be put in the noidle
    tree.

    Having the queue associated to the fsync-ing process and the one associated
    to journal commits in the noidle tree allows:
    - switching between them without idling,
    - fairness vs. competing idling queues, since they will be serviced only
    after the noidle tree expires its slice.

    Acked-by: Vivek Goyal
    Reviewed-by: Jeff Moyer
    Tested-by: Jeff Moyer
    Signed-off-by: Corrado Zoccolo
    Signed-off-by: Jens Axboe

    Corrado Zoccolo
     

17 Sep, 2010

2 commits

  • All the blkdev_issue_* helpers can only sanely be used for synchronous
    caller. To issue cache flushes or barriers asynchronously the caller needs
    to set up a bio by itself with a completion callback to move the asynchronous
    state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
    specified when calling blkdev_issue_* and also remove the now unused flags
    argument to blkdev_issue_flush and blkdev_issue_zeroout. For
    blkdev_issue_discard we need to keep it for the secure discard flag, which
    gains a more descriptive name and loses the bitops vs flag confusion.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • When a new disk is being discovered, add_disk() first ties the bdev to gendisk
    (via register_disk()->blkdev_get()) and only after that calls
    bdi_register_bdev(). Because register_disk() also creates disk's kobject, it
    can happen that userspace manages to open and modify the device's data (or
    inode) before its BDI is properly initialized leading to a warning in
    __mark_inode_dirty().

    Fix the problem by registering BDI early enough.

    This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=16312

    Cc: stable@kernel.org
    Reported-by: Larry Finger
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Signed-off-by: Jan Kara
     

16 Sep, 2010

2 commits