28 Sep, 2011

1 commit

  • There are numerous broken references to Documentation files (in other
    Documentation files, in comments, etc.). These broken references are
    caused by typo's in the references, and by renames or removals of the
    Documentation files. Some broken references are simply odd.

    Fix these broken references, sometimes by dropping the irrelevant text
    they were part of.

    Signed-off-by: Paul Bolle
    Signed-off-by: Jiri Kosina

    Paul Bolle
     

15 Sep, 2011

5 commits

  • Fast-forward merge with Linus to be able to merge patches
    based on more recent version of the tree.

    Jiri Kosina
     
  • It was pointed out by 'make versioncheck' that some includes of
    linux/version.h are not needed in include/.
    This patch removes them.

    When I last posted the patch, the ceph bit was ACK'ed by Sage Weil, so
    I've added that below.

    The pwc-ioctl change generated quite a bit of discussion about V4L version
    numbers in general, but as far as I can tell, no concensus was reached on
    what the long term solution should be, so in the mean time I think we
    could start by just removing the unneeded include, which is why I'm
    resending the patch with that hunk still included.

    Signed-off-by: Jesper Juhl
    Acked-by: Sage Weil
    Signed-off-by: Jiri Kosina

    Jesper Juhl
     
  • Use the normal include style.

    Signed-off-by: Joe Perches
    Signed-off-by: Jiri Kosina

    Joe Perches
     
  • Building a kernel with hotplug disabled results in a link failure:

    `bgpio_remove' referenced in section `___ksymtab_gpl+bgpio_remove' of drivers/built-in.o: defined in discarded section `.devexit.text' of drivers/built-in.o

    This is because of bgpio_remove() is exported. It is illegal to export
    symbols which are discarded either at link time or as part of an
    init/exit section.

    Fix this by dropping the __devexit attributation from bgpio_remove().
    Also drop the __devinit attributation from bgpio_init().

    Signed-off-by: Russell King
    Cc: Grant Likely
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
    memory.vmscan_stat").

    The implementation of per-memcg reclaim statistics violates how memcg
    hierarchies usually behave: hierarchically.

    The reclaim statistics are accounted to child memcgs and the parent
    hitting the limit, but not to hierarchy levels in between. Usually,
    hierarchical statistics are perfectly recursive, with each level
    representing the sum of itself and all its children.

    Since this exports statistics to userspace, this may lead to confusion
    and problems with changing things after the release, so revert it now,
    we can try again later.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Sep, 2011

2 commits


08 Sep, 2011

1 commit


06 Sep, 2011

2 commits


30 Aug, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (42 commits)
    netpoll: fix incorrect access to skb data in __netpoll_rx
    cassini: init before use in cas_interruptN.
    can: ti_hecc: Fix uninitialized spinlock in probe
    can: ti_hecc: Fix unintialized variable
    net: sh_eth: fix the compile error
    net/phy: fix DP83865 phy interrupt handler
    sendmmsg/sendmsg: fix unsafe user pointer access
    ibmveth: Fix leak when recycling skb and hypervisor returns error
    arp: fix rcu lockdep splat in arp_process()
    bridge: fix a possible use after free
    bridge: Pseudo-header required for the checksum of ICMPv6
    mcast: Fix source address selection for multicast listener report
    MAINTAINERS: Update GIT trees for network development
    ath9k: Fix PS wrappers in ath9k_set_coverage_class
    carl9170: Fix mismatch in carl9170_op_set_key mutex lock-unlock
    wl12xx: add max_sched_scan_ssids value to the hw description
    wl12xx: Fix validation of pm_runtime_get_sync return value
    wl12xx: Remove obsolete testmode NVS push command
    bcma: add uevent to the bus, to autoload drivers
    ath9k_hw: Fix STA (AR9485) bringup issue due to incorrect MAC address
    ...

    Linus Torvalds
     

29 Aug, 2011

1 commit

  • The current cgroup context switch code was incorrect leading
    to bogus counts. Furthermore, as soon as there was an active
    cgroup event on a CPU, the context switch cost on that CPU
    would increase by a significant amount as demonstrated by a
    simple ping/pong example:

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10684.51 ctxsw/s

    Now start a cgroup perf stat:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    6674.61 ctxsw/s

    That's a 37% penalty.

    Note that pong is not even in the monitored cgroup.

    The results shown by perf stat are bogus:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    Performance counter stats for 'sleep 100':

    CPU1 cycles test
    CPU1 16,984,189,138 cycles # 0.000 GHz

    The second 'cycles' event should report a count @ CPU clock
    (here 2.4GHz) as it is counting across all cgroups.

    The patch below fixes the bogus accounting and bypasses any
    cgroup switches in case the outgoing and incoming tasks are
    in the same cgroup.

    With this patch the same test now yields:
    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10775.30 ctxsw/s

    Start perf stat with cgroup:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Run pong outside the cgroup:
    $ /pong
    Both processes pinned to CPU1, running for 10s
    10687.80 ctxsw/s

    The penalty is now less than 2%.

    And the results for perf stat are correct:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    Now perf stat reports the correct counts for
    for the non cgroup event.

    If we run pong inside the cgroup, then we also get the
    correct counts:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 22,297,726,205 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    10.001457237 seconds time elapsed

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

27 Aug, 2011

3 commits

  • The nfsservctl system call is now gone, so we should remove all
    linkage for it.

    Signed-off-by: NeilBrown
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • * 'tty-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6:
    omap-serial: Allow IXON and IXOFF to be disabled.
    TTY: serial, document ignoring of uart->ops->startup error
    TTY: pty, fix pty counting
    8250: Fix race condition in serial8250_backup_timeout().
    serial/8250_pci: delete duplicate data definition
    8250_pci: add support for Rosewill RC-305 4x serial port card
    tty: Add "spi:" prefix for spi modalias
    atmel_serial: fix atmel_default_console_device
    serial: 8250_pnp: add Intermec CV60 touchscreen device
    drivers/serial/ucc_uart.c: Fix compiler warning
    pch_uart: Set PCIe bus number using probe parameter
    serial: samsung: Fix build error

    Linus Torvalds
     
  • …t/gregkh/driver-core-2.6

    * 'driver-core-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
    drivers:misc: ti-st: fix unexpected UART close
    drivers:misc: ti-st: free skb on firmware download
    drivers:misc: ti-st: wait for completion at fail
    drivers:misc: ti-st: reinit completion before send
    drivers:misc: ti-st: fail-safe on wrong pkt type
    drivers:misc: ti-st: reinit completion on ver read
    drivers:misc:ti-st: platform hooks for chip states
    drivers:misc: ti-st: avoid a misleading dbg msg
    base/devres.c: quiet sparse noise about context imbalance
    pti: add missing CONFIG_PCI dependency
    drivers/base/devtmpfs.c: correct annotation of `setup_done'
    driver core: fix kernel-doc warning in platform.c
    firmware: fix google/gsmi.c build warning

    Linus Torvalds
     

26 Aug, 2011

8 commits

  • …wireless into for-davem

    John W. Linville
     
  • We need a callback to do some things after pwm_enable, pwm_disable
    and pwm_config.

    Signed-off-by: Dilan Lee
    Reviewed-by: Robert Morell
    Reviewed-by: Arun Murthy
    Cc: Richard Purdie
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dilan Lee
     
  • Replace/remove use of RIO v.1.2 registers/bits that are not
    forward-compatible with newer versions of RapidIO specification.

    RapidIO specification v.1.3 removed Write Port CSR, Doorbell CSR,
    Mailbox CSR and Mailbox and Doorbell bits of the PEF CAR.

    Use of removed (since RIO v.1.3) register bits affects users of
    currently available 1.3 and 2.x compliant devices who may use not so
    recent kernel versions.

    Removing checks for unsupported bits makes corresponding routines
    compatible with all versions of RapidIO specification. Therefore,
    backporting makes stable kernel versions compliant with RIO v.1.3 and
    later as well.

    Signed-off-by: Alexandre Bounine
    Cc: Kumar Gala
    Cc: Matt Porter
    Cc: Li Yang
    Cc: Thomas Moll
    Cc: Chul Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Signed-off-by: Evgeniy Polyakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • Purely in-memory filesystems do not use the inode hash as the dcache
    tells us if an entry already exists. As a result, they do not call
    unlock_new_inode, and thus directory inodes do not get put into a
    different lockdep class for i_sem.

    We need the different lockdep classes, because the locking order for
    i_mutex is different for directory inodes and regular inodes. Directory
    inodes can do "readdir()", which takes i_mutex *before* possibly taking
    mm->mmap_sem (due to a page fault while copying the directory entry to
    user space).

    In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
    before accessing i_mutex.

    The two cases can never happen for the same inode, so no real deadlock
    can occur, but without the different lockdep classes, lockdep cannot
    understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
    can lead to false positives from lockdep like below:

    find/645 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x5c/0xac

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#15){+.+.+.}, at: []
    vfs_readdir+0x5b/0xb4

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
    [] lock_acquire+0xbf/0x103
    [] __mutex_lock_common+0x4c/0x361
    [] mutex_lock_nested+0x40/0x45
    [] hugetlbfs_file_mmap+0x82/0x110
    [] mmap_region+0x258/0x432
    [] do_mmap_pgoff+0x2ac/0x306
    [] sys_mmap_pgoff+0x118/0x16a
    [] sys_mmap+0x22/0x24
    [] system_call_fastpath+0x16/0x1b

    -> #0 (&mm->mmap_sem){++++++}:
    [] __lock_acquire+0xa1a/0xcf7
    [] lock_acquire+0xbf/0x103
    [] might_fault+0x89/0xac
    [] filldir+0x6f/0xc7
    [] dcache_readdir+0x67/0x205
    [] vfs_readdir+0x7b/0xb4
    [] sys_getdents+0x7e/0xd1
    [] system_call_fastpath+0x16/0x1b

    This patch moves the directory vs file lockdep annotation into a helper
    function that can be called by in-memory filesystems and has hugetlbfs
    call it.

    Signed-off-by: Josh Boyer
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Josh Boyer
     
  • * 'urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback:
    squeeze max-pause area and drop pass-good area

    Linus Torvalds
     
  • * '3.1-rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (21 commits)
    target: Convert acl_node_lock to be IRQ-disabling
    target: Make locking in transport_deregister_session() IRQ safe
    tcm_fc: init/exit functions should not be protected by "#ifdef MODULE"
    target: Print subpage too for unhandled MODE SENSE pages
    iscsi-target: Fix iscsit_allocate_se_cmd_for_tmr failure path bugs
    iscsi-target: Implement iSCSI target IPv6 address printing.
    target: Fix task SGL chaining breakage with transport_allocate_data_tasks
    target: Fix task count > 1 handling breakage and use max_sector page alignment
    target: Add missing DATA_SG_IO transport_cmd_get_valid_sectors check
    target: Fix SYNCHRONIZE_CACHE zero LBA + range breakage
    target: Remove duplicate task completions in transport_emulate_control_cdb
    target: Fix WRITE_SAME usage with transport_get_size
    target: Add WRITE_SAME (10) parsing and refactor passthrough checks
    target: Fix write payload exception handling with ->new_cmd_map
    iscsi-target: forever loop bug in iscsit_attach_ooo_cmdsn()
    iscsi-target: remove duplicate return
    target: Convert target_core_rd.c to use use BUG_ON
    iscsi-target: Fix leak on failure in iscsi_copy_param_list()
    target: Use ERR_CAST inlined function
    target: Make standard INQUIRY return 'not connected' for tpg_virt_lun0
    ...

    Linus Torvalds
     
  • I ran into a couple of programs which broke with the new Linux 3.0
    version. Some of those were binary only. I tried to use LD_PRELOAD to
    work around it, but it was quite difficult and in one case impossible
    because of a mix of 32bit and 64bit executables.

    For example, all kind of management software from HP doesnt work, unless
    we pretend to run a 2.6 kernel.

    $ uname -a
    Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux

    $ hpacucli ctrl all show

    Error: No controllers detected.

    $ rpm -qf /usr/sbin/hpacucli
    hpacucli-8.75-12.0

    Another notable case is that Python now reports "linux3" from
    sys.platform(); which in turn can break things that were checking
    sys.platform() == "linux2":

    https://bugzilla.mozilla.org/show_bug.cgi?id=664564

    It seems pretty clear to me though it's a bug in the apps that are using
    '==' instead of .startswith(), but this allows us to unbreak broken
    programs.

    This patch adds a UNAME26 personality that makes the kernel report a
    2.6.40+x version number instead. The x is the x in 3.x.

    I know this is somewhat ugly, but I didn't find a better workaround, and
    compatibility to existing programs is important.

    Some programs also read /proc/sys/kernel/osrelease. This can be worked
    around in user space with mount --bind (and a mount namespace)

    To use:

    wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
    gcc -o uname26 uname26.c
    ./uname26 program

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

25 Aug, 2011

1 commit


24 Aug, 2011

1 commit

  • tty_operations->remove is normally called like:
    queue_release_one_tty
    ->tty_shutdown
    ->tty_driver_remove_tty
    ->tty_operations->remove

    However tty_shutdown() is called from queue_release_one_tty() only if
    tty_operations->shutdown is NULL. But for pty, it is not.
    pty_unix98_shutdown() is used there as ->shutdown.

    So tty_operations->remove of pty (i.e. pty_unix98_remove()) is never
    called. This results in invalid pty_count. I.e. what can be seen in
    /proc/sys/kernel/pty/nr.

    I see this was already reported at:
    https://lkml.org/lkml/2009/11/5/370
    But it was not fixed since then.

    This patch is kind of a hackish way. The problem lies in ->install. We
    allocate there another tty (so-called tty->link). So ->install is
    called once, but ->remove twice, for both tty and tty->link. The fix
    here is to count both tty and tty->link and divide the count by 2 for
    user.

    And to have ->remove called, let's make tty_driver_remove_tty() global
    and call that from pty_unix98_shutdown() (tty_operations->shutdown).

    While at it, let's document that when ->shutdown is defined,
    tty_shutdown() is not called.

    Signed-off-by: Jiri Slaby
    Cc: Alan Cox
    Cc: "H. Peter Anvin"
    Cc: stable
    Signed-off-by: Greg Kroah-Hartman

    Jiri Slaby
     

23 Aug, 2011

3 commits

  • Certain platform specific or Host-WiLink Interface specific actions would be
    required to be taken when the chip is being enabled and after the chip is
    disabled such as configuration of the mux modes for the GPIO of host connected
    to the nshutdown of the chip or relinquishing UART after the chip is disabled.

    Similar actions can also be taken when the chip is in deep sleep or when the
    chip is awake. Performance enhancements such as configuring the host to run
    faster when chip is awake and slower when chip is asleep can also be made
    here.

    Signed-off-by: Pavan Savoy
    Signed-off-by: Greg Kroah-Hartman

    Pavan Savoy
     
  • This patch changes target_emulate_inquiry_std() to set the 'not connected'
    (0x35) bit in standard INQUIRY response data when we are processing a
    request to a virtual LUN=0 mapping from struct se_device *g_lun0_dev that
    have been setup for us in transport_lookup_cmd_lun().

    This addresses an issue where qla2xxx FC clients need to be able
    to create demo-mode I_T FC Nexuses by default, but should not be
    exposing the default set of TPG LUNs to all FC clients. This includes
    adding an new optional target_core_fabric_ops->tpg_check_demo_mode_login_only()
    caller to allow demo_mode nexuses to skip the old default of bulding
    a demo-mode MappedLUNs list via core_tpg_add_node_to_devs().

    (roland: Add missing tpg_check_demo_mode_login_only check in core_dev_add_lun)

    Reported-by: Roland Dreier
    Cc: Andrew Vasquez
    Signed-off-by: Nicholas Bellinger

    Nicholas Bellinger
     
  • Do not call ->suspend, ->resume methods after we unregister wiphy. Also
    delete sta_clanup timer after we finish wiphy unregister to avoid this:

    WARNING: at lib/debugobjects.c:262 debug_print_object+0x85/0xa0()
    Hardware name: 6369CTO
    ODEBUG: free active (active state 0) object type: timer_list hint: sta_info_cleanup+0x0/0x180 [mac80211]
    Modules linked in: aes_i586 aes_generic fuse bridge stp llc autofs4 sunrpc cpufreq_ondemand acpi_cpufreq mperf ext2 dm_mod uinput thinkpad_acpi hwmon sg arc4 rt2800usb rt2800lib crc_ccitt rt2x00usb rt2x00lib mac80211 cfg80211 i2c_i801 iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom yenta_socket ahci libahci pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit video [last unloaded: microcode]
    Pid: 5663, comm: pm-hibernate Not tainted 3.1.0-rc1-wl+ #19
    Call Trace:
    [] warn_slowpath_common+0x6d/0xa0
    [] ? debug_print_object+0x85/0xa0
    [] ? debug_print_object+0x85/0xa0
    [] warn_slowpath_fmt+0x2e/0x30
    [] debug_print_object+0x85/0xa0
    [] ? sta_info_alloc+0x1a0/0x1a0 [mac80211]
    [] debug_check_no_obj_freed+0xe2/0x180
    [] kfree+0x8b/0x150
    [] cfg80211_dev_free+0x7e/0x90 [cfg80211]
    [] wiphy_dev_release+0xd/0x10 [cfg80211]
    [] device_release+0x19/0x80
    [] kobject_release+0x7a/0x1c0
    [] ? rtnl_unlock+0x8/0x10
    [] ? wiphy_resume+0x6b/0x80 [cfg80211]
    [] ? kobject_del+0x30/0x30
    [] kref_put+0x2d/0x60
    [] kobject_put+0x1d/0x50
    [] ? mutex_lock+0x14/0x40
    [] put_device+0xf/0x20
    [] dpm_resume+0xca/0x160
    [] hibernation_snapshot+0xcd/0x260
    [] ? freeze_processes+0x3f/0x90
    [] hibernate+0xcb/0x1e0
    [] ? pm_async_store+0x40/0x40
    [] state_store+0xa0/0xb0
    [] ? pm_async_store+0x40/0x40
    [] kobj_attr_store+0x20/0x30
    [] sysfs_write_file+0x94/0xf0
    [] vfs_write+0x9a/0x160
    [] ? sysfs_open_file+0x200/0x200
    [] sys_write+0x3d/0x70
    [] sysenter_do_call+0x12/0x28

    Cc: stable@kernel.org
    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: John W. Linville

    Stanislaw Gruszka
     

20 Aug, 2011

2 commits

  • * 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
    Revert "cfq: Remove special treatment for metadata rqs."
    block: fix flush machinery for stacking drivers with differring flush flags
    block: improve rq_affinity placement
    blktrace: add FLUSH/FUA support
    Move some REQ flags to the common bio/request area
    allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
    xen/blkback: Make description more obvious.
    cfq-iosched: Add documentation about idling
    block: Make rq_affinity = 1 work as expected
    block: swim3: fix unterminated of_device_id table
    block/genhd.c: remove useless cast in diskstats_show()
    drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
    drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
    bsg-lib: add module.h include
    cfq-iosched: Reduce linked group count upon group destruction
    blk-throttle: correctly determine sync bio
    loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
    loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
    loop: add management interface for on-demand device allocation
    loop: replace linked list of allocated devices with an idr index
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
    PCI: OF: Don't crash when bridge parent is NULL.
    PCI: export pcie_bus_configure_settings symbol
    PCI: code and comments cleanup
    PCI: make cardbus-bridge resources optional
    PCI: make SRIOV resources optional
    PCI : ability to relocate assigned pci-resources
    PCI: honor child buses add_size in hot plug configuration
    PCI: Set PCI-E Max Payload Size on fabric

    Linus Torvalds
     

19 Aug, 2011

1 commit

  • Revert the pass-good area introduced in ffd1f609ab10 ("writeback:
    introduce max-pause and pass-good dirty limits") and make the max-pause
    area smaller and safe.

    This fixes ~30% performance regression in the ext3 data=writeback
    fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
    12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.

    Using deadline scheduler also has a regression, but not that big as CFQ,
    so this suggests we have some write starvation.

    The test logs show that

    - the disks are sometimes under utilized

    - global dirty pages sometimes rush high to the pass-good area for
    several hundred seconds, while in the mean time some bdi dirty pages
    drop to very low value (bdi_dirty << bdi_thresh). Then suddenly the
    global dirty pages dropped under global dirty threshold and bdi_dirty
    rush very high (for example, 2 times higher than bdi_thresh). During
    which time balance_dirty_pages() is not called at all.

    So the problems are

    1) The random writes progress so slow that they break the assumption of
    the max-pause logic that "8 pages per 200ms is typically more than
    enough to curb heavy dirtiers".

    2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
    for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
    and then (bdi_thresh >> bdi_dirty) for others.

    3) The higher max-pause/pass-good thresholds somehow leads to the bad
    swing of dirty pages.

    The fix is to allow the task to slightly dirty over task_bdi_thresh, but
    no way to exceed bdi_dirty and/or global dirty_thresh.

    Tests show that it fixed the JBOD regression completely (both behavior
    and performance), while still being able to cut down large pause times
    in balance_dirty_pages() for single-disk cases.

    Reported-by: Li Shaohua
    Tested-by: Li Shaohua
    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

18 Aug, 2011

5 commits


16 Aug, 2011

1 commit

  • Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
    FLUSH/FUA to support merge, introduced a performance regression when
    running any sort of fsyncing workload using dm-multipath and certain
    storage (in our case, an HP EVA). The test I ran was fs_mark, and it
    dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
    that dm-multipath always advertised flush+fua support, and passed
    commands on down the stack, where those flags used to get stripped off.
    The above commit changed that behavior:

    static inline struct request *__elv_next_request(struct request_queue *q)
    {
    struct request *rq;

    while (1) {
    - while (!list_empty(&q->queue_head)) {
    + if (!list_empty(&q->queue_head)) {
    rq = list_entry_rq(q->queue_head.next);
    - if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
    - (rq->cmd_flags & REQ_FLUSH_SEQ))
    - return rq;
    - rq = blk_do_flush(q, rq);
    - if (rq)
    - return rq;
    + return rq;
    }

    Note that previously, a command would come in here, have
    REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

    struct request *blk_do_flush(struct request_queue *q, struct request *rq)
    {
    unsigned int fflags = q->flush_flags; /* may change, cache it */
    bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
    bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
    bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
    REQ_FUA);
    unsigned skip = 0;
    ...
    if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
    rq->cmd_flags &= ~REQ_FLUSH;
    if (!has_fua)
    rq->cmd_flags &= ~REQ_FUA;
    return rq;
    }

    So, the flush machinery was bypassed in such cases (q->flush_flags == 0
    && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

    Now, however, we don't get into the flush machinery at all. Instead,
    __elv_next_request just hands a request with flush and fua bits set to
    the scsi_request_fn, even if the underlying request_queue does not
    support flush or fua.

    The agreed upon approach is to fix the flush machinery to allow
    stacking. While this isn't used in practice (since there is only one
    request-based dm target, and that target will now reflect the flush
    flags of the underlying device), it does future-proof the solution, and
    make it function as designed.

    In order to make this work, I had to add a field to the struct request,
    inside the flush structure (to store the original req->end_io). Shaohua
    had suggested overloading the union with rb_node and completion_data,
    but the completion data is used by device mapper and can also be used by
    other drivers. So, I didn't see a way around the additional field.

    I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
    the lost performance. Comments and other testers, as always, are
    appreciated.

    Cheers,
    Jeff

    Signed-off-by: Jeff Moyer
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

15 Aug, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
    mmc: remove unused "ddr" parameter in struct mmc_ios
    mmc: dw_mmc: Fix DDR mode support.
    mmc: core: use defined R1_STATE_PRG macro for card status
    mmc: sdhci: use f_max instead of host->clock for timeouts
    mmc: sdhci: move timeout_clk calculation farther down
    mmc: sdhci: check host->clock before using it as a denominator
    mmc: Revert "mmc: sdhci: Fix SDHCI_QUIRK_TIMEOUT_USES_SDCLK"
    mmc: tmio: eliminate unused variable 'mmc' warning
    mmc: esdhc-imx: fix card interrupt loss on freescale eSDHC
    mmc: sdhci-s3c: Fix build for header change
    mmc: dw_mmc: Fix mask in IDMAC_SET_BUFFER1_SIZE macro
    mmc: cb710: fix possible pci_dev leak in cb710_pci_configure()
    mmc: core: Detect eMMC v4.5 ext_csd entries
    mmc: mmc_test: avoid stalled file in debugfs
    mmc: sdhci-s3c: add BROKEN_ADMA_ZEROLEN_DESC quirk
    mmc: sdhci: pxav3: controller needs 32 bit ADMA addressing
    mmc: sdhci: fix retuning timer wrongly deleted in sdhci_tasklet_finish

    Linus Torvalds
     

14 Aug, 2011

1 commit