15 Jan, 2012

2 commits


14 Jan, 2012

11 commits

  • * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
    dma-buf: Documentation update for Kconfig select
    nouveau: Support Optimus models for vga_switcheroo
    nouveau: properly check for _DSM function support
    dma-buf: drop option text so users don't select it.
    radeon: Call pci_clear_master() instead of open-coding it.
    gma500: Discard modes that don't fit in stolen memory
    drm: bump DRM_CONNECTOR_MAX_ENCODER from 2 to 3
    drm/radeon/kms: Fix module parameter description format
    drm/radeon/kms/ni: fix packet2 handling for VM IB parser
    ttm/dma: Remove the WARN() which is not useful.

    Linus Torvalds
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (59 commits)
    rtc: max8925: Add function to work as wakeup source
    mfd: Add pm ops to max8925
    mfd: Convert aat2870 to dev_pm_ops
    mfd: Still check other interrupts if we get a wm831x touchscreen IRQ
    mfd: Introduce missing kfree in 88pm860x probe routine
    mfd: Add S5M series configuration
    mfd: Add s5m series irq driver
    mfd: Add S5M core driver
    mfd: Improve mc13xxx dt binding document
    mfd: Fix stmpe section mismatch
    mfd: Fix stmpe build warning
    mfd: Fix STMPE I2c build failure
    mfd: Constify aat2870-core i2c_device_id table
    gpio: Add support for stmpe variant 801
    mfd: Add support for stmpe variant 801
    mfd: Add support for stmpe variant 610
    mfd: Add support for STMPE SPI interface
    mfd: Separate out STMPE controller and interface specific code
    misc: Remove max8997-muic sysfs attributes
    mfd: Remove unused wm831x_irq_data_to_mask_reg()
    ...

    Fix up trivial conflict in drivers/leds/Kconfig due to addition of
    LEDS_MAX8997 and LEDS_TCA6507 next to each other.

    Linus Torvalds
     
  • MMC highlights for 3.3:

    Core:
    * Support for the HS200 high-speed eMMC mode.
    * Support SDIO 3.0 Ultra High Speed cards.
    * Kill pending block requests immediately if card is removed.
    * Enable the eMMC feature for locking boot partitions read-only
    until next power on, exposed via sysfs.

    Drivers:
    * Runtime PM support for Intel Medfield SDIO.
    * Suspend/resume support for sdhci-spear.
    * sh-mmcif now processes requests asynchronously.

    * tag 'mmc-merge-for-3.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (58 commits)
    mmc: fix a deadlock between system suspend and MMC block IO
    mmc: sdhci: restore the enabled dma when do reset all
    mmc: dw_mmc: miscaculated the fifo-depth with wrong bit operation
    mmc: host: Adds support for eMMC 4.5 HS200 mode
    mmc: core: HS200 mode support for eMMC 4.5
    mmc: dw_mmc: fixed wrong bit operation for SDMMC_GET_FCNT()
    mmc: core: Separate the timeout value for cache-ctrl
    mmc: sdhci-spear: Fix compilation error
    mmc: sdhci: Deal with failure case in sdhci_suspend_host
    mmc: dw_mmc: Clear the DDR mode for non-DDR
    mmc: sd: Fix SDR12 timing regression
    mmc: sdhci: Fix tuning timer incorrect setting when suspending host
    mmc: core: Add option to prevent eMMC sleep command
    mmc: omap_hsmmc: use threaded irq handler for card-detect.
    mmc: sdhci-pci: enable runtime PM for Medfield SDIO
    mmc: sdhci: Always pass clock request value zero to set_clock host op
    mmc: sdhci-pci: remove SDHCI_QUIRK2_OWN_CARD_DETECTION
    mmc: sdhci-pci: get gpio numbers from platform data
    mmc: sdhci-pci: add platform data
    mmc: sdhci: prevent card detection activity for non-removable cards
    ...

    Linus Torvalds
     
  • * 'wire-accept4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
    ia64: Add accept4() syscall

    Linus Torvalds
     
  • Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are
    allocated in a batch during processing of first iocbs. All iocbs in a
    batch are automatically added to ctx->active_reqs list and accounted in
    ctx->reqs_active.

    If one (not the last one) of iocbs submitted by an user fails, further
    iocbs are not processed, but they are still present in ctx->active_reqs
    and accounted in ctx->reqs_active. This causes process to stuck in a D
    state in wait_for_all_aios() on exit since ctx->reqs_active will never
    go down to zero. Furthermore since kiocb_batch_free() frees iocb
    without removing it from active_reqs list the list become corrupted
    which may cause oops.

    Fix this by removing iocb from ctx->active_reqs and updating
    ctx->reqs_active in kiocb_batch_free().

    Signed-off-by: Gleb Natapov
    Reviewed-by: Jeff Moyer
    Cc: stable@kernel.org # 3.2
    Signed-off-by: Linus Torvalds

    Gleb Natapov
     
  • Commit 8a25a2fd126c ("cpu: convert 'cpu' and 'machinecheck' sysdev_class
    to a regular subsystem") changed how things are dealt with in the MCE
    subsystem. Some of the things that got broken due to this are CPU
    hotplug and suspend/hibernate.

    MCE uses per_cpu allocations of struct device. So, when a CPU goes
    offline and comes back online, in order to ensure that we start from a
    clean slate with respect to the MCE subsystem, zero out the entire
    per_cpu device structure to 0 before using it.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
    Squashfs: fix i_blocks calculation with extended regular files
    Squashfs: fix mount time sanity check for corrupted superblock
    Squashfs: optimise squashfs_cache_get entry search
    Squashfs: Update documentation to include xattrs
    Squashfs: add missing block release on error condition

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: Fix nlink setting on inode creation
    GFS2: fail mount if journal recovery fails
    GFS2: let spectator mount do read only recovery
    GFS2: Fix a use-after-free that coverity spotted
    GFS2: dlm based recovery coordination

    Linus Torvalds
     
  • * 'linux-next' of git://git.infradead.org/ubifs-2.6:
    UBIFS: fix key printing
    UBIFS: use snprintf instead of sprintf when printing keys
    UBIFS: fix debugging messages
    UBIFS: make debugging messages light again
    UBI: fix debugging messages
    UBI: make vid_hdr non-static

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: ensure prealloc_blob is in place when removing xattr
    rbd: initialize snap_rwsem in rbd_add()
    ceph: enable/disable dentry complete flags via mount option
    vfs: export symbol d_find_any_alias()
    ceph: always initialize the dentry in open_root_dentry()
    libceph: remove useless return value for osd_client __send_request()
    ceph: avoid iput() while holding spinlock in ceph_dir_fsync
    ceph: avoid useless dget/dput in encode_fh
    ceph: dereference pointer after checking for NULL
    crush: fix force for non-root TAKE
    ceph: remove unnecessary d_fsdata conditional checks
    ceph: Use kmemdup rather than duplicating its implementation

    Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
    always initialize the dentry in open_root_dentry)

    Linus Torvalds
     
  • I don't know how I missed this obvious mistake when I
    reviewed Als' patches, sorry.

    [ Quoting Al:

    Grr... Note to self: do git status *and* git stash show -p
    before git push. Nothing like "WTF? I'd fixed that braino"
    feeling ;-/

    Al sent the same patch - it got broken in commit d668dc56631d:
    "autofs4: deal with autofs4_write/autofs4_write races". ]

    Reported-and-tested-by: Dave Airlie
    Signed-off-by: Ian Kent
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Ian Kent
     

13 Jan, 2012

27 commits

  • Before commit 56e46742e846e4de167dde0e1e1071ace1c882a5 we have had locking
    around all printing macros and we could use static buffers for creating
    key strings and printing them. However, now we do not have that locking and
    we cannot use static buffers. This commit removes the old DBGKEY() macros
    and introduces few new helper macros for printing debugging messages plus
    a key at the end. Thankfully, all the messages are already structures in
    a way that the key is printed in the end.

    Signed-off-by: Artem Bityutskiy

    Artem Bityutskiy
     
  • Switch to 'snprintf()' which is more secure and reliable. This is also a
    preparation to the subsequent key printing fixes.

    Signed-off-by: Artem Bityutskiy

    Artem Bityutskiy
     
  • As per Linus' comment, dma-buf Kconfig entry shouldn't have an option
    text, but should be selected by the subsystems that use it.

    Add this information in the documentation as well.

    Signed-off-by: Sumit Semwal
    Signed-off-by: Dave Airlie

    Sumit Semwal
     
  • Newer nVidia cards with Optimus do not support/use the DSM switching functions.
    Instead, it require a DSM function to be called prior to bringing a device into
    D3 state. No other _DSM calls are necessary before/after enabling/disabling a
    device. Switching between discrete and integrated GPU is not supported by
    this Optimus _DSM call, therefore return on the switching method.

    Signed-off-by: Peter Lekensteyn
    Signed-off-by: Dave Airlie

    Peter Lekensteyn
     
  • According to the ACPI spec version 4, section 9.14.1, _DSM functions
    must return a value with the first bit enabled if any DSM functions are
    supported for the given UUID and revision ID. For a given function index n
    to be marked supported, bit n must be enabled.

    Signed-off-by: Peter Lekensteyn
    Signed-off-by: Dave Airlie

    Peter Lekensteyn
     
  • This is going to be used by other subsystems so they should select it.

    Signed-off-by: Dave Airlie

    Dave Airlie
     
  • Reported-by: Ben Hutchings
    Signed-off-by: Michel Dänzer
    Reviewed-by: Alex Deucher
    Signed-off-by: Dave Airlie

    Michel Dänzer
     
  • [This fixes a crash on boot if the system is plugged into an HDTV so it's
    probably appropriate to push even though it didn't make the window. We could
    be cleverer about this but the simple version seems to be the safe one]

    From: Patrik Jakobsson

    At the moment we cannot allocate more than stolen memory size for framebuffers.
    To get around that issues we discard modes that doesn't fit. This is a temporary
    solution until we can freely allocate framebuffer memory.

    [Currently the framebuffer needs to be linear in kernel space due to limits
    in the kernel fb layer - AC]

    Signed-off-by: Patrik Jakobsson
    Signed-off-by: Alan Cox
    Signed-off-by: Dave Airlie

    Alan Cox
     
  • There exists at least one NVIDIA GPU (Quadro NVS 300) that has a DMS-59
    connector which is capable of supporting DisplayPort, TMDS and VGA on
    a single connector.

    We need to bump the allowed encoder limit to support all three configs.

    Signed-off-by: Ben Skeggs
    Signed-off-by: Dave Airlie

    Ben Skeggs
     
  • Module parameter descriptions don't take a trailing \n, otherwise it
    breaks formatting of modinfo's output. Also add missing space after
    comma.

    Signed-off-by: Jean Delvare
    Cc: David Airlie
    Reviewed-by: Alex Deucher
    Cc: Jerome Glisse
    Signed-off-by: Dave Airlie

    Jean Delvare
     
  • Packet2 is only one dword.

    Signed-off-by: Alex Deucher
    Reviewed-by: Jerome Glisse
    Signed-off-by: Dave Airlie

    Alex Deucher
     
  • . It was useful during development, but now on a production system
    we can get this (if the user forgot to upload the firmware):

    [drm] radeon: irq initialized.
    [drm] GART: num cpu pages 131072, num gpu pages 131072
    [drm] radeon: ib pool ready.
    [drm] Loading SUMO Microcode
    r600_cp: Failed to load firmware "radeon/SUMO_pfp.bin"
    atl1c 0000:03:00.0: version 1.0.1.0-NAPI.213057] [drm:evergreen_startup] *ERROR* Failed to load firmware!
    radeon 0000:00:01.0: disabling GPU acceleration
    88] radeon 0000:00:01.0: ffff8801bb782400 unpin not necessary
    ------------[ cut here ]------------
    WARNING: at /home/konrad/linux-linus/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c:956 ttm_dma_unpopulate+0x79/0x300 [ttm]()
    Hardware name: System Product Name
    Modules linked in: e1000e atl1c radeon(+) ahci libahci libata scsi_mod fbcon tileblit font ttm bitblit softcursor drm_kms_helper wmi xen_blkfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea xenfs xen_privcmd
    Pid: 1600, comm: modprobe Not tainted 3.2.0-06100-ge343a89 #1
    Call Trace:
    [] warn_slowpath_common+0x7a/0xb0
    [] warn_slowpath_null+0x15/0x20
    [] ttm_dma_unpopulate+0x79/0x300 [ttm]
    [] radeon_ttm_tt_unpopulate+0x120/0x130 [radeon]
    [] ttm_tt_destroy+0x2c/0x70 [ttm]
    [] ttm_bo_cleanup_memtype_use+0x3e/0x80 [ttm]
    [] ttm_bo_release+0x251/0x280 [ttm]
    [] ttm_bo_unref+0x40/0x60 [ttm]
    [] radeon_bo_unref+0x42/0x80 [radeon]
    [] radeon_sa_bo_manager_fini+0x6b/0x80 [radeon]
    [] radeon_ib_pool_fini+0x6f/0x90 [radeon]
    [] r100_ib_fini+0x19/0x20 [radeon]
    [] evergreen_init+0x1ee/0x2d0 [radeon]

    The big WARN() has nothing to do with the culprit - which is that
    the firmware was not loaded. So lets remove the WARN() from the TTM DMA code.

    Signed-off-by: Konrad Rzeszutek Wilk
    Reviewed-by: Jerome Glisse
    Signed-off-by: Dave Airlie

    Konrad Rzeszutek Wilk
     
  • Andrew explains:

    - various misc stuff

    - Most of the rest of MM: memcg, threaded hugepages, others.

    - cpumask

    - kexec

    - kdump

    - some direct-io performance tweaking

    - radix-tree optimisations

    - new selftests code

    A note on this: often people will develop a new userspace-visible
    feature and will develop userspace code to exercise/test that
    feature. Then they merge the patch and the selftest code dies.
    Sometimes we paste it into the changelog. Sometimes the code gets
    thrown into Documentation/(!).

    This saddens me. So this patch creates a bare-bones framework which
    will henceforth allow me to ask people to include their test apps in
    the kernel tree so we can keep them alive. Then when people enhance
    or fix the feature, I can ask them to update the test app too.

    The infrastruture is terribly trivial at present - let's see how it
    evolves.

    - checkpoint/restart feature work.

    A note on this: this is a project by various mad Russians to perform
    c/r mainly from userspace, with various oddball helper code added
    into the kernel where the need is demonstrated.

    So rather than some large central lump of code, what we have is
    little bits and pieces popping up in various places which either
    expose something new or which permit something which is normally
    kernel-private to be modified.

    The overall project is an ongoing thing. I've judged that the size
    and scope of the thing means that we're more likely to be successful
    with it if we integrate the support into mainline piecemeal rather
    than allowing it all to develop out-of-tree.

    However I'm less confident than the developers that it will all
    eventually work! So what I'm asking them to do is to wrap each piece
    of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all
    eventually comes to tears and the project as a whole fails, it should
    be a simple matter to go through and delete all trace of it.

    This lot pretty much wraps up the -rc1 merge for me.

    * akpm: (96 commits)
    unlzo: fix input buffer free
    ramoops: update parameters only after successful init
    ramoops: fix use of rounddown_pow_of_two()
    c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
    c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
    c/r: introduce CHECKPOINT_RESTORE symbol
    selftests: new x86 breakpoints selftest
    selftests: new very basic kernel selftests directory
    radix_tree: take radix_tree_path off stack
    radix_tree: remove radix_tree_indirect_to_ptr()
    dio: optimize cache misses in the submission path
    vfs: cache request_queue in struct block_device
    fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
    drivers/parport/parport_pc.c: fix warnings
    panic: don't print redundant backtraces on oops
    sysctl: add the kernel.ns_last_pid control
    kdump: add udev events for memory online/offline
    include/linux/crash_dump.h needs elf.h
    kdump: fix crash_kexec()/smp_send_stop() race in panic()
    kdump: crashk_res init check for /sys/kernel/kexec_crash_size
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
    pptp: Accept packet with seq zero
    RDS: Remove some unused iWARP code
    net: fsl: fec: handle 10Mbps speed in RMII mode
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c: add missing iounmap
    drivers/net/ethernet/tundra/tsi108_eth.c: add missing iounmap
    ksz884x: fix mtu for VLAN
    net_sched: sfq: add optional RED on top of SFQ
    dp83640: Fix NOHZ local_softirq_pending 08 warning
    gianfar: Fix invalid TX frames returned on error queue when time stamping
    gianfar: Fix missing sock reference when processing TX time stamps
    phylib: introduce mdiobus_alloc_size()
    net: decrement memcg jump label when limit, not usage, is changed
    net: reintroduce missing rcu_assign_pointer() calls
    inet_diag: Rename inet_diag_req_compat into inet_diag_req
    inet_diag: Rename inet_diag_req into inet_diag_req_v2
    bond_alb: don't disable softirq under bond_alb_xmit
    mac80211: fix rx->key NULL pointer dereference in promiscuous mode
    nl80211: fix old station flags compatibility
    mdio-octeon: use an unique MDIO bus name.
    mdio-gpio: use an unique MDIO bus name.
    ...

    Linus Torvalds
     
  • unlzo modifies the pointer to in_buf, so we have to free the original
    buffer, not the modified pointer.

    Signed-off-by: Sascha Hauer
    Cc: Lasse Collin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sascha Hauer
     
  • If a platform device exists on the system, but ramoops fails to attach to
    it, the module parameters are overridden before ramoops can fall back and
    try to use passed module parameters. Move update to end of init routine.

    Signed-off-by: Kees Cook
    Cc: Marco Stornelli
    Cc: Sergiu Iordache
    Cc: Seiji Aguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The return value of rounddown_pow_of_two wasn't evaluated, so the
    operation was a no-op.

    Signed-off-by: Marco Stornelli
    Reported-by: Andrew Morton
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Stornelli
     
  • When we restore a task we need to set up text, data and data heap sizes
    from userspace to the values a task had at checkpoint time. This patch
    adds auxilary prctl codes for that.

    While most of them have a statistical nature (their values are involved
    into calculation of /proc//statm output) the start_brk and brk values
    are used to compute an allowed size of program data segment expansion.
    Which means an arbitrary changes of this values might be dangerous
    operation. So to restrict access the following requirements applied to
    prctl calls:

    - The process has to have CAP_SYS_ADMIN capability granted.
    - For all opcodes except start_brk/brk members an appropriate
    VMA area must exist and should fit certain VMA flags,
    such as:
    - code segment must be executable but not writable;
    - data segment must not be executable.

    start_brk/brk values must not intersect with data segment and must not
    exceed RLIMIT_DATA resource limit.

    Still the main guard is CAP_SYS_ADMIN capability check.

    Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
    otherwise these prctl calls will return -EINVAL.

    [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
    involved into calculation of program text/data segment sizes (which might
    be seen in /proc//statm) and into brk() call final address.

    For restore we need to know all these values. While
    mm->start_code/end_code already present in /proc/$pid/stat, the rest
    members are not, so this patch brings them in.

    The restore procedure of these members is addressed in another patch using
    prctl().

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Serge Hallyn
    Reviewed-by: Kees Cook
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • For checkpoint/restore we need auxilary features being compiled into the
    kernel, such as additional prctl codes, /proc//map_files and etc...
    but same time these features are not mandatory for a regular kernel so
    CHECKPOINT_RESTORE config symbol should bring a way to disable them all at
    once if one wish to get rid of additional functionality.

    Signed-off-by: Cyrill Gorcunov
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Serge Hallyn
    Cc: Vasiliy Kulikov
    Reviewed-by: Kees Cook
    Cc: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Bring a first selftest in the relevant directory. This tests several
    combinations of breakpoints and watchpoints in x86, as well as icebp traps
    and int3 traps. Given the amount of breakpoint regressions we raised
    after we merged the generic breakpoint infrastructure, such selftest
    became necessary and can still serve today as a basis for new patches that
    touch the do_debug() path.

    Signed-off-by: Frederic Weisbecker
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Jason Wessel
    Cc: Will Deacon
    Cc: Michal Marek
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • Bring a new kernel selftests directory in tools/testing/selftests. To
    add a new selftest, create a subdirectory with the sources and a
    makefile that creates a target named "run_test" then add the
    subdirectory name to the TARGET var in tools/testing/selftests/Makefile
    and tools/testing/selftests/run_tests script.

    This can help centralizing and maintaining any useful selftest that
    developers usually tend to let rust in peace on some random server.

    Suggested-by: Andrew Morton
    Signed-off-by: Frederic Weisbecker
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Jason Wessel
    Cc: Will Deacon
    Cc: Steven Rostedt
    Cc: Michal Marek
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • Down, down in the deepest depths of GFP_NOIO page reclaim, we have
    shrink_page_list() calling __remove_mapping() calling __delete_from_
    swap_cache() or __delete_from_page_cache().

    You would not expect those to need much stack, but in fact they call
    radix_tree_delete(): which declares a 192-byte radix_tree_path array on
    its stack (to record the node,offsets it visits when descending, in case
    it needs to ascend to update them). And if any tag is still set [1],
    that calls radix_tree_tag_clear(), which declares a further such
    192-byte radix_tree_path array on the stack. (At least we have
    interrupts disabled here, so won't then be pushing registers too.)

    That was probably a good choice when most users were 32-bit (array of
    half the size), and adding fields to radix_tree_node would have bloated
    it unnecessarily. But nowadays many are 64-bit, and each
    radix_tree_node contains a struct rcu_head, which is only used when
    freeing; whereas the radix_tree_path info is only used for updating the
    tree (deleting, clearing tags or setting tags if tagged) when a lock
    must be held, of no interest when accessing the tree locklessly.

    So add a parent pointer to the radix_tree_node, in union with the
    rcu_head, and remove all uses of the radix_tree_path. There would be
    space in that union to save the offset when descending as before (we can
    argue that a lock must already be held to exclude other users), but
    recalculating it when ascending is both easy (a constant shift and a
    constant mask) and uncommon, so it seems better just to do that.

    Two little optimizations: no need to decrement height when descending,
    adjusting shift is enough; and once radix_tree_tag_if_tagged() has set
    tag on a node and its ancestors, it need not ascend from that node
    again.

    perf on the radix tree test harness reports radix_tree_insert() as 2%
    slower (now having to set parent), but radix_tree_delete() 24% faster.
    Surely that's an exaggeration from rtth's artificially low map shift 3,
    but forcing it back to 6 still rates radix_tree_delete() 8% faster.

    [1] Can a pagecache tag (dirty, writeback or towrite) actually still be
    set at the time of radix_tree_delete()? Perhaps not if the filesystem is
    well-behaved. But although I've not tracked any stack overflow down to
    this cause, I have observed a curious case in which a dirty tag is set
    and left set on tmpfs: page migration's migrate_page_copy() happens to
    use __set_page_dirty_nobuffers() to set PageDirty on the newpage, and
    that sets PAGECACHE_TAG_DIRTY as a side-effect - harmless to a
    filesystem which doesn't use tags, except for this stack depth issue.

    Signed-off-by: Hugh Dickins
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Mel Gorman
    Cc: Nai Xia
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It is not used anymore, remove it

    Signed-off-by: Xiao Guangrong
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Some investigation of a transaction processing workload showed that a
    major consumer of cycles in __blockdev_direct_IO is the cache miss while
    accessing the block size. This is because it has to walk the chain from
    block_dev to gendisk to queue.

    The block size is needed early on to check alignment and sizes. It's only
    done if the check for the inode block size fails. But the costly block
    device state is unconditionally fetched.

    - Reorganize the code to only fetch block dev state when actually
    needed.

    Then do a prefetch on the block dev early on in the direct IO path. This
    is worth it, because there is substantial code run before we actually
    touch the block dev now.

    - I also added some unlikelies to make it clear the compiler that block
    device fetch code is not normally executed.

    This gave a small, but measurable improvement on a large database
    benchmark (about 0.3%)

    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
    Signed-off-by: Andi Kleen
    Cc: Jeff Moyer
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • This makes it possible to get from the inode to the request_queue with one
    less cache miss. Used in followon optimization.

    The livetime of the pointer is the same as the gendisk.

    This assumes that the queue will always stay the same in the gendisk while
    it's visible to block_devices. I think that's safe correct?

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • In get_more_blocks(), we use dio_count to calcuate fs_count and do some
    tricky things to increase fs_count if dio_count isn't aligned. But
    actually it still has some corner cases that can't be coverd. See the
    following example:

    dio_write foo -s 1024 -w 4096

    (direct write 4096 bytes at offset 1024). The same goes if the offset
    isn't aligned to fs_blocksize.

    In this case, the old calculation counts fs_count to be 1, but actually we
    will write into 2 different blocks (if fs_blocksize=4096). The old code
    just works, since it will call get_block twice (and may have to allocate
    and create extents twice for filesystems like ext4). So we'd better call
    get_block just once with the proper fs_count.

    Signed-off-by: Tao Ma
    Cc: "Theodore Ts'o"
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma