01 Nov, 2011

1 commit

  • The basic idea behind cross memory attach is to allow MPI programs doing
    intra-node communication to do a single copy of the message rather than a
    double copy of the message via shared memory.

    The following patch attempts to achieve this by allowing a destination
    process, given an address and size from a source process, to copy memory
    directly from the source process into its own address space via a system
    call. There is also a symmetrical ability to copy from the current
    process's address space into a destination process's address space.

    - Use of /proc/pid/mem has been considered, but there are issues with
    using it:
    - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
    written to would need to be contiguous.
    - Currently mem_read allows only processes who are currently
    ptrace'ing the target and are still able to ptrace the target to read
    from the target. This check could possibly be moved to the open call,
    but its not clear exactly what race this restriction is stopping
    (reason appears to have been lost)
    - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
    domain socket is a bit ugly from a userspace point of view,
    especially when you may have hundreds if not (eventually) thousands
    of processes that all need to do this with each other
    - Doesn't allow for some future use of the interface we would like to
    consider adding in the future (see below)
    - Interestingly reading from /proc/pid/mem currently actually
    involves two copies! (But this could be fixed pretty easily)

    As mentioned previously use of vmsplice instead was considered, but has
    problems. Since you need the reader and writer working co-operatively if
    the pipe is not drained then you block. Which requires some wrapping to
    do non blocking on the send side or polling on the receive. In all to all
    communication it requires ordering otherwise you can deadlock. And in the
    example of many MPI tasks writing to one MPI task vmsplice serialises the
    copying.

    There are some cases of MPI collectives where even a single copy interface
    does not get us the performance gain we could. For example in an
    MPI_Reduce rather than copy the data from the source we would like to
    instead use it directly in a mathops (say the reduce is doing a sum) as
    this would save us doing a copy. We don't need to keep a copy of the data
    from the source. I haven't implemented this, but I think this interface
    could in the future do all this through the use of the flags - eg could
    specify the math operation and type and the kernel rather than just
    copying the data would apply the specified operation between the source
    and destination and store it in the destination.

    Although we don't have a "second user" of the interface (though I've had
    some nibbles from people who may be interested in using it for intra
    process messaging which is not MPI). This interface is something which
    hardware vendors are already doing for their custom drivers to implement
    fast local communication. And so in addition to this being useful for
    OpenMPI it would mean the driver maintainers don't have to fix things up
    when the mm changes.

    There was some discussion about how much faster a true zero copy would
    go. Here's a link back to the email with some testing I did on that:

    http://marc.info/?l=linux-mm&m=130105930902915&w=2

    There is a basic man page for the proposed interface here:

    http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

    This has been implemented for x86 and powerpc, other architecture should
    mainly (I think) just need to add syscall numbers for the process_vm_readv
    and process_vm_writev. There are 32 bit compatibility versions for
    64-bit kernels.

    For arch maintainers there are some simple tests to be able to quickly
    verify that the syscalls are working correctly here:

    http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

    Signed-off-by: Chris Yeoh
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: James Morris
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

29 Oct, 2011

2 commits

  • * 'gpio/next' of git://git.secretlab.ca/git/linux-2.6:
    h8300: Move gpio.h to gpio-internal.h
    gpio: pl061: add DT binding support
    gpio: fix build error in include/asm-generic/gpio.h
    gpiolib: Ensure struct gpio is always defined
    irq: Add EXPORT_SYMBOL_GPL to function of irq generic-chip
    gpio-ml-ioh: Use NUMA_NO_NODE not GFP_KERNEL
    gpio-pch: Use NUMA_NO_NODE not GFP_KERNEL
    gpio: langwell: ensure alternate function is cleared
    gpio-pch: Support interrupt function
    gpio-pch: Save register value in suspend()
    gpio-pch: modify gpio_nums and mask
    gpio-pch: support ML7223 IOH n-Bus
    gpio-pch: add spinlock in suspend/resume processing
    gpio-pch: Delete invalid "restore" code in suspend()
    gpio-ml-ioh: Fix suspend/resume issue
    gpio-ml-ioh: Support interrupt function
    gpio-ml-ioh: Delete unnecessary code
    gpio/mxc: add chained_irq_enter/exit() to mx3_gpio_irq_handler()
    gpio/nomadik: use genirq core to track enablement
    gpio/nomadik: disable clocks when unused

    Linus Torvalds
     
  • …git-cur/linux-2.6-arm

    * 'devel-stable' of http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm: (178 commits)
    ARM: 7139/1: fix compilation with CONFIG_ARM_ATAG_DTB_COMPAT and large TEXT_OFFSET
    ARM: gic, local timers: use the request_percpu_irq() interface
    ARM: gic: consolidate PPI handling
    ARM: switch from NO_MACH_MEMORY_H to NEED_MACH_MEMORY_H
    ARM: mach-s5p64x0: remove mach/memory.h
    ARM: mach-s3c64xx: remove mach/memory.h
    ARM: plat-mxc: remove mach/memory.h
    ARM: mach-prima2: remove mach/memory.h
    ARM: mach-zynq: remove mach/memory.h
    ARM: mach-bcmring: remove mach/memory.h
    ARM: mach-davinci: remove mach/memory.h
    ARM: mach-pxa: remove mach/memory.h
    ARM: mach-ixp4xx: remove mach/memory.h
    ARM: mach-h720x: remove mach/memory.h
    ARM: mach-vt8500: remove mach/memory.h
    ARM: mach-s5pc100: remove mach/memory.h
    ARM: mach-tegra: remove mach/memory.h
    ARM: plat-tcc: remove mach/memory.h
    ARM: mach-mmp: remove mach/memory.h
    ARM: mach-cns3xxx: remove mach/memory.h
    ...

    Fix up mostly pretty trivial conflicts in:
    - arch/arm/Kconfig
    - arch/arm/include/asm/localtimer.h
    - arch/arm/kernel/Makefile
    - arch/arm/mach-shmobile/board-ap4evb.c
    - arch/arm/mach-u300/core.c
    - arch/arm/mm/dma-mapping.c
    - arch/arm/mm/proc-v7.S
    - arch/arm/plat-omap/Kconfig
    largely due to some CONFIG option renaming (ie CONFIG_PM_SLEEP ->
    CONFIG_ARM_CPU_SUSPEND for the arm-specific suspend code etc) and
    addition of NEED_MACH_MEMORY_H next to HAVE_IDE.

    Linus Torvalds
     

26 Oct, 2011

9 commits

  • * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    time, s390: Get rid of compile warning
    dw_apb_timer: constify clocksource name
    time: Cleanup old CONFIG_GENERIC_TIME references that snuck in
    time: Change jiffies_to_clock_t() argument type to unsigned long
    alarmtimers: Fix error handling
    clocksource: Make watchdog reset lockless
    posix-cpu-timers: Cure SMP accounting oddities
    s390: Use direct ktime path for s390 clockevent device
    clockevents: Add direct ktime programming function
    clockevents: Make minimum delay adjustments configurable
    nohz: Remove "Switched to NOHz mode" debugging messages
    proc: Consider NO_HZ when printing idle and iowait times
    nohz: Make idle/iowait counter update conditional
    nohz: Fix update_ts_time_stat idle accounting
    cputime: Clean up cputime_to_usecs and usecs_to_cputime macros
    alarmtimers: Rework RTC device selection using class interface
    alarmtimers: Add try_to_cancel functionality
    alarmtimers: Add more refined alarm state tracking
    alarmtimers: Remove period from alarm structure
    alarmtimers: Remove interval cap limit hack
    ...

    Linus Torvalds
     
  • * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    llist: Add back llist_add_batch() and llist_del_first() prototypes
    sched: Don't use tasklist_lock for debug prints
    sched: Warn on rt throttling
    sched: Unify the ->cpus_allowed mask copy
    sched: Wrap scheduler p->cpus_allowed access
    sched: Request for idle balance during nohz idle load balance
    sched: Use resched IPI to kick off the nohz idle balance
    sched: Fix idle_cpu()
    llist: Remove cpu_relax() usage in cmpxchg loops
    sched: Convert to struct llist
    llist: Add llist_next()
    irq_work: Use llist in the struct irq_work logic
    llist: Return whether list is empty before adding in llist_add()
    llist: Move cpu_relax() to after the cmpxchg()
    llist: Remove the platform-dependent NMI checks
    llist: Make some llist functions inline
    sched, tracing: Show PREEMPT_ACTIVE state in trace_sched_switch
    sched: Remove redundant test in check_preempt_tick()
    sched: Add documentation for bandwidth control
    sched: Return unused runtime on group dequeue
    ...

    Linus Torvalds
     
  • * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (121 commits)
    perf symbols: Increase symbol KSYM_NAME_LEN size
    perf hists browser: Refuse 'a' hotkey on non symbolic views
    perf ui browser: Use libslang to read keys
    perf tools: Fix tracing info recording
    perf hists browser: Elide DSO column when it is set to just one DSO, ditto for threads
    perf hists: Don't consider filtered entries when calculating column widths
    perf hists: Don't decay total_period for filtered entries
    perf hists browser: Honour symbol_conf.show_{nr_samples,total_period}
    perf hists browser: Do not exit on tab key with single event
    perf annotate browser: Don't change selection line when returning from callq
    perf tools: handle endianness of feature bitmap
    perf tools: Add prelink suggestion to dso update message
    perf script: Fix unknown feature comment
    perf hists browser: Apply the dso and thread filters when merging new batches
    perf hists: Move the dso and thread filters from hist_browser
    perf ui browser: Honour the xterm colors
    perf top tui: Give color hints just on the percentage, like on --stdio
    perf ui browser: Make the colors configurable and change the defaults
    perf tui: Remove unneeded call to newtCls on startup
    perf hists: Don't format the percentage on hist_entry__snprintf
    ...

    Fix up conflicts in arch/x86/kernel/kprobes.c manually.

    Ingo's tree did the insane "add volatile to const array", which just
    doesn't make sense ("volatile const"?). But we could remove the const
    *and* make the array volatile to make doubly sure that gcc doesn't
    optimize it away..

    Also fix up kernel/trace/ring_buffer.c non-data-conflicts manually: the
    reader_lock has been turned into a raw lock by the core locking merge,
    and there was a new user of it introduced in this perf core merge. Make
    sure that new use also uses the raw accessor functions.

    Linus Torvalds
     
  • * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Add IRQF_RESUME_EARLY and resume such IRQs earlier
    genirq: Fix fatfinered fixup really
    genirq: percpu: allow interrupt type to be set at enable time
    genirq: Add support for per-cpu dev_id interrupts
    genirq: Add IRQCHIP_SKIP_SET_WAKE flag

    Linus Torvalds
     
  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    rcu: Move propagation of ->completed from rcu_start_gp() to rcu_report_qs_rsp()
    rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states
    rcu: Wire up RCU_BOOST_PRIO for rcutree
    rcu: Make rcu_torture_boost() exit loops at end of test
    rcu: Make rcu_torture_fqs() exit loops at end of test
    rcu: Permit rt_mutex_unlock() with irqs disabled
    rcu: Avoid having just-onlined CPU resched itself when RCU is idle
    rcu: Suppress NMI backtraces when stall ends before dump
    rcu: Prohibit grace periods during early boot
    rcu: Simplify unboosting checks
    rcu: Prevent early boot set_need_resched() from __rcu_pending()
    rcu: Dump local stack if cannot dump all CPUs' stacks
    rcu: Move __rcu_read_unlock()'s barrier() within if-statement
    rcu: Improve rcu_assign_pointer() and RCU_INIT_POINTER() documentation
    rcu: Make rcu_assign_pointer() unconditionally insert a memory barrier
    rcu: Make rcu_implicit_dynticks_qs() locals be correct size
    rcu: Eliminate in_irq() checks in rcu_enter_nohz()
    nohz: Remove nohz_cpu_mask
    rcu: Document interpretation of RCU-lockdep splats
    rcu: Allow rcutorture's stat_interval parameter to be changed at runtime
    ...

    Linus Torvalds
     
  • * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock()
    lockdep: Comment all warnings
    lib: atomic64: Change the type of local lock to raw_spinlock_t
    locking, lib/atomic64: Annotate atomic64_lock::lock as raw
    locking, x86, iommu: Annotate qi->q_lock as raw
    locking, x86, iommu: Annotate irq_2_ir_lock as raw
    locking, x86, iommu: Annotate iommu->register_lock as raw
    locking, dma, ipu: Annotate bank_lock as raw
    locking, ARM: Annotate low level hw locks as raw
    locking, drivers/dca: Annotate dca_lock as raw
    locking, powerpc: Annotate uic->lock as raw
    locking, x86: mce: Annotate cmci_discover_lock as raw
    locking, ACPI: Annotate c3_lock as raw
    locking, oprofile: Annotate oprofilefs lock as raw
    locking, video: Annotate vga console lock as raw
    locking, latencytop: Annotate latency_lock as raw
    locking, timer_stats: Annotate table_lock as raw
    locking, rwsem: Annotate inner lock as raw
    locking, semaphores: Annotate inner lock as raw
    locking, sched: Annotate thread_group_cputimer as raw
    ...

    Fix up conflicts in kernel/posix-cpu-timers.c manually: making
    cputimer->cputime a raw lock conflicted with the ABBA fix in commit
    bcd5cff7216f ("cputimer: Cure lock inversion").

    Linus Torvalds
     
  • * git://github.com/rustyrussell/linux:
    params: make dashes and underscores in parameter names truly equal
    kmod: prevent kmod_loop_msg overflow in __request_module()

    Linus Torvalds
     
  • The user may use "foo-bar" for a kernel parameter defined as "foo_bar".
    Make sure it works the other way around too.

    Apply the equality of dashes and underscores on early_params and __setup
    params as well.

    The example given in Documentation/kernel-parameters.txt indicates that
    this is the intended behaviour.

    With the patch the kernel accepts "log-buf-len=1M" as expected.
    https://bugzilla.redhat.com/show_bug.cgi?id=744545

    Signed-off-by: Michal Schmidt
    Signed-off-by: Rusty Russell (neatened implementations)

    Michal Schmidt
     
  • Due to post-increment in condition of kmod_loop_msg in __request_module(),
    the system log can be spammed by much more than 5 instances of the 'runaway
    loop' message if the number of events triggering it makes the kmod_loop_msg
    to overflow.

    Fix that by making sure we never increment it past the threshold.

    Signed-off-by: Jiri Kosina
    Signed-off-by: Rusty Russell
    CC: stable@kernel.org

    Jiri Kosina
     

25 Oct, 2011

6 commits

  • * 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (63 commits)
    PM / Clocks: Remove redundant NULL checks before kfree()
    PM / Documentation: Update docs about suspend and CPU hotplug
    ACPI / PM: Add Sony VGN-FW21E to nonvs blacklist.
    ARM: mach-shmobile: sh7372 A4R support (v4)
    ARM: mach-shmobile: sh7372 A3SP support (v4)
    PM / Sleep: Mark devices involved in wakeup signaling during suspend
    PM / Hibernate: Improve performance of LZO/plain hibernation, checksum image
    PM / Hibernate: Do not initialize static and extern variables to 0
    PM / Freezer: Make fake_signal_wake_up() wake TASK_KILLABLE tasks too
    PM / Hibernate: Add resumedelay kernel param in addition to resumewait
    MAINTAINERS: Update linux-pm list address
    PM / ACPI: Blacklist Vaio VGN-FW520F machine known to require acpi_sleep=nonvs
    PM / ACPI: Blacklist Sony Vaio known to require acpi_sleep=nonvs
    PM / Hibernate: Add resumewait param to support MMC-like devices as resume file
    PM / Hibernate: Fix typo in a kerneldoc comment
    PM / Hibernate: Freeze kernel threads after preallocating memory
    PM: Update the policy on default wakeup settings
    PM / VT: Cleanup #if defined uglyness and fix compile error
    PM / Suspend: Off by one in pm_suspend()
    PM / Hibernate: Include storage keys in hibernation image on s390
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1745 commits)
    dp83640: free packet queues on remove
    dp83640: use proper function to free transmit time stamping packets
    ipv6: Do not use routes from locally generated RAs
    |PATCH net-next] tg3: add tx_dropped counter
    be2net: don't create multiple RX/TX rings in multi channel mode
    be2net: don't create multiple TXQs in BE2
    be2net: refactor VF setup/teardown code into be_vf_setup/clear()
    be2net: add vlan/rx-mode/flow-control config to be_setup()
    net_sched: cls_flow: use skb_header_pointer()
    ipv4: avoid useless call of the function check_peer_pmtu
    TCP: remove TCP_DEBUG
    net: Fix driver name for mdio-gpio.c
    ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT
    rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces
    ipv4: fix ipsec forward performance regression
    jme: fix irq storm after suspend/resume
    route: fix ICMP redirect validation
    net: hold sock reference while processing tx timestamps
    tcp: md5: add more const attributes
    Add ethtool -g support to virtio_net
    ...

    Fix up conflicts in:
    - drivers/net/Kconfig:
    The split-up generated a trivial conflict with removal of a
    stale reference to Documentation/networking/net-modules.txt.
    Remove it from the new location instead.
    - fs/sysfs/dir.c:
    Fairly nasty conflicts with the sysfs rb-tree usage, conflicting
    with Eric Biederman's changes for tagged directories.

    Linus Torvalds
     
  • * 'usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (260 commits)
    usb: renesas_usbhs: fixup inconsistent return from usbhs_pkt_push()
    usb/isp1760: Allow to optionally trigger low-level chip reset via GPIOLIB.
    USB: gadget: midi: memory leak in f_midi_bind_config()
    USB: gadget: midi: fix range check in f_midi_out_open()
    QE/FHCI: fixed the CONTROL bug
    usb: renesas_usbhs: tidyup for smatch warnings
    USB: Fix USB Kconfig dependency problem on 85xx/QoirQ platforms
    EHCI: workaround for MosChip controller bug
    usb: gadget: file_storage: fix race on unloading
    USB: ftdi_sio.c: Use ftdi async_icount structure for TIOCMIWAIT, as in other drivers
    USB: ftdi_sio.c:Fill MSR fields of the ftdi async_icount structure
    USB: ftdi_sio.c: Fill LSR fields of the ftdi async_icount structure
    USB: ftdi_sio.c:Fill TX field of the ftdi async_icount structure
    USB: ftdi_sio.c: Fill the RX field of the ftdi async_icount structure
    USB: ftdi_sio.c: Basic icount infrastructure for ftdi_sio
    usb/isp1760: Let OF bindings depend on general CONFIG_OF instead of PPC_OF .
    USB: ftdi_sio: Support TI/Luminary Micro Stellaris BD-ICDI Board
    USB: Fix runtime wakeup on OHCI
    xHCI/USB: Make xHCI driver have a BOS descriptor.
    usb: gadget: add new usb gadget for ACM and mass storage
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
    MAINTAINERS: linux-m32r is moderated for non-subscribers
    linux@lists.openrisc.net is moderated for non-subscribers
    Drop default from "DM365 codec select" choice
    parisc: Kconfig: cleanup Kernel page size default
    Kconfig: remove redundant CONFIG_ prefix on two symbols
    cris: remove arch/cris/arch-v32/lib/nand_init.S
    microblaze: add missing CONFIG_ prefixes
    h8300: drop puzzling Kconfig dependencies
    MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
    tty: drop superfluous dependency in Kconfig
    ARM: mxc: fix Kconfig typo 'i.MX51'
    Fix file references in Kconfig files
    aic7xxx: fix Kconfig references to READMEs
    Fix file references in drivers/ide/
    thinkpad_acpi: Fix printk typo 'bluestooth'
    bcmring: drop commented out line in Kconfig
    btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
    doc: raw1394: Trivial typo fix
    CIFS: Don't free volume_info->UNC until we are entirely done with it.
    treewide: Correct spelling of successfully in comments
    ...

    Linus Torvalds
     
  • * 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
    TOMOYO: Fix incomplete read after seek.
    Smack: allow to access /smack/access as normal user
    TOMOYO: Fix unused kernel config option.
    Smack: fix: invalid length set for the result of /smack/access
    Smack: compilation fix
    Smack: fix for /smack/access output, use string instead of byte
    Smack: domain transition protections (v3)
    Smack: Provide information for UDS getsockopt(SO_PEERCRED)
    Smack: Clean up comments
    Smack: Repair processing of fcntl
    Smack: Rule list lookup performance
    Smack: check permissions from user space (v2)
    TOMOYO: Fix quota and garbage collector.
    TOMOYO: Remove redundant tasklist_lock.
    TOMOYO: Fix domain transition failure warning.
    TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
    TOMOYO: Simplify garbage collector.
    TOMOYO: Fix make namespacecheck warnings.
    target: check hex2bin result
    encrypted-keys: check hex2bin result
    ...

    Linus Torvalds
     
  • David S. Miller
     

24 Oct, 2011

1 commit

  • Some functions of irq generic-chip is undefined, because
    EXPORT_SYMBOL_GPL is not set to these.

    ERROR: "irq_setup_generic_chip" [drivers/gpio/gpio-pch.ko] undefined!
    ERROR: "irq_alloc_generic_chip" [drivers/gpio/gpio-pch.ko] undefined!
    ERROR: "irq_setup_generic_chip" [drivers/gpio/gpio-ml-ioh.ko] undefined!
    ERROR: "irq_alloc_generic_chip" [drivers/gpio/gpio-ml-ioh.ko] undefined!

    This is revised that EXPORT_SYMBOL_GPL can be added and referred
    to in functions.

    Signed-off-by: Nobuhiro Iwamatsu
    Acked-by: Thomas Gleixner
    Signed-off-by: Grant Likely

    Nobuhiro Iwamatsu
     

23 Oct, 2011

1 commit


18 Oct, 2011

1 commit

  • There's a lock inversion between the cputimer->lock and rq->lock;
    notably the two callchains involved are:

    update_rlimit_cpu()
    sighand->siglock
    set_process_cpu_timer()
    cpu_timer_sample_group()
    thread_group_cputimer()
    cputimer->lock
    thread_group_cputime()
    task_sched_runtime()
    ->pi_lock
    rq->lock

    scheduler_tick()
    rq->lock
    task_tick_fair()
    update_curr()
    account_group_exec()
    cputimer->lock

    Where the first one is enabling a CLOCK_PROCESS_CPUTIME_ID timer, and
    the second one is keeping up-to-date.

    This problem was introduced by e8abccb7193 ("posix-cpu-timers: Cure
    SMP accounting oddities").

    Cure the problem by removing the cputimer->lock and rq->lock nesting,
    this leaves concurrent enablers doing duplicate work, but the time
    wasted should be on the same order otherwise wasted spinning on the
    lock and the greater-than assignment filter should ensure we preserve
    monotonicity.

    Reported-by: Dave Jones
    Reported-by: Simon Kirby
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Link: http://lkml.kernel.org/r/1318928713.21167.4.camel@twins
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

17 Oct, 2011

14 commits

  • The size is always valid, but variable-length arrays generate worse code
    for no good reason (unless the function happens to be inlined and the
    compiler sees the length for the simple constant it is).

    Also, there seems to be some code generation problem on POWER, where
    Henrik Bakken reports that register r28 can get corrupted under some
    subtle circumstances (interrupt happening at the wrong time?). That all
    indicates some seriously broken compiler issues, but since variable
    length arrays are bad regardless, there's little point in trying to
    chase it down.

    "Just don't do that, then".

    Reported-by: Henrik Grindal Bakken
    Cc: Benjamin Herrenschmidt
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This adds a mechanism to resume selected IRQs during syscore_resume
    instead of dpm_resume_noirq.

    Under Xen we need to resume IRQs associated with IPIs early enough
    that the resched IPI is unmasked and we can therefore schedule
    ourselves out of the stop_machine where the suspend/resume takes
    place.

    This issue was introduced by 676dc3cf5bc3 "xen: Use IRQF_FORCE_RESUME".

    Signed-off-by: Ian Campbell
    Cc: Rafael J. Wysocki
    Cc: Jeremy Fitzhardinge
    Cc: xen-devel
    Cc: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1318713254.11016.52.camel@dagon.hellion.org.uk
    Cc: stable@kernel.org (at least to 2.6.32.y)
    Signed-off-by: Thomas Gleixner

    Ian Campbell
     
  • Use threads for LZO compression/decompression on hibernate/thaw.
    Improve buffering on hibernate/thaw.
    Calculate/verify CRC32 of the image pages on hibernate/thaw.

    In my testing, this improved write/read speed by a factor of about two.

    Signed-off-by: Bojan Smojver
    Signed-off-by: Rafael J. Wysocki

    Bojan Smojver
     
  • Static and extern variables in kernel/power/hibernate.c need not be
    initialized to 0 explicitly, so remove those initializations.

    [rjw: Modified subject, added changelog.]

    Signed-off-by: Barry Song
    Signed-off-by: Rafael J. Wysocki

    Barry Song
     
  • TASK_KILLABLE is often used to put tasks to sleep for quite some time.
    One of the most common uses is to put tasks to sleep while waiting for
    replies from a server on a networked filesystem (such as CIFS or NFS).

    Unfortunately, fake_signal_wake_up does not currently wake up tasks
    that are sleeping in TASK_KILLABLE state. This means that even if the
    code were in place to allow them to freeze while in this sleep, it
    wouldn't work anyway.

    This patch changes this function to wake tasks in this state as well.
    This should be harmless -- if the code doing the sleeping doesn't have
    handling to deal with freezer events, it should just go back to sleep.
    If it does, then this will allow that code to do the right thing.

    Signed-off-by: Jeff Layton
    Signed-off-by: Rafael J. Wysocki

    Jeff Layton
     
  • Patch "PM / Hibernate: Add resumewait param to support MMC-like
    devices as resume file" added the resumewait kernel command line
    option. The present patch adds resumedelay so that
    resumewait/delay were analogous to rootwait/delay.

    [rjw: Modified the subject and changelog slightly.]

    Signed-off-by: Barry Song
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Barry Song
     
  • Some devices like MMC are async detected very slow. For example,
    drivers/mmc/host/sdhci.c launches a 200ms delayed work to detect
    MMC partitions then add disk.

    We have wait_for_device_probe() and scsi_complete_async_scans()
    before calling swsusp_check(), but it is not enough to wait for MMC.

    This patch adds resumewait kernel param just like rootwait so
    that we have enough time to wait until MMC is ready. The difference is
    that we wait for resume partition whereas rootwait waits for rootfs
    partition (which may be on a different device).

    This patch will make hibernation support many embedded products
    without SCSI devices, but with devices like MMC.

    [rjw: Modified the changelog slightly.]

    Signed-off-by: Barry Song
    Reviewed-by: Valdis Kletnieks
    Signed-off-by: Rafael J. Wysocki

    Barry Song
     
  • Fix a typo in a function name in the kerneldoc comment next to
    resume_target_kernel().

    [rjw: Changed the subject slightly, added the changelog.]

    Signed-off-by: Barry Song
    Signed-off-by: Rafael J. Wysocki

    Barry Song
     
  • There is a problem with the current ordering of hibernate code which
    leads to deadlocks in some filesystems' memory shrinkers. Namely,
    some filesystems use freezable kernel threads that are inactive when
    the hibernate memory preallocation is carried out. Those same
    filesystems use memory shrinkers that may be triggered by the
    hibernate memory preallocation. If those memory shrinkers wait for
    the frozen kernel threads, the hibernate process deadlocks (this
    happens with XFS, for one example).

    Apparently, it is not technically viable to redesign the filesystems
    in question to avoid the situation described above, so the only
    possible solution of this issue is to defer the freezing of kernel
    threads until the hibernate memory preallocation is done, which is
    implemented by this change.

    Unfortunately, this requires the memory preallocation to be done
    before the "prepare" stage of device freeze, so after this change the
    only way drivers can allocate additional memory for their freeze
    routines in a clean way is to use PM notifiers.

    Reported-by: Christoph
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Introduce the config option CONFIG_VT_CONSOLE_SLEEP in order to cleanup
    the #if defined ugliness for the vt suspend support functions. Note that
    CONFIG_VT_CONSOLE is already dependant on CONFIG_VT.

    The function pm_set_vt_switch is actually dependant on CONFIG_VT and not
    CONFIG_PM_SLEEP. This fixes a compile error when CONFIG_PM_SLEEP is
    not set:

    drivers/tty/vt/vt_ioctl.c:1794: error: redefinition of 'pm_set_vt_switch'
    include/linux/suspend.h:17: error: previous definition of 'pm_set_vt_switch' was here

    Also, remove the incorrect path from the comment in console.c.

    [rjw: Replaced #if defined() with #ifdef in suspend.h.]

    Signed-off-by: H Hartley Sweeten
    Acked-by: Arnd Bergmann
    Signed-off-by: Rafael J. Wysocki

    H Hartley Sweeten
     
  • In enter_state() we use "state" as an offset for the pm_states[]
    array. The pm_states[] array only has PM_SUSPEND_MAX elements so
    this test is off by one.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Rafael J. Wysocki
    Cc: stable@kernel.org

    Dan Carpenter
     
  • For s390 there is one additional byte associated with each page,
    the storage key. This byte contains the referenced and changed
    bits and needs to be included into the hibernation image.
    If the storage keys are not restored to their previous state all
    original pages would appear to be dirty. This can cause
    inconsistencies e.g. with read-only filesystems.

    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Rafael J. Wysocki

    Martin Schwidefsky
     
  • Suspend statistics should depend on CONFIG_PM_SLEEP, so make that
    happen.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Record S3 failure time about each reason and the latest two failed
    devices' names in S3 progress.
    We can check it through 'suspend_stats' entry in debugfs.

    The motivation of the patch:

    We are enabling power features on Medfield. Comparing with PC/notebook,
    a mobile enters/exits suspend-2-ram (we call it s3 on Medfield) far
    more frequently. If it can't enter suspend-2-ram in time, the power
    might be used up soon.

    We often find sometimes, a device suspend fails. Then, system retries
    s3 over and over again. As display is off, testers and developers
    don't know what happens.

    Some testers and developers complain they don't know if system
    tries suspend-2-ram, and what device fails to suspend. They need
    such info for a quick check. The patch adds suspend_stats under
    debugfs for users to check suspend to RAM statistics quickly.

    If not using this patch, we have other methods to get info about
    what device fails. One is to turn on CONFIG_PM_DEBUG, but users
    would get too much info and testers need recompile the system.

    In addition, dynamic debug is another good tool to dump debug info.
    But it still doesn't match our utilization scenario closely.
    1) user need write a user space parser to process the syslog output;
    2) Our testing scenario is we leave the mobile for at least hours.
    Then, check its status. No serial console available during the
    testing. One is because console would be suspended, and the other
    is serial console connecting with spi or HSU devices would consume
    power. These devices are powered off at suspend-2-ram.

    Signed-off-by: ShuoX Liu
    Signed-off-by: Rafael J. Wysocki

    ShuoX Liu
     

14 Oct, 2011

2 commits

  • The trace_pipe_raw handler holds a cached page from the time the file
    is opened to the time it is closed. The cached page is used to handle
    the case of the user space buffer being smaller than what was read from
    the ring buffer. The left over buffer is held in the cache so that the
    next read will continue where the data left off.

    After EOF is returned (no more data in the buffer), the index of
    the cached page is set to zero. If a user app reads the page again
    after EOF, the check in the buffer will see that the cached page
    is less than page size and will return the cached page again. This
    will cause reading the trace_pipe_raw again after EOF to return
    duplicate data, making the output look like the time went backwards
    but instead data is just repeated.

    The fix is to not reset the index right after all data is read
    from the cache, but to reset it after all data is read and more
    data exists in the ring buffer.

    Cc: stable
    Reported-by: Jeremy Eder
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • tracing_enabled option is deprecated.
    To start/stop tracing, write to /sys/kernel/debug/tracing/tracing_on
    without tracing_enabled. This patch is based on Linux 3.1.0-rc1

    Signed-off-by: Geunsik Lim
    Link: http://lkml.kernel.org/r/1313127022-23830-1-git-send-email-leemgs1@gmail.com
    Signed-off-by: Steven Rostedt

    Geunsik Lim
     

12 Oct, 2011

2 commits


11 Oct, 2011

1 commit

  • When doing intense tracing, the kmalloc inside trace_marker can
    introduce side effects to what is being traced.

    As trace_marker() is used by userspace to inject data into the
    kernel ring buffer, it needs to do so with the least amount
    of intrusion to the operations of the kernel or the user space
    application.

    As the ring buffer is designed to write directly into the buffer
    without the need to make a temporary buffer, and userspace already
    went through the hassle of knowing how big the write will be,
    we can simply pin the userspace pages and write the data directly
    into the buffer. This improves the impact of tracing via trace_marker
    tremendously!

    Thanks to Peter Zijlstra and Thomas Gleixner for pointing out the
    use of get_user_pages_fast() and kmap_atomic().

    Suggested-by: Thomas Gleixner
    Suggested-by: Peter Zijlstra
    Signed-off-by: Steven Rostedt

    Steven Rostedt