17 May, 2007

4 commits

  • The sysfs files /sys/power/disk and /sys/power/state do not work as
    documented, since they allow the user to write only a few initial
    characters of the input string to trigger the option (eg. 'echo pl >
    /sys/power/disk' activates the platform mode of hibernation). Fix it.

    Special thanks to Peter Moulder for
    pointing out the problem.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core
    filename size and change it to 128, so that extensive patterns such as
    '/local/cores/%e-%h-%s-%t-%p.core' won't result in truncated filename
    generation.

    Signed-off-by: Dan Aloni
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Aloni
     
  • SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

    Signed-off-by: Christoph Lameter
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Steven French
    Cc: Michael Halcrow
    Cc: OGAWA Hirofumi
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Roman Zippel
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: Paul Mackerras
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: David Chinner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • In commit e3c7db621bed4afb8e231cb005057f2feb5db557 we fixed the resume
    ordering, so that the ACPI low-level resume code was called before the
    actual driver resume was called. However, that broke the nesting logic
    of suspend and resume, and we continued to suspend the devices _after_
    we the ACPI device suspend code was called.

    That resulted in us saving PCI state for devices that had already been
    changed by ACPI, and in some cases disabled entirely (causing the PCI
    save_state to be all-ones). Which in turn caused the wrong state to be
    written back on resume.

    This moves the ACPI device suspend to after the device model per-device
    suspend() calls. This fixes the bogus state save.

    Thanks to Lukáš Hejtmánek for testing.

    Acked-by: Lukas Hejtmanek
    Acked-by: Rafael J. Wysocki
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: Andrew Morton
    Cc: Greg KH
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 May, 2007

1 commit


15 May, 2007

2 commits

  • lockdep complains about the lock nesting of clocksource and watchdog lock
    in the resume path.

    Change the resume marker to a bit operation and remove the lock from this
    path.

    Signed-off-by: Thomas Gleixner
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • The time keeping code move to kernel/time/timekeeping.c broke the
    clocksource resume logic patch, which got applied to the old file by a
    fuzzy application. Fix it up and move the clocksource_resume() call to
    the appropriate place.

    Signed-off-by: Thomas Gleixner
    [ tssk, tssk, everybody should use --fuzz=0 ]
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

13 May, 2007

1 commit


12 May, 2007

3 commits

  • * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
    [IA64] Quicklist support for IA64
    [IA64] fix Kprobes reentrancy
    [IA64] SN: validate smp_affinity mask on intr redirect
    [IA64] drivers/char/snsc_event.c:206: warning: unused variable `p'
    [IA64] mca.c:121: warning: 'cpe_poll_timer' defined but not used
    [IA64] Fix - Section mismatch: reference to .init.data:mvec_name
    [IA64] more warning cleanups
    [IA64] Wire up epoll_pwait and utimensat
    [IA64] Fix warnings resulting from type-checking in dev_dbg()
    [IA64] typo s/kenrel/kernel/

    Linus Torvalds
     
  • * 'audit.b38' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
    [PATCH] Abnormal End of Processes
    [PATCH] match audit name data
    [PATCH] complete message queue auditing
    [PATCH] audit inode for all xattr syscalls
    [PATCH] initialize name osid
    [PATCH] audit signal recipients
    [PATCH] add SIGNAL syscall class (v3)
    [PATCH] auditing ptrace

    Linus Torvalds
     
  • On SN, only allow one bit to be set in the smp_affinty mask when
    redirecting an interrupt. Currently setting multiple bits is allowed, but
    only the first bit is used in determining the CPU to redirect to. This has
    caused confusion among some customers.

    [akpm@linux-foundation.org: fixes]
    Signed-off-by: John Keller
    Signed-off-by: Andrew Morton
    Signed-off-by: Tony Luck

    John Keller
     

11 May, 2007

18 commits

  • This is a very simple and light file descriptor, that can be used as event
    wait/dispatch by userspace (both wait and dispatch) and by the kernel
    (dispatch only). It can be used instead of pipe(2) in all cases where those
    would simply be used to signal events. Their kernel overhead is much lower
    than pipes, and they do not consume two fds. When used in the kernel, it can
    offer an fd-bridge to enable, for example, functionalities like KAIO or
    syslets/threadlets to signal to an fd the completion of certain operations.
    But more in general, an eventfd can be used by the kernel to signal readiness,
    in a POSIX poll/select way, of interfaces that would otherwise be incompatible
    with it. The API is:

    int eventfd(unsigned int count);

    The eventfd API accepts an initial "count" parameter, and returns an eventfd
    fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).

    The POLLIN flag is raised when the internal counter is greater than zero.

    The POLLOUT flag is raised when at least a value of "1" can be written to the
    internal counter.

    The POLLERR flag is raised when an overflow in the counter value is detected.

    The write(2) operation can never overflow the counter, since it blocks (unless
    O_NONBLOCK is set, in which case -EAGAIN is returned).

    But the eventfd_signal() function can do it, since it's supposed to not sleep
    during its operation.

    The read(2) function reads the __u64 counter value, and reset the internal
    value to zero. If the value read is equal to (__u64) -1, an overflow happened
    on the internal counter (due to 2^64 eventfd_signal() posts that has never
    been retired - unlickely, but possible).

    The write(2) call writes an __u64 count value, and adds it to the current
    counter. The eventfd fd supports O_NONBLOCK also.

    On the kernel side, we have:

    struct file *eventfd_fget(int fd);
    int eventfd_signal(struct file *file, unsigned int n);

    The eventfd_fget() should be called to get a struct file* from an eventfd fd
    (this is an fget() + check of f_op being an eventfd fops pointer).

    The kernel can then call eventfd_signal() every time it wants to post an event
    to userspace. The eventfd_signal() function can be called from any context.
    An eventfd() simple test and bench is available here:

    http://www.xmailserver.org/eventfd-bench.c

    This is the eventfd-based version of pipetest-4 (pipe(2) based):

    http://www.xmailserver.org/pipetest-4.c

    Not that performance matters much in the eventfd case, but eventfd-bench
    shows almost as double as performance than pipetest-4.

    [akpm@linux-foundation.org: fix i386 build]
    [akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • This patch implements the necessary compat code for the timerfd system call.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • This patch introduces a new system call for timers events delivered though
    file descriptors. This allows timer event to be used with standard POSIX
    poll(2), select(2) and read(2). As a consequence of supporting the Linux
    f_op->poll subsystem, they can be used with epoll(2) too.

    The system call is defined as:

    int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);

    The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
    w/out going through the close/open cycle (same as signalfd). If "ufd" is -1,
    s new file descriptor will be created, otherwise the existing "ufd" will be
    re-programmed.

    The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME. The time
    specified in the "utmr->it_value" parameter is the expiry time for the timer.

    If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
    otherwise it's a relative time.

    If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
    tv_nsec == 0), this is the period at which the following ticks should be
    generated.

    The "utmr->it_interval" should be set to zero if only one tick is requested.
    Setting the "utmr->it_value" to zero will disable the timer, or will create a
    timerfd without the timer enabled.

    The function returns the new (or same, in case "ufd" is a valid timerfd
    descriptor) file, or -1 in case of error.

    As stated before, the timerfd file descriptor supports poll(2), select(2) and
    epoll(2). When a timer event happened on the timerfd, a POLLIN mask will be
    returned.

    The read(2) call can be used, and it will return a u32 variable holding the
    number of "ticks" that happened on the interface since the last call to
    read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
    be returned if no ticks happened.

    A quick test program, shows timerfd working correctly on my amd64 box:

    http://www.xmailserver.org/timerfd-test.c

    [akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • This patch series implements the new signalfd() system call.

    I took part of the original Linus code (and you know how badly it can be
    broken :), and I added even more breakage ;) Signals are fetched from the same
    signal queue used by the process, so signalfd will compete with standard
    kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
    the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
    seems to be working fine on my Dual Opteron machine. I made a quick test
    program for it:

    http://www.xmailserver.org/signafd-test.c

    The signalfd() system call implements signal delivery into a file descriptor
    receiver. The signalfd file descriptor if created with the following API:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

    The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
    to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
    signalfd file.

    The "mask" allows to specify the signal mask of signals that we are interested
    in. The "masksize" parameter is the size of "mask".

    The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
    will return POLLIN when signals are available to be dequeued. As a direct
    consequence of supporting the Linux poll subsystem, the signalfd fd can use
    used together with epoll(2) too.

    The read(2) system call will return a "struct signalfd_siginfo" structure in
    the userspace supplied buffer. The return value is the number of bytes copied
    in the supplied buffer, or -1 in case of error. The read(2) call can also
    return 0, in case the sighand structure to which the signalfd was attached,
    has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
    return -EAGAIN in case no signal is available.

    If the size of the buffer passed to read(2) is lower than sizeof(struct
    signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
    return -ERESTARTSYS in case a signal hits the process. The format of the
    struct signalfd_siginfo is, and the valid fields depends of the (->code &
    __SI_MASK) value, in the same way a struct siginfo would:

    struct signalfd_siginfo {
    __u32 signo; /* si_signo */
    __s32 err; /* si_errno */
    __s32 code; /* si_code */
    __u32 pid; /* si_pid */
    __u32 uid; /* si_uid */
    __s32 fd; /* si_fd */
    __u32 tid; /* si_fd */
    __u32 band; /* si_band */
    __u32 overrun; /* si_overrun */
    __u32 trapno; /* si_trapno */
    __s32 status; /* si_status */
    __s32 svint; /* si_int */
    __u64 svptr; /* si_ptr */
    __u64 utime; /* si_utime */
    __u64 stime; /* si_stime */
    __u64 addr; /* si_addr */
    };

    [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Use task_pgrp() and task_session() in copy_process(), and avoid find_pid()
    call when attaching the task to its process group and session.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc:
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Modify copy_process() to take a struct pid * parameter instead of a pid_t.
    This simplifies the code a bit and also avoids having to call find_pid() to
    convert the pid_t to a struct pid.

    Changelog:
    - Fixed Badari Pulavarty's comments and passed in &init_struct_pid
    from fork_idle().
    - Fixed Eric Biederman's comments and simplified this patch and
    used a new patch to remove the likely(pid) check.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Eric Biederman
    Cc:
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Statically initialize a struct pid for the swapper process (pid_t == 0) and
    attach it to init_task. This is needed so task_pid(), task_pgrp() and
    task_session() interfaces work on the swapper process also.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Eric Biederman
    Cc: Herbert Poetzl
    Cc:
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • attach_pid() currently takes a pid_t and then uses find_pid() to find the
    corresponding struct pid. Sometimes we already have the struct pid. We can
    then skip find_pid() if attach_pid() were to take a struct pid parameter.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Switch to the defines for these two checks, instead of hard coding the
    values.

    [akpm@linux-foundation.org: add missing include]
    Signed-off-by: Daniel Walker
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Walker
     
  • Add a call to hard_irq_disable() to stop_machine so that we make sure IRQs are
    really disabled and not only lazy-disabled on archs like powerpc as some users
    of stop_machine() may rely on that.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: Rusty Russell
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • If CONFIG_TASK_IO_ACCOUNTING is defined, we update io accounting counters for
    each task.

    This patch permits reporting of values using the well known getrusage()
    syscall, filling ru_inblock and ru_oublock instead of null values.

    As TASK_IO_ACCOUNTING currently counts bytes counts, we approximate blocks
    count doing : nr_blocks = nr_bytes / 512

    Example of use :
    ----------------------
    After patch is applied, /usr/bin/time command can now give a good
    approximation of IO that the process had to do.

    $ /usr/bin/time grep tototo /usr/include/*
    Command exited with non-zero status 1
    0.00user 0.02system 0:02.11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
    24288inputs+0outputs (0major+259minor)pagefaults 0swaps

    $ /usr/bin/time dd if=/dev/zero of=/tmp/testfile count=1000
    1000+0 enregistrements lus
    1000+0 enregistrements écrits
    512000 octets (512 kB) copiés, 0,00326601 seconde, 157 MB/s
    0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+3000outputs (0major+299minor)pagefaults 0swaps

    Signed-off-by: Eric Dumazet
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Hi,

    I have been working on some code that detects abnormal events based on audit
    system events. One kind of event that we currently have no visibility for is
    when a program terminates due to segfault - which should never happen on a
    production machine. And if it did, you'd want to investigate it. Attached is a
    patch that collects these events and sends them into the audit system.

    Signed-off-by: Steve Grubb
    Signed-off-by: Al Viro

    Steve Grubb
     
  • Make more effort to detect previously collected names, so we don't log
    multiple PATH records for a single filesystem object. Add
    audit_inc_name_count() to reduce duplicate code.

    Signed-off-by: Amy Griffis
    Signed-off-by: Al Viro

    Amy Griffis
     
  • Handle the edge cases for POSIX message queue auditing. Collect inode
    info when opening an existing mq, and for send/receive operations. Remove
    audit_inode_update() as it has really evolved into the equivalent of
    audit_inode().

    Signed-off-by: Amy Griffis
    Signed-off-by: Al Viro

    Amy Griffis
     
  • Audit contexts can be reused, so initialize a name's osid to the
    default in audit_getname(). This ensures we don't log a bogus object
    label when no inode data is collected for a name.

    Signed-off-by: Amy Griffis
    Signed-off-by: Al Viro

    Amy Griffis
     
  • When auditing syscalls that send signals, log the pid and security
    context for each target process. Optimize the data collection by
    adding a counter for signal-related rules, and avoiding allocating an
    aux struct unless we have more than one target process. For process
    groups, collect pid/context data in blocks of 16. Move the
    audit_signal_info() hook up in check_kill_permission() so we audit
    attempts where permission is denied.

    Signed-off-by: Amy Griffis
    Signed-off-by: Al Viro

    Amy Griffis
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • On 09-05-2007 21:10, Pallipadi, Venkatesh wrote:
    ...
    > On a 64 bit system, converting pointer to int causes unnecessary
    > compiler warning, and intermediate long conversion was to avoid that.
    > I will have to rephrase my comment to remove 32 bit value and use int,
    > as that is what the function returns.

    So, this patch reverts all changes done by my previous patch.

    I apologize for my wrong comment about "logical error" here.

    Cc: "Pallipadi, Venkatesh"
    Cc: Satyam Sharma
    Cc: Oleg Nesterov
    Signed-off-by: Jarek Poplawski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     

10 May, 2007

11 commits

  • * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
    [POWERPC] Further fixes for the removal of 4level-fixup hack from ppc32
    [POWERPC] EEH: log all PCI-X and PCI-E AER registers
    [POWERPC] EEH: capture and log pci state on error
    [POWERPC] EEH: Split up long error msg
    [POWERPC] EEH: log error only after driver notification.
    [POWERPC] fsl_soc: Make mac_addr const in fs_enet_of_init().
    [POWERPC] Don't use SLAB/SLUB for PTE pages
    [POWERPC] Spufs support for 64K LS mappings on 4K kernels
    [POWERPC] Add ability to 4K kernel to hash in 64K pages
    [POWERPC] Introduce address space "slices"
    [POWERPC] Small fixes & cleanups in segment page size demotion
    [POWERPC] iSeries: Make HVC_ISERIES the default
    [POWERPC] iSeries: suppress build warning in lparmap.c
    [POWERPC] Mark pages that don't exist as nosave
    [POWERPC] swsusp: Introduce register_nosave_region_late

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
    sound: convert "sound" subdirectory to UTF-8
    MAINTAINERS: Add cxacru website/mailing list
    include files: convert "include" subdirectory to UTF-8
    general: convert "kernel" subdirectory to UTF-8
    documentation: convert the Documentation directory to UTF-8
    Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
    remove broken URLs from net drivers' output
    Magic number prefix consistency change to Documentation/magic-number.txt
    trivial: s/i_sem /i_mutex/
    fix file specification in comments
    drivers/base/platform.c: fix small typo in doc
    misc doc and kconfig typos
    Remove obsolete fat_cvf help text
    Fix occurrences of "the the "
    Fix minor typoes in kernel/module.c
    Kconfig: Remove reference to external mqueue library
    Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
    Correct comments in genrtc.c to refer to correct /proc file.
    Fix more "deprecated" spellos.
    Fix "deprecated" typoes.
    ...

    Fix trivial comment conflict in kernel/relay.c.

    Linus Torvalds
     
  • This finally renames the thread_info field in task structure to stack, so that
    the assumptions about this field are gone and archs have more freedom about
    placing the thread_info structure.

    Nonbroken archs which have a proper thread pointer can do the access to both
    current thread and task structure via a single pointer.

    It'll allow for a few more cleanups of the fork code, from which e.g. ia64
    could benefit.

    Signed-off-by: Roman Zippel
    [akpm@linux-foundation.org: build fix]
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Haavard Skinnemoen
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Greg Ungerer
    Cc: Ralf Baechle
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Andi Kleen
    Cc: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • Recently a few direct accesses to the thread_info in the task structure snuck
    back, so this wraps them with the appropriate wrapper.

    Signed-off-by: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • We need to make sure that the clocksources are resumed, when timekeeping is
    resumed. The current resume logic does not guarantee this.

    Add a resume function pointer to the clocksource struct, so clocksource
    drivers which need to reinitialize the clocksource can provide a resume
    function.

    Add a resume function, which calls the maybe available clocksource resume
    functions and resets the watchdog function, so a stable TSC can be used
    accross suspend/resume.

    Signed-off-by: Thomas Gleixner
    Cc: john stultz
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Make it configurable. Code in mm makes the vm statistics intervals
    independent from the cache reaper use that opportunity to make it
    configurable.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make the microcode driver use the suspend-related CPU hotplug notifications
    to handle the CPU hotplug events occuring during system-wide suspend and
    resume transitions. Remove the global variable suspend_cpu_hotplug
    previously used for this purpose.

    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Since nonboot CPUs are now disabled after tasks and devices have been
    frozen and the CPU hotplug infrastructure is used for this purpose, we need
    special CPU hotplug notifications that will help the CPU-hotplug-aware
    subsystems distinguish normal CPU hotplug events from CPU hotplug events
    related to a system-wide suspend or resume operation in progress. This
    patch introduces such notifications and causes them to be used during
    suspend and resume transitions. It also changes all of the
    CPU-hotplug-aware subsystems to take these notifications into consideration
    (for now they are handled in the same way as the corresponding "normal"
    ones).

    [oleg@tv-sign.ru: cleanups]
    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Pavel Machek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Signed-off-by: Jarek Poplawski
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jarek Poplawski
     
  • Analysis of current linux futex code :
    --------------------------------------

    A central hash table futex_queues[] holds all contexts (futex_q) of waiting
    threads.

    Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
    perform lookups or insert/deletion of a futex_q.

    When a futex_wait() is done, calling thread has to :

    1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
    (calling find_vma()). This validation tells us if the futex uses
    an inode based store (mapped file), or mm based store (anonymous mem)

    2) - compute a hash key

    3) - Atomic increment of reference counter on an inode or a mm_struct

    4) - lock part of futex_queues[] hash table

    5) - perform the test on value of futex.
    (rollback is value != expected_value, returns EWOULDBLOCK)
    (various loops if test triggers mm faults)

    6) queue the context into hash table, release the lock got in 4)

    7) - release the read_lock on mmap_sem

    8) Eventually unqueue the context (but rarely, as this part  may be done
    by the futex_wake())

    Futexes were designed to improve scalability but current implementation has
    various problems :

    - Central hashtable :

    This means scalability problems if many processes/threads want to use
    futexes at the same time.
    This means NUMA unbalance because this hashtable is located on one node.

    - Using mmap_sem on every futex() syscall :

    Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
    ops on mmap_sem, dirtying cache line :
    - lot of cache line ping pongs on SMP configurations.

    mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
    Highly threaded processes might suffer from mmap_sem contention.

    mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
    programs because of contention on the mmap_sem cache line.

    - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
    It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
    because of cache misses.

    Most of these scalability problems come from the fact that futexes are in
    one global namespace. As we use a central hash table, we must make sure
    they are all using the same reference (given by the mm subsystem). We
    chose to force all futexes be 'shared'. This has a cost.

    But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
    and optimal performance if carefuly implemented. Time has come for linux
    to have better threading performance.

    The goal is to permit new futex commands to avoid :
    - Taking the mmap_sem semaphore, conflicting with other subsystems.
    - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

    This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
    futexes, we only need to distinguish futexes by their virtual address, no
    matter the underlying mm storage is.

    If glibc wants to exploit this new infrastructure, it should use new
    _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be
    prepared to fallback on old subcommands for old kernels. Using one global
    variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

    PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

    Compatibility with old applications is preserved, they still hit the
    scalability problems, but new applications can fly :)

    Note : the same SHARED futex (mapped on a file) can be used by old binaries
    *and* new binaries, because both binaries will use the old subcommands.

    Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
    as this is the default semantic. Almost all applications should benefit
    of this changes (new kernel and updated libc)

    Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

    /* calling futex_wait(addr, value) with value != *addr */
    433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
    424 cycles per futex(FUTEX_WAIT) call (using one futex)
    334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
    334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
    For reference :
    187 cycles per getppid() call
    188 cycles per umask() call
    181 cycles per ni_syscall() call

    Signed-off-by: Eric Dumazet
    Pierre Peiffer
    Cc: "Ulrich Drepper"
    Cc: "Nick Piggin"
    Cc: "Ingo Molnar"
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • This patch provides the futex_requeue_pi functionality, which allows some
    threads waiting on a normal futex to be requeued on the wait-queue of a
    PI-futex.

    This provides an optimization, already used for (normal) futexes, to be used
    with the PI-futexes.

    This optimization is currently used by the glibc in pthread_broadcast, when
    using "normal" mutexes. With futex_requeue_pi, it can be used with
    PRIO_INHERIT mutexes too.

    Signed-off-by: Pierre Peiffer
    Cc: Ingo Molnar
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer