25 Jul, 2008

12 commits

  • This patch is by far the most complex in the series. It adds a new syscall
    paccept. This syscall differs from accept in that it adds (at the userlevel)
    two additional parameters:

    - a signal mask
    - a flags value

    The flags parameter can be used to set flag like SOCK_CLOEXEC. This is
    imlpemented here as well. Some people argued that this is a property which
    should be inherited from the file desriptor for the server but this is against
    POSIX. Additionally, we really want the signal mask parameter as well
    (similar to pselect, ppoll, etc). So an interface change in inevitable.

    The flag value is the same as for socket and socketpair. I think diverging
    here will only create confusion. Similar to the filesystem interfaces where
    the use of the O_* constants differs, it is acceptable here.

    The signal mask is handled as for pselect etc. The mask is temporarily
    installed for the thread and removed before the call returns. I modeled the
    code after pselect. If there is a problem it's likely also in pselect.

    For architectures which use socketcall I maintained this interface instead of
    adding a system call. The symmetry shouldn't be broken.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_paccept
    # ifdef __x86_64__
    # define __NR_paccept 288
    # elif defined __i386__
    # define SYS_PACCEPT 18
    # define USE_SOCKETCALL 1
    # else
    # error "need __NR_paccept"
    # endif
    #endif

    #ifdef USE_SOCKETCALL
    # define paccept(fd, addr, addrlen, mask, flags) \
    ({ long args[6] = { \
    (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \
    syscall (__NR_socketcall, SYS_PACCEPT, args); })
    #else
    # define paccept(fd, addr, addrlen, mask, flags) \
    syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags)
    #endif

    #define PORT 57392

    #define SOCK_CLOEXEC O_CLOEXEC

    static pthread_barrier_t b;

    static void *
    tf (void *arg)
    {
    pthread_barrier_wait (&b);
    int s = socket (AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in sin;
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);

    pthread_barrier_wait (&b);
    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    pthread_barrier_wait (&b);
    sleep (2);
    pthread_kill ((pthread_t) arg, SIGUSR1);

    return NULL;
    }

    static void
    handler (int s)
    {
    }

    int
    main (void)
    {
    pthread_barrier_init (&b, NULL, 2);

    struct sockaddr_in sin;
    pthread_t th;
    if (pthread_create (&th, NULL, tf, (void *) pthread_self ()) != 0)
    {
    puts ("pthread_create failed");
    return 1;
    }

    int s = socket (AF_INET, SOCK_STREAM, 0);
    int reuse = 1;
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    int s2 = paccept (s, NULL, 0, NULL, 0);
    if (s2 < 0)
    {
    puts ("paccept(0) failed");
    return 1;
    }

    int coe = fcntl (s2, F_GETFD);
    if (coe & FD_CLOEXEC)
    {
    puts ("paccept(0) set close-on-exec-flag");
    return 1;
    }
    close (s2);

    pthread_barrier_wait (&b);

    s2 = paccept (s, NULL, 0, NULL, SOCK_CLOEXEC);
    if (s2 < 0)
    {
    puts ("paccept(SOCK_CLOEXEC) failed");
    return 1;
    }

    coe = fcntl (s2, F_GETFD);
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("paccept(SOCK_CLOEXEC) does not set close-on-exec flag");
    return 1;
    }
    close (s2);

    pthread_barrier_wait (&b);

    struct sigaction sa;
    sa.sa_handler = handler;
    sa.sa_flags = 0;
    sigemptyset (&sa.sa_mask);
    sigaction (SIGUSR1, &sa, NULL);

    sigset_t ss;
    pthread_sigmask (SIG_SETMASK, NULL, &ss);
    sigaddset (&ss, SIGUSR1);
    pthread_sigmask (SIG_SETMASK, &ss, NULL);

    sigdelset (&ss, SIGUSR1);
    alarm (4);
    pthread_barrier_wait (&b);

    errno = 0 ;
    s2 = paccept (s, NULL, 0, &ss, 0);
    if (s2 != -1 || errno != EINTR)
    {
    puts ("paccept did not fail with EINTR");
    return 1;
    }

    close (s);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    [akpm@linux-foundation.org: make it compile]
    [akpm@linux-foundation.org: add sys_ni stub]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: "David S. Miller"
    Cc: Roland McGrath
    Cc: Kyle McMartin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • set_type returns an int indicating success or failure, but up to now
    setup_irq ignores that.

    In my case this resulted in a machine hang:

    gpio-keys requested IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, but
    arm/ns9xxx can only trigger on one direction so set_type didn't touch
    the configuration which happens do default to a level sensitiveness and
    returned -EINVAL. setup_irq ignored that and unmasked the irq. This
    resulted in an endless triggering of the gpio-key interrupt service
    routine which effectively killed the machine.

    With this patch applied setup_irq propagates the error to the caller.

    Note that before in the case

    chip && !chip->set_type && !chip->name

    a NULL pointer was feed to printk. This is fixed, too.

    Signed-off-by: Uwe Kleine-König
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • Fix try_to_freeze_tasks()'s use of do_div() on an s64 by making
    elapsed_csecs64 a u64 instead and dividing that.

    Possibly this should be guarded lest the interval calculation turn up
    negative, but the possible negativity of the result of the division is
    cast away anyway.

    This was introduced by patch 438e2ce68dfd4af4cfcec2f873564fb921db4bb5.

    Signed-off-by: David Howells
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • schedule sysrq poweroff on boot cpu.

    sysrq poweroff needs to disable nonboot cpus, and we need to run this on boot
    cpu to avoid any recursion. http://bugzilla.kernel.org/show_bug.cgi?id=10897

    [kosaki.motohiro@jp.fujitsu.com: build fix]
    Signed-off-by: Zhang Rui
    Tested-by: Rus
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Rui
     
  • This interface allows adding a job on a specific cpu.

    Although a work struct on a cpu will be scheduled to other cpu if the cpu
    dies, there is a recursion if a work task tries to offline the cpu it's
    running on. we need to schedule the task to a specific cpu in this case.
    http://bugzilla.kernel.org/show_bug.cgi?id=10897

    [oleg@tv-sign.ru: cleanups]
    Signed-off-by: Zhang Rui
    Tested-by: Rus
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Rui
     
  • This patch simplifies the memory bitmap manipulations.

    - remove the member size in struct bm_block

    It is not necessary for struct bm_block to have the number of bit chunks that
    can be calculated by using end_pfn and start_pfn.

    - use find_next_bit() for memory_bm_next_pfn

    No need to invent the bitmap library only for the memory bitmap.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Boot-time test for system suspend states (STR or standby). The generic
    RTC framework triggers wakeup alarms, which are used to exit those states.

    - Measures some aspects of suspend time ... this uses "jiffies" until
    someone converts it to use a timebase that works properly even while
    timer IRQs are disabled.

    - Triggered by a command line parameter. By default nothing even
    vaguely troublesome will happen, but "test_suspend=mem" will give
    you a brief STR test during system boot. (Or you may need to use
    "test_suspend=standby" instead, if your hardware needs that.)

    This isn't without problems. It fires early enough during boot that for
    example both PCMCIA and MMC stacks have misbehaved. The workaround in
    those cases was to boot without such media cards inserted.

    [matthltc@us.ibm.com: fix compile failure in boot time suspend selftest]
    Signed-off-by: David Brownell
    Cc: Ingo Molnar
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Brownell
     
  • Tell the user about the no_console_suspend option, so that we don't have to
    tell each bug reporter personally.

    [akpm@linux-foundation.org: clarify the text a little]
    Signed-off-by: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Machek
     
  • To date, we've tried hard to confine filesystem support for capabilities
    to the security modules. This has left a lot of the code in
    kernel/capability.c in a state where it looks like it supports something
    that filesystem support for capabilities actually suppresses when the LSM
    security/commmoncap.c code runs. What is left is a lot of code that uses
    sub-optimal locking in the main kernel

    With this change we refactor the main kernel code and make it explicit
    which locks are needed and that the only remaining kernel races in this
    area are associated with non-filesystem capability code.

    Signed-off-by: Andrew G. Morgan
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew G. Morgan
     
  • Add basic support for more than one hstate in hugetlbfs. This is the key
    to supporting multiple hugetlbfs page sizes at once.

    - Rather than a single hstate, we now have an array, with an iterator
    - default_hstate continues to be the struct hstate which we use by default
    - Add functions for architectures to register new hstates

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
    a similar manner to the reservations taken for MAP_SHARED mappings. The
    reserve count is accounted both globally and on a per-VMA basis for
    private mappings. This guarantees that a process that successfully calls
    mmap() will successfully fault all pages in the future unless fork() is
    called.

    The characteristics of private mappings of hugetlbfs files behaviour after
    this patch are;

    1. The process calling mmap() is guaranteed to succeed all future faults until
    it forks().
    2. On fork(), the parent may die due to SIGKILL on writes to the private
    mapping if enough pages are not available for the COW. For reasonably
    reliable behaviour in the face of a small huge page pool, children of
    hugepage-aware processes should not reference the mappings; such as
    might occur when fork()ing to exec().
    3. On fork(), the child VMAs inherit no reserves. Reads on pages already
    faulted by the parent will succeed. Successful writes will depend on enough
    huge pages being free in the pool.
    4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
    and at fault time otherwise.

    Before this patch, all reads or writes in the child potentially needs page
    allocations that can later lead to the death of the parent. This applies
    to reads and writes of uninstantiated pages as well as COW. After the
    patch it is only a write to an instantiated page that causes problems.

    Signed-off-by: Mel Gorman
    Acked-by: Adam Litke
    Cc: Andy Whitcroft
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds proper extern declarations for five variables in
    include/linux/vmstat.h

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

24 Jul, 2008

6 commits

  • * 'x86/auditsc' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland:
    i386 syscall audit fast-path
    x86_64 ia32 syscall audit fast-path
    x86_64 syscall audit fast-path
    x86_64: remove bogus optimization in sysret_signal

    Linus Torvalds
     
  • * 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: hrtick_enabled() should use cpu_active()
    sched, x86: clean up hrtick implementation
    sched: fix build error, provide partition_sched_domains() unconditionally
    sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed
    cpu hotplug: Make cpu_active_map synchronization dependency clear
    cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
    sched: rework of "prioritize non-migratable tasks over migratable ones"
    sched: reduce stack size in isolated_cpu_setup()
    Revert parts of "ftrace: do not trace scheduler functions"

    Fixed up conflicts in include/asm-x86/thread_info.h (due to the
    TIF_SINGLESTEP unification vs TIF_HRTICK_RESCHED removal) and
    kernel/sched_fair.c (due to cpu_active_map vs for_each_cpu_mask_nr()
    introduction).

    Linus Torvalds
     
  • * 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
    NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
    cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
    NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
    NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
    cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
    cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
    cpumask: Provide a generic set of CPUMASK_ALLOC macros
    cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
    cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
    cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
    cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
    cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
    cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
    Revert "cpumask: introduce new APIs"
    cpumask: make for_each_cpu_mask a bit smaller
    net: Pass reference to cpumask variable in net/sunrpc/svc.c
    ...

    Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually

    Linus Torvalds
     
  • …ernel/git/tip/linux-2.6-tip

    * 'core/softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    softlockup: fix invalid proc_handler for softlockup_panic
    softlockup: fix watchdog task wakeup frequency
    softlockup: fix watchdog task wakeup frequency
    softlockup: show irqtrace
    softlockup: print a module list on being stuck
    softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
    softlockup: fix false positives on nohz if CPU is 100% idle for more than 60 seconds
    softlockup: fix softlockup_thresh fix
    softlockup: fix softlockup_thresh unaligned access and disable detection at runtime
    softlockup: allow panic on lockup

    Linus Torvalds
     
  • This adds a fast path for 64-bit syscall entry and exit when
    TIF_SYSCALL_AUDIT is set, but no other kind of syscall tracing.
    This path does not need to save and restore all registers as
    the general case of tracing does. Avoiding the iret return path
    when syscall audit is enabled helps performance a lot.

    Signed-off-by: Roland McGrath

    Roland McGrath
     
  • Since 15a647eba94c3da27ccc666bea72e7cca06b2d19 set_irq_wake returned -ENXIO
    if another device had it already enabled. Zero is the right value to
    return in this case. Moreover the change to desc->status was not reverted
    if desc->chip->set_wake returned an error.

    Signed-off-by: Uwe Kleine-König
    Acked-by: David Brownell
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Russell King
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     

23 Jul, 2008

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    remove CONFIG_KMOD from core kernel code
    remove CONFIG_KMOD from lib
    remove CONFIG_KMOD from sparc64
    rework try_then_request_module to do less in non-modular kernels
    remove mention of CONFIG_KMOD from documentation
    make CONFIG_KMOD invisible
    modules: Take a shortcut for checking if an address is in a module
    module: turn longs into ints for module sizes
    Shrink struct module: CONFIG_UNUSED_SYMBOLS ifdefs
    module: reorder struct module to save space on 64 bit builds
    module: generic each_symbol iterator function
    module: don't use stop_machine for waiting rmmod

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (79 commits)
    arm: bus_id -> dev_name() and dev_set_name() conversions
    sparc64: fix up bus_id changes in sparc core code
    3c59x: handle pci_name() being const
    MTD: handle pci_name() being const
    HP iLO driver
    sysdev: Convert the x86 mce tolerant sysdev attribute to generic attribute
    sysdev: Add utility functions for simple int/ulong variable sysdev attributes
    sysdev: Pass the attribute to the low level sysdev show/store function
    driver core: Suppress sysfs warnings for device_rename().
    kobject: Transmit return value of call_usermodehelper() to caller
    sysfs-rules.txt: reword API stability statement
    debugfs: Implement debugfs_remove_recursive()
    HOWTO: change email addresses of James in HOWTO
    always enable FW_LOADER unless EMBEDDED=y
    uio-howto.tmpl: use unique output names
    uio-howto.tmpl: use standard copyright/legal markings
    sysfs: don't call notify_change
    sysdev: fix debugging statements in registration code.
    kobject: should use kobject_put() in kset-example
    kobject: reorder kobject to save space on 64 bit builds
    ...

    Linus Torvalds
     
  • Add missing cond_syscall() entry for compat_sys_epoll_pwait.

    Signed-off-by: Atsushi Nemoto
    Cc: Davide Libenzi
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     
  • Fix wrong domain attr updates, or we will always update the first sched
    domain attr.

    Signed-off-by: Miao Xie
    Cc: Hidetoshi Seto
    Cc: Paul Jackson
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: [2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

22 Jul, 2008

7 commits

  • Always compile request_module when the kernel allows modules.

    Signed-off-by: Johannes Berg
    Signed-off-by: Rusty Russell

    Johannes Berg
     
  • This patch keeps track of the boundaries of module allocation, in
    order to speed up module_text_address().

    Inspired by Arjan's version, which required arch-specific defines:

    Various pieces of the kernel (lockdep, latencytop, etc) tend
    to store backtraces, sometimes at a relatively high
    frequency. In itself this isn't a big performance deal (after
    all you're using diagnostics features), but there have been
    some complaints from people who have over 100 modules loaded
    that this is a tad too slow.

    This is due to the new backtracer code which looks at every
    slot on the stack to see if it's a kernel/module text address,
    so that's 1024 slots. 1024 times 100 modules... that's a lot
    of list walking.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • This shrinks module.o and each *.ko file.

    And finally, structure members which hold length of module
    code (four such members there) and count of symbols
    are converted from longs to ints.

    We cannot possibly have a module where 32 bits won't
    be enough to hold such counts.

    For one, module loading checks module size for sanity
    before loading, so such insanely big module will fail
    that test first.

    Signed-off-by: Denys Vlasenko
    Signed-off-by: Rusty Russell

    Denys Vlasenko
     
  • module.c and module.h conatains code for finding
    exported symbols which are declared with EXPORT_UNUSED_SYMBOL,
    and this code is compiled in even if CONFIG_UNUSED_SYMBOLS is not set
    and thus there can be no EXPORT_UNUSED_SYMBOLs in modules anyway
    (because EXPORT_UNUSED_SYMBOL(x) are compiled out to nothing then).

    This patch adds required #ifdefs.

    Signed-off-by: Denys Vlasenko
    Signed-off-by: Rusty Russell

    Denys Vlasenko
     
  • Introduce an each_symbol() iterator to avoid duplicating the knowledge
    about the 5 different sections containing symbols. Currently only
    used by find_symbol(), but will be used by symbol_put_addr() too.

    (Includes NULL ptr deref fix by Jiri Kosina )

    Signed-off-by: Rusty Russell
    Cc: Jiri Kosina

    Rusty Russell
     
  • rmmod has a little-used "-w" option, meaning that instead of failing if the
    module is in use, it should block until the module becomes unused.

    In this case, we don't need to use stop_machine: Max Krasnyansky
    indicated that would be useful for SystemTap which loads/unloads new
    modules frequently.

    Cc: Max Krasnyansky
    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • This allow to dynamically generate attributes and share show/store
    functions between attributes. Right now most attributes are generated
    by special macros and lots of duplicated code. With the attribute
    passed it's instead possible to attach some data to the attribute
    and then use that in shared low level functions to do different things.

    I need this for the dynamically generated bank attributes in the x86
    machine check code, but it'll allow some further cleanups.

    I converted all users in tree to the new show/store prototype. It's a single
    huge patch to avoid unbisectable sections.

    Runtime tested: x86-32, x86-64
    Compiled only: ia64, powerpc
    Not compile tested/only grep converted: sh, arm, avr32

    Signed-off-by: Andi Kleen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

20 Jul, 2008

3 commits

  • Peter pointed out that hrtick_enabled() should use cpu_active().

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Ingo Molnar
     
  • random uvesafb failures were reported against Gentoo:

    http://bugs.gentoo.org/show_bug.cgi?id=222799

    and Mihai Moldovan bisected it back to:

    > 8f4d37ec073c17e2d4aa8851df5837d798606d6f is first bad commit
    > commit 8f4d37ec073c17e2d4aa8851df5837d798606d6f
    > Author: Peter Zijlstra
    > Date: Fri Jan 25 21:08:29 2008 +0100
    >
    > sched: high-res preemption tick

    Linus suspected it to be hrtick + vm86 interaction and observed:

    > Btw, Peter, Ingo: I think that commit is doing bad things. They aren't
    > _incorrect_ per se, but they are definitely bad.
    >
    > Why?
    >
    > Using random _TIF_WORK_MASK flags is really impolite for doing
    > "scheduling" work. There's a reason that arch/x86/kernel/entry_32.S
    > special-cases the _TIF_NEED_RESCHED flag: we don't want to exit out of
    > vm86 mode unnecessarily.
    >
    > See the "work_notifysig_v86" label, and how it does that
    > "save_v86_state()" thing etc etc.

    Right, I never liked having to fiddle with those TIF flags. Initially I
    needed it because the hrtimer base lock could not nest in the rq lock.
    That however is fixed these days.

    Currently the only reason left to fiddle with the TIF flags is remote
    wakeups. We cannot program a remote cpu's hrtimer. I've been thinking
    about using the new and improved IPI function call stuff to implement
    hrtimer_start_on().

    However that does require that smp_call_function_single(.wait=0) works
    from interrupt context - /me looks at the latest series from Jens - Yes
    that does seem to be supported, good.

    Here's a stab at cleaning this stuff up ...

    Mihai reported test success as well.

    Signed-off-by: Peter Zijlstra
    Tested-by: Mihai Moldovan
    Cc: Michal Januszewski
    Cc: Antonino Daplas
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Jul, 2008

4 commits

  • * Optimize various places where a pointer to the cpumask_of_cpu value
    will result in reducing stack pressure.

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * This patch replaces the dangerous lvalue version of cpumask_of_cpu
    with new cpumask_of_cpu_ptr macros. These are patterned after the
    node_to_cpumask_ptr macros.

    In general terms, if there is a cpumask_of_cpu_map[] then a pointer to
    the cpumask_of_cpu_map[cpu] entry is used. The cpumask_of_cpu_map
    is provided when there is a large NR_CPUS count, reducing
    greatly the amount of code generated and stack space used for
    cpumask_of_cpu(). The pointer to the cpumask_t value is needed for
    calling set_cpus_allowed_ptr() to reduce the amount of stack space
    needed to pass the cpumask_t value.

    If there isn't a cpumask_of_cpu_map[], then a temporary variable is
    declared and filled in with value from cpumask_of_cpu(cpu) as well as
    a pointer variable pointing to this temporary variable. Afterwards,
    the pointer is used to reference the cpumask value. The compiler
    will optimize out the extra dereference through the pointer as well
    as the stack space used for the pointer, resulting in identical code.

    A good example of the orthogonal usages is in net/sunrpc/svc.c:

    case SVC_POOL_PERCPU:
    {
    unsigned int cpu = m->pool_to[pidx];
    cpumask_of_cpu_ptr(cpumask, cpu);

    *oldmask = current->cpus_allowed;
    set_cpus_allowed_ptr(current, cpumask);
    return 1;
    }
    case SVC_POOL_PERNODE:
    {
    unsigned int node = m->pool_to[pidx];
    node_to_cpumask_ptr(nodecpumask, node);

    *oldmask = current->cpus_allowed;
    set_cpus_allowed_ptr(current, nodecpumask);
    return 1;
    }

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • Conflicts:

    drivers/acpi/processor_throttling.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The type of softlockup_panic is int, but the proc_handler is
    proc_doulongvec_minmax(). This handler is for unsigned long.

    This handler should be proc_dointvec_minmax().

    Signed-off-by: Hiroshi Shimamoto
    Signed-off-by: Ingo Molnar

    Hiroshi Shimamoto
     

18 Jul, 2008

4 commits

  • Fix inc_rt_tasks() to not declare variable 'rq' if it's not needed. It is
    declared if CONFIG_SMP or CONFIG_RT_GROUP_SCHED, but only used if CONFIG_SMP.

    This is a consequence of patch 1f11eb6a8bc92536d9e93ead48fa3ffbd1478571 plus
    patch 1100ac91b6af02d8639d518fad5b434b1bf44ed6.

    Signed-off-by: David Howells
    Signed-off-by: Ingo Molnar

    David Howells
     
  • This goes on top of the cpu_active_map (take 2) patch.

    Currently we depend on the stop_machine to provide nescessesary
    synchronization for the cpu_active_map updates.
    As Dmitry Adamushko pointed this is fragile and is not much clearer
    than the previous scheme. In other words we do not want to depend on
    the internal stop machine operation here.
    So make the synchronization rules clear by doing synchronize_sched()
    after clearing out cpu active bit.

    Tested on quad-Core2 with:

    while true; do
    for i in 1 2 3; do
    echo 0 > /sys/devices/system/cpu/cpu$i/online
    done
    for i in 1 2 3; do
    echo 1 > /sys/devices/system/cpu/cpu$i/online
    done
    done
    and
    stress -c 200

    No lockdep, preempt or other complaints.

    Signed-off-by: Max Krasnyansky
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Max Krasnyansky
     
  • This is based on Linus' idea of creating cpu_active_map that prevents
    scheduler load balancer from migrating tasks to the cpu that is going
    down.

    It allows us to simplify domain management code and avoid unecessary
    domain rebuilds during cpu hotplug event handling.

    Please ignore the cpusets part for now. It needs some more work in order
    to avoid crazy lock nesting. Although I did simplfy and unify domain
    reinitialization logic. We now simply call partition_sched_domains() in
    all the cases. This means that we're using exact same code paths as in
    cpusets case and hence the test below cover cpusets too.
    Cpuset changes to make rebuild_sched_domains() callable from various
    contexts are in the separate patch (right next after this one).

    This not only boots but also easily handles
    while true; do make clean; make -j 8; done
    and
    while true; do on-off-cpu 1; done
    at the same time.
    (on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

    Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
    this on right now in gnome-terminal and things are moving just fine.

    Also this is running with most of the debug features enabled (lockdep,
    mutex, etc) no BUG_ONs or lockdep complaints so far.

    I believe I addressed all of the Dmitry's comments for original Linus'
    version. I changed both fair and rt balancer to mask out non-active cpus.
    And replaced cpu_is_offline() with !cpu_active() in the main scheduler
    code where it made sense (to me).

    Signed-off-by: Max Krasnyanskiy
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Acked-by: Gregory Haskins
    Cc: dmitry.adamushko@gmail.com
    Cc: pj@sgi.com
    Signed-off-by: Ingo Molnar

    Max Krasnyansky
     
  • (1) handle in a generic way all cases when a newly woken-up task is
    not migratable (not just a corner case when "rt_se->nr_cpus_allowed ==
    1")

    (2) if current is to be preempted, then make sure "p" will be picked
    up by pick_next_task_rt().
    i.e. move task's group at the head of its list as well.

    currently, it's not a case for the group-scheduling case as described
    here: http://www.ussg.iu.edu/hypermail/linux/kernel/0807.0/0134.html

    Signed-off-by: Dmitry Adamushko
    Cc: Steven Rostedt
    Cc: Gregory Haskins
    Signed-off-by: Ingo Molnar

    Dmitry Adamushko