08 Feb, 2008

9 commits

  • Change the interface to use bytes instead of pages. Page sizes can vary
    across platforms and configurations. A new strategy routine has been added
    to the resource counters infrastructure to format the data as desired.

    Suggested by David Rientjes, Andrew Morton and Herbert Poetzl

    Tested on a UML setup with the config for memory control enabled.

    [kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter]
    Signed-off-by: Balbir Singh
    Signed-off-by: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Basic setup routines, the mm_struct has a pointer to the cgroup that
    it belongs to and the the page has a page_cgroup associated with it.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     
  • With fixes from David Rientjes

    Introduce generic structures and routines for resource accounting.

    Each resource accounting cgroup is supposed to aggregate it,
    cgroup_subsystem_state and its resource-specific members within.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: David Rientjes
    Cc: Pavel Emelianov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     
  • cgroup_is_releasable() and notify_on_release() should be static,
    not global inline.

    Signed-off-by: Adrian Bunk
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Move the calls to the cgroup subsystem destroy() methods from
    cgroup_rmdir() to cgroup_diput(). This allows control file reads and
    writes to access their subsystem state without having to be concerned with
    locking against cgroup destruction - the control file dentry will keep the
    cgroup and its subsystem state objects alive until the file is closed.

    The documentation is updated to reflect the changed semantics of destroy();
    additionally the locking comments for destroy() and some other methods were
    clarified and decrustified.

    Signed-off-by: Paul Menage
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Simplify the space stripping code in cgroup file write.

    [akpm@linux-foundation.org: s/BUG_ON/BUILD_BUG_ON/]
    Signed-off-by: Paul Jackson
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Coding style fix - one line conditionals don't get braces.

    Signed-off-by: Paul Jackson
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch removes dead code spotted by the Coverity checker
    (look at the "(nbytes >= PATH_MAX)" check).

    Signed-off-by: Adrian Bunk
    Cc: Paul Jackson
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • The gl_owner_pid field is used to get the lock owning task by its pid, so make
    it in a proper manner, i.e. by using the struct pid pointer and pid_task()
    function.

    The pid_task() becomes exported for the gfs2 module.

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Acked-by: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

07 Feb, 2008

19 commits

  • Provide support to add an optional user defined callback to be run at
    function entry of a kretprobe'd function. Also modify the kprobe smoke
    tests to include an entry-handler during the kretprobe sanity test.

    Signed-off-by: Abhishek Sagar
    Cc: Prasanna S Panchamukhi
    Cc: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Acked-by: Jim Keniston
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Abhishek Sagar
     
  • Avoid calling do_div(x, 1) in this case.

    Cc: David Fries
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The kernel has a divide by zero crash when trying to run the system timer
    less than 100Hz. The problem is x/(HZ/USER_HZ) and related. Now
    x*(USER_HZ/HZ) will be used if HZ
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Fries
     
  • groups_sort() can be quite long if user loads a large gid table.

    This is because GROUP_AT(group_info, some_integer) uses an integer divide.
    So having to do XXX thousand divides during one syscall can lead to very
    high latencies. (NGROUPS_MAX=65536)

    In the past (25 Mar 2006), an analog problem was found in groups_search()
    (commit d74beb9f33a5f16d2965f11b275e401f225c949d ) and at that time I
    changed some variables to unsigned int.

    I believe that a more generic fix is to make sure NGROUPS_PER_BLOCK is
    unsigned.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Fix the following section mismatch with CONFIG_HOTPLUG=n,
    CONFIG_HOTPLUG_CPU=y:

    WARNING: vmlinux.o(.text+0x399a6): Section mismatch: reference to .init.text.5:idle_regs (between 'fork_idle' and 'get_task_mm')

    Signed-off-by: Adrian Bunk
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Fixing:
    CHECK kernel/params.c
    kernel/params.c:329:41: warning: incorrect type in argument 8 (different signedness)
    kernel/params.c:329:41: expected int *num
    kernel/params.c:329:41: got unsigned int *

    Signed-off-by: Richard Knutsson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Knutsson
     
  • [akpm@linux-foundation.org: cleanup]
    Signed-off-by: Daniel Walker
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Walker
     
  • This adds support to allow asm/ptrace.h to define two new macros,
    arch_ptrace_stop_needed and arch_ptrace_stop. These control special
    machine-specific actions to be done before a ptrace stop. The new code
    compiles away to nothing when the new macros are not defined. This is the
    case on all machines to begin with.

    On ia64, these macros will be defined to solve the long-standing issue of
    ptrace vs register backing store.

    Signed-off-by: Roland McGrath
    Cc: Petr Tesarik
    Cc: Tony Luck
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Convert relay from nopage to fault.
    Remove redundant vma range checks.
    Switch from OOM to SIGBUS if the resource is not available.

    Signed-off-by: Nick Piggin
    Cc: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • NR_OPEN (historically set to 1024*1024) actually forbids processes to open
    more than 1024*1024 handles.

    Unfortunatly some production servers hit the not so 'ridiculously high
    value' of 1024*1024 file descriptors per process.

    Changing NR_OPEN is not considered safe because of vmalloc space potential
    exhaust.

    This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
    1024*1024, so that admins can decide to change this limit if their workload
    needs it.

    [akpm@linux-foundation.org: export it for sparc64]
    Signed-off-by: Eric Dumazet
    Cc: Alan Cox
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: "David S. Miller"
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Minor cleanup. We can remove one "else if" branch.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Stop using unsigned _longs_ for printk buffer indexes. Log buffer is way
    smaller than 2 gigabytes and unsigned ints will work too . Indeed, they do
    work nicely on all 32-bit platforms where longs and ints are the same.

    With this patch, we have following size savings on amd64:

    text data bss dec hex filename
    5997 313 17736 24046 5dee 2.6.23.1.t64/kernel/printk.o
    5858 313 17700 23871 5d3f 2.6.23.1.printk.t64/kernel/printk.o

    Signed-off-by: Denys Vlasenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • I found that there is a buffer overflow problem in the following code.

    Version: 2.6.24-rc2,
    File: kernel/time/clocksource.c:417-432
    --------------------------------------------------------------------
    static ssize_t
    sysfs_show_available_clocksources(struct sys_device *dev, char *buf)
    {
    struct clocksource *src;
    char *curr = buf;

    spin_lock_irq(&clocksource_lock);
    list_for_each_entry(src, &clocksource_list, list) {
    curr += sprintf(curr, "%s ", src->name);
    }
    spin_unlock_irq(&clocksource_lock);

    curr += sprintf(curr, "\n");

    return curr - buf;
    }
    -----------------------------------------------------------------------

    sysfs_show_current_clocksources() also has the same problem though in practice
    the size of current clocksource's name won't exceed PAGE_SIZE.

    I fix the bug by using snprintf according to the specification of the kernel
    (Version:2.6.24-rc2,File:Documentation/filesystems/sysfs.txt)

    Fix sysfs_show_available_clocksources() and sysfs_show_current_clocksources()
    buffer overflow problem with snprintf().

    Signed-off-by: Miao Xie
    Cc: WANG Cong
    Cc: Thomas Gleixner
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Every file should include the headers containing the prototypes for its global
    functions (in this case {,un}register_reboot_notifier()).

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Make the needlessly global srcu_readers_active() static.

    Signed-off-by: Adrian Bunk
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Every file should include the headers containing the prototypes for its global
    functions (in this case sys_ptrace()).

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • When passing a zero address to kallsyms_lookup(), the kernel thought it was
    a valid kernel address, even if it is not. This is because is_ksym_addr()
    called is_kernel_extratext() and checked against labels that don't exist on
    many archs (which default as zero). Since PPC was the only kernel which
    defines _extra_text, (in 2005), and no longer needs it, this patch removes
    _extra_text support.

    For some history (provided by Jon):
    http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019734.html
    http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019736.html
    http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019751.html

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Robin Getz
    Cc: David Woodhouse
    Cc: Jon Loeliger
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Getz
     
  • 1. It is much easier to grep for ->state change if __set_task_state() is used
    instead of the direct assignment.

    2. ptrace_stop() and handle_group_stop() use set_task_state() which adds the
    unneeded mb() (btw even if we use mb() it is still possible that do_wait()
    sees the new ->state but not ->exit_code, but this is ok).

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This moves the ability to scale cputime into generic code. This allows us
    to fix the issue in kernel/timer.c (noticed by Balbir) where we could only
    add an unscaled value to the scaled utime/stime.

    This adds a cputime_to_scaled function. As before, the POWERPC version
    does the scaling based on the last SPURR/PURR ratio calculated. The
    generic and s390 (only other arch to implement asm/cputime.h) versions are
    both NOPs.

    Also moves the SPURR and PURR snapshots closer.

    Signed-off-by: Michael Neuling
    Cc: Jay Lan
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Neuling
     

06 Feb, 2008

12 commits

  • Replace latency.c use with pm_qos_params use.

    Signed-off-by: mark gross
    Cc: "John W. Linville"
    Cc: Len Brown
    Cc: Jaroslav Kysela
    Cc: Takashi Iwai
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Gross
     
  • The following patch is a generalization of the latency.c implementation done
    by Arjan last year. It provides infrastructure for more than one parameter,
    and exposes a user mode interface for processes to register pm_qos
    expectations of processes.

    This interface provides a kernel and user mode interface for registering
    performance expectations by drivers, subsystems and user space applications on
    one of the parameters.

    Currently we have {cpu_dma_latency, network_latency, network_throughput} as
    the initial set of pm_qos parameters.

    The infrastructure exposes multiple misc device nodes one per implemented
    parameter. The set of parameters implement is defined by pm_qos_power_init()
    and pm_qos_params.h. This is done because having the available parameters
    being runtime configurable or changeable from a driver was seen as too easy to
    abuse.

    For each parameter a list of performance requirements is maintained along with
    an aggregated target value. The aggregated target value is updated with
    changes to the requirement list or elements of the list. Typically the
    aggregated target value is simply the max or min of the requirement values
    held in the parameter list elements.

    >From kernel mode the use of this interface is simple:

    pm_qos_add_requirement(param_id, name, target_value):

    Will insert a named element in the list for that identified PM_QOS
    parameter with the target value. Upon change to this list the new target is
    recomputed and any registered notifiers are called only if the target value
    is now different.

    pm_qos_update_requirement(param_id, name, new_target_value):

    Will search the list identified by the param_id for the named list element
    and then update its target value, calling the notification tree if the
    aggregated target is changed. with that name is already registered.

    pm_qos_remove_requirement(param_id, name):

    Will search the identified list for the named element and remove it, after
    removal it will update the aggregate target and call the notification tree
    if the target was changed as a result of removing the named requirement.

    >From user mode:

    Only processes can register a pm_qos requirement. To provide for
    automatic cleanup for process the interface requires the process to register
    its parameter requirements in the following way:

    To register the default pm_qos target for the specific parameter, the
    process must open one of /dev/[cpu_dma_latency, network_latency,
    network_throughput]

    As long as the device node is held open that process has a registered
    requirement on the parameter. The name of the requirement is
    "process_" derived from the current->pid from within the open system
    call.

    To change the requested target value the process needs to write a s32
    value to the open device node. This translates to a
    pm_qos_update_requirement call.

    To remove the user mode request for a target value simply close the device
    node.

    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: fix build again]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: mark gross
    Cc: "John W. Linville"
    Cc: Len Brown
    Cc: Jaroslav Kysela
    Cc: Takashi Iwai
    Cc: Arjan van de Ven
    Cc: Venki Pallipadi
    Cc: Adam Belay
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Gross
     
  • kernel_shutdown_prepare() can now become static.

    Signed-off-by: Adrian Bunk
    Acked-by: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • resume_file[] and create_image() can become static.

    Signed-off-by: Adrian Bunk
    Acked-by: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • The capability bounding set is a set beyond which capabilities cannot grow.
    Currently cap_bset is per-system. It can be manipulated through sysctl,
    but only init can add capabilities. Root can remove capabilities. By
    default it includes all caps except CAP_SETPCAP.

    This patch makes the bounding set per-process when file capabilities are
    enabled. It is inherited at fork from parent. Noone can add elements,
    CAP_SETPCAP is required to remove them.

    One example use of this is to start a safer container. For instance, until
    device namespaces or per-container device whitelists are introduced, it is
    best to take CAP_MKNOD away from a container.

    The bounding set will not affect pP and pE immediately. It will only
    affect pP' and pE' after subsequent exec()s. It also does not affect pI,
    and exec() does not constrain pI'. So to really start a shell with no way
    of regain CAP_MKNOD, you would do

    prctl(PR_CAPBSET_DROP, CAP_MKNOD);
    cap_t cap = cap_get_proc();
    cap_value_t caparray[1];
    caparray[0] = CAP_MKNOD;
    cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
    cap_set_proc(cap);
    cap_free(cap);

    The following test program will get and set the bounding
    set (but not pI). For instance

    ./bset get
    (lists capabilities in bset)
    ./bset drop cap_net_raw
    (starts shell with new bset)
    (use capset, setuid binary, or binary with
    file capabilities to try to increase caps)

    ************************************************************
    cap_bound.c
    ************************************************************
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef PR_CAPBSET_READ
    #define PR_CAPBSET_READ 23
    #endif

    #ifndef PR_CAPBSET_DROP
    #define PR_CAPBSET_DROP 24
    #endif

    int usage(char *me)
    {
    printf("Usage: %s get\n", me);
    printf(" %s drop \n", me);
    return 1;
    }

    #define numcaps 32
    char *captable[numcaps] = {
    "cap_chown",
    "cap_dac_override",
    "cap_dac_read_search",
    "cap_fowner",
    "cap_fsetid",
    "cap_kill",
    "cap_setgid",
    "cap_setuid",
    "cap_setpcap",
    "cap_linux_immutable",
    "cap_net_bind_service",
    "cap_net_broadcast",
    "cap_net_admin",
    "cap_net_raw",
    "cap_ipc_lock",
    "cap_ipc_owner",
    "cap_sys_module",
    "cap_sys_rawio",
    "cap_sys_chroot",
    "cap_sys_ptrace",
    "cap_sys_pacct",
    "cap_sys_admin",
    "cap_sys_boot",
    "cap_sys_nice",
    "cap_sys_resource",
    "cap_sys_time",
    "cap_sys_tty_config",
    "cap_mknod",
    "cap_lease",
    "cap_audit_write",
    "cap_audit_control",
    "cap_setfcap"
    };

    int getbcap(void)
    {
    int comma=0;
    unsigned long i;
    int ret;

    printf("i know of %d capabilities\n", numcaps);
    printf("capability bounding set:");
    for (i=0; i< 0)
    perror("prctl");
    else if (ret==1)
    printf("%s%s", (comma++) ? ", " : " ", captable[i]);
    }
    printf("\n");
    return 0;
    }

    int capdrop(char *str)
    {
    unsigned long i;

    int found=0;
    for (i=0; i
    Signed-off-by: Andrew G. Morgan
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Cc: Casey Schaufler a
    Signed-off-by: "Serge E. Hallyn"
    Tested-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • The patch supports legacy (32-bit) capability userspace, and where possible
    translates 32-bit capabilities to/from userspace and the VFS to 64-bit
    kernel space capabilities. If a capability set cannot be compressed into
    32-bits for consumption by user space, the system call fails, with -ERANGE.

    FWIW libcap-2.00 supports this change (and earlier capability formats)

    http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/

    [akpm@linux-foundation.org: coding-syle fixes]
    [akpm@linux-foundation.org: use get_task_comm()]
    [ezk@cs.sunysb.edu: build fix]
    [akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
    [akpm@linux-foundation.org: unused var]
    [serue@us.ibm.com: export __cap_ symbols]
    Signed-off-by: Andrew G. Morgan
    Cc: Stephen Smalley
    Acked-by: Serge Hallyn
    Cc: Chris Wright
    Cc: James Morris
    Cc: Casey Schaufler
    Signed-off-by: Erez Zadok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morgan
     
  • Add vm.highmem_is_dirtyable toggle

    A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
    approximately 2Gb size which contains a hash format that is written
    randomly by the dbclean process. On 2.6.16 this process took a few
    minutes. With lowmem only accounting of dirty ratios, this takes about 12
    hours of 100% disk IO, all random writes.

    Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
    add the highmem back to the total available memory count.

    [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
    Signed-off-by: Bron Gondwana
    Cc: Ethan Solomita
    Cc: Peter Zijlstra
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bron Gondwana
     
  • (with Martin Schwidefsky )

    The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
    first argument. The free functions do not get the mm_struct argument. This
    is 1) asymmetrical and 2) to do mm related page table allocations the mm
    argument is needed on the free function as well.

    [kamalesh@linux.vnet.ibm.com: i386 fix]
    [akpm@linux-foundation.org: coding-syle fixes]
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Kamalesh Babulal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • - Add comments explaing how drain_pages() works.

    - Eliminate useless functions

    - Rename drain_all_local_pages to drain_all_pages(). It does drain
    all pages not only those of the local processor.

    - Eliminate useless interrupt off / on sequences. drain_pages()
    disables interrupts on its own. The execution thread is
    pinned to processor by the caller. So there is no need to
    disable interrupts.

    - Put drain_all_pages() declaration in gfp.h and remove the
    declarations from suspend.h and from mm/memory_hotplug.c

    - Make software suspend call drain_all_pages(). The draining
    of processor local pages is may not the right approach if
    software suspend wants to support SMP. If they call drain_all_pages
    then we can make drain_pages() static.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Christoph Lameter
    Acked-by: Mel Gorman
    Cc: "Rafael J. Wysocki"
    Cc: Daniel Walker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is the new timerfd API as it is implemented by the following patch:

    int timerfd_create(int clockid, int flags);
    int timerfd_settime(int ufd, int flags,
    const struct itimerspec *utmr,
    struct itimerspec *otmr);
    int timerfd_gettime(int ufd, struct itimerspec *otmr);

    The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
    parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.

    The timerfd_settime() API give new settings by the timerfd fd, by optionally
    retrieving the previous expiration time (in case the "otmr" parameter is not
    NULL).

    The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
    is set in the "flags" parameter. Otherwise it's a relative time.

    The timerfd_gettime() API returns the next expiration time of the timer, or
    {0, 0} if the timerfd has not been set yet.

    Like the previous timerfd API implementation, read(2) and poll(2) are
    supported (with the same interface). Here's a simple test program I used to
    exercise the new timerfd APIs:

    http://www.xmailserver.org/timerfd-test2.c

    [akpm@linux-foundation.org: coding-style cleanups]
    [akpm@linux-foundation.org: fix ia64 build]
    [akpm@linux-foundation.org: fix m68k build]
    [akpm@linux-foundation.org: fix mips build]
    [akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds]
    [heiko.carstens@de.ibm.com: fix s390]
    [akpm@linux-foundation.org: fix powerpc build]
    [akpm@linux-foundation.org: fix sparc64 more]
    Signed-off-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: Thomas Gleixner
    Cc: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • As Roland pointed out, we have the very old problem with exec. de_thread()
    sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
    clears signal->flags. All signals (even fatal ones) sent in this window
    (which is not too small) will be lost.

    With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(),
    the new helper, should be used to detect exit_group() or exec() in progress.
    It can have more users, but this patch does only strictly necessary changes.

    Signed-off-by: Oleg Nesterov
    Cc: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Robin Holt
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Every time we set SIGNAL_GROUP_EXIT or clear SIGNAL_STOP_DEQUEUED we also
    reset ->group_stop_count.

    This means that the SIGNAL_GROUP_EXIT check in handle_group_stop() is not
    needed, and do_signal_stop() should check SIGNAL_STOP_DEQUEUED only when
    ->group_stop_count == 0. With these changes handle_group_stop() becomes the
    subset of do_signal_stop(), we can kill it and use do_signal_stop() instead.

    Also, a preparation for the next patch.

    Signed-off-by: Oleg Nesterov
    Cc: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Robin Holt
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov