18 Aug, 2010

1 commit

  • Make do_execve() take a const filename pointer so that kernel_execve() compiles
    correctly on ARM:

    arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type

    This also requires the argv and envp arguments to be consted twice, once for
    the pointer array and once for the strings the array points to. This is
    because do_execve() passes a pointer to the filename (now const) to
    copy_strings_kernel(). A simpler alternative would be to cast the filename
    pointer in do_execve() when it's passed to copy_strings_kernel().

    do_execve() may not change any of the strings it is passed as part of the argv
    or envp lists as they are some of them in .rodata, so marking these strings as
    const should be fine.

    Further kernel_execve() and sys_execve() need to be changed to match.

    This has been test built on x86_64, frv, arm and mips.

    Signed-off-by: David Howells
    Tested-by: Ralf Baechle
    Acked-by: Russell King
    Signed-off-by: Linus Torvalds

    David Howells
     

14 Aug, 2010

1 commit

  • Early 4.3 versions of gcc apparently aggressively optimize the raw
    time accumulation loop, replacing it with a divide.

    On 32bit systems, this causes the following link errors:
    undefined reference to `__umoddi3'
    undefined reference to `__udivdi3'

    The gcc issue has been fixed in 4.4 and greater.

    This patch replaces the accumulation loop with a do_div, as suggested
    by Linus.

    Signed-off-by: John Stultz
    CC: Jason Wessel
    CC: Larry Finger
    CC: Ingo Molnar
    CC: Linus Torvalds
    Signed-off-by: Linus Torvalds

    John Stultz
     

13 Aug, 2010

4 commits

  • This reverts commit 3bcf3860a4ff9bbc522820b4b765e65e4deceb3e (and the
    accompanying commit c1e5c954020e "vfs/fsnotify: fsnotify_close can delay
    the final work in fput" that was a horribly ugly hack to make it work at
    all).

    The 'struct file' approach not only causes that disgusting hack, it
    somehow breaks pulseaudio, probably due to some other subtlety with
    f_count handling.

    Fix up various conflicts due to later fsnotify work.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * 'params' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: (22 commits)
    param: don't deref arg in __same_type() checks
    param: update drivers/acpi/debug.c to new scheme
    param: use module_param in drivers/message/fusion/mptbase.c
    ide: use module_param_named rather than module_param_call
    param: update drivers/char/ipmi/ipmi_watchdog.c to new scheme
    param: lock if_sdio's lbs_helper_name and lbs_fw_name against sysfs changes.
    param: lock myri10ge_fw_name against sysfs changes.
    param: simple locking for sysfs-writable charp parameters
    param: remove unnecessary writable charp
    param: add kerneldoc to moduleparam.h
    param: locking for kernel parameters
    param: make param sections const.
    param: use free hook for charp (fix leak of charp parameters)
    param: add a free hook to kernel_param_ops.
    param: silence .init.text references from param ops
    Add param ops struct for hvc_iucv driver.
    nfs: update for module_param_named API change
    AppArmor: update for module_param_named API change
    param: use ops in struct kernel_param, rather than get and set fns directly
    param: move the EXPORT_SYMBOL to after the definitions.
    ...

    Linus Torvalds
     
  • The tv_nsec is a long and when added to the shifted interval it can wrap
    and become negative which later causes looping problems in the
    getrawmonotonic(). The edge case occurs when the system has slept for
    a short period of time of ~2 seconds.

    A trace printk of the values in this patch illustrate the problem:

    ftrace time stamp: log
    43.716079: logarithmic_accumulation: raw: 3d0913 tv_nsec d687faa
    43.718513: logarithmic_accumulation: raw: 3d0913 tv_nsec da588bd
    43.722161: logarithmic_accumulation: raw: 3d0913 tv_nsec de291d0
    46.349925: logarithmic_accumulation: raw: 7a122600 tv_nsec e1f9ae3
    46.349930: logarithmic_accumulation: raw: 1e848980 tv_nsec 8831c0e3

    The kernel starts looping at 46.349925 in the getrawmonotonic() due to
    the negative value from adding the raw value to tv_nsec.

    A simple solution is to accumulate into a u64, and then normalize it
    to a timespec_t.

    Signed-off-by: Jason Wessel
    [ Reworked variable names and simplified some of the code. - John ]
    Signed-off-by: John Stultz
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Signed-off-by: Linus Torvalds

    Jason Wessel
     
  • Add a dummy printk function for the maintenance of unused printks through gcc
    format checking, and also so that side-effect checking is maintained too.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     

12 Aug, 2010

3 commits

  • Secure discard is the same as discard except that all copies of the
    discarded sectors (perhaps created by garbage collection) must also be
    erased.

    Signed-off-by: Adrian Hunter
    Acked-by: Jens Axboe
    Cc: Kyungmin Park
    Cc: Madhusudhan Chikkature
    Cc: Christoph Hellwig
    Cc: Ben Gardiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Hunter
     
  • The current kfifo scatterlist implementation will not work with chained
    scatterlists. It assumes that struct scatterlist arrays are allocated
    contiguously, which is not the case when chained scatterlists (struct
    sg_table) are in use.

    Signed-off-by: Stefani Seibold
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    isofs: Fix lseek() to position beyond 4 GB
    vfs: remove unused MNT_STRICTATIME
    vfs: show unreachable paths in getcwd and proc
    vfs: only add " (deleted)" where necessary
    vfs: add prepend_path() helper
    vfs: __d_path: dont prepend the name of the root dentry
    ia64: perfmon: add d_dname method
    vfs: add helpers to get root and pwd
    cachefiles: use path_get instead of lone dget
    fs/sysv/super.c: add support for non-PDP11 v7 filesystems
    V7: Adjust sanity checks for some volumes
    Add v7 alias
    v9fs: fixup for inode_setattr being removed

    Manual merge to take Al's version of the fs/sysv/super.c file: it merged
    cleanly, but Al had removed an unnecessary header include, so his side
    was better.

    Linus Torvalds
     

11 Aug, 2010

22 commits

  • Simply replace the whole kfifo.c and kfifo.h files with the new generic
    version and fix the kerneldoc API template file.

    Signed-off-by: Stefani Seibold
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • Add the new version of the kfifo API files kfifo.c and kfifo.h.

    Signed-off-by: Stefani Seibold
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • copy_to/from_user() returns the number of bytes remaining to be copied.
    It never returns a negative value. The correct return code is -EFAULT and
    not -EIO.

    All the callers check for non-zero returns so that's Ok, but the return
    code is passed to the user so we should fix this.

    Signed-off-by: Dan Carpenter
    Cc: Hidetoshi Seto
    Cc: "Paul E. McKenney"
    Cc: "Eric W. Biederman"
    Cc: Simon Kagstrom
    Acked-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • We are missing the oops end marker for the exception based WARN implementation
    in lib/bug.c. This is useful for logfile analysis tools.

    Signed-off-by: Anton Blanchard
    Cc: Ingo Molnar
    Cc: Arjan van de Ven
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • To keep panic_timeout accuracy when running under a hypervisor, the
    current implementation only spins on long time (1 second) calls to mdelay.
    That brings a good effect, but the problem is the keyboard LEDs don't
    blink at all on that situation.

    This patch changes to call to panic_blink_enter() between every mdelay and
    keeps blinking in spite of long spin timer mode.

    The time to call to mdelay is now 100ms. Even this change will keep
    panic_timeout accuracy enough when running under a hypervisor.

    Signed-off-by: TAMUKI Shoichi
    Cc: Ben Dooks
    Cc: Russell King
    Acked-by: Dmitry Torokhov
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    TAMUKI Shoichi
     
  • alloc_pidmap() calculates max_scan so that if the initial offset != 0 we
    inspect the first map->page twice. This is correct, we want to find the
    unused bits < offset in this bitmap block. Add the comment.

    But it doesn't make any sense to stop the find_next_offset() loop when we
    are looking into this map->page for the second time. We have already
    already checked the bits >= offset during the first attempt, it is fine to
    do this again, no matter if we succeed this time or not.

    Remove this hard-to-understand code. It optimizes the very unlikely case
    when we are going to fail, but slows down the more likely case.

    Signed-off-by: Oleg Nesterov
    Cc: Salman Qazi
    Cc: Ingo Molnar
    Cc: Sukadev Bhattiprolu
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A program that repeatedly forks and waits is susceptible to having the
    same pid repeated, especially when it competes with another instance of
    the same program. This is really bad for bash implementation.
    Furthermore, many shell scripts assume that pid numbers will not be used
    for some length of time.

    Race Description:

    A B

    // pid == offset == n // pid == offset == n + 1
    test_and_set_bit(offset, map->page)
    test_and_set_bit(offset, map->page);
    pid_ns->last_pid = pid;
    pid_ns->last_pid = pid;
    // pid == n + 1 is freed (wait())

    // Next fork()...
    last = pid_ns->last_pid; // == n
    pid = last + 1;

    Code to reproduce it (Running multiple instances is more effective):

    #include
    #include
    #include
    #include
    #include
    #include

    // The distance mod 32768 between two pids, where the first pid is expected
    // to be smaller than the second.
    int PidDistance(pid_t first, pid_t second) {
    return (second + 32768 - first) % 32768;
    }

    int main(int argc, char* argv[]) {
    int failed = 0;
    pid_t last_pid = 0;
    int i;
    printf("%d\n", sizeof(pid_t));
    for (i = 0; i < 10000000; ++i) {
    if (i % 32786 == 0)
    printf("Iter: %d\n", i/32768);
    int child_exit_code = i % 256;
    pid_t pid = fork();
    if (pid == -1) {
    fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
    exit(1);
    }
    if (pid == 0) {
    // Child
    exit(child_exit_code);
    } else {
    // Parent
    if (i > 0) {
    int distance = PidDistance(last_pid, pid);
    if (distance == 0 || distance > 30000) {
    fprintf(stderr,
    "Unexpected pid sequence: previous fork: pid=%d, "
    "current fork: pid=%d for iteration=%d.\n",
    last_pid, pid, i);
    failed = 1;
    }
    }
    last_pid = pid;
    int status;
    int reaped = wait(&status);
    if (reaped != pid) {
    fprintf(stderr,
    "Wait return value: expected pid=%d, "
    "got %d, iteration %d\n",
    pid, reaped, i);
    failed = 1;
    } else if (WEXITSTATUS(status) != child_exit_code) {
    fprintf(stderr,
    "Unexpected exit status %x, iteration %d\n",
    WEXITSTATUS(status), i);
    failed = 1;
    }
    }
    }
    exit(failed);
    }

    Thanks to Ted Tso for the key ideas of this implementation.

    Signed-off-by: Salman Qazi
    Cc: Ingo Molnar
    Cc: Theodore Ts'o
    Cc: Peter Zijlstra
    Cc: Sukadev Bhattiprolu
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Salman
     
  • exit_ptrace() takes tasklist_lock unconditionally. We need this lock to
    avoid the race with ptrace_traceme(), it acts as a barrier.

    Change its caller, forget_original_parent(), to call exit_ptrace() under
    tasklist_lock. Change exit_ptrace() to drop and reacquire this lock if
    needed.

    This allows us to add the fastpath list_empty(ptraced) check. In the
    likely no-tracees case exit_ptrace() just returns and we avoid the lock()
    + unlock() sequence.

    "Zhang, Yanmin" suggested to add this
    check, and he reports that this change adds about 11% improvement in some
    tests.

    Suggested-and-tested-by: "Zhang, Yanmin"
    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The original code didn't leave enough space for a NULL terminator. These
    strings are copied with strcpy() into fixed length buffers in
    cgroup_root_from_opts().

    Signed-off-by: Dan Carpenter
    Acked-by: Serge E. Hallyn
    Reviewd-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Ben Blum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • There may be cases (most obviously, sysfs-writable charp parameters) where
    a module needs to prevent sysfs access to parameters.

    Rather than express this in terms of a big lock, the functions are
    expressed in terms of what they protect against. This is clearer, esp.
    if the implementation changes to a module-level or even param-level lock.

    Signed-off-by: Rusty Russell
    Reviewed-by: Takashi Iwai
    Tested-by: Phil Carmody

    Rusty Russell
     
  • Since this section can be read-only (they're in .rodata), they should
    always have been const. Minor flow-through various functions.

    Signed-off-by: Rusty Russell
    Tested-by: Phil Carmody

    Rusty Russell
     
  • Instead of using a "I kmalloced this" flag, we keep track of the kmalloced
    strings and use that list to check if we need to kfree (in practice, the
    list is very short).

    This means that kparams can be const again, and plugs a leak. This
    is important for drivers/usb/gadget/nokia.c which gets modprobe/rmmod'ed
    frequently on the N9000.

    Signed-off-by: Rusty Russell
    Reviewed-by: Takashi Iwai
    Cc: Artem Bityutskiy
    Tested-by: Phil Carmody

    Rusty Russell
     
  • This allows us to generalize the KPARAM_KMALLOCED flag, by calling a function
    on every parameter when a module is unloaded.

    Signed-off-by: Rusty Russell
    Reviewed-by: Takashi Iwai
    Tested-by: Phil Carmody

    Rusty Russell
     
  • This is more kernel-ish, saves some space, and also allows us to
    expand the ops without breaking all the callers who are happy for the
    new members to be NULL.

    The few places which defined their own param types are changed to the
    new scheme (more which crept in recently fixed in following patches).

    Since we're touching them anyway, we change get() and set() to take a
    const struct kernel_param (which they really are). This causes some
    harmless warnings until we fix them (in following patches).

    To reduce churn, module_param_call creates the ops struct so the callers
    don't have to change (and casts the functions to reduce warnings).
    The modern version which takes an ops struct is called module_param_cb.

    Signed-off-by: Rusty Russell
    Reviewed-by: Takashi Iwai
    Tested-by: Phil Carmody
    Cc: "David S. Miller"
    Cc: Ville Syrjala
    Cc: Dmitry Torokhov
    Cc: Alessandro Rubini
    Cc: Michal Januszewski
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-input@vger.kernel.org
    Cc: linux-fbdev-devel@lists.sourceforge.net
    Cc: linux-nfs@vger.kernel.org
    Cc: netdev@vger.kernel.org

    Rusty Russell
     
  • This is modern style, and good to do before we start changing things.

    Signed-off-by: Rusty Russell
    Reviewed-by: Takashi Iwai
    Tested-by: Phil Carmody

    Rusty Russell
     
  • An audit by Dongdong Deng revealed that most driver-author-written param
    calls don't handle val == NULL (which happens when parameters are specified
    with no =, eg "foo" instead of "foo=1").

    The only real case to use this is boolean, so handle it specially for that
    case and remove a source of bugs for everyone else.

    Signed-off-by: Rusty Russell
    Cc: Dongdong Deng
    Cc: Américo Wang

    Rusty Russell
     
  • Add three helpers that retrieve a refcounted copy of the root and cwd
    from the supplied fs_struct.

    get_fs_root()
    get_fs_pwd()
    get_fs_root_and_pwd()

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Fix kernel-doc warning, add @timer description:

    Warning(kernel/timer.c:335): No description found for parameter 'timer'

    Signed-off-by: Randy Dunlap
    Cc: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • * 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
    block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
    xen-blkfront: fix missing out label
    blkdev: fix blkdev_issue_zeroout return value
    block: update request stacking methods to support discards
    block: fix missing export of blk_types.h
    writeback: fix bad _bh spinlock nesting
    drbd: revert "delay probes", feature is being re-implemented differently
    drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
    drbd: Disable delay probes for the upcomming release
    writeback: cleanup bdi_register
    writeback: add new tracepoints
    writeback: remove unnecessary init_timer call
    writeback: optimize periodic bdi thread wakeups
    writeback: prevent unnecessary bdi threads wakeups
    writeback: move bdi threads exiting logic to the forker thread
    writeback: restructure bdi forker loop a little
    writeback: move last_active to bdi
    writeback: do not remove bdi from bdi_list
    writeback: simplify bdi code a little
    writeback: do not lose wake-ups in bdi threads
    ...

    Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
    drivers/scsi/scsi_error.c as per Jens.

    Linus Torvalds
     
  • * 'writable_limits' of git://decibel.fi.muni.cz/~xslaby/linux:
    unistd: add __NR_prlimit64 syscall numbers
    rlimits: implement prlimit64 syscall
    rlimits: switch more rlimit syscalls to do_prlimit
    rlimits: redo do_setrlimit to more generic do_prlimit
    rlimits: add rlimit64 structure
    rlimits: do security check under task_lock
    rlimits: allow setrlimit to non-current tasks
    rlimits: split sys_setrlimit
    rlimits: selinux, do rlimits changes under task_lock
    rlimits: make sure ->rlim_max never grows in sys_setrlimit
    rlimits: add task_struct to update_rlimit_cpu
    rlimits: security, add task_struct to setrlimit

    Fix up various system call number conflicts. We not only added fanotify
    system calls in the meantime, but asm-generic/unistd.h added a wait4
    along with a range of reserved per-architecture system calls.

    Linus Torvalds
     
  • * 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
    fanotify: use both marks when possible
    fsnotify: pass both the vfsmount mark and inode mark
    fsnotify: walk the inode and vfsmount lists simultaneously
    fsnotify: rework ignored mark flushing
    fsnotify: remove global fsnotify groups lists
    fsnotify: remove group->mask
    fsnotify: remove the global masks
    fsnotify: cleanup should_send_event
    fanotify: use the mark in handler functions
    audit: use the mark in handler functions
    dnotify: use the mark in handler functions
    inotify: use the mark in handler functions
    fsnotify: send fsnotify_mark to groups in event handling functions
    fsnotify: Exchange list heads instead of moving elements
    fsnotify: srcu to protect read side of inode and vfsmount locks
    fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
    fsnotify: use _rcu functions for mark list traversal
    fsnotify: place marks on object in order of group memory address
    vfs/fsnotify: fsnotify_close can delay the final work in fput
    fsnotify: store struct file not struct path
    ...

    Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

8 commits

  • kmsg_dump takes care to sample the global variables
    inside a spinlock, but then goes on to use the same
    variables outside the spinlock region too.

    Use the correct variable. This will make the race
    window smaller.

    Found by gcc 4.6's new warnings.

    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Reorder elements in structure cpu_stopper to remove alignment padding on
    64 bit builds, this shrinks its size from 40 to 32 bytes saving 8 bytes
    per cpu.

    Signed-off-by: Richard Kennedy
    Acked-by: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     
  • Remove duplicate definition of ARRAY_SIZE(), which was never used anyway.

    Signed-off-by: Geert Uytterhoeven
    Cc: Yinghai Lu
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Cleanup, no functional changes.

    - __set_personality() always changes ->exec_domain/personality, the
    special case when ->exec_domain remains the same buys nothing but
    complicates the code. Unify both cases to simplify the code.

    - The -EINVAL check in sys_personality() was never right. If we assume
    that set_personality() can fail we should check the value it returns
    instead of verifying that task->personality was actually changed.

    Remove it. Before the previous patch it was possible to hit this case
    due to overflow problems, but this -EINVAL just indicated the kernel
    bug.

    OTOH, probably it makes sense to change lookup_exec_domain() to return
    ERR_PTR() instead of default_exec_domain if the search in exec_domains
    list fails, and report this error to the user-space. But this means
    another user-space change, and we have in-kernel users which need fixes.
    For example, PER_OSF4 falls into PER_MASK for unkown reason and nobody
    cares to register this domain.

    Signed-off-by: Oleg Nesterov
    Cc: Wenming Zhang
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When taking a memory snapshot in hibernate_snapshot(), all (directly
    called) memory allocations use GFP_ATOMIC. Hence swap misusage during
    hibernation never occurs.

    But from a pessimistic point of view, there is no guarantee that no page
    allcation has __GFP_WAIT. It is better to have a global indication "we
    enter hibernation, don't use swap!".

    This patch tries to freeze new-swap-allocation during hibernation. (All
    user processes are frozenm so swapin is not a concern).

    This way, no updates will happen to swap_map[] between
    hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free()
    is called. We can be assured that swap corruption will not occur.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Rafael J. Wysocki"
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Ondrej Zary
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This a complete rewrite of the oom killer's badness() heuristic which is
    used to determine which task to kill in oom conditions. The goal is to
    make it as simple and predictable as possible so the results are better
    understood and we end up killing the task which will lead to the most
    memory freeing while still respecting the fine-tuning from userspace.

    Instead of basing the heuristic on mm->total_vm for each task, the task's
    rss and swap space is used instead. This is a better indication of the
    amount of memory that will be freeable if the oom killed task is chosen
    and subsequently exits. This helps specifically in cases where KDE or
    GNOME is chosen for oom kill on desktop systems instead of a memory
    hogging task.

    The baseline for the heuristic is a proportion of memory that each task is
    currently using in memory plus swap compared to the amount of "allowable"
    memory. "Allowable," in this sense, means the system-wide resources for
    unconstrained oom conditions, the set of mempolicy nodes, the mems
    attached to current's cpuset, or a memory controller's limit. The
    proportion is given on a scale of 0 (never kill) to 1000 (always kill),
    roughly meaning that if a task has a badness() score of 500 that the task
    consumes approximately 50% of allowable memory resident in RAM or in swap
    space.

    The proportion is always relative to the amount of "allowable" memory and
    not the total amount of RAM systemwide so that mempolicies and cpusets may
    operate in isolation; they shall not need to know the true size of the
    machine on which they are running if they are bound to a specific set of
    nodes or mems, respectively.

    Root tasks are given 3% extra memory just like __vm_enough_memory()
    provides in LSMs. In the event of two tasks consuming similar amounts of
    memory, it is generally better to save root's task.

    Because of the change in the badness() heuristic's baseline, it is also
    necessary to introduce a new user interface to tune it. It's not possible
    to redefine the meaning of /proc/pid/oom_adj with a new scale since the
    ABI cannot be changed for backward compatability. Instead, a new tunable,
    /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
    be used to polarize the heuristic such that certain tasks are never
    considered for oom kill while others may always be considered. The value
    is added directly into the badness() score so a value of -500, for
    example, means to discount 50% of its memory consumption in comparison to
    other tasks either on the system, bound to the mempolicy, in the cpuset,
    or sharing the same memory controller.

    /proc/pid/oom_adj is changed so that its meaning is rescaled into the
    units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
    these per-task tunables will rescale the value of the other to an
    equivalent meaning. Although /proc/pid/oom_adj was originally defined as
    a bitshift on the badness score, it now shares the same linear growth as
    /proc/pid/oom_score_adj but with different granularity. This is required
    so the ABI is not broken with userspace applications and allows oom_adj to
    be deprecated for future removal.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The three oom killer sysctl variables (sysctl_oom_dump_tasks,
    sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better
    declared in include/linux/oom.h rather than kernel/sysctl.c.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • We'll need the path to implement the flags field for statvfs support.
    We do have it available in all callers except:

    - ecryptfs_statfs. This one doesn't actually need vfs_statfs but just
    needs to do a caller to the lower filesystem statfs method.
    - sys_ustat. Add a non-exported statfs_by_dentry helper for it which
    doesn't won't be able to fill out the flags field later on.

    In addition rename the helpers for statfs vs fstatfs to do_*statfs instead
    of the misleading vfs prefix.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

09 Aug, 2010

1 commit

  • Commit 6ee0578b (workqueue: mark init_workqueues as early_initcall)
    made workqueue SMP initialization depend on workqueue_cpu_callback(),
    which however was registered as hotcpu_notifier() and didn't get
    called if CONFIG_HOTPLUG_CPU is not set. This made gcwqs on non-boot
    CPUs not create their initial workers leading to boot failures. Fix
    it by making it a cpu_notifier.

    Signed-off-by: Tejun Heo
    Reported-and-bisected-by: walt
    Tested-by: Markus Trippelsdorf

    Tejun Heo