20 Apr, 2006

6 commits

  • * 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block:
    [PATCH] block/elevator.c: remove unused exports
    [PATCH] splice: fix smaller sized splice reads
    [PATCH] Don't inherit ->splice_pipe across forks
    [patch] cleanup: use blk_queue_stopped
    [PATCH] Document online io scheduler switching

    Linus Torvalds
     
  • In cases where a struct kretprobe's *_handler fields are non-NULL, it is
    possible to cause a system crash, due to the possibility of calls ending up
    in zombie functions. Documentation clearly states that unused *_handlers
    should be set to NULL, but kprobe users sometimes fail to do so.

    Fix it by setting the non-relevant fields of the struct kretprobe to NULL.

    Signed-off-by: Ananth N Mavinakayanahalli
    Acked-by: Jim Keniston
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ananth N Mavinakayanahalli
     
  • It's really task private, so clear that field on fork after copying
    task structure.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Those also break userland regs like following.

    00000000 :
    0: 0f b7 44 24 0c movzwl 0xc(%esp),%eax
    5: 83 ca ff or $0xffffffff,%edx
    8: 0f b7 4c 24 08 movzwl 0x8(%esp),%ecx
    d: 66 83 f8 ff cmp $0xffffffff,%ax
    11: 0f 44 c2 cmove %edx,%eax
    14: 66 83 f9 ff cmp $0xffffffff,%cx
    18: 0f 45 d1 cmovne %ecx,%edx
    1b: 89 44 24 0c mov %eax,0xc(%esp)
    1f: 89 54 24 08 mov %edx,0x8(%esp)
    23: e9 fc ff ff ff jmp 24

    where the tailcall at the end overwrites the incoming stack-frame.

    Signed-off-by: OGAWA Hirofumi
    [ I would _really_ like to have a way to tell gcc about calling
    conventions. The "prevent_tail_call()" macro is pretty ugly ]
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • The function free_pagedir() used by swsusp for freeing its internal data
    structures clears the PG_nosave and PG_nosave_free flags for each page
    being freed.

    However, during resume PG_nosave_free set means that the page in
    question is "unsafe" (ie. it will be overwritten in the process of
    restoring the saved system state from the image), so it should not be
    used for the image data.

    Therefore free_pagedir() should not clear PG_nosave_free if it's called
    during resume (otherwise "unsafe" pages freed by it may be used for
    storing the image data and the data may get corrupted later on).

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • While we can currently walk through thread groups, process groups, and
    sessions with just the rcu_read_lock, this opens the door to walking the
    entire task list.

    We already have all of the other RCU guarantees so there is no cost in
    doing this, this should be enough so that proc can stop taking the
    tasklist lock during readdir.

    prev_task was killed because it has no users, and using it will miss new
    tasks when doing an rcu traversal.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

15 Apr, 2006

2 commits

  • Somehow in the midst of dotting i's and crossing t's during
    the merge up to rc1 we wound up keeping __put_task_struct_cb
    when it should have been killed as it no longer has any users.
    Sorry I probably should have caught this while it was
    still in the -mm tree.

    Having the old code there gets confusing when reading
    through the code and trying to understand what is
    happening.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Since the last user is removed in -mm, we can now remove this long deprecated
    function.

    Signed-off-by: Adrian Bunk
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Adrian Bunk
     

14 Apr, 2006

1 commit

  • This reverts most of commit 30e0fca6c1d7d26f3f2daa4dd2b12c51dadc778a.
    It broke the case of non-leader MT exec when ptraced.
    I think the bug it was intended to fix was already addressed by commit
    788e05a67c343fa22f2ae1d3ca264e7f15c25eaf.

    Signed-off-by: Roland McGrath
    Acked-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

11 Apr, 2006

10 commits

  • Commit e56d090310d7625ecb43a1eeebd479f04affb48b

    [PATCH] RCU signal handling

    made this BUG_ON() unsafe. This code runs under ->siglock,
    while switch_exec_pids() takes tasklist_lock.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • * 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
    [PATCH] vfs: add splice_write and splice_read to documentation
    [PATCH] Remove sys_ prefix of new syscalls from __NR_sys_*
    [PATCH] splice: warning fix
    [PATCH] another round of fs/pipe.c cleanups
    [PATCH] splice: comment styles
    [PATCH] splice: add Ingo as addition copyright holder
    [PATCH] splice: unlikely() optimizations
    [PATCH] splice: speedups and optimizations
    [PATCH] pipe.c/fifo.c code cleanups
    [PATCH] get rid of the PIPE_*() macros
    [PATCH] splice: speedup __generic_file_splice_read
    [PATCH] splice: add direct fd fd splicing support
    [PATCH] splice: add optional input and output offsets
    [PATCH] introduce a "kernel-internal pipe object" abstraction
    [PATCH] splice: be smarter about calling do_page_cache_readahead()
    [PATCH] splice: optimize the splice buffer mapping
    [PATCH] splice: cleanup __generic_file_splice_read()
    [PATCH] splice: only call wake_up_interruptible() when we really have to
    [PATCH] splice: potential !page dereference
    [PATCH] splice: mark the io page as accessed

    Linus Torvalds
     
  • Add a cpu_relax() to the hand-coded spinwait in hrtimer_cancel().

    Signed-off-by: Joe Korty
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Korty
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Implement the scheduled unexport of panic_timeout.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • We need the boot CPU's tvec_bases[] entry to be initialised super-early in
    boot, for early_serial_setup(). That runs within setup_arch(), before even
    per-cpu areas are initialised.

    The patch changes tvec_bases to use compile-time initialisation, and adds a
    separate array `tvec_base_done' to keep track of which CPU has had its
    tvec_bases[] entry initialised (because we can no longer use the zeroness of
    that tvec_bases[] entry to determine whether it has been initialised).

    Thanks to Eugene Surovegin for diagnosing this.

    Cc: Eugene Surovegin
    Cc: Jan Beulich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • For some architectures, a few syscalls are not linked in noMMU mode. In
    that case, the MMU depending syscalls are needed to be defined as
    'cond_syscall'. For example, ARM architecture selectively links sys_mlock
    by the mode configuration.

    In case of FRV, it has been managed by #ifdef CONFIG_MMU macro in
    arch/frv/kernel/entry.S. However these conditional macros are just
    duplicates if they were defined as cond_syscall. Compilation test is done
    with FRV toolchains for both of MMU and noMMU mode.

    Signed-off-by: Hyok S. Choi
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hyok S. Choi
     
  • RT tasks are being awakened on the expired array when expired_starving() is
    true, whereas they really should be excluded. Fix.

    Signed-off-by: Mike Galbraith
    Acked-by: Ingo Molnar
    Cc: Con Kolivas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Galbraith
     
  • Fix a starvation problem that occurs when a stream of highly interactive tasks
    delay an array switch for extended periods despite EXPIRED_STARVING(rq) being
    true. AFAIKT, the only choice is to enqueue awakening tasks on the expired
    array in this case.

    Without this patch, it can be nearly impossible to remotely login to a busy
    server, and interactive shell commands can starve for minutes.

    Also, convert the EXPIRED_STARVING macro into an inline function which humans
    can understand.

    Signed-off-by: Mike Galbraith
    Acked-by: Ingo Molnar
    Cc: Nick Piggin
    Acked-by: Con Kolivas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Galbraith
     
  • It's more efficient for sendfile() emulation. Basically we cache an
    internal private pipe and just use that as the intermediate area for
    pages. Direct splicing is not available from sys_splice(), it is only
    meant to be used for sendfile() emulation.

    Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at
    exit for the normal fast path.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2006

1 commit

  • If the HPET timer is enabled, the clock can drift by ~3 seconds a day.
    This is due to the HPET timer not being initialized with the correct
    setting (still using PIT count).

    If HZ changes, this drift can become even more pronounced.

    HPET patch initializes tick_nsec with correct tick_nsec settings for
    HPET timer.

    Vojtech comments:

    "It's not entirely correct (it assumes the HPET ticks totally
    exactly), but it's significantly better than assuming the PIT error
    there."

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Jordan Hargrave
     

02 Apr, 2006

3 commits


01 Apr, 2006

17 commits

  • I was grepping through the code and some `grep ganularity -R .` didn't
    catch what I thought. Then looking closer I saw the term "granuality"
    used in only four places (in comments) and granularity in many more
    places describing the same idea. Some other facts:

    dictionary.com does not know such a word
    define:granuality on google is not found (and pages for granuality are
    mostly related to patches to the kernel)
    it has not been discussed as a term on LKML, AFAICS (=Can Search)

    To be consistent, I think granularity should be used everywhere.

    Signed-off-by: Kalin KOZHUHAROV
    Signed-off-by: Adrian Bunk

    Kalin KOZHUHAROV
     
  • this changes if() BUG(); constructs to BUG_ON() which is
    cleaner, contains unlikely() and can better optimized away.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Adrian Bunk

    Eric Sesterhenn
     
  • The note that SOFTWARE_SUSPEND doesn't need APM is helpful, but nowadays
    the information that it doesn't need ACPI, too, is even more helpful.

    Signed-off-by: Adrian Bunk

    Adrian Bunk
     
  • Wrong error path in dup_fd() - it should return NULL on error,
    not an address of already freed memory :/

    Triggered by OpenVZ stress test suite.

    What is interesting is that it was causing different oopses in RCU like
    below:
    Call Trace:
    [] rcu_do_batch+0x2c/0x80
    [] rcu_process_callbacks+0x3d/0x70
    [] tasklet_action+0x73/0xe0
    [] __do_softirq+0x10a/0x130
    [] do_softirq+0x4f/0x60
    =======================
    [] smp_apic_timer_interrupt+0x77/0x110
    [] apic_timer_interrupt+0x1c/0x24
    Code: Bad EIP value.
    Kernel panic - not syncing: Fatal exception in interrupt

    Signed-Off-By: Pavel Emelianov
    Signed-Off-By: Dmitry Mishin
    Signed-Off-By: Kirill Korotaev
    Signed-Off-By: Linus Torvalds

    Kirill Korotaev
     
  • Simplifies the code, reduces the need for 4 pid hash tables, and makes the
    code more capable.

    In the discussions I had with Oleg it was felt that to a large extent the
    cleanup itself justified the work. With struct pid being dynamically
    allocated meant we could create the hash table entry when the pid was
    allocated and free the hash table entry when the pid was freed. Instead of
    playing with the hash lists when ever a process would attach or detach to a
    process.

    For myself the fact that it gave what my previous task_ref patch gave for free
    with simpler code was a big win. The problem is that if you hold a reference
    to struct task_struct you lock in 10K of low memory. If you do that in a user
    controllable way like /proc does, with an unprivileged but hostile user space
    application with typical resource limits of 1000 fds and 100 processes I can
    trigger the OOM killer by consuming all of low memory with task structs, on a
    machine wight 1GB of low memory.

    If I instead hold a reference to struct pid which holds a pointer to my
    task_struct, I don't suffer from that problem because struct pid is 2 orders
    of magnitude smaller. In fact struct pid is small enough that most other
    kernel data structures dwarf it, so simply limiting the number of referring
    data structures is enough to prevent exhaustion of low memory.

    This splits the current struct pid into two structures, struct pid and struct
    pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
    struct pid_link is the per process linkage into the hash tables and lives in
    struct task_struct. struct pid is given an indepedent lifetime, and holds
    pointers to each of the pid types.

    The independent life of struct pid simplifies attach_pid, and detach_pid,
    because we are always manipulating the list of pids and not the hash table.
    In addition in giving struct pid an indpendent life it makes the concept much
    more powerful.

    Kernel data structures can now embed a struct pid * instead of a pid_t and
    not suffer from pid wrap around problems or from keeping unnecessarily
    large amounts of memory allocated.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • A big problem with rcu protected data structures that are also reference
    counted is that you must jump through several hoops to increase the reference
    count. I think someone finally implemented atomic_inc_not_zero(&count) to
    automate the common case. Unfortunately this means you must special case the
    rcu access case.

    When data structures are only visible via rcu in a manner that is not
    determined by the reference count on the object (i.e. tasks are visible until
    their zombies are reaped) there is a much simpler technique we can employ.
    Simply delaying the decrement of the reference count until the rcu interval is
    over.

    What that means is that the proc code that looks up a task and later
    wants to sleep can now do:

    rcu_read_lock();
    task = find_task_by_pid(some_pid);
    if (task) {
    get_task_struct(task);
    }
    rcu_read_unlock();

    The effect on the rest of the kernel is that put_task_struct becomes cheaper
    and immediate, and in the case where the task has been reaped it frees the
    task immediate instead of unnecessarily waiting an until the rcu interval is
    over.

    Cleanup of task_struct does not happen when its reference count drops to
    zero, instead cleanup happens when release_task is called. Tasks can only
    be looked up via rcu before release_task is called. All rcu protected
    members of task_struct are freed by release_task.

    Therefore we can move call_rcu from put_task_struct into release_task. And
    we can modify release_task to not immediately release the reference count
    but instead have it call put_task_struct from the function it gives to
    call_rcu.

    The end result:

    - get_task_struct is safe in an rcu context where we have just looked
    up the task.

    - put_task_struct() simplifies into its old pre rcu self.

    This reorganization also makes put_task_struct uncallable from modules as
    it is not exported but it does not appear to be called from any modules so
    this should not be an issue, and is trivially fixed.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This just got nuked in mainline. Bring it back because Eric's patches use it.

    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The core problem: setsid fails if it is called by init. The effect in 2.6.16
    and the earlier kernels that have this problem is that if you do a "ps -j 1 or
    ps -ej 1" you will see that init and several of it's children have process
    group and session == 0. Instead of process group == session == 1. Despite
    init calling setsid.

    The reason it fails is that daemonize calls set_special_pids(1,1) on kernel
    threads that are launched before /sbin/init is called.

    The only remaining effect in that current->signal->leader == 0 for init
    instead of 1. And the setsid call fails. No one has noticed because
    /sbin/init does not check the return value of setsid.

    In 2.4 where we don't have the pidhash table, and daemonize doesn't exist
    setsid actually works for init.

    I care a lot about pid == 1 not being a special case that we leave broken,
    because of the container/jail work that I am doing.

    - Carefully allow init (pid == 1) to call setsid despite the kernel using
    its session.

    - Use find_task_by_pid instead of find_pid because find_pid taking a
    pidtype is going away.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The futex timeval is not checked for correctness. The change does not
    break existing applications as the timeval is supplied by glibc (and glibc
    always passes a correct value), but the glibc-internal tests for this
    functionality fail.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • To increase the strength of SCHED_BATCH as a scheduling hint we can
    activate batch tasks on the expired array since by definition they are
    latency insensitive tasks.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • On runqueue time is used to elevate priority in schedule().

    In the code it currently requeues tasks even if their priority is not
    elevated, which would end up placing them at the end of their runqueue
    array effectively delaying them instead of improving their priority.

    Bug spotted by Mike Galbraith

    This patch removes this requeueing.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • Tasks waiting in SLEEP_NONINTERACTIVE state can now get to best priority so
    they need to be included in the idle detection code.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • We watch for tasks that sleep extended periods and don't allow one single
    prolonged sleep period from elevating priority to maximum bonus to prevent cpu
    bound tasks from getting high priority with single long sleeps. There is a
    bug in the current code that also penalises tasks that already have high
    priority. Correct that bug.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • Alterations to the pipe code in the kernel made it possible for relative
    starvation to occur with tasks that slept waiting on a pipe getting unfair
    priority bonuses even if they were otherwise fully cpu bound so the
    TASK_NONINTERACTIVE flag was introduced which prevented any change to
    sleep_avg while sleeping waiting on a pipe. This change also leads to the
    converse though, preventing any priority boost from occurring in truly
    interactive tasks that wait on pipes.

    Convert the TASK_NONINTERACTIVE flag to set sleep_type to SLEEP_NONINTERACTIVE
    which will allow a linear bonus to priority based on sleep time thus allowing
    interactive tasks to get high priority if they sleep enough.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • The activated flag in task_struct is used to track different sleep types and
    its usage is somewhat obfuscated. Convert the variable to an enum with more
    descriptive names without altering the function.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • Currently, count_active_tasks() calls both nr_running() &
    nr_interruptible(). Each of these functions does a "for_each_cpu" & reads
    values from the runqueue of each cpu. Although this is not a lot of
    instructions, each runqueue may be located on different node. Depending on
    the architecture, a unique TLB entry may be required to access each
    runqueue.

    Since there may be more runqueues than cpu TLB entries, a scan of all
    runqueues can trash the TLB. Each memory reference incurs a TLB miss &
    refill.

    In addition, the runqueue cacheline that contains nr_running &
    nr_uninterruptible may be evicted from the cache between the two passes.
    This causes unnecessary cache misses.

    Combining nr_running() & nr_interruptible() into a single function
    substantially reduces the TLB & cache misses on large systems. This should
    have no measureable effect on smaller systems.

    On a 128p IA64 system running a memory stress workload, the new function
    reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.

    Signed-off-by: Jack Steiner
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • It seems that run_hrtimer_queue() is calling get_softirq_time() more
    often than it needs to.

    With this patch, it only calls get_softirq_time() if there's a
    pending timer.

    Signed-off-by: Dimitri Sivanich
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dimitri Sivanich