31 Oct, 2005

4 commits

  • de_thread() sends SIGKILL to all sub-threads and waits them to die in 'D'
    state. It is possible that one of the threads already dequeued coredump
    signal. When de_thread() unlocks ->sighand->lock that thread can enter
    do_coredump()->coredump_wait() and cause a deadlock.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch deletes pointless code from coredump_wait().

    1. It does useless mm->core_waiters inc/dec under mm->mmap_sem,
    but any changes to ->core_waiters have no effect until we drop
    ->mmap_sem.

    2. It calls yield() for absolutely unknown reason.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • de_thread() calls del_timer_sync(->real_timer) under ->sighand->siglock.
    This is deadlockable, it_real_fn sends a signal and needs this lock too.

    Also, delete unneeded ->real_timer.data assignment.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Trivial, saves one 'if' branch in de_thread().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

30 Oct, 2005

3 commits

  • Second step in pushing down the page_table_lock. Remove the temporary
    bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
    to hold page_table_lock, whether it's on init_mm or a user mm; take
    page_table_lock internally to check if a racing task already allocated.

    Convert their callers from common code. But avoid coming back to change them
    again later: instead of moving the spin_lock(&mm->page_table_lock) down,
    switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
    encapsulate the mapping+locking and unlocking+unmapping together, and in the
    end may use alternatives to the mm page_table_lock itself.

    These callers all hold mmap_sem (some exclusively, some not), so at no level
    can a page table be whipped away from beneath them; and pte_alloc uses the
    "atomic" pmd_present to test whether it needs to allocate. It appears that on
    all arches we can safely descend without page_table_lock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • update_mem_hiwater has attracted various criticisms, in particular from those
    concerned with mm scalability. Originally it was called whenever rss or
    total_vm got raised. Then many of those callsites were replaced by a timer
    tick call from account_system_time. Now Frank van Maarseveen reports that to
    be found inadequate. How about this? Works for Frank.

    Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
    update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
    mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
    by 1): those are hot paths. Do the opposite, update only when about to lower
    rss (usually by many), or just before final accounting in do_exit. Handle
    mm->hiwater_vm in the same way, though it's much less of an issue. Demand
    that whoever collects these hiwater statistics do the work of taking the
    maximum with rss or total_vm.

    And there has been no collector of these hiwater statistics in the tree. The
    new convention needs an example, so match Frank's usage by adding a VmPeak
    line above VmSize to /proc//status, and also a VmHWM line above VmRSS
    (High-Water-Mark or High-Water-Memory).

    There was a particular anomaly during mremap move, that hiwater_vm might be
    captured too high. A fleeting such anomaly remains, but it's quickly
    corrected now, whereas before it would stick.

    What locking? None: if the app is racy then these statistics will be racy,
    it's not worth any overhead to make them exact. But whenever it suits,
    hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
    page_table_lock (for now) or with preemption disabled (later on): without
    going to any trouble, minimize the time between reading current values and
    updating, to minimize those occasions when a racing thread bumps a count up
    and back down in between.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I was lazy when we added anon_rss, and chose to change as few places as
    possible. So currently each anonymous page has to be counted twice, in rss
    and in anon_rss. Which won't be so good if those are atomic counts in some
    configurations.

    Change that around: keep file_rss and anon_rss separately, and add them
    together (with get_mm_rss macro) when the total is needed - reading two
    atomics is much cheaper than updating two atomics. And update anon_rss
    upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

19 Oct, 2005

1 commit


15 Sep, 2005

2 commits

  • Pavel Emelianov and Kirill Korotaev observe that fs and arch users of
    security_vm_enough_memory tend to forget to vm_unacct_memory when a
    failure occurs further down (typically in setup_arg_pages variants).

    These are all users of insert_vm_struct, and that reservation will only
    be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some
    cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't.

    So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into
    Committed_AS each time they're run. But don't add VM_ACCOUNT to them,
    it's inappropriate to reserve against the very unlikely case that gdb
    be used to COW a vDSO page - we ought to do something about that in
    do_wp_page, but there are yet other inconsistencies to be resolved.

    The safe and economical way to fix this is to let insert_vm_struct do
    the security_vm_enough_memory check when it finds VM_ACCOUNT is set.

    And the MIPS irix_brk has been calling security_vm_enough_memory before
    calling do_brk which repeats it, doubly accounting and so also leaking.
    Remove that, and all the fs and arch calls to security_vm_enough_memory:
    give it a less misleading name later on.

    Signed-off-by: Hugh Dickins
    Signed-Off-By: Kirill Korotaev
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It turns out that the BUG_ON() in fs/exec.c: de_thread() is unreliable
    and can trigger due to the test itself being racy.

    de_thread() does
    while (atomic_read(&sig->count) > count) {
    }
    .....
    .....
    BUG_ON(!thread_group_empty(current));

    but release_task does
    write_lock_irq(&tasklist_lock)
    __exit_signal
    (this is where atomic_dec(&sig->count) is run)
    __exit_sighand
    __unhash_process
    takes write lock on tasklist_lock
    remove itself out of PIDTYPE_TGID list
    write_unlock_irq(&tasklist_lock)

    so there's a clear (although small) window between the
    atomic_dec(&sig->count) and the actual PIDTYPE_TGID unhashing of the
    thread.

    And actually there is no need for all threads to have exited at this
    point, so we simply kill the BUG_ON.

    Big thanks to Marc Lehmann who provided the test-case.

    Fixes Bug 5170 (http://bugme.osdl.org/show_bug.cgi?id=5170)

    Signed-off-by: Alexander Nyberg
    Cc: Roland McGrath
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Acked-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Alexander Nyberg
     

10 Sep, 2005

1 commit

  • In order for the RCU to work, the file table array, sets and their sizes must
    be updated atomically. Instead of ensuring this through too many memory
    barriers, we put the arrays and their sizes in a separate structure. This
    patch takes the first step of putting the file table elements in a separate
    structure fdtable that is embedded withing files_struct. It also changes all
    the users to refer to the file table using files_fdtable() macro. Subsequent
    applciation of RCU becomes easier after this.

    Signed-off-by: Dipankar Sarma
    Signed-Off-By: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     

13 Jul, 2005

1 commit

  • When a noninitial thread does exec, it becomes the new group leader. If
    there is a ITIMER_REAL timer running, it points at the old group leader and
    when it fires it can follow a stale pointer. The timer data needs to be
    reset to point at the exec'ing thread that is becoming the group leader.
    This has to synchronize with any concurrent firing of the timer to make
    sure that it_real_fn can never run when the data points to a thread that
    might have been reaped already.

    Signed-off-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

24 Jun, 2005

1 commit

  • Add a new `suid_dumpable' sysctl:

    This value can be used to query and set the core dump mode for setuid
    or otherwise protected/tainted binaries. The modes are

    0 - (default) - traditional behaviour. Any process which has changed
    privilege levels or is execute only will not be dumped

    1 - (debug) - all processes dump core when possible. The core dump is
    owned by the current user and no security is applied. This is intended
    for system debugging situations only. Ptrace is unchecked.

    2 - (suidsafe) - any binary which normally would not be dumped is dumped
    readable by root only. This allows the end user to remove such a dump but
    not access it directly. For security reasons core dumps in this mode will
    not overwrite one another or other files. This mode is appropriate when
    adminstrators are attempting to debug problems in a normal environment.

    (akpm:

    > > +EXPORT_SYMBOL(suid_dumpable);
    >
    > EXPORT_SYMBOL_GPL?

    No problem to me.

    > > if (current->euid == current->uid && current->egid == current->gid)
    > > current->mm->dumpable = 1;
    >
    > Should this be SUID_DUMP_USER?

    Actually the feedback I had from last time was that the SUID_ defines
    should go because its clearer to follow the numbers. They can go
    everywhere (and there are lots of places where dumpable is tested/used
    as a bool in untouched code)

    > Maybe this should be renamed to `dump_policy' or something. Doing that
    > would help us catch any code which isn't using the #defines, too.

    Fair comment. The patch was designed to be easy to maintain for Red Hat
    rather than for merging. Changing that field would create a gigantic
    diff because it is used all over the place.

    )

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

19 Jun, 2005

1 commit


06 May, 2005

2 commits


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds