11 Jan, 2012

1 commit

  • oom_score_adj is used for guarding processes from OOM-Killer. One of
    problem is that it's inherited at fork(). When a daemon set oom_score_adj
    and make children, it's hard to know where the value is set.

    This patch adds some tracepoints useful for debugging. This patch adds
    3 trace points.
    - creating new task
    - renaming a task (exec)
    - set oom_score_adj

    To debug, users need to enable some trace pointer. Maybe filtering is useful as

    # EVENT=/sys/kernel/debug/tracing/events/task/
    # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
    # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
    # echo 1 > $EVENT/enable
    # EVENT=/sys/kernel/debug/tracing/events/oom/
    # echo 1 > $EVENT/enable

    output will be like this.
    # grep oom /sys/kernel/debug/tracing/trace
    bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
    bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
    ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
    bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
    grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

04 Jan, 2012

1 commit

  • some stuff in there can actually become static; some belongs to pnode.h
    as it's a private interface between namespace.c and pnode.c...

    Signed-off-by: Al Viro

    Al Viro
     

01 Nov, 2011

1 commit

  • This removes mm->oom_disable_count entirely since it's unnecessary and
    currently buggy. The counter was intended to be per-process but it's
    currently decremented in the exit path for each thread that exits, causing
    it to underflow.

    The count was originally intended to prevent oom killing threads that
    share memory with threads that cannot be killed since it doesn't lead to
    future memory freeing. The counter could be fixed to represent all
    threads sharing the same mm, but it's better to remove the count since:

    - it is possible that the OOM_DISABLE thread sharing memory with the
    victim is waiting on that thread to exit and will actually cause
    future memory freeing, and

    - there is no guarantee that a thread is disabled from oom killing just
    because another thread sharing its mm is oom disabled.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

12 Aug, 2011

1 commit

  • The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
    check in set_user() to check for NPROC exceeding via setuid() and
    similar functions.

    Before the check there was a possibility to greatly exceed the allowed
    number of processes by an unprivileged user if the program relied on
    rlimit only. But the check created new security threat: many poorly
    written programs simply don't check setuid() return code and believe it
    cannot fail if executed with root privileges. So, the check is removed
    in this patch because of too often privilege escalations related to
    buggy programs.

    The NPROC can still be enforced in the common code flow of daemons
    spawning user processes. Most of daemons do fork()+setuid()+execve().
    The check introduced in execve() (1) enforces the same limit as in
    setuid() and (2) doesn't create similar security issues.

    Neil Brown suggested to track what specific process has exceeded the
    limit by setting PF_NPROC_EXCEEDED process flag. With the change only
    this process would fail on execve(), and other processes' execve()
    behaviour is not changed.

    Solar Designer suggested to re-check whether NPROC limit is still
    exceeded at the moment of execve(). If the process was sleeping for
    days between set*uid() and execve(), and the NPROC counter step down
    under the limit, the defered execve() failure because NPROC limit was
    exceeded days ago would be unexpected. If the limit is not exceeded
    anymore, we clear the flag on successful calls to execve() and fork().

    The flag is also cleared on successful calls to set_user() as the limit
    was exceeded for the previous user, not the current one.

    Similar check was introduced in -ow patches (without the process flag).

    v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().

    Reviewed-by: James Morris
    Signed-off-by: Vasiliy Kulikov
    Acked-by: NeilBrown
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

27 Jul, 2011

7 commits

  • acct_arg_size() takes ->page_table_lock around add_mm_counter() if
    !SPLIT_RSS_COUNTING. This is not needed after commit 172703b08cd0 ("mm:
    delete non-atomic mm counter implementation").

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Matt Fleming
    Cc: Dave Hansen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If CONFIG_MODULES=n, it makes no sense to retry the list of binary formats
    handler because the list will not be modified by request_module().

    Signed-off-by: Tetsuo Handa
    Cc: Richard Weinberger
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Currently, search_binary_handler() tries to load binary loader module
    using request_module() if a loader for the requested program is not yet
    loaded. But second attempt of request_module() does not affect the result
    of search_binary_handler().

    If request_module() triggered recursion, calling request_module() twice
    causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions. It is
    not an infinite loop but is sufficient for users to consider as a hang up.

    Therefore, this patch changes not to call request_module() twice, making 1
    to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion.

    Signed-off-by: Tetsuo Handa
    Reported-by: Richard Weinberger
    Tested-by: Richard Weinberger
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit a8bef8ff6ea1 ("mm: migration: avoid race between
    shift_arg_pages() and rmap_walk() during migration by not migrating
    temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS
    and VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile
    time one, so BUILD_BUG_ON is more appropriate.

    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • do_coredump() assumes that if format_corename() fails it should return
    -ENOMEM. This is not true, for example cn_print_exe_file() can propagate
    the error from d_path. Even if it was true, this is too fragile. Change
    the code to check "ispipe < 0".

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Jiri Slaby
    Reviewed-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change every occurence of / in comm and hostname to !. If the process
    changes its name to contain /, the core is not dumped (if the directory
    tree doesn't exist like that). The same with hostname being something
    like myhost/3. Fix this behaviour by using the escape loop used in %E.
    (We extract it to a separate function.)

    Now both with comm == myprocess/1 and hostname == myhost/1, the core is
    dumped like (kernel.core_pattern='core.%p.%e.%h):
    core.2349.myprocess!1.myhost!1

    Signed-off-by: Jiri Slaby
    Cc: Alan Cox
    Cc: Al Viro
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • If we don't know the file corresponding to the binary (i.e. exe_file is
    unknown), use "task->comm (path unknown)" instead of simple "(unknown)"
    as suggested by ak.

    The fallback is the same as %e except it will append "(path unknown)".

    Signed-off-by: Jiri Slaby
    Cc: Alan Cox
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

23 Jul, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
    vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
    isofs: Remove global fs lock
    jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
    fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
    mm/truncate.c: fix build for CONFIG_BLOCK not enabled
    fs:update the NOTE of the file_operations structure
    Remove dead code in dget_parent()
    AFS: Fix silly characters in a comment
    switch d_add_ci() to d_splice_alias() in "found negative" case as well
    simplify gfs2_lookup()
    jfs_lookup(): don't bother with . or ..
    get rid of useless dget_parent() in btrfs rename() and link()
    get rid of useless dget_parent() in fs/btrfs/ioctl.c
    fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
    drivers: fix up various ->llseek() implementations
    fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
    Ext4: handle SEEK_HOLE/SEEK_DATA generically
    Btrfs: implement our own ->llseek
    fs: add SEEK_HOLE and SEEK_DATA flags
    reiserfs: make reiserfs default to barrier=flush
    ...

    Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
    shrinker callout for the inode cache, that clashed with the xfs code to
    start the periodic workers later.

    Linus Torvalds
     
  • * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
    ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
    ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
    connector: add an event for monitoring process tracers
    ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
    ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
    ptrace_init_task: initialize child->jobctl explicitly
    has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
    ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
    ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
    ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
    ptrace: ptrace_reparented() should check same_thread_group()
    redefine thread_group_leader() as exit_signal >= 0
    do not change dead_task->exit_signal
    kill task_detached()
    reparent_leader: check EXIT_DEAD instead of task_detached()
    make do_notify_parent() __must_check, update the callers
    __ptrace_detach: avoid task_detached(), check do_notify_parent()
    kill tracehook_notify_death()
    make do_notify_parent() return bool
    ptrace: s/tracehook_tracer_task()/ptrace_parent()/
    ...

    Linus Torvalds
     

22 Jul, 2011

1 commit

  • Test-case:

    void *tfunc(void *arg)
    {
    execvp("true", NULL);
    return NULL;
    }

    int main(void)
    {
    int pid;

    if (fork()) {
    pthread_t t;

    kill(getpid(), SIGSTOP);

    pthread_create(&t, NULL, tfunc, NULL);

    for (;;)
    pause();
    }

    pid = getppid();
    assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);

    while (wait(NULL) > 0)
    ptrace(PTRACE_CONT, pid, 0,0);

    return 0;
    }

    It is racy, exit_notify() does __wake_up_parent() too. But in the
    likely case it triggers the problem: de_thread() does release_task()
    and the old leader goes away without the notification, the tracer
    sleeps in do_wait() without children/tracees.

    Change de_thread() to do __wake_up_parent(traced_leader->parent).
    Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
    EXIT_DEAD threads do not exist from do_wait's pov.

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov
     

20 Jul, 2011

1 commit


02 Jul, 2011

1 commit

  • When multithreaded program execs under ptrace,
    all traced threads report WIFEXITED status, except for
    thread group leader and the thread which execs.

    Unless tracer tracks thread group relationship between tracees,
    which is a nontrivial task, it will not detect that
    execed thread no longer exists.

    This patch allows tracer to figure out which thread
    performed this exec, by requesting PTRACE_GETEVENTMSG
    in PTRACE_EVENT_EXEC stop.

    Another, samller problem which is solved by this patch
    is that tracer now can figure out which of the several
    concurrent execs in multithreaded program succeeded.

    Signed-off-by: Denys Vlasenko
    Signed-off-by: Oleg Nesterov

    Denys Vlasenko
     

28 Jun, 2011

1 commit

  • Change de_thread() to set old_leader->exit_signal = -1. This is
    good for the consistency, it is no longer the leader and all
    sub-threads have exit_signal = -1 set by copy_process(CLONE_THREAD).

    And this allows us to micro-optimize thread_group_leader(), it can
    simply check exit_signal >= 0. This also makes sense because we
    should move ->group_leader from task_struct to signal_struct.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Tejun Heo

    Oleg Nesterov
     

23 Jun, 2011

2 commits

  • At this point, tracehooks aren't useful to mainline kernel and mostly
    just add an extra layer of obfuscation. Although they have comments,
    without actual in-kernel users, it is difficult to tell what are their
    assumptions and they're actually trying to achieve. To mainline
    kernel, they just aren't worth keeping around.

    This patch kills the following clone and exec related tracehooks.

    tracehook_prepare_clone()
    tracehook_finish_clone()
    tracehook_report_clone()
    tracehook_report_clone_complete()
    tracehook_unsafe_exec()

    The changes are mostly trivial - logic is moved to the caller and
    comments are merged and adjusted appropriately.

    The only exception is in check_unsafe_exec() where LSM_UNSAFE_PTRACE*
    are OR'd to bprm->unsafe instead of setting it, which produces the
    same result as the field is always zero on entry. It also tests
    p->ptrace instead of (p->ptrace & PT_PTRACED) for consistency, which
    also gives the same result.

    This doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     
  • At this point, tracehooks aren't useful to mainline kernel and mostly
    just add an extra layer of obfuscation. Although they have comments,
    without actual in-kernel users, it is difficult to tell what are their
    assumptions and they're actually trying to achieve. To mainline
    kernel, they just aren't worth keeping around.

    This patch kills the following trivial tracehooks.

    * Ones testing whether task is ptraced. Replace with ->ptrace test.

    tracehook_expect_breakpoints()
    tracehook_consider_ignored_signal()
    tracehook_consider_fatal_signal()

    * ptrace_event() wrappers. Call directly.

    tracehook_report_exec()
    tracehook_report_exit()
    tracehook_report_vfork_done()

    * ptrace_release_task() wrapper. Call directly.

    tracehook_finish_release_task()

    * noop

    tracehook_prepare_release_task()
    tracehook_report_death()

    This doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Cc: Christoph Hellwig
    Cc: Martin Schwidefsky
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     

18 Jun, 2011

1 commit

  • ____call_usermodehelper() now erases any credentials set by the
    subprocess_inf::init() function. The problem is that commit
    17f60a7da150 ("capabilites: allow the application of capability limits
    to usermode helpers") creates and commits new credentials with
    prepare_kernel_cred() after the call to the init() function. This wipes
    all keyrings after umh_keys_init() is called.

    The best way to deal with this is to put the init() call just prior to
    the commit_creds() call, and pass the cred pointer to init(). That
    means that umh_keys_init() and suchlike can modify the credentials
    _before_ they are published and potentially in use by the rest of the
    system.

    This prevents request_key() from working as it is prevented from passing
    the session keyring it set up with the authorisation token to
    /sbin/request-key, and so the latter can't assume the authority to
    instantiate the key. This causes the in-kernel DNS resolver to fail
    with ENOKEY unconditionally.

    Signed-off-by: David Howells
    Acked-by: Eric Paris
    Tested-by: Jeff Layton
    Signed-off-by: Linus Torvalds

    David Howells
     

16 Jun, 2011

2 commits

  • This reverts commit 7f81c8890c15a10f5220bebae3b6dfae4961962a.

    It turns out that it's not actually a build-time check on x86-64 UML,
    which does some seriously crazy stuff with VM_STACK_FLAGS.

    The VM_STACK_FLAGS define depends on the arch-supplied
    VM_STACK_DEFAULT_FLAGS value, and on x86-64 UML we have

    arch/um/sys-x86_64/shared/sysdep/vm-flags.h:

    #define VM_STACK_DEFAULT_FLAGS \
    (test_thread_flag(TIF_IA32) ? vm_stack_flags32 : vm_stack_flags)

    #define VM_STACK_DEFAULT_FLAGS vm_stack_flags

    (yes, seriously: two different #define's for that thing, with the first
    one being inside an "#ifdef TIF_IA32")

    It's possible that it is UML that should just be fixed in this area, but
    for now let's just undo the (very small) optimization.

    Reported-by: Randy Dunlap
    Acked-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Commit a8bef8ff6ea1 ("mm: migration: avoid race between shift_arg_pages()
    and rmap_walk() during migration by not migrating temporary stacks")
    introduced a BUG_ON() to ensure that VM_STACK_FLAGS and
    VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time
    one, so BUILD_BUG_ON is more appropriate.

    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

10 Jun, 2011

1 commit

  • Unconditionally changing the address limit to USER_DS and not restoring
    it to its old value in the error path is wrong because it prevents us
    using kernel memory on repeated calls to this function. This, in fact,
    breaks the fallback of hard coded paths to the init program from being
    ever successful if the first candidate fails to load.

    With this patch applied switching to USER_DS is delayed until the point
    of no return is reached which makes it possible to have a multi-arch
    rootfs with one arch specific init binary for each of the (hard coded)
    probed paths.

    Since the address limit is already set to USER_DS when start_thread()
    will be invoked, this redundancy can be safely removed.

    Signed-off-by: Mathias Krause
    Cc: Al Viro
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Mathias Krause
     

05 Jun, 2011

3 commits

  • JOBCTL_TRAPPING indicates that ptracer is waiting for tracee to
    (re)transit into TRACED. task_clear_jobctl_pending() must be called
    when either tracee enters TRACED or the transition is cancelled for
    some reason. The former is achieved by explicitly calling
    task_clear_jobctl_pending() in ptrace_stop() and the latter by calling
    it at the end of do_signal_stop().

    Calling task_clear_jobctl_trapping() at the end of do_signal_stop()
    limits the scope TRAPPING can be used and is fragile in that seemingly
    unrelated changes to tracee's control flow can lead to stuck TRAPPING.

    We already have task_clear_jobctl_pending() calls on those cancelling
    events to clear JOBCTL_STOP_PENDING. Cancellations can be handled by
    making those call sites use JOBCTL_PENDING_MASK instead and updating
    task_clear_jobctl_pending() such that task_clear_jobctl_trapping() is
    called automatically if no stop/trap is pending.

    This patch makes the above changes and removes the fallback
    task_clear_jobctl_trapping() call from do_signal_stop().

    Signed-off-by: Tejun Heo
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     
  • This patch introduces JOBCTL_PENDING_MASK and replaces
    task_clear_jobctl_stop_pending() with task_clear_jobctl_pending()
    which takes an extra @mask argument.

    JOBCTL_PENDING_MASK is currently equal to JOBCTL_STOP_PENDING but
    future patches will add more bits. recalc_sigpending_tsk() is updated
    to use JOBCTL_PENDING_MASK instead.

    task_clear_jobctl_pending() takes @mask which in subset of
    JOBCTL_PENDING_MASK and clears the relevant jobctl bits. If
    JOBCTL_STOP_PENDING is set, other STOP bits are cleared together. All
    task_clear_jobctl_stop_pending() users are updated to call
    task_clear_jobctl_pending() with JOBCTL_STOP_PENDING which is
    functionally identical to task_clear_jobctl_stop_pending().

    This patch doesn't cause any functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     
  • signal->group_stop currently hosts mostly group stop related flags;
    however, it's gonna be used for wider purposes and the GROUP_STOP_
    flag prefix becomes confusing. Rename signal->group_stop to
    signal->jobctl and rename all GROUP_STOP_* flags to JOBCTL_*.

    Bit position macros JOBCTL_*_BIT are defined and JOBCTL_* flags are
    defined in terms of them to allow using bitops later.

    While at it, reassign JOBCTL_TRAPPING to bit 22 to better accomodate
    future additions.

    This doesn't cause any functional change.

    -v2: JOBCTL_*_BIT macros added as suggested by Linus.

    Signed-off-by: Tejun Heo
    Cc: Linus Torvalds
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     

27 May, 2011

2 commits

  • Now, exe_file is not proc FS dependent, so we can use it to name core
    file. So we add %E pattern for core file name cration which extract path
    from mm_struct->exe_file. Then it converts slashes to exclamation marks
    and pastes the result to the core file name itself.

    This is useful for environments where binary names are longer than 16
    character (the current->comm limitation). Also where there are binaries
    with same name but in a different path. Further in case the binery itself
    changes its current->comm after exec.

    So by doing (s/$/#/ -- # is treated as git comment):

    $ sysctl kernel.core_pattern='core.%p.%e.%E'
    $ ln /bin/cat cat45678901234567890
    $ ./cat45678901234567890
    ^Z
    $ rm cat45678901234567890
    $ fg
    ^\Quit (core dumped)
    $ ls core*

    we now get:

    core.2434.cat456789012345.!root!cat45678901234567890 (deleted)

    Signed-off-by: Jiri Slaby
    Cc: Al Viro
    Cc: Alan Cox
    Reviewed-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
    This was because exe_file was needed only for /proc//exe. Since we
    will need the exe_file functionality also for core dumps (so core name can
    contain full binary path), built this functionality always into the
    kernel.

    To achieve that move that out of proc FS to the kernel/ where in fact it
    should belong. By doing that we can make dup_mm_exe_file static. Also we
    can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

25 May, 2011

2 commits

  • Rework the existing mmu_gather infrastructure.

    The direct purpose of these patches was to allow preemptible mmu_gather,
    but even without that I think these patches provide an improvement to the
    status quo.

    The first 9 patches rework the mmu_gather infrastructure. For review
    purpose I've split them into generic and per-arch patches with the last of
    those a generic cleanup.

    The next patch provides generic RCU page-table freeing, and the followup
    is a patch converting s390 to use this. I've also got 4 patches from
    DaveM lined up (not included in this series) that uses this to implement
    gup_fast() for sparc64.

    Then there is one patch that extends the generic mmu_gather batching.

    After that follow the mm preemptibility patches, these make part of the mm
    a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
    mutexes which together with the mmu_gather rework makes mmu_gather
    preemptible as well.

    Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

    This also allows for preemptible mmu_notifiers, something that XPMEM I
    think wants.

    Furthermore, it removes the new and universially detested unmap_mutex.

    This patch:

    Remove the first obstacle towards a fully preemptible mmu_gather.

    The current scheme assumes mmu_gather is always done with preemption
    disabled and uses per-cpu storage for the page batches. Change this to
    try and allocate a page for batching and in case of failure, use a small
    on-stack array to make some progress.

    Preemptible mmu_gather is desired in general and usable once i_mmap_lock
    becomes a mutex. Doing it before the mutex conversion saves us from
    having to rework the code by moving the mmu_gather bits inside the
    pte_lock.

    Also avoid flushing the tlb batches from under the pte lock, this is
    useful even without the i_mmap_lock conversion as it significantly reduces
    pte lock hold times.

    [akpm@linux-foundation.org: fix comment tpyo]
    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Currently we have expand_upwards exported while expand_downwards is
    accessible only via expand_stack or expand_stack_downwards.

    check_stack_guard_page is a nice example of the asymmetry. It uses
    expand_stack for VM_GROWSDOWN while expand_upwards is called for
    VM_GROWSUP case.

    Let's clean this up by exporting both functions and make those names
    consistent. Let's use expand_{upwards,downwards} because expanding
    doesn't always involve stack manipulation (an example is
    ia64_do_page_fault which uses expand_upwards for registers backing store
    expansion). expand_downwards has to be defined for both
    CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
    version in the early process initialization phase for growsup
    configuration.

    Signed-off-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: James Bottomley
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

24 May, 2011

1 commit

  • * 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (48 commits)
    serial: 8250_pci: add support for Cronyx Omega PCI multiserial board.
    tty/serial: Fix break handling for PORT_TEGRA
    tty/serial: Add explicit PORT_TEGRA type
    n_tracerouter and n_tracesink ldisc additions.
    Intel PTI implementaiton of MIPI 1149.7.
    Kernel documentation for the PTI feature.
    export kernel call get_task_comm().
    tty: Remove to support serial for S5P6442
    pch_phub: Support new device ML7223
    8250_pci: Add support for the Digi/IBM PCIe 2-port Adapter
    ASoC: Update cx20442 for TTY API change
    pch_uart: Support new device ML7223 IOH
    parport: Use request_muxed_region for IT87 probe and lock
    tty/serial: add support for Xilinx PS UART
    n_gsm: Use print_hex_dump_bytes
    drivers/tty/moxa.c: Put correct tty value
    TTY: tty_io, annotate locking functions
    TTY: serial_core, remove superfluous set_task_state
    TTY: serial_core, remove invalid test
    Char: moxa, fix locking in moxa_write
    ...

    Fix up trivial conflicts in drivers/bluetooth/hci_ldisc.c and
    drivers/tty/serial/Makefile.

    I did the hci_ldisc thing as an evil merge, cleaning things up.

    Linus Torvalds
     

23 May, 2011

1 commit


14 May, 2011

1 commit

  • This allows drivers who call this function to be compiled modularly.
    Otherwise, a driver who is interested in this type of functionality
    has to implement their own get_task_comm() call, causing code
    duplication in the Linux source tree.

    Signed-off-by: J Freyensee
    Acked-by: David Rientjes
    Signed-off-by: Greg Kroah-Hartman

    J Freyensee
     

09 Apr, 2011

4 commits

  • Add the comment to explain acct_arg_size().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro

    Oleg Nesterov
     
  • Add the appropriate members into struct user_arg_ptr and teach
    get_user_arg_ptr() to handle is_compat = T case correctly.

    This allows us to remove the compat_do_execve() code from fs/compat.c
    and reimplement compat_do_execve() as the trivial wrapper on top of
    do_execve_common(is_compat => true).

    In fact, this fixes another (minor) bug. "compat_uptr_t str" can
    overflow after "str += len" in compat_copy_strings() if a 64bit
    application execs via sys32_execve().

    Unexport acct_arg_size() and get_arg_page(), fs/compat.c doesn't
    need them any longer.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro

    Oleg Nesterov
     
  • No functional changes, preparation.

    Introduce struct user_arg_ptr, change do_execve() paths to use it
    instead of "char __user * const __user *argv".

    This makes the argv/envp arguments opaque, we are ready to handle the
    compat case which needs argv pointing to compat_uptr_t.

    Suggested-by: Linus Torvalds
    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro

    Oleg Nesterov
     
  • Introduce get_user_arg_ptr() helper, convert count() and copy_strings()
    to use it.

    No functional changes, preparation. This helper is trivial, it just
    reads the pointer from argv/envp user-space array.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro

    Oleg Nesterov
     

23 Mar, 2011

1 commit

  • Currently task->signal->group_stop_count is used to decide whether to
    stop for group stop. However, if there is a task in the group which
    is taking a long time to stop, other tasks which are continued by
    ptrace would repeatedly stop for the same group stop until the group
    stop is complete.

    Conversely, if a ptraced task is in TASK_TRACED state, the debugger
    won't get notified of group stops which is inconsistent compared to
    the ptraced task in any other state.

    This patch introduces GROUP_STOP_PENDING which tracks whether a task
    is yet to stop for the group stop in progress. The flag is set when a
    group stop starts and cleared when the task stops the first time for
    the group stop, and consulted whenever whether the task should
    participate in a group stop needs to be determined. Note that now
    tasks in TASK_TRACED also participate in group stop.

    This results in the following behavior changes.

    * For a single group stop, a ptracer would see at most one stop
    reported.

    * A ptracee in TASK_TRACED now also participates in group stop and the
    tracer would get the notification. However, as a ptraced task could
    be in TASK_STOPPED state or any ptrace trap could consume group
    stop, the notification may still be missing. These will be
    addressed with further patches.

    * A ptracee may start a group stop while one is still in progress if
    the tracer let it continue with stop signal delivery. Group stop
    code handles this correctly.

    Oleg:

    * Spotted that a task might skip signal check even when its
    GROUP_STOP_PENDING is set. Fixed by updating
    recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of
    group_stop_count.

    * Pointed out that task->group_stop should be cleared whenever
    task->signal->group_stop_count is cleared. Fixed accordingly.

    * Pointed out the behavior inconsistency between TASK_TRACED and
    RUNNING and the last behavior change.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov
    Cc: Roland McGrath

    Tejun Heo
     

21 Mar, 2011

1 commit

  • Hi,

    I was backporting the coredump over pipe feature and noticed this small typo,
    I wish I would have something bigger to contribute...

    >From 15d6080e0ed4267da103c706917a33b1015e8804 Mon Sep 17 00:00:00 2001
    From: Holger Hans Peter Freyther
    Date: Thu, 24 Feb 2011 17:42:50 +0100
    Subject: [PATCH] fs: Fix a small typo in the comment

    The function is called umh_pipe_setup not uhm_pipe_setup.

    Signed-off-by: Holger Hans Peter Freyther
    Signed-off-by: Al Viro

    Holger Hans Peter Freyther
     

14 Mar, 2011

1 commit

  • take calculation of open_flags by open(2) arguments into new helper
    in fs/open.c, move filp_open() over there, have it and do_sys_open()
    use that helper, switch exec.c callers of do_filp_open() to explicit
    (and constant) struct open_flags.

    Signed-off-by: Al Viro

    Al Viro