12 Nov, 2009

1 commit

  • In setup_arg_pages we work hard to assign a value to ret, but on exit we
    always return 0.

    Also remove a now duplicated exit path and branch to out_unlock instead.

    Signed-off-by: Anton Blanchard
    Acked-by: Serge Hallyn
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

24 Sep, 2009

5 commits

  • Because the binfmt is not different between threads in the same process,
    it can be moved from task_struct to mm_struct. And binfmt moudle is
    handled per mm_struct instead of task_struct.

    Signed-off-by: Hiroshi Shimamoto
    Acked-by: Oleg Nesterov
    Cc: Rusty Russell
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     
  • sys_delete_module() can set MODULE_STATE_GOING after
    search_binary_handler() does try_module_get(). In this case
    set_binfmt()->try_module_get() fails but since none of the callers
    check the returned error, the task will run with the wrong old
    ->binfmt.

    The proper fix should change all ->load_binary() methods, but we can
    rely on fact that the caller must hold a reference to binfmt->module
    and use __module_get() which never fails.

    Signed-off-by: Oleg Nesterov
    Acked-by: Rusty Russell
    Cc: Hiroshi Shimamoto
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Allow core_pattern pipes to wait for user space to complete

    One of the things that user space processes like to do is look at metadata
    for a crashing process in their /proc/ directory. this is racy
    however, since do_coredump in the kernel doesn't wait for the user space
    process to complete before it reaps the crashing process. This patch
    corrects that. Allowing the kernel to wait for the user space process to
    complete before cleaning up the crashing process. This is a bit tricky to
    do for a few reasons:

    1) The user space process isn't our child, so we can't sys_wait4 on it
    2) We need to close the pipe before waiting for the user process to complete,
    since the user process may rely on an EOF condition

    I've discussed several solutions with Oleg Nesterov off-list about this,
    and this is the one we've come up with. We add ourselves as a pipe reader
    (to prevent premature cleanup of the pipe_inode_info), and remove
    ourselves as a writer (to provide an EOF condition to the writer in user
    space), then we iterate until the user space process exits (which we
    detect by pipe->readers == 1, hence the > 1 check in the loop). When we
    exit the loop, we restore the proper reader/writer values, then we return
    and let filp_close in do_coredump clean up the pipe data properly.

    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Introduce core pipe limiting sysctl.

    Since we can dump cores to pipe, rather than directly to the filesystem,
    we create a condition in which a user can create a very high load on the
    system simply by running bad applications.

    If the pipe reader specified in core_pattern is poorly written, we can
    have lots of ourstandig resources and processes in the system.

    This sysctl introduces an ability to limit that resource consumption.
    core_pipe_limit defines how many in-flight dumps may be run in parallel,
    dumps beyond this value are skipped and a note is made in the kernel log.
    A special value of 0 in core_pipe_limit denotes unlimited core dumps may
    be handled (this is the default value).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Change how we detect recursive dumps.

    Currently we have a mechanism by which we try to compare pathnames of the
    crashing process to the core_pattern path. This is broken for a dozen
    reasons, and just doesn't work in any sort of robust way.

    I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps
    set RLIMIT_CORE to zero, we don't write out core files for any process
    with that particular limit set. It the core_pattern is a pipe, any
    non-zero limit is translated to RLIM_INFINITY.

    This allows complete dumps to be captured, but prevents infinite recursion
    in the event that the core_pattern process itself crashes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     

23 Sep, 2009

2 commits

  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
    mark. This struct is filled as a parameter to getrusage syscall.
    ->ru_maxrss value is set to KBs which is the way it is done in BSD
    systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
    which seems to be incorrect behavior. Maintainer of this util was
    notified by me with the patch which corrects it and cc'ed.

    To make this happen we extend struct signal_struct by two fields. The
    first one is ->maxrss which we use to store rss hiwater of the task. The
    second one is ->cmaxrss which we use to store highest rss hiwater of all
    task childs. These values are used in k_getrusage() to actually fill
    ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
    struct exists.

    Note:
    exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
    it is intetionally behavior. *BSD getrusage have exec() inheriting.

    test programs
    ========================================================

    getrusage.c
    ===========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    #define err(str) perror(str), exit(1)

    int main(int argc, char** argv)
    {
    int status;

    printf("allocate 100MB\n");
    consume(100);

    printf("testcase1: fork inherit? \n");
    printf(" expect: initial.self ~= child.self\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase2: fork inherit? (cont.) \n");
    printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("child");
    _exit(0);
    }
    printf("\n");

    printf("testcase3: fork + malloc \n");
    printf(" expect: child.self ~= initial.self + 50MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    printf("allocate +50MB\n");
    consume(50);
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase4: grandchild maxrss\n");
    printf(" expect: post_wait.children ~= 300MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 0 -g 300");
    _exit(0);
    }
    printf("\n");

    printf("testcase5: zombie\n");
    printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
    printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
    show_rusage("initial");
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("pre_wait");
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 400");
    _exit(0);
    }
    printf("\n");

    printf("testcase6: SIG_IGN\n");
    printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
    show_rusage("initial");
    signal(SIGCHLD, SIG_IGN);
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("after_zombie");
    } else {
    system("./child -n 500");
    _exit(0);
    }
    printf("\n");
    signal(SIGCHLD, SIG_DFL);

    printf("testcase7: exec (without fork) \n");
    printf(" expect: initial ~= exec \n");
    show_rusage("initial");
    execl("./child", "child", "-v", NULL);

    return 0;
    }

    child.c
    =======
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    int main(int argc, char** argv)
    {
    int status;
    int c;
    long consume_size = 0;
    long grandchild_consume_size = 0;
    int show = 0;

    while ((c = getopt(argc, argv, "n:g:v")) != -1) {
    switch (c) {
    case 'n':
    consume_size = atol(optarg);
    break;
    case 'v':
    show = 1;
    break;
    case 'g':

    grandchild_consume_size = atol(optarg);
    break;
    default:
    break;
    }
    }

    if (show)
    show_rusage("exec");

    if (consume_size) {
    printf("child alloc %ldMB\n", consume_size);
    consume(consume_size);
    }

    if (grandchild_consume_size) {
    if (fork()) {
    wait(&status);
    } else {
    printf("grandchild alloc %ldMB\n", grandchild_consume_size);
    consume(grandchild_consume_size);

    exit(0);
    }
    }

    return 0;
    }

    common.c
    ========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"
    #define err(str) perror(str), exit(1)

    void show_rusage(char *prefix)
    {
    int err, err2;
    struct rusage rusage_self;
    struct rusage rusage_children;

    printf("%s: ", prefix);
    err = getrusage(RUSAGE_SELF, &rusage_self);
    if (!err)
    printf("self %ld ", rusage_self.ru_maxrss);
    err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
    if (!err2)
    printf("children %ld ", rusage_children.ru_maxrss);

    printf("\n");
    }

    /* Some buggy OS need this worthless CPU waste. */
    void make_pagefault(void)
    {
    void *addr;
    int size = getpagesize();
    int i;

    for (i=0; i
    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Pirko
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

06 Sep, 2009

1 commit

  • Tom Horsley reports that his debugger hangs when it tries to read
    /proc/pid_of_tracee/maps, this happens since

    "mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
    04b836cbf19e885f8366bccb2e4b0474346c02d

    commit in 2.6.31.

    But the root of the problem lies in the fact that do_execve() path calls
    tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.

    The tracee must not sleep in TASK_TRACED holding this mutex. Even if we
    remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
    another task doing PTRACE_ATTACH should not hang until it is killed or the
    tracee resumes.

    With this patch do_execve() does not use ->cred_guard_mutex directly and
    we do not hold it throughout, instead:

    - introduce prepare_bprm_creds() helper, it locks the mutex
    and calls prepare_exec_creds() to initialize bprm->cred.

    - install_exec_creds() drops the mutex after commit_creds(),
    and thus before tracehook_report_exec()->ptrace_stop().

    or, if exec fails,

    free_bprm() drops this mutex when bprm->cred != NULL which
    indicates install_exec_creds() was not called.

    Reported-by: Tom Horsley
    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Roland McGrath
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

24 Aug, 2009

1 commit

  • vfs_read() offset is defined as loff_t, but kernel_read()
    offset is only defined as unsigned long. Redefine
    kernel_read() offset as loff_t.

    Cc: stable@kernel.org
    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar
     

07 Jul, 2009

1 commit

  • do_execve() and ptrace_attach() return -EINTR if
    mutex_lock_interruptible(->cred_guard_mutex) fails.

    This is not right, change the code to return ERESTARTNOINTR.

    Perhaps we should also change proc_pid_attr_write().

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Acked-by: Roland McGrath
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

12 Jun, 2009

1 commit

  • …el/git/tip/linux-2.6-tip

    * 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
    perf_counter: Turn off by default
    perf_counter: Add counter->id to the throttle event
    perf_counter: Better align code
    perf_counter: Rename L2 to LL cache
    perf_counter: Standardize event names
    perf_counter: Rename enums
    perf_counter tools: Clean up u64 usage
    perf_counter: Rename perf_counter_limit sysctl
    perf_counter: More paranoia settings
    perf_counter: powerpc: Implement generalized cache events for POWER processors
    perf_counters: powerpc: Add support for POWER7 processors
    perf_counter: Accurate period data
    perf_counter: Introduce struct for sample data
    perf_counter tools: Normalize data using per sample period data
    perf_counter: Annotate exit ctx recursion
    perf_counter tools: Propagate signals properly
    perf_counter tools: Small frequency related fixes
    perf_counter: More aggressive frequency adjustment
    perf_counter/x86: Fix the model number of Intel Core2 processors
    perf_counter, x86: Correct some event and umask values for Intel processors
    ...

    Linus Torvalds
     

22 May, 2009

2 commits

  • Conflicts:
    fs/exec.c

    Removed IMA changes (the IMA checks are now performed via may_open()).

    Signed-off-by: James Morris

    James Morris
     
  • - Add support in ima_path_check() for integrity checking without
    incrementing the counts. (Required for nfsd.)
    - rename and export opencount_get to ima_counts_get
    - replace ima_shm_check calls with ima_counts_get
    - export ima_path_check

    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar
     

18 May, 2009

1 commit


11 May, 2009

1 commit


09 May, 2009

2 commits


03 May, 2009

1 commit

  • This fixes the problem introduced by commit 3bfacef412 (get rid of
    special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults
    when binfmt_aout built as module. That happens because aout binary
    handler gets on the top of the binfmt list due to late registration, and
    kernel attempts to execute the binary without preparatory work that must
    be done by binfmt_loader.

    Fixed by changing the registration order of the default binfmt handlers
    using list_add_tail() and introducing insert_binfmt() function which
    places new handler on the top of the binfmt list. This might be generally
    useful for installing arch-specific frontends for default handlers or just
    for overriding them.

    Signed-off-by: Ivan Kokshaysky
    Cc: Al Viro
    Cc: Richard Henderson
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     

29 Apr, 2009

1 commit


24 Apr, 2009

2 commits

  • write_lock(¤t->fs->lock) guarantees we can't wrongly miss
    LSM_UNSAFE_SHARE, this is what we care about. Use rcu_read_lock()
    instead of ->siglock to iterate over the sub-threads. We must see
    all CLONE_THREAD|CLONE_FS threads which didn't pass exit_fs(), it
    takes fs->lock too.

    With or without this patch we can miss the freshly cloned thread
    and set LSM_UNSAFE_SHARE, we don't care.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    [ Fixed lock/unlock typo - Hugh ]
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec
    unconditionally. This is wrong if we race with our sub-thread which
    also does do_execve:

    Two threads T1 and T2 and another process P, all share the same
    ->fs.

    T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since
    ->fs is shared, we set LSM_UNSAFE but not ->in_exec.

    P exits and decrements fs->users.

    T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not
    shared, we set fs->in_exec.

    T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and
    return to the user-space.

    T1 does clone(CLONE_FS /* without CLONE_THREAD */).

    T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with
    another process.

    Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change
    do_execve() to clear ->in_exec depending on res.

    When do_execve() suceeds, it is safe to clear ->in_exec unconditionally.
    It can be set only if we don't share ->fs with another process, and since
    we already killed all sub-threads either ->in_exec == 0 or we are the
    only user of this ->fs.

    Also, we do not need fs->lock to clear fs->in_exec.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Apr, 2009

1 commit

  • Similar to the mmap data stream, add one that tracks the task COMM field,
    so that the userspace reporting knows what to call a task.

    Signed-off-by: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Corey Ashford
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Apr, 2009

1 commit

  • Merge reason: we have gathered quite a few conflicts, need to merge upstream

    Conflicts:
    arch/powerpc/kernel/Makefile
    arch/x86/ia32/ia32entry.S
    arch/x86/include/asm/hardirq.h
    arch/x86/include/asm/unistd_32.h
    arch/x86/include/asm/unistd_64.h
    arch/x86/kernel/cpu/common.c
    arch/x86/kernel/irq.c
    arch/x86/kernel/syscall_table_32.S
    arch/x86/mm/iomap_32.c
    include/linux/sched.h
    kernel/Makefile

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

01 Apr, 2009

3 commits

  • Don't pull it in sched.h; very few files actually need it and those
    can include directly. sched.h itself only needs forward declaration
    of struct fs_struct;

    Signed-off-by: Al Viro

    Al Viro
     
  • ... since we'll unshare sighand anyway

    Signed-off-by: Al Viro

    Al Viro
     
  • * all changes of current->fs are done under task_lock and write_lock of
    old fs->lock
    * refcount is not atomic anymore (same protection)
    * its decrements are done when removing reference from current; at the
    same time we decide whether to free it.
    * put_fs_struct() is gone
    * new field - ->in_exec. Set by check_unsafe_exec() if we are trying to do
    execve() and only subthreads share fs_struct. Cleared when finishing exec
    (success and failure alike). Makes CLONE_FS fail with -EAGAIN if set.
    * check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
    is in progress.

    Signed-off-by: Al Viro

    Al Viro
     

29 Mar, 2009

1 commit

  • Joe Malicki reports that setuid sometimes doesn't: very rarely,
    a setuid root program does not get root euid; and, by the way,
    they have a health check running lsof every few minutes.

    Right, check_unsafe_exec() notes whether the files_struct is being
    shared by more threads than will get killed by the exec, and if so
    sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
    But /proc//fd and /proc//fdinfo lookups make transient
    use of get_files_struct(), which also raises that sharing count.

    There's a rather simple fix for this: exec's check on files->count
    has been redundant ever since 2.6.1 made it unshare_files() (except
    while compat_do_execve() omitted to do so) - just remove that check.

    [Note to -stable: this patch will not apply before 2.6.29: earlier
    releases should just remove the files->count line from unsafe_exec().]

    Reported-by: Joe Malicki
    Narrowed-down-by: Michael Itz
    Tested-by: Joe Malicki
    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Mar, 2009

1 commit


12 Feb, 2009

1 commit

  • This patch allows LSM modules to determine whether current process is in an
    execve operation or not so that they can behave differently while an execve
    operation is in progress.

    This patch is needed by TOMOYO. Please see another patch titled "LSM adapter
    functions." for backgrounds.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: David Howells
    Signed-off-by: James Morris

    Kentaro Takeda
     

11 Feb, 2009

1 commit


07 Feb, 2009

1 commit

  • The patch:

    commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
    CRED: Make execve() take advantage of copy-on-write credentials

    moved the place in which the 'safeness' of a SUID/SGID exec was performed to
    before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
    calculated incorrectly. This flag is set if any of the usage counts for
    fs_struct, files_struct and sighand_struct are greater than 1 at the time the
    determination is made. All of which are true for threads created by the
    pthread library.

    However, since we wish to make the security calculation before irrevocably
    damaging the process so that we can return it an error code in the case where
    we decide we want to reject the exec request on this basis, we have to make the
    determination before calling de_thread().

    So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
    our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
    (CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
    so can be discounted by check_unsafe_exec().

    We do have to be careful because CLONE_THREAD does not imply FS or FILES.

    We _assume_ that there will be no extra references to these structs held by the
    threads we're going to kill.

    This can be tested with the attached pair of programs. Build the two programs
    using the Makefile supplied, and run ./test1 as a non-root user. If
    successful, you should see something like:

    [dhowells@andromeda tmp]$ ./test1
    --TEST1--
    uid=4043, euid=4043 suid=4043
    exec ./test2
    --TEST2--
    uid=4043, euid=0 suid=0
    SUCCESS - Correct effective user ID

    and if unsuccessful, something like:

    [dhowells@andromeda tmp]$ ./test1
    --TEST1--
    uid=4043, euid=4043 suid=4043
    exec ./test2
    --TEST2--
    uid=4043, euid=4043 suid=4043
    ERROR - Incorrect effective user ID!

    The non-root user ID you see will depend on the user you run as.

    [test1.c]
    #include
    #include
    #include
    #include

    static void *thread_func(void *arg)
    {
    while (1) {}
    }

    int main(int argc, char **argv)
    {
    pthread_t tid;
    uid_t uid, euid, suid;

    printf("--TEST1--\n");
    getresuid(&uid, &euid, &suid);
    printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

    if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
    perror("pthread_create");
    exit(1);
    }

    printf("exec ./test2\n");
    execlp("./test2", "test2", NULL);
    perror("./test2");
    _exit(1);
    }

    [test2.c]
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    uid_t uid, euid, suid;

    getresuid(&uid, &euid, &suid);
    printf("--TEST2--\n");
    printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

    if (euid != 0) {
    fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
    exit(1);
    }
    printf("SUCCESS - Correct effective user ID\n");
    exit(0);
    }

    [Makefile]
    CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
    all: test1 test2

    test1: test1.c
    gcc $(CFLAGS) -o test1 test1.c -lpthread

    test2: test2.c
    gcc $(CFLAGS) -o test2 test2.c
    sudo chown root.root test2
    sudo chmod +s test2

    Reported-by: David Smith
    Signed-off-by: David Howells
    Acked-by: David Smith
    Signed-off-by: James Morris

    David Howells
     

06 Feb, 2009

2 commits

  • Conflicts:
    fs/namei.c

    Manually merged per:

    diff --cc fs/namei.c
    index 734f2b5,bbc15c2..0000000
    --- a/fs/namei.c
    +++ b/fs/namei.c
    @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char
    nd->flags |= LOOKUP_CONTINUE;
    err = exec_permission_lite(inode);
    if (err == -EAGAIN)
    - err = vfs_permission(nd, MAY_EXEC);
    + err = inode_permission(nd->path.dentry->d_inode,
    + MAY_EXEC);
    + if (!err)
    + err = ima_path_check(&nd->path, MAY_EXEC);
    if (err)
    break;

    @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc
    flag &= ~O_TRUNC;
    }

    - error = vfs_permission(nd, acc_mode);
    + error = inode_permission(inode, acc_mode);
    if (error)
    return error;
    +
    - error = ima_path_check(&nd->path,
    ++ error = ima_path_check(path,
    + acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC));
    + if (error)
    + return error;
    /*
    * An append-only file must be opened in append mode for writing.
    */

    Signed-off-by: James Morris

    James Morris
     
  • This patch replaces the generic integrity hooks, for which IMA registered
    itself, with IMA integrity hooks in the appropriate places directly
    in the fs directory.

    Signed-off-by: Mimi Zohar
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Mimi Zohar
     

21 Jan, 2009

1 commit


14 Jan, 2009

1 commit


11 Jan, 2009

1 commit


07 Jan, 2009

2 commits