17 Jul, 2007

4 commits

  • This patch enables the unshare of user namespaces.

    It adds a new clone flag CLONE_NEWUSER and implements copy_user_ns() which
    resets the current user_struct and adds a new root user (uid == 0)

    For now, unsharing the user namespace allows a process to reset its
    user_struct accounting and uid 0 in the new user namespace should be contained
    using appropriate means, for instance selinux

    The plan, when the full support is complete (all uid checks covered), is to
    keep the original user's rights in the original namespace, and let a process
    become uid 0 in the new namespace, with full capabilities to the new
    namespace.

    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Cedric Le Goater
    Acked-by: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Basically, it will allow a process to unshare its user_struct table,
    resetting at the same time its own user_struct and all the associated
    accounting.

    A new root user (uid == 0) is added to the user namespace upon creation.
    Such root users have full privileges and it seems that theses privileges
    should be controlled through some means (process capabilities ?)

    The unshare is not included in this patch.

    Changes since [try #4]:
    - Updated get_user_ns and put_user_ns to accept NULL, and
    get_user_ns to return the namespace.

    Changes since [try #3]:
    - moved struct user_namespace to files user_namespace.{c,h}

    Changes since [try #2]:
    - removed struct user_namespace* argument from find_user()

    Changes since [try #1]:
    - removed struct user_namespace* argument from find_user()
    - added a root_user per user namespace

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Serge E. Hallyn
    Acked-by: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Add TTY input auditing, used to audit system administrator's actions. This is
    required by various security standards such as DCID 6/3 and PCI to provide
    non-repudiation of administrator's actions and to allow a review of past
    actions if the administrator seems to overstep their duties or if the system
    becomes misconfigured for unknown reasons. These requirements do not make it
    necessary to audit TTY output as well.

    Compared to an user-space keylogger, this approach records TTY input using the
    audit subsystem, correlated with other audit events, and it is completely
    transparent to the user-space application (e.g. the console ioctls still
    work).

    TTY input auditing works on a higher level than auditing all system calls
    within the session, which would produce an overwhelming amount of mostly
    useless audit events.

    Add an "audit_tty" attribute, inherited across fork (). Data read from TTYs
    by process with the attribute is sent to the audit subsystem by the kernel.
    The audit netlink interface is extended to allow modifying the audit_tty
    attribute, and to allow sending explanatory audit events from user-space (for
    example, a shell might send an event containing the final command, after the
    interactive command-line editing and history expansion is performed, which
    might be difficult to decipher from the TTY input alone).

    Because the "audit_tty" attribute is inherited across fork (), it would be set
    e.g. for sshd restarted within an audited session. To prevent this, the
    audit_tty attribute is cleared when a process with no open TTY file
    descriptors (e.g. after daemon startup) opens a TTY.

    See https://www.redhat.com/archives/linux-audit/2007-June/msg00000.html for a
    more detailed rationale document for an older version of this patch.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Miloslav Trmac
    Cc: Al Viro
    Cc: Alan Cox
    Cc: Paul Fulghum
    Cc: Casey Schaufler
    Cc: Steve Grubb
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miloslav Trmac
     
  • Commit 411187fb05cd11676b0979d9fbf3291db69dbce2 caused boot time to move and
    process start times to become invalid after suspend. Using boot based time
    for those restores the old behaviour and fixes the issue.

    [akpm@linux-foundation.org: little cleanup]
    Signed-off-by: Tomas Janousek
    Cc: Tomas Smetana
    Acked-by: John Stultz
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomas Janousek
     

10 Jul, 2007

1 commit


24 May, 2007

1 commit

  • Currently try_to_freeze_tasks() has to wait until all of the vforked processes
    exit and for this reason every user can make it fail. To fix this problem we
    can introduce the additional process flag PF_FREEZER_SKIP to be used by tasks
    that do not want to be counted as freezable by the freezer and want to have
    TIF_FREEZE set nevertheless. Then, this flag can be set by tasks using
    sys_vfork() before they call wait_for_completion(&vfork) and cleared after
    they have woken up. After clearing it, the tasks should call try_to_freeze()
    as soon as possible.

    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Oleg Nesterov
    Cc: Pavel Machek
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

17 May, 2007

1 commit

  • SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

    Signed-off-by: Christoph Lameter
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Steven French
    Cc: Michael Halcrow
    Cc: OGAWA Hirofumi
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Roman Zippel
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: Paul Mackerras
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: David Chinner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

11 May, 2007

5 commits

  • This patch series implements the new signalfd() system call.

    I took part of the original Linus code (and you know how badly it can be
    broken :), and I added even more breakage ;) Signals are fetched from the same
    signal queue used by the process, so signalfd will compete with standard
    kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
    the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
    seems to be working fine on my Dual Opteron machine. I made a quick test
    program for it:

    http://www.xmailserver.org/signafd-test.c

    The signalfd() system call implements signal delivery into a file descriptor
    receiver. The signalfd file descriptor if created with the following API:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

    The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
    to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
    signalfd file.

    The "mask" allows to specify the signal mask of signals that we are interested
    in. The "masksize" parameter is the size of "mask".

    The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
    will return POLLIN when signals are available to be dequeued. As a direct
    consequence of supporting the Linux poll subsystem, the signalfd fd can use
    used together with epoll(2) too.

    The read(2) system call will return a "struct signalfd_siginfo" structure in
    the userspace supplied buffer. The return value is the number of bytes copied
    in the supplied buffer, or -1 in case of error. The read(2) call can also
    return 0, in case the sighand structure to which the signalfd was attached,
    has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
    return -EAGAIN in case no signal is available.

    If the size of the buffer passed to read(2) is lower than sizeof(struct
    signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
    return -ERESTARTSYS in case a signal hits the process. The format of the
    struct signalfd_siginfo is, and the valid fields depends of the (->code &
    __SI_MASK) value, in the same way a struct siginfo would:

    struct signalfd_siginfo {
    __u32 signo; /* si_signo */
    __s32 err; /* si_errno */
    __s32 code; /* si_code */
    __u32 pid; /* si_pid */
    __u32 uid; /* si_uid */
    __s32 fd; /* si_fd */
    __u32 tid; /* si_fd */
    __u32 band; /* si_band */
    __u32 overrun; /* si_overrun */
    __u32 trapno; /* si_trapno */
    __s32 status; /* si_status */
    __s32 svint; /* si_int */
    __u64 svptr; /* si_ptr */
    __u64 utime; /* si_utime */
    __u64 stime; /* si_stime */
    __u64 addr; /* si_addr */
    };

    [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Use task_pgrp() and task_session() in copy_process(), and avoid find_pid()
    call when attaching the task to its process group and session.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc:
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Modify copy_process() to take a struct pid * parameter instead of a pid_t.
    This simplifies the code a bit and also avoids having to call find_pid() to
    convert the pid_t to a struct pid.

    Changelog:
    - Fixed Badari Pulavarty's comments and passed in &init_struct_pid
    from fork_idle().
    - Fixed Eric Biederman's comments and simplified this patch and
    used a new patch to remove the likely(pid) check.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Eric Biederman
    Cc:
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • attach_pid() currently takes a pid_t and then uses find_pid() to find the
    corresponding struct pid. Sometimes we already have the struct pid. We can
    then skip find_pid() if attach_pid() were to take a struct pid parameter.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • If CONFIG_TASK_IO_ACCOUNTING is defined, we update io accounting counters for
    each task.

    This patch permits reporting of values using the well known getrusage()
    syscall, filling ru_inblock and ru_oublock instead of null values.

    As TASK_IO_ACCOUNTING currently counts bytes counts, we approximate blocks
    count doing : nr_blocks = nr_bytes / 512

    Example of use :
    ----------------------
    After patch is applied, /usr/bin/time command can now give a good
    approximation of IO that the process had to do.

    $ /usr/bin/time grep tototo /usr/include/*
    Command exited with non-zero status 1
    0.00user 0.02system 0:02.11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
    24288inputs+0outputs (0major+259minor)pagefaults 0swaps

    $ /usr/bin/time dd if=/dev/zero of=/tmp/testfile count=1000
    1000+0 enregistrements lus
    1000+0 enregistrements écrits
    512000 octets (512 kB) copiés, 0,00326601 seconde, 157 MB/s
    0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+3000outputs (0major+299minor)pagefaults 0swaps

    Signed-off-by: Eric Dumazet
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

10 May, 2007

1 commit

  • This finally renames the thread_info field in task structure to stack, so that
    the assumptions about this field are gone and archs have more freedom about
    placing the thread_info structure.

    Nonbroken archs which have a proper thread pointer can do the access to both
    current thread and task structure via a single pointer.

    It'll allow for a few more cleanups of the fork code, from which e.g. ia64
    could benefit.

    Signed-off-by: Roman Zippel
    [akpm@linux-foundation.org: build fix]
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Haavard Skinnemoen
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Greg Ungerer
    Cc: Ralf Baechle
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Andi Kleen
    Cc: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     

09 May, 2007

2 commits

  • Remove includes of where it is not used/needed.
    Suggested by Al Viro.

    Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
    sparc64, and arm (all 59 defconfigs).

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
    namespaces. But they have different code paths.

    This patch merges all the nsproxy and its associated namespace copy/clone
    handling (as much as possible). Posted on container list earlier for
    feedback.

    - Create a new nsproxy and its associated namespaces and pass it back to
    caller to attach it to right process.

    - Changed all copy_*_ns() routines to return a new copy of namespace
    instead of attaching it to task->nsproxy.

    - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.

    - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
    just incase.

    - Get rid of all individual unshare_*_ns() routines and make use of
    copy_*_ns() instead.

    [akpm@osdl.org: cleanups, warning fix]
    [clg@fr.ibm.com: remove dup_namespaces() declaration]
    [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
    [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Cedric Le Goater
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

08 May, 2007

1 commit

  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

03 May, 2007

1 commit

  • Add hooks to allow a paravirt implementation to track the lifetime of
    an mm. Paravirtualization requires three hooks, but only two are
    needed in common code. They are:

    arch_dup_mmap, which is called when a new mmap is created at fork

    arch_exit_mmap, which is called when the last process reference to an
    mm is dropped, which typically happens on exit and exec.

    The third hook is activate_mm, which is called from the arch-specific
    activate_mm() macro/function, and so doesn't need stub versions for
    other architectures. It's called when an mm is first used.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: linux-arch@vger.kernel.org
    Cc: James Bottomley
    Acked-by: Ingo Molnar

    Jeremy Fitzhardinge
     

17 Mar, 2007

1 commit


17 Feb, 2007

1 commit

  • - hrtimers did not use the hrtimer_restart enum and relied on the implict
    int representation. Fix the prototypes and the functions using the enums.
    - Use seperate name spaces for the enumerations
    - Convert hrtimer_restart macro to inline function
    - Add comments

    No functional changes.

    [akpm@osdl.org: fix input driver]
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: john stultz
    Cc: Roman Zippel
    Cc: Dmitry Torokhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

13 Feb, 2007

1 commit

  • Of kernel subsystems that work with pids the tty layer is probably the largest
    consumer. But it has the nice virtue that the assiation with a session only
    lasts until the session leader exits. Which means that no reference counting
    is required. So using struct pid winds up being a simple optimization to
    avoid hash table lookups.

    In the long term the use of pid_nr also ensures that when we have multiple pid
    spaces mixed everything will work correctly.

    Signed-off-by: Eric W. Biederman
    Cc: Alan Cox
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

12 Feb, 2007

1 commit

  • They are fat: 4x8 bytes in task_struct.
    They are uncoditionally updated in every fork, read, write and sendfile.
    They are used only if you have some "extended acct fields feature".

    And please, please, please, read(2) knows about bytes, not characters,
    why it is called "rchar"?

    Signed-off-by: Alexey Dobriyan
    Cc: Jay Lan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

02 Feb, 2007

1 commit


31 Jan, 2007

2 commits

  • This reverts commit 7a238fcba0629b6f2edbcd37458bae56fcf36be5 in
    preparation for a better and simpler fix proposed by Eric Biederman
    (and fixed up by Serge Hallyn)

    Acked-by: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Fix exit race by splitting the nsproxy putting into two pieces. First
    piece reduces the nsproxy refcount. If we dropped the last reference, then
    it puts the mnt_ns, and returns the nsproxy as a hint to the caller. Else
    it returns NULL. The second piece of exiting task namespaces sets
    tsk->nsproxy to NULL, and drops the references to other namespaces and
    frees the nsproxy only if an nsproxy was passed in.

    A little awkward and should probably be reworked, but hopefully it fixes
    the NFS oops.

    Signed-off-by: Serge E. Hallyn
    Cc: Herbert Poetzl
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Cedric Le Goater
    Cc: Daniel Hokka Zakrisson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

14 Dec, 2006

1 commit

  • Virtually index, physically tagged cache architectures can get away
    without cache flushing when forking. This patch adds a new cache
    flushing function flush_cache_dup_mm(struct mm_struct *) which for the
    moment I've implemented to do the same thing on all architectures
    except on MIPS where it's a no-op.

    Signed-off-by: Ralf Baechle
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     

11 Dec, 2006

4 commits

  • An fdtable can either be embedded inside a files_struct or standalone (after
    being expanded). When an fdtable is being discarded after all RCU references
    to it have expired, we must either free it directly, in the standalone case,
    or free the files_struct it is contained within, in the embedded case.

    Currently the free_files field controls this behavior, but we can get rid of
    it entirely, as all the necessary information is already recorded. We can
    distinguish embedded and standalone fdtables using max_fds, and if it is
    embedded we can divine the relevant files_struct using container_of().

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • Currently, each fdtable supports three dynamically-sized arrays of data: the
    fdarray and two fdsets. The code allows the number of fds supported by the
    fdarray (fdtable->max_fds) to differ from the number of fds supported by each
    of the fdsets (fdtable->max_fdset).

    In practice, it is wasteful for these two sizes to differ: whenever we hit a
    limit on the smaller-capacity structure, we will reallocate the entire fdtable
    and all the dynamic arrays within it, so any delta in the memory used by the
    larger-capacity structure will never be touched at all.

    Rather than hogging this excess, we shouldn't even allocate it in the first
    place, and keep the capacities of the fdarray and the fdsets equal. This
    patch removes fdtable->max_fdset. As an added bonus, most of the supporting
    code becomes simpler.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • The dup_fd() function creates a new files_struct and fdtable embedded inside
    that files_struct, and then possibly expands the fdtable using expand_files().

    The out_release error path is invoked when expand_files() returns an error
    code. However, when this attempt to expand fails, the fdtable is left in its
    original embedded form, so it is pointless to try to free the associated
    fdarray and fdsets.

    Signed-off-by: Vadim Lobanov
    Cc: Dipankar Sarma
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • The present per-task IO accounting isn't very useful. It simply counts the
    number of bytes passed into read() and write(). So if a process reads 1MB
    from an already-cached file, it is accused of having performed 1MB of I/O,
    which is wrong.

    (David Wright had some comments on the applicability of the present logical IO accounting:

    For billing purposes it is useless but for workload analysis it is very
    useful

    read_bytes/read_calls average read request size
    write_bytes/write_calls average write request size

    read_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing
    write_bytes/write_blocks ie logical/physical guess since pdflush writes can
    be missed

    I often look for logical larger than physical to see filesystem cache
    problems. And the bytes/cpusec can help find applications that are
    dominating the cache and causing slow interactive response from page cache
    contention.

    I want to find the IO intensive applications and make sure they are doing
    efficient IO. Thus the acctcms(sysV) or csacms command would give the high
    IO commands).

    This patchset adds new accounting which tries to be more accurate. We account
    for three things:

    reads:

    attempt to count the number of bytes which this process really did cause
    to be fetched from the storage layer. Done at the submit_bio() level, so it
    is accurate for block-backed filesystems. I also attempt to wire up NFS and
    CIFS.

    writes:

    attempt to count the number of bytes which this process caused to be sent
    to the storage layer. This is done at page-dirtying time.

    The big inaccuracy here is truncate. If a process writes 1MB to a file
    and then deletes the file, it will in fact perform no writeout. But it will
    have been accounted as having caused 1MB of write.

    So...

    cancelled_writes:

    account the number of bytes which this process caused to not happen, by
    truncating pagecache.

    We _could_ just subtract this from the process's `write' accounting. But
    that means that some processes would be reported to have done negative
    amounts of write IO, which is silly.

    So we just report the raw number and punt this decision up to userspace.

    Now, we _could_ account for writes at the physical I/O level. But

    - This would require that we track memory-dirtying tasks at the per-page
    level (would require a new pointer in struct page).

    - It would mean that IO statistics for a process are usually only available
    long after that process has exitted. Which means that we probably cannot
    communicate this info via taskstats.

    This patch:

    Wire up the kernel-private data structures and the accessor functions to
    manipulate them.

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Dec, 2006

5 commits

  • Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
    other namespaces being developped for the containers : pid, uts, ipc, etc.
    'namespace' variables and attributes are also renamed to 'mnt_ns'

    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • Add an anonymous union and ((deprecated)) to catch direct usage of the
    session field.

    [akpm@osdl.org: fix various missed conversions]
    [jdike@addtoit.com: fix UML bug]
    Signed-off-by: Jeff Dike
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Replace occurences of task->signal->session by a new process_session() helper
    routine.

    It will be useful for pid namespaces to abstract the session pid number.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in
    linux/kernel/.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     
  • sys_unshare(CLONE_SIGHAND) is broken, the code under 'if (new_sigh)' is
    never executed but very wrong. Just remove it to avoid a confusion,
    task_lock() has nothing to do with ->sighand changing.

    Also, change the comment in unshare_sighand(). Yes, CLONE_THREAD implies
    CLONE_SIGHAND, but still it looks confusing. Also, we don't need to check
    current->sighand != NULL.

    Signed-off-by: Oleg Nesterov
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

08 Dec, 2006

6 commits

  • * 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (156 commits)
    [PATCH] x86-64: Export smp_call_function_single
    [PATCH] i386: Clean up smp_tune_scheduling()
    [PATCH] unwinder: move .eh_frame to RODATA
    [PATCH] unwinder: fully support linker generated .eh_frame_hdr section
    [PATCH] x86-64: don't use set_irq_regs()
    [PATCH] x86-64: check vector in setup_ioapic_dest to verify if need setup_IO_APIC_irq
    [PATCH] x86-64: Make ix86 default to HIGHMEM4G instead of NOHIGHMEM
    [PATCH] i386: replace kmalloc+memset with kzalloc
    [PATCH] x86-64: remove remaining pc98 code
    [PATCH] x86-64: remove unused variable
    [PATCH] x86-64: Fix constraints in atomic_add_return()
    [PATCH] x86-64: fix asm constraints in i386 atomic_add_return
    [PATCH] x86-64: Correct documentation for bzImage protocol v2.05
    [PATCH] x86-64: replace kmalloc+memset with kzalloc in MTRR code
    [PATCH] x86-64: Fix numaq build error
    [PATCH] x86-64: include/asm-x86_64/cpufeature.h isn't a userspace header
    [PATCH] unwinder: Add debugging output to the Dwarf2 unwinder
    [PATCH] x86-64: Clarify error message in GART code
    [PATCH] x86-64: Fix interrupt race in idle callback (3rd try)
    [PATCH] x86-64: Remove unwind stack pointer alignment forcing again
    ...

    Fixed conflict in include/linux/uaccess.h manually

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Allocate ->signal->stats on demand in taskstats_exit(), this allows us to
    remove taskstats_tgid_alloc() (the last non-trivial inline) from taskstat's
    public interface.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Shailabh Nagar
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The CLONE_CHILD_CLEARTID flag is used by NPTL to have its threads
    communicate via memory/futex when they exit, so pthread_join can
    synchronize using a simple futex wait. The word of user memory where NPTL
    stores a thread's own TID is what it passes; this gets reset to zero at
    thread exit.

    It is not desireable to touch this user memory when threads are dying due
    to a fatal signal. A core dump is more usefully representative of the
    dying program state if the threads live at the time of the crash have their
    NPTL data structures unperturbed. The userland expectation of
    CLONE_CHILD_CLEARTID has only ever been that it works for a thread making
    an _exit system call.

    This problem was identified by Ernie Petrides .

    Signed-off-by: Roland McGrath
    Cc: Ernie Petrides
    Cc: Jakub Jelinek
    Acked-by: Ingo Molnar
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The new swap token patches replace the current token traversal algo. The old
    algo had a crude timeout parameter that was used to handover the token from
    one task to another. This algo, transfers the token to the tasks that are in
    need of the token. The urgency for the token is based on the number of times
    a task is required to swap-in pages. Accordingly, the priority of a task is
    incremented if it has been badly affected due to swap-outs. To ensure that
    the token doesnt bounce around rapidly, the token holders are given a priority
    boost. The priority of tasks is also decremented, if their rate of swap-in's
    keeps reducing. This way, the condition to check whether to pre-empt the swap
    token, is a matter of comparing two task's priority fields.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Ashwin Chaugule
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ashwin Chaugule