21 Jul, 2007

1 commit

  • This patch adds support for additional flags at spu_create, which relate
    to the establishment of affinity between contexts and contexts to memory.
    A fourth, optional, parameter is supported. This parameter represent
    a affinity neighbor of the context being created, and is used when defining
    SPU-SPU affinity.
    Affinity is represented as a doubly linked list of spu_contexts.

    Signed-off-by: Andre Detsch
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

18 Jul, 2007

1 commit

  • fallocate() is a new system call being proposed here which will allow
    applications to preallocate space to any file(s) in a file system.
    Each file system implementation that wants to use this feature will need
    to support an inode operation called ->fallocate().
    Applications can use this feature to avoid fragmentation to certain
    level and thus get faster access speed. With preallocation, applications
    also get a guarantee of space for particular file(s) - even if later the
    the system becomes full.

    Currently, glibc provides an interface called posix_fallocate() which
    can be used for similar cause. Though this has the advantage of working
    on all file systems, but it is quite slow (since it writes zeroes to
    each block that has to be preallocated). Without a doubt, file systems
    can do this more efficiently within the kernel, by implementing
    the proposed fallocate() system call. It is expected that
    posix_fallocate() will be modified to call this new system call first
    and incase the kernel/filesystem does not implement it, it should fall
    back to the current implementation of writing zeroes to the new blocks.
    ToDos:
    1. Implementation on other architectures (other than i386, x86_64,
    and ppc). Patches for s390(x) and ia64 are already available from
    previous posts, but it was decided that they should be added later
    once fallocate is in the mainline. Hence not including those patches
    in this take.
    2. Changes to glibc,
    a) to support fallocate() system call
    b) to make posix_fallocate() and posix_fallocate64() call fallocate()

    Signed-off-by: Amit Arora

    Amit Arora
     

29 Jun, 2007

1 commit

  • Not all the world is an i386. Many architectures need 64-bit arguments to be
    aligned in suitable pairs of registers, and the original
    sys_sync_file_range(int, loff_t, loff_t, int) was therefore wasting an
    argument register for padding after the first integer. Since we don't
    normally have more than 6 arguments for system calls, that left no room for
    the final argument on some architectures.

    Fix this by introducing sys_sync_file_range2(int, int, loff_t, loff_t) which
    all fits nicely. In fact, ARM already had that, but called it
    sys_arm_sync_file_range. Move it to fs/sync.c and rename it, then implement
    the needed compatibility routine. And stop the missing syscall check from
    bitching about the absence of sys_sync_file_range() if we've implemented
    sys_sync_file_range2() instead.

    Tested on PPC32 and with 32-bit and 64-bit userspace on PPC64.

    Signed-off-by: David Woodhouse
    Acked-by: Russell King
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Woodhouse
     

11 May, 2007

3 commits

  • This is a very simple and light file descriptor, that can be used as event
    wait/dispatch by userspace (both wait and dispatch) and by the kernel
    (dispatch only). It can be used instead of pipe(2) in all cases where those
    would simply be used to signal events. Their kernel overhead is much lower
    than pipes, and they do not consume two fds. When used in the kernel, it can
    offer an fd-bridge to enable, for example, functionalities like KAIO or
    syslets/threadlets to signal to an fd the completion of certain operations.
    But more in general, an eventfd can be used by the kernel to signal readiness,
    in a POSIX poll/select way, of interfaces that would otherwise be incompatible
    with it. The API is:

    int eventfd(unsigned int count);

    The eventfd API accepts an initial "count" parameter, and returns an eventfd
    fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).

    The POLLIN flag is raised when the internal counter is greater than zero.

    The POLLOUT flag is raised when at least a value of "1" can be written to the
    internal counter.

    The POLLERR flag is raised when an overflow in the counter value is detected.

    The write(2) operation can never overflow the counter, since it blocks (unless
    O_NONBLOCK is set, in which case -EAGAIN is returned).

    But the eventfd_signal() function can do it, since it's supposed to not sleep
    during its operation.

    The read(2) function reads the __u64 counter value, and reset the internal
    value to zero. If the value read is equal to (__u64) -1, an overflow happened
    on the internal counter (due to 2^64 eventfd_signal() posts that has never
    been retired - unlickely, but possible).

    The write(2) call writes an __u64 count value, and adds it to the current
    counter. The eventfd fd supports O_NONBLOCK also.

    On the kernel side, we have:

    struct file *eventfd_fget(int fd);
    int eventfd_signal(struct file *file, unsigned int n);

    The eventfd_fget() should be called to get a struct file* from an eventfd fd
    (this is an fget() + check of f_op being an eventfd fops pointer).

    The kernel can then call eventfd_signal() every time it wants to post an event
    to userspace. The eventfd_signal() function can be called from any context.
    An eventfd() simple test and bench is available here:

    http://www.xmailserver.org/eventfd-bench.c

    This is the eventfd-based version of pipetest-4 (pipe(2) based):

    http://www.xmailserver.org/pipetest-4.c

    Not that performance matters much in the eventfd case, but eventfd-bench
    shows almost as double as performance than pipetest-4.

    [akpm@linux-foundation.org: fix i386 build]
    [akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • This patch introduces a new system call for timers events delivered though
    file descriptors. This allows timer event to be used with standard POSIX
    poll(2), select(2) and read(2). As a consequence of supporting the Linux
    f_op->poll subsystem, they can be used with epoll(2) too.

    The system call is defined as:

    int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);

    The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
    w/out going through the close/open cycle (same as signalfd). If "ufd" is -1,
    s new file descriptor will be created, otherwise the existing "ufd" will be
    re-programmed.

    The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME. The time
    specified in the "utmr->it_value" parameter is the expiry time for the timer.

    If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
    otherwise it's a relative time.

    If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
    tv_nsec == 0), this is the period at which the following ticks should be
    generated.

    The "utmr->it_interval" should be set to zero if only one tick is requested.
    Setting the "utmr->it_value" to zero will disable the timer, or will create a
    timerfd without the timer enabled.

    The function returns the new (or same, in case "ufd" is a valid timerfd
    descriptor) file, or -1 in case of error.

    As stated before, the timerfd file descriptor supports poll(2), select(2) and
    epoll(2). When a timer event happened on the timerfd, a POLLIN mask will be
    returned.

    The read(2) call can be used, and it will return a u32 variable holding the
    number of "ticks" that happened on the interface since the last call to
    read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
    be returned if no ticks happened.

    A quick test program, shows timerfd working correctly on my amd64 box:

    http://www.xmailserver.org/timerfd-test.c

    [akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • This patch series implements the new signalfd() system call.

    I took part of the original Linus code (and you know how badly it can be
    broken :), and I added even more breakage ;) Signals are fetched from the same
    signal queue used by the process, so signalfd will compete with standard
    kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
    the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
    seems to be working fine on my Dual Opteron machine. I made a quick test
    program for it:

    http://www.xmailserver.org/signafd-test.c

    The signalfd() system call implements signal delivery into a file descriptor
    receiver. The signalfd file descriptor if created with the following API:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

    The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
    to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
    signalfd file.

    The "mask" allows to specify the signal mask of signals that we are interested
    in. The "masksize" parameter is the size of "mask".

    The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
    will return POLLIN when signals are available to be dequeued. As a direct
    consequence of supporting the Linux poll subsystem, the signalfd fd can use
    used together with epoll(2) too.

    The read(2) system call will return a "struct signalfd_siginfo" structure in
    the userspace supplied buffer. The return value is the number of bytes copied
    in the supplied buffer, or -1 in case of error. The read(2) call can also
    return 0, in case the sighand structure to which the signalfd was attached,
    has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
    return -EAGAIN in case no signal is available.

    If the size of the buffer passed to read(2) is lower than sizeof(struct
    signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
    return -ERESTARTSYS in case a signal hits the process. The format of the
    struct signalfd_siginfo is, and the valid fields depends of the (->code &
    __SI_MASK) value, in the same way a struct siginfo would:

    struct signalfd_siginfo {
    __u32 signo; /* si_signo */
    __s32 err; /* si_errno */
    __s32 code; /* si_code */
    __u32 pid; /* si_pid */
    __u32 uid; /* si_uid */
    __s32 fd; /* si_fd */
    __u32 tid; /* si_fd */
    __u32 band; /* si_band */
    __u32 overrun; /* si_overrun */
    __u32 trapno; /* si_trapno */
    __s32 status; /* si_status */
    __s32 svint; /* si_int */
    __u64 svptr; /* si_ptr */
    __u64 utime; /* si_utime */
    __u64 stime; /* si_stime */
    __u64 addr; /* si_addr */
    };

    [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

10 May, 2007

1 commit


12 Oct, 2006

1 commit

  • Implement the epoll_pwait system call, that extend the event wait mechanism
    with the same logic ppoll and pselect do. The definition of epoll_pwait
    is:

    int epoll_pwait(int epfd, struct epoll_event *events, int maxevents,
    int timeout, const sigset_t *sigmask, size_t sigsetsize);

    The difference between the vanilla epoll_wait and epoll_pwait is that the
    latter allows the caller to specify a signal mask to be set while waiting
    for events. Hence epoll_pwait will wait until either one monitored event,
    or an unmasked signal happen. If sigmask is NULL, the epoll_pwait system
    call will act exactly like epoll_wait. For the POSIX definition of
    pselect, information is available here:

    http://www.opengroup.org/onlinepubs/009695399/functions/select.html

    Signed-off-by: Davide Libenzi
    Cc: David Woodhouse
    Cc: Andi Kleen
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

11 Oct, 2006

1 commit


02 Oct, 2006

1 commit

  • Some architectures provide an execve function that does not set errno, but
    instead returns the result code directly. Rename these to kernel_execve to
    get the right semantics there. Moreover, there is no reasone for any of these
    architectures to still provide __KERNEL_SYSCALLS__ or _syscallN macros, so
    remove these right away.

    [akpm@osdl.org: build fix]
    [bunk@stusta.de: build fix]
    Signed-off-by: Arnd Bergmann
    Cc: Andi Kleen
    Acked-by: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: Hirokazu Takata
    Cc: Ralf Baechle
    Cc: Kyle McMartin
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Cc: "Luck, Tony"
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Signed-off-by: Adrian Bunk
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

30 Sep, 2006

1 commit


26 Sep, 2006

1 commit

  • For NUMA optimization and some other algorithms it is useful to have a fast
    to get the current CPU and node numbers in user space.

    x86-64 added a fast way to do this in a vsyscall. This adds a generic
    syscall for other architectures to make it a generic portable facility.

    I expect some of them will also implement it as a faster vsyscall.

    The cache is an optimization for the x86-64 vsyscall optimization. Since
    what the syscall returns is an approximation anyways and user space
    often wants very fast results it can be cached for some time. The norma
    methods to get this information in user space are relatively slow

    The vsyscall is in a better position to manage the cache because it has direct
    access to a fast time stamp (jiffies). For the generic syscall optimization
    it doesn't help much, but enforce a valid argument to keep programs
    portable

    I only added an i386 syscall entry for now. Other architectures can follow
    as needed.

    AK: Also added some cleanups from Andrew Morton

    Signed-off-by: Andi Kleen

    Andi Kleen
     

28 Jun, 2006

1 commit

  • We are pleased to announce "lightweight userspace priority inheritance" (PI)
    support for futexes. The following patchset and glibc patch implements it,
    ontop of the robust-futexes patchset which is included in 2.6.16-mm1.

    We are calling it lightweight for 3 reasons:

    - in the user-space fastpath a PI-enabled futex involves no kernel work
    (or any other PI complexity) at all. No registration, no extra kernel
    calls - just pure fast atomic ops in userspace.

    - in the slowpath (in the lock-contention case), the system call and
    scheduling pattern is in fact better than that of normal futexes, due to
    the 'integrated' nature of FUTEX_LOCK_PI. [more about that further down]

    - the in-kernel PI implementation is streamlined around the mutex
    abstraction, with strict rules that keep the implementation relatively
    simple: only a single owner may own a lock (i.e. no read-write lock
    support), only the owner may unlock a lock, no recursive locking, etc.

    Priority Inheritance - why, oh why???
    -------------------------------------

    Many of you heard the horror stories about the evil PI code circling Linux for
    years, which makes no real sense at all and is only used by buggy applications
    and which has horrible overhead. Some of you have dreaded this very moment,
    when someone actually submits working PI code ;-)

    So why would we like to see PI support for futexes?

    We'd like to see it done purely for technological reasons. We dont think it's
    a buggy concept, we think it's useful functionality to offer to applications,
    which functionality cannot be achieved in other ways. We also think it's the
    right thing to do, and we think we've got the right arguments and the right
    numbers to prove that. We also believe that we can address all the
    counter-arguments as well. For these reasons (and the reasons outlined below)
    we are submitting this patch-set for upstream kernel inclusion.

    What are the benefits of PI?

    The short reply:
    ----------------

    User-space PI helps achieving/improving determinism for user-space
    applications. In the best-case, it can help achieve determinism and
    well-bound latencies. Even in the worst-case, PI will improve the statistical
    distribution of locking related application delays.

    The longer reply:
    -----------------

    Firstly, sharing locks between multiple tasks is a common programming
    technique that often cannot be replaced with lockless algorithms. As we can
    see it in the kernel [which is a quite complex program in itself], lockless
    structures are rather the exception than the norm - the current ratio of
    lockless vs. locky code for shared data structures is somewhere between 1:10
    and 1:100. Lockless is hard, and the complexity of lockless algorithms often
    endangers to ability to do robust reviews of said code. I.e. critical RT
    apps often choose lock structures to protect critical data structures, instead
    of lockless algorithms. Furthermore, there are cases (like shared hardware,
    or other resource limits) where lockless access is mathematically impossible.

    Media players (such as Jack) are an example of reasonable application design
    with multiple tasks (with multiple priority levels) sharing short-held locks:
    for example, a highprio audio playback thread is combined with medium-prio
    construct-audio-data threads and low-prio display-colory-stuff threads. Add
    video and decoding to the mix and we've got even more priority levels.

    So once we accept that synchronization objects (locks) are an unavoidable fact
    of life, and once we accept that multi-task userspace apps have a very fair
    expectation of being able to use locks, we've got to think about how to offer
    the option of a deterministic locking implementation to user-space.

    Most of the technical counter-arguments against doing priority inheritance
    only apply to kernel-space locks. But user-space locks are different, there
    we cannot disable interrupts or make the task non-preemptible in a critical
    section, so the 'use spinlocks' argument does not apply (user-space spinlocks
    have the same priority inversion problems as other user-space locking
    constructs). Fact is, pretty much the only technique that currently enables
    good determinism for userspace locks (such as futex-based pthread mutexes) is
    priority inheritance:

    Currently (without PI), if a high-prio and a low-prio task shares a lock [this
    is a quite common scenario for most non-trivial RT applications], even if all
    critical sections are coded carefully to be deterministic (i.e. all critical
    sections are short in duration and only execute a limited number of
    instructions), the kernel cannot guarantee any deterministic execution of the
    high-prio task: any medium-priority task could preempt the low-prio task while
    it holds the shared lock and executes the critical section, and could delay it
    indefinitely.

    Implementation:
    ---------------

    As mentioned before, the userspace fastpath of PI-enabled pthread mutexes
    involves no kernel work at all - they behave quite similarly to normal
    futex-based locks: a 0 value means unlocked, and a value==TID means locked.
    (This is the same method as used by list-based robust futexes.) Userspace uses
    atomic ops to lock/unlock these mutexes without entering the kernel.

    To handle the slowpath, we have added two new futex ops:

    FUTEX_LOCK_PI
    FUTEX_UNLOCK_PI

    If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to TID
    fails], then FUTEX_LOCK_PI is called. The kernel does all the remaining work:
    if there is no futex-queue attached to the futex address yet then the code
    looks up the task that owns the futex [it has put its own TID into the futex
    value], and attaches a 'PI state' structure to the futex-queue. The pi_state
    includes an rt-mutex, which is a PI-aware, kernel-based synchronization
    object. The 'other' task is made the owner of the rt-mutex, and the
    FUTEX_WAITERS bit is atomically set in the futex value. Then this task tries
    to lock the rt-mutex, on which it blocks. Once it returns, it has the mutex
    acquired, and it sets the futex value to its own TID and returns. Userspace
    has no other work to perform - it now owns the lock, and futex value contains
    FUTEX_WAITERS|TID.

    If the unlock side fastpath succeeds, [i.e. userspace manages to do a TID ->
    0 atomic transition of the futex value], then no kernel work is triggered.

    If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then
    FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of
    userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes
    up any potential waiters.

    Note that under this approach, contrary to other PI-futex approaches, there is
    no prior 'registration' of a PI-futex. [which is not quite possible anyway,
    due to existing ABI properties of pthread mutexes.]

    Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties
    of futexes, and all four combinations are possible: futex, robust-futex,
    PI-futex, robust+PI-futex.

    glibc support:
    --------------

    Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes
    (and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX
    mutexes. (PTHREAD_PRIO_PROTECT support will be added later on too, no
    additional kernel changes are needed for that). [NOTE: The glibc patch is
    obviously inofficial and unsupported without matching upstream kernel
    functionality.]

    the patch-queue and the glibc patch can also be downloaded from:

    http://redhat.com/~mingo/PI-futex-patches/

    Many thanks go to the people who helped us create this kernel feature: Steven
    Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan
    van de Ven, Oleg Nesterov and others. Credits for related prior projects goes
    to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others.

    Clean up the futex code, before adding more features to it:

    - use u32 as the futex field type - that's the ABI
    - use __user and pointers to u32 instead of unsigned long
    - code style / comment style cleanups
    - rename hash-bucket name from 'bh' to 'hb'.

    I checked the pre and post futex.o object files to make sure this
    patch has no code effects.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Cc: Ulrich Drepper
    Cc: Jakub Jelinek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

23 Jun, 2006

3 commits

  • The definition of the third parameter is a pointer to an array of virtual
    addresses which give us some trouble. The existing code calculated the
    wrong address in the array since I used void to avoid having to specify a
    type.

    I now use the correct type "compat_uptr_t __user *" in the definition of
    the function in kernel/compat.c.

    However, I used __u32 in syscalls.h. Would have to include compat.h there
    in order to provide the same definition which would generate an ugly
    include situation.

    On both ia64 and x86_64 compat_uptr_t is u32. So this works although
    parameter declarations differ.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • sys_move_pages() support for 32bit (i386 plus x86_64 compat layer)

    Add support for move_pages() on i386 and also add the compat functions
    necessary to run 32 bit binaries on x86_64.

    Add compat_sys_move_pages to the x86_64 32bit binary layer. Note that it is
    not up to date so I added the missing pieces. Not sure if this is done the
    right way.

    [akpm@osdl.org: compile fix]
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • move_pages() is used to move individual pages of a process. The function can
    be used to determine the location of pages and to move them onto the desired
    node. move_pages() returns status information for each page.

    long move_pages(pid, number_of_pages_to_move,
    addresses_of_pages[],
    nodes[] or NULL,
    status[],
    flags);

    The addresses of pages is an array of void * pointing to the
    pages to be moved.

    The nodes array contains the node numbers that the pages should be moved
    to. If a NULL is passed instead of an array then no pages are moved but
    the status array is updated. The status request may be used to determine
    the page state before issuing another move_pages() to move pages.

    The status array will contain the state of all individual page migration
    attempts when the function terminates. The status array is only valid if
    move_pages() completed successfullly.

    Possible page states in status[]:

    0..MAX_NUMNODES The page is now on the indicated node.

    -ENOENT Page is not present

    -EACCES Page is mapped by multiple processes and can only
    be moved if MPOL_MF_MOVE_ALL is specified.

    -EPERM The page has been mlocked by a process/driver and
    cannot be moved.

    -EBUSY Page is busy and cannot be moved. Try again later.

    -EFAULT Invalid address (no VMA or zero page).

    -ENOMEM Unable to allocate memory on target node.

    -EIO Unable to write back page. The page must be written
    back in order to move it since the page is dirty and the
    filesystem does not provide a migration function that
    would allow the moving of dirty pages.

    -EINVAL A dirty page cannot be moved. The filesystem does not provide
    a migration function and has no ability to write back pages.

    The flags parameter indicates what types of pages to move:

    MPOL_MF_MOVE Move pages that are only mapped by the process.

    MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
    Requires sufficient capabilities.

    Possible return codes from move_pages()

    -ENOENT No pages found that would require moving. All pages
    are either already on the target node, not present, had an
    invalid address or could not be moved because they were
    mapped by multiple processes.

    -EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
    to migrate pages in a kernel thread.

    -EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges.
    or an attempt to move a process belonging to another user.

    -EACCES One of the target nodes is not allowed by the current cpuset.

    -ENODEV One of the target nodes is not online.

    -ESRCH Process does not exist.

    -E2BIG Too many pages to move.

    -ENOMEM Not enough memory to allocate control array.

    -EFAULT Parameters could not be accessed.

    A test program for move_pages() may be found with the patches
    on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

    From: Christoph Lameter

    Detailed results for sys_move_pages()

    Pass a pointer to an integer to get_new_page() that may be used to
    indicate where the completion status of a migration operation should be
    placed. This allows sys_move_pags() to report back exactly what happened to
    each page.

    Wish there would be a better way to do this. Looks a bit hacky.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

24 May, 2006

2 commits


29 Apr, 2006

1 commit


26 Apr, 2006

2 commits


11 Apr, 2006

3 commits

  • Basically an in-kernel implementation of tee, which uses splice and the
    pipe buffers as an intelligent way to pass data around by reference.

    Where the user space tee consumes the input and produces a stdout and
    file output, this syscall merely duplicates the data inside a pipe to
    another pipe. No data is copied, the output just grabs a reference to the
    input pipe data.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • * 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
    [PATCH] vfs: add splice_write and splice_read to documentation
    [PATCH] Remove sys_ prefix of new syscalls from __NR_sys_*
    [PATCH] splice: warning fix
    [PATCH] another round of fs/pipe.c cleanups
    [PATCH] splice: comment styles
    [PATCH] splice: add Ingo as addition copyright holder
    [PATCH] splice: unlikely() optimizations
    [PATCH] splice: speedups and optimizations
    [PATCH] pipe.c/fifo.c code cleanups
    [PATCH] get rid of the PIPE_*() macros
    [PATCH] splice: speedup __generic_file_splice_read
    [PATCH] splice: add direct fd fd splicing support
    [PATCH] splice: add optional input and output offsets
    [PATCH] introduce a "kernel-internal pipe object" abstraction
    [PATCH] splice: be smarter about calling do_page_cache_readahead()
    [PATCH] splice: optimize the splice buffer mapping
    [PATCH] splice: cleanup __generic_file_splice_read()
    [PATCH] splice: only call wake_up_interruptible() when we really have to
    [PATCH] splice: potential !page dereference
    [PATCH] splice: mark the io page as accessed

    Linus Torvalds
     
  • Ulrich suggested that the `flags' arg to sync_file_range() become unsigned.

    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

10 Apr, 2006

1 commit

  • add optional input and output offsets to sys_splice(), for seekable file
    descriptors:

    asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
    int fd_out, loff_t __user *off_out,
    size_t len, unsigned int flags);

    semantics are straightforward: f_pos will be updated with the offset
    provided by user-space, before the splice transfer is about to begin.
    Providing a NULL offset pointer means the existing f_pos will be used
    (and updated in situ). Providing an offset for a pipe results in
    -ESPIPE. Providing an invalid offset pointer results in -EFAULT.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Jens Axboe

    Ingo Molnar
     

01 Apr, 2006

1 commit

  • Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
    fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
    Reasons:

    - It's more flexible. Things which would require two or three syscalls with
    fadvise() can be done in a single syscall.

    - Using fadvise() in this manner is something not covered by POSIX.

    The patch wires up the syscall for x86.

    The sycall is implemented in the new fs/sync.c. The intention is that we can
    move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.

    Documentation for the syscall is in fs/sync.c.

    A test app (sync_file_range.c) is in
    http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

    The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
    say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
    NFS_DATA_SYNC which is hopefully the more common."

    Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
    the queue is congested. This is trivial to fix: add a new flag bit, set
    wbc->nonblocking. But I'm not sure that we want to expose implementation
    details down to that level.

    Note: it's notable that we can sync an fd which wasn't opened for writing.
    Same with fsync() and fdatasync()).

    Note: the code takes some care to handle attempts to sync file contents
    outside the 16TB offset on 32-bit machines. It makes such attempts appear to
    succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
    requests fail...

    Cc: Nick Piggin
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

31 Mar, 2006

1 commit

  • This adds support for the sys_splice system call. Using a pipe as a
    transport, it can connect to files or sockets (latter as output only).

    From the splice.c comments:

    "splice": joining two ropes together by interweaving their strands.

    This is the "extended pipe" functionality, where a pipe is used as
    an arbitrary in-memory buffer. Think of a pipe as a small kernel
    buffer that you can use to transfer data from one end to the other.

    The traditional unix read/write is extended with a "splice()" operation
    that transfers data buffers to or from a pipe buffer.

    Named by Larry McVoy, original implementation from Linus, extended by
    Jens to support splicing to files and fixing the initial implementation
    bugs.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

24 Mar, 2006

1 commit


25 Feb, 2006

1 commit

  • I'm currently at the POSIX meeting and one thing covered was the
    incompatibility of Linux's link() with the POSIX definition. The name.
    Linux does not follow symlinks, POSIX requires it does.

    Even if somebody thinks this is a good default behavior we cannot change this
    because it would break the ABI. But the fact remains that some application
    might want this behavior.

    We have one chance to help implementing this without breaking the behavior.
    For this we could use the new linkat interface which would need a new
    flags parameter. If the new parameter is AT_SYMLINK_FOLLOW the new
    behavior could be invoked.

    I do not want to introduce such a patch now. But we could add the
    parameter now, just don't use it. The patch below would do this. Can we
    get this late patch applied before the release more or less fixes the
    syscall API?

    Signed-off-by: Ulrich Drepper
    Signed-off-by: Ralf Baechle
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

12 Feb, 2006

1 commit

  • The *at patches introduced fstatat and, due to inusfficient research, I
    used the newfstat functions generally as the guideline. The result is that
    on 32-bit platforms we don't have all the information needed to implement
    fstatat64.

    This patch modifies the code to pass up 64-bit information if
    __ARCH_WANT_STAT64 is defined. I renamed the syscall entry point to make
    this clear. Other archs will continue to use the existing code. On x86-64
    the compat code is implemented using a new sys32_ function. this is what
    is done for the other stat syscalls as well.

    This patch might break some other archs (those which define
    __ARCH_WANT_STAT64 and which already wired up the syscall). Yet others
    might need changes to accomodate the compatibility mode. I really don't
    want to do that work because all this stat handling is a mess (more so in
    glibc, but the kernel is also affected). It should be done by the arch
    maintainers. I'll provide some stand-alone test shortly. Those who are
    eager could compile glibc and run 'make check' (no installation needed).

    The patch below has been tested on x86 and x86-64.

    Signed-off-by: Ulrich Drepper
    Cc: Christoph Hellwig
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

02 Feb, 2006

2 commits

  • Most of the 64 bit architectures will zero extend the first argument to
    compat_sys_{openat,newfstatat,futimesat} which will fail if the 32 bit
    syscall was passed AT_FDCWD (which is a small negative number). Declare
    the first argument to be an unsigned int which will force the correct
    sign extension when the internal functions are called in each case.

    Also, do some small white space cleanups in fs/compat.c.

    Signed-off-by: Stephen Rothwell
    Acked-by: David S. Miller
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Here's the follow-up patch which introduces the prototypes for the new
    syscalls. There was also a typo in one of the new symbols.

    Signed-off-by: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

19 Jan, 2006

1 commit


10 Jan, 2006

1 commit


09 Jan, 2006

2 commits

  • sys_migrate_pages implementation using swap based page migration

    This is the original API proposed by Ray Bryant in his posts during the first
    half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

    The intent of sys_migrate is to migrate memory of a process. A process may
    have migrated to another node. Memory was allocated optimally for the prior
    context. sys_migrate_pages allows to shift the memory to the new node.

    sys_migrate_pages is also useful if the processes available memory nodes have
    changed through cpuset operations to manually move the processes memory. Paul
    Jackson is working on an automated mechanism that will allow an automatic
    migration if the cpuset of a process is changed. However, a user may decide
    to manually control the migration.

    This implementation is put into the policy layer since it uses concepts and
    functions that are also needed for mbind and friends. The patch also provides
    a do_migrate_pages function that may be useful for cpusets to automatically
    move memory. sys_migrate_pages does not modify policies in contrast to Ray's
    implementation.

    The current code here is based on the swap based page migration capability and
    thus is not able to preserve the physical layout relative to it containing
    nodeset (which may be a cpuset). When direct page migration becomes available
    then the implementation needs to be changed to do a isomorphic move of pages
    between different nodesets. The current implementation simply evicts all
    pages in source nodeset that are not in the target nodeset.

    Patch supports ia64, i386 and x86_64.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is the current version of the spu file system, used
    for driving SPEs on the Cell Broadband Engine.

    This release is almost identical to the version for the
    2.6.14 kernel posted earlier, which is available as part
    of the Cell BE Linux distribution from
    http://www.bsc.es/projects/deepcomputing/linuxoncell/.

    The first patch provides all the interfaces for running
    spu application, but does not have any support for
    debugging SPU tasks or for scheduling. Both these
    functionalities are added in the subsequent patches.

    See Documentation/filesystems/spufs.txt on how to use
    spufs.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul Mackerras

    Arnd Bergmann
     

31 Oct, 2005

1 commit


22 Sep, 2005

1 commit


08 Jul, 2005

1 commit

  • - Make ioprio syscalls return long, like set/getpriority syscalls.
    - Move function prototypes into syscalls.h so we can pick them up in the
    32/64bit compat code.

    Signed-off-by: Anton Blanchard
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

26 Jun, 2005

1 commit

  • o Following patch provides purely cosmetic changes and corrects CodingStyle
    guide lines related certain issues like below in kexec related files

    o braces for one line "if" statements, "for" loops,
    o more than 80 column wide lines,
    o No space after "while", "for" and "switch" key words

    o Changes:
    o take-2: Removed the extra tab before "case" key words.
    o take-3: Put operator at the end of line and space before "*/"

    Signed-off-by: Maneesh Soni
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maneesh Soni