15 Nov, 2013

2 commits

  • The basic idea is the same as with PTE level: the lock is embedded into
    struct page of table's page.

    We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
    take mm->page_table_lock anymore. Let's reuse page->lru of table's page
    for that.

    pgtable_pmd_page_ctor() returns true, if initialization is successful
    and false otherwise. Current implementation never fails, but assumption
    that constructor can fail will help to port it to -rt where spinlock_t
    is rather huge and cannot be embedded into struct page -- dynamic
    allocation is required.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Reviewed-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Nov, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this cycle are:

    - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
    van Riel, Peter Zijlstra et al. Yay!

    - optimize preemption counter handling: merge the NEED_RESCHED flag
    into the preempt_count variable, by Peter Zijlstra.

    - wait.h fixes and code reorganization from Peter Zijlstra

    - cfs_bandwidth fixes from Ben Segall

    - SMP load-balancer cleanups from Peter Zijstra

    - idle balancer improvements from Jason Low

    - other fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
    ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
    stop_machine: Fix race between stop_two_cpus() and stop_cpus()
    sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
    sched: Fix asymmetric scheduling for POWER7
    sched: Move completion code from core.c to completion.c
    sched: Move wait code from core.c to wait.c
    sched: Move wait.c into kernel/sched/
    sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
    sched: Avoid throttle_cfs_rq() racing with period_timer stopping
    sched: Guarantee new group-entities always have weight
    sched: Fix hrtimer_cancel()/rq->lock deadlock
    sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
    sched: Fix race on toggling cfs_bandwidth_used
    sched: Remove extra put_online_cpus() inside sched_setaffinity()
    sched/rt: Fix task_tick_rt() comment
    sched/wait: Fix build breakage
    sched/wait: Introduce prepare_to_wait_event()
    sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
    sched: Remove get_online_cpus() usage
    sched: Fix race in migrate_swap_stop()
    ...

    Linus Torvalds
     

30 Oct, 2013

2 commits

  • uprobe_copy_process() does nothing if the child shares ->mm with
    the forking process, but there is a special case: CLONE_VFORK.
    In this case it would be more correct to do dup_utask() but avoid
    dup_xol(). This is not that important, the child should not unwind
    its stack too much, this can corrupt the parent's stack, but at
    least we need this to allow to ret-probe __vfork() itself.

    Note: in theory, it would be better to check task_pt_regs(p)->sp
    instead of CLONE_VFORK, we need to dup_utask() if and only if the
    child can return from the function called by the parent. But this
    needs the arch-dependant helper, and I think that nobody actually
    does clone(same_stack, CLONE_VM).

    Reported-by: Martin Cermak
    Reported-by: David Smith
    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     
  • Preparation for the next patches.

    Move the callsite of uprobe_copy_process() in copy_process() down
    to the succesfull return. We do not care if copy_process() fails,
    uprobe_free_utask() won't be called in this case so the wrong
    ->utask != NULL doesn't matter.

    OTOH, with this change we know that copy_process() can't fail when
    uprobe_copy_process() is called, the new task should either return
    to user-mode or call do_exit(). This way uprobe_copy_process() can:

    1. setup p->utask != NULL if necessary

    2. setup uprobes_state.xol_area

    3. use task_work_add(p)

    Also, move the definition of uprobe_copy_process() down so that it
    can see get_utask().

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     

09 Oct, 2013

2 commits

  • A newly spawned thread inside a process should stay on the same
    NUMA node as its parent. This prevents processes from being "torn"
    across multiple NUMA nodes every time they spawn a new thread.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-49-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • PTE scanning and NUMA hinting fault handling is expensive so commit
    5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
    on a new node") deferred the PTE scan until a task had been scheduled on
    another node. The problem is that in the purely shared memory case that
    this may never happen and no NUMA hinting fault information will be
    captured. We are not ruling out the possibility that something better
    can be done here but for now, this patch needs to be reverted and depend
    entirely on the scan_delay to avoid punishing short-lived processes.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

14 Sep, 2013

1 commit

  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

12 Sep, 2013

4 commits

  • Simple cleanup. Every user of vma_set_policy() does the same work, this
    looks a bit annoying imho. And the new trivial helper which does
    mpol_dup() + vma_set_policy() to simplify the callers.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID.

    Then later copy_process() denies CLONE_SIGHAND if the new process will
    be in a different pid namespace (task_active_pid_ns() doesn't match
    current->nsproxy->pid_ns).

    This looks confusing and inconsistent. CLONE_NEWPID is very similar to
    the case when ->pid_ns was already unshared, we want the same
    restrictions so copy_process() should also nack CLONE_PARENT.

    And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well
    just for consistency.

    Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change
    copy_process() to do the same check along with ->pid_ns check we already
    have.

    Signed-off-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Cc: "Eric W. Biederman"
    Cc: Colin Walters
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
    unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process
    unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does
    the same check.

    Remove the CLONE_NEWPID check to cleanup the code and prepare for the
    next change.

    Test-case:

    static int child(void *arg)
    {
    return 0;
    }

    static char stack[16 * 1024];

    int main(void)
    {
    pid_t pid;

    assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);

    pid = clone(child, stack + sizeof(stack) / 2,
    CLONE_NEWPID | SIGCHLD, NULL);
    assert(pid < 0 && errno == EINVAL);

    return 0;
    }

    clone(CLONE_NEWPID) correctly fails with or without this change.

    Signed-off-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Cc: "Eric W. Biederman"
    Cc: Colin Walters
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
    unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
    pid_ns, this obviously breaks vfork:

    int main(void)
    {
    assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
    assert(vfork() >= 0);
    _exit(0);
    return 0;
    }

    fails without this patch.

    Change this check to use CLONE_SIGHAND instead. This also forbids
    CLONE_THREAD automatically, and this is what the comment implies.

    We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
    would be safer to not do this. The current check denies CLONE_SIGHAND
    implicitely and there is no reason to change this.

    Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better.
    Having shared signal handling between two different pid namespaces is
    the case that we are fundamentally guarding against."

    Signed-off-by: Oleg Nesterov
    Reported-by: Colin Walters
    Acked-by: Andy Lutomirski
    Reviewed-by: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

08 Sep, 2013

1 commit

  • Pull namespace changes from Eric Biederman:
    "This is an assorted mishmash of small cleanups, enhancements and bug
    fixes.

    The major theme is user namespace mount restrictions. nsown_capable
    is killed as it encourages not thinking about details that need to be
    considered. A very hard to hit pid namespace exiting bug was finally
    tracked and fixed. A couple of cleanups to the basic namespace
    infrastructure.

    Finally there is an enhancement that makes per user namespace
    capabilities usable as capabilities, and an enhancement that allows
    the per userns root to nice other processes in the user namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Kill nsown_capable it makes the wrong thing easy
    capabilities: allow nice if we are privileged
    pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
    userns: Allow PR_CAPBSET_DROP in a user namespace.
    namespaces: Simplify copy_namespaces so it is clear what is going on.
    pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
    sysfs: Restrict mounting sysfs
    userns: Better restrictions on when proc and sysfs can be mounted
    vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
    kernel/nsproxy.c: Improving a snippet of code.
    proc: Restrict mounting the proc filesystem
    vfs: Lock in place mounts from more privileged users

    Linus Torvalds
     

31 Aug, 2013

1 commit

  • I goofed when I made unshare(CLONE_NEWPID) only work in a
    single-threaded process. There is no need for that requirement and in
    fact I analyzied things right for setns. The hard requirement
    is for tasks that share a VM to all be in the pid namespace and
    we properly prevent that in do_fork.

    Just to be certain I took a look through do_wait and
    forget_original_parent and there are no cases that make it any harder
    for children to be in the multiple pid namespaces than it is for
    children to be in the same pid namespace. I also performed a check to
    see if there were in uses of task->nsproxy_pid_ns I was not familiar
    with, but it is only used when allocating a new pid for a new task,
    and in checks to prevent craziness from happening.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Aug, 2013

1 commit


14 Aug, 2013

1 commit

  • Fix inadvertent breakage in the clone syscall ABI for Microblaze that
    was introduced in commit f3268edbe6fe ("microblaze: switch to generic
    fork/vfork/clone").

    The Microblaze syscall ABI for clone takes the parent tid address in the
    4th argument; the third argument slot is used for the stack size. The
    incorrectly-used CLONE_BACKWARDS type assigned parent tid to the 3rd
    slot.

    This commit restores the original ABI so that existing userspace libc
    code will work correctly.

    All kernel versions from v3.8-rc1 were affected.

    Signed-off-by: Michal Simek
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Simek
     

31 Jul, 2013

1 commit

  • On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
    > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
    > > When using a large number of threads performing AIO operations the
    > > IOCTX list may get a significant number of entries which will cause
    > > significant overhead. For example, when running this fio script:
    > >
    > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
    > > blocksize=1024; numjobs=512; thread; loops=100
    > >
    > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
    > > 30% CPU time spent by lookup_ioctx:
    > >
    > > 32.51% [guest.kernel] [g] lookup_ioctx
    > > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
    > > 4.40% [guest.kernel] [g] lock_release
    > > 4.19% [guest.kernel] [g] sched_clock_local
    > > 3.86% [guest.kernel] [g] local_clock
    > > 3.68% [guest.kernel] [g] native_sched_clock
    > > 3.08% [guest.kernel] [g] sched_clock_cpu
    > > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
    > > 2.60% [guest.kernel] [g] memcpy
    > > 2.33% [guest.kernel] [g] lock_acquired
    > > 2.25% [guest.kernel] [g] lock_acquire
    > > 1.84% [guest.kernel] [g] do_io_submit
    > >
    > > This patchs converts the ioctx list to a radix tree. For a performance
    > > comparison the above FIO script was run on a 2 sockets 8 core
    > > machine. This are the results (average and %rsd of 10 runs) for the
    > > original list based implementation and for the radix tree based
    > > implementation:
    > >
    > > cores 1 2 4 8 16 32
    > > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
    > > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
    > > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
    > > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
    > > % of radix
    > > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
    > > to list
    > >
    > > To consider the impact of the patch on the typical case of having
    > > only one ctx per process the following FIO script was run:
    > >
    > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
    > > blocksize=1024; numjobs=1; thread; loops=100
    > >
    > > on the same system and the results are the following:
    > >
    > > list 58892 ms
    > > %rsd 0.91%
    > > radix 59404 ms
    > > %rsd 0.81%
    > > % of radix
    > > relative 100.87%
    > > to list
    >
    > So, I was just doing some benchmarking/profiling to get ready to send
    > out the aio patches I've got for 3.11 - and it looks like your patch is
    > causing a ~1.5% throughput regression in my testing :/
    ...

    I've got an alternate approach for fixing this wart in lookup_ioctx()...
    Instead of using an rbtree, just use the reserved id in the ring buffer
    header to index an array pointing the ioctx. It's not finished yet, and
    it needs to be tidied up, but is most of the way there.

    -ben
    --
    "Thought is the essence of where you are now."
    --
    kmo> And, a rework of Ben's code, but this was entirely his idea
    kmo> -Kent

    bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
    free memory.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

11 Jul, 2013

1 commit

  • Since all architectures have been converted to use vm_unmapped_area(),
    there is no remaining use for the free_area_cache.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: "James E.J. Bottomley"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Jul, 2013

4 commits

  • copy_process() does a lot of "chaotic" initializations and checks
    CLONE_THREAD twice before it takes tasklist. In particular it sets
    "p->group_leader = p" and then changes it again under tasklist if
    !thread_group_leader(p).

    This looks a bit confusing, lets create a single "if (CLONE_THREAD)" block
    which initializes ->exit_signal, ->group_leader, and ->tgid.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Pavel Emelyanov
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • copy_process() adds the new child to thread_group/init_task.tasks list and
    then does attach_pid(child, PIDTYPE_PID). This means that the lockless
    next_thread() or next_task() can see this thread with the wrong pid. Say,
    "ls /proc/pid/task" can list the same inode twice.

    We could move attach_pid(child, PIDTYPE_PID) up, but in this case
    find_task_by_vpid() can find the new thread before it was fully
    initialized.

    And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
    copy_process() initializes child->pids[*].pid first, then calls
    attach_pid() to insert the task into the pid->tasks list.

    attach_pid() no longer need the "struct pid*" argument, it is always
    called after pid_link->pid was already set.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Pavel Emelyanov
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cleanup and preparation for the next changes.

    Move the "if (clone_flags & CLONE_THREAD)" code down under "if
    (likely(p->pid))" and turn it into into the "else" branch. This makes the
    process/thread initialization more symmetrical and removes one check.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Pavel Emelyanov
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When a task is attempting to violate the RLIMIT_NPROC limit we have a
    check to see if the task is sufficiently priviledged. The check first
    looks at CAP_SYS_ADMIN, then CAP_SYS_RESOURCE, then if the task is uid=0.

    A result is that tasks which are allowed by the uid=0 check are first
    checked against the security subsystem. This results in the security
    subsystem auditting a denial for sys_admin and sys_resource and then the
    task passing the uid=0 check.

    This patch rearranges the code to first check uid=0, since if we pass that
    we shouldn't hit the security system at all. We then check sys_resource,
    since it is the smallest capability which will solve the problem. Lastly
    we check the fallback everything cap_sysadmin. We don't want to give this
    capability many places since it is so powerful.

    This will eliminate many of the false positive/needless denial messages we
    get when a root task tries to violate the nproc limit. (note that
    kthreads count against root, so on a sufficiently large machine we can
    actually get past the default limits before any userspace tasks are
    launched.)

    Signed-off-by: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     

09 May, 2013

1 commit

  • Pull block driver updates from Jens Axboe:
    "It might look big in volume, but when categorized, not a lot of
    drivers are touched. The pull request contains:

    - mtip32xx fixes from Micron.

    - A slew of drbd updates, this time in a nicer series.

    - bcache, a flash/ssd caching framework from Kent.

    - Fixes for cciss"

    * 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
    bcache: Use bd_link_disk_holder()
    bcache: Allocator cleanup/fixes
    cciss: bug fix to prevent cciss from loading in kdump crash kernel
    cciss: add cciss_allow_hpsa module parameter
    drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
    mtip32xx: Workaround for unaligned writes
    bcache: Make sure blocksize isn't smaller than device blocksize
    bcache: Fix merge_bvec_fn usage for when it modifies the bvm
    bcache: Correctly check against BIO_MAX_PAGES
    bcache: Hack around stuff that clones up to bi_max_vecs
    bcache: Set ra_pages based on backing device's ra_pages
    bcache: Take data offset from the bdev superblock.
    mtip32xx: mtip32xx: Disable TRIM support
    mtip32xx: fix a smatch warning
    bcache: Disable broken btree fuzz tester
    bcache: Fix a format string overflow
    bcache: Fix a minor memory leak on device teardown
    bcache: Documentation updates
    bcache: Use WARN_ONCE() instead of __WARN()
    bcache: Add missing #include
    ...

    Linus Torvalds
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

01 May, 2013

1 commit

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     

30 Apr, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this development cycle were:

    - full dynticks preparatory work by Frederic Weisbecker

    - factor out the cpu time accounting code better, by Li Zefan

    - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

    - various smaller fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    sched: Fix init NOHZ_IDLE flag
    sched: Prevent to re-select dst-cpu in load_balance()
    sched: Rename load_balance_tmpmask to load_balance_mask
    sched: Move up affinity check to mitigate useless redoing overhead
    sched: Don't consider other cpus in our group in case of NEWLY_IDLE
    sched: Explicitly cpu_idle_type checking in rebalance_domains()
    sched: Change position of resched_cpu() in load_balance()
    sched: Fix wrong rq's runnable_avg update with rt tasks
    sched: Document task_struct::personality field
    sched/cpuacct/UML: Fix header file dependency bug on the UML build
    cgroup: Kill subsys.active flag
    sched/cpuacct: No need to check subsys active state
    sched/cpuacct: Initialize cpuacct subsystem earlier
    sched/cpuacct: Initialize root cpuacct earlier
    sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
    sched/cpuacct: Clean up cpuacct.h
    sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
    sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
    sched/cpuacct: Add cpuacct_acount_field()
    sched/cpuacct: Add cpuacct_init()
    ...

    Linus Torvalds
     

24 Mar, 2013

1 commit

  • Does writethrough and writeback caching, handles unclean shutdown, and
    has a bunch of other nifty features motivated by real world usage.

    See the wiki at http://bcache.evilpiepirate.org for more.

    Signed-off-by: Kent Overstreet

    Kent Overstreet
     

14 Mar, 2013

1 commit

  • Don't allowing sharing the root directory with processes in a
    different user namespace. There doesn't seem to be any point, and to
    allow it would require the overhead of putting a user namespace
    reference in fs_struct (for permission checks) and incrementing that
    reference count on practically every call to fork.

    So just perform the inexpensive test of forbidding sharing fs_struct
    acrosss processes in different user namespaces. We already disallow
    other forms of threading when unsharing a user namespace so this
    should be no real burden in practice.

    This updates setns, clone, and unshare to disallow multiple user
    namespaces sharing an fs_struct.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

08 Mar, 2013

1 commit

  • The full dynticks cputime accounting is able to account either
    using the tick or the context tracking subsystem. This way
    the housekeeping CPU can keep the low overhead tick based
    solution.

    This latter mode has a low jiffies resolution granularity and
    need to be scaled against CFS precise runtime accounting to
    improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
    now we also need to expand it to full dynticks accounting dynamic
    off-case as well.

    Signed-off-by: Frederic Weisbecker
    Cc: Li Zhong
    Cc: Kevin Hilman
    Cc: Mats Liljegren
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Namhyung Kim
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Cc: Paul E. McKenney

    Frederic Weisbecker
     

04 Mar, 2013

1 commit


28 Feb, 2013

1 commit

  • If new_nsproxy is set we will always call switch_task_namespaces and
    then set new_nsproxy back to NULL so the reassignment and fall through
    check are redundant

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

23 Feb, 2013

1 commit


05 Feb, 2013

1 commit

  • …x/kernel/git/frederic/linux-dynticks into sched/core

    Pull full-dynticks (user-space execution is undisturbed and
    receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready,
    from Frederic Weisbecker:

    "This implements the cputime accounting on full dynticks CPUs.

    Typical cputime stats infrastructure relies on the timer tick and
    its periodic polling on the CPU to account the amount of time
    spent by the CPUs and the tasks per high level domains such as
    userspace, kernelspace, guest, ...

    Now we are preparing to implement full dynticks capability on
    Linux for Real Time and HPC users who want full CPU isolation.
    This feature requires a cputime accounting that doesn't depend
    on the timer tick.

    To implement it, this new cputime infrastructure plugs into
    kernel/user/guest boundaries to take snapshots of cputime and
    flush these to the stats when needed. This performs pretty
    much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
    and cputime snaphots are synchronized between write and read
    side such that the latter can safely retrieve the pending tickless
    cputime of a task and add it to its latest cputime snapshot to
    return the correct result to the user."

    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

28 Jan, 2013

1 commit

  • While remotely reading the cputime of a task running in a
    full dynticks CPU, the values stored in utime/stime fields
    of struct task_struct may be stale. Its values may be those
    of the last kernel user transition time snapshot and
    we need to add the tickless time spent since this snapshot.

    To fix this, flush the cputime of the dynticks CPUs on
    kernel user transition and record the time / context
    where we did this. Then on top of this snapshot and the current
    time, perform the fixup on the reader side from task_times()
    accessors.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    [fixed kvm module related build errors]
    Signed-off-by: Sedat Dilek

    Frederic Weisbecker
     

21 Jan, 2013

1 commit

  • Pull misc syscall fixes from Al Viro:

    - compat syscall fixes (discussed back in December)

    - a couple of "make life easier for sigaltstack stuff by reducing
    inter-tree dependencies"

    - fix up compiler/asmlinkage calling convention disagreement of
    sys_clone()

    - misc

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    sys_clone() needs asmlinkage_protect
    make sure that /linuxrc has std{in,out,err}
    x32: fix sigtimedwait
    x32: fix waitid()
    switch compat_sys_wait4() and compat_sys_waitid() to COMPAT_SYSCALL_DEFINE
    switch compat_sys_sigaltstack() to COMPAT_SYSCALL_DEFINE
    CONFIG_GENERIC_SIGALTSTACK build breakage with asm-generic/syscalls.h
    Ensure that kernel_init_freeable() is not inlined into non __init code

    Linus Torvalds
     

20 Jan, 2013

1 commit


25 Dec, 2012

1 commit

  • The sequence:
    unshare(CLONE_NEWPID)
    clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)

    Creates a new process in the new pid namespace without setting
    pid_ns->child_reaper. After forking this results in a NULL
    pointer dereference.

    Avoid this and other nonsense scenarios that can show up after
    creating a new pid namespace with unshare by adding a new
    check in copy_prodcess.

    Pointed-out-by: Oleg Nesterov
    Acked-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 Dec, 2012

1 commit

  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds