25 Feb, 2012

1 commit

  • This patch is intentionally incomplete to simplify the review.
    It ignores ep_unregister_pollwait() which plays with the same wqh.
    See the next change.

    epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
    f_op->poll() needs. In particular it assumes that the wait queue
    can't go away until eventpoll_release(). This is not true in case
    of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
    which is not connected to the file.

    This patch adds the special event, POLLFREE, currently only for
    epoll. It expects that init_poll_funcptr()'ed hook should do the
    necessary cleanup. Perhaps it should be defined as EPOLLFREE in
    eventpoll.

    __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
    ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
    helper.

    ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
    This make this poll entry inconsistent, but we don't care. If you
    share epoll fd which contains our sigfd with another process you
    should blame yourself. signalfd is "really special". I simply do
    not know how we can define the "right" semantics if it used with
    epoll.

    The main problem is, epoll calls signalfd_poll() once to establish
    the connection with the wait queue, after that signalfd_poll(NULL)
    returns the different/inconsistent results depending on who does
    EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
    has nothing to do with the file, it works with the current thread.

    In short: this patch is the hack which tries to fix the symptoms.
    It also assumes that nobody can take tasklist_lock under epoll
    locks, this seems to be true.

    Note:

    - we do not have wake_up_all_poll() but wake_up_poll()
    is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

    - signalfd_cleanup() uses POLLHUP along with POLLFREE,
    we need a couple of simple changes in eventpoll.c to
    make sure it can't be "lost".

    Reported-by: Maxime Bizon
    Cc:
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

21 Feb, 2012

1 commit

  • Assorted fixes, sat in -next for a week or so...

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ocfs2: deal with wraparounds of i_nlink in ocfs2_rename()
    vfs: fix compat_sys_stat() handling of overflows in st_nlink
    quota: Fix deadlock with suspend and quotas
    vfs: Provide function to get superblock and wait for it to thaw
    vfs: fix panic in __d_lookup() with high dentry hashtable counts
    autofs4 - fix lockdep splat in autofs
    vfs: fix d_inode_lookup() dentry ref leak

    Linus Torvalds
     

14 Feb, 2012

3 commits


12 Feb, 2012

1 commit

  • Says Jens:

    "Time to push off some of the pending items. I really wanted to wait
    until we had the regression nailed, but alas it's not quite there yet.
    But I'm very confident that it's "just" a missing expire on exit, so
    fix from Tejun should be fairly trivial. I'm headed out for a week on
    the slopes.

    - Killing the barrier part of mtip32xx. It doesn't really support
    barriers, and it doesn't need them (writes are fully ordered).

    - A few fixes from Dan Carpenter, preventing overflows of integer
    multiplication.

    - A fixup for loop, fixing a previous commit that didn't quite solve
    the partial read problem from Dave Young.

    - A bio integer overflow fix from Kent Overstreet.

    - Improvement/fix of the door "keep locked" part of the cdrom shared
    code from Paolo Benzini.

    - A few cfq fixes from Shaohua Li.

    - A fix for bsg sysfs warning when removing a file it did not create
    from Stanislaw Gruszka.

    - Two fixes for floppy from Vivek, preventing a crash.

    - A few block core fixes from Tejun. One killing the over-optimized
    ioc exit path, cleaning that up nicely. Two others fixing an oops
    on elevator switch, due to calling into the scheduler merge check
    code without holding the queue lock."

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: fix lockdep warning on io_context release put_io_context()
    relay: prevent integer overflow in relay_open()
    loop: zero fill bio instead of return -EIO for partial read
    bio: don't overflow in bio_get_nr_vecs()
    floppy: Fix a crash during rmmod
    floppy: Cleanup disk->queue before caling put_disk() if add_disk() was never called
    cdrom: move shared static to cdrom_device_info
    bsg: fix sysfs link remove warning
    block: don't call elevator callbacks for plug merges
    block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions
    mtip32xx: removed the irrelevant argument of mtip_hw_submit_io() and the unused member of struct driver_data
    block: strip out locking optimization in put_io_context()
    cdrom: use copy_to_user() without the underscores
    block: fix ioc locking warning
    block: fix NULL icq_cache reference
    block,cfq: change code order

    Linus Torvalds
     

11 Feb, 2012

1 commit


10 Feb, 2012

1 commit


07 Feb, 2012

2 commits

  • The following patch fixes a bug introduced by the following
    commit:

    e050e3f0a71b ("perf: Fix broken interrupt rate throttling")

    The patch caused the following warning to pop up depending on
    the sampling frequency adjustments:

    ------------[ cut here ]------------
    WARNING: at arch/x86/kernel/cpu/perf_event.c:995 x86_pmu_start+0x79/0xd4()

    It was caused by the following call sequence:

    perf_adjust_freq_unthr_context.part() {
    stop()
    if (delta > 0) {
    perf_adjust_period() {
    if (period > 8*...) {
    stop()
    ...
    start()
    }
    }
    }
    start()
    }

    Which caused a double start and a double stop, thus triggering
    the assert in x86_pmu_start().

    The patch fixes the problem by avoiding the double calls. We
    pass a new argument to perf_adjust_period() to indicate whether
    or not the event is already stopped. We can't just remove the
    start/stop from that function because it's called from
    __perf_event_overflow where the event needs to be reloaded via a
    stop/start back-toback call.

    The patch reintroduces the assertion in x86_pmu_start() which
    was removed by commit:

    84f2b9b ("perf: Remove deprecated WARN_ON_ONCE()")

    In this second version, we've added calls to disable/enable PMU
    during unthrottling or frequency adjustment based on bug report
    of spurious NMI interrupts from Eric Dumazet.

    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Cc: markus@trippelsdorf.de
    Cc: paulus@samba.org
    Link: http://lkml.kernel.org/r/20120207133956.GA4932@quad
    [ Minor edits to the changelog and to the code ]
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • put_io_context() performed a complex trylock dancing to avoid
    deferring ioc release to workqueue. It was also broken on UP because
    trylock was always assumed to succeed which resulted in unbalanced
    preemption count.

    While there are ways to fix the UP breakage, even the most
    pathological microbench (forced ioc allocation and tight fork/exit
    loop) fails to show any appreciable performance benefit of the
    optimization. Strip it out. If there turns out to be workloads which
    are affected by this change, simpler optimization from the discussion
    thread can be applied later.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Signed-off-by: Jens Axboe

    Tejun Heo
     

05 Feb, 2012

2 commits

  • Power management fixes for 3.3-rc3

    Three power management regression fixes, one for a recent regression introcuded
    by the freezer changes during the 3.3 merge window and two for regressions
    in cpuidle (resulting from PM QoS changes) and in the hibernate user space
    interface, both introduced during the 3.2 development cycle.

    They include:

    * Two hibernate (s2disk) regression fixes from Srivatsa S. Bhat (for
    regressions introduced during the 3.3 merge window and during the 3.2
    development cycle).

    * A cpuidle fix from Venki Pallipadi for a regression resulting from PM QoS
    changes during the 3.2 development cycle causing cpuidle to work incorrectly
    for CONFIG_PM unset.

    * tag 'pm-fixes-for-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / QoS: CPU C-state breakage with PM Qos change
    PM / Freezer: Thaw only kernel threads if freezing of kernel threads fails
    PM / Hibernate: Thaw kernel threads in SNAPSHOT_CREATE_IMAGE ioctl path

    Linus Torvalds
     
  • If freezing of kernel threads fails, we are expected to automatically
    thaw tasks in the error recovery path. However, at times, we encounter
    situations in which we would like the automatic error recovery path
    to thaw only the kernel threads, because we want to be able to do
    some more cleanup before we thaw userspace. Something like:

    error = freeze_kernel_threads();
    if (error) {
    /* Do some cleanup */

    /* Only then thaw userspace tasks*/
    thaw_processes();
    }

    An example of such a situation is where we freeze/thaw filesystems
    during suspend/hibernation. There, if freezing of kernel threads
    fails, we would like to thaw the frozen filesystems before thawing
    the userspace tasks.

    So, modify freeze_kernel_threads() to thaw only kernel threads in
    case of freezing failure. And change suspend_freeze_processes()
    accordingly. (At the same time, let us also get rid of the rather
    cryptic usage of the conditional operator (:?) in that function.)

    [rjw: In fact, this patch fixes a regression introduced during the
    3.3 merge window, because without it thaw_processes() may be called
    before swsusp_free() in some situations and that may lead to massive
    memory allocation failures.]

    Signed-off-by: Srivatsa S. Bhat
    Acked-by: Tejun Heo
    Acked-by: Nigel Cunningham
    Signed-off-by: Rafael J. Wysocki

    Srivatsa S. Bhat
     

04 Feb, 2012

1 commit

  • In function pre_handler_kretprobe(), the allocated kretprobe_instance
    object will get leaked if the entry_handler callback returns non-zero.
    This may cause all the preallocated kretprobe_instance objects exhausted.

    This issue can be reproduced by changing
    samples/kprobes/kretprobe_example.c to probe "mutex_unlock". And the fix
    is straightforward: just put the allocated kretprobe_instance object back
    onto the free_instances list.

    [akpm@linux-foundation.org: use raw_spin_lock/unlock]
    Signed-off-by: Jiang Liu
    Acked-by: Jim Keniston
    Acked-by: Ananth N Mavinakayanahalli
    Cc: Masami Hiramatsu
    Cc: Anil S Keshavamurthy
    Cc: "David S. Miller"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     

03 Feb, 2012

2 commits

  • This fixes the race in process_vm_core found by Oleg (see

    http://article.gmane.org/gmane.linux.kernel/1235667/

    for details).

    This has been updated since I last sent it as the creation of the new
    mm_access() function did almost exactly the same thing as parts of the
    previous version of this patch did.

    In order to use mm_access() even when /proc isn't enabled, we move it to
    kernel/fork.c where other related process mm access functions already
    are.

    Signed-off-by: Chris Yeoh
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     
  • …or-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    bugs, x86: Fix printk levels for panic, softlockups and stack dumps

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf top: Fix number of samples displayed
    perf tools: Fix strlen() bug in perf_event__synthesize_event_type()
    perf tools: Fix broken build by defining _GNU_SOURCE in Makefile
    x86/dumpstack: Remove unneeded check in dump_trace()
    perf: Fix broken interrupt rate throttling

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: Fix ancient race in do_exit()
    sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug
    sched/s390: Fix compile error in sched/core.c
    sched: Fix rq->nr_uninterruptible update race

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/reboot: Remove VersaLogic Menlow reboot quirk
    x86/reboot: Skip DMI checks if reboot set by user
    x86: Properly parenthesize cmpxchg() macro arguments

    Linus Torvalds
     

02 Feb, 2012

1 commit

  • In the SNAPSHOT_CREATE_IMAGE ioctl, if the call to hibernation_snapshot()
    fails, the frozen tasks are not thawed.

    And in the case of success, if we happen to exit due to a successful freezer
    test, all tasks (including those of userspace) are thawed, whereas actually
    we should have thawed only the kernel threads at that point. Fix both these
    issues.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Rafael J. Wysocki
    Cc: stable@vger.kernel.org

    Srivatsa S. Bhat
     

30 Jan, 2012

1 commit

  • Commit 2aede851ddf08666f68ffc17be446420e9d2a056

    PM / Hibernate: Freeze kernel threads after preallocating memory

    introduced a mechanism by which kernel threads were frozen after
    the preallocation of hibernate image memory to avoid problems with
    frozen kernel threads not responding to memory freeing requests.
    However, it overlooked the s2disk code path in which the
    SNAPSHOT_CREATE_IMAGE ioctl was run directly after SNAPSHOT_FREE,
    which caused freeze_workqueues_begin() to BUG(), because it saw
    that worqueues had been already frozen.

    Although in principle this issue might be addressed by removing
    the relevant BUG_ON() from freeze_workqueues_begin(), that would
    reintroduce the very problem that commit 2aede851ddf08666f68ffc17be4
    attempted to avoid into that particular code path. For this reason,
    to fix the issue at hand, introduce thaw_kernel_threads() and make
    the SNAPSHOT_FREE ioctl execute it.

    Special thanks to Srivatsa S. Bhat for detailed analysis of the
    problem.

    Reported-and-tested-by: Jiri Slaby
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Srivatsa S. Bhat
    Cc: stable@kernel.org

    Rafael J. Wysocki
     

27 Jan, 2012

9 commits

  • This issue happens under the following conditions:

    1. preemption is off
    2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined
    3. RT scheduling class
    4. SMP system

    Sequence is as follows:

    1.suppose current task is A. start schedule()
    2.task A is enqueued pushable task at the entry of schedule()
    __schedule
    prev = rq->curr;
    ...
    put_prev_task
    put_prev_task_rt
    enqueue_pushable_task
    4.pick the task B as next task.
    next = pick_next_task(rq);
    3.rq->curr set to task B and context_switch is started.
    rq->curr = next;
    4.At the entry of context_swtich, release this cpu's rq->lock.
    context_switch
    prepare_task_switch
    prepare_lock_switch
    raw_spin_unlock_irq(&rq->lock);
    5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context
    6.try_to_wake_up() which called by ISR acquires rq->lock
    try_to_wake_up
    ttwu_remote
    rq = __task_rq_lock(p)
    ttwu_do_wakeup(rq, p, wake_flags);
    task_woken_rt
    7.push_rt_task picks the task A which is enqueued before.
    task_woken_rt
    push_rt_tasks(rq)
    next_task = pick_next_pushable_task(rq)
    8.At find_lock_lowest_rq(), If double_lock_balance() returns 0,
    lowest_rq can be the remote rq.
    (But,If preemption is on, double_lock_balance always return 1 and it
    does't happen.)
    push_rt_task
    find_lock_lowest_rq
    if (double_lock_balance(rq, lowest_rq))..
    9.find_lock_lowest_rq return the available rq. task A is migrated to
    the remote cpu/rq.
    push_rt_task
    ...
    deactivate_task(rq, next_task, 0);
    set_task_cpu(next_task, lowest_rq->cpu);
    activate_task(lowest_rq, next_task, 0);
    10. But, task A is on irq context at this cpu.
    So, task A is scheduled by two cpus at the same time until restore from IRQ.
    Task A's stack is corrupted.

    To fix it, don't migrate an RT task if it's still running.

    Signed-off-by: Chanho Min
    Signed-off-by: Peter Zijlstra
    Acked-by: Steven Rostedt
    Cc:
    Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Chanho Min
     
  • This patch fixes the sampling interrupt throttling mechanism.

    It was broken in v3.2. Events were not being unthrottled. The
    unthrottling mechanism required that events be checked at each
    timer tick.

    This patch solves this problem and also separates:

    - unthrottling
    - multiplexing
    - frequency-mode period adjustments

    Not all of them need to be executed at each timer tick.

    This third version of the patch is based on my original patch +
    PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).

    At each timer tick, for each context:

    - if the current CPU has throttled events, we unthrottle events

    - if context has frequency-based events, we adjust sampling periods

    - if we have reached the jiffies interval, we multiplex (rotate)

    We decoupled rotation (multiplexing) from frequency-mode sampling
    period adjustments. They should not necessarily happen at the same
    rate. Multiplexing is subject to jiffies_interval (currently at 1
    but could be higher once the tunable is exposed via sysfs).

    We have grouped frequency-mode adjustment and unthrottling into the
    same routine to minimize code duplication. When throttled while in
    frequency mode, we scan the events only once.

    We have fixed the threshold enforcement code in __perf_event_overflow().
    There was a bug whereby it would allow more than the authorized rate
    because an increment of hwc->interrupts was not executed at the right
    place.

    The patch was tested with low sampling limit (2000) and fixed periods,
    frequency mode, overcommitted PMU.

    On a 2.1GHz AMD CPU:

    $ cat /proc/sys/kernel/perf_event_max_sample_rate
    2000

    We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):

    $ perf record -e cycles,cycles -c 700000 noploop 10
    $ perf report -D | tail -21

    Aggregated stats:
    TOTAL events: 80086
    MMAP events: 88
    COMM events: 2
    EXIT events: 4
    THROTTLE events: 19996
    UNTHROTTLE events: 19996
    SAMPLE events: 40000

    cycles stats:
    TOTAL events: 40006
    MMAP events: 5
    COMM events: 1
    EXIT events: 4
    THROTTLE events: 9998
    UNTHROTTLE events: 9998
    SAMPLE events: 20000

    cycles stats:
    TOTAL events: 39996
    THROTTLE events: 9998
    UNTHROTTLE events: 9998
    SAMPLE events: 20000

    For 10s, the cap is 2x2000x10 = 40000 samples.
    We get exactly that: 20000 samples/event.

    Signed-off-by: Stephane Eranian
    Cc: # v3.2+
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120126160319.GA5655@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • try_to_wake_up() has a problem which may change status from TASK_DEAD to
    TASK_RUNNING in race condition with SMI or guest environment of virtual
    machine. As a result, exited task is scheduled() again and panic occurs.

    Here is the sequence how it occurs:

    ----------------------------------+-----------------------------
    |
    CPU A | CPU B
    ----------------------------------+-----------------------------

    TASK A calls exit()....

    do_exit()

    exit_mm()
    down_read(mm->mmap_sem);

    rwsem_down_failed_common()

    set TASK_UNINTERRUPTIBLE
    set waiter.task wait_list
    :
    raw_spin_unlock_irq()
    (I/O interruption occured)

    __rwsem_do_wake(mmap_sem)

    list_del(&waiter->list);
    waiter->task = NULL
    wake_up_process(task A)
    try_to_wake_up()
    (task is still
    TASK_UNINTERRUPTIBLE)
    p->on_rq is still 1.)

    ttwu_do_wakeup()
    (*A)
    :
    (I/O interruption handler finished)

    if (!waiter.task)
    schedule() is not called
    due to waiter.task is NULL.

    tsk->state = TASK_RUNNING

    :
    check_preempt_curr();
    :
    task->state = TASK_DEAD
    (*B)
    pi_lock which is used
    in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
    waking up.

    Signed-off-by: Yasunori Goto
    Acked-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.com
    Signed-off-by: Ingo Molnar

    Yasunori Goto
     
  • * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf: Call perf_cgroup_event_time() directly
    perf: Don't call release_callchain_buffers() if allocation fails

    Linus Torvalds
     
  • * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rcu: Add missing __cpuinit annotation in rcutorture code
    sched: Add "const" to is_idle_task() parameter
    rcu: Make rcutorture bool parameters really bool (core code)
    memblock: Fix alloc failure due to dumb underflow protection in memblock_find_in_range_node()

    Linus Torvalds
     
  • rsyslog will display KERN_EMERG messages on a connected
    terminal. However, these messages are useless/undecipherable
    for a general user.

    For example, after a softlockup we get:

    Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
    kernel:Stack:

    Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
    kernel:Call Trace:

    Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
    kernel:Code: ff ff a8 08 75 25 31 d2 48 8d 86 38 e0 ff ff 48 89
    d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 ea 69 dd ff 4c 29 e8 48 89 c7 e8 0f bc da ff 49 89 c4 49 89

    This happens because the printk levels for these messages are
    incorrect. Only an informational message should be displayed on
    a terminal.

    I modified the printk levels for various messages in the kernel
    and tested the output by using the drivers/misc/lkdtm.c kernel
    modules (ie, softlockups, panics, hard lockups, etc.) and
    confirmed that the console output was still the same and that
    the output to the terminals was correct.

    For example, in the case of a softlockup we now see the much
    more informative:

    Message from syslogd@intel-s3e37-04 at Jan 25 10:18:06 ...
    BUG: soft lockup - CPU4 stuck for 60s!

    instead of the above confusing messages.

    AFAICT, the messages no longer have to be KERN_EMERG. In the
    most important case of a panic we set console_verbose(). As for
    the other less severe cases the correct data is output to the
    console and /var/log/messages.

    Successfully tested by me using the drivers/misc/lkdtm.c module.

    Signed-off-by: Prarit Bhargava
    Cc: dzickus@redhat.com
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1327586134-11926-1-git-send-email-prarit@redhat.com
    Signed-off-by: Ingo Molnar

    Prarit Bhargava
     
  • With the recent nohz scheduler changes, rq's nohz flag
    'NOHZ_TICK_STOPPED' and its associated state doesn't get cleared
    immediately after the cpu exits idle. This gets cleared as part
    of the next tick seen on that cpu.

    For the cpu offline support, we need to clear this state
    manually. Fix it by registering a cpu notifier, which clears the
    nohz idle load balance state for this rq explicitly during the
    CPU_DYING notification.

    There won't be any nohz updates for that cpu, after the
    CPU_DYING notification. But lets be extra paranoid and skip
    updating the nohz state in the select_nohz_load_balancer() if
    the cpu is not in active state anymore.

    Reported-by: Srivatsa S. Bhat
    Reviewed-and-tested-by: Srivatsa S. Bhat
    Tested-by: Sergey Senozhatsky
    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1327026538.16150.40.camel@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Commit 029632fbb7b7c9d85063cc9eb470de6c54873df3 ("sched: Make
    separate sched*.c translation units") removed the include of
    asm/mutex.h from sched.c.

    This breaks the combination of:

    CONFIG_MUTEX_SPIN_ON_OWNER=yes
    CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX=yes

    like s390 without mutex debugging:

    CC kernel/sched/core.o
    kernel/sched/core.c: In function ‘mutex_spin_on_owner’:
    kernel/sched/core.c:3287: error: implicit declaration of function ‘arch_mutex_cpu_relax’

    Lets re-add the include to kernel/sched/core.c

    Signed-off-by: Christian Borntraeger
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1326268696-30904-1-git-send-email-borntraeger@de.ibm.com
    Signed-off-by: Ingo Molnar

    Christian Borntraeger
     
  • KOSAKI Motohiro noticed the following race:

    > CPU0 CPU1
    > --------------------------------------------------------
    > deactivate_task()
    > task->state = TASK_UNINTERRUPTIBLE;
    > activate_task()
    > rq->nr_uninterruptible--;
    >
    > schedule()
    > deactivate_task()
    > rq->nr_uninterruptible++;
    >

    Kosaki-San's scenario is possible when CPU0 runs
    __sched_setscheduler() against CPU1's current @task.

    __sched_setscheduler() does a dequeue/enqueue in order to move
    the task to its new queue (position) to reflect the newly provided
    scheduling parameters. However it should be completely invariant to
    nr_uninterruptible accounting, sched_setscheduler() doesn't affect
    readyness to run, merely policy on when to run.

    So convert the inappropriate activate/deactivate_task usage to
    enqueue/dequeue_task, which avoids the nr_uninterruptible accounting.

    Also convert the two other sites: __migrate_task() and
    normalize_task() that still use activate/deactivate_task. These sites
    aren't really a problem since __migrate_task() will only be called on
    non-running task (and therefore are immume to the described problem)
    and normalize_task() isn't ever used on regular systems.

    Also remove the comments from activate/deactivate_task since they're
    misleading at best.

    Reported-by: KOSAKI Motohiro
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1327486224.2614.45.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Jan, 2012

1 commit

  • Davem says:

    1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet.

    2) tg3 header length computation correction from Eric Dumazet.

    3) More build and reference counting fixes for socket memory cgroup
    code from Glauber Costa.

    4) module.h snuck back into a core header after all the hard work we
    did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer.

    5) Fix PHY naming regression and add some new PCI IDs in stmmac, from
    Alessandro Rubini.

    6) Netlink message generation fix in new team driver, should only advertise
    the entries that changed during events, from Jiri Pirko.

    7) SRIOV VF registration and unregistration fixes, and also add a
    missing PCI ID, from Roopa Prabhu.

    8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka.

    9) ftgmac100/ftmac100 build fix, missing interrupt.h include.

    10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun.

    11) Off by one fix in netem packet scheduler, from Vijay Subramanian.

    12) TCP loss detection fix from Yuchung Cheng.

    13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu.

    14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger.

    15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC
    congestion control modules, fix from Neal Cardwell.

    16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław.

    17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri.

    18) Statistics bug fixes in mlx4 from Eugenia Emantayev.

    19) rds locking bug fix during info dumps, from your's truly.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
    rds: Make rds_sock_lock BH rather than IRQ safe.
    netprio_cgroup.h: dont include module.h from other includes
    net: flow_dissector.c missing include linux/export.h
    team: send only changed options/ports via netlink
    net/hyperv: fix possible memory leak in do_set_multicast()
    drivers/net: dsa/mv88e6xxx.c files need linux/module.h
    stmmac: added PCI identifiers
    llc: Fix race condition in llc_ui_recvmsg
    stmmac: fix phy naming inconsistency
    dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches.
    tg3: fix ipv6 header length computation
    skge: add byte queue limit support
    mv643xx_eth: Add Rx Discard and Rx Overrun statistics
    bnx2x: fix compilation error with SOE in fw_dump
    bnx2x: handle CHIP_REVISION during init_one
    bnx2x: allow user to change ring size in ISCSI SD mode
    bnx2x: fix Big-Endianess in ethtool -t
    bnx2x: fixed ethtool statistics for MF modes
    bnx2x: credit-leakage fixup on vlan_mac_del_all
    macvlan: fix a possible use after free
    ...

    Linus Torvalds
     

24 Jan, 2012

5 commits

  • Power management fixes for 3.3

    Two fixes for regressions introduced during the merge window, one fix for
    a long-standing obscure issue in the computation of hibernate image size
    and two small PM documentation fixes.

    * tag 'pm-fixes-for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / Sleep: Fix read_unlock_usermodehelper() call.
    PM / Hibernate: Rewrite unlock_system_sleep() to fix s2disk regression
    PM / Hibernate: Correct additional pages number calculation
    PM / Documentation: Fix minor issue in freezing_of_tasks.txt
    PM / Documentation: Fix spelling mistake in basic-pm-debugging.txt

    Linus Torvalds
     
  • The usual kernel-doc fixups from Randy. Some of them David acked as
    merged in his tree, this is the random left-overs.

    * kernel-doc:
    docbook: fix sched source file names in device-drivers book
    docbook: change iomap source filename in deviceiobook
    docbook: don't use serial_core.h in device-drivers book
    kernel-doc: fix kernel-doc warnings in sched
    kernel-doc: fix new warnings in cfg80211.h
    kernel-doc: fix new warning in usb.h
    kernel-doc: fix new warnings in device.h
    kernel-doc: fix new warnings in debugfs
    kernel-doc: fix new warning in regulator core
    kernel-doc: fix new warnings in pci
    kernel-doc: fix new warnings in driver-core
    kernel-doc: fix new warnings in auditsc.c
    scripts/kernel-doc: fix fatal error caused by cfg80211.h

    Linus Torvalds
     
  • Fix new kernel-doc notation warnings:

    Warning(include/linux/sched.h:2094): No description found for parameter 'p'
    Warning(include/linux/sched.h:2094): Excess function parameter 'tsk' description in 'is_idle_task'
    Warning(kernel/sched/cpupri.c:139): No description found for parameter 'newpri'
    Warning(kernel/sched/cpupri.c:139): Excess function parameter 'pri' description in 'cpupri_set'
    Warning(kernel/sched/cpupri.c:208): Excess function parameter 'bootmem' description in 'cpupri_init'

    Signed-off-by: Randy Dunlap
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Fix new kernel-doc warnings in auditsc.c:

    Warning(kernel/auditsc.c:1875): No description found for parameter 'success'
    Warning(kernel/auditsc.c:1875): No description found for parameter 'return_code'
    Warning(kernel/auditsc.c:1875): Excess function parameter 'pt_regs' description in '__audit_syscall_exit'

    Signed-off-by: Randy Dunlap
    Cc: Al Viro
    Cc: Eric Paris
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Commit ef53d9c5e ("kprobes: improve kretprobe scalability with hashed
    locking") introduced a bug where we can potentially leak
    kretprobe_instances since we initialize a hlist head after having used
    it.

    Initialize the hlist head before using it.

    Reported by: Jim Keniston
    Acked-by: Jim Keniston
    Signed-off-by: Ananth N Mavinakayanahalli
    Acked-by: Masami Hiramatsu
    Cc: Srinivasa D S
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ananth N Mavinakayanahalli
     

23 Jan, 2012

1 commit

  • There is a case in __sk_mem_schedule(), where an allocation
    is beyond the maximum, but yet we are allowed to proceed.
    It happens under the following condition:

    sk->sk_wmem_queued + size >= sk->sk_sndbuf

    The network code won't revert the allocation in this case,
    meaning that at some point later it'll try to do it. Since
    this is never communicated to the underlying res_counter
    code, there is an inbalance in res_counter uncharge operation.

    I see two ways of fixing this:

    1) storing the information about those allocations somewhere
    in memcg, and then deducting from that first, before
    we start draining the res_counter,
    2) providing a slightly different allocation function for
    the res_counter, that matches the original behavior of
    the network code more closely.

    I decided to go for #2 here, believing it to be more elegant,
    since #1 would require us to do basically that, but in a more
    obscure way.

    Signed-off-by: Glauber Costa
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Michal Hocko
    CC: Tejun Heo
    CC: Li Zefan
    CC: Laurent Chavey
    Acked-by: Tejun Heo
    Signed-off-by: David S. Miller

    Glauber Costa
     

21 Jan, 2012

2 commits

  • The perf_event_time() will call perf_cgroup_event_time()
    if @event is a cgroup event. Just do it directly and avoid
    the extra check..

    Signed-off-by: Namhyung Kim
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/1327021966-27688-2-git-send-email-namhyung.kim@lge.com
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     
  • When alloc_callchain_buffers() fails, it frees all of
    entries before return. In addition, calling the
    release_callchain_buffers() will cause a NULL pointer
    dereference since callchain_cpu_entries is not set.

    Signed-off-by: Namhyung Kim
    Acked-by: Frederic Weisbecker
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/1327021966-27688-1-git-send-email-namhyung.kim@lge.com
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     

20 Jan, 2012

2 commits

  • …-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/accounting, proc: Fix /proc/stat interrupts sum

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tracepoints/module: Fix disabling tracepoints with taint CRAP or OOT
    x86/kprobes: Add arch/x86/tools/insn_sanity to .gitignore
    x86/kprobes: Fix typo transferred from Intel manual

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, syscall: Need __ARCH_WANT_SYS_IPC for 32 bits
    x86, tsc: Fix SMI induced variation in quick_pit_calibrate()
    x86, opcode: ANDN and Group 17 in x86-opcode-map.txt
    x86/kconfig: Move the ZONE_DMA entry under a menu
    x86/UV2: Add accounting for BAU strong nacks
    x86/UV2: Ack BAU interrupt earlier
    x86/UV2: Remove stale no-resources test for UV2 BAU
    x86/UV2: Work around BAU bug
    x86/UV2: Fix BAU destination timeout initialization
    x86/UV2: Fix new UV2 hardware by using native UV2 broadcast mode
    x86: Get rid of dubious one-bit signed bitfield

    Linus Torvalds
     
  • The struct bm_block is allocated by chain_alloc(),
    so it'd better counting it in LINKED_PAGE_DATA_SIZE.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Rafael J. Wysocki

    Namhyung Kim
     

18 Jan, 2012

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit: (29 commits)
    audit: no leading space in audit_log_d_path prefix
    audit: treat s_id as an untrusted string
    audit: fix signedness bug in audit_log_execve_info()
    audit: comparison on interprocess fields
    audit: implement all object interfield comparisons
    audit: allow interfield comparison between gid and ogid
    audit: complex interfield comparison helper
    audit: allow interfield comparison in audit rules
    Kernel: Audit Support For The ARM Platform
    audit: do not call audit_getname on error
    audit: only allow tasks to set their loginuid if it is -1
    audit: remove task argument to audit_set_loginuid
    audit: allow audit matching on inode gid
    audit: allow matching on obj_uid
    audit: remove audit_finish_fork as it can't be called
    audit: reject entry,always rules
    audit: inline audit_free to simplify the look of generic code
    audit: drop audit_set_macxattr as it doesn't do anything
    audit: inline checks for not needing to collect aux records
    audit: drop some potentially inadvisable likely notations
    ...

    Use evil merge to fix up grammar mistakes in Kconfig file.

    Bad speling and horrible grammar (and copious swearing) is to be
    expected, but let's keep it to commit messages and comments, rather than
    expose it to users in config help texts or printouts.

    Linus Torvalds
     
  • audit_log_d_path() injects an additional space before the prefix,
    which serves no purpose and doesn't mix well with other audit_log*()
    functions that do not sneak extra characters into the log.

    Signed-off-by: Kees Cook
    Signed-off-by: Eric Paris

    Kees Cook
     
  • In the loop, a size_t "len" is used to hold the return value of
    audit_log_single_execve_arg(), which returns -1 on error. In that
    case the error handling (len
    Signed-off-by: Eric Paris

    Xi Wang