15 Mar, 2011

9 commits

  • I have a machine where entering deep C-states broke.
    pm_qos was a hot candidate, but I couldn't find any way to double
    check without the need of recompiling.

    While in this case it was a driver bug (ath9k):
    https://bugzilla.kernel.org/show_bug.cgi?id=27532

    powertop or others may want to read out cpu_dma_latency
    restrictions which could be the cause of preventing a machine
    entering deeper C-states.

    Output with this patch:

    # default value of 2000 * USEC_PER_SEC (0x77359400)
    cat /dev/network_latency |hexdump
    0000000 9400 7735
    0000004

    # value of 55 us which is the reason for not entering C2
    cat /dev/cpu_dma_latency |hexdump
    0000000 0037 0000
    0000004

    There is no reason to hide this info -> make pm_qos files readable.

    Signed-off-by: Thomas Renninger
    Signed-off-by: Rafael J. Wysocki

    Thomas Renninger
     
  • 'n' defaults are pretty pointless and actually bogus when used with
    prompt-less config options.

    The "bool"/"default y" pair with no prompt can be expressed more
    compactly using def_bool.

    [rjw: Rebased on top of earlier patches modifying this file.]

    Signed-off-by: Jan Beulich
    Signed-off-by: Rafael J. Wysocki

    Jan Beulich
     
  • The variable pm_flags is used to prevent APM from being enabled
    along with ACPI, which would lead to problems. However, acpi_init()
    is always called before apm_init() and after acpi_init() has
    returned, it is known whether or not ACPI will be used. Namely, if
    acpi_disabled is not set after acpi_init() has returned, this means
    that ACPI is enabled. Thus, it is sufficient to check acpi_disabled
    in apm_init() to prevent APM from being enabled in parallel with
    ACPI.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Len Brown

    Rafael J. Wysocki
     
  • CONFIG_PM_SLEEP_ADVANCED_DEBUG is not used any more, so drop it
    and CONFIG_CAN_PM_TRACE need not depend on EXPERIMENTAL, so remove
    that dependency.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • After redefining CONFIG_PM to depend on (CONFIG_PM_SLEEP ||
    CONFIG_PM_RUNTIME) the CONFIG_PM_OPS option is redundant and can be
    replaced with CONFIG_PM.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Reorder configuration options in kernel/power/Kconfig so that
    the options depended on are at the top of the list.

    This patch doesn't introduce any functional changes.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • From the users' point of view CONFIG_PM is really only used for
    making it possible to set CONFIG_SUSPEND, CONFIG_HIBERNATION,
    CONFIG_PM_RUNTIME and (surprisingly enough) CONFIG_XEN_SAVE_RESTORE
    (CONFIG_PM_OPP also depends on CONFIG_PM, but quite artificially).
    However, both CONFIG_SUSPEND and CONFIG_HIBERNATION require platform
    support (independent of CONFIG_PM) and it is not quite obvious that
    CONFIG_PM has to be set for CONFIG_XEN_SAVE_RESTORE to be available.
    Thus, from the users' point of view, it would be more logical to
    automatically select CONFIG_PM if any of the above options depending
    on it are set.

    Make CONFIG_PM depend on (CONFIG_PM_SLEEP || CONFIG_PM_RUNTIME),
    which will cause it to be selected when any of CONFIG_SUSPEND,
    CONFIG_HIBERNATION, CONFIG_PM_RUNTIME, CONFIG_XEN_SAVE_RESTORE is
    set and will clarify its meaning.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • If direct references to pm_flags are removed from drivers/acpi/bus.c,
    CONFIG_ACPI will not need to depend on CONFIG_PM any more. Make that
    happen.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Len Brown

    Rafael J. Wysocki
     
  • * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
    NFS: NFSROOT should default to "proto=udp"
    nfs4: remove duplicated #include
    NFSv4: nfs4_state_mark_reclaim_nograce() should be static
    NFSv4: Fix the setlk error handler
    NFSv4.1: Fix the handling of the SEQUENCE status bits
    NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses
    NFSv4.1 reclaim complete must wait for completion
    NFSv4: remove duplicate clientid in struct nfs_client
    NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY
    sunrpc: Propagate errors from xs_bind() through xs_create_sock()
    (try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 or 63 are set in fileid
    nfs: fix compilation warning
    nfs: add kmalloc return value check in decode_and_add_ds
    SUNRPC: Remove resource leak in svc_rdma_send_error()
    nfs: close NFSv4 COMMIT vs. CLOSE race
    SUNRPC: Close a race in __rpc_wait_for_completion_task()

    Linus Torvalds
     

11 Mar, 2011

2 commits

  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix sched rt group scheduling when hierachy is enabled

    Linus Torvalds
     
  • Although they run as rpciod background tasks, under normal operation
    (i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
    and nfs4_do_close() want to be fully synchronous. This means that when we
    exit, we want all references to the rpc_task to be gone, and we want
    any dentry references etc. held by that task to be released.

    For this reason these functions call __rpc_wait_for_completion_task(),
    followed by rpc_put_task() in the expectation that the latter will be
    releasing the last reference to the rpc_task, and thus ensuring that the
    callback_ops->rpc_release() has been called synchronously.

    This patch fixes a race which exists due to the fact that
    rpciod calls rpc_complete_task() (in order to wake up the callers of
    __rpc_wait_for_completion_task()) and then subsequently calls
    rpc_put_task() without ensuring that these two steps are done atomically.

    In order to avoid adding new spin locks, the patch uses the existing
    waitqueue spin lock to order the rpc_task reference count releases between
    the waiting process and rpciod.
    The common case where nobody is waiting for completion is optimised for by
    checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
    reference count is 1: in those cases we drop trying to grab the spin lock,
    and immediately free up the rpc_task.

    Those few processes that need to put the rpc_task from inside an
    asynchronous context and that do not care about ordering are given a new
    helper: rpc_put_task_async().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

10 Mar, 2011

1 commit


08 Mar, 2011

1 commit

  • a) struct inode is not going to be freed under ->d_compare();
    however, the thing PROC_I(inode)->sysctl points to just might.
    Fortunately, it's enough to make freeing that sucker delayed,
    provided that we don't step on its ->unregistering, clear
    the pointer to it in PROC_I(inode) before dropping the reference
    and check if it's NULL in ->d_compare().

    b) I'm not sure that we *can* walk into NULL inode here (we recheck
    dentry->seq between verifying that it's still hashed / fetching
    dentry->d_inode and passing it to ->d_compare() and there's no
    negative hashed dentries in /proc/sys/*), but if we can walk into
    that, we really should not have ->d_compare() return 0 on it!
    Said that, I really suspect that this check can be simply killed.
    Nick?

    Signed-off-by: Al Viro

    Al Viro
     

05 Mar, 2011

2 commits


04 Mar, 2011

1 commit

  • The current sched rt code is broken when it comes to hierarchical
    scheduling, this patch fixes two problems

    1. It adds redundant enqueuing (harmless) when it finds a queue
    has tasks enqueued, but it has no run time and it is not
    throttled.

    2. The most important change is in sched_rt_rq_enqueue/dequeue.
    The code just picks the rt_rq belonging to the current cpu
    on which the period timer runs, the patch fixes it, so that
    the correct rt_se is enqueued/dequeued.

    Tested with a simple hierarchy

    /c/d, c and d assigned similar runtimes of 50,000 and a while
    1 loop runs within "d". Both c and d get throttled, without
    the patch, the task just stops running and never runs (depends
    on where the sched_rt b/w timer runs). With the patch, the
    task is throttled and runs as expected.

    [ bharata, suggestions on how to pick the rt_se belong to the
    rt_rq and correct cpu ]

    Signed-off-by: Balbir Singh
    Acked-by: Bharata B Rao
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Balbir Singh
     

03 Mar, 2011

1 commit

  • If we enable trace events to trace block actions, We use
    blk_fill_rwbs_rq to analyze the corresponding actions
    in request's cmd_flags, but we only choose the minor 2 bits
    from it, so most of other flags(e.g, REQ_SYNC) are missing.
    For example, with a sync write we get:
    write_test-2409 [001] 160.013869: block_rq_insert: 3,64 W 0 () 258135 + =
    8 [write_test]

    Since now we have integrated the flags of both bio and request,
    it is safe to pass rq->cmd_flags directly to blk_fill_rwbs and
    blk_fill_rwbs_rq isn't needed any more.

    With this patch, after a sync write we get:
    write_test-2417 [000] 226.603878: block_rq_insert: 3,64 WS 0 () 258135 +=
    8 [write_test]

    Signed-off-by: Tao Ma
    Acked-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Tao Ma
     

26 Feb, 2011

1 commit

  • When the per cpu timer is marked CLOCK_EVT_FEAT_C3STOP, then we only
    can switch into oneshot mode, when the backup broadcast device
    supports oneshot mode as well. Otherwise we would try to switch the
    broadcast device into an unsupported mode unconditionally. This went
    unnoticed so far as the current available broadcast devices support
    oneshot mode. Seth unearthed this problem while debugging and working
    around an hpet related BIOS wreckage.

    Add the necessary check to tick_is_oneshot_available().

    Reported-and-tested-by: Seth Forshee
    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Cc: stable@kernel.org # .21 ->

    Thomas Gleixner
     

23 Feb, 2011

2 commits


19 Feb, 2011

3 commits

  • With CONFIG_SHIRQ_DEBUG=y we call a newly installed interrupt handler
    in request_threaded_irq().

    The original implementation (commit a304e1b8) called the handler
    _BEFORE_ it was installed, but that caused problems with handlers
    calling disable_irq_nosync(). See commit 377bf1e4.

    It's braindead in the first place to call disable_irq_nosync in shared
    handlers, but ....

    Moving this call after we installed the handler looks innocent, but it
    is very subtle broken on SMP.

    Interrupt handlers rely on the fact, that the irq core prevents
    reentrancy.

    Now this debug call violates that promise because we run the handler
    w/o the IRQ_INPROGRESS protection - which we cannot apply here because
    that would result in a possibly forever masked interrupt line.

    A concurrent real hardware interrupt on a different CPU results in
    handler reentrancy and can lead to complete wreckage, which was
    unfortunately observed in reality and took a fricking long time to
    debug.

    Leave the code here for now. We want this debug feature, but that's
    not easy to fix. We really should get rid of those
    disable_irq_nosync() abusers and remove that function completely.

    Signed-off-by: Thomas Gleixner
    Cc: Anton Vorontsov
    Cc: David Woodhouse
    Cc: Arjan van de Ven
    Cc: stable@kernel.org # .28 -> .37

    Thomas Gleixner
     
  • Lars-Peter Clausen pointed out:

    I stumbled upon this while looking through the existing archs using
    SPARSE_IRQ. Even with SPARSE_IRQ the NR_IRQS is still the upper
    limit for the number of IRQs.

    Both PXA and MMP set NR_IRQS to IRQ_BOARD_START, with
    IRQ_BOARD_START being the number of IRQs used by the core.

    In various machine files the nr_irqs field of the ARM machine
    defintion struct is then set to "IRQ_BOARD_START + NR_BOARD_IRQS".

    As a result "nr_irqs" will greater then NR_IRQS which then again
    causes the "allocated_irqs" bitmap in the core irq code to be
    accessed beyond its size overwriting unrelated data.

    The core code really misses a sanity check there.

    This went unnoticed so far as by chance the compiler/linker places
    data behind that bitmap which gets initialized later on those affected
    platforms.

    So the obvious fix would be to add a sanity check in early_irq_init()
    and break all affected platforms. Though that check wants to be
    backported to stable as well, which will require to fix all known
    problematic platforms and probably some more yet not known ones as
    well. Lots of churn.

    A way simpler solution is to allocate a slightly larger bitmap and
    avoid the whole churn w/o breaking anything. Add a few warnings when
    an arch returns utter crap.

    Reported-by: Lars-Peter Clausen
    Signed-off-by: Thomas Gleixner
    Cc: stable@kernel.org # .37
    Cc: Haojian Zhuang
    Cc: Eric Miao
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • * 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: make sure MAYDAY_INITIAL_TIMEOUT is at least 2 jiffies long
    workqueue, freezer: unify spelling of 'freeze' + 'able' to 'freezable'
    workqueue: wake up a worker when a rescuer is leaving a gcwq

    Linus Torvalds
     

17 Feb, 2011

3 commits

  • Currently we return 0 in swsusp_alloc() when alloc_image_page() fails.
    Fix that. Also remove unneeded "error" variable since the only
    useful value of error is -ENOMEM.

    [rjw: Fixed up the changelog and changed subject.]

    Signed-off-by: Stanislaw Gruszka
    Cc: stable@kernel.org
    Signed-off-by: Rafael J. Wysocki

    Stanislaw Gruszka
     
  • MAYDAY_INITIAL_TIMEOUT is defined as HZ / 100 and depending on
    configuration may end up 0 or 1. Even when it's 1, depending on when
    the mayday timer is added in the current jiffy interval, it may expire
    way before a jiffy has passed.

    Make sure MAYDAY_INITIAL_TIMEOUT is at least two to guarantee that at
    least a full jiffy has passed before calling rescuers.

    Signed-off-by: Tejun Heo
    Reported-by: Ray Jui
    Cc: stable@kernel.org

    Tejun Heo
     
  • There are two spellings in use for 'freeze' + 'able' - 'freezable' and
    'freezeable'. The former is the more prominent one. The latter is
    mostly used by workqueue and in a few other odd places. Unify the
    spelling to 'freezable'.

    Signed-off-by: Tejun Heo
    Reported-by: Alan Stern
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Greg Kroah-Hartman
    Acked-by: Dmitry Torokhov
    Cc: David Woodhouse
    Cc: Alex Dubov
    Cc: "David S. Miller"
    Cc: Steven Whitehouse

    Tejun Heo
     

16 Feb, 2011

3 commits

  • It was possible to call pmu::start() on an already running event. In
    particular this lead so some wreckage as the hrtimer events would
    re-initialize active timers.

    This was due to throttled events being activated again by scheduling.
    Scheduling in a context would add and force start events, resulting in
    running events with a possible throttle status. The next tick to hit
    that task will then try to unthrottle the event and call ->start() on
    an already running event.

    Reported-by: Jeff Moyer
    Cc:
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • …kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    Revert "lockdep, timer: Fix del_timer_sync() annotation"

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    timer debug: Hide kernel addresses via %pK in /proc/timer_list

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86: Fix text_poke_smp_batch() deadlock
    perf tools: Fix thread_map event synthesizing in top and record
    watchdog, nmi: Lower the severity of error messages
    ARM: oprofile: Fix backtraces in timer mode
    oprofile: Fix usage of CONFIG_HW_PERF_EVENTS for oprofile_perf_init and friends

    Linus Torvalds
     

14 Feb, 2011

1 commit

  • After executing the matching works, a rescuer leaves the gcwq whether
    there are more pending works or not. This may decrease the
    concurrency level to zero and stall execution until a new work item is
    queued on the gcwq.

    Make rescuer wake up a regular worker when it leaves a gcwq if there
    are more works to execute, so that execution isn't stalled.

    Signed-off-by: Tejun Heo
    Reported-by: Ray Jui
    Cc: stable@kernel.org

    Tejun Heo
     

12 Feb, 2011

3 commits

  • In the continuing effort to avoid kernel addresses leaking to
    unprivileged users, this patch switches to %pK for
    /proc/timer_list reporting.

    Signed-off-by: Kees Cook
    Cc: John Stultz
    Cc: Dan Rosenberg
    Cc: Eugene Teo
    Cc: Linus Torvalds
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Kees Cook
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
    pci: use security_capable() when checking capablities during config space read
    security: add cred argument to security_capable()
    tpm_tis: Use timeouts returned from TPM

    Linus Torvalds
     
  • The wake_up_process() call in ptrace_detach() is spurious and not
    interlocked with the tracee state. IOW, the tracee could be running or
    sleeping in any place in the kernel by the time wake_up_process() is
    called. This can lead to the tracee waking up unexpectedly which can be
    dangerous.

    The wake_up is spurious and should be removed but for now reduce its
    toxicity by only waking up if the tracee is in TRACED or STOPPED state.

    This bug can possibly be used as an attack vector. I don't think it
    will take too much effort to come up with an attack which triggers oops
    somewhere. Most sleeps are wrapped in condition test loops and should
    be safe but we have quite a number of places where sleep and wakeup
    conditions are expected to be interlocked. Although the window of
    opportunity is tiny, ptrace can be used by non-privileged users and with
    some loading the window can definitely be extended and exploited.

    Signed-off-by: Tejun Heo
    Acked-by: Roland McGrath
    Acked-by: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

11 Feb, 2011

2 commits

  • Expand security_capable() to include cred, so that it can be usable in a
    wider range of call sites.

    Signed-off-by: Chris Wright
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Chris Wright
     
  • In commit ce6ada35bdf7 ("security: Define CAP_SYSLOG") Serge Hallyn
    introduced CAP_SYSLOG, but broke backwards compatibility by no longer
    accepting CAP_SYS_ADMIN as an override (it would cause a warning and
    then reject the operation).

    Re-instate CAP_SYS_ADMIN - but keeping the warning - as an acceptable
    capability until any legacy applications have been updated. There are
    apparently applications out there that drop all capabilities except for
    CAP_SYS_ADMIN in order to access the syslog.

    (This is a re-implementation of a patch by Serge, cleaning the logic up
    and making the code more readable)

    Acked-by: Serge Hallyn
    Reviewed-by: James Morris
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Feb, 2011

2 commits

  • During boot if the hardlockup detector fails to initialize, it
    complains very loudly. Some failures should be expected under
    certain situations, ie no lapics, or resource in-use. Tone
    those error messages down a bit. Keep the rest at a high level.

    Reported-by: Paul Bolle
    Tested-by: Paul Bolle
    Signed-off-by: Don Zickus
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Don Zickus
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    cdrom: support devices that have check_events but not media_changed
    cfq-iosched: Don't wait if queue already has requests.
    blkio-throttle: Avoid calling blkiocg_lookup_group() for root group
    cfq: rename a function to give it more appropriate name
    cciss: make cciss_revalidate not loop through CISS_MAX_LUNS volumes unnecessarily.
    drivers/block/aoe/Makefile: replace the use of -objs with -y
    loop: queue_lock NULL pointer derefence in blk_throtl_exit
    drivers/block/Makefile: replace the use of -objs with -y
    blktrace: Don't output messages if NOTIFY isn't set.

    Linus Torvalds
     

08 Feb, 2011

3 commits

  • Both attempts at trying to allow softirq usage for
    del_timer_sync() failed (produced bogus warnings),
    so revert the commit for this release:

    f266a5110d45: lockdep, timer: Fix del_timer_sync() annotation

    and try again later.

    Reported-by: Borislav Petkov
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Yong Zhang
    Cc: Thomas Gleixner
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In prepare_kernel_cred() since 2.6.29, put_cred(new) is called without
    assigning new->usage when security_prepare_creds() returned an error. As a
    result, memory for new and refcount for new->{user,group_info,tgcred} are
    leaked because put_cred(new) won't call __put_cred() unless old->usage == 1.

    Fix these leaks by assigning new->usage (and new->subscribers which was added
    in 2.6.32) before calling security_prepare_creds().

    Signed-off-by: Tetsuo Handa
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • In cred_alloc_blank() since 2.6.32, abort_creds(new) is called with
    new->security == NULL and new->magic == 0 when security_cred_alloc_blank()
    returns an error. As a result, BUG() will be triggered if SELinux is enabled
    or CONFIG_DEBUG_CREDENTIALS=y.

    If CONFIG_DEBUG_CREDENTIALS=y, BUG() is called from __invalid_creds() because
    cred->magic == 0. Failing that, BUG() is called from selinux_cred_free()
    because selinux_cred_free() is not expecting cred->security == NULL. This does
    not affect smack_cred_free(), tomoyo_cred_free() or apparmor_cred_free().

    Fix these bugs by

    (1) Set new->magic before calling security_cred_alloc_blank().

    (2) Handle null cred->security in creds_are_invalid() and selinux_cred_free().

    Signed-off-by: Tetsuo Handa
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    Tetsuo Handa