13 Jul, 2015

2 commits

  • Pull timer fixes from Thomas Gleixner:
    "This update from the timer departement contains:

    - A series of patches which address a shortcoming in the tick
    broadcast code.

    If the broadcast device is not available or an hrtimer emulated
    broadcast device, some of the original assumptions lead to boot
    failures. I rather plugged all of the corner cases instead of only
    addressing the issue reported, so the change got a little larger.

    Has been extensivly tested on x86 and arm.

    - Get rid of the last holdouts using do_posix_clock_monotonic_gettime()

    - A regression fix for the imx clocksource driver

    - An update to the new state callbacks mechanism for clockevents.
    This is required to simplify the conversion, which will take place
    in 4.3"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tick/broadcast: Prevent NULL pointer dereference
    time: Get rid of do_posix_clock_monotonic_gettime
    cris: Replace do_posix_clock_monotonic_gettime()
    tick/broadcast: Unbreak CONFIG_GENERIC_CLOCKEVENTS=n build
    tick/broadcast: Handle spurious interrupts gracefully
    tick/broadcast: Check for hrtimer broadcast active early
    tick/broadcast: Return busy when IPI is pending
    tick/broadcast: Return busy if periodic mode and hrtimer broadcast
    tick/broadcast: Move the check for periodic mode inside state handling
    tick/broadcast: Prevent deep idle if no broadcast device available
    tick/broadcast: Make idle check independent from mode and config
    tick/broadcast: Sanity check the shutdown of the local clock_event
    tick/broadcast: Prevent hrtimer recursion
    clockevents: Allow set-state callbacks to be optional
    clocksource/imx: Define clocksource for mx27

    Linus Torvalds
     
  • Pull irq fix from Thomas Gleixner:
    "A single fix for a cpu hotplug race vs. interrupt descriptors:

    Prevent irq setup/teardown across the cpu starting/dying parts of cpu
    hotplug so that the starting/dying cpu has a stable view of the
    descriptor space. This has been an issue for all architectures in the
    cpu dying phase, where interrupts are migrated away from the dying
    cpu. In the starting phase its mostly a x86 issue vs the vector space
    update"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    hotplug: Prevent alloc/free of irq descriptors during cpu up/down

    Linus Torvalds
     

11 Jul, 2015

1 commit


09 Jul, 2015

2 commits

  • The load_module() error path frees a module but forgot to take it out
    of the mod_tree, leaving a dangling entry in the tree, causing havoc.

    Cc: Mathieu Desnoyers
    Reported-by: Arthur Marsh
    Tested-by: Arthur Marsh
    Fixes: 93c2e105f6bc ("module: Optimize __module_address() using a latched RB-tree")
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Rusty Russell

    Peter Zijlstra
     
  • The "fix" in commit 0b08c5e5944 ("audit: Fix check of return value of
    strnlen_user()") didn't fix anything, it broke things. As reported by
    Steven Rostedt:

    "Yes, strnlen_user() returns 0 on fault, but if you look at what len is
    set to, than you would notice that on fault len would be -1"

    because we just subtracted one from the return value. So testing
    against 0 doesn't test for a fault condition, it tests against a
    perfectly valid empty string.

    Also fix up the usual braindamage wrt using WARN_ON() inside a
    conditional - make it part of the conditional and remove the explicit
    unlikely() (which is already part of the WARN_ON*() logic, exactly so
    that you don't have to write unreadable code.

    Reported-and-tested-by: Steven Rostedt
    Cc: Jan Kara
    Cc: Paul Moore
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Jul, 2015

10 commits

  • When a cpu goes up some architectures (e.g. x86) have to walk the irq
    space to set up the vector space for the cpu. While this needs extra
    protection at the architecture level we can avoid a few race
    conditions by preventing the concurrent allocation/free of irq
    descriptors and the associated data.

    When a cpu goes down it moves the interrupts which are targeted to
    this cpu away by reassigning the affinities. While this happens
    interrupts can be allocated and freed, which opens a can of race
    conditions in the code which reassignes the affinities because
    interrupt descriptors might be freed underneath.

    Example:

    CPU1 CPU2
    cpu_up/down
    irq_desc = irq_to_desc(irq);
    remove_from_radix_tree(desc);
    raw_spin_lock(&desc->lock);
    free(desc);

    We could protect the irq descriptors with RCU, but that would require
    a full tree change of all accesses to interrupt descriptors. But
    fortunately these kind of race conditions are rather limited to a few
    things like cpu hotplug. The normal setup/teardown is very well
    serialized. So the simpler and obvious solution is:

    Prevent allocation and freeing of interrupt descriptors accross cpu
    hotplug.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: xiao jin
    Cc: Joerg Roedel
    Cc: Borislav Petkov
    Cc: Yanmin Zhang
    Link: http://lkml.kernel.org/r/20150705171102.063519515@linutronix.de

    Thomas Gleixner
     
  • Andriy reported that on a virtual machine the warning about negative
    expiry time in the clock events programming code triggered:

    hpet: hpet0 irq 40 for MSI
    hpet: hpet1 irq 41 for MSI
    Switching to clocksource hpet
    WARNING: at kernel/time/clockevents.c:239

    [] clockevents_program_event+0xdb/0xf0
    [] tick_handle_periodic_broadcast+0x41/0x50
    [] timer_interrupt+0x15/0x20

    When the second hpet is installed as a per cpu timer the broadcast
    event is not longer required and stopped, which sets the next_evt of
    the broadcast device to KTIME_MAX.

    If after that a spurious interrupt happens on the broadcast device,
    then the current code blindly handles it and tries to reprogram the
    broadcast device afterwards, which adds the period to
    next_evt. KTIME_MAX + period results in a negative expiry value
    causing the WARN_ON in the clockevents code to trigger.

    Add a proper check for the state of the broadcast device into the
    interrupt handler and return if the interrupt is spurious.

    [ Folded in pointer fix from Sudeep ]

    Reported-by: Andriy Gapon
    Signed-off-by: Thomas Gleixner
    Cc: Sudeep Holla
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Link: http://lkml.kernel.org/r/20150705205221.802094647@linutronix.de

    Thomas Gleixner
     
  • If the current cpu is the one which has the hrtimer based broadcast
    queued then we better return busy immediately instead of going through
    loops and hoops to figure that out.

    [ Split out from a larger combo patch ]

    Tested-by: Sudeep Holla
    Signed-off-by: Thomas Gleixner
    Cc: Suzuki Poulose
    Cc: Lorenzo Pieralisi
    Cc: Catalin Marinas
    Cc: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos

    Thomas Gleixner
     
  • Tell the idle code not to go deep if the broadcast IPI is about to
    arrive.

    [ Split out from a larger combo patch ]

    Tested-by: Sudeep Holla
    Signed-off-by: Thomas Gleixner
    Cc: Suzuki Poulose
    Cc: Lorenzo Pieralisi
    Cc: Catalin Marinas
    Cc: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos

    Thomas Gleixner
     
  • If the system is in periodic mode and the broadcast device is hrtimer
    based, return busy as we have no proper handling for this.

    [ Split out from a larger combo patch ]

    Tested-by: Sudeep Holla
    Signed-off-by: Thomas Gleixner
    Cc: Suzuki Poulose
    Cc: Lorenzo Pieralisi
    Cc: Catalin Marinas
    Cc: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos

    Thomas Gleixner
     
  • We need to check more than the periodic mode for proper operation in
    all runtime combinations. To avoid code duplication move the check
    into the enter state handling.

    No functional change.

    [ Split out from a larger combo patch ]

    Reported-and-tested-by: Sudeep Holla
    Signed-off-by: Thomas Gleixner
    Cc: Suzuki Poulose
    Cc: Lorenzo Pieralisi
    Cc: Catalin Marinas
    Cc: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos

    Thomas Gleixner
     
  • Add a check for a installed broadcast device to the oneshot control
    function and return busy if not.

    [ Split out from a larger combo patch ]

    Reported-and-tested-by: Sudeep Holla
    Signed-off-by: Thomas Gleixner
    Cc: Suzuki Poulose
    Cc: Lorenzo Pieralisi
    Cc: Catalin Marinas
    Cc: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos

    Thomas Gleixner
     
  • Currently the broadcast busy check, which prevents the idle code from
    going into deep idle, works only in one shot mode.

    If NOHZ and HIGHRES are off (config or command line) there is no
    sanity check at all, so under certain conditions cpus are allowed to
    go into deep idle, where the local timer stops, and are not woken up
    again because there is no broadcast timer installed or a hrtimer based
    broadcast device is not evaluated.

    Move tick_broadcast_oneshot_control() into the common code and provide
    proper subfunctions for the various config combinations.

    The common check in tick_broadcast_oneshot_control() is for the C3STOP
    misfeature flag of the local clock event device. If its not set, idle
    can proceed. If set, further checks are necessary.

    Provide checks for the trivial cases:

    - If broadcast is disabled in the config, then return busy

    - If oneshot mode (NOHZ/HIGHES) is disabled in the config, return
    busy if the broadcast device is hrtimer based.

    - If oneshot mode is enabled in the config call the original
    tick_broadcast_oneshot_control() function. That function needs
    extra checks which will be implemented in seperate patches.

    [ Split out from a larger combo patch ]

    Reported-and-tested-by: Sudeep Holla
    Signed-off-by: Thomas Gleixner
    Cc: Suzuki Poulose
    Cc: Lorenzo Pieralisi
    Cc: Catalin Marinas
    Cc: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos

    Thomas Gleixner
     
  • The broadcast code shuts down the local clock event unconditionally
    even if no broadcast device is installed or if the broadcast device is
    hrtimer based.

    Add proper sanity checks.

    [ Split out from a larger combo patch ]

    Reported-and-tested-by: Sudeep Holla
    Signed-off-by: Thomas Gleixner
    Cc: Suzuki Poulose
    Cc: Lorenzo Pieralisi
    Cc: Catalin Marinas
    Cc: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos

    Thomas Gleixner
     
  • The hrtimer based broadcast vehicle can cause a hrtimer recursion
    which went unnoticed until we changed the hrtimer expiry code to keep
    track of the currently running timer.

    local_timer_interrupt()
    local_handler()
    hrtimer_interrupt()
    expire_hrtimers()
    broadcast_hrtimer()
    send_ipis()
    local_handler()
    hrtimer_interrupt()
    ....

    Solution is simple: Prevent the local handler call from the broadcast
    code when the broadcast 'device' is hrtimer based.

    [ Split out from a larger combo patch ]

    Tested-by: Sudeep Holla
    Signed-off-by: Thomas Gleixner
    Cc: Suzuki Poulose
    Cc: Lorenzo Pieralisi
    Cc: Catalin Marinas
    Cc: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos

    Thomas Gleixner
     

07 Jul, 2015

1 commit

  • Its mandatory for the drivers to provide set_state_{oneshot|periodic}()
    (only if related modes are supported) and set_state_shutdown() callbacks
    today, if they are implementing the new set-state interface.

    But this leads to unnecessary noop callbacks for drivers which don't
    want to implement them. Over that, it will lead to a full function call
    for nothing really useful.

    Lets make all set-state callbacks optional.

    Suggested-by: Daniel Lezcano
    Signed-off-by: Viresh Kumar
    Signed-off-by: Daniel Lezcano
    Link: http://lkml.kernel.org/r/1436256875-15562-1-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Thomas Gleixner

    Viresh Kumar
     

06 Jul, 2015

1 commit

  • Its currently possible to drop the last refcount to the aux buffer
    from NMI context, which results in the expected fireworks.

    The refcounting needs a bigger overhaul, but to cure the immediate
    problem, delay the freeing by using an irq_work.

    Reviewed-and-tested-by: Alexander Shishkin
    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150618103249.GK19282@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Jul, 2015

2 commits

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     
  • Pull kvm fixes from Paolo Bonzini:
    "Except for the preempt notifiers fix, these are all small bugfixes
    that could have been waited for -rc2. Sending them now since I was
    taking care of Peter's patch anyway"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    kvm: add hyper-v crash msrs values
    KVM: x86: remove data variable from kvm_get_msr_common
    KVM: s390: virtio-ccw: don't overwrite config space values
    KVM: x86: keep track of LVT0 changes under APICv
    KVM: x86: properly restore LVT0
    KVM: x86: make vapics_in_nmi_mode atomic
    sched, preempt_notifier: separate notifier registration from static_key inc/dec

    Linus Torvalds
     

04 Jul, 2015

7 commits

  • Pull scheduler fixes from Ingo Molnar:
    "Debug info and other statistics fixes and related enhancements"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/numa: Fix numa balancing stats in /proc/pid/sched
    sched/numa: Show numa_group ID in /proc/sched_debug task listings
    sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
    sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
    sched/stat: Simplify the sched_info accounting dependency

    Linus Torvalds
     
  • Commit 44dba3d5d6a1 ("sched: Refactor task_struct to use
    numa_faults instead of numa_* pointers") modified the way
    tsk->numa_faults stats are accounted.

    However that commit never touched show_numa_stats() that is displayed
    in /proc/pid/sched and thus the numbers displayed in /proc/pid/sched
    don't match the actual numbers.

    Fix it by making sure that /proc/pid/sched reflects the task
    fault numbers. Also add group fault stats too.

    Also couple of more modifications are added here:

    1. Format changes:

    - Previously we would list two entries per node, one for private
    and one for shared. Also the home node info was listed in each entry.

    - Now preferred node, total_faults and current node are
    displayed separately.

    - Now there is one entry per node, that lists private,shared task and
    group faults.

    2. Unit changes:

    - p->numa_pages_migrated was getting reset after every read of
    /proc/pid/sched. It's more useful to have absolute numbers since
    differential migrations between two accesses can be more easily
    calculated.

    Signed-off-by: Srikar Dronamraju
    Acked-by: Rik van Riel
    Cc: Iulia Manda
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1435252903-1081-4-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • Having the numa group ID in /proc/sched_debug helps to see how
    the numa groups have spread across the system.

    Signed-off-by: Srikar Dronamraju
    Acked-by: Rik van Riel
    Cc: Iulia Manda
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1435252903-1081-3-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • Currently print_cfs_rq() is declared in include/linux/sched.h.
    However it's not used outside kernel/sched. Hence move the
    declaration to kernel/sched/sched.h

    Also some functions are only available for CONFIG_SCHED_DEBUG=y.
    Hence move the declarations to within the #ifdef.

    Signed-off-by: Srikar Dronamraju
    Acked-by: Rik van Riel
    Cc: Iulia Manda
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1435252903-1081-2-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • Both CONFIG_SCHEDSTATS=y and CONFIG_TASK_DELAY_ACCT=y track task
    sched_info, which results in ugly #if clauses.

    Simplify the code by introducing a synthethic CONFIG_SCHED_INFO
    switch, selected by both.

    Signed-off-by: Naveen N. Rao
    Cc: Balbir Singh
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: a.p.zijlstra@chello.nl
    Cc: ricklind@us.ibm.com
    Link: http://lkml.kernel.org/r/8d19eef800811a94b0f91bcbeb27430a884d7433.1435255405.git.naveen.n.rao@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Naveen N. Rao
     
  • Pull user namespace updates from Eric Biederman:
    "Long ago and far away when user namespaces where young it was realized
    that allowing fresh mounts of proc and sysfs with only user namespace
    permissions could violate the basic rule that only root gets to decide
    if proc or sysfs should be mounted at all.

    Some hacks were put in place to reduce the worst of the damage could
    be done, and the common sense rule was adopted that fresh mounts of
    proc and sysfs should allow no more than bind mounts of proc and
    sysfs. Unfortunately that rule has not been fully enforced.

    There are two kinds of gaps in that enforcement. Only filesystems
    mounted on empty directories of proc and sysfs should be ignored but
    the test for empty directories was insufficient. So in my tree
    directories on proc, sysctl and sysfs that will always be empty are
    created specially. Every other technique is imperfect as an ordinary
    directory can have entries added even after a readdir returns and
    shows that the directory is empty. Special creation of directories
    for mount points makes the code in the kernel a smidge clearer about
    it's purpose. I asked container developers from the various container
    projects to help test this and no holes were found in the set of mount
    points on proc and sysfs that are created specially.

    This set of changes also starts enforcing the mount flags of fresh
    mounts of proc and sysfs are consistent with the existing mount of
    proc and sysfs. I expected this to be the boring part of the work but
    unfortunately unprivileged userspace winds up mounting fresh copies of
    proc and sysfs with noexec and nosuid clear when root set those flags
    on the previous mount of proc and sysfs. So for now only the atime,
    read-only and nodev attributes which userspace happens to keep
    consistent are enforced. Dealing with the noexec and nosuid
    attributes remains for another time.

    This set of changes also addresses an issue with how open file
    descriptors from /proc//ns/* are displayed. Recently readlink of
    /proc//fd has been triggering a WARN_ON that has not been
    meaningful since it was added (as all of the code in the kernel was
    converted) and is not now actively wrong.

    There is also a short list of issues that have not been fixed yet that
    I will mention briefly.

    It is possible to rename a directory from below to above a bind mount.
    At which point any directory pointers below the renamed directory can
    be walked up to the root directory of the filesystem. With user
    namespaces enabled a bind mount of the bind mount can be created
    allowing the user to pick a directory whose children they can rename
    to outside of the bind mount. This is challenging to fix and doubly
    so because all obvious solutions must touch code that is in the
    performance part of pathname resolution.

    As mentioned above there is also a question of how to ensure that
    developers by accident or with purpose do not introduce exectuable
    files on sysfs and proc and in doing so introduce security regressions
    in the current userspace that will not be immediately obvious and as
    such are likely to require breaking userspace in painful ways once
    they are recognized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Remove incorrect debugging WARN in prepend_path
    mnt: Update fs_fully_visible to test for permanently empty directories
    sysfs: Create mountpoints with sysfs_create_mount_point
    sysfs: Add support for permanently empty directories to serve as mount points.
    kernfs: Add support for always empty directories.
    proc: Allow creating permanently empty directories that serve as mount points
    sysctl: Allow creating permanently empty directories that serve as mountpoints.
    fs: Add helper functions for permanently empty directories.
    vfs: Ignore unlocked mounts in fs_fully_visible
    mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
    mnt: Refactor the logic for mounting sysfs and proc in a user namespace

    Linus Torvalds
     
  • Commit 1cde2930e154 ("sched/preempt: Add static_key() to preempt_notifiers")
    had two problems. First, the preempt-notifier API needs to sleep with the
    addition of the static_key, we do however need to hold off preemption
    while modifying the preempt notifier list, otherwise a preemption could
    observe an inconsistent list state. KVM correctly registers and
    unregisters preempt notifiers with preemption disabled, so the sleep
    caused dmesg splats.

    Second, KVM registers and unregisters preemption notifiers very often
    (in vcpu_load/vcpu_put). With a single uniprocessor guest the static key
    would move between 0 and 1 continuously, hitting the slow path on every
    userspace exit.

    To fix this, wrap the static_key inc/dec in a new API, and call it from
    KVM.

    Fixes: 1cde2930e154 ("sched/preempt: Add static_key() to preempt_notifiers")
    Reported-by: Pontus Fuchs
    Reported-by: Takashi Iwai
    Tested-by: Takashi Iwai
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Paolo Bonzini

    Peter Zijlstra
     

03 Jul, 2015

1 commit


02 Jul, 2015

5 commits

  • Merge third patchbomb from Andrew Morton:

    - the rest of MM

    - scripts/gdb updates

    - ipc/ updates

    - lib/ updates

    - MAINTAINERS updates

    - various other misc things

    * emailed patches from Andrew Morton : (67 commits)
    genalloc: rename of_get_named_gen_pool() to of_gen_pool_get()
    genalloc: rename dev_get_gen_pool() to gen_pool_get()
    x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
    MAINTAINERS: add zpool
    MAINTAINERS: BCACHE: Kent Overstreet has changed email address
    MAINTAINERS: move Jens Osterkamp to CREDITS
    MAINTAINERS: remove unused nbd.h pattern
    MAINTAINERS: update brcm gpio filename pattern
    MAINTAINERS: update brcm dts pattern
    MAINTAINERS: update sound soc intel patterns
    MAINTAINERS: remove website for paride
    MAINTAINERS: update Emulex ocrdma email addresses
    bcache: use kvfree() in various places
    libcxgbi: use kvfree() in cxgbi_free_big_mem()
    target: use kvfree() in session alloc and free
    IB/ehca: use kvfree() in ipz_queue_{cd}tor()
    drm/nouveau/gem: use kvfree() in u_free()
    drm: use kvfree() in drm_free_large()
    cxgb4: use kvfree() in t4_free_mem()
    cxgb3: use kvfree() in cxgb_free_mem()
    ...

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "This contains:

    - a build regression fix introduced by the timeconst move

    - a hotplug regression fix introduced by the timer wheel diet

    - a cpu hotplug bug fix for the exynos clocksource driver"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    time: Remove development rules from Kbuild/Makefile
    timer: Fix hotplug regression
    clocksource: exynos_mct: Avoid blocking calls in the cpu hotplug notifier

    Linus Torvalds
     
  • Pull power management and ACPI fixes from Rafael Wysocki:
    "These are fixes that didn't make it to the previous PM+ACPI pull
    request or are fixing issues introduced by it.

    Specifics:

    - Fix a recently added memory leak in an error path in the ACPI
    resources management code (Dan Carpenter)

    - Fix a build warning triggered by an ACPI video header function that
    should be static inline (Borislav Petkov)

    - Change names of helper function converting struct fwnode_handle
    pointers to either struct device_node or struct acpi_device
    pointers so they don't conflict with local variable names
    (Alexander Sverdlin)

    - Make the hibernate core re-enable nonboot CPUs on failures to
    disable them as expected (Vitaly Kuznetsov)

    - Increase the default timeout of the device suspend watchdog to
    prevent it from triggering too early on some systems (Takashi Iwai)

    - Prevent the cpuidle powernv driver from registering idle states
    with CPUIDLE_FLAG_TIMER_STOP set if CONFIG_TICK_ONESHOT is unset
    which leads to boot hangs (Preeti U Murthy)"

    * tag 'pm+acpi-4.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    tick/idle/powerpc: Do not register idle states with CPUIDLE_FLAG_TIMER_STOP set in periodic mode
    PM / sleep: Increase default DPM watchdog timeout to 60
    PM / hibernate: re-enable nonboot cpus on disable_nonboot_cpus() failure
    ACPI / OF: Rename of_node() and acpi_node() to to_of_node() and to_acpi_node()
    ACPI / video: Inline acpi_video_set_dmi_backlight_type
    ACPI / resources: free memory on error in add_region_before()

    Linus Torvalds
     
  • Pull xen updates from David Vrabel:
    "Xen features and cleanups for 4.2-rc0:

    - add "make xenconfig" to assist in generating configs for Xen guests

    - preparatory cleanups necessary for supporting 64 KiB pages in ARM
    guests

    - automatically use hvc0 as the default console in ARM guests"

    * tag 'for-linus-4.2-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    block/xen-blkback: s/nr_pages/nr_segs/
    block/xen-blkfront: Remove invalid comment
    block/xen-blkfront: Remove unused macro MAXIMUM_OUTSTANDING_BLOCK_REQS
    arm/xen: Drop duplicate define mfn_to_virt
    xen/grant-table: Remove unused macro SPP
    xen/xenbus: client: Fix call of virt_to_mfn in xenbus_grant_ring
    xen: Include xen/page.h rather than asm/xen/page.h
    kconfig: add xenconfig defconfig helper
    kconfig: clarify kvmconfig is for kvm
    xen/pcifront: Remove usage of struct timeval
    xen/tmem: use BUILD_BUG_ON() in favor of BUG_ON()
    hvc_xen: avoid uninitialized variable warning
    xenbus: avoid uninitialized variable warning
    xen/arm: allow console=hvc0 to be omitted for guests
    arm,arm64/xen: move Xen initialization earlier
    arm/xen: Correctly check if the event channel interrupt is present

    Linus Torvalds
     
  • Pull module updates from Rusty Russell:
    "Main excitement here is Peter Zijlstra's lockless rbtree optimization
    to speed module address lookup. He found some abusers of the module
    lock doing that too.

    A little bit of parameter work here too; including Dan Streetman's
    breaking up the big param mutex so writing a parameter can load
    another module (yeah, really). Unfortunately that broke the usual
    suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
    appended too"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
    modules: only use mod->param_lock if CONFIG_MODULES
    param: fix module param locks when !CONFIG_SYSFS.
    rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
    module: add per-module param_lock
    module: make perm const
    params: suppress unused variable error, warn once just in case code changes.
    modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
    kernel/module.c: avoid ifdefs for sig_enforce declaration
    kernel/workqueue.c: remove ifdefs over wq_power_efficient
    kernel/params.c: export param_ops_bool_enable_only
    kernel/params.c: generalize bool_enable_only
    kernel/module.c: use generic module param operaters for sig_enforce
    kernel/params: constify struct kernel_param_ops uses
    sysfs: tightened sysfs permission checks
    module: Rework module_addr_{min,max}
    module: Use __module_address() for module_address_lookup()
    module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
    module: Optimize __module_address() using a latched RB-tree
    rbtree: Implement generic latch_tree
    seqlock: Introduce raw_read_seqcount_latch()
    ...

    Linus Torvalds
     

01 Jul, 2015

8 commits

  • This allows for better documentation in the code and
    it allows for a simpler and fully correct version of
    fs_fully_visible to be written.

    The mount points converted and their filesystems are:
    /sys/hypervisor/s390/ s390_hypfs
    /sys/kernel/config/ configfs
    /sys/kernel/debug/ debugfs
    /sys/firmware/efi/efivars/ efivarfs
    /sys/fs/fuse/connections/ fusectl
    /sys/fs/pstore/ pstore
    /sys/kernel/tracing/ tracefs
    /sys/fs/cgroup/ cgroup
    /sys/kernel/security/ securityfs
    /sys/fs/selinux/ selinuxfs
    /sys/fs/smackfs/ smackfs

    Cc: stable@vger.kernel.org
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Add a magic sysctl table sysctl_mount_point that when used to
    create a directory forces that directory to be permanently empty.

    Update the code to use make_empty_dir_inode when accessing permanently
    empty directories.

    Update the code to not allow adding to permanently empty directories.

    Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • time.o gets rebuilt unconditionally due to a leftover Makefile rule
    which was placed there for development purposes.

    Remove it along with the commented out always rule in the toplevel
    Kbuild file.

    Fixes: 0a227985d4a9 'time: Move timeconst.h into include/generated'
    Reported-by; Stephen Boyd
    Signed-off-by: Thomas Gleixner
    Cc: Nicholas Mc Guire

    Thomas Gleixner
     
  • Use kvfree() instead of open-coding it.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • The comment about /dev/kmsg does not mention the additional values which
    may actually be exported, fix that.

    Also move up the part of the comment instructing the users to ignore these
    additional values, this way the reading is more fluent and logically
    compact.

    Signed-off-by: Antonio Ospite
    Cc: Joe Perches
    Cc: Jonathan Corbet
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Antonio Ospite
     
  • Fix kernel gcov support for GCC 5.1. Similar to commit a992bf836f9
    ("gcov: add support for GCC 4.9"), this patch takes into account the
    existence of a new gcov counter (see gcc's gcc/gcov-counter.def.)

    Firstly, it increments GCOV_COUNTERS (to 10), which makes the data
    structure struct gcov_info compatible with GCC 5.1.

    Secondly, a corresponding counter function __gcov_merge_icall_topn (Top N
    value tracking for indirect calls) is included in base.c with the other
    gcov counters unused for kernel profiling.

    Signed-off-by: Lorenzo Stoakes
    Cc: Andrey Ryabinin
    Cc: Yuan Pengfei
    Tested-by: Peter Oberparleiter
    Reviewed-by: Peter Oberparleiter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Commit f06e5153f4ae2e ("kernel/panic.c: add "crash_kexec_post_notifiers"
    option for kdump after panic_notifers") introduced
    "crash_kexec_post_notifiers" kernel boot option, which toggles wheather
    panic() calls crash_kexec() before panic_notifiers and dump kmsg or after.

    The problem is that the commit overlooks panic_on_oops kernel boot option.
    If it is enabled, crash_kexec() is called directly without going through
    panic() in oops path.

    To fix this issue, this patch adds a check to "crash_kexec_post_notifiers"
    in the condition of kexec_should_crash().

    Also, put a comment in kexec_should_crash() to explain not obvious things
    on this patch.

    Signed-off-by: HATAYAMA Daisuke
    Acked-by: Baoquan He
    Tested-by: Hidehiro Kawai
    Reviewed-by: Masami Hiramatsu
    Cc: Vivek Goyal
    Cc: Ingo Molnar
    Cc: Hidehiro Kawai
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     
  • For compatibility with the behaviour before the commit f06e5153f4ae2e
    ("kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after
    panic_notifers"), the 2nd crash_kexec() should be called only if
    crash_kexec_post_notifiers is enabled.

    Note that crash_kexec() returns immediately if kdump crash kernel is not
    loaded, so in this case, this patch makes no functionality change, but the
    point is to make it explicit, from the caller panic() side, that the 2nd
    crash_kexec() does nothing.

    Signed-off-by: HATAYAMA Daisuke
    Suggested-by: Ingo Molnar
    Cc: "Eric W. Biederman"
    Cc: Vivek Goyal
    Cc: Masami Hiramatsu
    Cc: Hidehiro Kawai
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke