03 Dec, 2013

3 commits

  • Pull irq fixes from Thomas Gleixner:
    - Correction of fuzzy and fragile IRQ_RETVAL macro
    - IRQ related resume fix affecting only XEN
    - ARM/GIC fix for chained GIC controllers

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip: Gic: fix boot for chained gics
    irq: Enable all irqs unconditionally in irq_resume
    genirq: Correct fuzzy and fragile IRQ_RETVAL() definition

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Various smaller fixlets, all over the place"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/doc: Fix generation of device-drivers
    sched: Expose preempt_schedule_irq()
    sched: Fix a trivial typo in comments
    sched: Remove unused variable in 'struct sched_domain'
    sched: Avoid NULL dereference on sd_busy
    sched: Check sched_domain before computing group power
    MAINTAINERS: Update file patterns in the lockdep and scheduler entries

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Misc kernel and tooling fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tools lib traceevent: Fix conversion of pointer to integer of different size
    perf/trace: Properly use u64 to hold event_id
    perf: Remove fragile swevent hlist optimization
    ftrace, perf: Avoid infinite event generation loop
    tools lib traceevent: Fix use of multiple options in processing field
    perf header: Fix possible memory leaks in process_group_desc()
    perf header: Fix bogus group name
    perf tools: Tag thread comm as overriden

    Linus Torvalds
     

30 Nov, 2013

2 commits

  • Pull workqueue fixes from Tejun Heo:
    "This contains one important fix. The NUMA support added a while back
    broke ordering guarantees on ordered workqueues. It was enforced by
    having single frontend interface with @max_active == 1 but the NUMA
    support puts multiple interfaces on unbound workqueues on NUMA
    machines thus breaking the ordered guarantee. This is fixed by
    disabling NUMA support on ordered workqueues.

    The above and a couple other patches were sitting in for-3.12-fixes
    but I forgot to push that out, so they ended up waiting a bit too
    long. My aplogies.

    Other fixes are minor"

    * 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in init_workqueues
    workqueue: fix comment typo for __queue_work()
    workqueue: fix ordered workqueues in NUMA setups
    workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY

    Linus Torvalds
     
  • Pull cgroup fixes from Tejun Heo:
    "Fixes for three issues.

    - cgroup destruction path could swamp system_wq possibly leading to
    deadlock. This actually seems to happen in the wild with memcg
    because memcg destruction path adds nested dependency on system_wq.

    Resolved by isolating cgroup destruction work items on its
    dedicated workqueue.

    - Possible locking context deadlock through seqcount reported by
    lockdep

    - Memory leak under certain conditions"

    * 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix cgroup_subsys_state leak for seq_files
    cpuset: Fix memory allocator deadlock
    cgroup: use a dedicated workqueue for cgroup destruction

    Linus Torvalds
     

29 Nov, 2013

1 commit


28 Nov, 2013

2 commits

  • If a cgroup file implements either read_map() or read_seq_string(),
    such file is served using seq_file by overriding file->f_op to
    cgroup_seqfile_operations, which also overrides the release method to
    single_release() from cgroup_file_release().

    Because cgroup_file_open() didn't use to acquire any resources, this
    used to be fine, but since f7d58818ba42 ("cgroup: pin
    cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
    pins the css (cgroup_subsys_state) which is put by
    cgroup_file_release(). The patch forgot to update the release path
    for seq_files and each open/release cycle leaks a css reference.

    Fix it by updating cgroup_file_release() to also handle seq_files and
    using it for seq_file release path too.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v3.12

    Tejun Heo
     
  • Juri hit the below lockdep report:

    [ 4.303391] ======================================================
    [ 4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
    [ 4.303394] 3.12.0-dl-peterz+ #144 Not tainted
    [ 4.303395] ------------------------------------------------------
    [ 4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    [ 4.303399] (&p->mems_allowed_seq){+.+...}, at: [] new_slab+0x6c/0x290
    [ 4.303417]
    [ 4.303417] and this task is already holding:
    [ 4.303418] (&(&q->__queue_lock)->rlock){..-...}, at: [] blk_execute_rq_nowait+0x5b/0x100
    [ 4.303431] which would create a new lock dependency:
    [ 4.303432] (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
    [ 4.303436]

    [ 4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
    [ 4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
    [ 4.303922] HARDIRQ-ON-W at:
    [ 4.303923] [] __lock_acquire+0x65a/0x1ff0
    [ 4.303926] [] lock_acquire+0x93/0x140
    [ 4.303929] [] kthreadd+0x86/0x180
    [ 4.303931] [] ret_from_fork+0x7c/0xb0
    [ 4.303933] SOFTIRQ-ON-W at:
    [ 4.303933] [] __lock_acquire+0x68c/0x1ff0
    [ 4.303935] [] lock_acquire+0x93/0x140
    [ 4.303940] [] kthreadd+0x86/0x180
    [ 4.303955] [] ret_from_fork+0x7c/0xb0
    [ 4.303959] INITIAL USE at:
    [ 4.303960] [] __lock_acquire+0x344/0x1ff0
    [ 4.303963] [] lock_acquire+0x93/0x140
    [ 4.303966] [] kthreadd+0x86/0x180
    [ 4.303969] [] ret_from_fork+0x7c/0xb0
    [ 4.303972] }

    Which reports that we take mems_allowed_seq with interrupts enabled. A
    little digging found that this can only be from
    cpuset_change_task_nodemask().

    This is an actual deadlock because an interrupt doing an allocation will
    hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
    forever waiting for the write side to complete.

    Cc: John Stultz
    Cc: Mel Gorman
    Reported-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Tested-by: Juri Lelli
    Acked-by: Li Zefan
    Acked-by: Mel Gorman
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Peter Zijlstra
     

27 Nov, 2013

2 commits

  • Tony reported that aa0d53260596 ("ia64: Use preempt_schedule_irq")
    broke PREEMPT=n builds on ia64.

    Ok, wrapped my brain around it. I tripped over the magic asm foo which
    has a single need_resched check and schedule point for both sys call
    return and interrupt return.

    So you need the schedule_preempt_irq() for kernel preemption from
    interrupt return while on a normal syscall preemption a schedule would
    be sufficient. But using schedule_preempt_irq() is not harmful here in
    any way. It just sets the preempt_active bit also in cases where it
    would not be required.

    Even on preempt=n kernels adding the preempt_active bit is completely
    harmless. So instead of having an extra function, moving the existing
    one out of the ifdef PREEMPT looks like the sanest thing to do.

    It would also allow getting rid of various other sti/schedule/cli asm
    magic in other archs.

    Reported-and-Tested-by: Tony Luck
    Fixes: aa0d53260596 ("ia64: Use preempt_schedule_irq")
    Signed-off-by: Thomas Gleixner
    [slightly edited Changelog]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311211230030.30673@ionos.tec.linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • …it/rostedt/linux-trace

    Pull tracing fixes from Steven Rostedt:
    "This includes two fixes.

    1) is a bug fix that happens when root does the following:

    echo function_graph > current_tracer
    modprobe foo
    echo nop > current_tracer

    This causes the ftrace internal accounting to get screwed up and
    crashes ftrace, preventing the user from using the function tracer
    after that.

    2) if a TRACE_EVENT has a string field, and NULL is given for it.

    The internal trace event code does a strlen() and strcpy() on the
    source of field. If it is NULL it causes the system to oops.

    This bug has been there since 2.6.31, but no TRACE_EVENT ever passed
    in a NULL to the string field, until now"

    * tag 'trace-fixes-v3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Fix function graph with loading of modules
    tracing: Allow events to have NULL strings

    Linus Torvalds
     

26 Nov, 2013

2 commits

  • Commit 8c4f3c3fa9681 "ftrace: Check module functions being traced on reload"
    fixed module loading and unloading with respect to function tracing, but
    it missed the function graph tracer. If you perform the following

    # cd /sys/kernel/debug/tracing
    # echo function_graph > current_tracer
    # modprobe nfsd
    # echo nop > current_tracer

    You'll get the following oops message:

    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9()
    Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt
    CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7
    Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
    0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000
    0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668
    ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000
    Call Trace:
    [] dump_stack+0x4f/0x7c
    [] warn_slowpath_common+0x81/0x9b
    [] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9
    [] warn_slowpath_null+0x1a/0x1c
    [] __ftrace_hash_rec_update.part.35+0x168/0x1b9
    [] ? __mutex_lock_slowpath+0x364/0x364
    [] ftrace_shutdown+0xd7/0x12b
    [] unregister_ftrace_graph+0x49/0x78
    [] graph_trace_reset+0xe/0x10
    [] tracing_set_tracer+0xa7/0x26a
    [] tracing_set_trace_write+0x8b/0xbd
    [] ? ftrace_return_to_handler+0xb2/0xde
    [] ? __sb_end_write+0x5e/0x5e
    [] vfs_write+0xab/0xf6
    [] ftrace_graph_caller+0x85/0x85
    [] SyS_write+0x59/0x82
    [] ftrace_graph_caller+0x85/0x85
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace 940358030751eafb ]---

    The above mentioned commit didn't go far enough. Well, it covered the
    function tracer by adding checks in __register_ftrace_function(). The
    problem is that the function graph tracer circumvents that (for a slight
    efficiency gain when function graph trace is running with a function
    tracer. The gain was not worth this).

    The problem came with ftrace_startup() which should always be called after
    __register_ftrace_function(), if you want this bug to be completely fixed.

    Anyway, this solution moves __register_ftrace_function() inside of
    ftrace_startup() and removes the need to call them both.

    Reported-by: Dave Wysochanski
    Fixes: ed926f9b35cd ("ftrace: Use counters to enable functions to trace")
    Cc: stable@vger.kernel.org # 3.0+
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • When the system enters suspend, it disables all interrupts in
    suspend_device_irqs(), including the interrupts marked EARLY_RESUME.

    On the resume side things are different. The EARLY_RESUME interrupts
    are reenabled in sys_core_ops->resume and the non EARLY_RESUME
    interrupts are reenabled in the normal system resume path.

    When suspend_noirq() failed or suspend is aborted for any other
    reason, we might omit the resume side call to sys_core_ops->resume()
    and therefor the interrupts marked EARLY_RESUME are not reenabled and
    stay disabled forever.

    To solve this, enable all irqs unconditionally in irq_resume()
    regardless whether interrupts marked EARLY_RESUMEhave been already
    enabled or not.

    This might try to reenable already enabled interrupts in the non
    failure case, but the only affected platform is XEN and it has been
    confirmed that it does not cause any side effects.

    [ tglx: Massaged changelog. ]

    Signed-off-by: Laxman Dewangan
    Acked-by-and-tested-by: Konrad Rzeszutek Wilk
    Acked-by: Heiko Stuebner
    Reviewed-by: Pavel Machek
    Cc:
    Cc:
    Cc:
    Cc:
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1385388587-16442-1-git-send-email-ldewangan@nvidia.com
    Signed-off-by: Thomas Gleixner

    Laxman Dewangan
     

24 Nov, 2013

1 commit

  • Pull crypto update from Herbert Xu:
    - Made x86 ablk_helper generic for ARM
    - Phase out chainiv in favour of eseqiv (affects IPsec)
    - Fixed aes-cbc IV corruption on s390
    - Added constant-time crypto_memneq which replaces memcmp
    - Fixed aes-ctr in omap-aes
    - Added OMAP3 ROM RNG support
    - Add PRNG support for MSM SoC's
    - Add and use Job Ring API in caam
    - Misc fixes

    [ NOTE! This pull request was sent within the merge window, but Herbert
    has some questionable email sending setup that makes him public enemy
    #1 as far as gmail is concerned. So most of his emails seem to be
    trapped by gmail as spam, resulting in me not seeing them. - Linus ]

    * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (49 commits)
    crypto: s390 - Fix aes-cbc IV corruption
    crypto: omap-aes - Fix CTR mode counter length
    crypto: omap-sham - Add missing modalias
    padata: make the sequence counter an atomic_t
    crypto: caam - Modify the interface layers to use JR API's
    crypto: caam - Add API's to allocate/free Job Rings
    crypto: caam - Add Platform driver for Job Ring
    hwrng: msm - Add PRNG support for MSM SoC's
    ARM: DT: msm: Add Qualcomm's PRNG driver binding document
    crypto: skcipher - Use eseqiv even on UP machines
    crypto: talitos - Simplify key parsing
    crypto: picoxcell - Simplify and harden key parsing
    crypto: ixp4xx - Simplify and harden key parsing
    crypto: authencesn - Simplify key parsing
    crypto: authenc - Export key parsing helper function
    crypto: mv_cesa: remove deprecated IRQF_DISABLED
    hwrng: OMAP3 ROM Random Number Generator support
    crypto: sha256_ssse3 - also test for BMI2
    crypto: mv_cesa - Remove redundant of_match_ptr
    crypto: sahara - Remove redundant of_match_ptr
    ...

    Linus Torvalds
     

23 Nov, 2013

5 commits

  • When one work starts execution, the high bits of work's data contain
    pool ID. It can represent a maximum of WORK_OFFQ_POOL_NONE. Pool ID
    is assigned WORK_OFFQ_POOL_NONE when the work being initialized
    indicating that no pool is associated and get_work_pool() uses it to
    check the associated pool. So if worker_pool_assign_id() assigns a
    ID greater than or equal WORK_OFFQ_POOL_NONE to a pool, it triggers
    leakage, and it may break the non-reentrance guarantee.

    This patch fix this issue by modifying the worker_pool_assign_id()
    function calling idr_alloc() by setting @end param WORK_OFFQ_POOL_NONE.

    Furthermore, in the current implementation, the BUILD_BUG_ON() in
    init_workqueues makes no sense. The number of worker pools needed
    cannot be determined at compile time, because the number of backing
    pools for UNBOUND workqueues is dynamic based on the assigned custom
    attributes. So remove it.

    tj: Minor comment and indentation updates.

    Signed-off-by: Li Bin
    Signed-off-by: Tejun Heo

    Li Bin
     
  • It seems the "dying" should be "draining" here.

    Signed-off-by: Li Bin
    Signed-off-by: Tejun Heo

    Li Bin
     
  • An ordered workqueue implements execution ordering by using single
    pool_workqueue with max_active == 1. On a given pool_workqueue, work
    items are processed in FIFO order and limiting max_active to 1
    enforces the queued work items to be processed one by one.

    Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for
    unbound workqueues") accidentally broke this guarantee by applying
    NUMA affinity to ordered workqueues too. On NUMA setups, an ordered
    workqueue would end up with separate pool_workqueues for different
    nodes. Each pool_workqueue still limits max_active to 1 but multiple
    work items may be executed concurrently and out of order depending on
    which node they are queued to.

    Fix it by using dedicated ordered_wq_attrs[] when creating ordered
    workqueues. The new attrs match the unbound ones except that no_numa
    is always set thus forcing all NUMA nodes to share the default
    pool_workqueue.

    While at it, add sanity check in workqueue creation path which
    verifies that an ordered workqueues has only the default
    pool_workqueue.

    Signed-off-by: Tejun Heo
    Reported-by: Libin
    Cc: stable@vger.kernel.org
    Cc: Lai Jiangshan

    Tejun Heo
     
  • Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed()
    in create_worker(). Otherwise userland can change ->cpus_allowed
    in between.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     
  • Since be44562613851 ("cgroup: remove synchronize_rcu() from
    cgroup_diput()"), cgroup destruction path makes use of workqueue. css
    freeing is performed from a work item from that point on and a later
    commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
    steps"), moves css offlining to workqueue too.

    As cgroup destruction isn't depended upon for memory reclaim, the
    destruction work items were put on the system_wq; unfortunately, some
    controller may block in the destruction path for considerable duration
    while holding cgroup_mutex. As large part of destruction path is
    synchronized through cgroup_mutex, when combined with high rate of
    cgroup removals, this has potential to fill up system_wq's max_active
    of 256.

    Also, it turns out that memcg's css destruction path ends up queueing
    and waiting for work items on system_wq through work_on_cpu(). If
    such operation happens while system_wq is fully occupied by cgroup
    destruction work items, work_on_cpu() can't make forward progress
    because system_wq is full and other destruction work items on
    system_wq can't make forward progress because the work item waiting
    for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

    This can be fixed by queueing destruction work items on a separate
    workqueue. This patch creates a dedicated workqueue -
    cgroup_destroy_wq - for this purpose. As these work items shouldn't
    have inter-dependencies and mostly serialized by cgroup_mutex anyway,
    giving high concurrency level doesn't buy anything and the workqueue's
    @max_active is set to 1 so that destruction work items are executed
    one by one on each CPU.

    Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
    cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
    separate core_initcall(). In the future, we probably want to reorder
    so that workqueue init happens before cgroup_init().

    Signed-off-by: Tejun Heo
    Reported-by: Hugh Dickins
    Reported-by: Shawn Bohrer
    Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
    Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
    Cc: stable@vger.kernel.org # v3.9+

    Tejun Heo
     

22 Nov, 2013

2 commits

  • Pull security subsystem updates from James Morris:
    "In this patchset, we finally get an SELinux update, with Paul Moore
    taking over as maintainer of that code.

    Also a significant update for the Keys subsystem, as well as
    maintenance updates to Smack, IMA, TPM, and Apparmor"

    and since I wanted to know more about the updates to key handling,
    here's the explanation from David Howells on that:

    "Okay. There are a number of separate bits. I'll go over the big bits
    and the odd important other bit, most of the smaller bits are just
    fixes and cleanups. If you want the small bits accounting for, I can
    do that too.

    (1) Keyring capacity expansion.

    KEYS: Consolidate the concept of an 'index key' for key access
    KEYS: Introduce a search context structure
    KEYS: Search for auth-key by name rather than target key ID
    Add a generic associative array implementation.
    KEYS: Expand the capacity of a keyring

    Several of the patches are providing an expansion of the capacity of a
    keyring. Currently, the maximum size of a keyring payload is one page.
    Subtract a small header and then divide up into pointers, that only gives
    you ~500 pointers on an x86_64 box. However, since the NFS idmapper uses
    a keyring to store ID mapping data, that has proven to be insufficient to
    the cause.

    Whatever data structure I use to handle the keyring payload, it can only
    store pointers to keys, not the keys themselves because several keyrings
    may point to a single key. This precludes inserting, say, and rb_node
    struct into the key struct for this purpose.

    I could make an rbtree of records such that each record has an rb_node
    and a key pointer, but that would use four words of space per key stored
    in the keyring. It would, however, be able to use much existing code.

    I selected instead a non-rebalancing radix-tree type approach as that
    could have a better space-used/key-pointer ratio. I could have used the
    radix tree implementation that we already have and insert keys into it by
    their serial numbers, but that means any sort of search must iterate over
    the whole radix tree. Further, its nodes are a bit on the capacious side
    for what I want - especially given that key serial numbers are randomly
    allocated, thus leaving a lot of empty space in the tree.

    So what I have is an associative array that internally is a radix-tree
    with 16 pointers per node where the index key is constructed from the key
    type pointer and the key description. This means that an exact lookup by
    type+description is very fast as this tells us how to navigate directly to
    the target key.

    I made the data structure general in lib/assoc_array.c as far as it is
    concerned, its index key is just a sequence of bits that leads to a
    pointer. It's possible that someone else will be able to make use of it
    also. FS-Cache might, for example.

    (2) Mark keys as 'trusted' and keyrings as 'trusted only'.

    KEYS: verify a certificate is signed by a 'trusted' key
    KEYS: Make the system 'trusted' keyring viewable by userspace
    KEYS: Add a 'trusted' flag and a 'trusted only' flag
    KEYS: Separate the kernel signature checking keyring from module signing

    These patches allow keys carrying asymmetric public keys to be marked as
    being 'trusted' and allow keyrings to be marked as only permitting the
    addition or linkage of trusted keys.

    Keys loaded from hardware during kernel boot or compiled into the kernel
    during build are marked as being trusted automatically. New keys can be
    loaded at runtime with add_key(). They are checked against the system
    keyring contents and if their signatures can be validated with keys that
    are already marked trusted, then they are marked trusted also and can
    thus be added into the master keyring.

    Patches from Mimi Zohar make this usable with the IMA keyrings also.

    (3) Remove the date checks on the key used to validate a module signature.

    X.509: Remove certificate date checks

    It's not reasonable to reject a signature just because the key that it was
    generated with is no longer valid datewise - especially if the kernel
    hasn't yet managed to set the system clock when the first module is
    loaded - so just remove those checks.

    (4) Make it simpler to deal with additional X.509 being loaded into the kernel.

    KEYS: Load *.x509 files into kernel keyring
    KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate

    The builder of the kernel now just places files with the extension ".x509"
    into the kernel source or build trees and they're concatenated by the
    kernel build and stuffed into the appropriate section.

    (5) Add support for userspace kerberos to use keyrings.

    KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches
    KEYS: Implement a big key type that can save to tmpfs

    Fedora went to, by default, storing kerberos tickets and tokens in tmpfs.
    We looked at storing it in keyrings instead as that confers certain
    advantages such as tickets being automatically deleted after a certain
    amount of time and the ability for the kernel to get at these tokens more
    easily.

    To make this work, two things were needed:

    (a) A way for the tickets to persist beyond the lifetime of all a user's
    sessions so that cron-driven processes can still use them.

    The problem is that a user's session keyrings are deleted when the
    session that spawned them logs out and the user's user keyring is
    deleted when the UID is deleted (typically when the last log out
    happens), so neither of these places is suitable.

    I've added a system keyring into which a 'persistent' keyring is
    created for each UID on request. Each time a user requests their
    persistent keyring, the expiry time on it is set anew. If the user
    doesn't ask for it for, say, three days, the keyring is automatically
    expired and garbage collected using the existing gc. All the kerberos
    tokens it held are then also gc'd.

    (b) A key type that can hold really big tickets (up to 1MB in size).

    The problem is that Active Directory can return huge tickets with lots
    of auxiliary data attached. We don't, however, want to eat up huge
    tracts of unswappable kernel space for this, so if the ticket is
    greater than a certain size, we create a swappable shmem file and dump
    the contents in there and just live with the fact we then have an
    inode and a dentry overhead. If the ticket is smaller than that, we
    slap it in a kmalloc()'d buffer"

    * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (121 commits)
    KEYS: Fix keyring content gc scanner
    KEYS: Fix error handling in big_key instantiation
    KEYS: Fix UID check in keyctl_get_persistent()
    KEYS: The RSA public key algorithm needs to select MPILIB
    ima: define '_ima' as a builtin 'trusted' keyring
    ima: extend the measurement list to include the file signature
    kernel/system_certificate.S: use real contents instead of macro GLOBAL()
    KEYS: fix error return code in big_key_instantiate()
    KEYS: Fix keyring quota misaccounting on key replacement and unlink
    KEYS: Fix a race between negating a key and reading the error set
    KEYS: Make BIG_KEYS boolean
    apparmor: remove the "task" arg from may_change_ptraced_domain()
    apparmor: remove parent task info from audit logging
    apparmor: remove tsk field from the apparmor_audit_struct
    apparmor: fix capability to not use the current task, during reporting
    Smack: Ptrace access check mode
    ima: provide hash algo info in the xattr
    ima: enable support for larger default filedata hash algorithms
    ima: define kernel parameter 'ima_template=' to change configured default
    ima: add Kconfig default measurement list template
    ...

    Linus Torvalds
     
  • Pull audit updates from Eric Paris:
    "Nothing amazing. Formatting, small bug fixes, couple of fixes where
    we didn't get records due to some old VFS changes, and a change to how
    we collect execve info..."

    Fixed conflict in fs/exec.c as per Eric and linux-next.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    audit: fix type of sessionid in audit_set_loginuid()
    audit: call audit_bprm() only once to add AUDIT_EXECVE information
    audit: move audit_aux_data_execve contents into audit_context union
    audit: remove unused envc member of audit_aux_data_execve
    audit: Kill the unused struct audit_aux_data_capset
    audit: do not reject all AUDIT_INODE filter types
    audit: suppress stock memalloc failure warnings since already managed
    audit: log the audit_names record type
    audit: add child record before the create to handle case where create fails
    audit: use given values in tty_audit enable api
    audit: use nlmsg_len() to get message payload length
    audit: use memset instead of trying to initialize field by field
    audit: fix info leak in AUDIT_GET requests
    audit: update AUDIT_INODE filter rule to comparator function
    audit: audit feature to set loginuid immutable
    audit: audit feature to only allow unsetting the loginuid
    audit: allow unsetting the loginuid (with priv)
    audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
    audit: loginuid functions coding style
    selinux: apply selinux checks on new audit message types
    ...

    Linus Torvalds
     

21 Nov, 2013

2 commits

  • Pull vfs bits and pieces from Al Viro:
    "Assorted bits that got missed in the first pull request + fixes for a
    couple of coredump regressions"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fold try_to_ascend() into the sole remaining caller
    dcache.c: get rid of pointless macros
    take read_seqbegin_or_lock() and friends to seqlock.h
    consolidate simple ->d_delete() instances
    gfs2: endianness misannotations
    dump_emit(): use __kernel_write(), not vfs_write()
    dump_align(): fix the dumb braino

    Linus Torvalds
     
  • Pull more ACPI and power management updates from Rafael Wysocki:

    - ACPI-based device hotplug fixes for issues introduced recently and a
    fix for an older error code path bug in the ACPI PCI host bridge
    driver

    - Fix for recently broken OMAP cpufreq build from Viresh Kumar

    - Fix for a recent hibernation regression related to s2disk

    - Fix for a locking-related regression in the ACPI EC driver from
    Puneet Kumar

    - System suspend error code path fix related to runtime PM and runtime
    PM documentation update from Ulf Hansson

    - cpufreq's conservative governor fix from Xiaoguang Chen

    - New processor IDs for intel_idle and turbostat and removal of an
    obsolete Kconfig option from Len Brown

    - New device IDs for the ACPI LPSS (Low-Power Subsystem) driver and
    ACPI-based PCI hotplug (ACPIPHP) cleanup from Mika Westerberg

    - Removal of several ACPI video DMI blacklist entries that are not
    necessary any more from Aaron Lu

    - Rework of the ACPI companion representation in struct device and code
    cleanup related to that change from Rafael J Wysocki, Lan Tianyu and
    Jarkko Nikula

    - Fixes for assigning names to ACPI-enumerated I2C and SPI devices from
    Jarkko Nikula

    * tag 'pm+acpi-2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (24 commits)
    PCI / hotplug / ACPI: Drop unused acpiphp_debug declaration
    ACPI / scan: Set flags.match_driver in acpi_bus_scan_fixed()
    ACPI / PCI root: Clear driver_data before failing enumeration
    ACPI / hotplug: Fix PCI host bridge hot removal
    ACPI / hotplug: Fix acpi_bus_get_device() return value check
    cpufreq: governor: Remove fossil comment in the cpufreq_governor_dbs()
    ACPI / video: clean up DMI table for initial black screen problem
    ACPI / EC: Ensure lock is acquired before accessing ec struct members
    PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()
    ACPI / AC: Remove struct acpi_device pointer from struct acpi_ac
    spi: Use stable dev_name for ACPI enumerated SPI slaves
    i2c: Use stable dev_name for ACPI enumerated I2C slaves
    ACPI: Provide acpi_dev_name accessor for struct acpi_device device name
    ACPI / bind: Use (put|get)_device() on ACPI device objects too
    ACPI: Eliminate the DEVICE_ACPI_HANDLE() macro
    ACPI / driver core: Store an ACPI device pointer in struct acpi_dev_node
    cpufreq: OMAP: Fix compilation error 'r & ret undeclared'
    PM / Runtime: Fix error path for prepare
    PM / Runtime: Update documentation around probe|remove|suspend
    cpufreq: conservative: set requested_freq to policy max when it is over policy max
    ...

    Linus Torvalds
     

20 Nov, 2013

7 commits

  • Pull networking fixes from David Miller:
    "Mostly these are fixes for fallout due to merge window changes, as
    well as cures for problems that have been with us for a much longer
    period of time"

    1) Johannes Berg noticed two major deficiencies in our genetlink
    registration. Some genetlink protocols we passing in constant
    counts for their ops array rather than something like
    ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
    using fixed IDs for their multicast groups.

    We have to retain these fixed IDs to keep existing userland tools
    working, but reserve them so that other multicast groups used by
    other protocols can not possibly conflict.

    In dealing with these two problems, we actually now use less state
    management for genetlink operations and multicast groups.

    2) When configuring interface hardware timestamping, fix several
    drivers that simply do not validate that the hwtstamp_config value
    is one the driver actually supports. From Ben Hutchings.

    3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.

    4) In dev_forward_skb(), set the skb->protocol in the right order
    relative to skb_scrub_packet(). From Alexei Starovoitov.

    5) Bridge erroneously fails to use the proper wrapper functions to make
    calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
    Makita.

    6) When detaching a bridge port, make sure to flush all VLAN IDs to
    prevent them from leaking, also from Toshiaki Makita.

    7) Put in a compromise for TCP Small Queues so that deep queued devices
    that delay TX reclaim non-trivially don't have such a performance
    decrease. One particularly problematic area is 802.11 AMPDU in
    wireless. From Eric Dumazet.

    8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
    here. Fix from Eric Dumzaet, reported by Dave Jones.

    9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.

    10) When computing mergeable buffer sizes, virtio-net fails to take the
    virtio-net header into account. From Michael Dalton.

    11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
    bumping, this one has been with us for a while. From Eric Dumazet.

    12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
    Hugne.

    13) 6lowpan bit used for traffic classification was wrong, from Jukka
    Rissanen.

    14) macvlan has the same issue as normal vlans did wrt. propagating LRO
    disabling down to the real device, fix it the same way. From Michal
    Kubecek.

    15) CPSW driver needs to soft reset all slaves during suspend, from
    Daniel Mack.

    16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.

    17) The xen-netfront RX buffer refill timer isn't properly scheduled on
    partial RX allocation success, from Ma JieYue.

    18) When ipv6 ping protocol support was added, the AF_INET6 protocol
    initialization cleanup path on failure was borked a little. Fix
    from Vlad Yasevich.

    19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
    blocks we can do the wrong thing with the msg_name we write back to
    userspace. From Hannes Frederic Sowa. There is another fix in the
    works from Hannes which will prevent future problems of this nature.

    20) Fix route leak in VTI tunnel transmit, from Fan Du.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
    genetlink: make multicast groups const, prevent abuse
    genetlink: pass family to functions using groups
    genetlink: add and use genl_set_err()
    genetlink: remove family pointer from genl_multicast_group
    genetlink: remove genl_unregister_mc_group()
    hsr: don't call genl_unregister_mc_group()
    quota/genetlink: use proper genetlink multicast APIs
    drop_monitor/genetlink: use proper genetlink multicast APIs
    genetlink: only pass array to genl_register_family_with_ops()
    tcp: don't update snd_nxt, when a socket is switched from repair mode
    atm: idt77252: fix dev refcnt leak
    xfrm: Release dst if this dst is improper for vti tunnel
    netlink: fix documentation typo in netlink_set_err()
    be2net: Delete secondary unicast MAC addresses during be_close
    be2net: Fix unconditional enabling of Rx interface options
    net, virtio_net: replace the magic value
    ping: prevent NULL pointer dereference on write to msg_name
    bnx2x: Prevent "timeout waiting for state X"
    bnx2x: prevent CFC attention
    bnx2x: Prevent panic during DMAE timeout
    ...

    Linus Torvalds
     
  • has heavy dependencies on other header files.
    It triggers circular dependencies in generated headers on IA64, at
    least:

    CC kernel/bounds.s
    In file included from /home/space/kas/git/public/linux/arch/ia64/include/asm/thread_info.h:9:0,
    from include/linux/thread_info.h:54,
    from include/asm-generic/preempt.h:4,
    from arch/ia64/include/generated/asm/preempt.h:1,
    from include/linux/preempt.h:18,
    from include/linux/spinlock.h:50,
    from kernel/bounds.c:14:
    /home/space/kas/git/public/linux/arch/ia64/include/asm/asm-offsets.h:1:35: fatal error: generated/asm-offsets.h: No such file or directory
    compilation terminated.

    Let's replace with , it's
    enough to find out size of spinlock_t.

    Signed-off-by: Kirill A. Shutemov
    Reported-and-Tested-by: Tony Luck
    Signed-off-by: Tony Luck

    Kirill A. Shutemov
     
  • As suggested by David Miller, make genl_register_family_with_ops()
    a macro and pass only the array, evaluating ARRAY_SIZE() in the
    macro, this is a little safer.

    The openvswitch has some indirection, assing ops/n_ops directly in
    that code. This might ultimately just assign the pointers in the
    family initializations, saving the struct genl_family_and_ops and
    code (once mcast groups are handled differently.)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Pull irq cleanups from Ingo Molnar:
    "This is a multi-arch cleanup series from Thomas Gleixner, which we
    kept to near the end of the merge window, to not interfere with
    architecture updates.

    This series (motivated by the -rt kernel) unifies more aspects of IRQ
    handling and generalizes PREEMPT_ACTIVE"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    preempt: Make PREEMPT_ACTIVE generic
    sparc: Use preempt_schedule_irq
    ia64: Use preempt_schedule_irq
    m32r: Use preempt_schedule_irq
    hardirq: Make hardirq bits generic
    m68k: Simplify low level interrupt handling code
    genirq: Prevent spurious detection for unconditionally polled interrupts

    Linus Torvalds
     
  • Fix a trivial typo in rq_attach_root().

    Signed-off-by: Shigeru Yoshida
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20131117.121236.1990617639803941055.shigeru.yoshida@gmail.com
    Signed-off-by: Ingo Molnar

    Shigeru Yoshida
     
  • Commit 37dc6b50cee9 ("sched: Remove unnecessary iteration over sched
    domains to update nr_busy_cpus") forgot to clear 'sd_busy' under some
    conditions leading to a possible NULL deref in set_cpu_sd_state_idle().

    Reported-by: Anton Blanchard
    Cc: Preeti U Murthy
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20131118113701.GF3866@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • After commit 863bffc80898 ("sched/fair: Fix group power_orig
    computation"), we can dereference rq->sd before it is set.

    Fix this by falling back to power_of() in this case and add a comment
    explaining things.

    Signed-off-by: Srikar Dronamraju
    [ Added comment and tweaked patch. ]
    Signed-off-by: Peter Zijlstra
    Cc: mikey@neuling.org
    Link: http://lkml.kernel.org/r/20131113151718.GN21461@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

19 Nov, 2013

4 commits

  • The 64-bit attr.config value for perf trace events was being copied into
    an "int" before doing a comparison, meaning the top 32 bits were
    being truncated.

    As far as I can tell this didn't cause any errors, but it did mean
    it was possible to create valid aliases for all the tracepoint ids
    which I don't think was intended. (For example, 0xffffffff00000018
    and 0x18 both enable the same tracepoint).

    Signed-off-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1311151236100.11932@vincent-weaver-1.um.maine.edu
    Signed-off-by: Ingo Molnar

    Vince Weaver
     
  • Currently we only allocate a single cpu hashtable for per-cpu
    swevents; do away with this optimization for it is fragile in the face
    of things like perf_pmu_migrate_context().

    The easiest thing is to make sure all CPUs are consistent wrt state.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130913111447.GN31370@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Vince's perf-trinity fuzzer found yet another 'interesting' problem.

    When we sample the irq_work_exit tracepoint with period==1 (or
    PERF_SAMPLE_PERIOD) and we add an fasync SIGNAL handler we create an
    infinite event generation loop:

    ,->
    | irq_work_exit() ->
    | trace_irq_work_exit() ->
    | ...
    | __perf_event_overflow() -> (due to fasync)
    | irq_work_queue() -> (irq_work_list must be empty)
    '--------- arch_irq_work_raise()

    Similar things can happen due to regular poll() wakeups if we exceed
    the ring-buffer wakeup watermark, or have an event_limit.

    To avoid this, dis-allow sampling this particular tracepoint.

    In order to achieve this, create a special perf_perm function pointer
    for each event and call this (when set) on trying to create a
    tracepoint perf event.

    [ roasted: use expr... to allow for ',' in your expression ]

    Reported-by: Vince Weaver
    Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Dave Jones
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20131114152304.GC5364@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • * pm-sleep:
    PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()

    Rafael J. Wysocki
     

17 Nov, 2013

1 commit

  • Pull tracing update from Steven Rostedt:
    "This batch of changes is mostly clean ups and small bug fixes. The
    only real feature that was added this release is from Namhyung Kim,
    who introduced "set_graph_notrace" filter that lets you run the
    function graph tracer and not trace particular functions and their
    call chain.

    Tom Zanussi added some updates to the ftrace multibuffer tracing that
    made it more consistent with the top level tracing.

    One of the fixes for perf function tracing required an API change in
    RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing
    that change in this release too, he gave me a branch that included all
    the changes to get that working, and I pulled that into my tree in
    order to complete the perf function tracing fix"

    * tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Add rcu annotation for syscall trace descriptors
    tracing: Do not use signed enums with unsigned long long in fgragh output
    tracing: Remove unused function ftrace_off_permanent()
    tracing: Do not assign filp->private_data to freed memory
    tracing: Add helper function tracing_is_disabled()
    tracing: Open tracer when ftrace_dump_on_oops is used
    tracing: Add support for SOFT_DISABLE to syscall events
    tracing: Make register/unregister_ftrace_command __init
    tracing: Update event filters for multibuffer
    recordmcount.pl: Add support for __fentry__
    ftrace: Have control op function callback only trace when RCU is watching
    rcu: Do not trace rcu_is_watching() functions
    ftrace/x86: skip over the breakpoint for ftrace caller
    trace/trace_stat: use rbtree postorder iteration helper instead of opencoding
    ftrace: Add set_graph_notrace filter
    ftrace: Narrow down the protected area of graph_lock
    ftrace: Introduce struct ftrace_graph_data
    ftrace: Get rid of ftrace_graph_filter_enabled
    tracing: Fix potential out-of-bounds in trace_get_user()
    tracing: Show more exact help information about snapshot

    Linus Torvalds
     

16 Nov, 2013

2 commits

  • Rename simple_delete_dentry() to always_delete_dentry() and export it.
    Export simple_dentry_operations, while we are at it, and get rid of
    their duplicates

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull trivial tree updates from Jiri Kosina:
    "Usual earth-shaking, news-breaking, rocket science pile from
    trivial.git"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
    doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
    doc: add missing files to timers/00-INDEX
    timekeeping: Fix some trivial typos in comments
    mm: Fix some trivial typos in comments
    irq: Fix some trivial typos in comments
    NUMA: fix typos in Kconfig help text
    mm: update 00-INDEX
    doc: Documentation/DMA-attributes.txt fix typo
    DRM: comment: `halve' -> `half'
    Docs: Kconfig: `devlopers' -> `developers'
    doc: typo on word accounting in kprobes.c in mutliple architectures
    treewide: fix "usefull" typo
    treewide: fix "distingush" typo
    mm/Kconfig: Grammar s/an/a/
    kexec: Typo s/the/then/
    Documentation/kvm: Update cpuid documentation for steal time and pv eoi
    treewide: Fix common typo in "identify"
    __page_to_pfn: Fix typo in comment
    Correct some typos for word frequency
    clk: fixed-factor: Fix a trivial typo
    ...

    Linus Torvalds
     

15 Nov, 2013

4 commits

  • Pull KVM changes from Paolo Bonzini:
    "Here are the 3.13 KVM changes. There was a lot of work on the PPC
    side: the HV and emulation flavors can now coexist in a single kernel
    is probably the most interesting change from a user point of view.

    On the x86 side there are nested virtualization improvements and a few
    bugfixes.

    ARM got transparent huge page support, improved overcommit, and
    support for big endian guests.

    Finally, there is a new interface to connect KVM with VFIO. This
    helps with devices that use NoSnoop PCI transactions, letting the
    driver in the guest execute WBINVD instructions. This includes some
    nVidia cards on Windows, that fail to start without these patches and
    the corresponding userspace changes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
    kvm, vmx: Fix lazy FPU on nested guest
    arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
    arm/arm64: KVM: MMIO support for BE guest
    kvm, cpuid: Fix sparse warning
    kvm: Delete prototype for non-existent function kvm_check_iopl
    kvm: Delete prototype for non-existent function complete_pio
    hung_task: add method to reset detector
    pvclock: detect watchdog reset at pvclock read
    kvm: optimize out smp_mb after srcu_read_unlock
    srcu: API for barrier after srcu read unlock
    KVM: remove vm mmap method
    KVM: IOMMU: hva align mapping page size
    KVM: x86: trace cpuid emulation when called from emulator
    KVM: emulator: cleanup decode_register_operand() a bit
    KVM: emulator: check rex prefix inside decode_register()
    KVM: x86: fix emulation of "movzbl %bpl, %eax"
    kvm_host: typo fix
    KVM: x86: emulate SAHF instruction
    MAINTAINERS: add tree for kvm.git
    Documentation/kvm: add a 00-INDEX file
    ...

    Linus Torvalds
     
  • Pull module updates from Rusty Russell:
    "Mainly boring here, too. rmmod --wait finally removed, though"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    modpost: fix bogus 'exported twice' warnings.
    init: fix in-place parameter modification regression
    asmlinkage, module: Make ksymtab and kcrctab symbols and __this_module __visible
    kernel: add support for init_array constructors
    modpost: Optionally ignore secondary errors seen if a single module build fails
    module: remove rmmod --wait option.

    Linus Torvalds
     
  • Signed-off-by: Christoph Hellwig
    Cc: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Cc: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig