10 Oct, 2014

1 commit

  • commit 62b4d2041117f35ab2409c9f5c4b8d3dc8e59d0f upstream.

    commit 03b8c7b623c80af264c4c8d6111e5c6289933666 ("futex: Allow
    architectures to skip futex_atomic_cmpxchg_inatomic() test") added the
    HAVE_FUTEX_CMPXCHG symbol right below FUTEX. This placed it right in
    the middle of the options for the EXPERT menu. However,
    HAVE_FUTEX_CMPXCHG does not depend on EXPERT or FUTEX, so Kconfig stops
    placing items in the EXPERT menu, and displays the remaining several
    EXPERT items (starting with EPOLL) directly in the General Setup menu.

    Since both users of HAVE_FUTEX_CMPXCHG only select it "if FUTEX", make
    HAVE_FUTEX_CMPXCHG itself depend on FUTEX. With this change, the
    subsequent items display as part of the EXPERT menu again; the EMBEDDED
    menu now appears as the next top-level item in the General Setup menu,
    which makes General Setup much shorter and more usable.

    Signed-off-by: Josh Triplett
    Acked-by: Randy Dunlap
    Signed-off-by: Greg Kroah-Hartman

    Josh Triplett
     

08 Aug, 2014

2 commits

  • commit 197725de65477bc8509b41388157c1a2283542bb upstream.

    Make espfix64 a hidden Kconfig option. This fixes the x86-64 UML
    build which had broken due to the non-existence of init_espfix_bsp()
    in UML: since UML uses its own Kconfig, this option does not appear in
    the UML build.

    This also makes it possible to make support for 16-bit segments a
    configuration option, for the people who want to minimize the size of
    the kernel.

    Reported-by: Ingo Molnar
    Signed-off-by: H. Peter Anvin
    Cc: Richard Weinberger
    Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman

    H. Peter Anvin
     
  • commit 3891a04aafd668686239349ea58f3314ea2af86b upstream.

    The IRET instruction, when returning to a 16-bit segment, only
    restores the bottom 16 bits of the user space stack pointer. This
    causes some 16-bit software to break, but it also leaks kernel state
    to user space. We have a software workaround for that ("espfix") for
    the 32-bit kernel, but it relies on a nonzero stack segment base which
    is not available in 64-bit mode.

    In checkin:

    b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels

    we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
    the logic that 16-bit support is crippled on 64-bit kernels anyway (no
    V86 support), but it turns out that people are doing stuff like
    running old Win16 binaries under Wine and expect it to work.

    This works around this by creating percpu "ministacks", each of which
    is mapped 2^16 times 64K apart. When we detect that the return SS is
    on the LDT, we copy the IRET frame to the ministack and use the
    relevant alias to return to userspace. The ministacks are mapped
    readonly, so if IRET faults we promote #GP to #DF which is an IST
    vector and thus has its own stack; we then do the fixup in the #DF
    handler.

    (Making #GP an IST exception would make the msr_safe functions unsafe
    in NMI/MC context, and quite possibly have other effects.)

    Special thanks to:

    - Andy Lutomirski, for the suggestion of using very small stack slots
    and copy (as opposed to map) the IRET frame there, and for the
    suggestion to mark them readonly and let the fault promote to #DF.
    - Konrad Wilk for paravirt fixup and testing.
    - Borislav Petkov for testing help and useful comments.

    Reported-by: Brian Gerst
    Signed-off-by: H. Peter Anvin
    Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
    Cc: Konrad Rzeszutek Wilk
    Cc: Borislav Petkov
    Cc: Andrew Lutomriski
    Cc: Linus Torvalds
    Cc: Dirk Hohndel
    Cc: Arjan van de Ven
    Cc: comex
    Cc: Alexander van Heukelum
    Cc: Boris Ostrovsky
    Cc: # consider after upstream merge
    Signed-off-by: Greg Kroah-Hartman

    H. Peter Anvin
     

01 Jun, 2014

1 commit

  • commit 82c04ff89eba09d0e46e3f3649c6d3aa18e764a0 upstream.

    The SYSTEM_TRUSTED_KEYRING config option is not in any menu, causing it
    to show up in the toplevel of the kernel configuration. Fix this by
    moving it under the General Setup menu.

    Signed-off-by: Peter Foley
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Foley
     

14 Apr, 2014

1 commit

  • commit 03b8c7b623c80af264c4c8d6111e5c6289933666 upstream.

    If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
    is no runtime check necessary, allow to skip the test within futex_init().

    This allows to get rid of some code which would always give the same result,
    and also allows the compiler to optimize a couple of if statements away.

    Signed-off-by: Heiko Carstens
    Cc: Finn Thain
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Heiko Carstens
     

13 Mar, 2014

1 commit

  • Commit 73f7d1ca3263 (ACPI / init: Run acpi_early_init() before
    timekeeping_init()) optimistically moved the early ACPI initialization
    before timekeeping_init(), but that didn't work, because it broke fast
    TSC calibration for Julian Wollrath on Thinkpad x121e (and most likely
    for others too). The reason is that acpi_early_init() enables the SCI
    and that interferes with the fast TSC calibration mechanism.

    Thus follow the original idea to execute acpi_early_init() before
    efi_enter_virtual_mode() to help the EFI people for now and we can
    revisit the other problem that commit 73f7d1ca3263 attempted to
    address in the future (if really necessary).

    Fixes: 73f7d1ca3263 (ACPI / init: Run acpi_early_init() before timekeeping_init())
    Reported-by: Julian Wollrath
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

06 Feb, 2014

1 commit

  • This changes 'do_execve()' to get the executable name as a 'struct
    filename', and to free it when it is done. This is what the normal
    users want, and it simplifies and streamlines their error handling.

    The controlled lifetime of the executable name also fixes a
    use-after-free problem with the trace_sched_process_exec tracepoint: the
    lifetime of the passed-in string for kernel users was not at all
    obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
    the pathname allocation lifetime with the execve() having finished,
    which in turn meant that the trace point that happened after
    mm_release() of the old process VM ended up using already free'd memory.

    To solve the kernel string lifetime issue, this simply introduces
    "getname_kernel()" that works like the normal user-space getname()
    function, except with the source coming from kernel memory.

    As Oleg points out, this also means that we could drop the tcomm[] array
    from 'struct linux_binprm', since the pathname lifetime now covers
    setup_new_exec(). That would be a separate cleanup.

    Reported-by: Igor Zhbanov
    Tested-by: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Feb, 2014

1 commit


29 Jan, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
    assorted cleanups and fixes all over the place...

    There will be another pile later this week"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
    __dentry_path() fixes
    vfs: Remove second variable named error in __dentry_path
    vfs: Is mounted should be testing mnt_ns for NULL or error.
    Fix race when checking i_size on direct i/o read
    hfsplus: remove can_set_xattr
    nfsd: use get_acl and ->set_acl
    fs: remove generic_acl
    nfs: use generic posix ACL infrastructure for v3 Posix ACLs
    gfs2: use generic posix ACL infrastructure
    jfs: use generic posix ACL infrastructure
    xfs: use generic posix ACL infrastructure
    reiserfs: use generic posix ACL infrastructure
    ocfs2: use generic posix ACL infrastructure
    jffs2: use generic posix ACL infrastructure
    hfsplus: use generic posix ACL infrastructure
    f2fs: use generic posix ACL infrastructure
    ext2/3/4: use generic posix ACL infrastructure
    btrfs: use generic posix ACL infrastructure
    fs: make posix_acl_create more useful
    fs: make posix_acl_chmod more useful
    ...

    Linus Torvalds
     

28 Jan, 2014

1 commit


26 Jan, 2014

1 commit

  • Pull user namespaces work from Eric Biederman:
    "The work to convert the kernel to use kuid_t and kgid_t has been
    finished since 3.12 so it is time to remove the scaffolding that
    allowed the work to progress incrementally.

    The first patch on this branch just removes the scaffolding, ensuring
    we will always get compile errors if people accidentally try the
    userspace and the kernel uid and gid types. The second patch an
    overlooked and unused chunk of mips code that that fails to build
    after the first patch.

    The code hasn't been in linux-next for long (as I was out of it and
    could not sheppared the cold properly) but the patch has been around
    for a long time just waiting for the day when I had finished the
    uid/gid conversions. Putting the code in linux-next did find the
    compile failure on mips so I took the time to get that fix reviewed
    and included. Beyond that I am not too worried about errors because
    all these two patches do is delete a modest amount of code"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    MIPS: VPE: Remove vpe_getuid and vpe_getgid
    userns: userns: Remove UIDGID_STRICT_TYPE_CHECKS

    Linus Torvalds
     

25 Jan, 2014

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Pull ACPI and power management updates from Rafael Wysocki:
    "As far as the number of commits goes, the top spot belongs to ACPI
    this time with cpufreq in the second position and a handful of PM
    core, PNP and cpuidle updates. They are fixes and cleanups mostly, as
    usual, with a couple of new features in the mix.

    The most visible change is probably that we will create struct
    acpi_device objects (visible in sysfs) for all devices represented in
    the ACPI tables regardless of their status and there will be a new
    sysfs attribute under those objects allowing user space to check that
    status via _STA.

    Consequently, ACPI device eject or generally hot-removal will not
    delete those objects, unless the table containing the corresponding
    namespace nodes is unloaded, which is extremely rare. Also ACPI
    container hotplug will be handled quite a bit differently and cpufreq
    will support CPU boost ("turbo") generically and not only in the
    acpi-cpufreq driver.

    Specifics:

    - ACPI core changes to make it create a struct acpi_device object for
    every device represented in the ACPI tables during all namespace
    scans regardless of the current status of that device. In
    accordance with this, ACPI hotplug operations will not delete those
    objects, unless the underlying ACPI tables go away.

    - On top of the above, new sysfs attribute for ACPI device objects
    allowing user space to check device status by triggering the
    execution of _STA for its ACPI object. From Srinivas Pandruvada.

    - ACPI core hotplug changes reducing code duplication, integrating
    the PCI root hotplug with the core and reworking container hotplug.

    - ACPI core simplifications making it use ACPI_COMPANION() in the
    code "glueing" ACPI device objects to "physical" devices.

    - ACPICA update to upstream version 20131218. This adds support for
    the DBG2 and PCCT tables to ACPICA, fixes some bugs and improves
    debug facilities. From Bob Moore, Lv Zheng and Betty Dall.

    - Init code change to carry out the early ACPI initialization
    earlier. That should allow us to use ACPI during the timekeeping
    initialization and possibly to simplify the EFI initialization too.
    From Chun-Yi Lee.

    - Clenups of the inclusions of ACPI headers in many places all over
    from Lv Zheng and Rashika Kheria (work in progress).

    - New helper for ACPI _DSM execution and rework of the code in
    drivers that uses _DSM to execute it via the new helper. From
    Jiang Liu.

    - New Win8 OSI blacklist entries from Takashi Iwai.

    - Assorted ACPI fixes and cleanups from Al Stone, Emil Goode, Hanjun
    Guo, Lan Tianyu, Masanari Iida, Oliver Neukum, Prarit Bhargava,
    Rashika Kheria, Tang Chen, Zhang Rui.

    - intel_pstate driver updates, including proper Baytrail support,
    from Dirk Brandewie and intel_pstate documentation from Ramkumar
    Ramachandra.

    - Generic CPU boost ("turbo") support for cpufreq from Lukasz
    Majewski.

    - powernow-k6 cpufreq driver fixes from Mikulas Patocka.

    - cpufreq core fixes and cleanups from Viresh Kumar, Jane Li, Mark
    Brown.

    - Assorted cpufreq drivers fixes and cleanups from Anson Huang, John
    Tobias, Paul Bolle, Paul Walmsley, Sachin Kamat, Shawn Guo, Viresh
    Kumar.

    - cpuidle cleanups from Bartlomiej Zolnierkiewicz.

    - Support for hibernation APM events from Bin Shi.

    - Hibernation fix to avoid bringing up nonboot CPUs with ACPI EC
    disabled during thaw transitions from Bjørn Mork.

    - PM core fixes and cleanups from Ben Dooks, Leonardo Potenza, Ulf
    Hansson.

    - PNP subsystem fixes and cleanups from Dmitry Torokhov, Levente
    Kurusa, Rashika Kheria.

    - New tool for profiling system suspend from Todd E Brandt and a
    cpupower tool cleanup from One Thousand Gnomes"

    * tag 'pm+acpi-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (153 commits)
    thermal: exynos: boost: Automatic enable/disable of BOOST feature (at Exynos4412)
    cpufreq: exynos4x12: Change L0 driver data to CPUFREQ_BOOST_FREQ
    Documentation: cpufreq / boost: Update BOOST documentation
    cpufreq: exynos: Extend Exynos cpufreq driver to support boost
    cpufreq / boost: Kconfig: Support for software-managed BOOST
    acpi-cpufreq: Adjust the code to use the common boost attribute
    cpufreq: Add boost frequency support in core
    intel_pstate: Add trace point to report internal state.
    cpufreq: introduce cpufreq_generic_get() routine
    ARM: SA1100: Create dummy clk_get_rate() to avoid build failures
    cpufreq: stats: create sysfs entries when cpufreq_stats is a module
    cpufreq: stats: free table and remove sysfs entry in a single routine
    cpufreq: stats: remove hotplug notifiers
    cpufreq: stats: handle cpufreq_unregister_driver() and suspend/resume properly
    cpufreq: speedstep: remove unused speedstep_get_state
    platform: introduce OF style 'modalias' support for platform bus
    PM / tools: new tool for suspend/resume performance optimization
    ACPI: fix module autoloading for ACPI enumerated devices
    ACPI: add module autoloading support for ACPI enumerated devices
    ACPI: fix create_modalias() return value handling
    ...

    Linus Torvalds
     

24 Jan, 2014

2 commits


22 Jan, 2014

4 commits

  • Merge first patch-bomb from Andrew Morton:

    - a couple of misc things

    - inotify/fsnotify work from Jan

    - ocfs2 updates (partial)

    - about half of MM

    * emailed patches from Andrew Morton : (117 commits)
    mm/migrate: remove unused function, fail_migrate_page()
    mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
    mm/migrate: correct failure handling if !hugepage_migration_support()
    mm/migrate: add comment about permanent failure path
    mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
    mm: compaction: reset scanner positions immediately when they meet
    mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
    mm: compaction: detect when scanners meet in isolate_freepages
    mm: compaction: reset cached scanner pfn's before reading them
    mm: compaction: encapsulate defer reset logic
    mm: compaction: trace compaction begin and end
    memcg, oom: lock mem_cgroup_print_oom_info
    sched: add tracepoints related to NUMA task migration
    mm: numa: do not automatically migrate KSM pages
    mm: numa: trace tasks that fail migration due to rate limiting
    mm: numa: limit scope of lock for NUMA migrate rate limiting
    mm: numa: make NUMA-migrate related functions static
    lib/show_mem.c: show num_poisoned_pages when oom
    mm/hwpoison: add '#' to hwpoison_inject
    mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "The bulk of changes are cleanups and preparations for the upcoming
    kernfs conversion.

    - cgroup_event mechanism which is and will be used only by memcg is
    moved to memcg.

    - pidlist handling is updated so that it can be served by seq_file.

    Also, the list is not sorted if sane_behavior. cgroup
    documentation explicitly states that the file is not sorted but it
    has been for quite some time.

    - All cgroup file handling now happens on top of seq_file. This is
    to prepare for kernfs conversion. In addition, all operations are
    restructured so that they map 1-1 to kernfs operations.

    - Other cleanups and low-pri fixes"

    * 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits)
    cgroup: trivial style updates
    cgroup: remove stray references to css_id
    doc: cgroups: Fix typo in doc/cgroups
    cgroup: fix fail path in cgroup_load_subsys()
    cgroup: fix missing unlock on error in cgroup_load_subsys()
    cgroup: remove for_each_root_subsys()
    cgroup: implement for_each_css()
    cgroup: factor out cgroup_subsys_state creation into create_css()
    cgroup: combine css handling loops in cgroup_create()
    cgroup: reorder operations in cgroup_create()
    cgroup: make for_each_subsys() useable under cgroup_root_mutex
    cgroup: css iterations and css_from_dir() are safe under cgroup_mutex
    cgroup: unify pidlist and other file handling
    cgroup: replace cftype->read_seq_string() with cftype->seq_show()
    cgroup: attach cgroup_open_file to all cgroup files
    cgroup: generalize cgroup_pidlist_open_file
    cgroup: unify read path so that seq_file is always used
    cgroup: unify cgroup_write_X64() and cgroup_write_string()
    cgroup: remove cftype->read(), ->read_map() and ->write()
    hugetlb_cgroup: convert away from cftype->read()
    ...

    Linus Torvalds
     
  • Switch to memblock interfaces for early memory allocator instead of
    bootmem allocator. No functional change in beahvior than what it is in
    current code from bootmem users points of view.

    Archs already converted to NO_BOOTMEM now directly use memblock
    interfaces instead of bootmem wrappers build on top of memblock. And
    the archs which still uses bootmem, these new apis just fall back to
    exiting bootmem APIs.

    Signed-off-by: Santosh Shilimkar
    Cc: Yinghai Lu
    Cc: Tejun Heo
    Cc: "Rafael J. Wysocki"
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Greg Kroah-Hartman
    Cc: Grygorii Strashko
    Cc: H. Peter Anvin
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michal Hocko
    Cc: Paul Walmsley
    Cc: Pavel Machek
    Cc: Russell King
    Cc: Tony Lindgren
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Santosh Shilimkar
     
  • If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
    is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
    so we loose 24 on each. An average system can easily allocate few tens
    thousands of page->ptl and overhead is significant.

    Let's create a separate slab for page->ptl allocation to solve this.

    To make sure that it really works this time, some numbers from my test
    machine (just booted, no load):

    Before:
    # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
    kmalloc-96 31987 32190 128 30 1 : tunables 120 60 8 : slabdata 1073 1073 92
    After:
    # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
    page->ptl 27516 28143 72 53 1 : tunables 120 60 8 : slabdata 531 531 9
    kmalloc-96 3853 5280 128 30 1 : tunables 120 60 8 : slabdata 176 176 0

    Note that the patch is useful not only for debug case, but also for
    PREEMPT_RT, where spinlock_t is always bloated.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2014

1 commit

  • This is a variant patch from Rafael J. Wysocki's
    ACPI / init: Run acpi_early_init() before efi_enter_virtual_mode()

    According to Matt Fleming, if acpi_early_init() was executed before
    efi_enter_virtual_mode(), the EFI initialization could benefit from
    it, so Rafael's patch makes that happen.

    And, we want accessing ACPI TAD device to set system clock, so move
    acpi_early_init() before timekeeping_init(). This final position is
    also before efi_enter_virtual_mode().

    Tested-by: Toshi Kani
    Signed-off-by: Lee, Chun-Yi
    Signed-off-by: Rafael J. Wysocki

    Lee, Chun-Yi
     

12 Jan, 2014

1 commit


11 Dec, 2013

1 commit

  • Introduce mul_u64_u32_shr() as proposed by Andy a while back; it
    allows using 64x64->128 muls on 64bit archs and recent GCC
    which defines __SIZEOF_INT128__ and __int128.

    (This new method will be used by the scheduler.)

    Signed-off-by: Peter Zijlstra
    Cc: fweisbec@gmail.com
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-hxjoeuzmrcaumR0uZwjpe2pv@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Dec, 2013

1 commit


27 Nov, 2013

1 commit

  • Removing UIDGID_STRICT_TYPE_CHECKS simplifies the code and always
    generates a compile error if the uids and kuids or gids and kgids are
    mixed by accident. Now that the appropriate conversions have been
    placed throughout the kernel there is no longer a need for a mode where
    we don't detect them as compile errors.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

23 Nov, 2013

2 commits

  • Merge v3.12 based patch series to move cgroup_event implementation to
    memcg into for-3.14. The following two commits cause a conflict in
    kernel/cgroup.c

    2ff2a7d03bbe4 ("cgroup: kill css_id")
    79bd9814e5ec9 ("cgroup, memcg: move cgroup_event implementation to memcg")

    Each patch removes a struct definition from kernel/cgroup.c. As the
    two are adjacent, they cause a context conflict. Easily resolved by
    removing both structs.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup_event is way over-designed and tries to build a generic
    flexible event mechanism into cgroup - fully customizable event
    specification for each user of the interface. This is utterly
    unnecessary and overboard especially in the light of the planned
    unified hierarchy as there's gonna be single agent. Simply generating
    events at fixed points, or if that's too restrictive, configureable
    cadence or single set of configureable points should be enough.

    Thankfully, memcg is the only user and gets to keep it. Replacing it
    with something simpler on sane_behavior is strongly recommended.

    This patch moves cgroup_event and "cgroup.event_control"
    implementation to mm/memcontrol.c. Clearing of events on cgroup
    destruction is moved from cgroup_destroy_locked() to
    mem_cgroup_css_offline(), which shouldn't make any noticeable
    difference.

    cgroup_css() and __file_cft() are exported to enable the move;
    however, this will soon be reverted once the event code is updated to
    be memcg specific.

    Note that "cgroup.event_control" will now exist only on the hierarchy
    with memcg attached to it. While this change is visible to userland,
    it is unlikely to be noticeable as the file has never been meaningful
    outside memcg.

    Aside from the above change, this is pure code relocation.

    v2: Per Li Zefan's comments, init/Kconfig updated accordingly and
    poll.h inclusion moved from cgroup.c to memcontrol.c.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     

22 Nov, 2013

2 commits

  • Pull security subsystem updates from James Morris:
    "In this patchset, we finally get an SELinux update, with Paul Moore
    taking over as maintainer of that code.

    Also a significant update for the Keys subsystem, as well as
    maintenance updates to Smack, IMA, TPM, and Apparmor"

    and since I wanted to know more about the updates to key handling,
    here's the explanation from David Howells on that:

    "Okay. There are a number of separate bits. I'll go over the big bits
    and the odd important other bit, most of the smaller bits are just
    fixes and cleanups. If you want the small bits accounting for, I can
    do that too.

    (1) Keyring capacity expansion.

    KEYS: Consolidate the concept of an 'index key' for key access
    KEYS: Introduce a search context structure
    KEYS: Search for auth-key by name rather than target key ID
    Add a generic associative array implementation.
    KEYS: Expand the capacity of a keyring

    Several of the patches are providing an expansion of the capacity of a
    keyring. Currently, the maximum size of a keyring payload is one page.
    Subtract a small header and then divide up into pointers, that only gives
    you ~500 pointers on an x86_64 box. However, since the NFS idmapper uses
    a keyring to store ID mapping data, that has proven to be insufficient to
    the cause.

    Whatever data structure I use to handle the keyring payload, it can only
    store pointers to keys, not the keys themselves because several keyrings
    may point to a single key. This precludes inserting, say, and rb_node
    struct into the key struct for this purpose.

    I could make an rbtree of records such that each record has an rb_node
    and a key pointer, but that would use four words of space per key stored
    in the keyring. It would, however, be able to use much existing code.

    I selected instead a non-rebalancing radix-tree type approach as that
    could have a better space-used/key-pointer ratio. I could have used the
    radix tree implementation that we already have and insert keys into it by
    their serial numbers, but that means any sort of search must iterate over
    the whole radix tree. Further, its nodes are a bit on the capacious side
    for what I want - especially given that key serial numbers are randomly
    allocated, thus leaving a lot of empty space in the tree.

    So what I have is an associative array that internally is a radix-tree
    with 16 pointers per node where the index key is constructed from the key
    type pointer and the key description. This means that an exact lookup by
    type+description is very fast as this tells us how to navigate directly to
    the target key.

    I made the data structure general in lib/assoc_array.c as far as it is
    concerned, its index key is just a sequence of bits that leads to a
    pointer. It's possible that someone else will be able to make use of it
    also. FS-Cache might, for example.

    (2) Mark keys as 'trusted' and keyrings as 'trusted only'.

    KEYS: verify a certificate is signed by a 'trusted' key
    KEYS: Make the system 'trusted' keyring viewable by userspace
    KEYS: Add a 'trusted' flag and a 'trusted only' flag
    KEYS: Separate the kernel signature checking keyring from module signing

    These patches allow keys carrying asymmetric public keys to be marked as
    being 'trusted' and allow keyrings to be marked as only permitting the
    addition or linkage of trusted keys.

    Keys loaded from hardware during kernel boot or compiled into the kernel
    during build are marked as being trusted automatically. New keys can be
    loaded at runtime with add_key(). They are checked against the system
    keyring contents and if their signatures can be validated with keys that
    are already marked trusted, then they are marked trusted also and can
    thus be added into the master keyring.

    Patches from Mimi Zohar make this usable with the IMA keyrings also.

    (3) Remove the date checks on the key used to validate a module signature.

    X.509: Remove certificate date checks

    It's not reasonable to reject a signature just because the key that it was
    generated with is no longer valid datewise - especially if the kernel
    hasn't yet managed to set the system clock when the first module is
    loaded - so just remove those checks.

    (4) Make it simpler to deal with additional X.509 being loaded into the kernel.

    KEYS: Load *.x509 files into kernel keyring
    KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate

    The builder of the kernel now just places files with the extension ".x509"
    into the kernel source or build trees and they're concatenated by the
    kernel build and stuffed into the appropriate section.

    (5) Add support for userspace kerberos to use keyrings.

    KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches
    KEYS: Implement a big key type that can save to tmpfs

    Fedora went to, by default, storing kerberos tickets and tokens in tmpfs.
    We looked at storing it in keyrings instead as that confers certain
    advantages such as tickets being automatically deleted after a certain
    amount of time and the ability for the kernel to get at these tokens more
    easily.

    To make this work, two things were needed:

    (a) A way for the tickets to persist beyond the lifetime of all a user's
    sessions so that cron-driven processes can still use them.

    The problem is that a user's session keyrings are deleted when the
    session that spawned them logs out and the user's user keyring is
    deleted when the UID is deleted (typically when the last log out
    happens), so neither of these places is suitable.

    I've added a system keyring into which a 'persistent' keyring is
    created for each UID on request. Each time a user requests their
    persistent keyring, the expiry time on it is set anew. If the user
    doesn't ask for it for, say, three days, the keyring is automatically
    expired and garbage collected using the existing gc. All the kerberos
    tokens it held are then also gc'd.

    (b) A key type that can hold really big tickets (up to 1MB in size).

    The problem is that Active Directory can return huge tickets with lots
    of auxiliary data attached. We don't, however, want to eat up huge
    tracts of unswappable kernel space for this, so if the ticket is
    greater than a certain size, we create a swappable shmem file and dump
    the contents in there and just live with the fact we then have an
    inode and a dentry overhead. If the ticket is smaller than that, we
    slap it in a kmalloc()'d buffer"

    * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (121 commits)
    KEYS: Fix keyring content gc scanner
    KEYS: Fix error handling in big_key instantiation
    KEYS: Fix UID check in keyctl_get_persistent()
    KEYS: The RSA public key algorithm needs to select MPILIB
    ima: define '_ima' as a builtin 'trusted' keyring
    ima: extend the measurement list to include the file signature
    kernel/system_certificate.S: use real contents instead of macro GLOBAL()
    KEYS: fix error return code in big_key_instantiate()
    KEYS: Fix keyring quota misaccounting on key replacement and unlink
    KEYS: Fix a race between negating a key and reading the error set
    KEYS: Make BIG_KEYS boolean
    apparmor: remove the "task" arg from may_change_ptraced_domain()
    apparmor: remove parent task info from audit logging
    apparmor: remove tsk field from the apparmor_audit_struct
    apparmor: fix capability to not use the current task, during reporting
    Smack: Ptrace access check mode
    ima: provide hash algo info in the xattr
    ima: enable support for larger default filedata hash algorithms
    ima: define kernel parameter 'ima_template=' to change configured default
    ima: add Kconfig default measurement list template
    ...

    Linus Torvalds
     
  • Pull audit updates from Eric Paris:
    "Nothing amazing. Formatting, small bug fixes, couple of fixes where
    we didn't get records due to some old VFS changes, and a change to how
    we collect execve info..."

    Fixed conflict in fs/exec.c as per Eric and linux-next.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    audit: fix type of sessionid in audit_set_loginuid()
    audit: call audit_bprm() only once to add AUDIT_EXECVE information
    audit: move audit_aux_data_execve contents into audit_context union
    audit: remove unused envc member of audit_aux_data_execve
    audit: Kill the unused struct audit_aux_data_capset
    audit: do not reject all AUDIT_INODE filter types
    audit: suppress stock memalloc failure warnings since already managed
    audit: log the audit_names record type
    audit: add child record before the create to handle case where create fails
    audit: use given values in tty_audit enable api
    audit: use nlmsg_len() to get message payload length
    audit: use memset instead of trying to initialize field by field
    audit: fix info leak in AUDIT_GET requests
    audit: update AUDIT_INODE filter rule to comparator function
    audit: audit feature to set loginuid immutable
    audit: audit feature to only allow unsetting the loginuid
    audit: allow unsetting the loginuid (with priv)
    audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
    audit: loginuid functions coding style
    selinux: apply selinux checks on new audit message types
    ...

    Linus Torvalds
     

21 Nov, 2013

1 commit

  • This reverts commit ea1e7ed33708c7a760419ff9ded0a6cb90586a50.

    Al points out that while the commit *does* actually create a separate
    slab for the page->ptl allocation, that slab is never actually used, and
    the code continues to use kmalloc/kfree.

    Damien Wyart points out that the original patch did have the conversion
    to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.

    Revert the half-arsed attempt that didn't do anything. If we really do
    want the special slab (remember: this is all relevant just for debug
    builds, so it's not necessarily all that critical) we might as well redo
    the patch fully.

    Reported-by: Al Viro
    Acked-by: Andrew Morton
    Cc: Kirill A Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Nov, 2013

1 commit

  • This reverts commit 69f0554ec261fd686ac7fa1c598cc9eb27b83a80.

    This patch breaks randconfig on at least the x86-64 architecture, and
    most likely on others. There is work underway to support uncompressed
    kernels in a generic way, but it looks like it will amount to
    rewriting the support from scratch; see the LKML thread in the Link:
    for info.

    Therefore, revert this change and wait for the fix.

    Reported-by: Pavel Roskin
    Cc: Christian Ruppert
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20131113113418.167b8ffd@IRBT4585
    Signed-off-by: H. Peter Anvin
    Signed-off-by: Linus Torvalds

    H. Peter Anvin
     

16 Nov, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "Usual earth-shaking, news-breaking, rocket science pile from
    trivial.git"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
    doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
    doc: add missing files to timers/00-INDEX
    timekeeping: Fix some trivial typos in comments
    mm: Fix some trivial typos in comments
    irq: Fix some trivial typos in comments
    NUMA: fix typos in Kconfig help text
    mm: update 00-INDEX
    doc: Documentation/DMA-attributes.txt fix typo
    DRM: comment: `halve' -> `half'
    Docs: Kconfig: `devlopers' -> `developers'
    doc: typo on word accounting in kprobes.c in mutliple architectures
    treewide: fix "usefull" typo
    treewide: fix "distingush" typo
    mm/Kconfig: Grammar s/an/a/
    kexec: Typo s/the/then/
    Documentation/kvm: Update cpuid documentation for steal time and pv eoi
    treewide: Fix common typo in "identify"
    __page_to_pfn: Fix typo in comment
    Correct some typos for word frequency
    clk: fixed-factor: Fix a trivial typo
    ...

    Linus Torvalds
     

15 Nov, 2013

2 commits

  • Pull module updates from Rusty Russell:
    "Mainly boring here, too. rmmod --wait finally removed, though"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    modpost: fix bogus 'exported twice' warnings.
    init: fix in-place parameter modification regression
    asmlinkage, module: Make ksymtab and kcrctab symbols and __this_module __visible
    kernel: add support for init_array constructors
    modpost: Optionally ignore secondary errors seen if a single module build fails
    module: remove rmmod --wait option.

    Linus Torvalds
     
  • If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
    is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
    so we loose 24 on each. An average system can easily allocate few tens
    thousands of page->ptl and overhead is significant.

    Let's create a separate slab for page->ptl allocation to solve this.

    Signed-off-by: Kirill A. Shutemov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Nov, 2013

7 commits

  • Pull networking updates from David Miller:

    1) The addition of nftables. No longer will we need protocol aware
    firewall filtering modules, it can all live in userspace.

    At the core of nftables is a, for lack of a better term, virtual
    machine that executes byte codes to inspect packet or metadata
    (arriving interface index, etc.) and make verdict decisions.

    Besides support for loading packet contents and comparing them, the
    interpreter supports lookups in various datastructures as
    fundamental operations. For example sets are supports, and
    therefore one could create a set of whitelist IP address entries
    which have ACCEPT verdicts attached to them, and use the appropriate
    byte codes to do such lookups.

    Since the interpreted code is composed in userspace, userspace can
    do things like optimize things before giving it to the kernel.

    Another major improvement is the capability of atomically updating
    portions of the ruleset. In the existing netfilter implementation,
    one has to update the entire rule set in order to make a change and
    this is very expensive.

    Userspace tools exist to create nftables rules using existing
    netfilter rule sets, but both kernel implementations will need to
    co-exist for quite some time as we transition from the old to the
    new stuff.

    Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
    worked so hard on this.

    2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
    to our pseudo-random number generator, mostly used for things like
    UDP port randomization and netfitler, amongst other things.

    In particular the taus88 generater is updated to taus113, and test
    cases are added.

    3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
    and Yang Yingliang.

    4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
    Sujir.

    5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
    Neal Cardwell, and Yuchung Cheng.

    6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
    control message data, much like other socket option attributes.
    From Francesco Fusco.

    7) Allow applications to specify a cap on the rate computed
    automatically by the kernel for pacing flows, via a new
    SO_MAX_PACING_RATE socket option. From Eric Dumazet.

    8) Make the initial autotuned send buffer sizing in TCP more closely
    reflect actual needs, from Eric Dumazet.

    9) Currently early socket demux only happens for TCP sockets, but we
    can do it for connected UDP sockets too. Implementation from Shawn
    Bohrer.

    10) Refactor inet socket demux with the goal of improving hash demux
    performance for listening sockets. With the main goals being able
    to use RCU lookups on even request sockets, and eliminating the
    listening lock contention. From Eric Dumazet.

    11) The bonding layer has many demuxes in it's fast path, and an RCU
    conversion was started back in 3.11, several changes here extend the
    RCU usage to even more locations. From Ding Tianhong and Wang
    Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
    Falico.

    12) Allow stackability of segmentation offloads to, in particular, allow
    segmentation offloading over tunnels. From Eric Dumazet.

    13) Significantly improve the handling of secret keys we input into the
    various hash functions in the inet hashtables, TCP fast open, as
    well as syncookies. From Hannes Frederic Sowa. The key fundamental
    operation is "net_get_random_once()" which uses static keys.

    Hannes even extended this to ipv4/ipv6 fragmentation handling and
    our generic flow dissector.

    14) The generic driver layer takes care now to set the driver data to
    NULL on device removal, so it's no longer necessary for drivers to
    explicitly set it to NULL any more. Many drivers have been cleaned
    up in this way, from Jingoo Han.

    15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.

    16) Improve CRC32 interfaces and generic SKB checksum iterators so that
    SCTP's checksumming can more cleanly be handled. Also from Daniel
    Borkmann.

    17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
    using the interface MTU value. This helps avoid PMTU attacks,
    particularly on DNS servers. From Hannes Frederic Sowa.

    18) Use generic XPS for transmit queue steering rather than internal
    (re-)implementation in virtio-net. From Jason Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    random32: add test cases for taus113 implementation
    random32: upgrade taus88 generator to taus113 from errata paper
    random32: move rnd_state to linux/random.h
    random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
    random32: add periodic reseeding
    random32: fix off-by-one in seeding requirement
    PHY: Add RTL8201CP phy_driver to realtek
    xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
    macmace: add missing platform_set_drvdata() in mace_probe()
    ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
    ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
    vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
    ixgbe: add warning when max_vfs is out of range.
    igb: Update link modes display in ethtool
    netfilter: push reasm skb through instead of original frag skbs
    ip6_output: fragment outgoing reassembled skb properly
    MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
    net_sched: tbf: support of 64bit rates
    ixgbe: deleting dfwd stations out of order can cause null ptr deref
    ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
    ...

    Linus Torvalds
     
  • Make menuconfig allows one to choose compression format of an initial
    ramdisk image. But this choice does not result in duly compressed ramdisk
    image. Because - $ make install - does not pass on the selected
    compression choice to the dracut(8) tool, which creates the initramfs
    file. dracut(8) generates the image with the default compression, ie.
    gzip(1).

    This patch exports the selected compression option to a sub-shell
    environment, so that it could be used by dracut(8) tool to generate
    appropriately compressed initramfs images.

    There isn't a straightforward way to pass on options to dracut(8) via
    positional parameters. Because it is indirectly invoked at the end of a $
    make install sequence.

    # make install
    -> arch/$arch/boot/Makefile
    -> arch/$arch/boot/install.sh
    -> /sbing/installkernel ...
    -> /sbin/new-kernel-pkg ...
    -> /sbin/dracut ...

    Signed-off-by: P J P
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    P J P
     
  • Some ARC users say they can boot faster with without kernel compression.
    This probably depends on things like the FLASH chip they use etc.

    Until now, kernel compression can only be disabled by removing "select
    HAVE_" lines from the architecture Kconfig. So add the
    Kconfig logic to permit disabling of kernel compression.

    Signed-off-by: Christian Ruppert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Ruppert
     
  • This patch proposes to make init failures more explicit.

    Before this, the "No init found" message didn't help much. It could
    sometimes be misleading and actually mean "No *working* init found".

    This message could hide many different issues:
    - no init program candidates found at all
    - some init program candidates exist but can't be executed (missing
    execute permissions, failed to load shared libraries, executable
    compiled for an unknown architecture...)

    This patch notifies the kernel user when a candidate init program is found
    but can't be executed. In each failure situation, the error code is
    displayed, to quickly find the root cause. "No init found" is also
    replaced by "No working init found", which is more correct.

    This will help embedded Linux developers (especially the newcomers),
    regularly making and debugging new root filesystems.

    Credits to Geert Uytterhoeven and Janne Karhunen for their improvement
    suggestions.

    Signed-off-by: Michael Opdenacker
    Cc: Geert Uytterhoeven
    Tested-by: Janne Karhunen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Opdenacker
     
  • Make menuconfig allows one to choose compression format of an initial
    ramdisk image. But this choice does not result in duly compressed initial
    ramdisk image. Because - $ make install - does not pass on the selected
    compression choice to the dracut(8) tool, which creates the initramfs
    file. dracut(8) generates the image with the default compression, ie.
    gzip(1).

    If a user chose any other compression instead of gzip(1), it leads to a
    crash due to NULL pointer dereference in crd_load(), caused by a NULL
    function pointer returned by the 'decompress_method()' routine. Because
    the initramfs image is gzip(1) compressed, whereas the kernel knows only
    to decompress the chosen format and not gzip(1).

    This patch replaces the crash by an explicit panic() call with an
    appropriate error message. This shall prevent the kernel from
    eventually panicking in: init/do_mounts.c: mount_block_root() with
    -> panic("VFS: Unable to mount root fs on %s", b);

    [akpm@linux-foundation.org: mention that the problem is with the ramdisk, don't print known-to-be-NULL value]
    Signed-off-by: P J P
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    P J P
     
  • It's already available in

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • The name_to_dev_t function has a comment block which lists the supported
    syntaxes for the device name. Add a bullet for the :
    syntax, which is already supported in the code

    Signed-off-by: Sebastian Capella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Capella