07 Aug, 2016

1 commit

  • The introduction of pre-allocated hash elements inadvertently broke
    the behavior of bpf hash maps where users expected to call
    bpf_map_update_elem() without considering that the map can be full.
    Some programs do:
    old_value = bpf_map_lookup_elem(map, key);
    if (old_value) {
    ... prepare new_value on stack ...
    bpf_map_update_elem(map, key, new_value);
    }
    Before pre-alloc the update() for existing element would work even
    in 'map full' condition. Restore this behavior.

    The above program could have updated old_value in place instead of
    update() which would be faster and most programs use that approach,
    but sometimes the values are large and the programs use update()
    helper to do atomic replacement of the element.
    Note we cannot simply update element's value in-place like percpu
    hash map does and have to allocate extra num_possible_cpu elements
    and use this extra reserve when the map is full.

    Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

04 Aug, 2016

2 commits

  • Using per-register incrementing ID can lead to
    find_good_pkt_pointers() confusing registers which
    have completely different values. Consider example:

    0: (bf) r6 = r1
    1: (61) r8 = *(u32 *)(r6 +76)
    2: (61) r0 = *(u32 *)(r6 +80)
    3: (bf) r7 = r8
    4: (07) r8 += 32
    5: (2d) if r8 > r0 goto pc+9
    R0=pkt_end R1=ctx R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=0,off=32,r=32) R10=fp
    6: (bf) r8 = r7
    7: (bf) r9 = r7
    8: (71) r1 = *(u8 *)(r7 +0)
    9: (0f) r8 += r1
    10: (71) r1 = *(u8 *)(r7 +1)
    11: (0f) r9 += r1
    12: (07) r8 += 32
    13: (2d) if r8 > r0 goto pc+1
    R0=pkt_end R1=inv56 R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=1,off=32,r=32) R9=pkt(id=1,off=0,r=32) R10=fp
    14: (71) r1 = *(u8 *)(r9 +16)
    15: (b7) r7 = 0
    16: (bf) r0 = r7
    17: (95) exit

    We need to get a UNKNOWN_VALUE with imm to force id
    generation so lines 0-5 make r7 a valid packet pointer.
    We then read two different bytes from the packet and
    add them to copies of the constructed packet pointer.
    r8 (line 9) and r9 (line 11) will get the same id of 1,
    independently. When either of them is validated (line
    13) - find_good_pkt_pointers() will also mark the other
    as safe. This leads to access on line 14 being mistakenly
    considered safe.

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Signed-off-by: Jakub Kicinski
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Pull tracing fixes from Steven Rostedt:
    "A few updates and fixes:

    - move the suppressing of the __builtin_return_address >0 warning to
    the tracing directory only.

    - metag recordmcount fix for newer glibc's

    - two tracing histogram fixes that were reported by KASAN"

    * tag 'trace-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix use-after-free in hist_register_trigger()
    tracing: Fix use-after-free in hist_unreg_all/hist_enable_unreg_all
    Makefile: Mute warning for __builtin_return_address(>0) for tracing only
    ftrace/recordmcount: Work around for addition of metag magic but not relocations

    Linus Torvalds
     

03 Aug, 2016

20 commits

  • Copy the config fragments from the AOSP common kernel android-4.4
    branch. It is becoming possible to run mainline kernels with Android,
    but the kernel defconfigs don't work as-is and debugging missing config
    options is a pain. Adding the config fragments into the kernel tree,
    makes configuring a mainline kernel as simple as:

    make ARCH=arm multi_v7_defconfig android-base.config android-recommended.config

    The following non-upstream config options were removed:

    CONFIG_NETFILTER_XT_MATCH_QTAGUID
    CONFIG_NETFILTER_XT_MATCH_QUOTA2
    CONFIG_NETFILTER_XT_MATCH_QUOTA2_LOG
    CONFIG_PPPOLAC
    CONFIG_PPPOPNS
    CONFIG_SECURITY_PERF_EVENTS_RESTRICT
    CONFIG_USB_CONFIGFS_F_MTP
    CONFIG_USB_CONFIGFS_F_PTP
    CONFIG_USB_CONFIGFS_F_ACC
    CONFIG_USB_CONFIGFS_F_AUDIO_SRC
    CONFIG_USB_CONFIGFS_UEVENT
    CONFIG_INPUT_KEYCHORD
    CONFIG_INPUT_KEYRESET

    Link: http://lkml.kernel.org/r/1466708235-28593-1-git-send-email-robh@kernel.org
    Signed-off-by: Rob Herring
    Cc: Amit Pundir
    Cc: John Stultz
    Cc: Dmitry Shmidt
    Cc: Rom Lemarchand
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Herring
     
  • Commit 20d8b67c06fa ("relay: add buffer-only channels; useful for early
    logging") added support to use channels with no associated files.

    This is useful when the exact location of relay file is not known or the
    the parent directory of relay file is not available, while creating the
    channel and the logging has to start right from the boot.

    But there was no provision to use global mode with buffer-only channels,
    which is added by this patch, without modifying the interface where
    initially there will be a dummy invocation of create_buf_file callback
    through which kernel client can convey the need of a global buffer.

    For the use case where drivers/kernel clients want a simple interface
    for the userspace, which enables them to capture data/logs from relay
    file inorder & without any post processing, support of Global buffer
    mode is warranted.

    Modules, like i915, using relay_open() in early init would have to later
    register their buffer-only relays, once debugfs is available, by calling
    relay_late_setup_files(). Hence relay_late_setup_files() symbol also
    needs to be exported.

    Link: http://lkml.kernel.org/r/1468404563-11653-1-git-send-email-akash.goel@intel.com
    Signed-off-by: Akash Goel
    Cc: Eduard - Gabriel Munteanu
    Cc: Tom Zanussi
    Cc: Chris Wilson
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akash Goel
     
  • I hit the following issue when run trinity in my system. The kernel is
    3.4 version, but mainline has the same issue.

    The root cause is that the segment size is too large so the kerenl
    spends too long trying to allocate a page. Other cases will block until
    the test case quits. Also, OOM conditions will occur.

    Call Trace:
    __alloc_pages_nodemask+0x14c/0x8f0
    alloc_pages_current+0xaf/0x120
    kimage_alloc_pages+0x10/0x60
    kimage_alloc_control_pages+0x5d/0x270
    machine_kexec_prepare+0xe5/0x6c0
    ? kimage_free_page_list+0x52/0x70
    sys_kexec_load+0x141/0x600
    ? vfs_write+0x100/0x180
    system_call_fastpath+0x16/0x1b

    The patch changes sanity_check_segment_list() to verify that the usage by
    all segments does not exceed half of memory.

    [akpm@linux-foundation.org: fix for kexec-return-error-number-directly.patch, update comment]
    Link: http://lkml.kernel.org/r/1469625474-53904-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Suggested-by: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • Provide a wrapper function to be used by kernel code to check whether a
    crash kernel is loaded. It returns the same value that can be seen in
    /sys/kernel/kexec_crash_loaded by userspace programs.

    I'm exporting the function, because it will be used by Xen, and it is
    possible to compile Xen modules separately to enable the use of PV
    drivers with unmodified bare-metal kernels.

    Link: http://lkml.kernel.org/r/20160713121955.14969.69080.stgit@hananiah.suse.cz
    Signed-off-by: Petr Tesarik
    Cc: Juergen Gross
    Cc: Josh Triplett
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Eric Biederman
    Cc: "H. Peter Anvin"
    Cc: Boris Ostrovsky
    Cc: "Paul E. McKenney"
    Cc: Dave Young
    Cc: David Vrabel
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Tesarik
     
  • crash_kexec_post_notifiers ia a boot option which controls whether the
    1st kernel calls panic notifiers or not before booting the 2nd kernel.
    However, there is no need to limit it to being modifiable only at boot
    time. So, use core_param instead of early_param.

    Link: http://lkml.kernel.org/r/20160705113327.5864.43139.stgit@softrs
    Signed-off-by: Hidehiro Kawai
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Vivek Goyal
    Cc: Eric Biederman
    Cc: Masami Hiramatsu
    Cc: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • kexec physical addresses are the boot-time view of the system. For
    certain ARM systems (such as Keystone 2), the boot view of the system
    does not match the kernel's view of the system: the boot view uses a
    special alias in the lower 4GB of the physical address space.

    To cater for these kinds of setups, we need to translate between the
    boot view physical addresses and the normal kernel view physical
    addresses. This patch extracts the current transation points into
    linux/kexec.h, and allows an architecture to override the functions.

    Due to the translations required, we unfortunately end up with six
    translation functions, which are reduced down to four that the
    architecture can override.

    [akpm@linux-foundation.org: kexec.h needs asm/io.h for phys_to_virt()]
    Link: http://lkml.kernel.org/r/E1b8koP-0004HZ-Vf@rmk-PC.armlinux.org.uk
    Signed-off-by: Russell King
    Cc: Keerthy
    Cc: Pratyush Anand
    Cc: Vitaly Andrianov
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Vivek Goyal
    Cc: Simon Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • On PAE systems (eg, ARM LPAE) the vmcore note may be located above 4GB
    physical on 32-bit architectures, so we need a wider type than "unsigned
    long" here. Arrange for paddr_vmcoreinfo_note() to return a
    phys_addr_t, thereby allowing it to be located above 4GB.

    This makes no difference for kexec-tools, as they already assume a
    64-bit type when reading from this file.

    Link: http://lkml.kernel.org/r/E1b8koK-0004HS-K9@rmk-PC.armlinux.org.uk
    Signed-off-by: Russell King
    Reviewed-by: Pratyush Anand
    Acked-by: Baoquan He
    Cc: Keerthy
    Cc: Vitaly Andrianov
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Vivek Goyal
    Cc: Simon Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • Ensure that user memory sizes do not wrap around when validating the
    user input, which can lead to the following input validation working
    incorrectly.

    [akpm@linux-foundation.org: fix it for kexec-return-error-number-directly.patch]
    Link: http://lkml.kernel.org/r/E1b8koF-0004HM-5x@rmk-PC.armlinux.org.uk
    Signed-off-by: Russell King
    Reviewed-by: Pratyush Anand
    Acked-by: Baoquan He
    Cc: Keerthy
    Cc: Vitaly Andrianov
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Vivek Goyal
    Cc: Simon Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • This is a cleanup patch to make kexec more clear to return error number
    directly. The variable result is useless, because there is no other
    function's return value assignes to it. So remove it.

    Link: http://lkml.kernel.org/r/1464179273-57668-1-git-send-email-mnghuan@gmail.com
    Signed-off-by: Minfei Huang
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Xunlei Pang
    Cc: Atsushi Kumagai
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • Many targets enable CONFIG_DEBUG_STACK_USAGE, and while the information
    is useful, it isn't worthy of pr_warn(). Reduce it to pr_info().

    Link: http://lkml.kernel.org/r/1466982072-29836-1-git-send-email-anton@ozlabs.org
    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • Add a "printk.devkmsg" kernel command line parameter which controls how
    userspace writes into /dev/kmsg. It has three options:

    * ratelimit - ratelimit logging from userspace.
    * on - unlimited logging from userspace
    * off - logging from userspace gets ignored

    The default setting is to ratelimit the messages written to it.

    This changes the kernel default setting of "on" to "ratelimit" and we do
    that because we want to keep userspace spamming /dev/kmsg to sane
    levels. This is especially moot when a small kernel log buffer wraps
    around and messages get lost. So the ratelimiting setting should be a
    sane setting where kernel messages should have a bit higher chance of
    survival from all the spamming.

    It additionally does not limit logging to /dev/kmsg while the system is
    booting if we haven't disabled it on the command line.

    Furthermore, we can control the logging from a lower priority sysctl
    interface - kernel.printk_devkmsg.

    That interface will succeed only if printk.devkmsg *hasn't* been
    supplied on the command line. If it has, then printk.devkmsg is a
    one-time setting which remains for the duration of the system lifetime.
    This "locking" of the setting is to prevent userspace from changing the
    logging on us through sysctl(2).

    This patch is based on previous patches from Linus and Steven.

    [bp@suse.de: fixes]
    Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
    Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
    Signed-off-by: Borislav Petkov
    Cc: Dave Young
    Cc: Franck Bui
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • asm-generic headers are generic implementations for architecture
    specific code and should not be included by common code. Thus use the
    asm/ version of sections.h to get at the linker sections.

    Link: http://lkml.kernel.org/r/1468285008-7331-1-git-send-email-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Messages' levels and console log level are inspected when the actual
    printing occurs, which may provoke console_unlock() and
    console_cont_flush() to waste CPU cycles on every message that has
    loglevel above the current console_loglevel.

    Schematically, console_unlock() does the following:

    console_unlock()
    {
    ...
    for (;;) {
    ...
    raw_spin_lock_irqsave(&logbuf_lock, flags);
    skip:
    msg = log_from_idx(console_idx);

    if (msg->flags & LOG_NOCONS) {
    ...
    goto skip;
    }

    level = msg->level;
    len += msg_print_text(); >> sprintfs
    memcpy,
    etc.

    if (nr_ext_console_drivers) {
    ext_len = msg_print_ext_header(); >> scnprintf
    ext_len += msg_print_ext_body(); >> scnprintfs
    etc.
    }
    ...
    raw_spin_unlock(&logbuf_lock);

    call_console_drivers(level, ext_text, ext_len, text, len)
    {
    if (level >= console_loglevel && >> drop the message
    !ignore_loglevel)
    return;

    console->write(...);
    }

    local_irq_restore(flags);
    }
    ...
    }

    The thing here is this deferred `level >= console_loglevel' check. We
    are wasting CPU cycles on sprintfs/memcpy/etc. preparing the messages
    that we will eventually drop.

    This can be huge when we register a new CON_PRINTBUFFER console, for
    instance. For every such a console register_console() resets the

    console_seq, console_idx, console_prev

    and sets a `exclusive console' pointer to replay the log buffer to that
    just-registered console. And there can be a lot of messages to replay,
    in the worst case most of which can be dropped after console_loglevel
    test.

    We know messages' levels long before we call msg_print_text() and
    friends, so we can just move console_loglevel check out of
    call_console_drivers() and format a new message only if we are sure that
    it won't be dropped.

    The patch factors out loglevel check into suppress_message_printing()
    function and tests message->level and console_loglevel before formatting
    functions in console_unlock() and console_cont_flush() are getting
    executed. This improves things not only for exclusive CON_PRINTBUFFER
    consoles, but for every console_unlock() that attempts to print a
    message of level above the console_loglevel.

    Link: http://lkml.kernel.org/r/20160627135012.8229-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Petr Mladek
    Cc: Tejun Heo
    Cc: Jan Kara
    Cc: Calvin Owens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Using functions instead of macros can reduce overall code size by
    eliminating unnecessary "KERN_SOH" prefixes from format strings.

    defconfig x86-64:

    $ size vmlinux*
    text data bss dec hex filename
    10193570 4331464 1105920 15630954 ee826a vmlinux.new
    10192623 4335560 1105920 15634103 ee8eb7 vmlinux.old

    As the return value are unimportant and unused in the kernel tree, these
    new functions return void.

    Miscellanea:

    - change pr_ macros to call new __pr_ functions
    - change vprintk_nmi and vprintk_default to add LOGLEVEL_ argument

    [akpm@linux-foundation.org: fix LOGLEVEL_INFO, per Joe]
    Link: http://lkml.kernel.org/r/e16cc34479dfefcae37c98b481e6646f0f69efc3.1466718827.git.joe@perches.com
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • A trivial cosmetic change: interrupt.h header is redundant since commit
    6b898c07cb1d ("console: use might_sleep in console_lock").

    Link: http://lkml.kernel.org/r/20160620132847.21930-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • kernel.h header doesn't directly use dynamic debug, instead we can
    include it in module.c (which used it via kernel.h). printk.h only uses
    it if CONFIG_DYNAMIC_DEBUG is on, changing the inclusion to only happen
    in that case.

    Link: http://lkml.kernel.org/r/1468429793-16917-1-git-send-email-luisbg@osg.samsung.com
    [luisbg@osg.samsung.com: include dynamic_debug.h in drb_int.h]
    Link: http://lkml.kernel.org/r/1468447828-18558-2-git-send-email-luisbg@osg.samsung.com
    Signed-off-by: Luis de Bethencourt
    Cc: Rusty Russell
    Cc: Hidehiro Kawai
    Cc: Borislav Petkov
    Cc: Michal Nazarewicz
    Cc: Rasmus Villemoes
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis de Bethencourt
     
  • Change task_work_cancel() to use lockless_dereference(), this is what
    the code really wants but we didn't have this helper when it was
    written.

    Also add the fast-path task->task_works == NULL check, in the likely
    case this task has no pending works and we can avoid
    spin_lock(task->pi_lock).

    While at it, change other users of ACCESS_ONCE() to use READ_ONCE().

    Link: http://lkml.kernel.org/r/20160610150042.GA13868@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Andrea Parri
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This fixes a use-after-free case flagged by KASAN; make sure the test
    happens before the potential free in this case.

    Link: http://lkml.kernel.org/r/48fd74ab61bebd7dca9714386bb47d7c5ccd6a7b.1467247517.git.tom.zanussi@linux.intel.com

    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Tom Zanussi
     
  • While running tools/testing/selftests test suite with KASAN, Dmitry
    Vyukov hit the following use-after-free report:

    ==================================================================
    BUG: KASAN: use-after-free in hist_unreg_all+0x1a1/0x1d0 at addr
    ffff880031632cc0
    Read of size 8 by task ftracetest/7413
    ==================================================================
    BUG kmalloc-128 (Not tainted): kasan: bad access detected
    ------------------------------------------------------------------

    This fixes the problem, along with the same problem in
    hist_enable_unreg_all().

    Link: http://lkml.kernel.org/r/c3d05b79e42555b6e36a3a99aae0e37315ee5304.1467247517.git.tom.zanussi@linux.intel.com

    Cc: Dmitry Vyukov
    [Copied Steve's hist_enable_unreg_all() fix to hist_unreg_all()]
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • With the latest gcc compilers, they give a warning if
    __builtin_return_address() parameter is greater than 0. That is because if
    it is used by a function called by a top level function (or in the case of
    the kernel, by assembly), it can try to access stack frames outside the
    stack and crash the system.

    The tracing system uses __builtin_return_address() of up to 2! But it is
    well aware of the dangers that it may have, and has even added precautions
    to protect against it (see the thunk code in arch/x86/entry/thunk*.S)

    Linus originally added KBUILD_CFLAGS that would suppress the warning for the
    entire kernel, as simply adding KBUILD_CFLAGS to the tracing directory
    wouldn't work. The tracing directory plays a bit with the CFLAGS and
    requires a little more logic.

    This adds that special logic to only suppress the warning for the tracing
    directory. If it is used anywhere else outside of tracing, the warning will
    still be triggered.

    Link: http://lkml.kernel.org/r/20160728223043.51996267@grimm.local.home

    Tested-by: Linus Torvalds
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

31 Jul, 2016

1 commit

  • Pull misc fixes from Thomas Gleixner:
    "This update contains:

    - a fix for stomp-machine so the nmi_watchdog wont trigger on the cpu
    waiting for the others to execute the callback

    - various fixes and updates to objtool including an resync of the
    instruction decoder to match the kernel's decoder"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    objtool: Un-capitalize "Warning" for out-of-sync instruction decoder
    objtool: Resync x86 instruction decoder with the kernel's
    objtool: Support new GCC 6 switch jump table pattern
    stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE
    objtool: Add 'fixdep' to objtool/.gitignore

    Linus Torvalds
     

30 Jul, 2016

5 commits

  • Pull audit updates from Paul Moore:
    "Six audit patches for 4.8.

    There are a couple of style and minor whitespace tweaks for the logs,
    as well as a minor fixup to catch errors on user filter rules, however
    the major improvements are a fix to the s390 syscall argument masking
    code (reviewed by the nice s390 folks), some consolidation around the
    exclude filtering (less code, always a win), and a double-fetch fix
    for recording the execve arguments"

    * 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
    audit: fix a double fetch in audit_log_single_execve_arg()
    audit: fix whitespace in CWD record
    audit: add fields to exclude filter by reusing user filter
    s390: ensure that syscall arguments are properly masked on s390
    audit: fix some horrible switch statement style crimes
    audit: fixup: log on errors from filter user rules

    Linus Torvalds
     
  • Pull security subsystem updates from James Morris:
    "Highlights:

    - TPM core and driver updates/fixes
    - IPv6 security labeling (CALIPSO)
    - Lots of Apparmor fixes
    - Seccomp: remove 2-phase API, close hole where ptrace can change
    syscall #"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits)
    apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling
    tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family)
    tpm: Factor out common startup code
    tpm: use devm_add_action_or_reset
    tpm2_i2c_nuvoton: add irq validity check
    tpm: read burstcount from TPM_STS in one 32-bit transaction
    tpm: fix byte-order for the value read by tpm2_get_tpm_pt
    tpm_tis_core: convert max timeouts from msec to jiffies
    apparmor: fix arg_size computation for when setprocattr is null terminated
    apparmor: fix oops, validate buffer size in apparmor_setprocattr()
    apparmor: do not expose kernel stack
    apparmor: fix module parameters can be changed after policy is locked
    apparmor: fix oops in profile_unpack() when policy_db is not present
    apparmor: don't check for vmalloc_addr if kvzalloc() failed
    apparmor: add missing id bounds check on dfa verification
    apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task
    apparmor: use list_next_entry instead of list_entry_next
    apparmor: fix refcount race when finding a child profile
    apparmor: fix ref count leak when profile sha1 hash is read
    apparmor: check that xindex is in trans_table bounds
    ...

    Linus Torvalds
     
  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     
  • Pull more cgroup updates from Tejun Heo:
    "I forgot to include the patches which got applied to for-4.7-fixes
    late during last cycle.

    Eric's three patches fix bugs introduced with the namespace support"

    * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
    cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns
    cgroupns: Fix the locking in copy_cgroup_ns

    Linus Torvalds
     
  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the next part of the hotplug rework.

    - Convert all notifiers with a priority assigned

    - Convert all CPU_STARTING/DYING notifiers

    The final removal of the STARTING/DYING infrastructure will happen
    when the merge window closes.

    Another 700 hundred line of unpenetrable maze gone :)"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    timers/core: Correct callback order during CPU hot plug
    leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
    powerpc/numa: Convert to hotplug state machine
    arm/perf: Fix hotplug state machine conversion
    irqchip/armada: Avoid unused function warnings
    ARC/time: Convert to hotplug state machine
    clocksource/atlas7: Convert to hotplug state machine
    clocksource/armada-370-xp: Convert to hotplug state machine
    clocksource/exynos_mct: Convert to hotplug state machine
    clocksource/arm_global_timer: Convert to hotplug state machine
    rcu: Convert rcutree to hotplug state machine
    KVM/arm/arm64/vgic-new: Convert to hotplug state machine
    smp/cfd: Convert core to hotplug state machine
    x86/x2apic: Convert to CPU hotplug state machine
    profile: Convert to hotplug state machine
    timers/core: Convert to hotplug state machine
    hrtimer: Convert to hotplug state machine
    x86/tboot: Convert to hotplug state machine
    arm64/armv8 deprecated: Convert to hotplug state machine
    hwtracing/coresight-etm4x: Convert to hotplug state machine
    ...

    Linus Torvalds
     

29 Jul, 2016

11 commits

  • Pull tracing updates from Steven Rostedt:
    "This is mostly clean ups and small fixes. Some of the more visible
    changes are:

    - The function pid code uses the event pid filtering logic
    - [ku]probe events have access to current->comm
    - trace_printk now has sample code
    - PCI devices now trace physical addresses
    - stack tracing has less unnessary functions traced"

    * tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    printk, tracing: Avoiding unneeded blank lines
    tracing: Use __get_str() when manipulating strings
    tracing, RAS: Cleanup on __get_str() usage
    tracing: Use outer () on __get_str() definition
    ftrace: Reduce size of function graph entries
    tracing: Have HIST_TRIGGERS select TRACING
    tracing: Using for_each_set_bit() to simplify trace_pid_write()
    ftrace: Move toplevel init out of ftrace_init_tracefs()
    tracing/function_graph: Fix filters for function_graph threshold
    tracing: Skip more functions when doing stack tracing of events
    tracing: Expose CPU physical addresses (resource values) for PCI devices
    tracing: Show the preempt count of when the event was called
    tracing: Add trace_printk sample code
    tracing: Choose static tp_printk buffer by explicit nesting count
    tracing: expose current->comm to [ku]probe events
    ftrace: Have set_ftrace_pid use the bitmap like events do
    tracing: Move pid_list write processing into its own function
    tracing: Move the pid_list seq_file functions to be global
    tracing: Move filtered_pid helper functions into trace.c
    tracing: Make the pid filtering helper functions global

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:

    - Replace pcommit with ADR / directed-flushing.

    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement
    either ADR, or provide one or more flush addresses per nvdimm.

    ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
    to the memory controller on a power-fail event.

    Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
    Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
    A flush hint is an mmio address that when written and fenced assures
    that all previous posted writes targeting a given dimm have been
    flushed to media.

    - On-demand ARS (address range scrub).

    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices. When latent errors are detected we re-scrub the
    media to refresh the bad block list, userspace can also request a
    re-scrub at any time.

    - Support for the Microsoft DSM (device specific method) command
    format.

    - Support for EDK2/OVMF virtual disk device memory ranges.

    - Various fixes and cleanups across the subsystem.

    * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
    libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
    nfit: do an ARS scrub on hitting a latent media error
    nfit: move to nfit/ sub-directory
    nfit, libnvdimm: allow an ARS scrub to be triggered on demand
    libnvdimm: register nvdimm_bus devices with an nd_bus driver
    pmem: clarify a debug print in pmem_clear_poison
    x86/insn: remove pcommit
    Revert "KVM: x86: add pcommit support"
    nfit, tools/testing/nvdimm/: unify shutdown paths
    libnvdimm: move ->module to struct nvdimm_bus_descriptor
    nfit: cleanup acpi_nfit_init calling convention
    nfit: fix _FIT evaluation memory leak + use after free
    tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
    tools/testing/nvdimm: add virtual ramdisk range
    acpi, nfit: treat virtual ramdisk SPA as pmem region
    pmem: kill __pmem address space
    pmem: kill wmb_pmem()
    libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
    fs/dax: remove wmb_pmem()
    libnvdimm, pmem: flush posted-write queues on shutdown
    ...

    Linus Torvalds
     
  • We currently show:

    task: ti: task.ti: "

    "ti" and "task.ti" are redundant, and neither is actually what we want
    to show, which the the base of the thread stack. Change the display to
    show the stack pointer explicitly.

    Link: http://lkml.kernel.org/r/543ac5bd66ff94000a57a02e11af7239571a3055.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • We should account for stacks regardless of stack size, and we need to
    account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
    to kilobytes and Move it into account_kernel_stack().

    Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
    Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
    This only makes sense if each kernel stack exists entirely in one zone,
    and allowing vmapped stacks could break this assumption.

    Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
    allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
    architectures. Keep it simple and use KiB.

    Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Now that ZONE_DEVICE depends on SPARSEMEM_VMEMMAP we can simplify some
    ifdef guards to just ZONE_DEVICE.

    Link: http://lkml.kernel.org/r/146687646788.39261.8020536391978771940.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Vlastimil Babka
    Cc: Eric Sandeen
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
    it is checking the flag on the current task rather than the given one.

    This doesn't make much sense and it is actually wrong. If the current
    task which updates the nodemask of a cpuset got killed by the OOM killer
    then a part of the cpuset cgroup processes would have incompatible
    nodemask which is surprising to say the least.

    The comment suggests the intention was to skip oom victim or an exiting
    task so we should be checking the given task. But even then it would be
    layering violation because it is the memory allocator to interpret the
    TIF_MEMDIE meaning. Simply drop both checks. All tasks in the cpuset
    should simply follow the same mask.

    Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Miao Xie
    Cc: Miao Xie
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • freezing_slow_path() is checking TIF_MEMDIE to skip OOM killed tasks.
    It is, however, checking the flag on the current task rather than the
    given one. This is really confusing because freezing() can be called
    also on !current tasks. It would end up working correctly for its main
    purpose because __refrigerator will be always called on the current task
    so the oom victim will never get frozen. But it could lead to
    surprising results when a task which is freezing a cgroup got oom killed
    because only part of the cgroup would get frozen. This is highly
    unlikely but worth fixing as the resulting code would be more clear
    anyway.

    Link: http://lkml.kernel.org/r/1467029719-17602-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Miao Xie
    Cc: Miao Xie
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • On the tear-down path, the dead CPU callback for the timers was
    misplaced within the 'cpuhp_state' enumeration. There is a hidden
    dependency between the timers and block multiqueue. The timers
    callback must happen before the block multiqueue callback otherwise a
    RCU stall occurs.

    Move the timers callback to the proper place in the state machine.

    Reported-and-tested-by: Jon Hunter
    Reported-by: kbuild test robot
    Fixes: 24f73b99716a ("timers/core: Convert to hotplug state machine")
    Signed-off-by: Richard Cochran
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Rasmus Villemoes
    Cc: John Stultz
    Cc: rt@linutronix.de
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1469610498-25914-1-git-send-email-rcochran@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Richard Cochran