28 May, 2016

2 commits

  • Pull vfs fixes from Al Viro:
    "Followups to the parallel lookup work:

    - update docs

    - restore killability of the places that used to take ->i_mutex
    killably now that we have down_write_killable() merged

    - Additionally, it turns out that I missed a prerequisite for
    security_d_instantiate() stuff - ->getxattr() wasn't the only thing
    that could be called before dentry is attached to inode; with smack
    we needed the same treatment applied to ->setxattr() as well"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->setxattr() to passing dentry and inode separately
    switch xattr_handler->set() to passing dentry and inode separately
    restore killability of old mutex_lock_killable(&inode->i_mutex) users
    add down_write_killable_nested()
    update D/f/directory-locking

    Linus Torvalds
     
  • Most users of IS_ERR_VALUE() in the kernel are wrong, as they
    pass an 'int' into a function that takes an 'unsigned long'
    argument. This happens to work because the type is sign-extended
    on 64-bit architectures before it gets converted into an
    unsigned type.

    However, anything that passes an 'unsigned short' or 'unsigned int'
    argument into IS_ERR_VALUE() is guaranteed to be broken, as are
    8-bit integers and types that are wider than 'unsigned long'.

    Andrzej Hajda has already fixed a lot of the worst abusers that
    were causing actual bugs, but it would be nice to prevent any
    users that are not passing 'unsigned long' arguments.

    This patch changes all users of IS_ERR_VALUE() that I could find
    on 32-bit ARM randconfig builds and x86 allmodconfig. For the
    moment, this doesn't change the definition of IS_ERR_VALUE()
    because there are probably still architecture specific users
    elsewhere.

    Almost all the warnings I got are for files that are better off
    using 'if (err)' or 'if (err < 0)'.
    The only legitimate user I could find that we get a warning for
    is the (32-bit only) freescale fman driver, so I did not remove
    the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
    For 9pfs, I just worked around one user whose calling conventions
    are so obscure that I did not dare change the behavior.

    I was using this definition for testing:

    #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
    unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))

    which ends up making all 16-bit or wider types work correctly with
    the most plausible interpretation of what IS_ERR_VALUE() was supposed
    to return according to its users, but also causes a compile-time
    warning for any users that do not pass an 'unsigned long' argument.

    I suggested this approach earlier this year, but back then we ended
    up deciding to just fix the users that are obviously broken. After
    the initial warning that caused me to get involved in the discussion
    (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
    asked me to send the whole thing again.

    [ Updated the 9p parts as per Al Viro - Linus ]

    Signed-off-by: Arnd Bergmann
    Cc: Andrzej Hajda
    Cc: Andrew Morton
    Link: https://lkml.org/lkml/2016/1/7/363
    Link: https://lkml.org/lkml/2016/5/27/486
    Acked-by: Srinivas Kandagatla # For nvmem part
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

27 May, 2016

2 commits

  • Pull kbuild updates from Michal Marek:

    - new option CONFIG_TRIM_UNUSED_KSYMS which does a two-pass build and
    unexports symbols which are not used in the current config [Nicolas
    Pitre]

    - several kbuild rule cleanups [Masahiro Yamada]

    - warning option adjustments for gcov etc [Arnd Bergmann]

    - a few more small fixes

    * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (31 commits)
    kbuild: move -Wunused-const-variable to W=1 warning level
    kbuild: fix if_change and friends to consider argument order
    kbuild: fix adjust_autoksyms.sh for modules that need only one symbol
    kbuild: fix ksym_dep_filter when multiple EXPORT_SYMBOL() on the same line
    gcov: disable -Wmaybe-uninitialized warning
    gcov: disable tree-loop-im to reduce stack usage
    gcov: disable for COMPILE_TEST
    Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES
    Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
    kbuild: forbid kernel directory to contain spaces and colons
    kbuild: adjust ksym_dep_filter for some cmd_* renames
    kbuild: Fix dependencies for final vmlinux link
    kbuild: better abstract vmlinux sequential prerequisites
    kbuild: fix call to adjust_autoksyms.sh when output directory specified
    kbuild: Get rid of KBUILD_STR
    kbuild: rename cmd_as_s_S to cmd_cpp_s_S
    kbuild: rename cmd_cc_i_c to cmd_cpp_i_c
    kbuild: drop redundant "PHONY += FORCE"
    kbuild: delete unnecessary "@:"
    kbuild: mark help target as PHONY
    ...

    Linus Torvalds
     
  • mmput_async is currently used only from the oom_reaper which is defined
    only for CONFIG_MMU. We can save work_struct in mm_struct for
    !CONFIG_MMU.

    [akpm@linux-foundation.org: fix typo, per Minchan]
    Link: http://lkml.kernel.org/r/20160520061658.GB19172@dhcp22.suse.cz
    Reported-by: Minchan Kim
    Signed-off-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

26 May, 2016

5 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Pull scheduler fixes from Ingo Molnar:
    "Two fixes: one for a lost wakeup, the other to fix the compiler
    optimizing out preempt operations on ARM64 (and possibly other non-x86
    architectures)"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Fix remote wakeups
    sched/preempt: Fix preempt_count manipulations

    Linus Torvalds
     
  • Pull perf updates from Ingo Molnar:
    "Mostly tooling and PMU driver fixes, but also a number of late updates
    such as the reworking of the call-chain size limiting logic to make
    call-graph recording more robust, plus tooling side changes for the
    new 'backwards ring-buffer' extension to the perf ring-buffer"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    perf record: Read from backward ring buffer
    perf record: Rename variable to make code clear
    perf record: Prevent reading invalid data in record__mmap_read
    perf evlist: Add API to pause/resume
    perf trace: Use the ptr->name beautifier as default for "filename" args
    perf trace: Use the fd->name beautifier as default for "fd" args
    perf report: Add srcline_from/to branch sort keys
    perf evsel: Record fd into perf_mmap
    perf evsel: Add overwrite attribute and check write_backward
    perf tools: Set buildid dir under symfs when --symfs is provided
    perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
    perf annotate: Sort list of recognised instructions
    perf annotate: Fix identification of ARM blt and bls instructions
    perf tools: Fix usage of max_stack sysctl
    perf callchain: Stop validating callchains by the max_stack sysctl
    perf trace: Fix exit_group() formatting
    perf top: Use machine->kptr_restrict_warned
    perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
    perf machine: Do not bail out if not managing to read ref reloc symbol
    perf/x86/intel/p4: Trival indentation fix, remove space
    ...

    Linus Torvalds
     
  • Pull more power management updates from Rafael Wysocki:
    "These are two stable-candidate fixes (PM core, cpuidle) and a bunch of
    cpufreq cleanups.

    Specifics:

    - Stable-candidate cpuidle fix to make it check the right variable
    when deciding whether or not to enable interrupts on the local CPU
    so as to avoid enabling iterrupts too early in some cases if the
    system has both coupled and per-core idle states (Daniel Lezcano).

    - Stable-candidate PM core fix to make it handle failures at the
    "late suspend" stage of device suspend consistently for all devices
    regardless of whether or not async suspend/resume is enabled for
    them (Rafael Wysocki).

    - Cleanups in the cpufreq core, the schedutil governor and the
    intel_pstate driver (Rafael Wysocki, Pankaj Gupta, Viresh Kumar)"

    * tag 'pm-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / sleep: Handle failures in device_suspend_late() consistently
    cpufreq: schedutil: Improve prints messages with pr_fmt
    cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter()
    cpufreq: simplified goto out in cpufreq_register_driver()
    cpufreq: governor: CPUFREQ_GOV_STOP never fails
    cpufreq: governor: CPUFREQ_GOV_POLICY_EXIT never fails
    intel_pstate: Simplify conditional in intel_pstate_set_policy()

    Linus Torvalds
     
  • * pm-cpufreq:
    cpufreq: schedutil: Improve prints messages with pr_fmt
    cpufreq: simplified goto out in cpufreq_register_driver()
    cpufreq: governor: CPUFREQ_GOV_STOP never fails
    cpufreq: governor: CPUFREQ_GOV_POLICY_EXIT never fails
    intel_pstate: Simplify conditional in intel_pstate_set_policy()

    * pm-cpuidle:
    cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter()

    * pm-core:
    PM / sleep: Handle failures in device_suspend_late() consistently

    Rafael J. Wysocki
     

25 May, 2016

2 commits

  • Commit:

    b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")

    ... introduced a bug: Mike Galbraith found that it introduced a
    performance regression, while Paul E. McKenney reported lost
    wakeups and bisected it to this commit.

    The reason is that I mis-read ttwu_queue() such that I assumed any
    wakeup that got a remote queue must have had the task migrated.

    Since this is not so; we need to transfer this information between
    queueing the wakeup and actually doing the wakeup. Use a new
    task_struct::sched_flag for this, we already write to
    sched_contributes_to_load in the wakeup path so this is a hot and
    modified cacheline.

    Reported-by: Paul E. McKenney
    Reported-by: Mike Galbraith
    Tested-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Hunter
    Cc: Andy Lutomirski
    Cc: Ben Segall
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: Fenghua Yu
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Morten Rasmussen
    Cc: Oleg Nesterov
    Cc: Paul Turner
    Cc: Pavan Kondeti
    Cc: Peter Zijlstra
    Cc: Quentin Casasnovas
    Cc: Thomas Gleixner
    Cc: byungchul.park@lge.com
    Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
    Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Pull ext4 updates from Ted Ts'o:
    "Fix a number of bugs, most notably a potential stale data exposure
    after a crash and a potential BUG_ON crash if a file has the data
    journalling flag enabled while it has dirty delayed allocation blocks
    that haven't been written yet. Also fix a potential crash in the new
    project quota code and a maliciously corrupted file system.

    In addition, fix some DAX-specific bugs, including when there is a
    transient ENOSPC situation and races between writes via direct I/O and
    an mmap'ed segment that could lead to lost I/O.

    Finally the usual set of miscellaneous cleanups"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
    ext4: pre-zero allocated blocks for DAX IO
    ext4: refactor direct IO code
    ext4: fix race in transient ENOSPC detection
    ext4: handle transient ENOSPC properly for DAX
    dax: call get_blocks() with create == 1 for write faults to unwritten extents
    ext4: remove unmeetable inconsisteny check from ext4_find_extent()
    jbd2: remove excess descriptions for handle_s
    ext4: remove unnecessary bio get/put
    ext4: silence UBSAN in ext4_mb_init()
    ext4: address UBSAN warning in mb_find_order_for_block()
    ext4: fix oops on corrupted filesystem
    ext4: fix check of dqget() return value in ext4_ioctl_setproject()
    ext4: clean up error handling when orphan list is corrupted
    ext4: fix hang when processing corrupted orphaned inode list
    ext4: remove trailing \n from ext4_warning/ext4_error calls
    ext4: fix races between changing inode journal mode and ext4_writepages
    ext4: handle unwritten or delalloc buffers before enabling data journaling
    ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
    ext4: do not ask jbd2 to write data for delalloc buffers
    jbd2: add support for avoiding data writes during transaction commits
    ...

    Linus Torvalds
     

24 May, 2016

12 commits

  • xol_add_vma needs mmap_sem for write. If the waiting task gets killed
    by the oom killer it would block oom_reaper from asynchronous address
    space reclaim and reduce the chances of timely OOM resolving. Wait for
    the lock in the killable mode and return with EINTR if the task got
    killed while waiting.

    Do not warn in dup_xol_work if __create_xol_area failed due to fatal
    signal pending because this is usually considered a kernel issue.

    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • PR_SET_THP_DISABLE requires mmap_sem for write. If the waiting task
    gets killed by the oom killer it would block oom_reaper from
    asynchronous address space reclaim and reduce the chances of timely OOM
    resolving. Wait for the lock in the killable mode and return with EINTR
    if the task got killed while waiting.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • dup_mmap needs to lock current's mm mmap_sem for write. If the waiting
    task gets killed by the oom killer it would block oom_reaper from
    asynchronous address space reclaim and reduce the chances of timely OOM
    resolving. Wait for the lock in the killable mode and return with EINTR
    if the task got killed while waiting.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • …unprotect)_crashkres()

    Commit 3f625002581b ("kexec: introduce a protection mechanism for the
    crashkernel reserved memory") is a similar mechanism for protecting the
    crash kernel reserved memory to previous crash_map/unmap_reserved_pages()
    implementation, the new one is more generic in name and cleaner in code
    (besides, some arch may not be allowed to unmap the pgtable).

    Therefore, this patch consolidates them, and uses the new
    arch_kexec_protect(unprotect)_crashkres() to replace former
    crash_map/unmap_reserved_pages() which by now has been only used by
    S390.

    The consolidation work needs the crash memory to be mapped initially,
    this is done in machine_kdump_pm_init() which is after
    reserve_crashkernel(). Once kdump kernel is loaded, the new
    arch_kexec_protect_crashkres() implemented for S390 will actually
    unmap the pgtable like before.

    Signed-off-by: Xunlei Pang <xlpang@redhat.com>
    Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Minfei Huang <mhuang@redhat.com>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Cc: Dave Young <dyoung@redhat.com>
    Cc: Baoquan He <bhe@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Xunlei Pang
     
  • There are a lof of work to be done in function kexec_load, not only for
    allocating structs and loading initram, but also for some misc.

    To make it more clear, wrap a new function do_kexec_load which is used
    to allocate structs and load initram. And the pre-work will be done in
    kexec_load.

    Signed-off-by: Minfei Huang
    Cc: Vivek Goyal
    Cc: "Eric W. Biederman"
    Cc: Xunlei Pang
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • For some arch, kexec shall map the reserved pages, then use them, when
    we try to start the kdump service.

    kexec may return directly, without unmaping the reserved pages, if it
    fails during starting service. To fix it, we make a pair of map/unmap
    reserved pages both in generic path and error path.

    This patch only affects s390. Other architecturess don't implement the
    interface of crash_unmap_reserved_pages and crash_map_reserved_pages.

    It isn't a urgent patch. Kernel can work well without any risk,
    although the reserved pages are not unmapped before returning in error
    path.

    Signed-off-by: Minfei Huang
    Cc: Vivek Goyal
    Cc: "Eric W. Biederman"
    Cc: Xunlei Pang
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • For the cases that some kernel (module) path stamps the crash reserved
    memory(already mapped by the kernel) where has been loaded the second
    kernel data, the kdump kernel will probably fail to boot when panic
    happens (or even not happens) leaving the culprit at large, this is
    unacceptable.

    The patch introduces a mechanism for detecting such cases:

    1) After each crash kexec loading, it simply marks the reserved memory
    regions readonly since we no longer access it after that. When someone
    stamps the region, the first kernel will panic and trigger the kdump.
    The weak arch_kexec_protect_crashkres() is introduced to do the actual
    protection.

    2) To allow multiple loading, once 1) was done we also need to remark
    the reserved memory to readwrite each time a system call related to
    kdump is made. The weak arch_kexec_unprotect_crashkres() is introduced
    to do the actual protection.

    The architecture can make its specific implementation by overriding
    arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().

    Signed-off-by: Xunlei Pang
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Minfei Huang
    Cc: Vivek Goyal
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     
  • Linux preallocates the task structs of the idle tasks for all possible
    CPUs. This currently means they all end up on node 0. This also
    implies that the cache line of MWAIT, which is around the flags field in
    the task struct, are all located in node 0.

    We see a noticeable performance improvement on Knights Landing CPUs when
    the cache lines used for MWAIT are located in the local nodes of the
    CPUs using them. I would expect this to give a (likely slight)
    improvement on other systems too.

    The patch implements placing the idle task in the node of its CPUs, by
    passing the right target node to copy_process()

    [akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1]
    Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.org
    Signed-off-by: Andi Kleen
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Use pr_ instead of printk(KERN_ ).

    Signed-off-by: Wang Xiaoqiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Xiaoqiang
     
  • I see no reason why waitid() can't support other linux-specific flags
    allowed in sys_wait4().

    In particular this change can help if we reconsider the previous change
    ("wait/ptrace: assume __WALL if the child is traced") which adds the
    "automagical" __WALL for debugger.

    Signed-off-by: Oleg Nesterov
    Cc: Dmitry Vyukov
    Cc: Denys Vlasenko
    Cc: Jan Kratochvil
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Pedro Alves
    Cc: Roland McGrath
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The following program (simplified version of generated by syzkaller)

    #include
    #include
    #include
    #include
    #include

    void *thread_func(void *arg)
    {
    ptrace(PTRACE_TRACEME, 0,0,0);
    return 0;
    }

    int main(void)
    {
    pthread_t thread;

    if (fork())
    return 0;

    while (getppid() != 1)
    ;

    pthread_create(&thread, NULL, thread_func, NULL);
    pthread_join(thread, NULL);
    return 0;
    }

    creates an unreapable zombie if /sbin/init doesn't use __WALL.

    This is not a kernel bug, at least in a sense that everything works as
    expected: debugger should reap a traced sub-thread before it can reap the
    leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.

    Unfortunately, it seems that /sbin/init in most (all?) distributions
    doesn't use it and we have to change the kernel to avoid the problem.
    Note also that most init's use sys_waitid() which doesn't allow __WALL, so
    the necessary user-space fix is not that trivial.

    This patch just adds the "ptrace" check into eligible_child(). To some
    degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
    mostly ignored when the tracee reports to debugger. Or WSTOPPED, the
    tracer doesn't need to set this flag to wait for the stopped tracee.

    This obviously means the user-visible change: __WCLONE and __WALL no
    longer have any meaning for debugger. And I can only hope that this won't
    break something, but at least strace/gdb won't suffer.

    We could make a more conservative change. Say, we can take __WCLONE into
    account, or !thread_group_leader(). But it would be nice to not
    complicate these historical/confusing checks.

    Signed-off-by: Oleg Nesterov
    Reported-by: Dmitry Vyukov
    Cc: Denys Vlasenko
    Cc: Jan Kratochvil
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Pedro Alves
    Cc: Roland McGrath
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the
    following linker errors:

    arch/mips/built-in.o: In function `elf_core_dump':
    binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs'
    binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size'
    binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs'
    binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data'

    CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following
    linker errors:

    arch/mips/built-in.o: In function `elf_core_dump':
    binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs'
    binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size'
    binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs'
    binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data'

    This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from
    elfcore but for these configurations elfcore will not be built.

    Fixed by making elfcore selectable by a separate config symbol which
    unlike the current mechanism can also be used from other directories
    than kernel/, then having each flavor of ELF that relies on elfcore.o,
    select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32
    which fixes this issue.

    Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.org
    Signed-off-by: Ralf Baechle
    Reviewed-by: James Hogan
    Cc: "Maciej W. Rozycki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     

23 May, 2016

2 commits

  • Pull motr tracing updates from Steven Rostedt:
    "Three more changes.

    - I forgot that I had another selftest to stress test the ftrace
    instance creation. It was actually suppose to go into the 4.6
    merge window, but I never committed it. I almost forgot about it
    again, but noticed it was missing from your tree.

    - Soumya PN sent me a clean up patch to not disable interrupts when
    taking the tasklist_lock for read, as it's unnecessary because that
    lock is never taken for write in irq context.

    - Newer gcc's can cause the jump in the function_graph code to the
    global ftrace_stub label to be a short jump instead of a long one.
    As that jump is dynamically converted to jump to the trace code to
    do function graph tracing, and that conversion expects a long jump
    it can corrupt the ftrace_stub itself (it's directly after that
    call). One way to prevent gcc from using a short jump is to
    declare the ftrace_stub as a weak function, which we do here to
    keep gcc from optimizing too much"

    * tag 'trace-v4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace/x86: Set ftrace_stub to weak to prevent gcc from using short jumps to it
    ftrace: Don't disable irqs when taking the tasklist_lock read_lock
    ftracetest: Add instance created, delete, read and enable event test

    Linus Torvalds
     
  • I'm looking at trying to possibly merge the 32-bit and 64-bit versions
    of the x86 uaccess.h implementation, but first this needs to be cleaned
    up.

    For example, the 32-bit version of "__copy_from_user_inatomic()" is
    mostly the special cases for the constant size, and it's actually almost
    never relevant. Most users aren't actually using a constant size
    anyway, and the few cases that do small constant copies are better off
    just using __get_user() instead.

    So get rid of the unnecessary complexity.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 May, 2016

15 commits

  • Merge more updates from Andrew Morton:

    - the rest of MM

    - KASAN updates

    - procfs updates

    - exit, fork updates

    - printk updates

    - lib/ updates

    - radix-tree testsuite updates

    - checkpatch updates

    - kprobes updates

    - a few other misc bits

    * emailed patches from Andrew Morton : (162 commits)
    samples/kprobes: print out the symbol name for the hooks
    samples/kprobes: add a new module parameter
    kprobes: add the "tls" argument for j_do_fork
    init/main.c: simplify initcall_blacklisted()
    fs/efs/super.c: fix return value
    checkpatch: improve --git shortcut
    checkpatch: reduce number of `git log` calls with --git
    checkpatch: add support to check already applied git commits
    checkpatch: add --list-types to show message types to show or ignore
    checkpatch: advertise the --fix and --fix-inplace options more
    checkpatch: whine about ACCESS_ONCE
    checkpatch: add test for keywords not starting on tabstops
    checkpatch: improve CONSTANT_COMPARISON test for structure members
    checkpatch: add PREFER_IS_ENABLED test
    lib/GCD.c: use binary GCD algorithm instead of Euclidean
    radix-tree: free up the bottom bit of exceptional entries for reuse
    dax: move RADIX_DAX_ definitions to dax.c
    radix-tree: make radix_tree_descend() more useful
    radix-tree: introduce radix_tree_replace_clear_tags()
    radix-tree: tidy up __radix_tree_create()
    ...

    Linus Torvalds
     
  • Pull networking fixes and more updates from David Miller:

    1) Tunneling fixes from Tom Herbert and Alexander Duyck.

    2) AF_UNIX updates some struct sock bit fields with the socket lock,
    whereas setsockopt() sets overlapping ones with locking. Seperate
    out the synchronized vs. the AF_UNIX unsynchronized ones to avoid
    corruption. From Andrey Ryabinin.

    3) Mount BPF filesystem with mount_nodev rather than mount_ns, from
    Eric Biederman.

    4) A couple kmemdup conversions, from Muhammad Falak R Wani.

    5) BPF verifier fixes from Alexei Starovoitov.

    6) Don't let tunneled UDP packets get stuck in socket queues, if
    something goes wrong during the encapsulation just drop the packet
    rather than signalling an error up the call stack. From Hannes
    Frederic Sowa.

    7) SKB ref after free in batman-adv, from Florian Westphal.

    8) TCP iSCSI, ocfs2, rds, and tipc have to disable BH in it's TCP
    callbacks since the TCP stack runs pre-emptibly now. From Eric
    Dumazet.

    9) Fix crash in fixed_phy_add, from Rabin Vincent.

    10) Fix length checks in xen-netback, from Paul Durrant.

    11) Fix mixup in KEY vs KEYID macsec attributes, from Sabrina Dubroca.

    12) RDS connection spamming bug fixes from Sowmini Varadhan

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (152 commits)
    net: suppress warnings on dev_alloc_skb
    uapi glibc compat: fix compilation when !__USE_MISC in glibc
    udp: prevent skbs lingering in tunnel socket queues
    bpf: teach verifier to recognize imm += ptr pattern
    bpf: support decreasing order in direct packet access
    net: usb: ch9200: use kmemdup
    ps3_gelic: use kmemdup
    net:liquidio: use kmemdup
    bpf: Use mount_nodev not mount_ns to mount the bpf filesystem
    net: cdc_ncm: update datagram size after changing mtu
    tuntap: correctly wake up process during uninit
    intel: Add support for IPv6 IP-in-IP offload
    ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled with FOU/GUE
    RDS: TCP: Avoid rds connection churn from rogue SYNs
    RDS: TCP: rds_tcp_accept_worker() must exit gracefully when terminating rds-tcp
    net: sock: move ->sk_shutdown out of bitfields.
    ipv6: Don't reset inner headers in ip6_tnl_xmit
    ip4ip6: Support for GSO/GRO
    ip6ip6: Support for GSO/GRO
    ipv6: Set features for IPv6 tunnels
    ...

    Linus Torvalds
     
  • Commit e61452365372 ("radix_tree: add support for multi-order entries")
    left the impression that the support for multiorder radix tree entries
    was functional. As soon as Ross tried to use it, it became apparent
    that my testing was completely inadequate, and it didn't even work a
    little bit for orders that were not a multiple of shift.

    This series of patches is the result of about 6 weeks of redesign,
    reimplementation, testing, arguing and hair-pulling. The great news is
    that the test-suite is now far better than it was. That's reflected in
    the diffstat for the test-suite alone:

    12 files changed, 436 insertions(+), 28 deletions(-)

    The highlight for users of the tree is that the restriction on the order
    of inserted entries being >= RADIX_TREE_MAP_SHIFT is now gone; the radix
    tree now supports any order between 0 and 64.

    For those who are interested in how the tree works, patch 9 is probably
    the most interesting one as it introduces the new machinery for handling
    sibling entries.

    I've tried to be fair in attributing authorship to the person who
    contributed the majority of the code in each patch; Ross has been an
    invaluable partner in the development of this support and it's fair to
    say that each of us has code in every commit.

    I should also express my appreciation of the 0day testing. It prompted
    me that I was bloating the tinyconfig in an unacceptable way, and it
    bisected to a commit which contained a rather nasty memory-corruption
    bug.

    This patch (of 29):

    The irqdomain code was checking for 0 or 1 entries, not 0 entries like
    the comment said they were. Introduce a new helper that will actually
    check for an empty tree.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • UUID library provides uuid_be type and uuid_be_to_bin() function. This
    substitutes open coded variant by generic library calls.

    Signed-off-by: Andy Shevchenko
    Reviewed-by: Matt Fleming
    Cc: Dmitry Kasatkin
    Cc: Mimi Zohar
    Cc: Rasmus Villemoes
    Cc: Arnd Bergmann
    Cc: "Theodore Ts'o"
    Cc: Al Viro
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • In NMI context, printk() messages are stored into per-CPU buffers to
    avoid a possible deadlock. They are normally flushed to the main ring
    buffer via an IRQ work. But the work is never called when the system
    calls panic() in the very same NMI handler.

    This patch tries to flush NMI buffers before the crash dump is
    generated. In this case it does not risk a double release and bails out
    when the logbuf_lock is already taken. The aim is to get the messages
    into the main ring buffer when possible. It makes them better
    accessible in the vmcore.

    Then the patch tries to flush the buffers second time when other CPUs
    are down. It might be more aggressive and reset logbuf_lock. The aim
    is to get the messages available for the consequent kmsg_dump() and
    console_flush_on_panic() calls.

    The patch causes vprintk_emit() to be called even in NMI context again.
    But it is done via printk_deferred() so that the console handling is
    skipped. Consoles use internal locks and we could not prevent a
    deadlock easily. They are explicitly called later when the crash dump
    is not generated, see console_flush_on_panic().

    Signed-off-by: Petr Mladek
    Cc: Benjamin Herrenschmidt
    Cc: Daniel Thompson
    Cc: David Miller
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jiri Kosina
    Cc: Martin Schwidefsky
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • Testing has shown that the backtrace sometimes does not fit into the 4kB
    temporary buffer that is used in NMI context. The warnings are gone
    when I double the temporary buffer size.

    This patch doubles the buffer size and makes it configurable.

    Note that this problem existed even in the x86-specific implementation
    that was added by the commit a9edc8809328 ("x86/nmi: Perform a safe NMI
    stack trace on all CPUs"). Nobody noticed it because it did not print
    any warnings.

    Signed-off-by: Petr Mladek
    Cc: Jan Kara
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Russell King
    Cc: Daniel Thompson
    Cc: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: David Miller
    Cc: Daniel Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • We could not resize the temporary buffer in NMI context. Let's warn if
    a message is lost.

    This is rather theoretical. printk() should not be used in NMI. The
    only sensible use is when we want to print backtrace from all CPUs. The
    current buffer should be enough for this purpose.

    [akpm@linux-foundation.org: whitespace fixlet]
    Signed-off-by: Petr Mladek
    Cc: Jan Kara
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Russell King
    Cc: Daniel Thompson
    Cc: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: David Miller
    Cc: Daniel Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • printk() takes some locks and could not be used a safe way in NMI
    context.

    The chance of a deadlock is real especially when printing stacks from
    all CPUs. This particular problem has been addressed on x86 by the
    commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all
    CPUs").

    The patchset brings two big advantages. First, it makes the NMI
    backtraces safe on all architectures for free. Second, it makes all NMI
    messages almost safe on all architectures (the temporary buffer is
    limited. We still should keep the number of messages in NMI context at
    minimum).

    Note that there already are several messages printed in NMI context:
    WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
    handlers. These are not easy to avoid.

    This patch reuses most of the code and makes it generic. It is useful
    for all messages and architectures that support NMI.

    The alternative printk_func is set when entering and is reseted when
    leaving NMI context. It queues IRQ work to copy the messages into the
    main ring buffer in a safe context.

    __printk_nmi_flush() copies all available messages and reset the buffer.
    Then we could use a simple cmpxchg operations to get synchronized with
    writers. There is also used a spinlock to get synchronized with other
    flushers.

    We do not longer use seq_buf because it depends on external lock. It
    would be hard to make all supported operations safe for a lockless use.
    It would be confusing and error prone to make only some operations safe.

    The code is put into separate printk/nmi.c as suggested by Steven
    Rostedt. It needs a per-CPU buffer and is compiled only on
    architectures that call nmi_enter(). This is achieved by the new
    HAVE_NMI Kconfig flag.

    The are MN10300 and Xtensa architectures. We need to clean up NMI
    handling there first. Let's do it separately.

    The patch is heavily based on the draft from Peter Zijlstra, see

    https://lkml.org/lkml/2015/6/10/327

    [arnd@arndb.de: printk-nmi: use %zu format string for size_t]
    [akpm@linux-foundation.org: min_t->min - all types are size_t here]
    Signed-off-by: Petr Mladek
    Suggested-by: Peter Zijlstra
    Suggested-by: Steven Rostedt
    Cc: Jan Kara
    Acked-by: Russell King [arm part]
    Cc: Daniel Thompson
    Cc: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: David Miller
    Cc: Daniel Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • When using this program (as root):

    #include
    #include
    #include
    #include

    #include
    #include
    #include

    #define ITER 1000
    #define FORKERS 15
    #define THREADS (6000/FORKERS) // 1850 is proc max

    static void fork_100_wait()
    {
    unsigned a, to_wait = 0;

    printf("\t%d forking %d\n", THREADS, getpid());

    for (a = 0; a < THREADS; a++) {
    switch (fork()) {
    case 0:
    usleep(1000);
    exit(0);
    break;
    case -1:
    break;
    default:
    to_wait++;
    break;
    }
    }

    printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(),
    to_wait);

    for (a = 0; a < to_wait; a++)
    wait(NULL);

    printf("\t%d waited from %d\n", THREADS, getpid());
    }

    static void run_forkers()
    {
    pid_t forkers[FORKERS];
    unsigned a;

    for (a = 0; a < FORKERS; a++) {
    switch ((forkers[a] = fork())) {
    case 0:
    fork_100_wait();
    exit(0);
    break;
    case -1:
    err(1, "DIE fork of %d'th forker", a);
    break;
    default:
    break;
    }
    }

    for (a = 0; a < FORKERS; a++)
    waitpid(forkers[a], NULL, 0);
    }

    int main()
    {
    unsigned a;
    int ret;

    ret = ioperm(10, 20, 0);
    if (ret < 0)
    err(1, "ioperm");

    for (a = 0; a < ITER; a++)
    run_forkers();

    return 0;
    }

    kmemleak reports many occurences of this leak:
    unreferenced object 0xffff8805917c8000 (size 8192):
    comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s)
    hex dump (first 32 bytes):
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
    backtrace:
    [] kmemdup+0x25/0x50
    [] copy_thread_tls+0x6c3/0x9a0
    [] copy_process+0x1a84/0x5790
    [] wake_up_new_task+0x2d5/0x6f0
    [] _do_fork+0x12d/0x820
    ...

    Due to the leakage of the memory items which should have been freed in
    arch/x86/kernel/process.c:exit_thread().

    Make sure the memory is freed when fork fails later in copy_process.
    This is done by calling exit_thread with the thread to kill.

    Signed-off-by: Jiri Slaby
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: Aurelien Jacquiot
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Liqin
    Cc: Chris Metcalf
    Cc: Chris Zankel
    Cc: David Howells
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James Hogan
    Cc: Jeff Dike
    Cc: Jesper Nilsson
    Cc: Jiri Slaby
    Cc: Jonas Bonn
    Cc: Koichi Yasutake
    Cc: Lennox Wu
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mikael Starvik
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Richard Henderson
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Steven Miao
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • We need to call exit_thread from copy_process in a fail path. So make it
    accept task_struct as a parameter.

    [v2]
    * s390: exit_thread_runtime_instr doesn't make sense to be called for
    non-current tasks.
    * arm: fix the comment in vfp_thread_copy
    * change 'me' to 'tsk' for task_struct
    * now we can change only archs that actually have exit_thread

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jiri Slaby
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: Aurelien Jacquiot
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Liqin
    Cc: Chris Metcalf
    Cc: Chris Zankel
    Cc: David Howells
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James Hogan
    Cc: Jeff Dike
    Cc: Jesper Nilsson
    Cc: Jiri Slaby
    Cc: Jonas Bonn
    Cc: Koichi Yasutake
    Cc: Lennox Wu
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mikael Starvik
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Richard Henderson
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Steven Miao
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • Tetsuo has properly noted that mmput slow path might get blocked waiting
    for another party (e.g. exit_aio waits for an IO). If that happens the
    oom_reaper would be put out of the way and will not be able to process
    next oom victim. We should strive for making this context as reliable
    and independent on other subsystems as much as possible.

    Introduce mmput_async which will perform the slow path from an async
    (WQ) context. This will delay the operation but that shouldn't be a
    problem because the oom_reaper has reclaimed the victim's address space
    for most cases as much as possible and the remaining context shouldn't
    bind too much memory anymore. The only exception is when mmap_sem
    trylock has failed which shouldn't happen too often.

    The issue is only theoretical but not impossible.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Humans don't write C code like:
    u8 *ptr = skb->data;
    int imm = 4;
    imm += ptr;
    but from llvm backend point of view 'imm' and 'ptr' are registers and
    imm += ptr may be preferred vs ptr += imm depending which register value
    will be used further in the code, while verifier can only recognize ptr += imm.
    That caused small unrelated changes in the C code of the bpf program to
    trigger rejection by the verifier. Therefore teach the verifier to recognize
    both ptr += imm and imm += ptr.
    For example:
    when R6=pkt(id=0,off=0,r=62) R7=imm22
    after r7 += r6 instruction
    will be R6=pkt(id=0,off=0,r=62) R7=pkt(id=0,off=22,r=62)

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • when packet headers are accessed in 'decreasing' order (like TCP port
    may be fetched before the program reads IP src) the llvm may generate
    the following code:
    [...] // R7=pkt(id=0,off=22,r=70)
    r2 = *(u32 *)(r7 +0) // good access
    [...]
    r7 += 40 // R7=pkt(id=0,off=62,r=70)
    r8 = *(u32 *)(r7 +0) // good access
    [...]
    r1 = *(u32 *)(r7 -20) // this one will fail though it's within a safe range
    // it's doing *(u32*)(skb->data + 42)
    Fix verifier to recognize such code pattern

    Alos turned out that 'off > range' condition is not a verifier bug.
    It's a buggy program that may do something like:
    if (ptr + 50 > data_end)
    return 0;
    ptr += 60;
    *(u32*)ptr;
    in such case emit
    "invalid access to packet, off=0 size=4, R1(id=0,off=60,r=50)" error message,
    so all information is available for the program author to fix the program.

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
    bpf filesystem. Looking at the code I saw a broken usage of mount_ns
    with current->nsproxy->mnt_ns. As the code does not acquire a
    reference to the mount namespace it can not possibly be correct to
    store the mount namespace on the superblock as it does.

    Replace mount_ns with mount_nodev so that each mount of the bpf
    filesystem returns a distinct instance, and the code is not buggy.

    In discussion with Hannes Frederic Sowa it was reported that the use
    of mount_ns was an attempt to have one bpf instance per mount
    namespace, in an attempt to keep resources that pin resources from
    hiding. That intent simply does not work, the vfs is not built to
    allow that kind of behavior. Which means that the bpf filesystem
    really is buggy both semantically and in it's implemenation as it does
    not nor can it implement the original intent.

    This change is userspace visible, but my experience with similar
    filesystems leads me to believe nothing will break with a model of each
    mount of the bpf filesystem is distinct from all others.

    Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
    Cc: Hannes Frederic Sowa
    Acked-by: Daniel Borkmann
    Signed-off-by: "Eric W. Biederman"
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Start address randomization and blinding in BPF currently use
    prandom_u32(). prandom_u32() values are not exposed to unpriviledged
    user space to my knowledge, but given other kernel facilities such as
    ASLR, stack canaries, etc make use of stronger get_random_int(), we
    better make use of it here as well given blinding requests successively
    new random values. get_random_int() has minimal entropy pool depletion,
    is not cryptographically secure, but doesn't need to be for our use
    cases here.

    Suggested-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann