01 Dec, 2018

1 commit

  • commit 30aba6656f61ed44cba445a3c0d38b296fa9e8f5 upstream.

    Disallows open of FIFOs or regular files not owned by the user in world
    writable sticky directories, unless the owner is the same as that of the
    directory or the file is opened without the O_CREAT flag. The purpose
    is to make data spoofing attacks harder. This protection can be turned
    on and off separately for FIFOs and regular files via sysctl, just like
    the symlinks/hardlinks protection. This patch is based on Openwall's
    "HARDEN_FIFO" feature by Solar Designer.

    This is a brief list of old vulnerabilities that could have been prevented
    by this feature, some of them even allow for privilege escalation:

    CVE-2000-1134
    CVE-2007-3852
    CVE-2008-0525
    CVE-2009-0416
    CVE-2011-4834
    CVE-2015-1838
    CVE-2015-7442
    CVE-2016-7489

    This list is not meant to be complete. It's difficult to track down all
    vulnerabilities of this kind because they were often reported without any
    mention of this particular attack vector. In fact, before
    hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
    vehicle to exploit them.

    [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
    Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
    Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
    [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
    [keescook@chromium.org: adjust commit subjet]
    Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
    Signed-off-by: Salvatore Mesoraca
    Signed-off-by: Kees Cook
    Suggested-by: Solar Designer
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Loic
    Signed-off-by: Greg Kroah-Hartman

    Salvatore Mesoraca
     

22 Feb, 2018

1 commit

  • commit 4675ff05de2d76d167336b368bd07f3fef6ed5a6 upstream.

    Fix up makefiles, remove references, and git rm kmemcheck.

    Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Steven Rostedt
    Cc: Vegard Nossum
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Eric W. Biederman
    Cc: Alexander Potapenko
    Cc: Tim Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

14 Dec, 2017

1 commit

  • [ Upstream commit 98159d977f71c3b3dee898d1c34e56f520b094e7 ]

    Patch series "A few round_pipe_size() and pipe-max-size fixups", v3.

    While backporting Michael's "pipe: fix limit handling" patchset to a
    distro-kernel, Mikulas noticed that current upstream pipe limit handling
    contains a few problems:

    1 - procfs signed wrap: echo'ing a large number into
    /proc/sys/fs/pipe-max-size and then cat'ing it back out shows a
    negative value.

    2 - round_pipe_size() nr_pages overflow on 32bit: this would
    subsequently try roundup_pow_of_two(0), which is undefined.

    3 - visible non-rounded pipe-max-size value: there is no mutual
    exclusion or protection between the time pipe_max_size is assigned
    a raw value from proc_dointvec_minmax() and when it is rounded.

    4 - unsigned long -> unsigned int conversion makes for potential odd
    return errors from do_proc_douintvec_minmax_conv() and
    do_proc_dopipe_max_size_conv().

    This version underwent the same testing as v1:
    https://marc.info/?l=linux-kernel&m=150643571406022&w=2

    This patch (of 4):

    pipe_max_size is defined as an unsigned int:

    unsigned int pipe_max_size = 1048576;

    but its procfs/sysctl representation is an integer:

    static struct ctl_table fs_table[] = {
    ...
    {
    .procname = "pipe-max-size",
    .data = &pipe_max_size,
    .maxlen = sizeof(int),
    .mode = 0644,
    .proc_handler = &pipe_proc_fn,
    .extra1 = &pipe_min_size,
    },
    ...

    that is signed:

    int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
    size_t *lenp, loff_t *ppos)
    {
    ...
    ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)

    This leads to signed results via procfs for large values of pipe_max_size:

    % echo 2147483647 >/proc/sys/fs/pipe-max-size
    % cat /proc/sys/fs/pipe-max-size
    -2147483648

    Use unsigned operations on this variable to avoid such negative values.

    Link: http://lkml.kernel.org/r/1507658689-11669-2-git-send-email-joe.lawrence@redhat.com
    Signed-off-by: Joe Lawrence
    Reported-by: Mikulas Patocka
    Reviewed-by: Mikulas Patocka
    Cc: Michael Kerrisk
    Cc: Randy Dunlap
    Cc: Al Viro
    Cc: Jens Axboe
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Joe Lawrence
     

06 Oct, 2017

1 commit

  • Pull watchddog clean-up and fixes from Thomas Gleixner:
    "The watchdog (hard/softlockup detector) code is pretty much broken in
    its current state. The patch series addresses this by removing all
    duct tape and refactoring it into a workable state.

    The reasons why I ask for inclusion that late in the cycle are:

    1) The code causes lockdep splats vs. hotplug locking which get
    reported over and over. Unfortunately there is no easy fix.

    2) The risk of breakage is minimal because it's already broken

    3) As 4.14 is a long term stable kernel, I prefer to have working
    watchdog code in that and the lockdep issues resolved. I wouldn't
    ask you to pull if 4.14 wouldn't be a LTS kernel or if the
    solution would be easy to backport.

    4) The series was around before the merge window opened, but then got
    delayed due to the UP failure caused by the for_each_cpu()
    surprise which we discussed recently.

    Changes vs. V1:

    - Addressed your review points

    - Addressed the warning in the powerpc code which was discovered late

    - Changed two function names which made sense up to a certain point
    in the series. Now they match what they do in the end.

    - Fixed a 'unused variable' warning, which got not detected by the
    intel robot. I triggered it when trying all possible related config
    combinations manually. Randconfig testing seems not random enough.

    The changes have been tested by and reviewed by Don Zickus and tested
    and acked by Micheal Ellerman for powerpc"

    * 'core-watchdog-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    watchdog/core: Put softlockup_threads_initialized under ifdef guard
    watchdog/core: Rename some softlockup_* functions
    powerpc/watchdog: Make use of watchdog_nmi_probe()
    watchdog/core, powerpc: Lock cpus across reconfiguration
    watchdog/core, powerpc: Replace watchdog_nmi_reconfigure()
    watchdog/hardlockup/perf: Fix spelling mistake: "permanetely" -> "permanently"
    watchdog/hardlockup/perf: Cure UP damage
    watchdog/hardlockup: Clean up hotplug locking mess
    watchdog/hardlockup/perf: Simplify deferred event destroy
    watchdog/hardlockup/perf: Use new perf CPU enable mechanism
    watchdog/hardlockup/perf: Implement CPU enable replacement
    watchdog/hardlockup/perf: Implement init time detection of perf
    watchdog/hardlockup/perf: Implement init time perf validation
    watchdog/core: Get rid of the racy update loop
    watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage
    watchdog/sysctl: Clean up sysctl variable name space
    watchdog/sysctl: Get rid of the #ifdeffery
    watchdog/core: Clean up header mess
    watchdog/core: Further simplify sysctl handling
    watchdog/core: Get rid of the thread teardown/setup dance
    ...

    Linus Torvalds
     

04 Oct, 2017

1 commit

  • do_proc_douintvec_conv() has two UINT_MAX checks, we can remove one.
    This has no functional changes other than fixing a compiler warning:

    kernel/sysctl.c:2190]: (warning) Identical condition '*lvalp>UINT_MAX', second condition is always false

    Fixes: 4f2fec00afa60 ("sysctl: simplify unsigned int support")
    Link: http://lkml.kernel.org/r/20170919072918.12066-1-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Reported-by: David Binderman
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     

29 Sep, 2017

1 commit

  • System will hang if user set sysctl_sched_time_avg to 0:

    [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

    Stack traceback for pid 0
    0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
    ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
    0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
    ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
    Call Trace:
    [] ? update_group_capacity+0x110/0x200
    [] ? update_sd_lb_stats+0x109/0x600
    [] ? find_busiest_group+0x47/0x530
    [] ? load_balance+0x194/0x900
    [] ? update_rq_clock.part.83+0x1a/0xe0
    [] ? rebalance_domains+0x152/0x290
    [] ? run_rebalance_domains+0xdc/0x1d0
    [] ? __do_softirq+0xfb/0x320
    [] ? irq_exit+0x125/0x130
    [] ? scheduler_ipi+0x97/0x160
    [] ? smp_reschedule_interrupt+0x29/0x30
    [] ? reschedule_interrupt+0x6e/0x80
    [] ? cpuidle_enter_state+0xcc/0x230
    [] ? cpuidle_enter_state+0x9c/0x230
    [] ? cpuidle_enter+0x17/0x20
    [] ? cpu_startup_entry+0x38c/0x420
    [] ? start_secondary+0x173/0x1e0

    Because divide-by-zero error happens in function:

    update_group_capacity()
    update_cpu_capacity()
    scale_rt_capacity()
    {
    ...
    total = sched_avg_period() + delta;
    used = div_u64(avg, total);
    ...
    }

    To fix this issue, check user input value of sysctl_sched_time_avg, keep
    it unchanged when hitting invalid input, and set the minimum limit of
    sysctl_sched_time_avg to 1 ms.

    Reported-by: James Puthukattukaran
    Signed-off-by: Ethan Zhao
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: efault@gmx.de
    Cc: ethan.kernel@gmail.com
    Cc: keescook@chromium.org
    Cc: mcgrof@kernel.org
    Cc:
    Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.com
    Signed-off-by: Ingo Molnar

    Ethan Zhao
     

14 Sep, 2017

2 commits

  • Reflect that these variables are user interface related and remove the
    whitespace damage in the sysctl table while at it.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Don Zickus
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Chris Metcalf
    Cc: Linus Torvalds
    Cc: Nicholas Piggin
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Ulrich Obergfell
    Link: http://lkml.kernel.org/r/20170912194147.783210221@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • The sysctl of the nmi_watchdog file prevents writes by setting:

    min = max = 0

    if none of the users is enabled. That involves ifdeffery and is competely
    non obvious.

    If none of the facilities is enabeld, then the file can simply be made read
    only. Move the ifdeffery into the header and use a constant for file
    permissions.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Don Zickus
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Chris Metcalf
    Cc: Linus Torvalds
    Cc: Nicholas Piggin
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Ulrich Obergfell
    Link: http://lkml.kernel.org/r/20170912194147.706073616@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

13 Jul, 2017

5 commits

  • Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
    HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.

    LOCKUP_DETECTOR implies the general boot, sysctl, and programming
    interfaces for the lockup detectors.

    An architecture that wants to use a hard lockup detector must define
    HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.

    Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
    minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
    does not implement the LOCKUP_DETECTOR interfaces.

    sparc is unusual in that it has started to implement some of the
    interfaces, but not fully yet. It should probably be converted to a full
    HAVE_HARDLOCKUP_DETECTOR_ARCH.

    [npiggin@gmail.com: fix]
    Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
    Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Don Zickus
    Reviewed-by: Babu Moger
    Tested-by: Babu Moger [sparc]
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • To keep parity with regular int interfaces provide the an unsigned int
    proc_douintvec_minmax() which allows you to specify a range of allowed
    valid numbers.

    Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
    actual user for that.

    Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Acked-by: Kees Cook
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
    fields") added proc_douintvec() to start help adding support for
    unsigned int, this however was only half the work needed. Two fixes
    have come in since then for the following issues:

    o Printing the values shows a negative value, this happens since
    do_proc_dointvec() and this uses proc_put_long()

    This was fixed by commit 5380e5644afbba9 ("sysctl: don't print negative
    flag for proc_douintvec").

    o We can easily wrap around the int values: UINT_MAX is 4294967295, if
    we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
    end up with 1.
    o We echo negative values in and they are accepted

    This was fixed by commit 425fffd886ba ("sysctl: report EINVAL if value
    is larger than UINT_MAX for proc_douintvec").

    It still also failed to be added to sysctl_check_table()... instead of
    adding it with the current implementation just provide a proper and
    simplified unsigned int support without any array unsigned int support
    with no negative support at all.

    Historically sysctl proc helpers have supported arrays, due to the
    complexity this adds though we've taken a step back to evaluate array
    users to determine if its worth upkeeping for unsigned int. An
    evaluation using Coccinelle has been done to perform a grammatical
    search to ask ourselves:

    o How many sysctl proc_dointvec() (int) users exist which likely
    should be moved over to proc_douintvec() (unsigned int) ?
    Answer: about 8
    - Of these how many are array users ?
    Answer: Probably only 1
    o How many sysctl array users exist ?
    Answer: about 12

    This last question gives us an idea just how popular arrays: they are not.
    Array support should probably just be kept for strings.

    The identified uint ports are:

    drivers/infiniband/core/ucma.c - max_backlog
    drivers/infiniband/core/iwcm.c - default_backlog
    net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
    net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
    net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
    net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
    net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
    net/phonet/sysctl.c proc_local_port_range()

    The only possible array users is proc_local_port_range() but it does not
    seem worth it to add array support just for this given the range support
    works just as well. Unsigned int support should be desirable more for
    when you *need* more than INT_MAX or using int min/max support then does
    not suffice for your ranges.

    If you forget and by mistake happen to register an unsigned int proc
    entry with an array, the driver will fail and you will get something as
    follows:

    sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
    CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    Call Trace:
    dump_stack+0x63/0x81
    __register_sysctl_table+0x350/0x650
    ? kmem_cache_alloc_trace+0x107/0x240
    __register_sysctl_paths+0x1b3/0x1e0
    ? 0xffffffffc005f000
    register_sysctl_table+0x1f/0x30
    test_sysctl_init+0x10/0x1000 [test_sysctl]
    do_one_initcall+0x52/0x1a0
    ? kmem_cache_alloc_trace+0x107/0x240
    do_init_module+0x5f/0x200
    load_module+0x1867/0x1bd0
    ? __symbol_put+0x60/0x60
    SYSC_finit_module+0xdf/0x110
    SyS_finit_module+0xe/0x10
    entry_SYSCALL_64_fastpath+0x1e/0xad
    RIP: 0033:0x7f042b22d119

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Alexey Dobriyan
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Liping Zhang
    Cc: Alexey Dobriyan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • The mode sysctl_writes_strict positional checks keep being copy and pasted
    as we add new proc handlers. Just add a helper to avoid code duplication.

    Link: http://lkml.kernel.org/r/20170519033554.18592-4-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Document the different sysctl_writes_strict modes in code.

    Link: http://lkml.kernel.org/r/20170519033554.18592-3-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     

09 May, 2017

1 commit

  • do_proc_dointvec_jiffies_conv() uses LONG_MAX/HZ as the max value to
    avoid overflow. But actually the *valp is int type, so it still causes
    overflow.

    For example,

    echo 2147483647 > ./sys/net/ipv4/tcp_keepalive_time

    Then,

    cat ./sys/net/ipv4/tcp_keepalive_time

    The output is "-1", it is not expected.

    Now use INT_MAX/HZ as the max value instead LONG_MAX/HZ to fix it.

    Link: http://lkml.kernel.org/r/1490109532-9228-1-git-send-email-fgao@ikuai8.com
    Signed-off-by: Gao Feng
    Cc: Arnaldo Carvalho de Melo
    Cc: Ingo Molnar
    Cc: Alexey Dobriyan
    Cc: Eric Dumazet
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gao Feng
     

02 May, 2017

1 commit

  • Pull timer updates from Thomas Gleixner:
    "The timer departement delivers:

    - more year 2038 rework

    - a massive rework of the arm achitected timer

    - preparatory patches to allow NTP correction of clock event devices
    to avoid early expiry

    - the usual pile of fixes and enhancements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits)
    timer/sysclt: Restrict timer migration sysctl values to 0 and 1
    arm64/arch_timer: Mark errata handlers as __maybe_unused
    Clocksource/mips-gic: Remove redundant non devicetree init
    MIPS/Malta: Probe gic-timer via devicetree
    clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK
    acpi/arm64: Add SBSA Generic Watchdog support in GTDT driver
    clocksource: arm_arch_timer: add GTDT support for memory-mapped timer
    acpi/arm64: Add memory-mapped timer support in GTDT driver
    clocksource: arm_arch_timer: simplify ACPI support code.
    acpi/arm64: Add GTDT table parse driver
    clocksource: arm_arch_timer: split MMIO timer probing.
    clocksource: arm_arch_timer: add structs to describe MMIO timer
    clocksource: arm_arch_timer: move arch_timer_needs_of_probing into DT init call
    clocksource: arm_arch_timer: refactor arch_timer_needs_probing
    clocksource: arm_arch_timer: split dt-only rate handling
    x86/uv/time: Set ->min_delta_ticks and ->max_delta_ticks
    unicore32/time: Set ->min_delta_ticks and ->max_delta_ticks
    um/time: Set ->min_delta_ticks and ->max_delta_ticks
    tile/time: Set ->min_delta_ticks and ->max_delta_ticks
    score/time: Set ->min_delta_ticks and ->max_delta_ticks
    ...

    Linus Torvalds
     

20 Apr, 2017

1 commit

  • timer_migration sysctl acts as a boolean switch, so the allowed values
    should be restricted to 0 and 1.

    Add the necessary extra fields to the sysctl table entry to enforce that.

    [ tglx: Rewrote changelog ]

    Signed-off-by: Myungho Jung
    Link: http://lkml.kernel.org/r/1492640690-3550-1-git-send-email-mhjungk@gmail.com
    Signed-off-by: Thomas Gleixner

    Myungho Jung
     

09 Apr, 2017

1 commit

  • Currently, inputting the following command will succeed but actually the
    value will be truncated:

    # echo 0x12ffffffff > /proc/sys/net/ipv4/tcp_notsent_lowat

    This is not friendly to the user, so instead, we should report error
    when the value is larger than UINT_MAX.

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Signed-off-by: Liping Zhang
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Andrew Morton
    Cc: Eric W. Biederman
    Signed-off-by: Linus Torvalds

    Liping Zhang
     

08 Apr, 2017

1 commit

  • I saw some very confusing sysctl output on my system:
    # cat /proc/sys/net/core/xfrm_aevent_rseqth
    -2
    # cat /proc/sys/net/core/xfrm_aevent_etime
    -10
    # cat /proc/sys/net/ipv4/tcp_notsent_lowat
    -4294967295

    Because we forget to set the *negp flag in proc_douintvec, so it will
    become a garbage value.

    Since the value related to proc_douintvec is always an unsigned integer,
    so we can set *negp to false explictily to fix this issue.

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Signed-off-by: Liping Zhang
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liping Zhang
     

02 Mar, 2017

1 commit


01 Feb, 2017

1 commit

  • We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit:

    ce0dbbbb30ae ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice")

    ... which name suggests to users that it's in milliseconds, while in reality
    it's being set in milliseconds but the result is shown in jiffies.

    This is obviously confusing when HZ is not 1000, it makes it appear like the
    value set failed, such as HZ=100:

    root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms
    root# cat /proc/sys/kernel/sched_rr_timeslice_ms
    10

    Fix this to be milliseconds all around.

    Signed-off-by: Shile Zhang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com
    Signed-off-by: Ingo Molnar

    Shile Zhang
     

27 Jan, 2017

1 commit

  • We perform the conversion between kernel jiffies and ms only when
    exporting kernel value to user space.

    We need to do the opposite operation when value is written by user.

    Only matters when HZ != 1000

    Signed-off-by: Eric Dumazet
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

25 Dec, 2016

1 commit


16 Dec, 2016

1 commit

  • Pull tracing updates from Steven Rostedt:
    "This release has a few updates:

    - STM can hook into the function tracer
    - Function filtering now supports more advance glob matching
    - Ftrace selftests updates and added tests
    - Softirq tag in traces now show only softirqs
    - ARM nop added to non traced locations at compile time
    - New trace_marker_raw file that allows for binary input
    - Optimizations to the ring buffer
    - Removal of kmap in trace_marker
    - Wakeup and irqsoff tracers now adhere to the set_graph_notrace file
    - Other various fixes and clean ups"

    * tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (42 commits)
    selftests: ftrace: Shift down default message verbosity
    kprobes/trace: Fix kprobe selftest for newer gcc
    tracing/kprobes: Add a helper method to return number of probe hits
    tracing/rb: Init the CPU mask on allocation
    tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
    tracing/fgraph: Have wakeup and irqsoff tracers ignore graph functions too
    fgraph: Handle a case where a tracer ignores set_graph_notrace
    tracing: Replace kmap with copy_from_user() in trace_marker writing
    ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps to it
    tracing: Allow benchmark to be enabled at early_initcall()
    tracing: Have system enable return error if one of the events fail
    tracing: Do not start benchmark on boot up
    tracing: Have the reg function allow to fail
    ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inline
    ring-buffer: Froce rb_update_write_stamp() to be inlined
    ring-buffer: Force inline of hotpath helper functions
    tracing: Make __buffer_unlock_commit() always_inline
    tracing: Make tracepoint_printk a static_key
    ring-buffer: Always inline rb_event_data()
    ring-buffer: Make rb_reserve_next_event() always inlined
    ...

    Linus Torvalds
     

15 Dec, 2016

1 commit

  • I was amused to find "unsafe core_pattern" warning having these lines in
    /etc/sysctl.conf:

    fs.suid_dumpable=2
    kernel.core_pattern=/core/core-%e-%p-%E
    kernel.core_uses_pid=0

    Turns out kernel is formally right. Default core_pattern is just "core",
    which doesn't qualify for secure path while setting suid.dumpable.

    Hint admins about solution, clarify sysctl names, delete unnecessary '\'
    characters (string literals are concatenated regardless) and reformat for
    easier grepping.

    Link: http://lkml.kernel.org/r/20161029152124.GA1258@avx2
    Signed-off-by: Alexey Dobriyan
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

13 Dec, 2016

1 commit

  • Pull x86 asm updates from Ingo Molnar:
    "The main changes in this development cycle were:

    - a large number of call stack dumping/printing improvements: higher
    robustness, better cross-context dumping, improved output, etc.
    (Josh Poimboeuf)

    - vDSO getcpu() performance improvement for future Intel CPUs with
    the RDPID instruction (Andy Lutomirski)

    - add two new Intel AVX512 features and the CPUID support
    infrastructure for it: AVX512IFMA and AVX512VBMI. (Gayatri Kammela,
    He Chen)

    - more copy-user unification (Borislav Petkov)

    - entry code assembly macro simplifications (Alexander Kuleshov)

    - vDSO C/R support improvements (Dmitry Safonov)

    - misc fixes and cleanups (Borislav Petkov, Paul Bolle)"

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    scripts/decode_stacktrace.sh: Fix address line detection on x86
    x86/boot/64: Use defines for page size
    x86/dumpstack: Make stack name tags more comprehensible
    selftests/x86: Add test_vdso to test getcpu()
    x86/vdso: Use RDPID in preference to LSL when available
    x86/dumpstack: Handle NULL stack pointer in show_trace_log_lvl()
    x86/cpufeatures: Enable new AVX512 cpu features
    x86/cpuid: Provide get_scattered_cpuid_leaf()
    x86/cpuid: Cleanup cpuid_regs definitions
    x86/copy_user: Unify the code by removing the 64-bit asm _copy_*_user() variants
    x86/unwind: Ensure stack grows down
    x86/vdso: Set vDSO pointer only after success
    x86/prctl/uapi: Remove #ifdef for CHECKPOINT_RESTORE
    x86/unwind: Detect bad stack return address
    x86/dumpstack: Warn on stack recursion
    x86/unwind: Warn on bad frame pointer
    x86/decoder: Use stderr if insn sanity test fails
    x86/decoder: Use stdout if insn decoder test is successful
    mm/page_alloc: Remove kernel address exposure in free_reserved_area()
    x86/dumpstack: Remove raw stack dump
    ...

    Linus Torvalds
     

24 Nov, 2016

1 commit

  • Currently, when tracepoint_printk is set (enabled by the "tp_printk" kernel
    command line), it causes trace events to print via printk(). This is a very
    dangerous operation, but is useful for debugging.

    The issue is, it's seldom used, but it is always checked even if it's not
    enabled by the kernel command line. Instead of having this feature called by
    a branch against a variable, turn that variable into a static key, and this
    will remove the test and jump.

    To simplify things, the functions output_printk() and
    trace_event_buffer_commit() were moved from trace_events.c to trace.c.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

26 Oct, 2016

1 commit

  • For mostly historical reasons, the x86 oops dump shows the raw stack
    values:

    ...
    [registers]
    Stack:
    ffff880079af7350 ffff880079905400 0000000000000000 ffffc900008f3ae0
    ffffffffa0196610 0000000000000001 00010000ffffffff 0000000087654321
    0000000000000002 0000000000000000 0000000000000000 0000000000000000
    Call Trace:
    ...

    This seems to be an artifact from long ago, and probably isn't needed
    anymore. It generally just adds noise to the dump, and it can be
    actively harmful because it leaks kernel addresses.

    Linus says:

    "The stack dump actually goes back to forever, and it used to be
    useful back in 1992 or so. But it used to be useful mainly because
    stacks were simpler and we didn't have very good call traces anyway. I
    definitely remember having used them - I just do not remember having
    used them in the last ten+ years.

    Of course, it's still true that if you can trigger an oops, you've
    likely already lost the security game, but since the stack dump is so
    useless, let's aim to just remove it and make games like the above
    harder."

    This also removes the related 'kstack=' cmdline option and the
    'kstack_depth_to_print' sysctl.

    Suggested-by: Linus Torvalds
    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/e83bd50df52d8fe88e94d2566426ae40d813bf8f.1477405374.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

20 Oct, 2016

1 commit

  • The last user of this tunable was removed in 2012 in commit:

    82958366cfea ("sched: Replace update_shares weight distribution with per-entity computation")

    Delete it since its very existence confuses people.

    Signed-off-by: Matt Fleming
    Cc: Dietmar Eggemann
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20161019141059.26408-1-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     

11 Oct, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

07 Oct, 2016

1 commit

  • Pull namespace updates from Eric Biederman:
    "This set of changes is a number of smaller things that have been
    overlooked in other development cycles focused on more fundamental
    change. The devpts changes are small things that were a distraction
    until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
    trivial regression fix to autofs for the unprivileged mount changes
    that went in last cycle. A pair of ioctls has been added by Andrey
    Vagin making it is possible to discover the relationships between
    namespaces when referring to them through file descriptors.

    The big user visible change is starting to add simple resource limits
    to catch programs that misbehave. With namespaces in general and user
    namespaces in particular allowing users to use more kinds of
    resources, it has become important to have something to limit errant
    programs. Because the purpose of these limits is to catch errant
    programs the code needs to be inexpensive to use as it always on, and
    the default limits need to be high enough that well behaved programs
    on well behaved systems don't encounter them.

    To this end, after some review I have implemented per user per user
    namespace limits, and use them to limit the number of namespaces. The
    limits being per user mean that one user can not exhause the limits of
    another user. The limits being per user namespace allow contexts where
    the limit is 0 and security conscious folks can remove from their
    threat anlysis the code used to manage namespaces (as they have
    historically done as it root only). At the same time the limits being
    per user namespace allow other parts of the system to use namespaces.

    Namespaces are increasingly being used in application sand boxing
    scenarios so an all or nothing disable for the entire system for the
    security conscious folks makes increasing use of these sandboxes
    impossible.

    There is also added a limit on the maximum number of mounts present in
    a single mount namespace. It is nontrivial to guess what a reasonable
    system wide limit on the number of mount structure in the kernel would
    be, especially as it various based on how a system is using
    containers. A limit on the number of mounts in a mount namespace
    however is much easier to understand and set. In most cases in
    practice only about 1000 mounts are used. Given that some autofs
    scenarious have the potential to be 30,000 to 50,000 mounts I have set
    the default limit for the number of mounts at 100,000 which is well
    above every known set of users but low enough that the mount hash
    tables don't degrade unreaonsably.

    These limits are a start. I expect this estabilishes a pattern that
    other limits for resources that namespaces use will follow. There has
    been interest in making inotify event limits per user per user
    namespace as well as interest expressed in making details about what
    is going on in the kernel more visible"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
    autofs: Fix automounts by using current_real_cred()->uid
    mnt: Add a per mount namespace limit on the number of mounts
    netns: move {inc,dec}_net_namespaces into #ifdef
    nsfs: Simplify __ns_get_path
    tools/testing: add a test to check nsfs ioctl-s
    nsfs: add ioctl to get a parent namespace
    nsfs: add ioctl to get an owning user namespace for ns file descriptor
    kernel: add a helper to get an owning user namespace for a namespace
    devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
    devpts: Remove sync_filesystems
    devpts: Make devpts_kill_sb safe if fsi is NULL
    devpts: Simplify devpts_mount by using mount_nodev
    devpts: Move the creation of /dev/pts/ptmx into fill_super
    devpts: Move parse_mount_options into fill_super
    userns: When the per user per user namespace limit is reached return ENOSPC
    userns; Document per user per user namespace limits.
    mntns: Add a limit on the number of mount namespaces.
    netns: Add a limit on the number of net namespaces
    cgroupns: Add a limit on the number of cgroup namespaces
    ipcns: Add a limit on the number of ipc namespaces
    ...

    Linus Torvalds
     

01 Oct, 2016

1 commit

  • CAI Qian pointed out that the semantics
    of shared subtrees make it possible to create an exponentially
    increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

    Will create create 2^20 or 1048576 mounts, which is a practical problem
    as some people have managed to hit this by accident.

    As such CVE-2016-6213 was assigned.

    Ian Kent described the situation for autofs users
    as follows:

    > The number of mounts for direct mount maps is usually not very large because of
    > the way they are implemented, large direct mount maps can have performance
    > problems. There can be anywhere from a few (likely case a few hundred) to less
    > than 10000, plus mounts that have been triggered and not yet expired.
    >
    > Indirect mounts have one autofs mount at the root plus the number of mounts that
    > have been triggered and not yet expired.
    >
    > The number of autofs indirect map entries can range from a few to the common
    > case of several thousand and in rare cases up to between 30000 and 50000. I've
    > not heard of people with maps larger than 50000 entries.
    >
    > The larger the number of map entries the greater the possibility for a large
    > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
    > more active mounts.

    So I am setting the default number of mounts allowed per mount
    namespace at 100,000. This is more than enough for any use case I
    know of, but small enough to quickly stop an exponential increase
    in mounts. Which should be perfect to catch misconfigurations and
    malfunctioning programs.

    For anyone who needs a higher limit this can be changed by writing
    to the new /proc/sys/fs/mount-max sysctl.

    Tested-by: CAI Qian
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Sep, 2016

2 commits

  • After 7e8e385aaf6e ("x86/compat: Remove sys32_vm86_warning"), this
    function has become unused, so we can remove it as well.

    Link: http://lkml.kernel.org/r/20160617142903.3070388-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Cc: Alexander Viro
    Cc: "Theodore Ts'o"
    Cc: Arnaldo Carvalho de Melo
    Signed-off-by: Andrew Morton

    Arnd Bergmann
     
  • Propagate unsignedness for grand total of 149 bytes:

    $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
    add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
    function old new delta
    set_close_on_exec 99 98 -1
    put_files_struct 201 200 -1
    get_close_on_exec 59 58 -1
    do_prlimit 498 497 -1
    do_execveat_common.isra 1662 1661 -1
    __close_fd 178 173 -5
    do_dup2 219 204 -15
    seq_show 685 660 -25
    __alloc_fd 384 357 -27
    dup_fd 718 646 -72

    It mostly comes from converting "unsigned int" to "long" for bit operations.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Al Viro

    Alexey Dobriyan
     

27 Aug, 2016

1 commit

  • We have scripts which write to certain fields on 3.18 kernels but this
    seems to be failing on 4.4 kernels. An entry which we write to here is
    xfrm_aevent_rseqth which is u32.

    echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth

    Commit 230633d109e3 ("kernel/sysctl.c: detect overflows when converting
    to int") prevented writing to sysctl entries when integer overflow
    occurs. However, this does not apply to unsigned integers.

    Heinrich suggested that we introduce a new option to handle 64 bit
    limits and set min as 0 and max as UINT_MAX. This might not work as it
    leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
    we would need to change the datatype of the entry to 64 bit.

    static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
    {
    i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
    vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.

    Introduce a new proc handler proc_douintvec. Individual proc entries
    will need to be updated to use the new handler.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 230633d109e3 ("kernel/sysctl.c:detect overflows when converting to int")
    Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Subash Abhinov Kasiviswanathan
     

03 Aug, 2016

1 commit

  • Add a "printk.devkmsg" kernel command line parameter which controls how
    userspace writes into /dev/kmsg. It has three options:

    * ratelimit - ratelimit logging from userspace.
    * on - unlimited logging from userspace
    * off - logging from userspace gets ignored

    The default setting is to ratelimit the messages written to it.

    This changes the kernel default setting of "on" to "ratelimit" and we do
    that because we want to keep userspace spamming /dev/kmsg to sane
    levels. This is especially moot when a small kernel log buffer wraps
    around and messages get lost. So the ratelimiting setting should be a
    sane setting where kernel messages should have a bit higher chance of
    survival from all the spamming.

    It additionally does not limit logging to /dev/kmsg while the system is
    booting if we haven't disabled it on the command line.

    Furthermore, we can control the logging from a lower priority sysctl
    interface - kernel.printk_devkmsg.

    That interface will succeed only if printk.devkmsg *hasn't* been
    supplied on the command line. If it has, then printk.devkmsg is a
    one-time setting which remains for the duration of the system lifetime.
    This "locking" of the setting is to prevent userspace from changing the
    logging on us through sysctl(2).

    This patch is based on previous patches from Linus and Steven.

    [bp@suse.de: fixes]
    Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
    Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
    Signed-off-by: Borislav Petkov
    Cc: Dave Young
    Cc: Franck Bui
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     

29 Jul, 2016

1 commit

  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Jun, 2016

1 commit

  • It is not always easy to determine the cause of an RCU stall just by
    analysing the RCU stall messages, mainly when the problem is caused
    by the indirect starvation of rcu threads. For example, when preempt_rcu
    is not awakened due to the starvation of a timer softirq.

    We have been hard coding panic() in the RCU stall functions for
    some time while testing the kernel-rt. But this is not possible in
    some scenarios, like when supporting customers.

    This patch implements the sysctl kernel.panic_on_rcu_stall. If
    set to 1, the system will panic() when an RCU stall takes place,
    enabling the capture of a vmcore. The vmcore provides a way to analyze
    all kernel/tasks states, helping out to point to the culprit and the
    solution for the stall.

    The kernel.panic_on_rcu_stall sysctl is disabled by default.

    Changes from v1:
    - Fixed a typo in the git log
    - The if(sysctl_panic_on_rcu_stall) panic() is in a static function
    - Fixed the CONFIG_TINY_RCU compilation issue
    - The var sysctl_panic_on_rcu_stall is now __read_mostly

    Cc: Jonathan Corbet
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Lai Jiangshan
    Acked-by: Christian Borntraeger
    Reviewed-by: Josh Triplett
    Reviewed-by: Arnaldo Carvalho de Melo
    Tested-by: "Luis Claudio R. Goncalves"
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Paul E. McKenney

    Daniel Bristot de Oliveira
     

26 May, 2016

1 commit

  • Pull perf updates from Ingo Molnar:
    "Mostly tooling and PMU driver fixes, but also a number of late updates
    such as the reworking of the call-chain size limiting logic to make
    call-graph recording more robust, plus tooling side changes for the
    new 'backwards ring-buffer' extension to the perf ring-buffer"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    perf record: Read from backward ring buffer
    perf record: Rename variable to make code clear
    perf record: Prevent reading invalid data in record__mmap_read
    perf evlist: Add API to pause/resume
    perf trace: Use the ptr->name beautifier as default for "filename" args
    perf trace: Use the fd->name beautifier as default for "fd" args
    perf report: Add srcline_from/to branch sort keys
    perf evsel: Record fd into perf_mmap
    perf evsel: Add overwrite attribute and check write_backward
    perf tools: Set buildid dir under symfs when --symfs is provided
    perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
    perf annotate: Sort list of recognised instructions
    perf annotate: Fix identification of ARM blt and bls instructions
    perf tools: Fix usage of max_stack sysctl
    perf callchain: Stop validating callchains by the max_stack sysctl
    perf trace: Fix exit_group() formatting
    perf top: Use machine->kptr_restrict_warned
    perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
    perf machine: Do not bail out if not managing to read ref reloc symbol
    perf/x86/intel/p4: Trival indentation fix, remove space
    ...

    Linus Torvalds
     

20 May, 2016

1 commit

  • Provide /proc/sys/vm/stat_refresh to force an immediate update of
    per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
    before checking counts when testing. Originally added to work around a
    bug which left counts stranded indefinitely on a cpu going idle (an
    inaccuracy magnified when small below-batch numbers represent "huge"
    amounts of memory), but I believe that bug is now fixed: nonetheless,
    this is still a useful knob.

    Its schedule_on_each_cpu() is probably too expensive just to fold into
    reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
    Allow a write or a read to do the same: nothing to read, but "grep -h
    Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
    since global_page_state() itself is careful to disguise any underflow as
    0, hack in an "Invalid argument" and pr_warn() if a counter is negative
    after the refresh - this helped to fix a misaccounting of
    NR_ISOLATED_FILE in my migration code.

    But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
    often go negative some of the time. I have not yet worked out why, but
    have no evidence that it's actually harmful. Punt for the moment by
    just ignoring the anomaly on those.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 May, 2016

1 commit

  • The perf_sample->ip_callchain->nr value includes all the entries in the
    ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
    while what the user expects is that what is in the kernel.perf_event_max_stack
    sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
    honoured in terms of IP addresses in the stack trace.

    So allocate a bunch of extra entries for contexts, and do the accounting
    via perf_callchain_entry_ctx struct members.

    A new sysctl, kernel.perf_event_max_contexts_per_stack is also
    introduced for investigating possible bugs in the callchain
    implementation by some arch.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo