04 Nov, 2018

1 commit

  • Remove one include of .
    No functional changes.

    Link: http://lkml.kernel.org/r/20181004134223.17735-1-michael@schupikov.de
    Signed-off-by: Michael Schupikov
    Reviewed-by: Richard Weinberger
    Acked-by: Luis Chamberlain
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Schupikov
     

05 Sep, 2018

1 commit


24 Aug, 2018

1 commit

  • Disallows open of FIFOs or regular files not owned by the user in world
    writable sticky directories, unless the owner is the same as that of the
    directory or the file is opened without the O_CREAT flag. The purpose
    is to make data spoofing attacks harder. This protection can be turned
    on and off separately for FIFOs and regular files via sysctl, just like
    the symlinks/hardlinks protection. This patch is based on Openwall's
    "HARDEN_FIFO" feature by Solar Designer.

    This is a brief list of old vulnerabilities that could have been prevented
    by this feature, some of them even allow for privilege escalation:

    CVE-2000-1134
    CVE-2007-3852
    CVE-2008-0525
    CVE-2009-0416
    CVE-2011-4834
    CVE-2015-1838
    CVE-2015-7442
    CVE-2016-7489

    This list is not meant to be complete. It's difficult to track down all
    vulnerabilities of this kind because they were often reported without any
    mention of this particular attack vector. In fact, before
    hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
    vehicle to exploit them.

    [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
    Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
    Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
    [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
    [keescook@chromium.org: adjust commit subjet]
    Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
    Signed-off-by: Salvatore Mesoraca
    Signed-off-by: Kees Cook
    Suggested-by: Solar Designer
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Salvatore Mesoraca
     

23 Aug, 2018

2 commits

  • Fix a few typos/spellos in kernel/sysctl.c.

    Link: http://lkml.kernel.org/r/bb09a8b9-f984-6dd4-b07b-3ecaf200862e@infradead.org
    Signed-off-by: Randy Dunlap
    Acked-by: Kees Cook
    Cc: "Luis R. Rodriguez"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Currently task hung checking interval is equal to timeout, as the result
    hung is detected anywhere between timeout and 2*timeout. This is fine for
    most interactive environments, but this hurts automated testing setups
    (syzbot). In an automated setup we need to strictly order CPU lockup <
    RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
    is not detected as task hung and task hung is not detected as silent
    machine loss. The large variance in task hung detection timeout requires
    setting silent machine loss timeout to a very large value (e.g. if task
    hung is 3 mins, then silent loss need to be set to ~7 mins). The
    additional 3 minutes significantly reduce testing efficiency because
    usually we crash kernel within a minute, and this can add hours to bug
    localization process as it needs to do dozens of tests.

    Allow setting checking interval separately from timeout. This allows to
    set timeout to, say, 3 minutes, but checking interval to 10 secs.

    The interval is controlled via a new hung_task_check_interval_secs sysctl,
    similar to the existing hung_task_timeout_secs sysctl. The default value
    of 0 results in the current behavior: checking interval is equal to
    timeout.

    [akpm@linux-foundation.org: update hung_task_timeout_max's comment]
    Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Cc: Paul E. McKenney
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

16 Jul, 2018

1 commit

  • /proc/sys/kernel/sched_time_avg_ms entry is not used anywhere,
    remove it.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Luis R. Rodriguez
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: claudio@evidence.eu.com
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: patrick.bellasi@arm.com
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: valentin.schneider@arm.com
    Cc: viresh.kumar@linaro.org
    Link: http://lkml.kernel.org/r/1530200714-4504-12-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

13 Jun, 2018

1 commit

  • The kzalloc() function has a 2-factor argument form, kcalloc(). This
    patch replaces cases of:

    kzalloc(a * b, gfp)

    with:
    kcalloc(a * b, gfp)

    as well as handling cases of:

    kzalloc(a * b * c, gfp)

    with:

    kzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kzalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kzalloc
    + kcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kzalloc(sizeof(THING) * C2, ...)
    |
    kzalloc(sizeof(TYPE) * C2, ...)
    |
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(C1 * C2, ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

12 Apr, 2018

2 commits

  • Kdoc comments are added to the do_proc_dointvec_minmax_conv_param and
    do_proc_douintvec_minmax_conv_param structures thare are used internally
    for range checking.

    The error codes returned by proc_dointvec_minmax() and
    proc_douintvec_minmax() are also documented.

    Link: http://lkml.kernel.org/r/1519926220-7453-3-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Andrew Morton
    Acked-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Cc: Manfred Spraul
    Cc: Matthew Wilcox
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Fix sizeof argument to be the same as the data variable name. Probably
    a copy/paste error.

    Mostly harmless since both variables are unsigned int.

    Fixes kernel bugzilla #197371:
    Possible access to unintended variable in "kernel/sysctl.c" line 1339
    https://bugzilla.kernel.org/show_bug.cgi?id=197371

    Link: http://lkml.kernel.org/r/e0d0531f-361e-ef5f-8499-32743ba907e1@infradead.org
    Signed-off-by: Randy Dunlap
    Reported-by: Petru Mihancea
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

20 Mar, 2018

1 commit

  • Currently one requires to test four kernel configurations to test the
    firmware API completely:

    0)
    CONFIG_FW_LOADER=y

    1)
    o CONFIG_FW_LOADER=y
    o CONFIG_FW_LOADER_USER_HELPER=y

    2)
    o CONFIG_FW_LOADER=y
    o CONFIG_FW_LOADER_USER_HELPER=y
    o CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y

    3) When CONFIG_FW_LOADER=m the built-in stuff is disabled, we have
    no current tests for this.

    We can reduce the requirements to three kernel configurations by making
    fw_config.force_sysfs_fallback a proc knob we flip on off. For kernels that
    disable CONFIG_IKCONFIG_PROC this can also enable one to inspect if
    CONFIG_FW_LOADER_USER_HELPER_FALLBACK was enabled at build time by checking
    the proc value at boot time.

    Acked-by: Kees Cook
    Signed-off-by: Luis R. Rodriguez
    Signed-off-by: Greg Kroah-Hartman

    Luis R. Rodriguez
     

07 Feb, 2018

3 commits

  • A pipe's size is represented as an 'unsigned int'. As expected, writing a
    value greater than UINT_MAX to /proc/sys/fs/pipe-max-size fails with
    EINVAL. However, the F_SETPIPE_SZ fcntl silently truncates such values to
    32 bits, rather than failing with EINVAL as expected. (It *does* fail
    with EINVAL for values above (1 << 31) but
    Acked-by: Kees Cook
    Acked-by: Joe Lawrence
    Cc: Alexander Viro
    Cc: "Luis R . Rodriguez"
    Cc: Michael Kerrisk
    Cc: Mikulas Patocka
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • pipe_proc_fn() is no longer needed, as it only calls through to
    proc_dopipe_max_size(). Just put proc_dopipe_max_size() in the ctl_table
    entry directly, and remove the unneeded EXPORT_SYMBOL() and the ENOSYS
    stub for it.

    (The reason the ENOSYS stub isn't needed is that the pipe-max-size
    ctl_table entry is located directly in 'kern_table' rather than being
    registered separately. Therefore, the entry is already only defined when
    the kernel is built with sysctl support.)

    Link: http://lkml.kernel.org/r/20180111052902.14409-3-ebiggers3@gmail.com
    Signed-off-by: Eric Biggers
    Acked-by: Kees Cook
    Acked-by: Joe Lawrence
    Cc: Alexander Viro
    Cc: "Luis R . Rodriguez"
    Cc: Michael Kerrisk
    Cc: Mikulas Patocka
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • Patch series "pipe: buffer limits fixes and cleanups", v2.

    This series simplifies the sysctl handler for pipe-max-size and fixes
    another set of bugs related to the pipe buffer limits:

    - The root user wasn't allowed to exceed the limits when creating new
    pipes.

    - There was an off-by-one error when checking the limits, so a limit of
    N was actually treated as N - 1.

    - F_SETPIPE_SZ accepted values over UINT_MAX.

    - Reading the pipe buffer limits could be racy.

    This patch (of 7):

    Before validating the given value against pipe_min_size,
    do_proc_dopipe_max_size_conv() calls round_pipe_size(), which rounds the
    value up to pipe_min_size. Therefore, the second check against
    pipe_min_size is redundant. Remove it.

    Link: http://lkml.kernel.org/r/20180111052902.14409-2-ebiggers3@gmail.com
    Signed-off-by: Eric Biggers
    Acked-by: Kees Cook
    Acked-by: Joe Lawrence
    Cc: Alexander Viro
    Cc: "Luis R . Rodriguez"
    Cc: Michael Kerrisk
    Cc: Mikulas Patocka
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

01 Feb, 2018

1 commit

  • hugepages_treat_as_movable has been introduced by 396faf0303d2 ("Allow
    huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
    allocations from ZONE_MOVABLE even when hugetlb pages were not
    migrateable. The purpose of the movable zone was different at the time.
    It aimed at reducing memory fragmentation and hugetlb pages being long
    lived and large werre not contributing to the fragmentation so it was
    acceptable to use the zone back then.

    Things have changed though and the primary purpose of the zone became
    migratability guarantee. If we allow non migrateable hugetlb pages to
    be in ZONE_MOVABLE memory hotplug might fail to offline the memory.

    Remove the knob and only rely on hugepage_migration_supported to allow
    movable zones.

    Mel said:

    : Primarily it was aimed at allowing the hugetlb pool to safely shrink with
    : the ability to grow it again. The use case was for batched jobs, some of
    : which needed huge pages and others that did not but didn't want the memory
    : useless pinned in the huge pages pool.
    :
    : I suspect that more users rely on THP than hugetlbfs for flexible use of
    : huge pages with fallback options so I think that removing the option
    : should be ok.

    Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Alexandru Moise
    Acked-by: Mel Gorman
    Cc: Alexandru Moise
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Nov, 2017

4 commits

  • Remove unnecessary else block, remove redundant return and call to kfree
    in if block.

    Link: http://lkml.kernel.org/r/1510238435-1655-1-git-send-email-mail@okal.no
    Signed-off-by: Ola N. Kaldestad
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ola N. Kaldestad
     
  • Mikulas noticed in the existing do_proc_douintvec_minmax_conv() and
    do_proc_dopipe_max_size_conv() introduced in this patchset, that they
    inconsistently handle overflow and min/max range inputs:

    For example:

    0 ... param->min - 1 ---> ERANGE
    param->min ... param->max ---> the value is accepted
    param->max + 1 ... 0x100000000L + param->min - 1 ---> ERANGE
    0x100000000L + param->min ... 0x100000000L + param->max ---> EINVAL
    0x100000000L + param->max + 1, 0x200000000L + param->min - 1 ---> ERANGE
    0x200000000L + param->min ... 0x200000000L + param->max ---> EINVAL
    0x200000000L + param->max + 1, 0x300000000L + param->min - 1 ---> ERANGE

    In do_proc_do*() routines which store values into unsigned int variables
    (4 bytes wide for 64-bit builds), first validate that the input unsigned
    long value (8 bytes wide for 64-bit builds) will fit inside the smaller
    unsigned int variable. Then check that the unsigned int value falls
    inside the specified parameter min, max range. Otherwise the unsigned
    long -> unsigned int conversion drops leading bits from the input value,
    leading to the inconsistent pattern Mikulas documented above.

    Link: http://lkml.kernel.org/r/1507658689-11669-5-git-send-email-joe.lawrence@redhat.com
    Signed-off-by: Joe Lawrence
    Reported-by: Mikulas Patocka
    Reviewed-by: Mikulas Patocka
    Cc: Al Viro
    Cc: Jens Axboe
    Cc: Michael Kerrisk
    Cc: Randy Dunlap
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Lawrence
     
  • pipe_max_size is assigned directly via procfs sysctl:

    static struct ctl_table fs_table[] = {
    ...
    {
    .procname = "pipe-max-size",
    .data = &pipe_max_size,
    .maxlen = sizeof(int),
    .mode = 0644,
    .proc_handler = &pipe_proc_fn,
    .extra1 = &pipe_min_size,
    },
    ...

    int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
    size_t *lenp, loff_t *ppos)
    {
    ...
    ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)
    ...

    and then later rounded in-place a few statements later:

    ...
    pipe_max_size = round_pipe_size(pipe_max_size);
    ...

    This leaves a window of time between initial assignment and rounding
    that may be visible to other threads. (For example, one thread sets a
    non-rounded value to pipe_max_size while another reads its value.)

    Similar reads of pipe_max_size are potentially racy:

    pipe.c :: alloc_pipe_info()
    pipe.c :: pipe_set_size()

    Add a new proc_dopipe_max_size() that consolidates reading the new value
    from the user buffer, verifying bounds, and calling round_pipe_size()
    with a single assignment to pipe_max_size.

    Link: http://lkml.kernel.org/r/1507658689-11669-4-git-send-email-joe.lawrence@redhat.com
    Signed-off-by: Joe Lawrence
    Reported-by: Mikulas Patocka
    Reviewed-by: Mikulas Patocka
    Cc: Al Viro
    Cc: Jens Axboe
    Cc: Michael Kerrisk
    Cc: Randy Dunlap
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Lawrence
     
  • Patch series "A few round_pipe_size() and pipe-max-size fixups", v3.

    While backporting Michael's "pipe: fix limit handling" patchset to a
    distro-kernel, Mikulas noticed that current upstream pipe limit handling
    contains a few problems:

    1 - procfs signed wrap: echo'ing a large number into
    /proc/sys/fs/pipe-max-size and then cat'ing it back out shows a
    negative value.

    2 - round_pipe_size() nr_pages overflow on 32bit: this would
    subsequently try roundup_pow_of_two(0), which is undefined.

    3 - visible non-rounded pipe-max-size value: there is no mutual
    exclusion or protection between the time pipe_max_size is assigned
    a raw value from proc_dointvec_minmax() and when it is rounded.

    4 - unsigned long -> unsigned int conversion makes for potential odd
    return errors from do_proc_douintvec_minmax_conv() and
    do_proc_dopipe_max_size_conv().

    This version underwent the same testing as v1:
    https://marc.info/?l=linux-kernel&m=150643571406022&w=2

    This patch (of 4):

    pipe_max_size is defined as an unsigned int:

    unsigned int pipe_max_size = 1048576;

    but its procfs/sysctl representation is an integer:

    static struct ctl_table fs_table[] = {
    ...
    {
    .procname = "pipe-max-size",
    .data = &pipe_max_size,
    .maxlen = sizeof(int),
    .mode = 0644,
    .proc_handler = &pipe_proc_fn,
    .extra1 = &pipe_min_size,
    },
    ...

    that is signed:

    int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
    size_t *lenp, loff_t *ppos)
    {
    ...
    ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)

    This leads to signed results via procfs for large values of pipe_max_size:

    % echo 2147483647 >/proc/sys/fs/pipe-max-size
    % cat /proc/sys/fs/pipe-max-size
    -2147483648

    Use unsigned operations on this variable to avoid such negative values.

    Link: http://lkml.kernel.org/r/1507658689-11669-2-git-send-email-joe.lawrence@redhat.com
    Signed-off-by: Joe Lawrence
    Reported-by: Mikulas Patocka
    Reviewed-by: Mikulas Patocka
    Cc: Michael Kerrisk
    Cc: Randy Dunlap
    Cc: Al Viro
    Cc: Jens Axboe
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Lawrence
     

16 Nov, 2017

2 commits

  • This is the second step which introduces a tunable interface that allow
    numa stats configurable for optimizing zone_statistics(), as suggested
    by Dave Hansen and Ying Huang.

    =========================================================================

    When page allocation performance becomes a bottleneck and you can
    tolerate some possible tool breakage and decreased numa counter
    precision, you can do:

    echo 0 > /proc/sys/vm/numa_stat

    In this case, numa counter update is ignored. We can see about
    *4.8%*(185->176) drop of cpu cycles per single page allocation and
    reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
    drop of cpu cycles per single page allocation and reclaim on Jesper's
    page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
    (88 threads, 126G memory).

    Benchmark link provided by Jesper D Brouer (increase loop times to
    10000000):

    https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

    =========================================================================

    When page allocation performance is not a bottleneck and you want all
    tooling to work, you can do:

    echo 1 > /proc/sys/vm/numa_stat

    This is system default setting.

    Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
    for comments to help improve the original patch.

    [keescook@chromium.org: make sure mutex is a global static]
    Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
    Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Signed-off-by: Kees Cook
    Reported-by: Jesper Dangaard Brouer
    Suggested-by: Dave Hansen
    Suggested-by: Ying Huang
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Luis R . Rodriguez"
    Cc: Kees Cook
    Cc: Jonathan Corbet
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Christopher Lameter
    Cc: Sebastian Andrzej Siewior
    Cc: Andrey Ryabinin
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     
  • Fix up makefiles, remove references, and git rm kmemcheck.

    Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Steven Rostedt
    Cc: Vegard Nossum
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Eric W. Biederman
    Cc: Alexander Potapenko
    Cc: Tim Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

06 Oct, 2017

1 commit

  • Pull watchddog clean-up and fixes from Thomas Gleixner:
    "The watchdog (hard/softlockup detector) code is pretty much broken in
    its current state. The patch series addresses this by removing all
    duct tape and refactoring it into a workable state.

    The reasons why I ask for inclusion that late in the cycle are:

    1) The code causes lockdep splats vs. hotplug locking which get
    reported over and over. Unfortunately there is no easy fix.

    2) The risk of breakage is minimal because it's already broken

    3) As 4.14 is a long term stable kernel, I prefer to have working
    watchdog code in that and the lockdep issues resolved. I wouldn't
    ask you to pull if 4.14 wouldn't be a LTS kernel or if the
    solution would be easy to backport.

    4) The series was around before the merge window opened, but then got
    delayed due to the UP failure caused by the for_each_cpu()
    surprise which we discussed recently.

    Changes vs. V1:

    - Addressed your review points

    - Addressed the warning in the powerpc code which was discovered late

    - Changed two function names which made sense up to a certain point
    in the series. Now they match what they do in the end.

    - Fixed a 'unused variable' warning, which got not detected by the
    intel robot. I triggered it when trying all possible related config
    combinations manually. Randconfig testing seems not random enough.

    The changes have been tested by and reviewed by Don Zickus and tested
    and acked by Micheal Ellerman for powerpc"

    * 'core-watchdog-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    watchdog/core: Put softlockup_threads_initialized under ifdef guard
    watchdog/core: Rename some softlockup_* functions
    powerpc/watchdog: Make use of watchdog_nmi_probe()
    watchdog/core, powerpc: Lock cpus across reconfiguration
    watchdog/core, powerpc: Replace watchdog_nmi_reconfigure()
    watchdog/hardlockup/perf: Fix spelling mistake: "permanetely" -> "permanently"
    watchdog/hardlockup/perf: Cure UP damage
    watchdog/hardlockup: Clean up hotplug locking mess
    watchdog/hardlockup/perf: Simplify deferred event destroy
    watchdog/hardlockup/perf: Use new perf CPU enable mechanism
    watchdog/hardlockup/perf: Implement CPU enable replacement
    watchdog/hardlockup/perf: Implement init time detection of perf
    watchdog/hardlockup/perf: Implement init time perf validation
    watchdog/core: Get rid of the racy update loop
    watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage
    watchdog/sysctl: Clean up sysctl variable name space
    watchdog/sysctl: Get rid of the #ifdeffery
    watchdog/core: Clean up header mess
    watchdog/core: Further simplify sysctl handling
    watchdog/core: Get rid of the thread teardown/setup dance
    ...

    Linus Torvalds
     

05 Oct, 2017

1 commit

  • This tunable has been obsolete since 2.6.32, and writes to the
    file have been failing and complaining in dmesg since then:

    nr_pdflush_threads exported in /proc is scheduled for removal

    That was 8 years ago. Remove the file ABI obsolete notice, and
    the sysfs file.

    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Oct, 2017

1 commit

  • do_proc_douintvec_conv() has two UINT_MAX checks, we can remove one.
    This has no functional changes other than fixing a compiler warning:

    kernel/sysctl.c:2190]: (warning) Identical condition '*lvalp>UINT_MAX', second condition is always false

    Fixes: 4f2fec00afa60 ("sysctl: simplify unsigned int support")
    Link: http://lkml.kernel.org/r/20170919072918.12066-1-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Reported-by: David Binderman
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     

29 Sep, 2017

1 commit

  • System will hang if user set sysctl_sched_time_avg to 0:

    [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

    Stack traceback for pid 0
    0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
    ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
    0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
    ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
    Call Trace:
    [] ? update_group_capacity+0x110/0x200
    [] ? update_sd_lb_stats+0x109/0x600
    [] ? find_busiest_group+0x47/0x530
    [] ? load_balance+0x194/0x900
    [] ? update_rq_clock.part.83+0x1a/0xe0
    [] ? rebalance_domains+0x152/0x290
    [] ? run_rebalance_domains+0xdc/0x1d0
    [] ? __do_softirq+0xfb/0x320
    [] ? irq_exit+0x125/0x130
    [] ? scheduler_ipi+0x97/0x160
    [] ? smp_reschedule_interrupt+0x29/0x30
    [] ? reschedule_interrupt+0x6e/0x80
    [] ? cpuidle_enter_state+0xcc/0x230
    [] ? cpuidle_enter_state+0x9c/0x230
    [] ? cpuidle_enter+0x17/0x20
    [] ? cpu_startup_entry+0x38c/0x420
    [] ? start_secondary+0x173/0x1e0

    Because divide-by-zero error happens in function:

    update_group_capacity()
    update_cpu_capacity()
    scale_rt_capacity()
    {
    ...
    total = sched_avg_period() + delta;
    used = div_u64(avg, total);
    ...
    }

    To fix this issue, check user input value of sysctl_sched_time_avg, keep
    it unchanged when hitting invalid input, and set the minimum limit of
    sysctl_sched_time_avg to 1 ms.

    Reported-by: James Puthukattukaran
    Signed-off-by: Ethan Zhao
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: efault@gmx.de
    Cc: ethan.kernel@gmail.com
    Cc: keescook@chromium.org
    Cc: mcgrof@kernel.org
    Cc:
    Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.com
    Signed-off-by: Ingo Molnar

    Ethan Zhao
     

14 Sep, 2017

2 commits

  • Reflect that these variables are user interface related and remove the
    whitespace damage in the sysctl table while at it.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Don Zickus
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Chris Metcalf
    Cc: Linus Torvalds
    Cc: Nicholas Piggin
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Ulrich Obergfell
    Link: http://lkml.kernel.org/r/20170912194147.783210221@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • The sysctl of the nmi_watchdog file prevents writes by setting:

    min = max = 0

    if none of the users is enabled. That involves ifdeffery and is competely
    non obvious.

    If none of the facilities is enabeld, then the file can simply be made read
    only. Move the ifdeffery into the header and use a constant for file
    permissions.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Don Zickus
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Chris Metcalf
    Cc: Linus Torvalds
    Cc: Nicholas Piggin
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Ulrich Obergfell
    Link: http://lkml.kernel.org/r/20170912194147.706073616@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

13 Jul, 2017

5 commits

  • Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
    HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.

    LOCKUP_DETECTOR implies the general boot, sysctl, and programming
    interfaces for the lockup detectors.

    An architecture that wants to use a hard lockup detector must define
    HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.

    Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
    minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
    does not implement the LOCKUP_DETECTOR interfaces.

    sparc is unusual in that it has started to implement some of the
    interfaces, but not fully yet. It should probably be converted to a full
    HAVE_HARDLOCKUP_DETECTOR_ARCH.

    [npiggin@gmail.com: fix]
    Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
    Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Don Zickus
    Reviewed-by: Babu Moger
    Tested-by: Babu Moger [sparc]
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • To keep parity with regular int interfaces provide the an unsigned int
    proc_douintvec_minmax() which allows you to specify a range of allowed
    valid numbers.

    Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
    actual user for that.

    Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Acked-by: Kees Cook
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
    fields") added proc_douintvec() to start help adding support for
    unsigned int, this however was only half the work needed. Two fixes
    have come in since then for the following issues:

    o Printing the values shows a negative value, this happens since
    do_proc_dointvec() and this uses proc_put_long()

    This was fixed by commit 5380e5644afbba9 ("sysctl: don't print negative
    flag for proc_douintvec").

    o We can easily wrap around the int values: UINT_MAX is 4294967295, if
    we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
    end up with 1.
    o We echo negative values in and they are accepted

    This was fixed by commit 425fffd886ba ("sysctl: report EINVAL if value
    is larger than UINT_MAX for proc_douintvec").

    It still also failed to be added to sysctl_check_table()... instead of
    adding it with the current implementation just provide a proper and
    simplified unsigned int support without any array unsigned int support
    with no negative support at all.

    Historically sysctl proc helpers have supported arrays, due to the
    complexity this adds though we've taken a step back to evaluate array
    users to determine if its worth upkeeping for unsigned int. An
    evaluation using Coccinelle has been done to perform a grammatical
    search to ask ourselves:

    o How many sysctl proc_dointvec() (int) users exist which likely
    should be moved over to proc_douintvec() (unsigned int) ?
    Answer: about 8
    - Of these how many are array users ?
    Answer: Probably only 1
    o How many sysctl array users exist ?
    Answer: about 12

    This last question gives us an idea just how popular arrays: they are not.
    Array support should probably just be kept for strings.

    The identified uint ports are:

    drivers/infiniband/core/ucma.c - max_backlog
    drivers/infiniband/core/iwcm.c - default_backlog
    net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
    net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
    net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
    net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
    net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
    net/phonet/sysctl.c proc_local_port_range()

    The only possible array users is proc_local_port_range() but it does not
    seem worth it to add array support just for this given the range support
    works just as well. Unsigned int support should be desirable more for
    when you *need* more than INT_MAX or using int min/max support then does
    not suffice for your ranges.

    If you forget and by mistake happen to register an unsigned int proc
    entry with an array, the driver will fail and you will get something as
    follows:

    sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
    CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    Call Trace:
    dump_stack+0x63/0x81
    __register_sysctl_table+0x350/0x650
    ? kmem_cache_alloc_trace+0x107/0x240
    __register_sysctl_paths+0x1b3/0x1e0
    ? 0xffffffffc005f000
    register_sysctl_table+0x1f/0x30
    test_sysctl_init+0x10/0x1000 [test_sysctl]
    do_one_initcall+0x52/0x1a0
    ? kmem_cache_alloc_trace+0x107/0x240
    do_init_module+0x5f/0x200
    load_module+0x1867/0x1bd0
    ? __symbol_put+0x60/0x60
    SYSC_finit_module+0xdf/0x110
    SyS_finit_module+0xe/0x10
    entry_SYSCALL_64_fastpath+0x1e/0xad
    RIP: 0033:0x7f042b22d119

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Alexey Dobriyan
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Liping Zhang
    Cc: Alexey Dobriyan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • The mode sysctl_writes_strict positional checks keep being copy and pasted
    as we add new proc handlers. Just add a helper to avoid code duplication.

    Link: http://lkml.kernel.org/r/20170519033554.18592-4-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Document the different sysctl_writes_strict modes in code.

    Link: http://lkml.kernel.org/r/20170519033554.18592-3-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     

09 May, 2017

1 commit

  • do_proc_dointvec_jiffies_conv() uses LONG_MAX/HZ as the max value to
    avoid overflow. But actually the *valp is int type, so it still causes
    overflow.

    For example,

    echo 2147483647 > ./sys/net/ipv4/tcp_keepalive_time

    Then,

    cat ./sys/net/ipv4/tcp_keepalive_time

    The output is "-1", it is not expected.

    Now use INT_MAX/HZ as the max value instead LONG_MAX/HZ to fix it.

    Link: http://lkml.kernel.org/r/1490109532-9228-1-git-send-email-fgao@ikuai8.com
    Signed-off-by: Gao Feng
    Cc: Arnaldo Carvalho de Melo
    Cc: Ingo Molnar
    Cc: Alexey Dobriyan
    Cc: Eric Dumazet
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gao Feng
     

02 May, 2017

1 commit

  • Pull timer updates from Thomas Gleixner:
    "The timer departement delivers:

    - more year 2038 rework

    - a massive rework of the arm achitected timer

    - preparatory patches to allow NTP correction of clock event devices
    to avoid early expiry

    - the usual pile of fixes and enhancements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits)
    timer/sysclt: Restrict timer migration sysctl values to 0 and 1
    arm64/arch_timer: Mark errata handlers as __maybe_unused
    Clocksource/mips-gic: Remove redundant non devicetree init
    MIPS/Malta: Probe gic-timer via devicetree
    clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK
    acpi/arm64: Add SBSA Generic Watchdog support in GTDT driver
    clocksource: arm_arch_timer: add GTDT support for memory-mapped timer
    acpi/arm64: Add memory-mapped timer support in GTDT driver
    clocksource: arm_arch_timer: simplify ACPI support code.
    acpi/arm64: Add GTDT table parse driver
    clocksource: arm_arch_timer: split MMIO timer probing.
    clocksource: arm_arch_timer: add structs to describe MMIO timer
    clocksource: arm_arch_timer: move arch_timer_needs_of_probing into DT init call
    clocksource: arm_arch_timer: refactor arch_timer_needs_probing
    clocksource: arm_arch_timer: split dt-only rate handling
    x86/uv/time: Set ->min_delta_ticks and ->max_delta_ticks
    unicore32/time: Set ->min_delta_ticks and ->max_delta_ticks
    um/time: Set ->min_delta_ticks and ->max_delta_ticks
    tile/time: Set ->min_delta_ticks and ->max_delta_ticks
    score/time: Set ->min_delta_ticks and ->max_delta_ticks
    ...

    Linus Torvalds
     

20 Apr, 2017

1 commit

  • timer_migration sysctl acts as a boolean switch, so the allowed values
    should be restricted to 0 and 1.

    Add the necessary extra fields to the sysctl table entry to enforce that.

    [ tglx: Rewrote changelog ]

    Signed-off-by: Myungho Jung
    Link: http://lkml.kernel.org/r/1492640690-3550-1-git-send-email-mhjungk@gmail.com
    Signed-off-by: Thomas Gleixner

    Myungho Jung
     

09 Apr, 2017

1 commit

  • Currently, inputting the following command will succeed but actually the
    value will be truncated:

    # echo 0x12ffffffff > /proc/sys/net/ipv4/tcp_notsent_lowat

    This is not friendly to the user, so instead, we should report error
    when the value is larger than UINT_MAX.

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Signed-off-by: Liping Zhang
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Andrew Morton
    Cc: Eric W. Biederman
    Signed-off-by: Linus Torvalds

    Liping Zhang
     

08 Apr, 2017

1 commit

  • I saw some very confusing sysctl output on my system:
    # cat /proc/sys/net/core/xfrm_aevent_rseqth
    -2
    # cat /proc/sys/net/core/xfrm_aevent_etime
    -10
    # cat /proc/sys/net/ipv4/tcp_notsent_lowat
    -4294967295

    Because we forget to set the *negp flag in proc_douintvec, so it will
    become a garbage value.

    Since the value related to proc_douintvec is always an unsigned integer,
    so we can set *negp to false explictily to fix this issue.

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Signed-off-by: Liping Zhang
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liping Zhang
     

02 Mar, 2017

1 commit


01 Feb, 2017

1 commit

  • We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit:

    ce0dbbbb30ae ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice")

    ... which name suggests to users that it's in milliseconds, while in reality
    it's being set in milliseconds but the result is shown in jiffies.

    This is obviously confusing when HZ is not 1000, it makes it appear like the
    value set failed, such as HZ=100:

    root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms
    root# cat /proc/sys/kernel/sched_rr_timeslice_ms
    10

    Fix this to be milliseconds all around.

    Signed-off-by: Shile Zhang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com
    Signed-off-by: Ingo Molnar

    Shile Zhang
     

27 Jan, 2017

1 commit

  • We perform the conversion between kernel jiffies and ms only when
    exporting kernel value to user space.

    We need to do the opposite operation when value is written by user.

    Only matters when HZ != 1000

    Signed-off-by: Eric Dumazet
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Eric Dumazet