10 Mar, 2019

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "A fairly routine cycle for docs - lots of typo fixes, some new
    documents, and more translations. There's also some LICENSES
    adjustments from Thomas"

    * tag 'docs-5.1' of git://git.lwn.net/linux: (74 commits)
    docs: Bring some order to filesystem documentation
    Documentation/locking/lockdep: Drop last two chars of sample states
    doc: rcu: Suspicious RCU usage is a warning
    docs: driver-api: iio: fix errors in documentation
    Documentation/process/howto: Update for 4.x -> 5.x versioning
    docs: Explicitly state that the 'Fixes:' tag shouldn't split lines
    doc: security: Add kern-doc for lsm_hooks.h
    doc: sctp: Merge and clean up rst files
    Docs: Correct /proc/stat path
    scripts/spdxcheck.py: fix C++ comment style detection
    doc: fix typos in license-rules.rst
    Documentation: fix admin-guide/README.rst minimum gcc version requirement
    doc: process: complete removal of info about -git patches
    doc: translations: sync translations 'remove info about -git patches'
    perf-security: wrap paragraphs on 72 columns
    perf-security: elaborate on perf_events/Perf privileged users
    perf-security: document collected perf_events/Perf data categories
    perf-security: document perf_events/Perf resource control
    sysfs.txt: add note on available attribute macros
    docs: kernel-doc: typo "if ... if" -> "if ... is"
    ...

    Linus Torvalds
     

07 Mar, 2019

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - refcount conversions

    - Solve the rq->leaf_cfs_rq_list can of worms for real.

    - improve power-aware scheduling

    - add sysctl knob for Energy Aware Scheduling

    - documentation updates

    - misc other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    kthread: Do not use TIMER_IRQSAFE
    kthread: Convert worker lock to raw spinlock
    sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
    sched/fair: Remove unused 'sd' parameter from select_idle_smt()
    sched/wait: Use freezable_schedule() when possible
    sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
    sched/fair: Explain LLC nohz kick condition
    sched/fair: Simplify nohz_balancer_kick()
    sched/topology: Fix percpu data types in struct sd_data & struct s_data
    sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
    sched/fair: Fix O(nr_cgroups) in the load balancing path
    sched/fair: Optimize update_blocked_averages()
    sched/fair: Fix insertion in rq->leaf_cfs_rq_list
    sched/fair: Add tmp_alone_branch assertion
    sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
    sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
    sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
    sched/fair: Update scale invariance of PELT
    sched/fair: Move the rq_of() helper function
    sched/core: Convert task_struct.stack_refcount to refcount_t
    ...

    Linus Torvalds
     

16 Feb, 2019

1 commit

  • The netfilter conflicts were rather simple overlapping
    changes.

    However, the cls_tcindex.c stuff was a bit more complex.

    On the 'net' side, Cong is fixing several races and memory
    leaks. Whilst on the 'net-next' side we have Vlad adding
    the rtnl-ness support.

    What I've decided to do, in order to resolve this, is revert the
    conversion over to using a workqueue that Cong did, bringing us back
    to pure RCU. I did it this way because I believe that either Cong's
    races don't apply with have Vlad did things, or Cong will have to
    implement the race fix slightly differently.

    Signed-off-by: David S. Miller

    David S. Miller
     

12 Feb, 2019

1 commit


11 Feb, 2019

1 commit


09 Feb, 2019

1 commit


05 Feb, 2019

1 commit


31 Jan, 2019

1 commit

  • The current dentry number tracking code doesn't distinguish between
    positive & negative dentries. It just reports the total number of
    dentries in the LRU lists.

    As excessive number of negative dentries can have an impact on system
    performance, it will be wise to track the number of positive and
    negative dentries separately.

    This patch adds tracking for the total number of negative dentries in
    the system LRU lists and reports it in the 5th field in the
    /proc/sys/fs/dentry-state file. The number, however, does not include
    negative dentries that are in flight but not in the LRU yet as well as
    those in the shrinker lists which are on the way out anyway.

    The number of positive dentries in the LRU lists can be roughly found by
    subtracting the number of negative dentries from the unused count.

    Matthew Wilcox had confirmed that since the introduction of the
    dentry_stat structure in 2.1.60, the dummy array was there, probably for
    future extension. They were not replacements of pre-existing fields.
    So no sane applications that read the value of /proc/sys/fs/dentry-state
    will do dummy thing if the last 2 fields of the sysctl parameter are not
    zero. IOW, it will be safe to use one of the dummy array entry for
    negative dentry count.

    Signed-off-by: Waiman Long
    Signed-off-by: Linus Torvalds

    Waiman Long
     

27 Jan, 2019

1 commit

  • In its current state, Energy Aware Scheduling (EAS) starts automatically
    on asymmetric platforms having an Energy Model (EM). However, there are
    users who want to have an EM (for thermal management for example), but
    don't want EAS with it.

    In order to let users disable EAS explicitly, introduce a new sysctl
    called 'sched_energy_aware'. It is enabled by default so that EAS can
    start automatically on platforms where it makes sense. Flipping it to 0
    rebuilds the scheduling domains and disables EAS.

    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: adharmap@codeaurora.org
    Cc: chris.redpath@arm.com
    Cc: currojerez@riseup.net
    Cc: dietmar.eggemann@arm.com
    Cc: edubezval@gmail.com
    Cc: gregkh@linuxfoundation.org
    Cc: javi.merino@kernel.org
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: morten.rasmussen@arm.com
    Cc: patrick.bellasi@arm.com
    Cc: pkondeti@codeaurora.org
    Cc: rjw@rjwysocki.net
    Cc: skannan@codeaurora.org
    Cc: smuckle@google.com
    Cc: srinivas.pandruvada@linux.intel.com
    Cc: thara.gopinath@linaro.org
    Cc: tkjos@google.com
    Cc: valentin.schneider@arm.com
    Cc: vincent.guittot@linaro.org
    Cc: viresh.kumar@linaro.org
    Link: https://lkml.kernel.org/r/20181203095628.11858-11-quentin.perret@arm.com
    Signed-off-by: Ingo Molnar

    Quentin Perret
     

23 Jan, 2019

1 commit

  • There have been many people complaining about the inconsistent
    behaviors of IPv4 and IPv6 devconf when creating new network
    namespaces. Currently, for IPv4, we inherit all current settings
    from init_net, but for IPv6 we reset all setting to default.

    This patch introduces a new /proc file
    /proc/sys/net/core/devconf_inherit_init_net to control the
    behavior of whether to inhert sysctl current settings from init_net.
    This file itself is only available in init_net.

    As demonstrated below:

    Initial setup in init_net:
    # cat /proc/sys/net/ipv4/conf/all/rp_filter
    2
    # cat /proc/sys/net/ipv6/conf/all/accept_dad
    1

    Default value 0 (current behavior):
    # ip netns del test
    # ip netns add test
    # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
    2
    # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
    0

    Set to 1 (inherit from init_net):
    # echo 1 > /proc/sys/net/core/devconf_inherit_init_net
    # ip netns del test
    # ip netns add test
    # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
    2
    # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
    1

    Set to 2 (reset to default):
    # echo 2 > /proc/sys/net/core/devconf_inherit_init_net
    # ip netns del test
    # ip netns add test
    # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
    0
    # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
    0

    Set to a value out of range (invalid):
    # echo 3 > /proc/sys/net/core/devconf_inherit_init_net
    -bash: echo: write error: Invalid argument
    # echo -1 > /proc/sys/net/core/devconf_inherit_init_net
    -bash: echo: write error: Invalid argument

    Reported-by: Zhu Yanjun
    Reported-by: Tonghao Zhang
    Cc: Nicolas Dichtel
    Signed-off-by: Cong Wang
    Acked-by: Nicolas Dichtel
    Acked-by: Tonghao Zhang
    Signed-off-by: David S. Miller

    Cong Wang
     

15 Jan, 2019

1 commit


09 Jan, 2019

2 commits

  • Jonathan Corbet
     
  • Add a section about decoding /proc/sys/kernel/tainted, create a more
    understandable intro and a hopefully explain better the tainted flags in
    bugs, oops or panics messages. Only thing missing then is a table that
    quickly describes the various bits and taint flags before going into more
    detail, so add that as well.

    That table is partly based on a section from Documentation/sysctl/kernel.txt,
    but a bit more compact. To avoid confusion I added the shortened version to
    kernel.txt; the same table is used in three different places now:
    ./tools/debugging/kernel-chktaint,
    Documentation/admin-guide/tainted-kernels.rst and
    Documentation/sysctl/kernel.txt

    During review of v1 (see above) a number of existing issues with the text
    were raised, like outdated usages as well as incomplete or missing
    descriptions. Address most of those as well.

    Signed-off-by: Thorsten Leemhuis
    [jc: tightened up changelog]
    Signed-off-by: Jonathan Corbet

    Thorsten Leemhuis
     

05 Jan, 2019

1 commit

  • So that we can also runtime chose to print out the needed system info
    for panic, other than setting the kernel cmdline.

    Link: http://lkml.kernel.org/r/1543398842-19295-3-git-send-email-feng.tang@intel.com
    Signed-off-by: Feng Tang
    Suggested-by: Steven Rostedt
    Acked-by: Steven Rostedt (VMware)
    Cc: Thomas Gleixner
    Cc: John Stultz
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Feng Tang
     

29 Dec, 2018

1 commit

  • An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

    The kernel reduces the probability of such events by increasing the
    watermark sizes by calling set_recommended_min_free_kbytes early in the
    lifetime of the system. This works reasonably well in general but if
    there are enough sparsely populated pageblocks then the problem can still
    occur as enough memory is free overall and kswapd stays asleep.

    This patch introduces a watermark_boost_factor sysctl that allows a zone
    watermark to be temporarily boosted when an external fragmentation causing
    events occurs. The boosting will stall allocations that would decrease
    free memory below the boosted low watermark and kswapd is woken if the
    calling context allows to reclaim an amount of memory relative to the size
    of the high watermark and the watermark_boost_factor until the boost is
    cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
    to clean some of the pageblocks that may have been affected by the
    fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
    from reclaim context during this operation to avoid excessive system
    disruption in the name of fragmentation avoidance. Care is taken so that
    kswapd will do normal reclaim work if the system is really low on memory.

    This was evaluated using the same workloads as "mm, page_alloc: Spread
    allocations across zones before introducing fragmentation".

    1-socket Skylake machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 1 THP allocating thread
    --------------------------------------

    4.20-rc3 extfrag events < order 9: 804694
    4.20-rc3+patch: 408912 (49% reduction)
    4.20-rc3+patch1-4: 18421 (98% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
    Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)

    Note that external fragmentation causing events are massively reduced by
    this path whether in comparison to the previous kernel or the vanilla
    kernel. The fault latency for huge pages appears to be increased but that
    is only because THP allocations were successful with the patch applied.

    1-socket Skylake machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 291392
    4.20-rc3+patch: 191187 (34% reduction)
    4.20-rc3+patch1-4: 13464 (95% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
    Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
    Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
    Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)

    As before, massive reduction in external fragmentation events, some jitter
    on latencies and an increase in THP allocation success rates.

    2-socket Haswell machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 5 THP allocating threads
    ----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 215698
    4.20-rc3+patch: 200210 (7% reduction)
    4.20-rc3+patch1-4: 14263 (93% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
    Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)

    There is a 93% reduction in fragmentation causing events, there is a big
    reduction in the huge page fault latency and allocation success rate is
    higher.

    2-socket Haswell machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 166352
    4.20-rc3+patch: 147463 (11% reduction)
    4.20-rc3+patch1-4: 11095 (93% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
    Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)

    There is a large reduction in fragmentation events with some jitter around
    the latencies and success rates. As before, the high THP allocation
    success rate does mean the system is under a lot of pressure. However, as
    the fragmentation events are reduced, it would be expected that the
    long-term allocation success rate would be higher.

    Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

02 Nov, 2018

1 commit

  • Pull stackleak gcc plugin from Kees Cook:
    "Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
    was ported from grsecurity by Alexander Popov. It provides efficient
    stack content poisoning at syscall exit. This creates a defense
    against at least two classes of flaws:

    - Uninitialized stack usage. (We continue to work on improving the
    compiler to do this in other ways: e.g. unconditional zero init was
    proposed to GCC and Clang, and more plugin work has started too).

    - Stack content exposure. By greatly reducing the lifetime of valid
    stack contents, exposures via either direct read bugs or unknown
    cache side-channels become much more difficult to exploit. This
    complements the existing buddy and heap poisoning options, but
    provides the coverage for stacks.

    The x86 hooks are included in this series (which have been reviewed by
    Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
    been merged through the arm64 tree (written by Laura Abbott and
    reviewed by Mark Rutland and Will Deacon).

    With VLAs having been removed this release, there is no need for
    alloca() protection, so it has been removed from the plugin"

    * tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    arm64: Drop unneeded stackleak_check_alloca()
    stackleak: Allow runtime disabling of kernel stack erasing
    doc: self-protection: Add information about STACKLEAK feature
    fs/proc: Show STACKLEAK metrics in the /proc file system
    lkdtm: Add a test for STACKLEAK
    gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
    x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls

    Linus Torvalds
     

26 Oct, 2018

1 commit

  • Rick reported that the BPF JIT could potentially fill the entire module
    space with BPF programs from unprivileged users which would prevent later
    attempts to load normal kernel modules or privileged BPF programs, for
    example. If JIT was enabled but unsuccessful to generate the image, then
    before commit 290af86629b2 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
    we would always fall back to the BPF interpreter. Nowadays in the case
    where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
    with a failure since the BPF interpreter was compiled out.

    Add a global limit and enforce it for unprivileged users such that in case
    of BPF interpreter compiled out we fail once the limit has been reached
    or we fall back to BPF interpreter earlier w/o using module mem if latter
    was compiled in. In a next step, fair share among unprivileged users can
    be resolved in particular for the case where we would fail hard once limit
    is reached.

    Fixes: 290af86629b2 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
    Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
    Co-Developed-by: Rick Edgecombe
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: LKML
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

10 Sep, 2018

1 commit

  • This is a respin with a wider audience (all that get_maintainer returned)
    and I know this spams a *lot* of people. Not sure what would be the correct
    way, so my apologies for ruining your inbox.

    The 00-INDEX files are supposed to give a summary of all files present
    in a directory, but these files are horribly out of date and their
    usefulness is brought into question. Often a simple "ls" would reveal
    the same information as the filenames are generally quite descriptive as
    a short introduction to what the file covers (it should not surprise
    anyone what Documentation/sched/sched-design-CFS.txt covers)

    A few years back it was mentioned that these files were no longer really
    needed, and they have since then grown further out of date, so perhaps
    it is time to just throw them out.

    A short status yields the following _outdated_ 00-INDEX files, first
    counter is files listed in 00-INDEX but missing in the directory, last
    is files present but not listed in 00-INDEX.

    List of outdated 00-INDEX:
    Documentation: (4/10)
    Documentation/sysctl: (0/1)
    Documentation/timers: (1/0)
    Documentation/blockdev: (3/1)
    Documentation/w1/slaves: (0/1)
    Documentation/locking: (0/1)
    Documentation/devicetree: (0/5)
    Documentation/power: (1/1)
    Documentation/powerpc: (0/5)
    Documentation/arm: (1/0)
    Documentation/x86: (0/9)
    Documentation/x86/x86_64: (1/1)
    Documentation/scsi: (4/4)
    Documentation/filesystems: (2/9)
    Documentation/filesystems/nfs: (0/2)
    Documentation/cgroup-v1: (0/2)
    Documentation/kbuild: (0/4)
    Documentation/spi: (1/0)
    Documentation/virtual/kvm: (1/0)
    Documentation/scheduler: (0/2)
    Documentation/fb: (0/1)
    Documentation/block: (0/1)
    Documentation/networking: (6/37)
    Documentation/vm: (1/3)

    Then there are 364 subdirectories in Documentation/ with several files that
    are missing 00-INDEX alltogether (and another 120 with a single file and no
    00-INDEX).

    I don't really have an opinion to whether or not we /should/ have 00-INDEX,
    but the above 00-INDEX should either be removed or be kept up to date. If
    we should keep the files, I can try to keep them updated, but I rather not
    if we just want to delete them anyway.

    As a starting point, remove all index-files and references to 00-INDEX and
    see where the discussion is going.

    Signed-off-by: Henrik Austad
    Acked-by: "Paul E. McKenney"
    Just-do-it-by: Steven Rostedt
    Reviewed-by: Jens Axboe
    Acked-by: Paul Moore
    Acked-by: Greg Kroah-Hartman
    Acked-by: Mark Brown
    Acked-by: Mike Rapoport
    Cc: [Almost everybody else]
    Signed-off-by: Jonathan Corbet

    Henrik Austad
     

05 Sep, 2018

1 commit


24 Aug, 2018

1 commit

  • Disallows open of FIFOs or regular files not owned by the user in world
    writable sticky directories, unless the owner is the same as that of the
    directory or the file is opened without the O_CREAT flag. The purpose
    is to make data spoofing attacks harder. This protection can be turned
    on and off separately for FIFOs and regular files via sysctl, just like
    the symlinks/hardlinks protection. This patch is based on Openwall's
    "HARDEN_FIFO" feature by Solar Designer.

    This is a brief list of old vulnerabilities that could have been prevented
    by this feature, some of them even allow for privilege escalation:

    CVE-2000-1134
    CVE-2007-3852
    CVE-2008-0525
    CVE-2009-0416
    CVE-2011-4834
    CVE-2015-1838
    CVE-2015-7442
    CVE-2016-7489

    This list is not meant to be complete. It's difficult to track down all
    vulnerabilities of this kind because they were often reported without any
    mention of this particular attack vector. In fact, before
    hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
    vehicle to exploit them.

    [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
    Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
    Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
    [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
    [keescook@chromium.org: adjust commit subjet]
    Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
    Signed-off-by: Salvatore Mesoraca
    Signed-off-by: Kees Cook
    Suggested-by: Solar Designer
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Salvatore Mesoraca
     

23 Aug, 2018

3 commits

  • ipc_addid() initializes kern_ipc_perm.seq after having called idr_alloc()
    (within ipc_idr_alloc()).

    Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
    may see an uninitialized value.

    The patch moves the initialization of kern_ipc_perm.seq before the calls
    of idr_alloc().

    Notes:
    1) This patch has a user space visible side effect:
    If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
    if semget()/msgget()/shmget() fails in the final step of adding the id
    to the rhash tree, then .._next_id is cleared. Before the patch, is
    remained unmodified.

    There is no change of the behavior after a successful ..get() call: It
    always clears .._next_id, there is no impact to non checkpoint/restore
    code as that code does not use .._next_id.

    2) The patch correctly documents that after a call to ipc_idr_alloc(),
    the full tear-down sequence must be used. The callers of ipc_addid()
    do not fullfill that, i.e. more bugfixes are required.

    The patch is a squash of a patch from Dmitry and my own changes.

    Link: http://lkml.kernel.org/r/20180712185241.4017-3-manfred@colorfullife.com
    Reported-by: syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Cc: Davidlohr Bueso
    Cc: Herbert Xu
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Currently task hung checking interval is equal to timeout, as the result
    hung is detected anywhere between timeout and 2*timeout. This is fine for
    most interactive environments, but this hurts automated testing setups
    (syzbot). In an automated setup we need to strictly order CPU lockup <
    RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
    is not detected as task hung and task hung is not detected as silent
    machine loss. The large variance in task hung detection timeout requires
    setting silent machine loss timeout to a very large value (e.g. if task
    hung is 3 mins, then silent loss need to be set to ~7 mins). The
    additional 3 minutes significantly reduce testing efficiency because
    usually we crash kernel within a minute, and this can add hours to bug
    localization process as it needs to do dozens of tests.

    Allow setting checking interval separately from timeout. This allows to
    set timeout to, say, 3 minutes, but checking interval to 10 secs.

    The interval is controlled via a new hung_task_check_interval_secs sysctl,
    similar to the existing hung_task_timeout_secs sysctl. The default value
    of 0 results in the current behavior: checking interval is equal to
    timeout.

    [akpm@linux-foundation.org: update hung_task_timeout_max's comment]
    Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Cc: Paul E. McKenney
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • __vm_enough_memory has moved to mm/util.c.

    Link: http://lkml.kernel.org/r/E18EDF4A4FA4A04BBFA824B6D7699E532A7E5913@EXMBX-SZMAIL013.tencent.com
    Signed-off-by: Juvi Liu
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    juviliu
     

19 Aug, 2018

1 commit

  • Pull char/misc driver updates from Greg KH:
    "Here is the bit set of char/misc drivers for 4.19-rc1

    There is a lot here, much more than normal, seems like everyone is
    writing new driver subsystems these days... Anyway, major things here
    are:

    - new FSI driver subsystem, yet-another-powerpc low-level hardware
    bus

    - gnss, finally an in-kernel GPS subsystem to try to tame all of the
    crazy out-of-tree drivers that have been floating around for years,
    combined with some really hacky userspace implementations. This is
    only for GNSS receivers, but you have to start somewhere, and this
    is great to see.

    Other than that, there are new slimbus drivers, new coresight drivers,
    new fpga drivers, and loads of DT bindings for all of these and
    existing drivers.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (255 commits)
    android: binder: Rate-limit debug and userspace triggered err msgs
    fsi: sbefifo: Bump max command length
    fsi: scom: Fix NULL dereference
    misc: mic: SCIF Fix scif_get_new_port() error handling
    misc: cxl: changed asterisk position
    genwqe: card_base: Use true and false for boolean values
    misc: eeprom: assignment outside the if statement
    uio: potential double frees if __uio_register_device() fails
    eeprom: idt_89hpesx: clean up an error pointer vs NULL inconsistency
    misc: ti-st: Fix memory leak in the error path of probe()
    android: binder: Show extra_buffers_size in trace
    firmware: vpd: Fix section enabled flag on vpd_section_destroy
    platform: goldfish: Retire pdev_bus
    goldfish: Use dedicated macros instead of manual bit shifting
    goldfish: Add missing includes to goldfish.h
    mux: adgs1408: new driver for Analog Devices ADGS1408/1409 mux
    dt-bindings: mux: add adi,adgs1408
    Drivers: hv: vmbus: Cleanup synic memory free path
    Drivers: hv: vmbus: Remove use of slow_virt_to_phys()
    Drivers: hv: vmbus: Reset the channel callback in vmbus_onoffer_rescind()
    ...

    Linus Torvalds
     

27 Jul, 2018

1 commit


08 Jul, 2018

1 commit

  • In the VM mode on Hyper-V, currently, when the kernel panics, an error
    code and few register values are populated in an MSR and the Hypervisor
    notified. This information is collected on the host. The amount of
    information currently collected is found to be limited and not very
    actionable. To gather more actionable data, such as stack trace, the
    proposal is to write one page worth of kmsg data on an allocated page
    and the Hypervisor notified of the page address through the MSR.

    - Sysctl option to control the behavior, with ON by default.

    Cc: K. Y. Srinivasan
    Cc: Stephen Hemminger
    Signed-off-by: Sunil Muthuswamy
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Sunil Muthuswamy
     

26 Jun, 2018

1 commit

  • commit 1efff914afac8a965ad63817ecf8861a927c2ace ("fs: add
    dirtytime_expire_seconds sysctl") introduced dirtytime_expire_seconds
    knob, but there is not description about it in
    Documentation/sysctl/vm.txt.

    Add the description for it.

    Cc: Theodore Ts'o
    Signed-off-by: Yang Shi
    Signed-off-by: Jonathan Corbet

    Yang Shi
     

07 Jun, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Add Maglev hashing scheduler to IPVS, from Inju Song.

    2) Lots of new TC subsystem tests from Roman Mashak.

    3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

    4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

    5) Add ttl inherit support to vxlan, from Hangbin Liu.

    6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

    7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

    8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

    9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

    10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

    11) Plumb extack down into fib_rules, from Roopa Prabhu.

    12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

    13) Add UDP GSO support, from Willem de Bruijn.

    14) Add documentation for eBPF helpers, from Quentin Monnet.

    15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

    16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

    17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

    18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

    19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

    20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

    21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

    22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

    23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

    24) Add TCP SACK compression, from Eric Dumazet.

    25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

    26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

    27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

    28) Add lots of forwarding selftests, from Petr Machata.

    29) Add generic network device failover driver, from Sridhar Samudrala.

    * ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
    strparser: Add __strp_unpause and use it in ktls.
    rxrpc: Fix terminal retransmission connection ID to include the channel
    net: hns3: Optimize PF CMDQ interrupt switching process
    net: hns3: Fix for VF mailbox receiving unknown message
    net: hns3: Fix for VF mailbox cannot receiving PF response
    bnx2x: use the right constant
    Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
    net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
    enic: fix UDP rss bits
    netdev-FAQ: clarify DaveM's position for stable backports
    rtnetlink: validate attributes in do_setlink()
    mlxsw: Add extack messages for port_{un, }split failures
    netdevsim: Add extack error message for devlink reload
    devlink: Add extack to reload and port_{un, }split operations
    net: metrics: add proper netlink validation
    ipmr: fix error path when ipmr_new_table fails
    ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
    net: hns3: remove unused hclgevf_cfg_func_mta_filter
    netfilter: provide udp*_lib_lookup for nf_tproxy
    qed*: Utilize FW 8.37.2.0
    ...

    Linus Torvalds
     

04 May, 2018

1 commit

  • The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF
    only. Classic BPF is supported because of the conversion by BPF core.

    Almost all instructions from eBPF ISA supported except the following:
    BPF_ALU64 | BPF_DIV | BPF_K
    BPF_ALU64 | BPF_DIV | BPF_X
    BPF_ALU64 | BPF_MOD | BPF_K
    BPF_ALU64 | BPF_MOD | BPF_X
    BPF_STX | BPF_XADD | BPF_W
    BPF_STX | BPF_XADD | BPF_DW

    It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment.

    IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use
    EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF
    ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others
    eBPF registers, R0-R10, are simulated through scratch space on stack.

    The reasons behind the hardware registers allocation policy are:
    1:MUL need EAX:EDX, shift operation need ECX, so they aren't fit
    for general eBPF 64bit register simulation.
    2:We need at least 4 registers to simulate most eBPF ISA operations
    on registers operands instead of on register&memory operands.
    3:We need to put BPF_REG_AX on hardware registers, or constant blinding
    will degrade jit performance heavily.

    Tested on PC (Intel(R) Core(TM) i5-5200U CPU).
    Testing results on i5-5200U:
    1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
    2) test_progs: Summary: 83 PASSED, 0 FAILED.
    3) test_lpm: OK
    4) test_lru_map: OK
    5) test_verifier: Summary: 828 PASSED, 0 FAILED.

    Above tests are all done in following two conditions separately:
    1:bpf_jit_enable=1 and bpf_jit_harden=0
    2:bpf_jit_enable=1 and bpf_jit_harden=2

    Below are some numbers for this jit implementation:
    Note:
    I run test_progs in kselftest 100 times continuously for every condition,
    the numbers are in format: total/times=avg.
    The numbers that test_bpf reports show almost the same relation.

    a:jit_enable=0 and jit_harden=0 b:jit_enable=1 and jit_harden=0
    test_pkt_access:PASS:ipv4:15622/100=156 test_pkt_access:PASS:ipv4:10674/100=106
    test_pkt_access:PASS:ipv6:9130/100=91 test_pkt_access:PASS:ipv6:4855/100=48
    test_xdp:PASS:ipv4:240198/100=2401 test_xdp:PASS:ipv4:138912/100=1389
    test_xdp:PASS:ipv6:137326/100=1373 test_xdp:PASS:ipv6:68542/100=685
    test_l4lb:PASS:ipv4:61100/100=611 test_l4lb:PASS:ipv4:37302/100=373
    test_l4lb:PASS:ipv6:101000/100=1010 test_l4lb:PASS:ipv6:55030/100=550

    c:jit_enable=1 and jit_harden=2
    test_pkt_access:PASS:ipv4:10558/100=105
    test_pkt_access:PASS:ipv6:5092/100=50
    test_xdp:PASS:ipv4:131902/100=1319
    test_xdp:PASS:ipv6:77932/100=779
    test_l4lb:PASS:ipv4:38924/100=389
    test_l4lb:PASS:ipv6:57520/100=575

    The numbers show we get 30%~50% improvement.

    See Documentation/networking/filter.txt for more information.

    Changelog:

    Changes v5-v6:
    1:Add do {} while (0) to RETPOLINE_RAX_BPF_JIT for
    consistence reason.
    2:Clean up non-standard comments, reported by Daniel Borkmann.
    3:Fix a memory leak issue, repoted by Daniel Borkmann.

    Changes v4-v5:
    1:Delete is_on_stack, BPF_REG_AX is the only one
    on real hardware registers, so just check with
    it.
    2:Apply commit 1612a981b766 ("bpf, x64: fix JIT emission
    for dead code"), suggested by Daniel Borkmann.

    Changes v3-v4:
    1:Fix changelog in commit.
    I install llvm-6.0, then test_progs willn't report errors.
    I submit another patch:
    "bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform"
    to fix another problem, after that patch, test_verifier willn't report errors too.
    2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation.

    Changes v2-v3:
    1:Move BPF_REG_AX to real hardware registers for performance reason.
    3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel Borkmann.
    4:Delete partial codes in 1c2a088a6626, suggested by Daniel Borkmann.
    5:Some bug fixes and comments improvement.

    Changes v1-v2:
    1:Fix bug in emit_ia32_neg64.
    2:Fix bug in emit_ia32_arsh_r64.
    3:Delete filename in top level comment, suggested by Thomas Gleixner.
    4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
    5:Rewrite some words in changelog.
    6:CodingSytle improvement and a little more comments.

    Signed-off-by: Wang YanQing
    Signed-off-by: Daniel Borkmann

    Wang YanQing
     

28 Apr, 2018

1 commit


17 Apr, 2018

2 commits

  • Mike Rapoport says:

    These patches convert files in Documentation/vm to ReST format, add an
    initial index and link it to the top level documentation.

    There are no contents changes in the documentation, except few spelling
    fixes. The relatively large diffstat stems from the indentation and
    paragraph wrapping changes.

    I've tried to keep the formatting as consistent as possible, but I could
    miss some places that needed markup and add some markup where it was not
    necessary.

    [jc: significant conflicts in vm/hmm.rst]

    Jonathan Corbet
     
  • Signed-off-by: Mike Rapoport
    Signed-off-by: Jonathan Corbet

    Mike Rapoport
     

12 Apr, 2018

3 commits

  • Since the randstruct plugin can intentionally produce extremely unusual
    kernel structure layouts (even performance pathological ones), some
    maintainers want to be able to trivially determine if an Oops is coming
    from a randstruct-built kernel, so as to keep their sanity when
    debugging. This adds the new flag and initializes taint_mask
    immediately when built with randstruct.

    Link: http://lkml.kernel.org/r/1519084390-43867-4-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: Jonathan Corbet
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • This consolidates the taint bit documentation into a single place with
    both numeric and letter values. Additionally adds the missing TAINT_AUX
    documentation.

    Link: http://lkml.kernel.org/r/1519084390-43867-3-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: Jonathan Corbet
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
    important to reserve. When ZONE_MOVABLE is used, this problem would
    theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
    allocation request which is mainly used for page cache and anon page
    allocation. So, fix it by setting 0 to
    sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM].

    And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
    makes code complex. For example, if there is highmem system, following
    reserve ratio is activated for *NORMAL ZONE* which would be easyily
    misleading people.

    #ifdef CONFIG_HIGHMEM
    32
    #endif

    This patch also fixes this situation by defining
    sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to
    right place.

    Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Tested-by: Tony Lindgren
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: "Aneesh Kumar K . V"
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Russell King
    Cc: Will Deacon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

10 Mar, 2018

1 commit

  • fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
    ip6tnl0, ip6gre0) are automatically created when the corresponding
    module is loaded.

    These tunnels are also automatically created when a new network
    namespace is created, at a great cost.

    In many cases, netns are used for isolation purposes, and these
    extra network devices are a waste of resources. We are using
    thousands of netns per host, and hit the netns creation/delete
    bottleneck a lot. (Many thanks to Kirill for recent work on this)

    Add a new sysctl so that we can opt-out from this automatic creation.

    Note that these tunnels are still created for the initial namespace,
    to be the least intrusive for typical setups.

    Tested:
    lpk43:~# cat add_del_unshare.sh
    for i in `seq 1 40`
    do
    (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
    done
    wait

    lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
    lpk43:~# time ./add_del_unshare.sh

    real 0m37.521s
    user 0m0.886s
    sys 7m7.084s
    lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
    lpk43:~# time ./add_del_unshare.sh

    real 0m4.761s
    user 0m0.851s
    sys 1m8.343s
    lpk43:~#

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Feb, 2018

1 commit

  • Fix 'documetation' to 'documentation'

    Link: http://lkml.kernel.org/r/CAKW4uUxRPZz59aWAX8ytaCB5=Qh6d_CvAnO7rYq-6NRAnQJbDA@mail.gmail.com
    Signed-off-by: Kangmin Park
    Reviewed-by: Andrew Morton
    Cc: Jiri Kosina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kangmin Park
     

01 Feb, 2018

3 commits

  • Pull documentation updates from Jonathan Corbet:
    "Documentation updates for 4.16.

    New stuff includes refcount_t documentation, errseq documentation,
    kernel-doc support for nested structure definitions, the removal of
    lots of crufty kernel-doc support for unused formats, SPDX tag
    documentation, the beginnings of a manual for subsystem maintainers,
    and lots of fixes and updates.

    As usual, some of the changesets reach outside of Documentation/ to
    effect kerneldoc comment fixes. It also adds the new LICENSES
    directory, of which Thomas promises I do not need to be the
    maintainer"

    * tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
    linux-next: docs-rst: Fix typos in kfigure.py
    linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
    Documentation: Fix misconversion of #if
    docs: add index entry for networking/msg_zerocopy
    Documentation: security/credentials.rst: explain need to sort group_list
    LICENSES: Add MPL-1.1 license
    LICENSES: Add the GPL 1.0 license
    LICENSES: Add Linux syscall note exception
    LICENSES: Add the MIT license
    LICENSES: Add the BSD-3-clause "Clear" license
    LICENSES: Add the BSD 3-clause "New" or "Revised" License
    LICENSES: Add the BSD 2-clause "Simplified" license
    LICENSES: Add the LGPL-2.1 license
    LICENSES: Add the LGPL 2.0 license
    LICENSES: Add the GPL 2.0 license
    Documentation: Add license-rules.rst to describe how to properly identify file licenses
    scripts: kernel_doc: better handle show warnings logic
    fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
    doc: md: Fix a file name to md-fault.c in fault-injection.txt
    errseq: Add to documentation tree
    ...

    Linus Torvalds
     
  • Merge updates from Andrew Morton:

    - misc fixes

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (118 commits)
    mm: remove PG_highmem description
    tools, vm: new option to specify kpageflags file
    mm/swap.c: make functions and their kernel-doc agree
    mm, memory_hotplug: fix memmap initialization
    mm: correct comments regarding do_fault_around()
    mm: numa: do not trap faults on shared data section pages.
    hugetlb, mbind: fall back to default policy if vma is NULL
    hugetlb, mempolicy: fix the mbind hugetlb migration
    mm, hugetlb: further simplify hugetlb allocation API
    mm, hugetlb: get rid of surplus page accounting tricks
    mm, hugetlb: do not rely on overcommit limit during migration
    mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
    mm, hugetlb: unify core page allocation accounting and initialization
    mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
    mm/memcontrol.c: make local symbol static
    mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
    include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
    mm/compaction.c: fix comment for try_to_compact_pages()
    mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
    zsmalloc: use U suffix for negative literals being shifted
    ...

    Linus Torvalds
     
  • hugepages_treat_as_movable has been introduced by 396faf0303d2 ("Allow
    huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
    allocations from ZONE_MOVABLE even when hugetlb pages were not
    migrateable. The purpose of the movable zone was different at the time.
    It aimed at reducing memory fragmentation and hugetlb pages being long
    lived and large werre not contributing to the fragmentation so it was
    acceptable to use the zone back then.

    Things have changed though and the primary purpose of the zone became
    migratability guarantee. If we allow non migrateable hugetlb pages to
    be in ZONE_MOVABLE memory hotplug might fail to offline the memory.

    Remove the knob and only rely on hugepage_migration_supported to allow
    movable zones.

    Mel said:

    : Primarily it was aimed at allowing the hugetlb pool to safely shrink with
    : the ability to grow it again. The use case was for batched jobs, some of
    : which needed huge pages and others that did not but didn't want the memory
    : useless pinned in the huge pages pool.
    :
    : I suspect that more users rely on THP than hugetlbfs for flexible use of
    : huge pages with fallback options so I think that removing the option
    : should be ok.

    Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Alexandru Moise
    Acked-by: Mel Gorman
    Cc: Alexandru Moise
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko