25 Jan, 2016
1 commit
-
If we detect that there is nothing to do just set the flag and do not
check if it was already set before. Races really do not matter. If the
flag is set by any code then the shepherd will start dealing with the
situation and reenable the vmstat workers when necessary again.Since commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again
and shut down on idle") quiet_vmstat might update cpu_stat_off and mark
a particular cpu to be handled by vmstat_shepherd. This might trigger a
VM_BUG_ON in vmstat_update because the work item might have been
sleeping during the idle period and see the cpu_stat_off updated after
the wake up. The VM_BUG_ON is therefore misleading and no more
appropriate. Moreover it doesn't really suite any protection from real
bugs because vmstat_shepherd will simply reschedule the vmstat_work
anytime it sees a particular cpu set or vmstat_update would do the same
from the worker context directly. Even when the two would race the
result wouldn't be incorrect as the counters update is fully idempotent.Reported-by: Sasha Levin
Signed-off-by: Christoph Lameter
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Tetsuo Handa
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jan, 2016
1 commit
-
Pull final vfs updates from Al Viro:
- The ->i_mutex wrappers (with small prereq in lustre)
- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)- followup to dedupe stuff merged this cycle
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration
23 Jan, 2016
6 commits
-
There are many locations that do
if (memory_was_allocated_by_vmalloc)
vfree(ptr);
else
kfree(ptr);but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Acked-by: Jan Kara
Acked-by: Russell King
Reviewed-by: Andreas Dilger
Acked-by: "Rafael J. Wysocki"
Acked-by: David Rientjes
Cc: "Luck, Tony"
Cc: Oleg Drokin
Cc: Boris Petkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
To properly handle fsync/msync in an efficient way DAX needs to track
dirty pages so it is able to flush them durably to media on demand.The tracking of dirty pages is done via the radix tree in struct
address_space. This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file,
and it already has support for exceptional (non struct page*) entries.
We build upon these features to add exceptional entries to the radix
tree for DAX dirty PMD or PTE pages at fault time.[dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
Signed-off-by: Ross Zwisler
Cc: "H. Peter Anvin"
Cc: "J. Bruce Fields"
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andreas Dilger
Cc: Dave Chinner
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jeff Layton
Cc: Matthew Wilcox
Cc: Thomas Gleixner
Cc: Matthew Wilcox
Cc: Dave Hansen
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this
function) that are marked with the PAGECACHE_TAG_TOWRITE tag.Signed-off-by: Ross Zwisler
Reviewed-by: Jan Kara
Cc: "H. Peter Anvin"
Cc: "J. Bruce Fields"
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andreas Dilger
Cc: Dave Chinner
Cc: Ingo Molnar
Cc: Jeff Layton
Cc: Matthew Wilcox
Cc: Thomas Gleixner
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add support for tracking dirty DAX entries in the struct address_space
radix tree. This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.In order to properly track dirty DAX pages we will insert new
exceptional entries into the radix tree that represent dirty DAX PTE or
PMD pages. These exceptional entries will also contain the writeback
addresses for the PTE or PMD faults that we can use at fsync/msync time.There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third. We rely
on the fact that only one type of exceptional entry can be found in a
given radix tree based on its usage. This happens for free with DAX vs
shmem but we explicitly prevent shadow entries from being added to radix
trees for DAX mappings.The only shadow entries that would be generated for DAX radix trees
would be to track zero page mappings that were created for holes. These
pages would receive minimal benefit from having shadow entries, and the
choice to have only one type of exceptional entry in a given radix tree
makes the logic simpler both in clear_exceptional_entry() and in the
rest of DAX.Signed-off-by: Ross Zwisler
Cc: "H. Peter Anvin"
Cc: "J. Bruce Fields"
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andreas Dilger
Cc: Dave Chinner
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jeff Layton
Cc: Matthew Wilcox
Cc: Thomas Gleixner
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro -
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.Signed-off-by: Al Viro
22 Jan, 2016
3 commits
-
This crash is caused by NULL pointer deference, in page_to_pfn() marco,
when page == NULL :Unable to handle kernel NULL pointer dereference at virtual address 00000000
Internal error: Oops: 94000006 [#1] SMP
Modules linked in:
CPU: 1 PID: 26 Comm: khugepaged Tainted: G W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
PC is at khugepaged+0x378/0x1af8
LR is at khugepaged+0x418/0x1af8
Process khugepaged (pid: 26, stack limit = 0xffffffc079638020)
Call trace:
khugepaged+0x378/0x1af8
kthread+0xdc/0xf4
ret_from_fork+0xc/0x40
Code: 35001700 f0002c60 aa0703e3 f9009fa0 (f94000e0)
---[ end trace 637503d8e28ae69e ]---
Kernel panic - not syncing: Fatal exception
CPU2: stopping
CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
Hardware name: linux,dummy-virt (DT)[akpm@linux-foundation.org: fix fat-fingered merge resolution]
Signed-off-by: yalin wang
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Acked-by: David Rientjes
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Tetsuo Handa reported underflow of NR_MLOCK on munlock.
Testcase:
#include
#include
#include#define BASE ((void *)0x400000000000)
#define SIZE (1UL << 21)int main(int argc, char *argv[])
{
void *addr;system("grep Mlocked /proc/meminfo");
addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
-1, 0);
if (addr == MAP_FAILED)
printf("mmap() failed\n"), exit(1);
munmap(addr, SIZE);
system("grep Mlocked /proc/meminfo");
return 0;
}It happens on munlock_vma_page() due to unfortunate choice of nr_pages
data type:__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
For unsigned int nr_pages, implicitly casted to long in
__mod_zone_page_state(), it becomes something around UINT_MAX.munlock_vma_page() usually called for THP as small pages go though
pagevec.Let's make nr_pages signed int.
Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in
mod_zone_page_state()") used `long' type, but `int' here is OK for a
count of the number of sub-pages in a huge page.Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages")
Signed-off-by: Kirill A. Shutemov
Reported-by: Tetsuo Handa
Tested-by: Tetsuo Handa
Cc: Michel Lespinasse
Acked-by: Michal Hocko
Cc: [4.4+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
After THP refcounting rework we have only two possible return values
from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
ptl doesn't make much sense in this case.Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
failure.Signed-off-by: Kirill A. Shutemov
Suggested-by: Linus Torvalds
Cc: Minchan Kim
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Jan, 2016
27 commits
-
Provide statistics on how much of a cgroup's memory footprint is made up
of socket buffers from network connections owned by the group.Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Provide a cgroup2 memory.stat that provides statistics on LRU memory
and fault event counters. More consumers and breakdowns will follow.Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Changing page->mem_cgroup of a live page is tricky and fragile. In
particular, the memcg writeback code relies on that mapping being stable
and users of mem_cgroup_replace_page() not overlapping with dirtyable
inodes.Page cache replacement doesn't have to do that, though. Instead of being
clever and transferring the charge from the old page to the new,
force-charge the new page and leave the old page alone. A temporary
overcharge won't matter in practice, and the old page is going to be freed
shortly after this anyway. And this is not performance critical.Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Swap cache pages are freed aggressively if swap is nearly full (>50%
currently), because otherwise we are likely to stop scanning anonymous
when we near the swap limit even if there is plenty of freeable swap cache
pages. We should follow the same trend in case of memory cgroup, which
has its own swap limit.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We don't scan anonymous memory if we ran out of swap, neither should we do
it in case memcg swap limit is hit, because swap out is impossible anyway.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_lruvec_online() takes lruvec, but it only needs memcg. Since
get_scan_count(), which is the only user of this function, now possesses
pointer to memcg, let's pass memcg directly to mem_cgroup_online() instead
of picking it out of lruvec and rename the function accordingly.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
memcg will come in handy in get_scan_count(). It can already be used for
getting swappiness immediately in get_scan_count() instead of passing it
around. The following patches will add more memcg-related values, which
will be used there.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patchset introduces swap accounting to cgroup2.
This patch (of 7):
In the legacy hierarchy we charge memsw, which is dubious, because:
- memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.- memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.That said, we should provide a different swap limiting mechanism for
cgroup2.This patch adds mem_cgroup->swap counter, which charges the actual number
of swap entries used by a cgroup. It is only charged in the unified
hierarchy, while the legacy hierarchy memsw logic is left intact.The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.Note, to charge swap resource properly in the unified hierarchy, we have
to make swap_entry_free uncharge swap only when ->usage reaches zero, not
just ->count, i.e. when all references to a swap entry, including the one
taken by swap cache, are gone. This is necessary, because otherwise
swap-in could result in uncharging swap even if the page is still in swap
cache and hence still occupies a swap entry. At the same time, this
shouldn't break memsw counter logic, where a page is never charged twice
for using both memory and swap, because in case of legacy hierarchy we
uncharge swap on commit (see mem_cgroup_commit_charge).Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The creation and teardown of struct mem_cgroup is fairly messy and
that has attracted mistakes and subtle bugs before.The main cause for this is that there is no clear model about what
needs to happen when, and that attracts more chaos. So create one:1. mem_cgroup_alloc() should allocate struct mem_cgroup and its
auxiliary members and initialize work items, locks etc. so that the
object it returns is fully initialized and in a neutral state.2. mem_cgroup_css_alloc() will use mem_cgroup_alloc() to obtain a new
memcg object and configure it and the system according to the role
of the new memory-controlled cgroup in the hierarchy.3. mem_cgroup_css_online() is no longer needed to synchronize with
iterators, but it verifies css->id which isn't available earlier.4. mem_cgroup_css_offline() implements stuff that needs to happen upon
the user-visible destruction of a cgroup, which includes stopping
all user interfacing as well as releasing certain structures when
continued memory consumption would be unexpected at that point.5. mem_cgroup_css_free() prepares the system and the memcg object for
the object's disappearance, neutralizes its state, and then gives
it back to mem_cgroup_free().6. mem_cgroup_free() releases struct mem_cgroup and auxiliary memory.
[arnd@arndb.de: fix SLOB build regression]
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are no more external users of struct cg_proto, flatten the
structure into struct mem_cgroup.Since using those struct members doesn't stand out as much anymore,
add cgroup2 static branches to make it clearer which code is legacy.Suggested-by: Vladimir Davydov
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
controller code is insignificant, having these conditionals is not
worth the complication and fragility that comes with them.[akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
and mem_cgroup->tcp_mem init/destroy stuff. This doesn't belong to
network subsys. Let's move it to memcontrol.c. This also allows us to
reuse generic code for handling legacy memcg files.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: "David S. Miller"
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
interface. This also makes legacy-only code sections stand out better.[arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Kmem accounting might incur overhead that some users can't put up with.
Besides, the implementation is still considered unstable. So let's
provide a way to disable it for those users who aren't happy with it.To disable kmem accounting for cgroup2, pass cgroup.memory=nokmem at
boot time.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The original cgroup memory controller has an extension to account slab
memory (and other "kernel memory" consumers) in a separate "kmem"
counter, once the user set an explicit limit on that "kmem" pool.However, this includes various consumers whose sizes are directly linked
to userspace activity. Accounting them as an optional "kmem" extension
is problematic for several reasons:1. It leaves the main memory interface with incomplete semantics. A
user who puts their workload into a cgroup and configures a memory
limit does not expect us to leave holes in the containment as big
as the dentry and inode cache, or the kernel stack pages.2. If the limit set on this random historical subgroup of consumers is
reached, subsequent allocations will fail even when the main memory
pool available to the cgroup is not yet exhausted and/or has
reclaimable memory in it.3. Calling it 'kernel memory' is misleading. The dentry and inode
caches are no more 'kernel' (or no less 'user') memory than the
page cache itself. Treating these consumers as different classes is
a historical implementation detail that should not leak to users.So, in addition to page cache, anonymous memory, and network socket
memory, account the following memory consumers per default in the
cgroup2 memory controller:- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems.This should give us reasonable memory isolation for most common
workloads out of the box.Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The cgroup2 memory controller will account important in-kernel memory
consumers per default. Move all necessary components to CONFIG_MEMCG.Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The cgroup2 memory controller will include important in-kernel memory
consumers per default, including socket memory, but it will no longer
carry the historic tcp control interface.Separate the kmem state init from the tcp control interface init in
preparation for that.Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Put all the related code to setup and teardown the kmem accounting state
into the same location. No functional change intended.Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On any given memcg, the kmem accounting feature has three separate
states: not initialized, structures allocated, and actively accounting
slab memory. These are represented through a combination of the
kmem_acct_activated and kmem_acct_active flags, which is confusing.Convert to a kmem_state enum with the states NONE, ALLOCATED, and
ONLINE. Then rename the functions to modify the state accordingly.
This follows the nomenclature of css object states more closely.Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The kmem page_counter's limit is initialized to PAGE_COUNTER_MAX inside
mem_cgroup_css_online(). There is no need to repeat this from
memcg_propagate_kmem().Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This series adds accounting of the historical "kmem" memory consumers to
the cgroup2 memory controller.These consumers include the dentry cache, the inode cache, kernel stack
pages, and a few others that are pointed out in patch 7/8. The
footprint of these consumers is directly tied to userspace activity in
common workloads, and so they have to be part of the minimally viable
configuration in order to present a complete feature to our users.The cgroup2 interface of the memory controller is far from complete, but
this series, along with the socket memory accounting series, provides
the final semantic changes for the existing memory knobs in the cgroup2
interface, which is scheduled for initial release in the next merge
window.This patch (of 8):
Remove unused css argument frmo memcg_init_kmem()
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Only functions doing more than one read are modified. Consumeres
happened to deal with possibly changing data, but it does not seem like
a good thing to rely on.Signed-off-by: Mateusz Guzik
Acked-by: Cyrill Gorcunov
Cc: Alexey Dobriyan
Cc: Jarod Wilson
Cc: Jan Stancek
Cc: Al Viro
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
UBSAN uses compile-time instrumentation to catch undefined behavior
(UB). Compiler inserts code that perform certain kinds of checks before
operations that could cause UB. If check fails (i.e. UB detected)
__ubsan_handle_* function called to print error message.So the most of the work is done by compiler. This patch just implements
ubsan handlers printing errors.GCC has this capability since 4.9.x [1] (see -fsanitize=undefined
option and its suboptions).
However GCC 5.x has more checkers implemented [2].
Article [3] has a bit more details about UBSAN in the GCC.[1] - https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html
[2] - https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html
[3] - http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/Issues which UBSAN has found thus far are:
Found bugs:
* out-of-bounds access - 97840cb67ff5 ("netfilter: nfnetlink: fix
insufficient validation in nfnetlink_bind")undefined shifts:
* d48458d4a768 ("jbd2: use a better hash function for the revoke
table")* 10632008b9e1 ("clockevents: Prevent shift out of bounds")
* 'x << -1' shift in ext4 -
http://lkml.kernel.org/r/* undefined rol32(0) -
http://lkml.kernel.org/r/* undefined dirty_ratelimit calculation -
http://lkml.kernel.org/r/* undefined roundown_pow_of_two(0) -
http://lkml.kernel.org/r/* [WONTFIX] undefined shift in __bpf_prog_run -
http://lkml.kernel.org/r/WONTFIX here because it should be fixed in bpf program, not in kernel.
signed overflows:
* 32a8df4e0b33f ("sched: Fix odd values in effective_load()
calculations")* mul overflow in ntp -
http://lkml.kernel.org/r/* incorrect conversion into rtc_time in rtc_time64_to_tm() -
http://lkml.kernel.org/r/* unvalidated timespec in io_getevents() -
http://lkml.kernel.org/r/* [NOTABUG] signed overflow in ktime_add_safe() -
http://lkml.kernel.org/r/[akpm@linux-foundation.org: fix unused local warning]
[akpm@linux-foundation.org: fix __int128 build woes]
Signed-off-by: Andrey Ryabinin
Cc: Peter Zijlstra
Cc: Sasha Levin
Cc: Randy Dunlap
Cc: Rasmus Villemoes
Cc: Jonathan Corbet
Cc: Michal Marek
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Yury Gribov
Cc: Dmitry Vyukov
Cc: Konstantin Khlebnikov
Cc: Kostya Serebryany
Cc: Johannes Berg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secretTherefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn
Acked-by: Kees Cook
Cc: Casey Schaufler
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: James Morris
Cc: "Serge E. Hallyn"
Cc: Andy Shevchenko
Cc: Andy Lutomirski
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Willy Tarreau
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
record_obj() in migrate_zspage() does not preserve handle's
HANDLE_PIN_BIT, set by find_aloced_obj()->trypin_tag(), and implicitly
(accidentally) un-pins the handle, while migrate_zspage() still performs
an explicit unpin_tag() on the that handle. This additional explicit
unpin_tag() introduces a race condition with zs_free(), which can pin
that handle by this time, so the handle becomes un-pinned.Schematically, it goes like this:
CPU0 CPU1
migrate_zspage
find_alloced_obj
trypin_tag
set HANDLE_PIN_BIT zs_free()
pin_tag()
obj_malloc() -- new object, no tag
record_obj() -- remove HANDLE_PIN_BIT set HANDLE_PIN_BIT
unpin_tag() -- remove zs_free's HANDLE_PIN_BITThe race condition may result in a NULL pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
CPU: 0 PID: 19001 Comm: CookieMonsterCl Tainted:
PC is at get_zspage_mapping+0x0/0x24
LR is at obj_free.isra.22+0x64/0x128
Call trace:
get_zspage_mapping+0x0/0x24
zs_free+0x88/0x114
zram_free_page+0x64/0xcc
zram_slot_free_notify+0x90/0x108
swap_entry_free+0x278/0x294
free_swap_and_cache+0x38/0x11c
unmap_single_vma+0x480/0x5c8
unmap_vmas+0x44/0x60
exit_mmap+0x50/0x110
mmput+0x58/0xe0
do_exit+0x320/0x8dc
do_group_exit+0x44/0xa8
get_signal+0x538/0x580
do_signal+0x98/0x4b8
do_notify_resume+0x14/0x5cThis patch keeps the lock bit in migration path and update value
atomically.Signed-off-by: Junil Lee
Signed-off-by: Minchan Kim
Acked-by: Vlastimil Babka
Cc: Sergey Senozhatsky
Cc: [4.1+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
split_queue_lock can be taken from interrupt context in some cases, but
I forgot to convert locking in split_huge_page() to interrupt-safe
primitives.Let's fix this.
lockdep output:
======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
4.4.0+ #259 Tainted: G W
------------------------------------------------------
syz-executor/18183 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
(split_queue_lock){+.+...}, at: free_transhuge_page+0x24/0x90 mm/huge_memory.c:3436and this task is already holding:
(slock-AF_INET){+.-...}, at: spin_lock_bh include/linux/spinlock.h:307
(slock-AF_INET){+.-...}, at: lock_sock_fast+0x45/0x120 net/core/sock.c:2462
which would create a new lock dependency:
(slock-AF_INET){+.-...} -> (split_queue_lock){+.+...}but this new dependency connects a SOFTIRQ-irq-safe lock:
(slock-AF_INET){+.-...}
... which became SOFTIRQ-irq-safe at:
mark_irqflags kernel/locking/lockdep.c:2799
__lock_acquire+0xfd8/0x4700 kernel/locking/lockdep.c:3162
lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
__raw_spin_lock include/linux/spinlock_api_smp.h:144
_raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:302
udp_queue_rcv_skb+0x781/0x1550 net/ipv4/udp.c:1680
flush_stack+0x50/0x330 net/ipv6/udp.c:799
__udp4_lib_mcast_deliver+0x694/0x7f0 net/ipv4/udp.c:1798
__udp4_lib_rcv+0x17dc/0x23e0 net/ipv4/udp.c:1888
udp_rcv+0x21/0x30 net/ipv4/udp.c:2108
ip_local_deliver_finish+0x2b3/0xa50 net/ipv4/ip_input.c:216
NF_HOOK_THRESH include/linux/netfilter.h:226
NF_HOOK include/linux/netfilter.h:249
ip_local_deliver+0x1c4/0x2f0 net/ipv4/ip_input.c:257
dst_input include/net/dst.h:498
ip_rcv_finish+0x5ec/0x1730 net/ipv4/ip_input.c:365
NF_HOOK_THRESH include/linux/netfilter.h:226
NF_HOOK include/linux/netfilter.h:249
ip_rcv+0x963/0x1080 net/ipv4/ip_input.c:455
__netif_receive_skb_core+0x1620/0x2f80 net/core/dev.c:4154
__netif_receive_skb+0x2a/0x160 net/core/dev.c:4189
netif_receive_skb_internal+0x1b5/0x390 net/core/dev.c:4217
napi_skb_finish net/core/dev.c:4542
napi_gro_receive+0x2bd/0x3c0 net/core/dev.c:4572
e1000_clean_rx_irq+0x4e2/0x1100 drivers/net/ethernet/intel/e1000e/netdev.c:1038
e1000_clean+0xa08/0x24a0 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
napi_poll net/core/dev.c:5074
net_rx_action+0x7eb/0xdf0 net/core/dev.c:5139
__do_softirq+0x26a/0x920 kernel/softirq.c:273
invoke_softirq kernel/softirq.c:350
irq_exit+0x18f/0x1d0 kernel/softirq.c:391
exiting_irq ./arch/x86/include/asm/apic.h:659
do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:520
arch_safe_halt ./arch/x86/include/asm/paravirt.h:117
default_idle+0x52/0x2e0 arch/x86/kernel/process.c:304
arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:295
default_idle_call+0x48/0xa0 kernel/sched/idle.c:92
cpuidle_idle_call kernel/sched/idle.c:156
cpu_idle_loop kernel/sched/idle.c:252
cpu_startup_entry+0x554/0x710 kernel/sched/idle.c:300
rest_init+0x192/0x1a0 init/main.c:412
start_kernel+0x678/0x69e init/main.c:683
x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:195
x86_64_start_kernel+0x158/0x167 arch/x86/kernel/head64.c:184to a SOFTIRQ-irq-unsafe lock:
(split_queue_lock){+.+...}
which became SOFTIRQ-irq-unsafe at:
mark_irqflags kernel/locking/lockdep.c:2817
__lock_acquire+0x146e/0x4700 kernel/locking/lockdep.c:3162
lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
__raw_spin_lock include/linux/spinlock_api_smp.h:144
_raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:302
split_huge_page_to_list+0xcc0/0x1c50 mm/huge_memory.c:3399
split_huge_page include/linux/huge_mm.h:99
queue_pages_pte_range+0xa38/0xef0 mm/mempolicy.c:507
walk_pmd_range mm/pagewalk.c:50
walk_pud_range mm/pagewalk.c:90
walk_pgd_range mm/pagewalk.c:116
__walk_page_range+0x653/0xcd0 mm/pagewalk.c:204
walk_page_range+0xfe/0x2b0 mm/pagewalk.c:281
queue_pages_range+0xfb/0x130 mm/mempolicy.c:687
migrate_to_node mm/mempolicy.c:1004
do_migrate_pages+0x370/0x4e0 mm/mempolicy.c:1109
SYSC_migrate_pages mm/mempolicy.c:1453
SyS_migrate_pages+0x640/0x730 mm/mempolicy.c:1374
entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(split_queue_lock);
local_irq_disable();
lock(slock-AF_INET);
lock(split_queue_lock);
lock(slock-AF_INET);Signed-off-by: Kirill A. Shutemov
Reported-by: Dmitry Vyukov
Acked-by: David Rientjes
Reviewed-by: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A newly added tracepoint in the hugepage code uses a variable in the
error handling that is not initialized at that point:include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]
The result is relatively harmless, as the trace data will in rare
cases contain incorrect data.This works around the problem by adding an explicit initialization.
Signed-off-by: Arnd Bergmann
Fixes: 7d2eba0557c1 ("mm: add tracepoint for scanning pages")
Reviewed-by: Ebru Akagunduz
Acked-by: David Rientjes
Cc: Kirill A. Shutemov
Signed-off-by: Linus Torvalds
19 Jan, 2016
1 commit
-
Pull virtio barrier rework+fixes from Michael Tsirkin:
"This adds a new kind of barrier, and reworks virtio and xen to use it.Plus some fixes here and there"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (44 commits)
checkpatch: add virt barriers
checkpatch: check for __smp outside barrier.h
checkpatch.pl: add missing memory barriers
virtio: make find_vqs() checkpatch.pl-friendly
virtio_balloon: fix race between migration and ballooning
virtio_balloon: fix race by fill and leak
s390: more efficient smp barriers
s390: use generic memory barriers
xen/events: use virt_xxx barriers
xen/io: use virt_xxx barriers
xenbus: use virt_xxx barriers
virtio_ring: use virt_store_mb
sh: move xchg_cmpxchg to a header by itself
sh: support 1 and 2 byte xchg
virtio_ring: update weak barriers to use virt_xxx
Revert "virtio_ring: Update weak barriers to use dma_wmb/rmb"
asm-generic: implement virt_xxx memory barriers
x86: define __smp_xxx
xtensa: define __smp_xxx
tile: define __smp_xxx
...
18 Jan, 2016
1 commit
-
Commit b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when
MADV_FREE syscall is called") introduced this new function, but got the
error handling for when pmd_trans_huge_lock() fails wrong. In the
failure case, the lock has not been taken, and we should not unlock on
the way out.Cc: Minchan Kim
Cc: Andrew Morton
Signed-off-by: Linus Torvalds