29 Feb, 2016
15 commits
-
It looks like all the call paths that lead to __acct_update_integrals()
already have irqs disabled, and __acct_update_integrals() does not need
to disable irqs itself.This is very convenient since about half the CPU time left in this
function was spent in local_irq_save alone.Performance of a microbenchmark that calls an invalid syscall
ten million times in a row on a nohz_full CPU improves 21% vs.
4.5-rc1 with both the removal of divisions from __acct_update_integrals()
and this patch, with runtime dropping from 3.7 to 2.9 seconds.With these patches applied, the highest remaining cpu user in
the trace is native_sched_clock, which is addressed in the next
patch.For testing purposes I stuck a WARN_ON(!irqs_disabled()) test
in __acct_update_integrals(). It did not trigger.Suggested-by: Peter Zijlstra
Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar -
Change the indentation in __acct_update_integrals() to make the function
a little easier to read.Suggested-by: Peter Zijlstra
Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Acked-by: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar -
When running a microbenchmark calling an invalid syscall number
in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
time in __acct_update_integrals().This function converts cputime_t to jiffies, to a timeval, only to
convert the timeval back to microseconds before discarding it.This patch leaves __acct_update_integrals() functionally equivalent,
but speeds things up by about 12%, with 10 million calls to an
invalid syscall number dropping from 3.7 to 3.25 seconds.Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar -
I've been debugging why deadline tasks can cause the RT scheduler to
throttle, even when the deadline tasks are only taking up 50% of the
CPU and RT tasks are not even using 1% of the CPU. Here's what I found.In order to keep a CPU from being hogged by RT tasks, the deadline
scheduler adds its run time (delta_exec) to the rt_time of the RT
bandwidth. That way, if the two use more than 95% of the CPU within one
second (default settings), the RT tasks are throttled to allow non RT
tasks to run.Although the deadline tasks add their run time to the RT bandwidth, it
lets the RT tasks do the accounting. This is where the problem lies. If
a deadline task runs for a bit, and no RT tasks are running, then it
will continually add to the RT rt_time that is used to calculate how
much CPU the RT tasks use. But no RT period is in play, and this
accumulation of the runtime never gets reset.When an RT task finally gets to run, and the watchdog goes off, it can
see that the RT task has used more than it should of, because the
deadline task added all this runtime to its rt_time. Then the RT task
that just woke up gets throttled for no good reason.I also noticed that when an RT task is queued, it starts the timer to
account for overload and such. But that timer goes off one period
later, which may be too late and the extra rt_time will trigger a
throttle.This is a quick work around to the problem. When a new RT task is
queued, the bandwidth timer is set to go off immediately. Then the
timer can clear out the extra time added to the rt_time while there was
no RT task running. This stops my tests from triggering the throttle,
and it will still throttle if an RT task runs too much, even while a
deadline task is running.A better solution may be to subtract the bandwidth that the deadline
task uses from the rt_runtime, and add it back when its finished. Then
there wont be a need for runtime tracking of the time used by deadline
tasks.I may play with that solution tomorrow.
Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc:
Cc: Clark Williams
Cc: Daniel Bristot de Oliveira
Cc: John Kacur
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20160216183746.349ec98b@gandalf.local.home
Signed-off-by: Ingo Molnar -
Playing with SCHED_DEADLINE and cpusets, I found that I was unable to create
new SCHED_DEADLINE tasks, with the error of EBUSY as if the bandwidth was
already used up. I then realized there wa no way to see what bandwidth is
used by the runqueues to debug the issue.By adding the dl_bw->bw and dl_bw->total_bw to the output of the deadline
info in /proc/sched_debug, this allows us to see what bandwidth has been
reserved and where a problem may exist.For example, before the issue we see the ratio of the bandwidth:
# cat /proc/sys/kernel/sched_rt_runtime_us
950000
# cat /proc/sys/kernel/sched_rt_period_us
1000000# grep dl /proc/sched_debug
dl_rq[0]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[1]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[2]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[3]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[4]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[5]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[6]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[7]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0Note: (950000 / 1000000) << 20 == 996147
After I played with cpusets and hit the issue, the result is now:
# grep dl /proc/sched_debug
dl_rq[0]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : -104857
dl_rq[1]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 104857
dl_rq[2]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 104857
dl_rq[3]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 104857
dl_rq[4]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : -104857
dl_rq[5]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : -104857
dl_rq[6]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : -104857
dl_rq[7]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : -104857This shows that there is definitely a problem as we should never have a
negative total bandwidth.Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Clark Williams
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20160222212825.756849091@goodmis.org
Signed-off-by: Ingo Molnar -
The sched_domain_sysctl setup is only enabled when SCHED_DEBUG is
configured. As debug.c is only compiled when SCHED_DEBUG is configured as
well, move the setup of sched_domain_sysctl into that file.Note, the (un)register_sched_domain_sysctl() functions had to be changed
from static to allow access to them from core.c.Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Clark Williams
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20160222212825.599278093@goodmis.org
Signed-off-by: Ingo Molnar -
As /sys/kernel/debug/sched_features is only created when SCHED_DEBUG is enabled, and the file
debug.c is only compiled when SCHED_DEBUG is enabled, it makes sense to move
sched_feature setup into that file and get rid of the #ifdef.Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Clark Williams
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20160222212825.464193063@goodmis.org
Signed-off-by: Ingo Molnar -
Andrea Parri reported:
> I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not
> handled correctly:
>
> T1 (prio = 20)
> lock(rtmutex);
>
> T2 (prio = 20)
> blocks on rtmutex (rt_nr_boosted = 0 on T1's rq)
>
> T1 (prio = 20)
> sys_set_scheduler(prio = 0)
> [new_effective_prio == oldprio]
> T1 prio = 20 (rt_nr_boosted = 0 on T1's rq)
>
> The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted());
> in particular, if we continue with
>
> T1 (prio = 20)
> unlock(rtmutex)
> wakeup(T2)
> adjust_prio(T1)
> [prio != rt_mutex_getprio(T1)]
> dequeue(T1)
> rt_nr_boosted = (unsigned long)(-1)
> ...
> T1 prio = 0
>
> then we end up leaving rt_nr_boosted in an "inconsistent" state.
>
> The simple program attached could reproduce the previous scenario; note
> that, as a consequence of the presence of this state, the "assertion"
>
> WARN_ON(!rt_nr_running && rt_nr_boosted)
>
> from dec_rt_group() may trigger.So normally we dequeue/enqueue tasks in sched_setscheduler(), which
would ensure the accounting stays correct. However in the early PI path
we fail to do so.So this was introduced at around v3.14, by:
c365c292d059 ("sched: Consider pi boosting in setscheduler()")
which fixed another problem exactly because that dequeue/enqueue, joy.
Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it
preserve runqueue location with that option. This requires decoupling
the on_rt_rq() state from being on the list.In order to allow for explicit movement during the SAVE/RESTORE,
introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these
cases to preserve other invariants.Respecting the SAVE/RESTORE flags also has the (nice) side-effect that
things like sys_nice()/sys_sched_setaffinity() also do not reorder
FIFO tasks (whereas they used to before this patch).Reported-by: Andrea Parri
Tested-by: Andrea Parri
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar -
Signed-off-by: Dongsheng Yang
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1452674558-31897-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar -
Lets factorize a bit of code there. We'll even have a third user soon.
While at it, standardize the idle update function name against the
others.Signed-off-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Byungchul Park
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E . McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1452700891-21807-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar -
decay_load_missed() cannot handle nagative values, so we need to prevent
using the function with a negative value.Reported-by: Dietmar Eggemann
Signed-off-by: Byungchul Park
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E . McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Thomas Gleixner
Cc: perterz@infradead.org
Fixes: 59543275488d ("sched/fair: Prepare __update_cpu_load() to handle active tickless")
Link: http://lkml.kernel.org/r/20160115070749.GA1914@X58A-UD3R
Signed-off-by: Ingo Molnar -
Signed-off-by: Ingo Molnar
-
Steven noticed that occasionally a sched_yield() call would not result
in a wait for the next period edge as expected.It turns out that when we call update_curr_dl() and end up with
delta_exec < 0, which in turn means that
replenish would gift us with too much runtime.Fix both issues by not relying on the dl.runtime value for yield.
Reported-by: Steven Rostedt
Tested-by: Steven Rostedt
Signed-off-by: Peter Zijlstra (Intel)
Cc: Clark Williams
Cc: Daniel Bristot de Oliveira
Cc: John Kacur
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20160223122822.GP6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar -
When a cgroup's CPU runqueue is destroyed, it should remove its
remaining load accounting from its parent cgroup.The current site for doing so it unsuited because its far too late and
unordered against other cgroup removal (->css_free() will be, but we're also
in an RCU callback).Put it in the ->css_offline() callback, which is the start of cgroup
destruction, right after the group has been made unavailable to
userspace. The ->css_offline() callbacks are called in hierarchical order
after the following v4.4 commit:aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children")
Signed-off-by: Peter Zijlstra (Intel)
Cc: Christian Borntraeger
Cc: Johannes Weiner
Cc: Li Zefan
Cc: Linus Torvalds
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Tejun Heo
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
28 Feb, 2016
25 commits
-
Pull perf fixes from Thomas Gleixner:
"A rather largish series of 12 patches addressing a maze of race
conditions in the perf core code from Peter Zijlstra"* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Robustify task_function_call()
perf: Fix scaling vs. perf_install_in_context()
perf: Fix scaling vs. perf_event_enable()
perf: Fix scaling vs. perf_event_enable_on_exec()
perf: Fix ctx time tracking by introducing EVENT_TIME
perf: Cure event->pending_disable race
perf: Fix race between event install and jump_labels
perf: Fix cloning
perf: Only update context time when active
perf: Allow perf_release() with !event->ctx
perf: Do not double free
perf: Close install vs. exit race -
Pull x86 fixes from Thomas Gleixner:
"This update contains:- Hopefully the last ASM CLAC fixups
- A fix for the Quark family related to the IMR lock which makes
kexec work again- A off-by-one fix in the MPX code. Ironic, isn't it?
- A fix for X86_PAE which addresses once more an unsigned long vs
phys_addr_t hickup"* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mpx: Fix off-by-one comparison with nr_registers
x86/mm: Fix slow_virt_to_phys() for X86_PAE again
x86/entry/compat: Add missing CLAC to entry_INT80_32
x86/entry/32: Add an ASM_CLAC to entry_SYSENTER_32
x86/platform/intel/quark: Change the kernel's IMR lock bit to false -
Pull scheduler fixlet from Thomas Gleixner:
"A trivial printk typo fix"* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix trivial typo in printk() message -
Pull irq fixes from Thomas Gleixner:
"Four small fixes for irqchip drivers:- Add missing low level irq handler initialization on mxs, so
interrupts can acutally be delivered- Add a missing barrier to the GIC driver
- Two fixes for the GIC-V3-ITS driver, addressing a double EOI write
and a cache flush beyond the actual region"* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Add missing barrier to 32bit version of gic_read_iar()
irqchip/mxs: Add missing set_handle_irq()
irqchip/gicv3-its: Avoid cache flush beyond ITS_BASERn memory size
irqchip/gic-v3-its: Fix double ICC_EOIR write for LPI in EOImode==1 -
Pull staging/android fix from Greg KH:
"Here is one patch, for the android binder driver, to resolve a
reported problem. Turns out it has been around for a while (since
3.15), so it is good to finally get it resolved.It has been in linux-next for a while with no reported issues"
* tag 'staging-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
drivers: android: correct the size of struct binder_uintptr_t for BC_DEAD_BINDER_DONE -
Pull USB fixes from Greg KH:
"Here are a few USB fixes for 4.5-rc6They fix a reported bug for some USB 3 devices by reverting the recent
patch, a MAINTAINERS change for some drivers, some new device ids, and
of course, the usual bunch of USB gadget driver fixes.All have been in linux-next for a while with no reported issues"
* tag 'usb-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
MAINTAINERS: drop OMAP USB and MUSB maintainership
usb: musb: fix DMA for host mode
usb: phy: msm: Trigger USB state detection work in DRD mode
usb: gadget: net2280: fix endpoint max packet for super speed connections
usb: gadget: gadgetfs: unregister gadget only if it got successfully registered
usb: gadget: remove driver from pending list on probe error
Revert "usb: hub: do not clear BOS field during reset device"
usb: chipidea: fix return value check in ci_hdrc_pci_probe()
usb: chipidea: error on overflow for port_test_write
USB: option: add "4G LTE usb-modem U901"
USB: cp210x: add IDs for GE B650V3 and B850V3 boards
USB: option: add support for SIM7100E
usb: musb: Fix DMA desired mode for Mentor DMA engine
usb: gadget: fsl_qe_udc: fix IS_ERR_VALUE usage
usb: dwc2: USB_DWC2 should depend on HAS_DMA
usb: dwc2: host: fix the data toggle error in full speed descriptor dma
usb: dwc2: host: fix logical omissions in dwc2_process_non_isoc_desc
usb: dwc3: Fix assignment of EP transfer resources
usb: dwc2: Add extra delay when forcing dr_mode -
Pull vfs fixes from Al Viro.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_last(): ELOOP failure exit should be done after leaving RCU mode
should_follow_link(): validate ->d_seq after having decided to follow
namei: ->d_inode of a pinned dentry is stable only for positives
do_last(): don't let a bogus return value from ->open() et.al. to confuse us
fs: return -EOPNOTSUPP if clone is not supported
hpfs: don't truncate the file when delete fails -
Pull ARM SoC fixes from Olof Johansson:
"We didn't have a batch last week, so this one is slightly larger.None of them are scary though, a handful of fixes for small DT pieces,
replacing properties with newer conventions.Highlights:
- N900 fix for setting system revision
- onenand init fix to avoid filesystem corruption
- Clock fix for audio on Beaglebone-x15
- Fixes on shmobile to deal with CONFIG_DEBUG_RODATA (default y in 4.6)+ misc smaller stuff"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
MAINTAINERS: Extend info, add wiki and ml for meson arch
MAINTAINERS: alpine: add a new maintainer and update the entry
ARM: at91/dt: fix typo in sama5d2 pinmux descriptions
ARM: OMAP2+: Fix onenand initialization to avoid filesystem corruption
Revert "regulator: tps65217: remove tps65217.dtsi file"
ARM: shmobile: Remove shmobile_boot_arg
ARM: shmobile: Move shmobile_smp_{mpidr, fn, arg}[] from .text to .bss
ARM: shmobile: r8a7779: Remove remainings of removed SCU boot setup code
ARM: shmobile: Move shmobile_scu_base from .text to .bss
ARM: OMAP2+: Fix omap_device for module reload on PM runtime forbid
ARM: OMAP2+: Improve omap_device error for driver writers
ARM: DTS: am57xx-beagle-x15: Select SYS_CLK2 for audio clocks
ARM: dts: am335x/am57xx: replace gpio-key,wakeup with wakeup-source property
ARM: OMAP2+: Set system_rev from ATAGS for n900
ARM: dts: orion5x: fix the missing mtd flash on linkstation lswtgl
ARM: dts: kirkwood: use unique machine name for ds112
ARM: dts: imx6: remove bogus interrupt-parent from CAAM node -
... or we risk seeing a bogus value of d_is_symlink() there.
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro -
... otherwise d_is_symlink() above might have nothing to do with
the inode value we've got.Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro -
both do_last() and walk_component() risk picking a NULL inode out
of dentry about to become positive, *then* checking its flags and
seeing that it's not negative anymore and using (already stale by
then) value they'd fetched earlier. Usually ends up oopsing soon
after that...Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro -
... into returning a positive to path_openat(), which would interpret that
as "symlink had been encountered" and proceed to corrupt memory, etc.
It can only happen due to a bug in some ->open() instance or in some LSM
hook, etc., so we report any such event *and* make sure it doesn't trick
us into further unpleasantness.Cc: stable@vger.kernel.org # v3.6+, at least
Signed-off-by: Al Viro -
-EBADF is a rather confusing error if an operations is not supported,
and nfsd gets rather upset about it.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro -
The delete opration can allocate additional space on the HPFS filesystem
due to btree split. The HPFS driver checks in advance if there is
available space, so that it won't corrupt the btree if we run out of space
during splitting.If there is not enough available space, the HPFS driver attempted to
truncate the file, but this results in a deadlock since the commit
7dd29d8d865efdb00c0542a5d2c87af8c52ea6c7 ("HPFS: Introduce a global mutex
and lock it on every callback from VFS").This patch removes the code that tries to truncate the file and -ENOSPC is
returned instead. If the user hits -ENOSPC on delete, he should try to
delete other files (that are stored in a leaf btree node), so that the
delete operation will make some space for deleting the file stored in
non-leaf btree node.Reported-by: Al Viro
Signed-off-by: Mikulas Patocka
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Al Viro -
Merge fixes from Andrew Morton:
"10 fixes"* emailed patches from Andrew Morton :
dax: move writeback calls into the filesystems
dax: give DAX clearing code correct bdev
ext4: online defrag not supported with DAX
ext2, ext4: only set S_DAX for regular inodes
block: disable block device DAX by default
ocfs2: unlock inode if deleting inode from orphan fails
mm: ASLR: use get_random_long()
drivers: char: random: add get_random_long()
mm: numa: quickly fail allocations for NUMA balancing on full nodes
mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED -
Pull ext2/4 DAX fix from Ted Ts'o:
"This fixes a file system corruption bug with DAX"* tag 'tags/ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite() -
Pull PCI fixes from Bjorn Helgaas:
"Enumeration:
Revert x86 pcibios_alloc_irq() to fix regression (Bjorn Helgaas)Marvell MVEBU host bridge driver:
Restrict build to 32-bit ARM (Thierry Reding)"* tag 'pci-v4.5-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: mvebu: Restrict build to 32-bit ARM
Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()"
Revert "PCI: Add helpers to manage pci_dev->irq and pci_dev->irq_managed"
Revert "x86/PCI: Don't alloc pcibios-irq when MSI is enabled" -
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry. For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect. The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.Reviewed-by: Jan Kara
Signed-off-by: Ross Zwisler
Signed-off-by: Theodore Ts'o -
Pull clk fix from Stephen Boyd:
"One small fix to keep OMAP platforms working across a suspend/resume
cycle"* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: ti: omap3+: dpll: use non-locking version of clk_get_rate -
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev. This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response
to sync(2).Signed-off-by: Ross Zwisler
Signed-off-by: Jan Kara
Cc: Al Viro
Cc: Dan Williams
Cc: Dave Chinner
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
dax_clear_blocks() needs a valid struct block_device and previously it
was using inode->i_sb->s_bdev in all cases. This is correct for normal
inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
DAX raw block devices and for XFS real-time devices.Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
its arguments to take a bdev and a sector instead of an inode and a
block. This better reflects what the function does, and it allows the
filesystem and raw block device code to pass in an appropriate struct
block_device.Signed-off-by: Ross Zwisler
Suggested-by: Dan Williams
Reviewed-by: Jan Kara
Cc: Theodore Ts'o
Cc: Al Viro
Cc: Dave Chinner
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption. This was observed with xfstests ext4/307 and
ext4/308.Fix this by only allowing online defrag for non-DAX files.
Signed-off-by: Ross Zwisler
Reviewed-by: Jan Kara
Cc: Theodore Ts'o
Cc: Al Viro
Cc: Dan Williams
Cc: Dave Chinner
Cc: Jens Axboe
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes. Any dirty data associated
with the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().With the current code, though, this isn't always true. For example,
ext2 and ext4 directory inodes can have S_DAX set, but have their dirty
data stored as dirty page cache entries. For these types of inodes,
having S_DAX set doesn't really make sense since their I/O doesn't
actually happen through the DAX code path.Instead, only allow S_DAX to be set for regular inodes for ext2 and
ext4. This allows us to have strict DAX vs non-DAX paths in the
writeback code.Signed-off-by: Ross Zwisler
Reviewed-by: Jan Kara
Cc: Theodore Ts'o
Cc: Al Viro
Cc: Dan Williams
Cc: Dave Chinner
Cc: Jens Axboe
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings. This can lead to data corruption.We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.dump_stack+0x85/0xc2
dax_writeback_mapping_range+0x60/0xe0
blkdev_writepages+0x3f/0x50
do_writepages+0x21/0x30
__filemap_fdatawrite_range+0xc6/0x100
filemap_write_and_wait+0x4a/0xa0
set_blocksize+0x70/0xd0
sb_set_blocksize+0x1d/0x50
ext4_fill_super+0x75b/0x3360
mount_bdev+0x180/0x1b0
ext4_mount+0x15/0x20
mount_fs+0x38/0x170Mark the support broken so its disabled by default, but otherwise still
available for testing.Signed-off-by: Dan Williams
Signed-off-by: Ross Zwisler
Reported-by: Ross Zwisler
Suggested-by: Dave Chinner
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Al Viro
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When doing append direct io cleanup, if deleting inode fails, it goes
out without unlocking inode, which will cause the inode deadlock.This issue was introduced by commit cf1776a9e834 ("ocfs2: fix a tiny
race when truncate dio orohaned entry").Signed-off-by: Guozhonghua
Signed-off-by: Joseph Qi
Reviewed-by: Gang He
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc: [4.2+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds