22 Feb, 2020
1 commit
-
This reverts commit a97955844807e327df11aa33869009d14d6b7de0.
Commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage
in exit_sem()") removes a lock that is needed. This leads to a process
looping infinitely in exit_sem() and can also lead to a crash. There is
a reproducer available in [1] and with the commit reverted the issue
does not reproduce anymore.Using the reproducer found in [1] is fairly easy to reach a point where
one of the child processes is looping infinitely in exit_sem between
for(;;) and if (semid == -1) block, while it's trying to free its last
sem_undo structure which has already been freed by freeary().Each sem_undo struct is on two lists: one per semaphore set (list_id)
and one per process (list_proc). The list_id list tracks undos by
semaphore set, and the list_proc by process.Undo structures are removed either by freeary() or by exit_sem(). The
freeary function is invoked when the user invokes a syscall to remove a
semaphore set. During this operation freeary() traverses the list_id
associated with the semaphore set and removes the undo structures from
both the list_id and list_proc lists.For this case, exit_sem() is called at process exit. Each process
contains a struct sem_undo_list (referred to as "ulp") which contains
the head for the list_proc list. When the process exits, exit_sem()
traverses this list to remove each sem_undo struct. As in freeary(),
whenever a sem_undo struct is removed from list_proc, it is also removed
from the list_id list.Removing elements from list_id is safe for both exit_sem() and freeary()
due to sem_lock(). Removing elements from list_proc is not safe;
freeary() locks &un->ulp->lock when it performs
list_del_rcu(&un->list_proc) but exit_sem() does not (locking was
removed by commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list
lock usage in exit_sem()").This can result in the following situation while executing the
reproducer [1] : Consider a child process in exit_sem() and the parent
in freeary() (because of semctl(sid[i], NSEM, IPC_RMID)).- The list_proc for the child contains the last two undo structs A and
B (the rest have been removed either by exit_sem() or freeary()).- The semid for A is 1 and semid for B is 2.
- exit_sem() removes A and at the same time freeary() removes B.
- Since A and B have different semid sem_lock() will acquire different
locks for each process and both can proceed.The bug is that they remove A and B from the same list_proc at the same
time because only freeary() acquires the ulp lock. When exit_sem()
removes A it makes ulp->list_proc.next to point at B and at the same
time freeary() removes B setting B->semid=-1.At the next iteration of for(;;) loop exit_sem() will try to remove B.
The only way to break from for(;;) is for (&un->list_proc ==
&ulp->list_proc) to be true which is not. Then exit_sem() will check if
B->semid=-1 which is and will continue looping in for(;;) until the
memory for B is reallocated and the value at B->semid is changed.At that point, exit_sem() will crash attempting to unlink B from the
lists (this can be easily triggered by running the reproducer [1] a
second time).To prove this scenario instrumentation was added to keep information
about each sem_undo (un) struct that is removed per process and per
semaphore set (sma).CPU0 CPU1
[caller holds sem_lock(sma for A)] ...
freeary() exit_sem()
... ...
... sem_lock(sma for B)
spin_lock(A->ulp->lock) ...
list_del_rcu(un_A->list_proc) list_del_rcu(un_B->list_proc)Undo structures A and B have different semid and sem_lock() operations
proceed. However they belong to the same list_proc list and they are
removed at the same time. This results into ulp->list_proc.next
pointing to the address of B which is already removed.After reverting commit a97955844807 ("ipc,sem: remove uneeded
sem_undo_list lock usage in exit_sem()") the issue was no longer
reproducible.[1] https://bugzilla.redhat.com/show_bug.cgi?id=1694779
Link: http://lkml.kernel.org/r/20191211191318.11860-1-ioanna-maria.alifieraki@canonical.com
Fixes: a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()")
Signed-off-by: Ioanna Alifieraki
Acked-by: Manfred Spraul
Acked-by: Herton R. Krzesinski
Cc: Arnd Bergmann
Cc: Catalin Marinas
Cc:
Cc: Joel Fernandes (Google)
Cc: Davidlohr Bueso
Cc: Jay Vosburgh
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Feb, 2020
6 commits
-
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.Conversion rule is:
llseek => proc_lseek
unlocked_ioctl => proc_ioctlxxx => proc_xxx
delete ".owner = THIS_MODULE" line
[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A use of uninitialized memory in msgctl_down() because msqid64 in
ksys_msgctl hasn't been initialized. The local | msqid64 | is created in
ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64
is never initialized before msgctl_down() checks msqid64->msg_qbytes.KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
reports:==================================================================
BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x75/0xae
__kumsan_report+0x17c/0x3e6
kumsan_report+0xe/0x20
msgctl_down+0x94/0x300
ksys_msgctl.constprop.14+0xef/0x260
do_syscall_64+0x7e/0x1f0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4400e9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000The buggy address belongs to the page:
page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kumsan: bad access detected
==================================================================Syzkaller reproducer:
msgctl$IPC_RMID(0x0, 0x0)C reproducer:
// autogenerated by syzkaller (https://github.com/google/syzkaller)int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
syscall(__NR_msgctl, 0, 0, 0);
return 0;
}[natechancellor@gmail.com: adjust indentation in ksys_msgctl]
Link: https://github.com/ClangBuiltLinux/linux/issues/829
Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
Signed-off-by: Lu Shuaibing
Signed-off-by: Nathan Chancellor
Suggested-by: Arnd Bergmann
Cc: Davidlohr Bueso
Cc: Manfred Spraul
Cc: NeilBrown
From: Andrew Morton
Subject: drivers/block/null_blk_main.c: fix layoutEach line here overflows 80 cols by exactly one character. Delete one tab
per line to fix.Cc: Shaohua Li
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Document and update the memory barriers in ipc/sem.c:
- Add smp_store_release() to wake_up_sem_queue_prepare() and
document why it is needed.- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
as the pair for the barrier inside wake_up_sem_queue_prepare().- Add comments to all barriers, and mention the rules in the block
regarding locking.- Switch to using wake_q_add_safe().
Link: http://lkml.kernel.org/r/20191020123305.14715-6-manfred@colorfullife.com
Signed-off-by: Manfred Spraul
Cc: Waiman Long
Cc: Davidlohr Bueso
Cc:
Cc: Peter Zijlstra
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Transfer findings from ipc/mqueue.c:
- A control barrier was missing for the lockless receive case So in
theory, not yet initialized data may have been copied to user space -
obviously only for architectures where control barriers are not NOP.- use smp_store_release(). In theory, the refount may have been
decreased to 0 already when wake_q_add() tries to get a reference.Link: http://lkml.kernel.org/r/20191020123305.14715-5-manfred@colorfullife.com
Signed-off-by: Manfred Spraul
Cc: Waiman Long
Cc: Davidlohr Bueso
Cc:
Cc: Peter Zijlstra
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.
- add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
acquire semantics if the value is STATE_READY.- use wake_q_add_safe()
- document why __set_current_state() may be used:
Reading task->state cannot happen before the wake_q_add() call,
which happens while holding info->lock. Thus the spin_unlock()
is the RELEASE, and the spin_lock() is the ACQUIRE.For completeness: there is also a 3 CPU scenario, if the to be woken
up task is already on another wake_q.
Then:
- CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
- CPU2: the spin_lock() of the waker is the ACQUIRE
- CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
- CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRELink: http://lkml.kernel.org/r/20191020123305.14715-4-manfred@colorfullife.com
Signed-off-by: Manfred Spraul
Reviewed-by: Davidlohr Bueso
Cc: Waiman Long
Cc:
Cc: Peter Zijlstra
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
pipelined_send() and pipelined_receive() are identical, so merge them.
[manfred@colorfullife.com: add changelog]
Link: http://lkml.kernel.org/r/20191020123305.14715-3-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso
Signed-off-by: Manfred Spraul
Cc:
Cc: Peter Zijlstra
Cc: Waiman Long
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Dec, 2019
1 commit
-
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().This patch is generated using following script:
EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"
git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
doif [[ "$file" =~ $EXCLUDE_FILES ]]; then
continue
fi
sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
doneSigned-off-by: Pankaj Bharadiya
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook
Signed-off-by: Kees Cook
Acked-by: David Miller # for net
15 Nov, 2019
1 commit
-
The CONFIG_64BIT_TIME option is defined on all architectures, and can
be removed for simplicity now.Signed-off-by: Arnd Bergmann
26 Sep, 2019
3 commits
-
CONFIG_PROVE_RCU_LIST requires list_for_each_entry_rcu() to pass a lockdep
expression if using srcu or locking for protection. It can only check
regular RCU protection, all other protection needs to be passed as lockdep
expression.Link: http://lkml.kernel.org/r/20190830231817.76862-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google)
Cc: Arnd Bergmann
Cc: Bjorn Helgaas
Cc: Catalin Marinas
Cc: "Gustavo A. R. Silva"
Cc: Jonathan Derrick
Cc: Keith Busch
Cc: Lorenzo Pieralisi
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Null pointers were assigned to local variables in a few cases as exception
handling. The jump target “out” was used where no meaningful data
processing actions should eventually be performed by branches of an if
statement then. Use an additional jump target for calling dev_kfree_skb()
directly.Return also directly after error conditions were detected when no extra
clean-up is needed by this function implementation.Link: http://lkml.kernel.org/r/592ef10e-0b69-72d0-9789-fc48f638fdfd@web.de
Signed-off-by: Markus Elfring
Cc: Davidlohr Bueso
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
dev_kfree_skb() input parameter validation, thus the test around the call
is not needed.This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/07477187-63e5-cc80-34c1-32dd16b38e12@web.de
Signed-off-by: Markus Elfring
Cc: Davidlohr Bueso
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Sep, 2019
1 commit
-
Pull vfs mount API infrastructure updates from Al Viro:
"Infrastructure bits of mount API conversions.The rest is more of per-filesystem updates and that will happen
in separate pull requests"* 'work.mount-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mtd: Provide fs_context-aware mount_mtd() replacement
vfs: Create fs_context-aware mount_bdev() replacement
new helper: get_tree_keyed()
vfs: set fs_context::user_ns for reconfigure
08 Sep, 2019
1 commit
-
Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
to a commit from my y2038 series in linux-5.1, as I missed the custom
sys_ipc() wrapper that sparc64 uses in place of the generic version that
I patched.The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
now do not allow being called with the IPC_64 flag any more, resulting
in a -EINVAL error when they don't recognize the command.Instead, the correct way to do this now is to call the internal
ksys_old_{sem,shm,msg}ctl() functions to select the API version.As we generally move towards these functions anyway, change all of
sparc_ipc() to consistently use those in place of the sys_*() versions,
and move the required ksys_*() declarations into linux/syscalls.hThe IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
errors when ipc is disabled.Reported-by: Matt Turner
Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
Cc: stable@vger.kernel.org
Tested-by: Matt Turner
Tested-by: Anatoly Pugachev
Signed-off-by: Arnd Bergmann
06 Sep, 2019
1 commit
-
For vfs_get_keyed_super users.
Signed-off-by: Al Viro
20 Jul, 2019
1 commit
-
Pull vfs mount updates from Al Viro:
"The first part of mount updates.Convert filesystems to use the new mount API"
* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...
19 Jul, 2019
1 commit
-
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce
Signed-off-by: Arnd Bergmann
Acked-by: Kees Cook
Reviewed-by: Aaron Tomlin
Cc: Matthew Wilcox
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jul, 2019
1 commit
-
Andreas Christoforou reported:
UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
9 * 2305843009213693951 cannot be represented in type 'long int'
...
Call Trace:
mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
evict+0x472/0x8c0 fs/inode.c:558
iput_final fs/inode.c:1547 [inline]
iput+0x51d/0x8c0 fs/inode.c:1573
mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
vfs_mkobj+0x39e/0x580 fs/namei.c:2892
prepare_open ipc/mqueue.c:731 [inline]
do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771Which could be triggered by:
struct mq_attr attr = {
.mq_flags = 0,
.mq_maxmsg = 9,
.mq_msgsize = 0x1fffffffffffffff,
.mq_curmsgs = 0,
};if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
perror("mq_open");mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
preparing to return -EINVAL. During the cleanup, it calls
mqueue_evict_inode() which performed resource usage tracking math for
updating "user", before checking if there was a valid "user" at all
(which would indicate that the calculations would be sane). Instead,
delay this check to after seeing a valid "user".The overflow was real, but the results went unused, so while the flaw is
harmless, it's noisy for kernel fuzzers, so just fix it by moving the
calculation under the non-NULL "user" where it actually gets used.Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
Signed-off-by: Kees Cook
Reported-by: Andreas Christoforou
Acked-by: "Eric W. Biederman"
Cc: Al Viro
Cc: Arnd Bergmann
Cc: Davidlohr Bueso
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jun, 2019
1 commit
-
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation version 2 of the licenseextracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 315 file(s).
Signed-off-by: Thomas Gleixner
Reviewed-by: Allison Randal
Reviewed-by: Armijn Hemel
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman
26 May, 2019
1 commit
-
... so that we could lift the capability checks into ->get_tree()
callerSigned-off-by: Al Viro
24 May, 2019
1 commit
-
Based on 1 normalized pattern(s):
this file is released under gnu general public licence version 2 or
at your option any later version see the file copying for more
detailsextracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 1 file(s).
Signed-off-by: Thomas Gleixner
Reviewed-by: Kate Stewart
Reviewed-by: Richard Fontana
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520071857.941092988@linutronix.de
Signed-off-by: Greg Kroah-Hartman
15 May, 2019
6 commits
-
For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elementsResult:
- for normal mode:
No change for 4095 active objects until the 3rd level
is added without cyclic allocation.For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<
Acked-by: Waiman Long
Cc: "Luis R. Rodriguez"
Cc: Kees Cook
Cc: Jonathan Corbet
Cc: Al Viro
Cc: Matthew Wilcox
Cc: "Eric W . Biederman"
Cc: Takashi Iwai
Cc: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Rewrite, based on the patch from Waiman Long:
The mixing in of a sequence number into the IPC IDs is probably to avoid
ID reuse in userspace as much as possible. With ipcmni_extend mode, the
number of usable sequence numbers is greatly reduced leading to higher
chance of ID reuse.To address this issue, we need to conserve the sequence number space as
much as possible. Right now, the sequence number is incremented for
every new ID created. In reality, we only need to increment the
sequence number when new allocated ID is not greater than the last one
allocated. It is in such case that the new ID may collide with an
existing one. This is being done irrespective of the ipcmni mode.In order to avoid any races, the index is first allocated and then the
pointer is replaced.Changes compared to the initial patch:
- Handle failures from idr_alloc().
- Avoid that concurrent operations can see the wrong sequence number.
(This is achieved by using idr_replace()).
- IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
ipcmni_seq_shift().
- IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com
Signed-off-by: Manfred Spraul
Signed-off-by: Waiman Long
Suggested-by: Matthew Wilcox
Acked-by: Waiman Long
Cc: Al Viro
Cc: Davidlohr Bueso
Cc: "Eric W . Biederman"
Cc: Jonathan Corbet
Cc: Kees Cook
Cc: "Luis R. Rodriguez"
Cc: Takashi Iwai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The maximum number of unique System V IPC identifiers was limited to
32k. That limit should be big enough for most use cases.However, there are some users out there requesting for more, especially
those that are migrating from Solaris which uses 24 bits for unique
identifiers. To satisfy the need of those users, a new boot time kernel
option "ipcmni_extend" is added to extend the IPCMNI value to 16M. This
is a 512X increase which should be big enough for users out there that
need a large number of unique IPC identifier.The use of this new option will change the pattern of the IPC
identifiers returned by functions like shmget(2). An application that
depends on such pattern may not work properly. So it should only be
used if the users really need more than 32k of unique IPC numbers.This new option does have the side effect of reducing the maximum number
of unique sequence numbers from 64k down to 128. So it is a trade-off.The computation of a new IPC id is not done in the performance critical
path. So a little bit of additional overhead shouldn't have any real
performance impact.Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com
Signed-off-by: Waiman Long
Acked-by: Manfred Spraul
Cc: Al Viro
Cc: Davidlohr Bueso
Cc: "Eric W . Biederman"
Cc: Jonathan Corbet
Cc: Kees Cook
Cc: "Luis R. Rodriguez"
Cc: Matthew Wilcox
Cc: Takashi Iwai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Our msg priorities became an rbtree as of d6629859b36d ("ipc/mqueue:
improve performance of send/recv"). However, consuming a msg in
msg_get() remains logarithmic (still being better than the case before
of course). By applying well known techniques to cache pointers we can
have the node with the highest priority in O(1), which is specially nice
for the rt cases. Furthermore, some callers can call msg_get() in a
loop.A new msg_tree_erase() helper is also added to encapsulate the tree
removal and node_cache game. Passes ltp mq testcases.Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We already store the current task fo the new waiter before calling
wq_sleep() in both send and recv paths. Trivially remove the redundant
assignment.Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
enabled on large memory SMP systems, the pages initialization can take a
long time, if msgctl10 requests a huge block memory, and it will block
rcu scheduler, so release cpu actively.After adding schedule() in free_msg, free_msg can not be called when
holding spinlock, so adding msg to a tmp list, and free it out of
spinlockrcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
msgctl10 R running task 21608 32505 2794 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:__is_insn_slot_addr+0xfb/0x250
Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
kernel_text_address+0xc1/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
create_object+0x380/0x650
__kmalloc+0x14c/0x2b0
load_msg+0x38/0x1a0
do_msgsnd+0x19e/0xcf0
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbercu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
msgctl10 R running task 21608 32170 32155 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:lock_acquire+0x4d/0x340
Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
is_bpf_text_address+0x32/0xe0
kernel_text_address+0xec/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
save_stack+0x32/0xb0
__kasan_slab_free+0x130/0x180
kfree+0xfa/0x2d0
free_msg+0x24/0x50
do_msgrcv+0x508/0xe60
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbeDavidlohr said:
"So after releasing the lock, the msg rbtree/list is empty and new
calls will not see those in the newly populated tmp_msg list, and
therefore they cannot access the delayed msg freeing pointers, which
is good. Also the fact that the node_cache is now freed before the
actual messages seems to be harmless as this is wanted for
msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
info->lock the thing is freed anyway so it should not change things"Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Li RongQing
Signed-off-by: Zhang Yu
Reviewed-by: Davidlohr Bueso
Cc: Manfred Spraul
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 May, 2019
1 commit
-
Pull networking updates from David Miller:
"Highlights:1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.
2) Add fib_sync_mem to control the amount of dirty memory we allow to
queue up between synchronize RCU calls, from David Ahern.3) Make flow classifier more lockless, from Vlad Buslov.
4) Add PHY downshift support to aquantia driver, from Heiner
Kallweit.5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
contention on SLAB spinlocks in heavy RPC workloads.6) Partial GSO offload support in XFRM, from Boris Pismenny.
7) Add fast link down support to ethtool, from Heiner Kallweit.
8) Use siphash for IP ID generator, from Eric Dumazet.
9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
entries, from David Ahern.10) Move skb->xmit_more into a per-cpu variable, from Florian
Westphal.11) Improve eBPF verifier speed and increase maximum program size,
from Alexei Starovoitov.12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
spinlocks. From Neil Brown.13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.
14) Improve link partner cap detection in generic PHY code, from
Heiner Kallweit.15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
Maguire.16) Remove SKB list implementation assumptions in SCTP, your's truly.
17) Various cleanups, optimizations, and simplifications in r8169
driver. From Heiner Kallweit.18) Add memory accounting on TX and RX path of SCTP, from Xin Long.
19) Switch PHY drivers over to use dynamic featue detection, from
Heiner Kallweit.20) Support flow steering without masking in dpaa2-eth, from Ioana
Ciocoi.21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
Pirko.22) Increase the strict parsing of current and future netlink
attributes, also export such policies to userspace. From Johannes
Berg.23) Allow DSA tag drivers to be modular, from Andrew Lunn.
24) Remove legacy DSA probing support, also from Andrew Lunn.
25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
Haabendal.26) Add a generic tracepoint for TX queue timeouts to ease debugging,
from Cong Wang.27) More indirect call optimizations, from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
cxgb4: Fix error path in cxgb4_init_module
net: phy: improve pause mode reporting in phy_print_status
dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
net: macb: Change interrupt and napi enable order in open
net: ll_temac: Improve error message on error IRQ
net/sched: remove block pointer from common offload structure
net: ethernet: support of_get_mac_address new ERR_PTR error
net: usb: smsc: fix warning reported by kbuild test robot
staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
net: dsa: support of_get_mac_address new ERR_PTR error
net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
vrf: sit mtu should not be updated when vrf netdev is the link
net: dsa: Fix error cleanup path in dsa_init_module
l2tp: Fix possible NULL pointer dereference
taprio: add null check on sched_nest to avoid potential null pointer dereference
net: mvpp2: cls: fix less than zero check on a u32 variable
net_sched: sch_fq: handle non connected flows
net_sched: sch_fq: do not assume EDT packets are ordered
net: hns3: use devm_kcalloc when allocating desc_cb
net: hns3: some cleanup for struct hns3_enet_ring
...
02 May, 2019
1 commit
-
Signed-off-by: Al Viro
08 Apr, 2019
1 commit
-
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
bucket pointer to lock the hash chain for that bucket.The benefits of a bit spin_lock are:
- no need to allocate a separate array of locks.
- no need to have a configuration option to guide the
choice of the size of this array
- locking cost is often a single test-and-set in a cache line
that will have to be loaded anyway. When inserting at, or removing
from, the head of the chain, the unlock is free - writing the new
address in the bucket head implicitly clears the lock bit.
For __rhashtable_insert_fast() we ensure this always happens
when adding a new key.
- even when lockings costs 2 updates (lock and unlock), they are
in a cacheline that needs to be read anyway.The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved. This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.To enhance type checking, a new struct is introduced to represent the
pointer plus lock-bit
that is stored in the bucket-table. This is "struct rhash_lock_head"
and is empty. A pointer to this needs to be cast to either an
unsigned lock, or a "struct rhash_head *" to be useful.
Variables of this type are most often called "bkt".Previously "pprev" would sometimes point to a bucket, and sometimes a
->next pointer in an rhash_head. As these are now different types,
pprev is NULL when it would have pointed to the bucket. In that case,
'blk' is used, together with correct locking protocol.Signed-off-by: NeilBrown
Signed-off-by: David S. Miller
13 Mar, 2019
1 commit
-
Pull vfs mount infrastructure updates from Al Viro:
"The rest of core infrastructure; no new syscalls in that pile, but the
old parts are switched to new infrastructure. At that point
conversions of individual filesystems can happen independently; some
are done here (afs, cgroup, procfs, etc.), there's also a large series
outside of that pile dealing with NFS (quite a bit of option-parsing
stuff is getting used there - it's one of the most convoluted
filesystems in terms of mount-related logics), but NFS bits are the
next cycle fodder.It got seriously simplified since the last cycle; documentation is
probably the weakest bit at the moment - I considered dropping the
commit introducing Documentation/filesystems/mount_api.txt (cutting
the size increase by quarter ;-), but decided that it would be better
to fix it up after -rc1 instead.That pile allows to do followup work in independent branches, which
should make life much easier for the next cycle. fs/super.c size
increase is unpleasant; there's a followup series that allows to
shrink it considerably, but I decided to leave that until the next
cycle"* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
afs: Use fs_context to pass parameters over automount
afs: Add fs_context support
vfs: Add some logging to the core users of the fs_context log
vfs: Implement logging through fs_context
vfs: Provide documentation for new mount API
vfs: Remove kern_mount_data()
hugetlbfs: Convert to fs_context
cpuset: Use fs_context
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
cgroup: store a reference to cgroup_ns into cgroup_fs_context
cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
cgroup_do_mount(): massage calling conventions
cgroup: stash cgroup_root reference into cgroup_fs_context
cgroup2: switch to option-by-option parsing
cgroup1: switch to option-by-option parsing
cgroup: take options parsing into ->parse_monolithic()
cgroup: fold cgroup1_mount() into cgroup1_get_tree()
cgroup: start switching to fs_context
ipc: Convert mqueue fs to fs_context
proc: Add fs_context support to procfs
...
08 Mar, 2019
2 commits
-
Use kvzalloc() instead of kvmalloc() and memset().
Also, make use of the struct_size() helper instead of the open-coded
version in order to avoid any potential type mistakes.This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20190131214221.GA28930@embeddedor
Signed-off-by: Gustavo A. R. Silva
Reviewed-by: Andrew Morton
Cc: Davidlohr Bueso
Cc: Manfred Spraul
Cc: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is a plan to build the kernel with -Wimplicit-fallthrough and this
place in the code produced a warning (W=1).This commit remove the following warning:
ipc/sem.c:1683:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
Link: http://lkml.kernel.org/r/20190114203608.18218-1-malat@debian.org
Signed-off-by: Mathieu Malaterre
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Feb, 2019
1 commit
-
Convert the mqueue filesystem to use the filesystem context stuff.
Notes:
(1) The relevant ipc namespace is selected in when the context is
initialised (and it defaults to the current task's ipc namespace).
The caller can override this before calling vfs_get_tree().(2) Rather than simply calling kern_mount_data(), mq_init_ns() and
mq_internal_mount() create a context, adjust it and then do the rest
of the mount procedure.(3) The lazy mqueue mounting on creation of a new namespace is retained
from a previous patch, but the avoidance of sget() if no superblock
yet exists is reverted and the superblock is again keyed on the
namespace pointer.Yes, there was a performance gain in not searching the superblock
hash, but it's only paid once per ipc namespace - and only if someone
uses mqueue within that namespace, so I'm not sure it's worth it,
especially as calling sget() allows avoidance of recursion.Signed-off-by: David Howells
Signed-off-by: Al Viro
07 Feb, 2019
1 commit
-
A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.Acked-by: Catalin Marinas
Signed-off-by: Arnd Bergmann
26 Jan, 2019
1 commit
-
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.Signed-off-by: Arnd Bergmann
18 Jan, 2019
1 commit
-
The sys_ipc() and compat_ksys_ipc() functions are meant to only
be used from the system call table, not called by another function.Introduce ksys_*() interfaces for this purpose, as we have done
for many other system calls.Link: https://lore.kernel.org/lkml/20190116131527.2071570-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Reviewed-by: Heiko Carstens
[heiko.carstens@de.ibm.com: compile fix for !CONFIG_COMPAT]
Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky
31 Oct, 2018
2 commits
-
For SysV semaphores, the semmni value is the last part of the 4-element
sem number array. To make semmni behave in a similar way to msgmni and
shmmni, we can't directly use the _minmax handler. Instead, a special sem
specific handler is added to check the last argument to make sure that it
is limited to the [0, IPCMNI] range. An error will be returned if this is
not the case.Link: http://lkml.kernel.org/r/1536352137-12003-3-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long
Reviewed-by: Davidlohr Bueso
Cc: "Eric W. Biederman"
Cc: Jonathan Corbet
Cc: Kees Cook
Cc: Luis R. Rodriguez
Cc: Matthew Wilcox
Cc: Takashi Iwai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "ipc: IPCMNI limit check for *mni & increase that limit", v9.
The sysctl parameters msgmni, shmmni and semmni have an inherent limit of
IPC_MNI (32k). However, users may not be aware of that because they can
write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.The real IPCMNI limit is now enforced to make sure that users won't put in
an unrealistic value. The first 2 patches enforce the limits.There are also users out there requesting increase in the IPCMNI value.
The last 2 patches attempt to do that by using a boot kernel parameter
"ipcmni_extend" to increase the IPCMNI limit from 32k to 8M if the users
really want the extended value.This patch (of 4):
A user can write arbitrary integer values to msgmni and shmmni sysctl
parameters without getting error, but the actual limit is really IPCMNI
(32k). This can mislead users as they think they can get a value that is
not real.The right limits are now set for msgmni and shmmni so that the users will
become aware if they set a value outside of the acceptable range.Link: http://lkml.kernel.org/r/1536352137-12003-2-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long
Acked-by: Luis R. Rodriguez
Reviewed-by: Davidlohr Bueso
Cc: Kees Cook
Cc: Jonathan Corbet
Cc: Matthew Wilcox
Cc: "Eric W. Biederman"
Cc: Takashi Iwai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Oct, 2018
1 commit
-
Pull timekeeping updates from Thomas Gleixner:
"The timers and timekeeping departement provides:- Another large y2038 update with further preparations for providing
the y2038 safe timespecs closer to the syscalls.- An overhaul of the SHCMT clocksource driver
- SPDX license identifier updates
- Small cleanups and fixes all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tick/sched : Remove redundant cpu_online() check
clocksource/drivers/dw_apb: Add reset control
clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
clocksource/drivers: Unify the names to timer-* format
clocksource/drivers/sh_cmt: Add R-Car gen3 support
dt-bindings: timer: renesas: cmt: document R-Car gen3 support
clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
clocksource/drivers/sh_cmt: Fixup for 64-bit machines
clocksource/drivers/sh_tmu: Convert to SPDX identifiers
clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
clocksource/drivers/sh_cmt: Convert to SPDX identifiers
clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
clocksource: Convert to using %pOFn instead of device_node.name
tick/broadcast: Remove redundant check
RISC-V: Request newstat syscalls
y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
y2038: socket: Change recvmmsg to use __kernel_timespec
y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
y2038: utimes: Rework #ifdef guards for compat syscalls
...