Eric Lee / smarc-fsl-linux-kernel

07 Jun, 2020

1 commit

3c0ad98c2 Merge tag 'integrity-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity ... Browse Code »

Pull integrity updates from Mimi Zohar:
"The main changes are extending the TPM 2.0 PCR banks with bank
specific file hashes, calculating the "boot_aggregate" based on other
TPM PCR banks, using the default IMA hash algorithm, instead of SHA1,
as the basis for the cache hash table key, and preventing the mprotect
syscall to circumvent an IMA mmap appraise policy rule.

- In preparation for extending TPM 2.0 PCR banks with bank specific
digests, commit 0b6cf6b97b7e ("tpm: pass an array of
tpm_extend_digest structures to tpm_pcr_extend()") modified
tpm_pcr_extend(). The original SHA1 file digests were
padded/truncated, before being extended into the other TPM PCR
banks. This pull request calculates and extends the TPM PCR banks
with bank specific file hashes completing the above change.

- The "boot_aggregate", the first IMA measurement list record, is the
"trusted boot" link between the pre-boot environment and the
running OS. With TPM 2.0, the "boot_aggregate" record is not
limited to being based on the SHA1 TPM PCR bank, but can be
calculated based on any enabled bank, assuming the hash algorithm
is also enabled in the kernel.

Other changes include the following and five other bug fixes/code
clean up:

- supporting both a SHA1 and a larger "boot_aggregate" digest in a
custom template format containing both the the SHA1 ('d') and
larger digests ('d-ng') fields.

- Initial hash table key fix, but additional changes would be good"

* tag 'integrity-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: Directly free *entry in ima_alloc_init_template() if digests is NULL
ima: Call ima_calc_boot_aggregate() in ima_eventdigest_init()
ima: Directly assign the ima_default_policy pointer to ima_rules
ima: verify mprotect change is consistent with mmap policy
evm: Fix possible memory leak in evm_calc_hmac_or_hash()
ima: Set again build_ima_appraise variable
ima: Remove redundant policy rule set in add_rules()
ima: Fix ima digest hash table key calculation
ima: Use ima_hash_algo for collision detection in the measurement list
ima: Calculate and extend PCR with digests in ima_template_entry
ima: Allocate and initialize tfm for each PCR bank
ima: Switch to dynamically allocated buffer for template digests
ima: Store template digest directly in ima_template_entry
ima: Evaluate error in init_ima()
ima: Switch to ima_hash_algo for boot aggregate

Linus Torvalds
2020-06-07 00:39:05 +0800

05 Jun, 2020

7 commits

42413b498 ima: Directly free *entry in ima_alloc_init_template() if digests is NULL ... Browse Code »

To support multiple template digests, the static array entry->digest has
been replaced with a dynamically allocated array in commit aa724fe18a8a
("ima: Switch to dynamically allocated buffer for template digests"). The
array is allocated in ima_alloc_init_template() and if the returned pointer
is NULL, ima_free_template_entry() is called.

However, (*entry)->template_desc is not yet initialized while it is used by
ima_free_template_entry(). This patch fixes the issue by directly freeing
*entry without calling ima_free_template_entry().

Fixes: aa724fe18a8a ("ima: Switch to dynamically allocated buffer for template digests")
Reported-by: syzbot+223310b454ba6b75974e@syzkaller.appspotmail.com
Signed-off-by: Roberto Sassu
Signed-off-by: Mimi Zohar

Roberto Sassu
2020-06-05 18:04:11 +0800
886d7de63 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge yet more updates from Andrew Morton:

- More MM work. 100ish more to go. Mike Rapoport's "mm: remove
__ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue

- Various other little subsystems

* emailed patches from Andrew Morton : (127 commits)
lib/ubsan.c: fix gcc-10 warnings
tools/testing/selftests/vm: remove duplicate headers
selftests: vm: pkeys: fix multilib builds for x86
selftests: vm: pkeys: use the correct page size on powerpc
selftests/vm/pkeys: override access right definitions on powerpc
selftests/vm/pkeys: test correct behaviour of pkey-0
selftests/vm/pkeys: introduce a sub-page allocator
selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
selftests/vm/pkeys: associate key on a mapped page and detect write violation
selftests/vm/pkeys: associate key on a mapped page and detect access violation
selftests/vm/pkeys: improve checks to determine pkey support
selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
selftests/vm/pkeys: fix number of reserved powerpc pkeys
selftests/vm/pkeys: introduce powerpc support
selftests/vm/pkeys: introduce generic pkey abstractions
selftests: vm: pkeys: use the correct huge page size
selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
selftests/vm/pkeys: fix pkey_disable_clear()
selftests: vm: pkeys: add helpers for pkey bits
...

Linus Torvalds
2020-06-05 10:18:29 +0800
d4eaa2837 mm: add kvfree_sensitive() for freeing sensitive data objects ... Browse Code »

For kvmalloc'ed data object that contains sensitive information like
cryptographic keys, we need to make sure that the buffer is always cleared
before freeing it. Using memset() alone for buffer clearing may not
provide certainty as the compiler may compile it away. To be sure, the
special memzero_explicit() has to be used.

This patch introduces a new kvfree_sensitive() for freeing those sensitive
data objects allocated by kvmalloc(). The relevant places where
kvfree_sensitive() can be used are modified to use it.

Fixes: 4f0882491a14 ("KEYS: Avoid false positive ENOMEM error on key read")
Suggested-by: Linus Torvalds
Signed-off-by: Waiman Long
Signed-off-by: Andrew Morton
Reviewed-by: Eric Biggers
Acked-by: David Howells
Cc: Jarkko Sakkinen
Cc: James Morris
Cc: "Serge E. Hallyn"
Cc: Joe Perches
Cc: Matthew Wilcox
Cc: David Rientjes
Cc: Uladzislau Rezki
Link: http://lkml.kernel.org/r/20200407200318.11711-1-longman@redhat.com
Signed-off-by: Linus Torvalds

Waiman Long
2020-06-05 10:06:22 +0800
15a2bc4db Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull execve updates from Eric Biederman:
"Last cycle for the Nth time I ran into bugs and quality of
implementation issues related to exec that could not be easily be
fixed because of the way exec is implemented. So I have been digging
into exec and cleanup up what I can.

I don't think I have exec sorted out enough to fix the issues I
started with but I have made some headway this cycle with 4 sets of
changes.

- promised cleanups after introducing exec_update_mutex

- trivial cleanups for exec

- control flow simplifications

- remove the recomputation of bprm->cred

The net result is code that is a bit easier to understand and work
with and a decrease in the number of lines of code (if you don't count
the added tests)"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
exec: Compute file based creds only once
exec: Add a per bprm->file version of per_clear
binfmt_elf_fdpic: fix execfd build regression
selftests/exec: Add binfmt_script regression test
exec: Remove recursion from search_binary_handler
exec: Generic execfd support
exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
exec: Move the call of prepare_binprm into search_binary_handler
exec: Allow load_misc_binary to call prepare_binprm unconditionally
exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
exec: Teach prepare_exec_creds how exec treats uids & gids
exec: Set the point of no return sooner
exec: Move handling of the point of no return to the top level
exec: Run sync_mm_rss before taking exec_update_mutex
exec: Fix spelling of search_binary_handler in a comment
exec: Move the comment from above de_thread to above unshare_sighand
exec: Rename flush_old_exec begin_new_exec
exec: Move most of setup_new_exec into flush_old_exec
exec: In setup_new_exec cache current in the local variable me
...

Linus Torvalds
2020-06-05 05:07:08 +0800
9ff725857 Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull proc updates from Eric Biederman:
"This has four sets of changes:

- modernize proc to support multiple private instances

- ensure we see the exit of each process tid exactly

- remove has_group_leader_pid

- use pids not tasks in posix-cpu-timers lookup

Alexey updated proc so each mount of proc uses a new superblock. This
allows people to actually use mount options with proc with no fear of
messing up another mount of proc. Given the kernel's internal mounts
of proc for things like uml this was a real problem, and resulted in
Android's hidepid mount options being ignored and introducing security
issues.

The rest of the changes are small cleanups and fixes that came out of
my work to allow this change to proc. In essence it is swapping the
pids in de_thread during exec which removes a special case the code
had to handle. Then updating the code to stop handling that special
case"

* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: proc_pid_ns takes super_block as an argument
remove the no longer needed pid_alive() check in __task_pid_nr_ns()
posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
posix-cpu-timers: Extend rcu_read_lock removing task_struct references
signal: Remove has_group_leader_pid
exec: Remove BUG_ON(has_group_leader_pid)
posix-cpu-timer: Unify the now redundant code in lookup_task
posix-cpu-timer: Tidy up group_leader logic in lookup_task
proc: Ensure we see the exit of each process tid exactly once
rculist: Add hlists_swap_heads_rcu
proc: Use PIDTYPE_TGID in next_tgid
Use proc_pid_ns() to get pid_namespace from the proc superblock
proc: use named enums for better readability
proc: use human-readable values for hidepid
docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
proc: add option to mount only a pids subset
proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
proc: allow to mount many instances of proc in one pid namespace
proc: rename struct proc_fs_info to proc_fs_opts

Linus Torvalds
2020-06-05 04:54:34 +0800
acf25aa66 Merge tag 'Smack-for-5.8' of git://github.com/cschaufler/smack-next ... Browse Code »

Pull smack updates from Casey Schaufler:
"Clean out dead code and repair an out-of-bounds warning"

* tag 'Smack-for-5.8' of git://github.com/cschaufler/smack-next:
Smack: Remove unused inline function smk_ad_setfield_u_fs_path_mnt
Smack:- Remove redundant inode_smack cache
Smack:- Remove mutex lock "smk_lock" from inode_smack
Smack: slab-out-of-bounds in vsscanf
smack: remove redundant structure variable from header.
smack: avoid unused 'sip' variable warning

Linus Torvalds
2020-06-05 01:31:05 +0800
a484a497c Merge tag 'keys-next-20200602' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs ... Browse Code »

Pull keyring updates from David Howells:

- Fix a documentation warning.

- Replace a zero-length array with a flexible one

- Make the big_key key type use ChaCha20Poly1305 and use the crypto
algorithm directly rather than going through the crypto layer.

- Implement the update op for the big_key type.

* tag 'keys-next-20200602' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
keys: Implement update for the big_key type
security/keys: rewrite big_key crypto to use library interface
KEYS: Replace zero-length array with flexible-array
Documentation: security: core.rst: add missing argument

Linus Torvalds
2020-06-05 01:27:07 +0800

04 Jun, 2020

3 commits

cb8e59cc8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.

2) Add GSO partial support to igc, from Sasha Neftin.

3) Several cleanups and improvements to r8169 from Heiner Kallweit.

4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.

5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.

6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

8) Add sriov and vf support to hinic, from Luo bin.

9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...

Linus Torvalds
2020-06-04 07:27:18 +0800
6cc7c266e ima: Call ima_calc_boot_aggregate() in ima_eventdigest_init() ... Browse Code »

If the template field 'd' is chosen and the digest to be added to the
measurement entry was not calculated with SHA1 or MD5, it is
recalculated with SHA1, by using the passed file descriptor. However, this
cannot be done for boot_aggregate, because there is no file descriptor.

This patch adds a call to ima_calc_boot_aggregate() in
ima_eventdigest_init(), so that the digest can be recalculated also for the
boot_aggregate entry.

Cc: stable@vger.kernel.org # 3.13.x
Fixes: 3ce1217d6cd5d ("ima: define template fields library and new helpers")
Reported-by: Takashi Iwai
Signed-off-by: Roberto Sassu
Signed-off-by: Mimi Zohar

Roberto Sassu
2020-06-04 05:20:43 +0800
067a436b1 ima: Directly assign the ima_default_policy pointer to ima_rules ... Browse Code »

This patch prevents the following oops:

[ 10.771813] BUG: kernel NULL pointer dereference, address: 0000000000000
[...]
[ 10.779790] RIP: 0010:ima_match_policy+0xf7/0xb80
[...]
[ 10.798576] Call Trace:
[ 10.798993] ? ima_lsm_policy_change+0x2b0/0x2b0
[ 10.799753] ? inode_init_owner+0x1a0/0x1a0
[ 10.800484] ? _raw_spin_lock+0x7a/0xd0
[ 10.801592] ima_must_appraise.part.0+0xb6/0xf0
[ 10.802313] ? ima_fix_xattr.isra.0+0xd0/0xd0
[ 10.803167] ima_must_appraise+0x4f/0x70
[ 10.804004] ima_post_path_mknod+0x2e/0x80
[ 10.804800] do_mknodat+0x396/0x3c0

It occurs when there is a failure during IMA initialization, and
ima_init_policy() is not called. IMA hooks still call ima_match_policy()
but ima_rules is NULL. This patch prevents the crash by directly assigning
the ima_default_policy pointer to ima_rules when ima_rules is defined. This
wouldn't alter the existing behavior, as ima_rules is always set at the end
of ima_init_policy().

Cc: stable@vger.kernel.org # 3.7.x
Fixes: 07f6a79415d7d ("ima: add appraise action keywords and default rules")
Reported-by: Takashi Iwai
Signed-off-by: Roberto Sassu
Signed-off-by: Mimi Zohar

Roberto Sassu
2020-06-04 02:54:35 +0800

03 Jun, 2020

5 commits

d9afbb350 Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security ... Browse Code »

Pull lockdown update from James Morris:
"An update for the security subsystem to allow unprivileged users
to see the status of the lockdown feature. From Jeremy Cline"

Also an added comment to describe CAP_SETFCAP.

* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
capabilities: add description for CAP_SETFCAP
lockdown: Allow unprivileged users to see lockdown status

Linus Torvalds
2020-06-03 08:36:24 +0800
f41030a20 Merge tag 'selinux-pr-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux ... Browse Code »

Pull SELinux updates from Paul Moore:
"The highlights:

- A number of improvements to various SELinux internal data
structures to help improve performance. We move the role
transitions into a hash table. In the content structure we shift
from hashing the content string (aka SELinux label) to the
structure itself, when it is valid. This last change not only
offers a speedup, but it helps us simplify the code some as well.

- Add a new SELinux policy version which allows for a more space
efficient way of storing the filename transitions in the binary
policy. Given the default Fedora SELinux policy with the unconfined
module enabled, this change drops the policy size from ~7.6MB to
~3.3MB. The kernel policy load time dropped as well.

- Some fixes to the error handling code in the policy parser to
properly return error codes when things go wrong"

* tag 'selinux-pr-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: netlabel: Remove unused inline function
selinux: do not allocate hashtabs dynamically
selinux: fix return value on error in policydb_read()
selinux: simplify range_write()
selinux: fix error return code in policydb_read()
selinux: don't produce incorrect filename_trans_count
selinux: implement new format of filename transitions
selinux: move context hashing under sidtab
selinux: hash context structure directly
selinux: store role transitions in a hash table
selinux: drop unnecessary smp_load_acquire() call
selinux: fix warning Comparison to bool

Linus Torvalds
2020-06-03 08:16:47 +0800
91681e848 Merge tag 'tomoyo-pr-20200601' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1 ... Browse Code »

Pull tomoyo update from Tetsuo Handa:
"One patch for suppressing coccicheck's warning"

* tag 'tomoyo-pr-20200601' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1:
tomoyo: use true for bool variable

Linus Torvalds
2020-06-03 08:12:07 +0800
b6f61c314 keys: Implement update for the big_key type ... Browse Code »

Implement the ->update op for the big_key type.

Signed-off-by: David Howells
Acked-by: Jason A. Donenfeld

David Howells
2020-06-03 00:22:31 +0800
521fd61c8 security/keys: rewrite big_key crypto to use library interface ... Browse Code »

A while back, I noticed that the crypto and crypto API usage in big_keys
were entirely broken in multiple ways, so I rewrote it. Now, I'm
rewriting it again, but this time using the simpler ChaCha20Poly1305
library function. This makes the file considerably more simple; the
diffstat alone should justify this commit. It also should be faster,
since it no longer requires a mutex around the "aead api object" (nor
allocations), allowing us to encrypt multiple items in parallel. We also
benefit from being able to pass any type of pointer, so we can get rid
of the ridiculously complex custom page allocator that big_key really
doesn't need.

[DH: Change the select CRYPTO_LIB_CHACHA20POLY1305 to a depends on as
select doesn't propagate and the build can end up with an =y dependending
on some =m pieces.

The depends on CRYPTO also had to be removed otherwise the configurator
complains about a recursive dependency.]

Cc: Andy Lutomirski
Cc: Greg KH
Cc: Linus Torvalds
Cc: kernel-hardening@lists.openwall.com
Reviewed-by: Eric Biggers
Signed-off-by: Jason A. Donenfeld
Signed-off-by: David Howells

Jason A. Donenfeld
2020-06-03 00:22:31 +0800

02 Jun, 2020

3 commits

e0cd92068 Merge branch 'uaccess.access_ok' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull uaccess/access_ok updates from Al Viro:
"Removals of trivially pointless access_ok() calls.

Note: the fiemap stuff was removed from the series, since they are
duplicates with part of ext4 series carried in Ted's tree"

* 'uaccess.access_ok' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vmci_host: get rid of pointless access_ok()
hfi1: get rid of pointless access_ok()
usb: get rid of pointless access_ok() calls
lpfc_debugfs: get rid of pointless access_ok()
efi_test: get rid of pointless access_ok()
drm_read(): get rid of pointless access_ok()
via-pmu: don't bother with access_ok()
drivers/crypto/ccp/sev-dev.c: get rid of pointless access_ok()
omapfb: get rid of pointless access_ok() calls
amifb: get rid of pointless access_ok() calls
drivers/fpga/dfl-afu-dma-region.c: get rid of pointless access_ok()
drivers/fpga/dfl-fme-pr.c: get rid of pointless access_ok()
cm4000_cs.c cmm_ioctl(): get rid of pointless access_ok()
nvram: drop useless access_ok()
n_hdlc_tty_read(): remove pointless access_ok()
tomoyo_write_control(): get rid of pointless access_ok()
btrfs_ioctl_send(): don't bother with access_ok()
fat_dir_ioctl(): hadn't needed that access_ok() for more than a decade...
dlmfs_file_write(): get rid of pointless access_ok()

Linus Torvalds
2020-06-02 07:09:43 +0800
a7092c820 Merge tag 'perf-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"Kernel side changes:

- Add AMD Fam17h RAPL support

- Introduce CAP_PERFMON to kernel and user space

- Add Zhaoxin CPU support

- Misc fixes and cleanups

Tooling changes:

- perf record:

Introduce '--switch-output-event' to use arbitrary events to be
setup and read from a side band thread and, when they take place a
signal be sent to the main 'perf record' thread, reusing the core
for '--switch-output' to take perf.data snapshots from the ring
buffer used for '--overwrite', e.g.:

# perf record --overwrite -e sched:* \
--switch-output-event syscalls:*connect* \
workload

will take perf.data.YYYYMMDDHHMMSS snapshots up to around the
connect syscalls.

Add '--num-synthesize-threads' option to control degree of
parallelism of the synthesize_mmap() code which is scanning
/proc/PID/task/PID/maps and can be time consuming. This mimics
pre-existing behaviour in 'perf top'.

- perf bench:

Add a multi-threaded synthesize benchmark and kallsyms parsing
benchmark.

- Intel PT support:

Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.

Allow using Intel PT to synthesize callchains for regular events.

Add support for synthesizing branch stacks for regular events
(cycles, instructions, etc) from Intel PT data.

Misc changes:

- Updated perf vendor events for power9 and Coresight.

- Add flamegraph.py script via 'perf flamegraph'

- Misc other changes, fixes and cleanups - see the Git log for details

Also, since over the last couple of years perf tooling has matured and
decoupled from the kernel perf changes to a large degree, going
forward Arnaldo is going to send perf tooling changes via direct pull
requests"

* tag 'perf-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (163 commits)
perf/x86/rapl: Add AMD Fam17h RAPL support
perf/x86/rapl: Make perf_probe_msr() more robust and flexible
perf/x86/rapl: Flip logic on default events visibility
perf/x86/rapl: Refactor to share the RAPL code between Intel and AMD CPUs
perf/x86/rapl: Move RAPL support to common x86 code
perf/core: Replace zero-length array with flexible-array
perf/x86: Replace zero-length array with flexible-array
perf/x86/intel: Add more available bits for OFFCORE_RESPONSE of Intel Tremont
perf/x86/rapl: Add Ice Lake RAPL support
perf flamegraph: Use /bin/bash for report and record scripts
perf cs-etm: Move definition of 'traceid_list' global variable from header file
libsymbols kallsyms: Move hex2u64 out of header
libsymbols kallsyms: Parse using io api
perf bench: Add kallsyms parsing
perf: cs-etm: Update to build with latest opencsd version.
perf symbol: Fix kernel symbol address display
perf inject: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
perf annotate: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
perf trace: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
perf script: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
...

Linus Torvalds
2020-06-02 04:23:59 +0800
81e8c10da Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 ... Browse Code »

Pull crypto updates from Herbert Xu:
"API:
- Introduce crypto_shash_tfm_digest() and use it wherever possible.
- Fix use-after-free and race in crypto_spawn_alg.
- Add support for parallel and batch requests to crypto_engine.

Algorithms:
- Update jitter RNG for SP800-90B compliance.
- Always use jitter RNG as seed in drbg.

Drivers:
- Add Arm CryptoCell driver cctrng.
- Add support for SEV-ES to the PSP driver in ccp"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
crypto: hisilicon - fix driver compatibility issue with different versions of devices
crypto: engine - do not requeue in case of fatal error
crypto: cavium/nitrox - Fix a typo in a comment
crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
crypto: hisilicon/qm - add debugfs to the QM state machine
crypto: hisilicon/qm - add debugfs for QM
crypto: stm32/crc32 - protect from concurrent accesses
crypto: stm32/crc32 - don't sleep in runtime pm
crypto: stm32/crc32 - fix multi-instance
crypto: stm32/crc32 - fix run-time self test issue.
crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
crypto: hisilicon/zip - Use temporary sqe when doing work
crypto: hisilicon - add device error report through abnormal irq
crypto: hisilicon - remove codes of directly report device errors through MSI
crypto: hisilicon - QM memory management optimization
crypto: hisilicon - unify initial value assignment into QM
...

Linus Torvalds
2020-06-02 03:00:10 +0800

01 Jun, 2020

1 commit

1806c13dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.

The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.

Signed-off-by: David S. Miller

David S. Miller
2020-06-01 08:48:46 +0800

30 May, 2020

2 commits

56305aa9b exec: Compute file based creds only once ... Browse Code »

Move the computation of creds from prepare_binfmt into begin_new_exec
so that the creds need only be computed once. This is just code
reorganization no semantic changes of any kind are made.

Moving the computation is safe. I have looked through the kernel and
verified none of the binfmts look at bprm->cred directly, and that
there are no helpers that look at bprm->cred indirectly. Which means
that it is not a problem to compute the bprm->cred later in the
execution flow as it is not used until it becomes current->cred.

A new function bprm_creds_from_file is added to contain the work that
needs to be done. bprm_creds_from_file first computes which file
bprm->executable or most likely bprm->file that the bprm->creds
will be computed from.

The funciton bprm_fill_uid is updated to receive the file instead of
accessing bprm->file. The now unnecessary work needed to reset the
bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid.
A small comment to document that bprm_fill_uid now only deals with the
work to handle suid and sgid files. The default case is already
heandled by prepare_exec_creds.

The function security_bprm_repopulate_creds is renamed
security_bprm_creds_from_file and now is explicitly passed the file
from which to compute the creds. The documentation of the
bprm_creds_from_file security hook is updated to explain when the hook
is called and what it needs to do. The file is passed from
cap_bprm_creds_from_file into get_file_caps so that the caps are
computed for the appropriate file. The now unnecessary work in
cap_bprm_creds_from_file to reset the ambient capabilites has been
removed. A small comment to document that the work of
cap_bprm_creds_from_file is to read capabilities from the files
secureity attribute and derive capabilities from the fact the
user had uid 0 has been added.

Reviewed-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-05-30 11:00:54 +0800
a7868323c exec: Add a per bprm->file version of per_clear ... Browse Code »

There is a small bug in the code that recomputes parts of bprm->cred
for every bprm->file. The code never recomputes the part of
clear_dangerous_personality_flags it is responsible for.

Which means that in practice if someone creates a sgid script
the interpreter will not be able to use any of:
READ_IMPLIES_EXEC
ADDR_NO_RANDOMIZE
ADDR_COMPAT_LAYOUT
MMAP_PAGE_ZERO.

This accentially clearing of personality flags probably does
not matter in practice because no one has complained
but it does make the code more difficult to understand.

Further remaining bug compatible prevents the recomputation from being
removed and replaced by simply computing bprm->cred once from the
final bprm->file.

Making this change removes the last behavior difference between
computing bprm->creds from the final file and recomputing
bprm->cred several times. Which allows this behavior change
to be justified for it's own reasons, and for any but hunts
looking into why the behavior changed to wind up here instead
of in the code that will follow that computes bprm->cred
from the final bprm->file.

This small logic bug appears to have existed since the code
started clearing dangerous personality bits.

History Tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 1bb0fa189c6a ("[PATCH] NX: clean up legacy binary support")
Reviewed-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-05-30 10:06:48 +0800

29 May, 2020

1 commit

00fca6b53 tomoyo_write_control(): get rid of pointless access_ok() ... Browse Code »

address is passed only to get_user()

Signed-off-by: Al Viro

Al Viro
2020-05-29 23:02:29 +0800

28 May, 2020

4 commits

0bffedbce Merge tag 'v5.7-rc7' into perf/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2020-05-28 13:58:12 +0800
e32f88790 Merge commit a4ae32c71fe9 ("exec: Always set cap_ambient in cap_bprm_set_creds") ... Browse Code »

This is a bug fix and one of two places where I have found that the
result of calling security_bprm_repopulate_creds more than once on
different bprm->files depends on all of the bprm->files not just the
file bprm->file.

I intend to fix both of those cases and then modify the code to
only call security_bprm_repopulate_creds on the final bprm file.

So merge this change in so I hopefully reduce conflicts for others
and I make it possible to build on top of this change.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-05-28 11:37:33 +0800
3301f6ae2 Merge branch 'for-5.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:

- Reverted stricter synchronization for cgroup recursive stats which
was prepping it for event counter usage which never got merged. The
change was causing performation regressions in some cases.

- Restore bpf-based device-cgroup operation even when cgroup1 device
cgroup is disabled.

- An out-param init fix.

* 'for-5.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
device_cgroup: Cleanup cgroup eBPF device filter code
xattr: fix uninitialized out-param
Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"

Linus Torvalds
2020-05-28 01:58:19 +0800
006f38a1c Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull execve fix from Eric Biederman:
"While working on my exec cleanups I found a bug in exec that winds up
miscomputing the ambient credentials during exec. Andy appears to have
to been confused as to why credentials are computed for both the
script and the interpreter

From the original patch description:

[3] Linux very confusingly processes both the script and the
interpreter if applicable, for reasons that elude me. The results
from thinking about a script's file capabilities and/or setuid
bits are mostly discarded.

The only value in struct cred that gets changed in cap_bprm_set_creds
that I could find that might persist between the script and the
interpreter was cap_ambient. Which is fixed with this trivial change"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
exec: Always set cap_ambient in cap_bprm_set_creds

Linus Torvalds
2020-05-28 00:53:25 +0800

27 May, 2020

1 commit

a4ae32c71 exec: Always set cap_ambient in cap_bprm_set_creds ... Browse Code »

An invariant of cap_bprm_set_creds is that every field in the new cred
structure that cap_bprm_set_creds might set, needs to be set every
time to ensure the fields does not get a stale value.

The field cap_ambient is not set every time cap_bprm_set_creds is
called, which means that if there is a suid or sgid script with an
interpreter that has neither the suid nor the sgid bits set the
interpreter should be able to accept ambient credentials.
Unfortuantely because cap_ambient is not reset to it's original value
the interpreter can not accept ambient credentials.

Given that the ambient capability set is expected to be controlled by
the caller, I don't think this is particularly serious. But it is
definitely worth fixing so the code works correctly.

I have tested to verify my reading of the code is correct and the
interpreter of a sgid can receive ambient capabilities with this
change and cannot receive ambient capabilities without this change.

Cc: stable@vger.kernel.org
Cc: Andy Lutomirski
Fixes: 58319057b784 ("capabilities: ambient capabilities")
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-05-27 02:11:00 +0800

25 May, 2020

1 commit

13209a8f7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller

David S. Miller
2020-05-25 04:47:27 +0800

24 May, 2020

1 commit

caffb99b6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from David Miller:

1) Fix RCU warnings in ipv6 multicast router code, from Madhuparna
Bhowmik.

2) Nexthop attributes aren't being checked properly because of
mis-initialized iterator, from David Ahern.

3) Revert iop_idents_reserve() change as it caused performance
regressions and was just working around what is really a UBSAN bug
in the compiler. From Yuqi Jin.

4) Read MAC address properly from ROM in bmac driver (double iteration
proceeds past end of address array), from Jeremy Kerr.

5) Add Microsoft Surface device IDs to r8152, from Marc Payne.

6) Prevent reference to freed SKB in __netif_receive_skb_core(), from
Boris Sukholitko.

7) Fix ACK discard behavior in rxrpc, from David Howells.

8) Preserve flow hash across packet scrubbing in wireguard, from Jason
A. Donenfeld.

9) Cap option length properly for SO_BINDTODEVICE in AX25, from Eric
Dumazet.

10) Fix encryption error checking in kTLS code, from Vadim Fedorenko.

11) Missing BPF prog ref release in flow dissector, from Jakub Sitnicki.

12) dst_cache must be used with BH disabled in tipc, from Eric Dumazet.

13) Fix use after free in mlxsw driver, from Jiri Pirko.

14) Order kTLS key destruction properly in mlx5 driver, from Tariq
Toukan.

15) Check devm_platform_ioremap_resource() return value properly in
several drivers, from Tiezhu Yang.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (71 commits)
net: smsc911x: Fix runtime PM imbalance on error
net/mlx4_core: fix a memory leak bug.
net: ethernet: ti: cpsw: fix ASSERT_RTNL() warning during suspend
net: phy: mscc: fix initialization of the MACsec protocol mode
net: stmmac: don't attach interface until resume finishes
net: Fix return value about devm_platform_ioremap_resource()
net/mlx5: Fix error flow in case of function_setup failure
net/mlx5e: CT: Correctly get flow rule
net/mlx5e: Update netdev txq on completions during closure
net/mlx5: Annotate mutex destroy for root ns
net/mlx5: Don't maintain a case of del_sw_func being null
net/mlx5: Fix cleaning unmanaged flow tables
net/mlx5: Fix memory leak in mlx5_events_init
net/mlx5e: Fix inner tirs handling
net/mlx5e: kTLS, Destroy key object after destroying the TIS
net/mlx5e: Fix allowed tc redirect merged eswitch offload cases
net/mlx5: Avoid processing commands before cmdif is ready
net/mlx5: Fix a race when moving command interface to events mode
net/mlx5: Add command entry handling completion
rxrpc: Fix a memory leak in rxkad_verify_response()
...

Linus Torvalds
2020-05-24 08:16:18 +0800

23 May, 2020

1 commit

8eb613c0b ima: verify mprotect change is consistent with mmap policy ... Browse Code »

Files can be mmap'ed read/write and later changed to execute to circumvent
IMA's mmap appraise policy rules. Due to locking issues (mmap semaphore
would be taken prior to i_mutex), files can not be measured or appraised at
this point. Eliminate this integrity gap, by denying the mprotect
PROT_EXECUTE change, if an mmap appraise policy rule exists.

On mprotect change success, return 0. On failure, return -EACESS.

Reviewed-by: Lakshmi Ramasubramanian
Signed-off-by: Mimi Zohar

Mimi Zohar
2020-05-23 02:41:04 +0800

22 May, 2020

3 commits

c54d481d7 apparmor: Fix use-after-free in aa_audit_rule_init ... Browse Code »

In the implementation of aa_audit_rule_init(), when aa_label_parse()
fails the allocated memory for rule is released using
aa_audit_rule_free(). But after this release, the return statement
tries to access the label field of the rule which results in
use-after-free. Before releasing the rule, copy errNo and return it
after release.

Fixes: 52e8c38001d8 ("apparmor: Fix memory leak of rule on error exit path")
Signed-off-by: Navid Emamdoost
Signed-off-by: John Johansen

Navid Emamdoost
2020-05-22 06:25:51 +0800
c6b39f070 apparmor: Fix aa_label refcnt leak in policy_update ... Browse Code »

policy_update() invokes begin_current_label_crit_section(), which
returns a reference of the updated aa_label object to "label" with
increased refcount.

When policy_update() returns, "label" becomes invalid, so the refcount
should be decreased to keep refcount balanced.

The reference counting issue happens in one exception handling path of
policy_update(). When aa_may_manage_policy() returns not NULL, the
refcnt increased by begin_current_label_crit_section() is not decreased,
causing a refcnt leak.

Fix this issue by jumping to "end_section" label when
aa_may_manage_policy() returns not NULL.

Fixes: 5ac8c355ae00 ("apparmor: allow introspecting the loaded policy pre internal transform")
Signed-off-by: Xiyu Yang
Signed-off-by: Xin Tan
Signed-off-by: John Johansen

Xiyu Yang
2020-05-22 06:25:51 +0800
a0b845ffa apparmor: fix potential label refcnt leak in aa_change_profile ... Browse Code »

aa_change_profile() invokes aa_get_current_label(), which returns
a reference of the current task's label.

According to the comment of aa_get_current_label(), the returned
reference must be put with aa_put_label().
However, when the original object pointed by "label" becomes
unreachable because aa_change_profile() returns or a new object
is assigned to "label", reference count increased by
aa_get_current_label() is not decreased, causing a refcnt leak.

Fix this by calling aa_put_label() before aa_change_profile() return
and dropping unnecessary aa_get_current_label().

Fixes: 9fcf78cca198 ("apparmor: update domain transitions that are subsets of confinement at nnp")
Signed-off-by: Xiyu Yang
Signed-off-by: Xin Tan
Signed-off-by: John Johansen

Xiyu Yang
2020-05-22 06:25:51 +0800

21 May, 2020

3 commits

112b71475 exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds ... Browse Code »

Rename bprm->cap_elevated to bprm->active_secureexec and initialize it
in prepare_binprm instead of in cap_bprm_set_creds. Initializing
bprm->active_secureexec in prepare_binprm allows multiple
implementations of security_bprm_repopulate_creds to play nicely with
each other.

Rename security_bprm_set_creds to security_bprm_reopulate_creds to
emphasize that this path recomputes part of bprm->cred. This
recomputation avoids the time of check vs time of use problems that
are inherent in unix #! interpreters.

In short two renames and a move in the location of initializing
bprm->active_secureexec.

Link: https://lkml.kernel.org/r/87o8qkzrxp.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds
Reviewed-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-05-21 23:16:50 +0800
0550cfe8c security: Fix hook iteration for secid_to_secctx ... Browse Code »

secid_to_secctx is not stackable, and since the BPF LSM registers this
hook by default, the call_int_hook logic is not suitable which
"bails-on-fail" and casues issues when other LSMs register this hook and
eventually breaks Audit.

In order to fix this, directly iterate over the security hooks instead
of using call_int_hook as suggested in:

https: //lore.kernel.org/bpf/9d0eb6c6-803a-ff3a-5603-9ad6d9edfc00@schaufler-ca.com/#t

Fixes: 98e828a0650f ("security: Refactor declaration of LSM hooks")
Fixes: 625236ba3832 ("security: Fix the default value of secid_to_secctx hook")
Reported-by: Alexei Starovoitov
Signed-off-by: KP Singh
Signed-off-by: Alexei Starovoitov
Acked-by: James Morris
Link: https://lore.kernel.org/bpf/20200520125616.193765-1-kpsingh@chromium.org

KP Singh
2020-05-21 11:12:07 +0800
b8bff5992 exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds ... Browse Code »

Today security_bprm_set_creds has several implementations:
apparmor_bprm_set_creds, cap_bprm_set_creds, selinux_bprm_set_creds,
smack_bprm_set_creds, and tomoyo_bprm_set_creds.

Except for cap_bprm_set_creds they all test bprm->called_set_creds and
return immediately if it is true. The function cap_bprm_set_creds
ignores bprm->calld_sed_creds entirely.

Create a new LSM hook security_bprm_creds_for_exec that is called just
before prepare_binprm in __do_execve_file, resulting in a LSM hook
that is called exactly once for the entire of exec. Modify the bits
of security_bprm_set_creds that only want to be called once per exec
into security_bprm_creds_for_exec, leaving only cap_bprm_set_creds
behind.

Remove bprm->called_set_creds all of it's former users have been moved
to security_bprm_creds_for_exec.

Add or upate comments a appropriate to bring them up to date and
to reflect this change.

Link: https://lkml.kernel.org/r/87v9kszrzh.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds
Acked-by: Casey Schaufler # For the LSM and Smack bits
Reviewed-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-05-21 03:45:31 +0800

19 May, 2020

2 commits

9d78edeae proc: proc_pid_ns takes super_block as an argument ... Browse Code »

syzbot found that

touch /proc/testfile

causes NULL pointer dereference at tomoyo_get_local_path()
because inode of the dentry is NULL.

Before c59f415a7cb6, Tomoyo received pid_ns from proc's s_fs_info
directly. Since proc_pid_ns() can only work with inode, using it in
the tomoyo_get_local_path() was wrong.

To avoid creating more functions for getting proc_ns, change the
argument type of the proc_pid_ns() function. Then, Tomoyo can use
the existing super_block to get pid_ns.

Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com
Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com
Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com
Fixes: c59f415a7cb6 ("Use proc_pid_ns() to get pid_namespace from the proc superblock")
Signed-off-by: Alexey Gladkov
Signed-off-by: Eric W. Biederman

Alexey Gladkov
2020-05-19 20:07:50 +0800
642b151f4 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity ... Browse Code »

Pull integrity fixes from Mimi Zohar:
"A couple of miscellaneous bug fixes for the integrity subsystem:

IMA:

- Properly modify the open flags in order to calculate the file hash.

- On systems requiring the IMA policy to be signed, the policy is
loaded differently. Don't differentiate between "enforce" and
either "log" or "fix" modes how the policy is loaded.

EVM:

- Two patches to fix an EVM race condition, normally the result of
attempting to load an unsupported hash algorithm.

- Use the lockless RCU version for walking an append only list"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
evm: Fix a small race in init_desc()
evm: Fix RCU list related warnings
ima: Fix return value of ima_write_policy()
evm: Check also if *tfm is an error pointer in init_desc()
ima: Set file->f_mode instead of file->f_flags in ima_calc_file_hash()

Linus Torvalds
2020-05-19 02:29:21 +0800

15 May, 2020

1 commit

a17b53c4a bpf, capability: Introduce CAP_BPF ... Browse Code »

Split BPF operations that are allowed under CAP_SYS_ADMIN into
combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN.
For backward compatibility include them in CAP_SYS_ADMIN as well.

The end result provides simple safety model for applications that use BPF:
- to load tracing program types
BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
use CAP_BPF and CAP_PERFMON
- to load networking program types
BPF_PROG_TYPE_{SCHED_CLS, XDP, SK_SKB, etc}
use CAP_BPF and CAP_NET_ADMIN

There are few exceptions from this rule:
- bpf_trace_printk() is allowed in networking programs, but it's using
tracing mechanism, hence this helper needs additional CAP_PERFMON
if networking program is using this helper.
- BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
to discourage production use.
- BPF HW offload is allowed under CAP_SYS_ADMIN.
- bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only.

CAPs are not checked at attach/detach time with two exceptions:
- loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users,
hence CAP_NET_ADMIN is required at attach time.
- flow_dissector detach doesn't check prog FD at detach,
hence CAP_NET_ADMIN is required at detach time.

CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id
command and convert them to file descriptor via GET_FD_BY_ID command.
This restriction guarantees that mutliple tasks with CAP_BPF are not able to
affect each other. That leads to clean isolation of tasks. For example:
task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link.
task B with the same capabilities cannot detach that firewall unless
task A explicitly passed link FD to task B via scm_rights or bpffs.
CAP_SYS_ADMIN can still detach/unload everything.

Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can
accidentely mess with each other programs and maps.
Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other.

CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data.
Such networking progs cannot access arbitrary kernel memory or leak pointers.

bpftool, bpftrace, bcc tools binaries should NOT be installed with
CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets.
But users with these two permissions will be able to use these tracing tools.

CAP_PERFMON is least secure, since it allows kprobes and kernel memory access.
CAP_NET_ADMIN can stop network traffic via iproute2.
CAP_BPF is the safest from security point of view and harmless on its own.

Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map
and if that map is used by firewall-like bpf prog.
CAP_BPF allows many bpf prog_load commands in parallel. The verifier
may consume large amount of memory and significantly slow down the system.

Existing unprivileged BPF operations are not affected.
In particular unprivileged users are allowed to load socket_filter and cg_skb
program types and to create array, hash, prog_array, map-in-map map types.

Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/20200513230355.7858-2-alexei.starovoitov@gmail.com

Alexei Starovoitov
2020-05-15 23:29:41 +0800