02 Nov, 2017

1 commit

  • Many user space API headers are missing licensing information, which
    makes it hard for compliance tools to determine the correct license.

    By default are files without license information under the default
    license of the kernel, which is GPLV2. Marking them GPLV2 would exclude
    them from being included in non GPLV2 code, which is obviously not
    intended. The user space API headers fall under the syscall exception
    which is in the kernels COPYING file:

    NOTE! This copyright does *not* cover user programs that use kernel
    services by normal system calls - this is merely considered normal use
    of the kernel, and does *not* fall under the heading of "derived work".

    otherwise syscall usage would not be possible.

    Update the files which contain no license information with an SPDX
    license identifier. The chosen identifier is 'GPL-2.0 WITH
    Linux-syscall-note' which is the officially assigned identifier for the
    Linux syscall exception. SPDX license identifiers are a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne. See the previous patch in this series for the
    methodology of how this patch was researched.

    Reviewed-by: Kate Stewart
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

12 Sep, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "Life has been busy and I have not gotten half as much done this round
    as I would have liked. I delayed it so that a minor conflict
    resolution with the mips tree could spend a little time in linux-next
    before I sent this pull request.

    This includes two long delayed user namespace changes from Kirill
    Tkhai. It also includes a very useful change from Serge Hallyn that
    allows the security capability attribute to be used inside of user
    namespaces. The practical effect of this is people can now untar
    tarballs and install rpms in user namespaces. It had been suggested to
    generalize this and encode some of the namespace information
    information in the xattr name. Upon close inspection that makes the
    things that should be hard easy and the things that should be easy
    more expensive.

    Then there is my bugfix/cleanup for signal injection that removes the
    magic encoding of the siginfo union member from the kernel internal
    si_code. The mips folks reported the case where I had used FPE_FIXME
    me is impossible so I have remove FPE_FIXME from mips, while at the
    same time including a return statement in that case to keep gcc from
    complaining about unitialized variables.

    I almost finished the work to get make copy_siginfo_to_user a trivial
    copy to user. The code is available at:

    git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3

    But I did not have time/energy to get the code posted and reviewed
    before the merge window opened.

    I was able to see that the security excuse for just copying fields
    that we know are initialized doesn't work in practice there are buggy
    initializations that don't initialize the proper fields in siginfo. So
    we still sometimes copy unitialized data to userspace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Introduce v3 namespaced file capabilities
    mips/signal: In force_fcr31_sig return in the impossible case
    signal: Remove kernel interal si_code magic
    fcntl: Don't use ambiguous SIG_POLL si_codes
    prctl: Allow local CAP_SYS_ADMIN changing exe_file
    security: Use user_namespace::level to avoid redundant iterations in cap_capable()
    userns,pidns: Verify the userns for new pid namespaces
    signal/testing: Don't look for __SI_FAULT in userspace
    signal/mips: Document a conflict with SI_USER with SIGFPE
    signal/sparc: Document a conflict with SI_USER with SIGFPE
    signal/ia64: Document a conflict with SI_USER with SIGFPE
    signal/alpha: Document a conflict with SI_USER for SIGTRAP

    Linus Torvalds
     

07 Sep, 2017

4 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
    in the child process after fork. This differs from MADV_DONTFORK in one
    important way.

    If a child process accesses memory that was MADV_WIPEONFORK, it will get
    zeroes. The address ranges are still valid, they are just empty.

    If a child process accesses memory that was MADV_DONTFORK, it will get a
    segmentation fault, since those address ranges are no longer valid in
    the child after fork.

    Since MADV_DONTFORK also seems to be used to allow very large programs
    to fork in systems with strict memory overcommit restrictions, changing
    the semantics of MADV_DONTFORK might break existing programs.

    MADV_WIPEONFORK only works on private, anonymous VMAs.

    The use case is libraries that store or cache information, and want to
    know that they need to regenerate it in the child process after fork.

    Examples of this would be:
    - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
    check, which is too slow without a PID cache)
    - PKCS#11 API reinitialization check (mandated by specification)
    - glibc's upcoming PRNG (reseed after fork)
    - OpenSSL PRNG (reseed after fork)

    The security benefits of a forking server having a re-inialized PRNG in
    every child process are pretty obvious. However, due to libraries
    having all kinds of internal state, and programs getting compiled with
    many different versions of each library, it is unreasonable to expect
    calling programs to re-initialize everything manually after fork.

    A further complication is the proliferation of clone flags, programs
    bypassing glibc's functions to call clone directly, and programs calling
    unshare, causing the glibc pthread_atfork hook to not get called.

    It would be better to have the kernel take care of this automatically.

    The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
    MADV_WIPEONFORK.

    This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

    [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
    Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
    Signed-off-by: Rik van Riel
    Reported-by: Florian Weimer
    Reported-by: Colm MacCártaigh
    Reviewed-by: Mike Kravetz
    Cc: "H. Peter Anvin"
    Cc: "Kirill A. Shutemov"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Helge Deller
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Will Drewry
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • A non-default huge page size can be encoded in the flags argument of the
    mmap system call. The definitions for these encodings are in arch
    specific header files. However, all architectures use the same values.

    Consolidate all the definitions in the primary user header file
    (uapi/linux/mman.h). Include definitions for all known huge page sizes.
    Use the generic encoding definitions in hugetlb_encode.h as the basis
    for these definitions.

    Link: http://lkml.kernel.org/r/1501527386-10736-3-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Aneesh Kumar K.V
    Cc: Anshuman Khandual
    Cc: Arnd Bergmann
    Cc: Davidlohr Bueso
    Cc: Matthew Wilcox
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Patch series "Consolidate system call hugetlb page size encodings".

    These patches are the result of discussions in
    https://lkml.org/lkml/2017/3/8/548. The following changes are made in the
    patch set:

    1) Put all the log2 encoded huge page size definitions in a common
    header file. The idea is have a set of definitions that can be use as
    the basis for system call specific definitions such as MAP_HUGE_* and
    SHM_HUGE_*.

    2) Remove MAP_HUGE_* definitions in arch specific files. All these
    definitions are the same. Consolidate all definitions in the primary
    user header file (uapi/linux/mman.h).

    3) Remove SHM_HUGE_* definitions intended for user space from kernel
    header file, and add to user (uapi/linux/shm.h) header file. Add
    definitions for all known huge page size encodings as in mmap.

    This patch (of 3):

    If hugetlb pages are requested in mmap or shmget system calls, a huge
    page size other than default can be requested. This is accomplished by
    encoding the log2 of the huge page size in the upper bits of the flag
    argument. asm-generic and arch specific headers all define the same
    values for these encodings.

    Put common definitions in a single header file. The primary uapi header
    files for mmap and shm will use these definitions as a basis for
    definitions specific to those system calls.

    Link: http://lkml.kernel.org/r/1501527386-10736-2-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Andi Kleen
    Cc: Michael Kerrisk
    Cc: Davidlohr Bueso
    Cc: Anshuman Khandual
    Cc: Aneesh Kumar K.V
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

04 Aug, 2017

1 commit

  • The send call ignores unknown flags. Legacy applications may already
    unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
    socket opts in to zerocopy.

    Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
    Processes can also query this socket option to detect kernel support
    for the feature. Older kernels will return ENOPROTOOPT.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

25 Jul, 2017

2 commits

  • struct siginfo is a union and the kernel since 2.4 has been hiding a union
    tag in the high 16bits of si_code using the values:
    __SI_KILL
    __SI_TIMER
    __SI_POLL
    __SI_FAULT
    __SI_CHLD
    __SI_RT
    __SI_MESGQ
    __SI_SYS

    While this looks plausible on the surface, in practice this situation has
    not worked well.

    - Injected positive signals are not copied to user space properly
    unless they have these magic high bits set.

    - Injected positive signals are not reported properly by signalfd
    unless they have these magic high bits set.

    - These kernel internal values leaked to userspace via ptrace_peek_siginfo

    - It was possible to inject these kernel internal values and cause the
    the kernel to misbehave.

    - Kernel developers got confused and expected these kernel internal values
    in userspace in kernel self tests.

    - Kernel developers got confused and set si_code to __SI_FAULT which
    is SI_USER in userspace which causes userspace to think an ordinary user
    sent the signal and that it was not kernel generated.

    - The values make it impossible to reorganize the code to transform
    siginfo_copy_to_user into a plain copy_to_user. As si_code must
    be massaged before being passed to userspace.

    So remove these kernel internal si codes and make the kernel code simpler
    and more maintainable.

    To replace these kernel internal magic si_codes introduce the helper
    function siginfo_layout, that takes a signal number and an si_code and
    computes which union member of siginfo is being used. Have
    siginfo_layout return an enumeration so that gcc will have enough
    information to warn if a switch statement does not handle all of union
    members.

    A couple of architectures have a messed up ABI that defines signal
    specific duplications of SI_USER which causes more special cases in
    siginfo_layout than I would like. The good news is only problem
    architectures pay the cost.

    Update all of the code that used the previous magic __SI_ values to
    use the new SIL_ values and to call siginfo_layout to get those
    values. Escept where not all of the cases are handled remove the
    defaults in the switch statements so that if a new case is missed in
    the future the lack will show up at compile time.

    Modify the code that copies siginfo si_code to userspace to just copy
    the value and not cast si_code to a short first. The high bits are no
    longer used to hold a magic union member.

    Fixup the siginfo header files to stop including the __SI_ values in
    their constants and for the headers that were missing it to properly
    update the number of si_codes for each signal type.

    The fixes to copy_siginfo_from_user32 implementations has the
    interesting property that several of them perviously should never have
    worked as the __SI_ values they depended up where kernel internal.
    With that dependency gone those implementations should work much
    better.

    The idea of not passing the __SI_ values out to userspace and then
    not reinserting them has been tested with criu and criu worked without
    changes.

    Ref: 2.4.0-test1
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • We have a weird and problematic intersection of features that when
    they all come together result in ambiguous siginfo values, that
    we can not support properly.

    - Supporting fcntl(F_SETSIG,...) with arbitrary valid signals.

    - Using positive values for POLL_IN, POLL_OUT, POLL_MSG, ..., etc
    that imply they are signal specific si_codes and using the
    aforementioned arbitrary signal to deliver them.

    - Supporting injection of arbitrary siginfo values for debugging and
    checkpoint/restore.

    The result is that just looking at siginfo si_codes of 1 to 6 are
    ambigious. It could either be a signal specific si_code or it could
    be a generic si_code.

    For most of the kernel this is a non-issue but for sending signals
    with siginfo it is impossible to play back the kernel signals and
    get the same result.

    Strictly speaking when the si_code was changed from SI_SIGIO to
    POLL_IN and friends between 2.2 and 2.4 this functionality was not
    ambiguous, as only real time signals were supported. Before 2.4 was
    released the kernel began supporting siginfo with non realtime signals
    so they could give details of why the signal was sent.

    The result is that if F_SETSIG is set to one of the signals with signal
    specific si_codes then user space can not know why the signal was sent.

    I grepped through a bunch of userspace programs using debian code
    search to get a feel for how often people choose a signal that results
    in an ambiguous si_code. I only found one program doing so and it was
    using SIGCHLD to test the F_SETSIG functionality, and did not appear
    to be a real world usage.

    Therefore the ambiguity does not appears to be a real world problem in
    practice. Remove the ambiguity while introducing the smallest chance
    of breakage by changing the si_code to SI_SIGIO when signals with
    signal specific si_codes are targeted.

    Fixes: v2.3.40 -- Added support for queueing non-rt signals
    Fixes: v2.3.21 -- Changed the si_code from SI_SIGIO
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

17 Jul, 2017

1 commit

  • This ioctl does nothing to justify an _IOC_READ or _IOC_WRITE flag
    because it doesn't copy anything from/to userspace to access the
    argument.

    Fixes: 54ebbfb16034 ("tty: add TIOCGPTPEER ioctl")
    Signed-off-by: Gleb Fotengauer-Malinovskiy
    Acked-by: Aleksa Sarai
    Acked-by: Arnd Bergmann
    Signed-off-by: Greg Kroah-Hartman

    Gleb Fotengauer-Malinovskiy
     

06 Jul, 2017

1 commit

  • Pull networking updates from David Miller:
    "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
    merge window:

    1) Several optimizations for UDP processing under high load from
    Paolo Abeni.

    2) Support pacing internally in TCP when using the sch_fq packet
    scheduler for this is not practical. From Eric Dumazet.

    3) Support mutliple filter chains per qdisc, from Jiri Pirko.

    4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

    5) Add batch dequeueing to vhost_net, from Jason Wang.

    6) Flesh out more completely SCTP checksum offload support, from
    Davide Caratti.

    7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
    Neira Ayuso, and Matthias Schiffer.

    8) Add devlink support to nfp driver, from Simon Horman.

    9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
    Prabhu.

    10) Add stack depth tracking to BPF verifier and use this information
    in the various eBPF JITs. From Alexei Starovoitov.

    11) Support XDP on qed device VFs, from Yuval Mintz.

    12) Introduce BPF PROG ID for better introspection of installed BPF
    programs. From Martin KaFai Lau.

    13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

    14) For loads, allow narrower accesses in bpf verifier checking, from
    Yonghong Song.

    15) Support MIPS in the BPF selftests and samples infrastructure, the
    MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
    Daney.

    16) Support kernel based TLS, from Dave Watson and others.

    17) Remove completely DST garbage collection, from Wei Wang.

    18) Allow installing TCP MD5 rules using prefixes, from Ivan
    Delalande.

    19) Add XDP support to Intel i40e driver, from Björn Töpel

    20) Add support for TC flower offload in nfp driver, from Simon
    Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
    Kicinski, and Bert van Leeuwen.

    21) IPSEC offloading support in mlx5, from Ilan Tayari.

    22) Add HW PTP support to macb driver, from Rafal Ozieblo.

    23) Networking refcount_t conversions, From Elena Reshetova.

    24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
    for tuning the TCP sockopt settings of a group of applications,
    currently via CGROUPs"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
    net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
    dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
    cxgb4: Support for get_ts_info ethtool method
    cxgb4: Add PTP Hardware Clock (PHC) support
    cxgb4: time stamping interface for PTP
    nfp: default to chained metadata prepend format
    nfp: remove legacy MAC address lookup
    nfp: improve order of interfaces in breakout mode
    net: macb: remove extraneous return when MACB_EXT_DESC is defined
    bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
    bpf: fix return in load_bpf_file
    mpls: fix rtm policy in mpls_getroute
    net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
    net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
    ...

    Linus Torvalds
     

04 Jul, 2017

1 commit

  • Pull tty/serial updates from Greg KH:
    "Here is the large tty/serial patchset for 4.13-rc1.

    A lot of tty and serial driver updates are in here, along with some
    fixups for some __get/put_user usages that were reported. Nothing
    huge, just lots of development by a number of different developers,
    full details in the shortlog.

    All of these have been in linux-next for a while"

    * tag 'tty-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (71 commits)
    tty: serial: lpuart: add a more accurate baud rate calculation method
    tty: serial: lpuart: add earlycon support for imx7ulp
    tty: serial: lpuart: add imx7ulp support
    dt-bindings: serial: fsl-lpuart: add i.MX7ULP support
    tty: serial: lpuart: add little endian 32 bit register support
    tty: serial: lpuart: refactor lpuart32_{read|write} prototype
    tty: serial: lpuart: introduce lpuart_soc_data to represent SoC property
    serial: imx-serial - move DMA buffer configuration to DT
    serial: imx: Enable RTSD only when needed
    serial: imx: Remove unused members from imx_port struct
    serial: 8250: 8250_omap: Fix race b/w dma completion and RX timeout
    serial: 8250: Fix THRE flag usage for CAP_MINI
    tty/serial: meson_uart: update to stable bindings
    dt-bindings: serial: Add bindings for the Amlogic Meson UARTs
    serial: Delete dead code for CIR serial ports
    serial: sirf: make of_device_ids const
    serial/mpsc: switch to dma_alloc_attrs
    tty: serial: Add Actions Semi Owl UART earlycon
    dt-bindings: serial: Document Actions Semi Owl UARTs
    tty/serial: atmel: make the driver DT only
    ...

    Linus Torvalds
     

21 Jun, 2017

1 commit

  • This adds the new getsockopt(2) option SO_PEERGROUPS on SOL_SOCKET to
    retrieve the auxiliary groups of the remote peer. It is designed to
    naturally extend SO_PEERCRED. That is, the underlying data is from the
    same credentials. Regarding its syntax, it is based on SO_PEERSEC. That
    is, if the provided buffer is too small, ERANGE is returned and @optlen
    is updated. Otherwise, the information is copied, @optlen is set to the
    actual size, and 0 is returned.

    While SO_PEERCRED (and thus `struct ucred') already returns the primary
    group, it lacks the auxiliary group vector. However, nearly all access
    controls (including kernel side VFS and SYSVIPC, but also user-space
    polkit, DBus, ...) consider the entire set of groups, rather than just
    the primary group. But this is currently not possible with pure
    SO_PEERCRED. Instead, user-space has to work around this and query the
    system database for the auxiliary groups of a UID retrieved via
    SO_PEERCRED.

    Unfortunately, there is no race-free way to query the auxiliary groups
    of the PID/UID retrieved via SO_PEERCRED. Hence, the current user-space
    solution is to use getgrouplist(3p), which itself falls back to NSS and
    whatever is configured in nsswitch.conf(3). This effectively checks
    which groups we *would* assign to the user if it logged in *now*. On
    normal systems it is as easy as reading /etc/group, but with NSS it can
    resort to quering network databases (eg., LDAP), using IPC or network
    communication.

    Long story short: Whenever we want to use auxiliary groups for access
    checks on IPC, we need further IPC to talk to the user/group databases,
    rather than just relying on SO_PEERCRED and the incoming socket. This
    is unfortunate, and might even result in dead-locks if the database
    query uses the same IPC as the original request.

    So far, those recursions / dead-locks have been avoided by using
    primitive IPC for all crucial NSS modules. However, we want to avoid
    re-inventing the wheel for each NSS module that might be involved in
    user/group queries. Hence, we would preferably make DBus (and other IPC
    that supports access-management based on groups) work without resorting
    to the user/group database. This new SO_PEERGROUPS ioctl would allow us
    to make dbus-daemon work without ever calling into NSS.

    Cc: Michal Sekletar
    Cc: Simon McVittie
    Reviewed-by: Tom Gundersen
    Signed-off-by: David Herrmann
    Signed-off-by: David S. Miller

    David Herrmann
     

09 Jun, 2017

1 commit

  • When opening the slave end of a PTY, it is not possible for userspace to
    safely ensure that /dev/pts/$num is actually a slave (in cases where the
    mount namespace in which devpts was mounted is controlled by an
    untrusted process). In addition, there are several unresolvable
    race conditions if userspace were to attempt to detect attacks through
    stat(2) and other similar methods [in addition it is not clear how
    userspace could detect attacks involving FUSE].

    Resolve this by providing an interface for userpace to safely open the
    "peer" end of a PTY file descriptor by using the dentry cached by
    devpts. Since it is not possible to have an open master PTY without
    having its slave exposed in /dev/pts this interface is safe. This
    interface currently does not provide a way to get the master pty (since
    it is not clear whether such an interface is safe or even useful).

    Cc: Christian Brauner
    Cc: Valentin Rothberg
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Greg Kroah-Hartman

    Aleksa Sarai
     

04 Jun, 2017

1 commit

  • By moving the kernel side __SI_* defintions right next to the userspace
    ones we can kill the non-uapi versions of include
    include/asm-generic/siginfo.h and untangle the unholy mess of includes.

    [ tglx: Removed uapi/asm/siginfo.h from m32r, microblaze, mn10300 and score ]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Fenghua Yu
    Cc: Tony Luck
    Cc: linux-ia64@vger.kernel.org
    Cc: Arnd Bergmann
    Cc: sparclinux@vger.kernel.org
    Cc: "David S. Miller"
    Link: http://lkml.kernel.org/r/20170603190102.28866-6-hch@lst.de

    Christoph Hellwig
     

22 May, 2017

1 commit

  • Add SOF_TIMESTAMPING_OPT_PKTINFO option to request a new control message
    for incoming packets with hardware timestamps. It contains the index of
    the real interface which received the packet and the length of the
    packet at layer 2.

    The index is useful with bonding, bridges and other interfaces, where
    IP_PKTINFO doesn't allow applications to determine which PHC made the
    timestamp. With the L2 length (and link speed) it is possible to
    transpose preamble timestamps to trailer timestamps, which are used in
    the NTP protocol.

    While this information could be provided by two new socket options
    independently from timestamping, it doesn't look like they would be very
    useful. With this option any performance impact is limited to hardware
    timestamping.

    Use dev_get_by_napi_id() to get the device and its index. On kernels
    with disabled CONFIG_NET_RX_BUSY_POLL or drivers not using NAPI, a zero
    index will be returned in the control message.

    CC: Richard Cochran
    Acked-by: Willem de Bruijn
    Signed-off-by: Miroslav Lichvar
    Signed-off-by: David S. Miller

    Miroslav Lichvar
     

10 May, 2017

1 commit

  • Regularly, when a new header is created in include/uapi/, the developer
    forgets to add it in the corresponding Kbuild file. This error is usually
    detected after the release is out.

    In fact, all headers under uapi directories should be exported, thus it's
    useless to have an exhaustive list.

    After this patch, the following files, which were not exported, are now
    exported (with make headers_install_all):
    asm-arc/kvm_para.h
    asm-arc/ucontext.h
    asm-blackfin/shmparam.h
    asm-blackfin/ucontext.h
    asm-c6x/shmparam.h
    asm-c6x/ucontext.h
    asm-cris/kvm_para.h
    asm-h8300/shmparam.h
    asm-h8300/ucontext.h
    asm-hexagon/shmparam.h
    asm-m32r/kvm_para.h
    asm-m68k/kvm_para.h
    asm-m68k/shmparam.h
    asm-metag/kvm_para.h
    asm-metag/shmparam.h
    asm-metag/ucontext.h
    asm-mips/hwcap.h
    asm-mips/reg.h
    asm-mips/ucontext.h
    asm-nios2/kvm_para.h
    asm-nios2/ucontext.h
    asm-openrisc/shmparam.h
    asm-parisc/kvm_para.h
    asm-powerpc/perf_regs.h
    asm-sh/kvm_para.h
    asm-sh/ucontext.h
    asm-tile/shmparam.h
    asm-unicore32/shmparam.h
    asm-unicore32/ucontext.h
    asm-x86/hwcap2.h
    asm-xtensa/kvm_para.h
    drm/armada_drm.h
    drm/etnaviv_drm.h
    drm/vgem_drm.h
    linux/aspeed-lpc-ctrl.h
    linux/auto_dev-ioctl.h
    linux/bcache.h
    linux/btrfs_tree.h
    linux/can/vxcan.h
    linux/cifs/cifs_mount.h
    linux/coresight-stm.h
    linux/cryptouser.h
    linux/fsmap.h
    linux/genwqe/genwqe_card.h
    linux/hash_info.h
    linux/kcm.h
    linux/kcov.h
    linux/kfd_ioctl.h
    linux/lightnvm.h
    linux/module.h
    linux/nbd-netlink.h
    linux/nilfs2_api.h
    linux/nilfs2_ondisk.h
    linux/nsfs.h
    linux/pr.h
    linux/qrtr.h
    linux/rpmsg.h
    linux/sched/types.h
    linux/sed-opal.h
    linux/smc.h
    linux/smc_diag.h
    linux/stm.h
    linux/switchtec_ioctl.h
    linux/vfio_ccw.h
    linux/wil6210_uapi.h
    rdma/bnxt_re-abi.h

    Note that I have removed from this list the files which are generated in every
    exported directories (like .install or .install.cmd).

    Thanks to Julien Floret for the tip to get all
    subdirs with a pure makefile command.

    For the record, note that exported files for asm directories are a mix of
    files listed by:
    - include/uapi/asm-generic/Kbuild.asm;
    - arch//include/uapi/asm/Kbuild;
    - arch//include/asm/Kbuild.

    Signed-off-by: Nicolas Dichtel
    Acked-by: Daniel Vetter
    Acked-by: Russell King
    Acked-by: Mark Salter
    Acked-by: Michael Ellerman (powerpc)
    Signed-off-by: Masahiro Yamada

    Nicolas Dichtel
     

03 May, 2017

1 commit

  • Pull networking updates from David Millar:
    "Here are some highlights from the 2065 networking commits that
    happened this development cycle:

    1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

    2) Add a generic XDP driver, so that anyone can test XDP even if they
    lack a networking device whose driver has explicit XDP support
    (me).

    3) Sparc64 now has an eBPF JIT too (me)

    4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
    Starovoitov)

    5) Make netfitler network namespace teardown less expensive (Florian
    Westphal)

    6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

    7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

    8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

    9) Multiqueue support in stmmac driver (Joao Pinto)

    10) Remove TCP timewait recycling, it never really could possibly work
    well in the real world and timestamp randomization really zaps any
    hint of usability this feature had (Soheil Hassas Yeganeh)

    11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
    Aleksandrov)

    12) Add socket busy poll support to epoll (Sridhar Samudrala)

    13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
    and several others)

    14) IPSEC hw offload infrastructure (Steffen Klassert)"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
    tipc: refactor function tipc_sk_recv_stream()
    tipc: refactor function tipc_sk_recvmsg()
    net: thunderx: Optimize page recycling for XDP
    net: thunderx: Support for XDP header adjustment
    net: thunderx: Add support for XDP_TX
    net: thunderx: Add support for XDP_DROP
    net: thunderx: Add basic XDP support
    net: thunderx: Cleanup receive buffer allocation
    net: thunderx: Optimize CQE_TX handling
    net: thunderx: Optimize RBDR descriptor handling
    net: thunderx: Support for page recycling
    ipx: call ipxitf_put() in ioctl error path
    net: sched: add helpers to handle extended actions
    qed*: Fix issues in the ptp filter config implementation.
    qede: Fix concurrency issue in PTP Tx path processing.
    stmmac: Add support for SIMATIC IOT2000 platform
    net: hns: fix ethtool_get_strings overflow in hns driver
    tcp: fix wraparound issue in tcp_lp
    bpf, arm64: fix jit branch offset related to ldimm64
    bpf, arm64: implement jiting of BPF_XADD
    ...

    Linus Torvalds
     

18 Apr, 2017

1 commit

  • Unlike normal compat syscall variants, it is needed only for
    biarch architectures that have different alignement requirements for
    u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle
    a store of 64bit value at 32bit-aligned address. We used to have one
    such (ia64), but its biarch support has been gone since 2010 (after
    being broken in 2008, which went unnoticed since nobody had been using
    it).

    It had escaped removal at the same time only because back in 2004
    a patch that switched several syscalls on amd64 from private wrappers to
    generic compat ones had switched to use of compat_sys_getdents64(), which
    hadn't needed (or used) a compat wrapper on amd64.

    Let's bury it - it's at least 7 years overdue.

    Signed-off-by: Al Viro

    Al Viro
     

08 Apr, 2017

1 commit

  • Introduce a new getsockopt operation to retrieve the socket cookie
    for a specific socket based on the socket fd. It returns a unique
    non-decreasing cookie for each socket.
    Tested: https://android-review.googlesource.com/#/c/358163/

    Acked-by: Willem de Bruijn
    Signed-off-by: Chenbo Feng
    Signed-off-by: David S. Miller

    Chenbo Feng
     

06 Apr, 2017

1 commit


25 Mar, 2017

1 commit

  • This socket option returns the NAPI ID associated with the queue on which
    the last frame is received. This information can be used by the apps to
    split the incoming flows among the threads based on the Rx queue on which
    they are received.

    If the NAPI ID actually represents a sender_cpu then the value is ignored
    and 0 is returned.

    Signed-off-by: Sridhar Samudrala
    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Sridhar Samudrala
     

23 Mar, 2017

1 commit

  • Allows reading of SK_MEMINFO_VARS via socket option. This way an
    application can get all meminfo related information in single socket
    option call instead of multiple calls.

    Adds helper function, sk_get_meminfo(), and uses that for both
    getsockopt and sock_diag_put_meminfo().

    Suggested by Eric Dumazet.

    Signed-off-by: Josh Hunt
    Reviewed-by: Jason Baron
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Josh Hunt
     

20 Mar, 2017

1 commit

  • The new syscall statx is implemented as generic code, so enable it
    for architectures like openrisc which use the generic syscall table.

    Fixes: a528d35e8bfcc ("statx: Add a system call to make enhanced file info available")
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Howells
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: linux-arm-kernel@lists.infradead.org
    Signed-off-by: Stafford Horne
    Signed-off-by: Will Deacon

    Stafford Horne
     

23 Feb, 2017

1 commit

  • Patch series "userfaultfd tmpfs/hugetlbfs/non-cooperative", v2

    These userfaultfd features are finished and are ready for larger
    exposure in -mm and upstream merging.

    1) tmpfs non present userfault
    2) hugetlbfs non present userfault
    3) non cooperative userfault for fork/madvise/mremap

    qemu development code is already exercising 2) and container postcopy
    live migration needs 3).

    1) is not currently used but there's a self test and we know some qemu
    user for various reasons uses tmpfs as backing for KVM so it'll need it
    too to use postcopy live migration with tmpfs memory.

    All review feedback from the previous submit has been handled and the
    fixes are included. There's no outstanding issue AFIK.

    Upstream code just did a s/fe/vmf/ conversion in the page faults and
    this has been converted as well incrementally.

    In addition to the previous submits, this also wakes up stuck userfaults
    during UFFDIO_UNREGISTER. The non cooperative testcase actually
    reproduced this problem by getting stuck instead of quitting clean in
    some rare case as it could call UFFDIO_UNREGISTER while some userfault
    could be still in flight. The other option would have been to keep
    leaving it up to userland to serialize itself and to patch the testcase
    instead but the wakeup during unregister I think is preferable.

    David also asked the UFFD_FEATURE_MISSING_HUGETLBFS and
    UFFD_FEATURE_MISSING_SHMEM feature flags to be added so QEMU can avoid
    to probe if the hugetlbfs/shmem missing support is available by calling
    UFFDIO_REGISTER. QEMU already checks HUGETLBFS_MAGIC with fstatfs so if
    UFFD_FEATURE_MISSING_HUGETLBFS is also set, it knows UFFDIO_REGISTER
    will succeed (or if it fails, it's for some other more concerning
    reason). There's no reason to worry about adding too many feature
    flags. There are 64 available and worst case we've to bump the API if
    someday we're really going to run out of them.

    The round-trip network latency of hugetlbfs userfaults during postcopy
    live migration is still of the order of dozen milliseconds on 10GBit if
    at 2MB hugepage granularity so it's working perfectly and it should
    provide for higher bandwidth or lower CPU usage (which makes it
    interesting to add an option in the future to support THP granularity
    too for anonymous memory, UFFDIO_COPY would then have to create THP if
    alignment/len allows for it). 1GB hugetlbfs granularity will require
    big changes in hugetlbfs to work so it's deferred for later.

    This patch (of 42):

    This adds proper documentation (inline) to avoid the risk of further
    misunderstandings about the semantics of _IOW/_IOR and it also reminds
    whoever will bump the UFFDIO_API in the future, to change the two ioctl
    to _IOW.

    This was found while implementing strace support for those ioctl,
    otherwise we could have never found it by just reviewing kernel code and
    testing it.

    _IOC_READ or _IOC_WRITE alters nothing but the ioctl number itself, so
    it's only worth fixing if the UFFDIO_API is bumped someday.

    Link: http://lkml.kernel.org/r/20161216144821.5183-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: "Dmitry V. Levin"
    Cc: Michael Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

30 Nov, 2016

1 commit

  • This patch exports the sender chronograph stats via the socket
    SO_TIMESTAMPING channel. Currently we can instrument how long a
    particular application unit of data was queued in TCP by tracking
    SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
    these sender chronograph stats exported simultaneously along with
    these timestamps allow further breaking down the various sender
    limitation. For example, a video server can tell if a particular
    chunk of video on a connection takes a long time to deliver because
    TCP was experiencing small receive window. It is not possible to
    tell before this patch without packet traces.

    To prepare these stats, the user needs to set
    SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
    while requesting other SOF_TIMESTAMPING TX timestamps. When the
    timestamps are available in the error queue, the stats are returned
    in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
    in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
    TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.

    Signed-off-by: Francis Yan
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Francis Yan
     

18 Oct, 2016

1 commit

  • pkey_set() and pkey_get() were syscalls present in older versions
    of the protection keys patches. They were fully excised from the
    x86 code, but some cruft was left in the generic syscall code. The
    C++ comments were intended to help to make it more glaring to me to
    fix them before actually submitting them. That technique worked,
    but later than I would have liked.

    I test-compiled this for arm64.

    Fixes: a60f7b69d92c0 ("generic syscalls: Wire up memory protection keys syscalls")
    Signed-off-by: Dave Hansen
    Acked-by: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: x86@kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: mgorman@techsingularity.net
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

09 Sep, 2016

2 commits

  • These new syscalls are implemented as generic code, so enable them for
    architectures like arm64 which use the generic syscall table.

    According to Arnd:

    Even if the support is x86 specific for the forseeable future, it may be
    good to reserve the number just in case. The other architecture specific
    syscall lists are usually left to the individual arch maintainers, most a
    lot of the newer architectures share this table.

    Signed-off-by: Dave Hansen
    Acked-by: Arnd Bergmann
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: mgorman@techsingularity.net
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163018.505A6875@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     
  • This patch adds two new system calls:

    int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
    int pkey_free(int pkey);

    These implement an "allocator" for the protection keys
    themselves, which can be thought of as analogous to the allocator
    that the kernel has for file descriptors. The kernel tracks
    which numbers are in use, and only allows operations on keys that
    are valid. A key which was not obtained by pkey_alloc() may not,
    for instance, be passed to pkey_mprotect().

    These system calls are also very important given the kernel's use
    of pkeys to implement execute-only support. These help ensure
    that userspace can never assume that it has control of a key
    unless it first asks the kernel. The kernel does not promise to
    preserve PKRU (right register) contents except for allocated
    pkeys.

    The 'init_access_rights' argument to pkey_alloc() specifies the
    rights that will be established for the returned pkey. For
    instance:

    pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

    will allocate 'pkey', but also sets the bits in PKRU[1] such that
    writing to 'pkey' is already denied.

    The kernel does not prevent pkey_free() from successfully freeing
    in-use pkeys (those still assigned to a memory range by
    pkey_mprotect()). It would be expensive to implement the checks
    for this, so we instead say, "Just don't do it" since sane
    software will never do it anyway.

    Any piece of userspace calling pkey_alloc() needs to be prepared
    for it to fail. Why? pkey_alloc() returns the same error code
    (ENOSPC) when there are no pkeys and when pkeys are unsupported.
    They can be unsupported for a whole host of reasons, so apps must
    be prepared for this. Also, libraries or LD_PRELOADs might steal
    keys before an application gets access to them.

    This allocation mechanism could be implemented in userspace.
    Even if we did it in userspace, we would still need additional
    user/kernel interfaces to tell userspace which keys are being
    used by the kernel internally (such as for execute-only
    mappings). Having the kernel provide this facility completely
    removes the need for these additional interfaces, or having an
    implementation of this in userspace at all.

    Note that we have to make changes to all of the architectures
    that do not use mman-common.h because we use the new
    PKEY_DENY_ACCESS/WRITE macros in arch-independent code.

    1. PKRU is the Protection Key Rights User register. It is a
    usermode-accessible register that controls whether writes
    and/or access to each individual pkey is allowed or denied.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: arnd@arndb.de
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

05 May, 2016

2 commits

  • The newer renameat2 syscall provides all the functionality provided by
    the renameat syscall and adds flags, so future architectures won't need
    to include renameat.

    Therefore drop the renameat syscall from the generic syscall list unless
    __ARCH_WANT_RENAMEAT is defined by the architecture's unistd.h prior to
    including asm-generic/unistd.h, and adjust all architectures using the
    generic syscall list to define it so that no in-tree architectures are
    affected.

    Signed-off-by: James Hogan
    Acked-by: Vineet Gupta
    Cc: linux-arch@vger.kernel.org
    Cc: linux-snps-arc@lists.infradead.org
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: Richard Kuo
    Cc: linux-hexagon@vger.kernel.org
    Cc: linux-metag@vger.kernel.org
    Cc: Jonas Bonn
    Cc: linux@lists.openrisc.net
    Cc: Chen Liqin
    Cc: Lennox Wu
    Cc: Chris Metcalf
    Cc: Guan Xuetao
    Cc: Ley Foon Tan
    Cc: nios2-dev@lists.rocketboards.org
    Cc: Yoshinori Sato
    Cc: uclinux-h8-devel@lists.sourceforge.jp
    Signed-off-by: Arnd Bergmann

    James Hogan
     
  • Compat architectures that does not use generic unistd (mips, s390),
    declare compat version in their syscall tables for preadv2 and
    pwritev2. Generic unistd syscall table should do it as well.

    [arnd: this initially slipped through the review and an
    incorrect patch got merged. arch/tile/ is the only architecture
    that could be affected for their 32-bit compat mode, every
    other architecture we support today is fine.]

    Signed-off-by: Yury Norov
    Signed-off-by: Arnd Bergmann

    Yury Norov
     

24 Apr, 2016

1 commit


21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

05 Mar, 2016

1 commit

  • Stephen Rothwell reported this linux-next build failure:

    http://lkml.kernel.org/r/20160226164406.065a1ffc@canb.auug.org.au

    ... caused by the Memory Protection Keys patches from the tip tree triggering
    a newly introduced build-time sanity check on an ARM build, because they changed
    the ABI of siginfo in an unexpected way.

    If u64 has a natural alignment of 8 bytes (which is the case on most mainstream
    platforms, with the notable exception of x86-32), then the leadup to the
    _sifields union matters:

    typedef struct siginfo {
    int si_signo;
    int si_errno;
    int si_code;

    union {
    ...
    } _sifields;
    } __ARCH_SI_ATTRIBUTES siginfo_t;

    Note how the first 3 fields give us 12 bytes, so _sifields is not 8
    naturally bytes aligned.

    Before the _pkey field addition the largest element of _sifields (on
    32-bit platforms) was 32 bits. With the u64 added, the minimum alignment
    requirement increased to 8 bytes on those (rare) 32-bit platforms. Thus
    GCC padded the space after si_code with 4 extra bytes, and shifted all
    _sifields offsets by 4 bytes - breaking the ABI of all of those
    remaining fields.

    On 64-bit platforms this problem was hidden due to _sifields already
    having numerous fields with natural 8 bytes alignment (pointers).

    To fix this, we replace the u64 with an '__u32'. The __u32 does not
    increase the minimum alignment requirement of the union, and it is
    also large enough to store the 16-bit pkey we have today on x86.

    Reported-by: Stehen Rothwell
    Signed-off-by: Dave Hansen
    Acked-by: Stehen Rothwell
    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: Helge Deller
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-next@vger.kernel.org
    Fixes: cd0ea35ff551 ("signals, pkeys: Notify userspace about protection key faults")
    Link: http://lkml.kernel.org/r/20160301125451.02C7426D@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

26 Feb, 2016

1 commit

  • This patch add the SO_CNX_ADVICE socket option (setsockopt only). The
    purpose is to allow an application to give feedback to the kernel about
    the quality of the network path for a connected socket. The value
    argument indicates the type of quality report. For this initial patch
    the only supported advice is a value of 1 which indicates "bad path,
    please reroute"-- the action taken by the kernel is to call
    dst_negative_advice which will attempt to choose a different ECMP route,
    reset the TX hash for flow label and UDP source port in encapsulation,
    etc.

    This facility should be useful for connected UDP sockets where only the
    application can provide any feedback about path quality. It could also
    be useful for TCP applications that have additional knowledge about the
    path outside of the normal TCP control loop.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

18 Feb, 2016

1 commit

  • A protection key fault is very similar to any other access error.
    There must be a VMA, etc... We even want to take the same action
    (SIGSEGV) that we do with a normal access fault.

    However, we do need to let userspace know that something is
    different. We do this the same way what we did with SEGV_BNDERR
    with Memory Protection eXtensions (MPX): define a new SEGV code:
    SEGV_PKUERR.

    We add a siginfo field: si_pkey that reveals to userspace which
    protection key was set on the PTE that we faulted on. There is
    no other easy way for userspace to figure this out. They could
    parse smaps but that would be a bit cruel.

    We share space with in siginfo with _addr_bnd. #BR faults from
    MPX are completely separate from page faults (#PF) that trigger
    from protection key violations, so we never need both at the same
    time.

    Note that _pkey is a 64-bit value. The current hardware only
    supports 4-bit protection keys. We do this because there is
    _plenty_ of space in _sigfault and it is possible that future
    processors would support more than 4 bits of protection keys.

    The x86 code to actually fill in the siginfo is in the next
    patch.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Amanieu d'Antras
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Palmer Dabbelt
    Cc: Peter Zijlstra
    Cc: Richard Weinberger
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Vegard Nossum
    Cc: Vladimir Davydov
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210212.3A9B83AC@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

16 Jan, 2016

2 commits

  • For uapi, need try to let all macros have same value, and MADV_FREE is
    added into main branch recently, so need redefine MADV_FREE for it.

    At present, '8' can be shared with all architectures, so redefine it to
    '8'.

    [sudipm.mukherjee@gmail.com: correct uniform value of MADV_FREE]
    Signed-off-by: Chen Gang
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Ralf Baechle
    Cc: Arnd Bergmann
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Chris Zankel
    Cc: Max Filippov
    Cc: Roland Dreier
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Daniel Micay
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Linux doesn't have an ability to free pages lazy while other OS already
    have been supported that named by madvise(MADV_FREE).

    The gain is clear that kernel can discard freed pages rather than
    swapping out or OOM if memory pressure happens.

    Without memory pressure, freed pages would be reused by userspace
    without another additional overhead(ex, page fault + allocation +
    zeroing).

    Jason Evans said:

    : Facebook has been using MAP_UNINITIALIZED
    : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
    : several years, but there are operational costs to maintaining this
    : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
    : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
    : increased throughput for much of our workload by ~5%, and although the
    : benefit has decreased using newer hardware and kernels, there is still
    : enough benefit that we cannot reasonably retire it without a replacement.
    :
    : Aside from Facebook operations, there are numerous broadly used
    : applications that would benefit from MADV_FREE. The ones that immediately
    : come to mind are redis, varnish, and MariaDB. I don't have much insight
    : into Android internals and development process, but I would hope to see
    : MADV_FREE support eventually end up there as well to benefit applications
    : linked with the integrated jemalloc.
    :
    : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
    : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
    : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
    : (and AIX, but I'm not sure it even compiles on AIX). The lack of
    : MADV_FREE on Linux forced me down a long series of increasingly
    : sophisticated heuristics for madvise() volume reduction, and even so this
    : remains a common performance issue for people using jemalloc on Linux.
    : Please integrate MADV_FREE; many people will benefit substantially.

    How it works:

    When madvise syscall is called, VM clears dirty bit of ptes of the
    range. If memory pressure happens, VM checks dirty bit of page table
    and if it found still "clean", it means it's a "lazyfree pages" so VM
    could discard the page instead of swapping out. Once there was store
    operation for the page before VM peek a page to reclaim, dirty bit is
    set so VM can swap out the page instead of discarding.

    One thing we should notice is that basically, MADV_FREE relies on dirty
    bit in page table entry to decide whether VM allows to discard the page
    or not. IOW, if page table entry includes marked dirty bit, VM
    shouldn't discard the page.

    However, as a example, if swap-in by read fault happens, page table
    entry doesn't have dirty bit so MADV_FREE could discard the page
    wrongly.

    For avoiding the problem, MADV_FREE did more checks with PageDirty and
    PageSwapCache. It worked out because swapped-in page lives on swap
    cache and since it is evicted from the swap cache, the page has PG_dirty
    flag. So both page flags check effectively prevent wrong discarding by
    MADV_FREE.

    However, a problem in above logic is that swapped-in page has PG_dirty
    still after they are removed from swap cache so VM cannot consider the
    page as freeable any more even if madvise_free is called in future.

    Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
    swapcache. Then, page table doesn't mark
    dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

    To solve the problem, this patch clears PG_dirty if only the page is
    owned exclusively by current process when madvise is called because
    PG_dirty represents ptes's dirtiness in several processes so we could
    clear it only if we own it exclusively.

    Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
    and hope glibc supports it) and jemalloc/tcmalloc already have supported
    the feature for other OS(ex, FreeBSD)

    barrios@blaptop:~/benchmark/ebizzy$ lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 12
    On-line CPU(s) list: 0-11
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s): 12
    NUMA node(s): 1
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 2
    Stepping: 3
    CPU MHz: 3200.185
    BogoMIPS: 6400.53
    Virtualization: VT-x
    Hypervisor vendor: KVM
    Virtualization type: full
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 4096K
    NUMA node0 CPU(s): 0-11
    ebizzy benchmark(./ebizzy -S 10 -n 512)

    Higher avg is better.

    vanilla-jemalloc MADV_free-jemalloc

    1 thread
    records: 10 records: 10
    avg: 2961.90 avg: 12069.70
    std: 71.96(2.43%) std: 186.68(1.55%)
    max: 3070.00 max: 12385.00
    min: 2796.00 min: 11746.00

    2 thread
    records: 10 records: 10
    avg: 5020.00 avg: 17827.00
    std: 264.87(5.28%) std: 358.52(2.01%)
    max: 5244.00 max: 18760.00
    min: 4251.00 min: 17382.00

    4 thread
    records: 10 records: 10
    avg: 8988.80 avg: 27930.80
    std: 1175.33(13.08%) std: 3317.33(11.88%)
    max: 9508.00 max: 30879.00
    min: 5477.00 min: 21024.00

    8 thread
    records: 10 records: 10
    avg: 13036.50 avg: 33739.40
    std: 170.67(1.31%) std: 5146.22(15.25%)
    max: 13371.00 max: 40572.00
    min: 12785.00 min: 24088.00

    16 thread
    records: 10 records: 10
    avg: 11092.40 avg: 31424.20
    std: 710.60(6.41%) std: 3763.89(11.98%)
    max: 12446.00 max: 36635.00
    min: 9949.00 min: 25669.00

    32 thread
    records: 10 records: 10
    avg: 11067.00 avg: 34495.80
    std: 971.06(8.77%) std: 2721.36(7.89%)
    max: 12010.00 max: 38598.00
    min: 9002.00 min: 30636.00

    In summary, MADV_FREE is about much faster than MADV_DONTNEED.

    This patch (of 12):

    Add core MADV_FREE implementation.

    [akpm@linux-foundation.org: small cleanups]
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: Mika Penttil
    Cc: Michael Kerrisk
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Jason Evans
    Cc: Daniel Micay
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: "Shaohua Li"
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

13 Jan, 2016

1 commit

  • Pull networking updates from Davic Miller:

    1) Support busy polling generically, for all NAPI drivers. From Eric
    Dumazet.

    2) Add byte/packet counter support to nft_ct, from Floriani Westphal.

    3) Add RSS/XPS support to mvneta driver, from Gregory Clement.

    4) Implement IPV6_HDRINCL socket option for raw sockets, from Hannes
    Frederic Sowa.

    5) Add support for T6 adapter to cxgb4 driver, from Hariprasad Shenai.

    6) Add support for VLAN device bridging to mlxsw switch driver, from
    Ido Schimmel.

    7) Add driver for Netronome NFP4000/NFP6000, from Jakub Kicinski.

    8) Provide hwmon interface to mlxsw switch driver, from Jiri Pirko.

    9) Reorganize wireless drivers into per-vendor directories just like we
    do for ethernet drivers. From Kalle Valo.

    10) Provide a way for administrators "destroy" connected sockets via the
    SOCK_DESTROY socket netlink diag operation. From Lorenzo Colitti.

    11) Add support to add/remove multicast routes via netlink, from Nikolay
    Aleksandrov.

    12) Make TCP keepalive settings per-namespace, from Nikolay Borisov.

    13) Add forwarding and packet duplication facilities to nf_tables, from
    Pablo Neira Ayuso.

    14) Dead route support in MPLS, from Roopa Prabhu.

    15) TSO support for thunderx chips, from Sunil Goutham.

    16) Add driver for IBM's System i/p VNIC protocol, from Thomas Falcon.

    17) Rationalize, consolidate, and more completely document the checksum
    offloading facilities in the networking stack. From Tom Herbert.

    18) Support aborting an ongoing scan in mac80211/cfg80211, from
    Vidyullatha Kanchanapally.

    19) Use per-bucket spinlock for bpf hash facility, from Tom Leiming.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1375 commits)
    net: bnxt: always return values from _bnxt_get_max_rings
    net: bpf: reject invalid shifts
    phonet: properly unshare skbs in phonet_rcv()
    dwc_eth_qos: Fix dma address for multi-fragment skbs
    phy: remove an unneeded condition
    mdio: remove an unneed condition
    mdio_bus: NULL dereference on allocation error
    net: Fix typo in netdev_intersect_features
    net: freescale: mac-fec: Fix build error from phy_device API change
    net: freescale: ucc_geth: Fix build error from phy_device API change
    bonding: Prevent IPv6 link local address on enslaved devices
    IB/mlx5: Add flow steering support
    net/mlx5_core: Export flow steering API
    net/mlx5_core: Make ipv4/ipv6 location more clear
    net/mlx5_core: Enable flow steering support for the IB driver
    net/mlx5_core: Initialize namespaces only when supported by device
    net/mlx5_core: Set priority attributes
    net/mlx5_core: Connect flow tables
    net/mlx5_core: Introduce modify flow table command
    net/mlx5_core: Managing root flow table
    ...

    Linus Torvalds
     

05 Jan, 2016

1 commit

  • Expose socket options for setting a classic or extended BPF program
    for use when selecting sockets in an SO_REUSEPORT group. These options
    can be used on the first socket to belong to a group before bind or
    on any socket in the group after bind.

    This change includes refactoring of the existing sk_filter code to
    allow reuse of the existing BPF filter validation checks.

    Signed-off-by: Craig Gallek
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Craig Gallek