18 Oct, 2016

1 commit

  • pkey_set() and pkey_get() were syscalls present in older versions
    of the protection keys patches. They were fully excised from the
    x86 code, but some cruft was left in the generic syscall code. The
    C++ comments were intended to help to make it more glaring to me to
    fix them before actually submitting them. That technique worked,
    but later than I would have liked.

    I test-compiled this for arm64.

    Fixes: a60f7b69d92c0 ("generic syscalls: Wire up memory protection keys syscalls")
    Signed-off-by: Dave Hansen
    Acked-by: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: x86@kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: mgorman@techsingularity.net
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

09 Sep, 2016

2 commits

  • These new syscalls are implemented as generic code, so enable them for
    architectures like arm64 which use the generic syscall table.

    According to Arnd:

    Even if the support is x86 specific for the forseeable future, it may be
    good to reserve the number just in case. The other architecture specific
    syscall lists are usually left to the individual arch maintainers, most a
    lot of the newer architectures share this table.

    Signed-off-by: Dave Hansen
    Acked-by: Arnd Bergmann
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: mgorman@techsingularity.net
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163018.505A6875@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     
  • This patch adds two new system calls:

    int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
    int pkey_free(int pkey);

    These implement an "allocator" for the protection keys
    themselves, which can be thought of as analogous to the allocator
    that the kernel has for file descriptors. The kernel tracks
    which numbers are in use, and only allows operations on keys that
    are valid. A key which was not obtained by pkey_alloc() may not,
    for instance, be passed to pkey_mprotect().

    These system calls are also very important given the kernel's use
    of pkeys to implement execute-only support. These help ensure
    that userspace can never assume that it has control of a key
    unless it first asks the kernel. The kernel does not promise to
    preserve PKRU (right register) contents except for allocated
    pkeys.

    The 'init_access_rights' argument to pkey_alloc() specifies the
    rights that will be established for the returned pkey. For
    instance:

    pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

    will allocate 'pkey', but also sets the bits in PKRU[1] such that
    writing to 'pkey' is already denied.

    The kernel does not prevent pkey_free() from successfully freeing
    in-use pkeys (those still assigned to a memory range by
    pkey_mprotect()). It would be expensive to implement the checks
    for this, so we instead say, "Just don't do it" since sane
    software will never do it anyway.

    Any piece of userspace calling pkey_alloc() needs to be prepared
    for it to fail. Why? pkey_alloc() returns the same error code
    (ENOSPC) when there are no pkeys and when pkeys are unsupported.
    They can be unsupported for a whole host of reasons, so apps must
    be prepared for this. Also, libraries or LD_PRELOADs might steal
    keys before an application gets access to them.

    This allocation mechanism could be implemented in userspace.
    Even if we did it in userspace, we would still need additional
    user/kernel interfaces to tell userspace which keys are being
    used by the kernel internally (such as for execute-only
    mappings). Having the kernel provide this facility completely
    removes the need for these additional interfaces, or having an
    implementation of this in userspace at all.

    Note that we have to make changes to all of the architectures
    that do not use mman-common.h because we use the new
    PKEY_DENY_ACCESS/WRITE macros in arch-independent code.

    1. PKRU is the Protection Key Rights User register. It is a
    usermode-accessible register that controls whether writes
    and/or access to each individual pkey is allowed or denied.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: arnd@arndb.de
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

05 May, 2016

2 commits

  • The newer renameat2 syscall provides all the functionality provided by
    the renameat syscall and adds flags, so future architectures won't need
    to include renameat.

    Therefore drop the renameat syscall from the generic syscall list unless
    __ARCH_WANT_RENAMEAT is defined by the architecture's unistd.h prior to
    including asm-generic/unistd.h, and adjust all architectures using the
    generic syscall list to define it so that no in-tree architectures are
    affected.

    Signed-off-by: James Hogan
    Acked-by: Vineet Gupta
    Cc: linux-arch@vger.kernel.org
    Cc: linux-snps-arc@lists.infradead.org
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: Richard Kuo
    Cc: linux-hexagon@vger.kernel.org
    Cc: linux-metag@vger.kernel.org
    Cc: Jonas Bonn
    Cc: linux@lists.openrisc.net
    Cc: Chen Liqin
    Cc: Lennox Wu
    Cc: Chris Metcalf
    Cc: Guan Xuetao
    Cc: Ley Foon Tan
    Cc: nios2-dev@lists.rocketboards.org
    Cc: Yoshinori Sato
    Cc: uclinux-h8-devel@lists.sourceforge.jp
    Signed-off-by: Arnd Bergmann

    James Hogan
     
  • Compat architectures that does not use generic unistd (mips, s390),
    declare compat version in their syscall tables for preadv2 and
    pwritev2. Generic unistd syscall table should do it as well.

    [arnd: this initially slipped through the review and an
    incorrect patch got merged. arch/tile/ is the only architecture
    that could be affected for their 32-bit compat mode, every
    other architecture we support today is fine.]

    Signed-off-by: Yury Norov
    Signed-off-by: Arnd Bergmann

    Yury Norov
     

24 Apr, 2016

1 commit


21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

05 Mar, 2016

1 commit

  • Stephen Rothwell reported this linux-next build failure:

    http://lkml.kernel.org/r/20160226164406.065a1ffc@canb.auug.org.au

    ... caused by the Memory Protection Keys patches from the tip tree triggering
    a newly introduced build-time sanity check on an ARM build, because they changed
    the ABI of siginfo in an unexpected way.

    If u64 has a natural alignment of 8 bytes (which is the case on most mainstream
    platforms, with the notable exception of x86-32), then the leadup to the
    _sifields union matters:

    typedef struct siginfo {
    int si_signo;
    int si_errno;
    int si_code;

    union {
    ...
    } _sifields;
    } __ARCH_SI_ATTRIBUTES siginfo_t;

    Note how the first 3 fields give us 12 bytes, so _sifields is not 8
    naturally bytes aligned.

    Before the _pkey field addition the largest element of _sifields (on
    32-bit platforms) was 32 bits. With the u64 added, the minimum alignment
    requirement increased to 8 bytes on those (rare) 32-bit platforms. Thus
    GCC padded the space after si_code with 4 extra bytes, and shifted all
    _sifields offsets by 4 bytes - breaking the ABI of all of those
    remaining fields.

    On 64-bit platforms this problem was hidden due to _sifields already
    having numerous fields with natural 8 bytes alignment (pointers).

    To fix this, we replace the u64 with an '__u32'. The __u32 does not
    increase the minimum alignment requirement of the union, and it is
    also large enough to store the 16-bit pkey we have today on x86.

    Reported-by: Stehen Rothwell
    Signed-off-by: Dave Hansen
    Acked-by: Stehen Rothwell
    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: Helge Deller
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-next@vger.kernel.org
    Fixes: cd0ea35ff551 ("signals, pkeys: Notify userspace about protection key faults")
    Link: http://lkml.kernel.org/r/20160301125451.02C7426D@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

26 Feb, 2016

1 commit

  • This patch add the SO_CNX_ADVICE socket option (setsockopt only). The
    purpose is to allow an application to give feedback to the kernel about
    the quality of the network path for a connected socket. The value
    argument indicates the type of quality report. For this initial patch
    the only supported advice is a value of 1 which indicates "bad path,
    please reroute"-- the action taken by the kernel is to call
    dst_negative_advice which will attempt to choose a different ECMP route,
    reset the TX hash for flow label and UDP source port in encapsulation,
    etc.

    This facility should be useful for connected UDP sockets where only the
    application can provide any feedback about path quality. It could also
    be useful for TCP applications that have additional knowledge about the
    path outside of the normal TCP control loop.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

18 Feb, 2016

1 commit

  • A protection key fault is very similar to any other access error.
    There must be a VMA, etc... We even want to take the same action
    (SIGSEGV) that we do with a normal access fault.

    However, we do need to let userspace know that something is
    different. We do this the same way what we did with SEGV_BNDERR
    with Memory Protection eXtensions (MPX): define a new SEGV code:
    SEGV_PKUERR.

    We add a siginfo field: si_pkey that reveals to userspace which
    protection key was set on the PTE that we faulted on. There is
    no other easy way for userspace to figure this out. They could
    parse smaps but that would be a bit cruel.

    We share space with in siginfo with _addr_bnd. #BR faults from
    MPX are completely separate from page faults (#PF) that trigger
    from protection key violations, so we never need both at the same
    time.

    Note that _pkey is a 64-bit value. The current hardware only
    supports 4-bit protection keys. We do this because there is
    _plenty_ of space in _sigfault and it is possible that future
    processors would support more than 4 bits of protection keys.

    The x86 code to actually fill in the siginfo is in the next
    patch.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Amanieu d'Antras
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Palmer Dabbelt
    Cc: Peter Zijlstra
    Cc: Richard Weinberger
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Vegard Nossum
    Cc: Vladimir Davydov
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210212.3A9B83AC@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

16 Jan, 2016

2 commits

  • For uapi, need try to let all macros have same value, and MADV_FREE is
    added into main branch recently, so need redefine MADV_FREE for it.

    At present, '8' can be shared with all architectures, so redefine it to
    '8'.

    [sudipm.mukherjee@gmail.com: correct uniform value of MADV_FREE]
    Signed-off-by: Chen Gang
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Ralf Baechle
    Cc: Arnd Bergmann
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Chris Zankel
    Cc: Max Filippov
    Cc: Roland Dreier
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Daniel Micay
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Linux doesn't have an ability to free pages lazy while other OS already
    have been supported that named by madvise(MADV_FREE).

    The gain is clear that kernel can discard freed pages rather than
    swapping out or OOM if memory pressure happens.

    Without memory pressure, freed pages would be reused by userspace
    without another additional overhead(ex, page fault + allocation +
    zeroing).

    Jason Evans said:

    : Facebook has been using MAP_UNINITIALIZED
    : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
    : several years, but there are operational costs to maintaining this
    : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
    : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
    : increased throughput for much of our workload by ~5%, and although the
    : benefit has decreased using newer hardware and kernels, there is still
    : enough benefit that we cannot reasonably retire it without a replacement.
    :
    : Aside from Facebook operations, there are numerous broadly used
    : applications that would benefit from MADV_FREE. The ones that immediately
    : come to mind are redis, varnish, and MariaDB. I don't have much insight
    : into Android internals and development process, but I would hope to see
    : MADV_FREE support eventually end up there as well to benefit applications
    : linked with the integrated jemalloc.
    :
    : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
    : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
    : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
    : (and AIX, but I'm not sure it even compiles on AIX). The lack of
    : MADV_FREE on Linux forced me down a long series of increasingly
    : sophisticated heuristics for madvise() volume reduction, and even so this
    : remains a common performance issue for people using jemalloc on Linux.
    : Please integrate MADV_FREE; many people will benefit substantially.

    How it works:

    When madvise syscall is called, VM clears dirty bit of ptes of the
    range. If memory pressure happens, VM checks dirty bit of page table
    and if it found still "clean", it means it's a "lazyfree pages" so VM
    could discard the page instead of swapping out. Once there was store
    operation for the page before VM peek a page to reclaim, dirty bit is
    set so VM can swap out the page instead of discarding.

    One thing we should notice is that basically, MADV_FREE relies on dirty
    bit in page table entry to decide whether VM allows to discard the page
    or not. IOW, if page table entry includes marked dirty bit, VM
    shouldn't discard the page.

    However, as a example, if swap-in by read fault happens, page table
    entry doesn't have dirty bit so MADV_FREE could discard the page
    wrongly.

    For avoiding the problem, MADV_FREE did more checks with PageDirty and
    PageSwapCache. It worked out because swapped-in page lives on swap
    cache and since it is evicted from the swap cache, the page has PG_dirty
    flag. So both page flags check effectively prevent wrong discarding by
    MADV_FREE.

    However, a problem in above logic is that swapped-in page has PG_dirty
    still after they are removed from swap cache so VM cannot consider the
    page as freeable any more even if madvise_free is called in future.

    Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
    swapcache. Then, page table doesn't mark
    dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

    To solve the problem, this patch clears PG_dirty if only the page is
    owned exclusively by current process when madvise is called because
    PG_dirty represents ptes's dirtiness in several processes so we could
    clear it only if we own it exclusively.

    Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
    and hope glibc supports it) and jemalloc/tcmalloc already have supported
    the feature for other OS(ex, FreeBSD)

    barrios@blaptop:~/benchmark/ebizzy$ lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 12
    On-line CPU(s) list: 0-11
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s): 12
    NUMA node(s): 1
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 2
    Stepping: 3
    CPU MHz: 3200.185
    BogoMIPS: 6400.53
    Virtualization: VT-x
    Hypervisor vendor: KVM
    Virtualization type: full
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 4096K
    NUMA node0 CPU(s): 0-11
    ebizzy benchmark(./ebizzy -S 10 -n 512)

    Higher avg is better.

    vanilla-jemalloc MADV_free-jemalloc

    1 thread
    records: 10 records: 10
    avg: 2961.90 avg: 12069.70
    std: 71.96(2.43%) std: 186.68(1.55%)
    max: 3070.00 max: 12385.00
    min: 2796.00 min: 11746.00

    2 thread
    records: 10 records: 10
    avg: 5020.00 avg: 17827.00
    std: 264.87(5.28%) std: 358.52(2.01%)
    max: 5244.00 max: 18760.00
    min: 4251.00 min: 17382.00

    4 thread
    records: 10 records: 10
    avg: 8988.80 avg: 27930.80
    std: 1175.33(13.08%) std: 3317.33(11.88%)
    max: 9508.00 max: 30879.00
    min: 5477.00 min: 21024.00

    8 thread
    records: 10 records: 10
    avg: 13036.50 avg: 33739.40
    std: 170.67(1.31%) std: 5146.22(15.25%)
    max: 13371.00 max: 40572.00
    min: 12785.00 min: 24088.00

    16 thread
    records: 10 records: 10
    avg: 11092.40 avg: 31424.20
    std: 710.60(6.41%) std: 3763.89(11.98%)
    max: 12446.00 max: 36635.00
    min: 9949.00 min: 25669.00

    32 thread
    records: 10 records: 10
    avg: 11067.00 avg: 34495.80
    std: 971.06(8.77%) std: 2721.36(7.89%)
    max: 12010.00 max: 38598.00
    min: 9002.00 min: 30636.00

    In summary, MADV_FREE is about much faster than MADV_DONTNEED.

    This patch (of 12):

    Add core MADV_FREE implementation.

    [akpm@linux-foundation.org: small cleanups]
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: Mika Penttil
    Cc: Michael Kerrisk
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Jason Evans
    Cc: Daniel Micay
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: "Shaohua Li"
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

13 Jan, 2016

1 commit

  • Pull networking updates from Davic Miller:

    1) Support busy polling generically, for all NAPI drivers. From Eric
    Dumazet.

    2) Add byte/packet counter support to nft_ct, from Floriani Westphal.

    3) Add RSS/XPS support to mvneta driver, from Gregory Clement.

    4) Implement IPV6_HDRINCL socket option for raw sockets, from Hannes
    Frederic Sowa.

    5) Add support for T6 adapter to cxgb4 driver, from Hariprasad Shenai.

    6) Add support for VLAN device bridging to mlxsw switch driver, from
    Ido Schimmel.

    7) Add driver for Netronome NFP4000/NFP6000, from Jakub Kicinski.

    8) Provide hwmon interface to mlxsw switch driver, from Jiri Pirko.

    9) Reorganize wireless drivers into per-vendor directories just like we
    do for ethernet drivers. From Kalle Valo.

    10) Provide a way for administrators "destroy" connected sockets via the
    SOCK_DESTROY socket netlink diag operation. From Lorenzo Colitti.

    11) Add support to add/remove multicast routes via netlink, from Nikolay
    Aleksandrov.

    12) Make TCP keepalive settings per-namespace, from Nikolay Borisov.

    13) Add forwarding and packet duplication facilities to nf_tables, from
    Pablo Neira Ayuso.

    14) Dead route support in MPLS, from Roopa Prabhu.

    15) TSO support for thunderx chips, from Sunil Goutham.

    16) Add driver for IBM's System i/p VNIC protocol, from Thomas Falcon.

    17) Rationalize, consolidate, and more completely document the checksum
    offloading facilities in the networking stack. From Tom Herbert.

    18) Support aborting an ongoing scan in mac80211/cfg80211, from
    Vidyullatha Kanchanapally.

    19) Use per-bucket spinlock for bpf hash facility, from Tom Leiming.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1375 commits)
    net: bnxt: always return values from _bnxt_get_max_rings
    net: bpf: reject invalid shifts
    phonet: properly unshare skbs in phonet_rcv()
    dwc_eth_qos: Fix dma address for multi-fragment skbs
    phy: remove an unneeded condition
    mdio: remove an unneed condition
    mdio_bus: NULL dereference on allocation error
    net: Fix typo in netdev_intersect_features
    net: freescale: mac-fec: Fix build error from phy_device API change
    net: freescale: ucc_geth: Fix build error from phy_device API change
    bonding: Prevent IPv6 link local address on enslaved devices
    IB/mlx5: Add flow steering support
    net/mlx5_core: Export flow steering API
    net/mlx5_core: Make ipv4/ipv6 location more clear
    net/mlx5_core: Enable flow steering support for the IB driver
    net/mlx5_core: Initialize namespaces only when supported by device
    net/mlx5_core: Set priority attributes
    net/mlx5_core: Connect flow tables
    net/mlx5_core: Introduce modify flow table command
    net/mlx5_core: Managing root flow table
    ...

    Linus Torvalds
     

05 Jan, 2016

1 commit

  • Expose socket options for setting a classic or extended BPF program
    for use when selecting sockets in an SO_REUSEPORT group. These options
    can be used on the first socket to belong to a group before bind or
    on any socket in the group after bind.

    This change includes refactoring of the existing sk_filter code to
    allow reuse of the existing BPF filter validation checks.

    Signed-off-by: Craig Gallek
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Craig Gallek
     

02 Dec, 2015

1 commit

  • Add a copy_file_range() system call for offloading copies between
    regular files.

    This gives an interface to underlying layers of the storage stack which
    can copy without reading and writing all the data. There are a few
    candidates that should support copy offloading in the nearer term:

    - btrfs shares extent references with its clone ioctl
    - NFS has patches to add a COPY command which copies on the server
    - SCSI has a family of XCOPY commands which copy in the device

    This system call avoids the complexity of also accelerating the creation
    of the destination file by operating on an existing destination file
    descriptor, not a path.

    Currently the high level vfs entry point limits copy offloading to files
    on the same mount and super (and not in the same file). This can be
    relaxed if we get implementations which can copy between file systems
    safely.

    Signed-off-by: Zach Brown
    [Anna Schumaker: Change -EINVAL to -EBADF during file verification,
    Change flags parameter from int to unsigned int,
    Add function to include/linux/syscalls.h,
    Check copy len after file open mode,
    Don't forbid ranges inside the same file,
    Use rw_verify_area() to veriy ranges,
    Use file_out rather than file_in,
    Add COPY_FR_REFLINK flag]
    Signed-off-by: Anna Schumaker
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Zach Brown
     

06 Nov, 2015

2 commits

  • The previous patch introduced a flag that specified pages in a VMA should
    be placed on the unevictable LRU, but they should not be made present when
    the area is created. This patch adds the ability to set this state via
    the new mlock system calls.

    We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
    MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
    MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
    When used with MCL_CURRENT, all current mappings will be marked with
    VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags
    will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both
    MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
    marked with VM_LOCKED | VM_LOCKONFAULT.

    Prior to this patch, mlockall() will unconditionally clear the
    mm->def_flags any time it is called without MCL_FUTURE. This behavior is
    maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is
    followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
    new VMAs will be unlocked. This remains true with or without MCL_ONFAULT
    in either mlockall() invocation.

    munlock() will unconditionally clear both vma flags. munlockall()
    unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
    field.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • With the refactored mlock code, introduce a new system call for mlock.
    The new call will allow the user to specify what lock states are being
    added. mlock2 is trivial at the moment, but a follow on patch will add a
    new mlock state making it useful.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Heiko Carstens
    Cc: Geert Uytterhoeven
    Cc: Catalin Marinas
    Cc: Stephen Rothwell
    Cc: Guenter Roeck
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

13 Oct, 2015

1 commit

  • MINSIGSTKSZ and SIGSTKSZ for ARM64 are not correctly set in latest kernel.
    This patch fixes this issue.

    This issue is reported in LTP (testcase: sigaltstack02.c).
    Testcase failed when sigaltstack() called with stack size "MINSIGSTKSZ - 1"
    Since in Glibc-2.22, MINSIGSTKSZ is set to 5120 but in kernel
    it is set to 2048 so testcase gets failed.

    Testcase Output:
    sigaltstack02 1 TPASS : stgaltstack() fails, Invalid Flag value,errno:22
    sigaltstack02 2 TFAIL : sigaltstack() returned 0, expected -1,errno:12

    Reported Issue in Glibc Bugzilla:
    Bugfix in Glibc-2.22: [Bug 16850]
    https://sourceware.org/bugzilla/show_bug.cgi?id=16850

    Acked-by: Arnd Bergmann
    Signed-off-by: Akhilesh Kumar
    Signed-off-by: Manjeet Pawar
    Signed-off-by: Rohit Thapliyal
    Signed-off-by: Will Deacon

    Manjeet Pawar
     

23 Sep, 2015

1 commit

  • Add the userfaultfd syscalls to uapi asm-generic, it was tested with
    postcopy live migration on aarch64 with both 4k and 64k pagesize
    kernels.

    Signed-off-by: Dr. David Alan Gilbert
    Signed-off-by: Andrea Arcangeli
    Cc: Michael Ellerman
    Cc: Shuah Khan
    Cc: Thierry Reding
    Cc: Mathieu Desnoyers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dr. David Alan Gilbert
     

12 Sep, 2015

1 commit

  • Here is an implementation of a new system call, sys_membarrier(), which
    executes a memory barrier on all threads running on the system. It is
    implemented by calling synchronize_sched(). It can be used to
    distribute the cost of user-space memory barriers asymmetrically by
    transforming pairs of memory barriers into pairs consisting of
    sys_membarrier() and a compiler barrier. For synchronization primitives
    that distinguish between read-side and write-side (e.g. userspace RCU
    [1], rwlocks), the read-side can be accelerated significantly by moving
    the bulk of the memory barrier overhead to the write-side.

    The existing applications of which I am aware that would be improved by
    this system call are as follows:

    * Through Userspace RCU library (http://urcu.so)
    - DNS server (Knot DNS) https://www.knot-dns.cz/
    - Network sniffer (http://netsniff-ng.org/)
    - Distributed object storage (https://sheepdog.github.io/sheepdog/)
    - User-space tracing (http://lttng.org)
    - Network storage system (https://www.gluster.org/)
    - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
    - Financial software (https://lkml.org/lkml/2015/3/23/189)

    Those projects use RCU in userspace to increase read-side speed and
    scalability compared to locking. Especially in the case of RCU used by
    libraries, sys_membarrier can speed up the read-side by moving the bulk of
    the memory barrier cost to synchronize_rcu().

    * Direct users of sys_membarrier
    - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

    Microsoft core dotnet GC developers are planning to use the mprotect()
    side-effect of issuing memory barriers through IPIs as a way to implement
    Windows FlushProcessWriteBuffers() on Linux. They are referring to
    sys_membarrier in their github thread, specifically stating that
    sys_membarrier() is what they are looking for.

    To explain the benefit of this scheme, let's introduce two example threads:

    Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
    Thread B (frequent, e.g. executing liburcu
    rcu_read_lock()/rcu_read_unlock())

    In a scheme where all smp_mb() in thread A are ordering memory accesses
    with respect to smp_mb() present in Thread B, we can change each
    smp_mb() within Thread A into calls to sys_membarrier() and each
    smp_mb() within Thread B into compiler barriers "barrier()".

    Before the change, we had, for each smp_mb() pairs:

    Thread A Thread B
    previous mem accesses previous mem accesses
    smp_mb() smp_mb()
    following mem accesses following mem accesses

    After the change, these pairs become:

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    As we can see, there are two possible scenarios: either Thread B memory
    accesses do not happen concurrently with Thread A accesses (1), or they
    do (2).

    1) Non-concurrent Thread A vs Thread B accesses:

    Thread A Thread B
    prev mem accesses
    sys_membarrier()
    follow mem accesses
    prev mem accesses
    barrier()
    follow mem accesses

    In this case, thread B accesses will be weakly ordered. This is OK,
    because at that point, thread A is not particularly interested in
    ordering them with respect to its own accesses.

    2) Concurrent Thread A vs Thread B accesses

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    In this case, thread B accesses, which are ensured to be in program
    order thanks to the compiler barrier, will be "upgraded" to full
    smp_mb() by synchronize_sched().

    * Benchmarks

    On Intel Xeon E5405 (8 cores)
    (one thread is calling sys_membarrier, the other 7 threads are busy
    looping)

    1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

    * User-space user of this system call: Userspace RCU library

    Both the signal-based and the sys_membarrier userspace RCU schemes
    permit us to remove the memory barrier from the userspace RCU
    rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
    accelerating them. These memory barriers are replaced by compiler
    barriers on the read-side, and all matching memory barriers on the
    write-side are turned into an invocation of a memory barrier on all
    active threads in the process. By letting the kernel perform this
    synchronization rather than dumbly sending a signal to every process
    threads (as we currently do), we diminish the number of unnecessary wake
    ups and only issue the memory barriers on active threads. Non-running
    threads do not need to execute such barrier anyway, because these are
    implied by the scheduler context switches.

    Results in liburcu:

    Operations in 10s, 6 readers, 2 writers:

    memory barriers in reader: 1701557485 reads, 2202847 writes
    signal-based scheme: 9830061167 reads, 6700 writes
    sys_membarrier: 9952759104 reads, 425 writes
    sys_membarrier (dyn. check): 7970328887 reads, 425 writes

    The dynamic sys_membarrier availability check adds some overhead to
    the read-side compared to the signal-based scheme, but besides that,
    sys_membarrier slightly outperforms the signal-based scheme. However,
    this non-expedited sys_membarrier implementation has a much slower grace
    period than signal and memory barrier schemes.

    Besides diminishing the number of wake-ups, one major advantage of the
    membarrier system call over the signal-based scheme is that it does not
    need to reserve a signal. This plays much more nicely with libraries,
    and with processes injected into for tracing purposes, for which we
    cannot expect that signals will be unused by the application.

    An expedited version of this system call can be added later on to speed
    up the grace period. Its implementation will likely depend on reading
    the cpu_curr()->mm without holding each CPU's rq lock.

    This patch adds the system call to x86 and to asm-generic.

    [1] http://urcu.so

    membarrier(2) man page:

    MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

    NAME
    membarrier - issue memory barriers on a set of threads

    SYNOPSIS
    #include

    int membarrier(int cmd, int flags);

    DESCRIPTION
    The cmd argument is one of the following:

    MEMBARRIER_CMD_QUERY
    Query the set of supported commands. It returns a bitmask of
    supported commands.

    MEMBARRIER_CMD_SHARED
    Execute a memory barrier on all threads running on the system.
    Upon return from system call, the caller thread is ensured that
    all running threads have passed through a state where all memory
    accesses to user-space addresses match program order between
    entry to and return from the system call (non-running threads
    are de facto in such a state). This covers threads from all pro=E2=80=90
    cesses running on the system. This command returns 0.

    The flags argument needs to be 0. For future extensions.

    All memory accesses performed in program order from each targeted
    thread is guaranteed to be ordered with respect to sys_membarrier(). If
    we use the semantic "barrier()" to represent a compiler barrier forcing
    memory accesses to be performed in program order across the barrier,
    and smp_mb() to represent explicit memory barriers forcing full memory
    ordering across the barrier, we have the following ordering table for
    each pair of barrier(), sys_membarrier() and smp_mb():

    The pair ordering is detailed as (O: ordered, X: not ordered):

    barrier() smp_mb() sys_membarrier()
    barrier() X X O
    smp_mb() X O O
    sys_membarrier() O O O

    RETURN VALUE
    On success, these system calls return zero. On error, -1 is returned,
    and errno is set appropriately. For a given command, with flags
    argument set to 0, this system call is guaranteed to always return the
    same value until reboot.

    ERRORS
    ENOSYS System call is not implemented.

    EINVAL Invalid arguments.

    Linux 2015-04-15 MEMBARRIER(2)

    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Cc: KOSAKI Motohiro
    Cc: Steven Rostedt
    Cc: Nicholas Miell
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Lai Jiangshan
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: David Howells
    Cc: Pranith Kumar
    Cc: Michael Kerrisk
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

17 Apr, 2015

1 commit

  • ENOSYS is the mechanism used by user code to detect whether the running
    kernel implements a given system call. It should not be returned by
    anything except an unimplemented system call.

    Unfortunately, it is rather frequently used in the kernel to indicate that
    various new functions of existing system calls are not implemented. This
    should be discouraged.

    Improve the comment in errno.h to help clarify ENOSYS's purpose.

    Signed-off-by: Andy Lutomirski
    Cc: Pavel Machek
    Cc: Michael Kerrisk
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

09 Jan, 2015

1 commit

  • Fix clashing values for O_PATH and FMODE_NONOTIFY on sparc. The
    clashing O_PATH value was added in commit 5229645bdc35 ("vfs: add
    nonconflicting values for O_PATH") but this can't be changed as it is
    user-visible.

    FMODE_NONOTIFY is only used internally in the kernel, but it is in the
    same numbering space as the other O_* flags, as indicated by the comment
    at the top of include/uapi/asm-generic/fcntl.h (and its use in
    fs/notify/fanotify/fanotify_user.c). So renumber it to avoid the clash.

    All of this has happened before (commit 12ed2e36c98a: "fanotify:
    FMODE_NONOTIFY and __O_SYNC in sparc conflict"), and all of this will
    happen again -- so update the uniqueness check in fcntl_init() to
    include __FMODE_NONOTIFY.

    Signed-off-by: David Drysdale
    Acked-by: David S. Miller
    Acked-by: Jan Kara
    Cc: Heinrich Schuchardt
    Cc: Alexander Viro
    Cc: Arnd Bergmann
    Cc: Stephen Rothwell
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Drysdale
     

14 Dec, 2014

1 commit

  • This patchset adds execveat(2) for x86, and is derived from Meredydd
    Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

    The primary aim of adding an execveat syscall is to allow an
    implementation of fexecve(3) that does not rely on the /proc filesystem,
    at least for executables (rather than scripts). The current glibc version
    of fexecve(3) is implemented via /proc, which causes problems in sandboxed
    or otherwise restricted environments.

    Given the desire for a /proc-free fexecve() implementation, HPA suggested
    (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
    an appropriate generalization.

    Also, having a new syscall means that it can take a flags argument without
    back-compatibility concerns. The current implementation just defines the
    AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
    added in future -- for example, flags for new namespaces (as suggested at
    https://lkml.org/lkml/2006/7/11/474).

    Related history:
    - https://lkml.org/lkml/2006/12/27/123 is an example of someone
    realizing that fexecve() is likely to fail in a chroot environment.
    - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
    documenting the /proc requirement of fexecve(3) in its manpage, to
    "prevent other people from wasting their time".
    - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
    problem where a process that did setuid() could not fexecve()
    because it no longer had access to /proc/self/fd; this has since
    been fixed.

    This patch (of 4):

    Add a new execveat(2) system call. execveat() is to execve() as openat()
    is to open(): it takes a file descriptor that refers to a directory, and
    resolves the filename relative to that.

    In addition, if the filename is empty and AT_EMPTY_PATH is specified,
    execveat() executes the file to which the file descriptor refers. This
    replicates the functionality of fexecve(), which is a system call in other
    UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/" (and
    so relies on /proc being mounted).

    The filename fed to the executed program as argv[0] (or the name of the
    script fed to a script interpreter) will be of the form "/dev/fd/"
    (for an empty filename) or "/dev/fd//", effectively
    reflecting how the executable was found. This does however mean that
    execution of a script in a /proc-less environment won't work; also, script
    execution via an O_CLOEXEC file descriptor fails (as the file will not be
    accessible after exec).

    Based on patches by Meredydd Luff.

    Signed-off-by: David Drysdale
    Cc: Meredydd Luff
    Cc: Shuah Khan
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Alexander Viro
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Kees Cook
    Cc: Arnd Bergmann
    Cc: Rich Felker
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Drysdale
     

12 Dec, 2014

1 commit

  • Pull networking updates from David Miller:

    1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

    2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers. Thanks to Al Viro
    and Herbert Xu.

    3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

    4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

    5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

    6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

    7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

    8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

    9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets. From Alexei
    Starovoitov.

    10) Support TSO/LSO in sunvnet driver, from David L Stevens.

    11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

    12) Remote checksum offload, from Tom Herbert.

    13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

    14) Add MPLS support to openvswitch, from Simon Horman.

    15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

    16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet. This tries to resolve the conflicting goals between the
    desired handling of bulk vs. RPC-like traffic.

    17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU. From Eric Dumazet.

    18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

    19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

    20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

    21) Add VLAN packet scheduler action, from Jiri Pirko.

    22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
    Fix race condition between vxlan_sock_add and vxlan_sock_release
    net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
    net/mlx4: Add support for A0 steering
    net/mlx4: Refactor QUERY_PORT
    net/mlx4_core: Add explicit error message when rule doesn't meet configuration
    net/mlx4: Add A0 hybrid steering
    net/mlx4: Add mlx4_bitmap zone allocator
    net/mlx4: Add a check if there are too many reserved QPs
    net/mlx4: Change QP allocation scheme
    net/mlx4_core: Use tasklet for user-space CQ completion events
    net/mlx4_core: Mask out host side virtualization features for guests
    net/mlx4_en: Set csum level for encapsulated packets
    be2net: Export tunnel offloads only when a VxLAN tunnel is created
    gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
    cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
    net: fec: only enable mdio interrupt before phy device link up
    net: fec: clear all interrupt events to support i.MX6SX
    net: fec: reset fep link status in suspend function
    net: sock: fix access via invalid file descriptor
    net: introduce helper macro for_each_cmsghdr
    ...

    Linus Torvalds
     

06 Dec, 2014

1 commit

  • introduce new setsockopt() command:

    setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))

    where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
    and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER

    setsockopt() calls bpf_prog_get() which increments refcnt of the program,
    so it doesn't get unloaded while socket is using the program.

    The same eBPF program can be attached to multiple sockets.

    User task exit automatically closes socket which calls sk_filter_uncharge()
    which decrements refcnt of eBPF program

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

18 Nov, 2014

1 commit

  • This patch adds new fields about bound violation into siginfo
    structure. si_lower and si_upper are respectively lower bound
    and upper bound when bound violation is caused.

    Signed-off-by: Qiaowei Ren
    Signed-off-by: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: linux-mips@linux-mips.org
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20141114151819.1908C900@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Qiaowei Ren
     

12 Nov, 2014

1 commit

  • Alternative to RPS/RFS is to use hardware support for multiple
    queues.

    Then split a set of million of sockets into worker threads, each
    one using epoll() to manage events on its own socket pool.

    Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
    know after accept() or connect() on which queue/cpu a socket is managed.

    We normally use one cpu per RX queue (IRQ smp_affinity being properly
    set), so remembering on socket structure which cpu delivered last packet
    is enough to solve the problem.

    After accept(), connect(), or even file descriptor passing around
    processes, applications can use :

    int cpu;
    socklen_t len = sizeof(cpu);

    getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);

    And use this information to put the socket into the right silo
    for optimal performance, as all networking stack should run
    on the appropriate cpu, without need to send IPI (RPS/RFS).

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2014

1 commit


19 Aug, 2014

1 commit

  • Commit 9183df25fe7b ("shm: add memfd_create() syscall") added a new
    system call (memfd_create) but didn't update the asm-generic unistd
    header.

    This patch adds the new system call to the asm-generic version of
    unistd.h so that it can be used by architectures such as arm64.

    Cc: Arnd Bergmann
    Reviewed-by: David Herrmann
    Signed-off-by: Will Deacon

    Will Deacon
     

06 Aug, 2014

2 commits

  • Pull randomness updates from Ted Ts'o:
    "Cleanups and bug fixes to /dev/random, add a new getrandom(2) system
    call, which is a superset of OpenBSD's getentropy(2) call, for use
    with userspace crypto libraries such as LibreSSL.

    Also add the ability to have a kernel thread to pull entropy from
    hardware rng devices into /dev/random"

    * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
    hwrng: Pass entropy to add_hwgenerator_randomness() in bits, not bytes
    random: limit the contribution of the hw rng to at most half
    random: introduce getrandom(2) system call
    hw_random: fix sparse warning (NULL vs 0 for pointer)
    random: use registers from interrupted code for CPU's w/o a cycle counter
    hwrng: add per-device entropy derating
    hwrng: create filler thread
    random: add_hwgenerator_randomness() for feeding entropy from devices
    random: use an improved fast_mix() function
    random: clean up interrupt entropy accounting for archs w/o cycle counters
    random: only update the last_pulled time if we actually transferred entropy
    random: remove unneeded hash of a portion of the entropy pool
    random: always update the entropy pool under the spinlock

    Linus Torvalds
     
  • The getrandom(2) system call was requested by the LibreSSL Portable
    developers. It is analoguous to the getentropy(2) system call in
    OpenBSD.

    The rationale of this system call is to provide resiliance against
    file descriptor exhaustion attacks, where the attacker consumes all
    available file descriptors, forcing the use of the fallback code where
    /dev/[u]random is not available. Since the fallback code is often not
    well-tested, it is better to eliminate this potential failure mode
    entirely.

    The other feature provided by this new system call is the ability to
    request randomness from the /dev/urandom entropy pool, but to block
    until at least 128 bits of entropy has been accumulated in the
    /dev/urandom entropy pool. Historically, the emphasis in the
    /dev/urandom development has been to ensure that urandom pool is
    initialized as quickly as possible after system boot, and preferably
    before the init scripts start execution.

    This is because changing /dev/urandom reads to block represents an
    interface change that could potentially break userspace which is not
    acceptable. In practice, on most x86 desktop and server systems, in
    general the entropy pool can be initialized before it is needed (and
    in modern kernels, we will printk a warning message if not). However,
    on an embedded system, this may not be the case. And so with this new
    interface, we can provide the functionality of blocking until the
    urandom pool has been initialized. Any userspace program which uses
    this new functionality must take care to assure that if it is used
    during the boot process, that it will not cause the init scripts or
    other portions of the system startup to hang indefinitely.

    SYNOPSIS
    #include

    int getrandom(void *buf, size_t buflen, unsigned int flags);

    DESCRIPTION
    The system call getrandom() fills the buffer pointed to by buf
    with up to buflen random bytes which can be used to seed user
    space random number generators (i.e., DRBG's) or for other
    cryptographic uses. It should not be used for Monte Carlo
    simulations or other programs/algorithms which are doing
    probabilistic sampling.

    If the GRND_RANDOM flags bit is set, then draw from the
    /dev/random pool instead of the /dev/urandom pool. The
    /dev/random pool is limited based on the entropy that can be
    obtained from environmental noise, so if there is insufficient
    entropy, the requested number of bytes may not be returned.
    If there is no entropy available at all, getrandom(2) will
    either block, or return an error with errno set to EAGAIN if
    the GRND_NONBLOCK bit is set in flags.

    If the GRND_RANDOM bit is not set, then the /dev/urandom pool
    will be used. Unlike using read(2) to fetch data from
    /dev/urandom, if the urandom pool has not been sufficiently
    initialized, getrandom(2) will block (or return -1 with the
    errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags).

    The getentropy(2) system call in OpenBSD can be emulated using
    the following function:

    int getentropy(void *buf, size_t buflen)
    {
    int ret;

    if (buflen > 256)
    goto failure;
    ret = getrandom(buf, buflen, 0);
    if (ret < 0)
    return ret;
    if (ret == buflen)
    return 0;
    failure:
    errno = EIO;
    return -1;
    }

    RETURN VALUE
    On success, the number of bytes that was filled in the buf is
    returned. This may not be all the bytes requested by the
    caller via buflen if insufficient entropy was present in the
    /dev/random pool, or if the system call was interrupted by a
    signal.

    On error, -1 is returned, and errno is set appropriately.

    ERRORS
    EINVAL An invalid flag was passed to getrandom(2)

    EFAULT buf is outside the accessible address space.

    EAGAIN The requested entropy was not available, and
    getentropy(2) would have blocked if the
    GRND_NONBLOCK flag was not set.

    EINTR While blocked waiting for entropy, the call was
    interrupted by a signal handler; see the description
    of how interrupted read(2) calls on "slow" devices
    are handled with and without the SA_RESTART flag
    in the signal(7) man page.

    NOTES
    For small requests (buflen
    Reviewed-by: Zach Brown

    Theodore Ts'o
     

19 Jul, 2014

1 commit

  • This adds the new "seccomp" syscall with both an "operation" and "flags"
    parameter for future expansion. The third argument is a pointer value,
    used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
    be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

    In addition to the TSYNC flag later in this patch series, there is a
    non-zero chance that this syscall could be used for configuring a fixed
    argument area for seccomp-tracer-aware processes to pass syscall arguments
    in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
    for this syscall. Additionally, this syscall uses operation, flags,
    and user pointer for arguments because strictly passing arguments via
    a user pointer would mean seccomp itself would be unable to trivially
    filter the seccomp syscall itself.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     

20 May, 2014

1 commit

  • Add the renameat2 syscall to the generic syscall list, which is used by the
    following architectures: arc, arm64, c6x, hexagon, metag, openrisc, score,
    tile, unicore32.

    Signed-off-by: James Hogan
    Acked-by: Arnd Bergmann
    Signed-off-by: Miklos Szeredi
    Cc: linux-arch@vger.kernel.org
    Cc: Vineet Gupta
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: Richard Kuo
    Cc: linux-hexagon@vger.kernel.org
    Cc: linux-metag@vger.kernel.org
    Cc: Jonas Bonn
    Cc: Chen Liqin
    Cc: Lennox Wu
    Cc: Chris Metcalf
    Cc: Guan Xuetao

    James Hogan
     

15 May, 2014

1 commit

  • _STK_LIM_MAX could be used to override the RLIMIT_STACK hard limit from
    an arch's include/uapi/asm-generic/resource.h file, but is no longer
    used since both parisc and metag removed the override. Therefore remove
    it entirely, setting the hard RLIMIT_STACK limit to RLIM_INFINITY
    directly in include/asm-generic/resource.h.

    Signed-off-by: James Hogan
    Cc: Arnd Bergmann
    Cc: linux-arch@vger.kernel.org
    Cc: Helge Deller
    Cc: John David Anglin

    James Hogan
     

22 Apr, 2014

1 commit

  • File-private locks have been merged into Linux for v3.15, and *now*
    people are commenting that the name and macro definitions for the new
    file-private locks suck.

    ...and I can't even disagree. The names and command macros do suck.

    We're going to have to live with these for a long time, so it's
    important that we be happy with the names before we're stuck with them.
    The consensus on the lists so far is that they should be rechristened as
    "open file description locks".

    The name isn't a big deal for the kernel, but the command macros are not
    visually distinct enough from the traditional POSIX lock macros. The
    glibc and documentation folks are recommending that we change them to
    look like F_OFD_{GETLK|SETLK|SETLKW}. That lessens the chance that a
    programmer will typo one of the commands wrong, and also makes it easier
    to spot this difference when reading code.

    This patch makes the following changes that I think are necessary before
    v3.15 ships:

    1) rename the command macros to their new names. These end up in the uapi
    headers and so are part of the external-facing API. It turns out that
    glibc doesn't actually use the fcntl.h uapi header, but it's hard to
    be sure that something else won't. Changing it now is safest.

    2) make the the /proc/locks output display these as type "OFDLCK"

    Cc: Michael Kerrisk
    Cc: Christoph Hellwig
    Cc: Carlos O'Donell
    Cc: Stefan Metzmacher
    Cc: Andy Lutomirski
    Cc: Frank Filz
    Cc: Theodore Ts'o
    Signed-off-by: Jeff Layton

    Jeff Layton
     

08 Apr, 2014

1 commit


05 Apr, 2014

1 commit

  • Pull file locking updates from Jeff Layton:
    "Highlights:

    - maintainership change for fs/locks.c. Willy's not interested in
    maintaining it these days, and is OK with Bruce and I taking it.
    - fix for open vs setlease race that Al ID'ed
    - cleanup and consolidation of file locking code
    - eliminate unneeded BUG() call
    - merge of file-private lock implementation"

    * 'locks-3.15' of git://git.samba.org/jlayton/linux:
    locks: make locks_mandatory_area check for file-private locks
    locks: fix locks_mandatory_locked to respect file-private locks
    locks: require that flock->l_pid be set to 0 for file-private locks
    locks: add new fcntl cmd values for handling file private locks
    locks: skip deadlock detection on FL_FILE_PVT locks
    locks: pass the cmd value to fcntl_getlk/getlk64
    locks: report l_pid as -1 for FL_FILE_PVT locks
    locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
    locks: rename locks_remove_flock to locks_remove_file
    locks: consolidate checks for compatible filp->f_mode values in setlk handlers
    locks: fix posix lock range overflow handling
    locks: eliminate BUG() call when there's an unexpected lock on file close
    locks: add __acquires and __releases annotations to locks_start and locks_stop
    locks: remove "inline" qualifier from fl_link manipulation functions
    locks: clean up comment typo
    locks: close potential race between setlease and open
    MAINTAINERS: update entry for fs/locks.c

    Linus Torvalds
     

31 Mar, 2014

2 commits

  • Due to some unfortunate history, POSIX locks have very strange and
    unhelpful semantics. The thing that usually catches people by surprise
    is that they are dropped whenever the process closes any file descriptor
    associated with the inode.

    This is extremely problematic for people developing file servers that
    need to implement byte-range locks. Developers often need a "lock
    management" facility to ensure that file descriptors are not closed
    until all of the locks associated with the inode are finished.

    Additionally, "classic" POSIX locks are owned by the process. Locks
    taken between threads within the same process won't conflict with one
    another, which renders them useless for synchronization between threads.

    This patchset adds a new type of lock that attempts to address these
    issues. These locks conflict with classic POSIX read/write locks, but
    have semantics that are more like BSD locks with respect to inheritance
    and behavior on close.

    This is implemented primarily by changing how fl_owner field is set for
    these locks. Instead of having them owned by the files_struct of the
    process, they are instead owned by the filp on which they were acquired.
    Thus, they are inherited across fork() and are only released when the
    last reference to a filp is put.

    These new semantics prevent them from being merged with classic POSIX
    locks, even if they are acquired by the same process. These locks will
    also conflict with classic POSIX locks even if they are acquired by
    the same process or on the same file descriptor.

    The new locks are managed using a new set of cmd values to the fcntl()
    syscall. The initial implementation of this converts these values to
    "classic" cmd values at a fairly high level, and the details are not
    exposed to the underlying filesystem. We may eventually want to push
    this handing out to the lower filesystem code but for now I don't
    see any need for it.

    Also, note that with this implementation the new cmd values are only
    available via fcntl64() on 32-bit arches. There's little need to
    add support for legacy apps on a new interface like this.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • In the 32-bit case fcntl assigns the 64-bit f_pos and i_size to a 32-bit
    off_t.

    The existing range checks also seem to depend on signed arithmetic
    wrapping when it overflows. In practice maybe that works, but we can be
    more careful. That also allows us to make a more reliable distinction
    between -EINVAL and -EOVERFLOW.

    Note that in the 32-bit case SEEK_CUR or SEEK_END might allow the caller
    to set a lock with starting point no longer representable as a 32-bit
    value. We could return -EOVERFLOW in such cases, but the locks code is
    capable of handling such ranges, so we choose to be lenient here. The
    only problem is that subsequent GETLK calls on such a lock will fail
    with EOVERFLOW.

    While we're here, do some cleanup including consolidating code for the
    flock and flock64 cases.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    J. Bruce Fields
     

04 Mar, 2014

1 commit

  • For architecture dependent compat syscalls in common code an architecture
    must define something like __ARCH_WANT_ if it wants to use the
    code.
    This however is not true for compat_sys_getdents64 for which architectures
    must define __ARCH_OMIT_COMPAT_SYS_GETDENTS64 if they do not want the code.

    This leads to the situation where all architectures, except mips, get the
    compat code but only x86_64, arm64 and the generic syscall architectures
    actually use it.

    So invert the logic, so that architectures actively must do something to
    get the compat code.

    This way a couple of architectures get rid of otherwise dead code.

    Signed-off-by: Heiko Carstens

    Heiko Carstens