16 Apr, 2015

1 commit

  • There are a lot of embedded systems that run most or all of their
    functionality in init, running as root:root. For these systems,
    supporting multiple users is not necessary.

    This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
    non-root users, non-root groups, and capabilities optional. It is enabled
    under CONFIG_EXPERT menu.

    When this symbol is not defined, UID and GID are zero in any possible case
    and processes always have all capabilities.

    The following syscalls are compiled out: setuid, setregid, setgid,
    setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
    getgroups, setfsuid, setfsgid, capget, capset.

    Also, groups.c is compiled out completely.

    In kernel/capability.c, capable function was moved in order to avoid
    adding two ifdef blocks.

    This change saves about 25 KB on a defconfig build. The most minimal
    kernels have total text sizes in the high hundreds of kB rather than
    low MB. (The 25k goes down a bit with allnoconfig, but not that much.

    The kernel was booted in Qemu. All the common functionalities work.
    Adding users/groups is not possible, failing with -ENOSYS.

    Bloat-o-meter output:
    add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Iulia Manda
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Tested-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iulia Manda
     

14 Dec, 2014

1 commit

  • This patchset adds execveat(2) for x86, and is derived from Meredydd
    Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

    The primary aim of adding an execveat syscall is to allow an
    implementation of fexecve(3) that does not rely on the /proc filesystem,
    at least for executables (rather than scripts). The current glibc version
    of fexecve(3) is implemented via /proc, which causes problems in sandboxed
    or otherwise restricted environments.

    Given the desire for a /proc-free fexecve() implementation, HPA suggested
    (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
    an appropriate generalization.

    Also, having a new syscall means that it can take a flags argument without
    back-compatibility concerns. The current implementation just defines the
    AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
    added in future -- for example, flags for new namespaces (as suggested at
    https://lkml.org/lkml/2006/7/11/474).

    Related history:
    - https://lkml.org/lkml/2006/12/27/123 is an example of someone
    realizing that fexecve() is likely to fail in a chroot environment.
    - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
    documenting the /proc requirement of fexecve(3) in its manpage, to
    "prevent other people from wasting their time".
    - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
    problem where a process that did setuid() could not fexecve()
    because it no longer had access to /proc/self/fd; this has since
    been fixed.

    This patch (of 4):

    Add a new execveat(2) system call. execveat() is to execve() as openat()
    is to open(): it takes a file descriptor that refers to a directory, and
    resolves the filename relative to that.

    In addition, if the filename is empty and AT_EMPTY_PATH is specified,
    execveat() executes the file to which the file descriptor refers. This
    replicates the functionality of fexecve(), which is a system call in other
    UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/" (and
    so relies on /proc being mounted).

    The filename fed to the executed program as argv[0] (or the name of the
    script fed to a script interpreter) will be of the form "/dev/fd/"
    (for an empty filename) or "/dev/fd//", effectively
    reflecting how the executable was found. This does however mean that
    execution of a script in a /proc-less environment won't work; also, script
    execution via an O_CLOEXEC file descriptor fails (as the file will not be
    accessible after exec).

    Based on patches by Meredydd Luff.

    Signed-off-by: David Drysdale
    Cc: Meredydd Luff
    Cc: Shuah Khan
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Alexander Viro
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Kees Cook
    Cc: Arnd Bergmann
    Cc: Rich Felker
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Drysdale
     

19 Nov, 2014

1 commit


09 Oct, 2014

1 commit

  • Pull networking updates from David Miller:
    "Most notable changes in here:

    1) By far the biggest accomplishment, thanks to a large range of
    contributors, is the addition of multi-send for transmit. This is
    the result of discussions back in Chicago, and the hard work of
    several individuals.

    Now, when the ->ndo_start_xmit() method of a driver sees
    skb->xmit_more as true, it can choose to defer the doorbell
    telling the driver to start processing the new TX queue entires.

    skb->xmit_more means that the generic networking is guaranteed to
    call the driver immediately with another SKB to send.

    There is logic added to the qdisc layer to dequeue multiple
    packets at a time, and the handling mis-predicted offloads in
    software is now done with no locks held.

    Finally, pktgen is extended to have a "burst" parameter that can
    be used to test a multi-send implementation.

    Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
    virtio_net

    Adding support is almost trivial, so export more drivers to
    support this optimization soon.

    I want to thank, in no particular or implied order, Jesper
    Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
    Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
    David Tat, Hannes Frederic Sowa, and Rusty Russell.

    2) PTP and timestamping support in bnx2x, from Michal Kalderon.

    3) Allow adjusting the rx_copybreak threshold for a driver via
    ethtool, and add rx_copybreak support to enic driver. From
    Govindarajulu Varadarajan.

    4) Significant enhancements to the generic PHY layer and the bcm7xxx
    driver in particular (EEE support, auto power down, etc.) from
    Florian Fainelli.

    5) Allow raw buffers to be used for flow dissection, allowing drivers
    to determine the optimal "linear pull" size for devices that DMA
    into pools of pages. The objective is to get exactly the
    necessary amount of headers into the linear SKB area pre-pulled,
    but no more. The new interface drivers use is eth_get_headlen().
    From WANG Cong, with driver conversions (several had their own
    by-hand duplicated implementations) by Alexander Duyck and Eric
    Dumazet.

    6) Support checksumming more smoothly and efficiently for
    encapsulations, and add "foo over UDP" facility. From Tom
    Herbert.

    7) Add Broadcom SF2 switch driver to DSA layer, from Florian
    Fainelli.

    8) eBPF now can load programs via a system call and has an extensive
    testsuite. Alexei Starovoitov and Daniel Borkmann.

    9) Major overhaul of the packet scheduler to use RCU in several major
    areas such as the classifiers and rate estimators. From John
    Fastabend.

    10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
    Duyck.

    11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
    Dumazet.

    12) Add Datacenter TCP congestion control algorithm support, From
    Florian Westphal.

    13) Reorganize sk_buff so that __copy_skb_header() is significantly
    faster. From Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
    netlabel: directly return netlbl_unlabel_genl_init()
    net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
    net: description of dma_cookie cause make xmldocs warning
    cxgb4: clean up a type issue
    cxgb4: potential shift wrapping bug
    i40e: skb->xmit_more support
    net: fs_enet: Add NAPI TX
    net: fs_enet: Remove non NAPI RX
    r8169:add support for RTL8168EP
    net_sched: copy exts->type in tcf_exts_change()
    wimax: convert printk to pr_foo()
    af_unix: remove 0 assignment on static
    ipv6: Do not warn for informational ICMP messages, regardless of type.
    Update Intel Ethernet Driver maintainers list
    bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
    tipc: fix bug in multicast congestion handling
    net: better IFF_XMIT_DST_RELEASE support
    net/mlx4_en: remove NETDEV_TX_BUSY
    3c59x: fix bad split of cpu_to_le32(pci_map_single())
    net: bcmgenet: fix Tx ring priority programming
    ...

    Linus Torvalds
     

27 Sep, 2014

1 commit


18 Aug, 2014

1 commit

  • Many embedded systems will not need these syscalls, and omitting them
    saves space. Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS
    (default y) to support compiling them out.

    bloat-o-meter:
    add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250)
    function old new delta
    sys_fadvise64 57 - -57
    sys_fadvise64_64 691 - -691
    sys_madvise 1502 - -1502

    Signed-off-by: Josh Triplett

    Josh Triplett
     

09 Aug, 2014

2 commits

  • This is the new syscall kexec_file_load() declaration/interface. I have
    reserved the syscall number only for x86_64 so far. Other architectures
    (including i386) can reserve syscall number when they enable the support
    for this new syscall.

    Signed-off-by: Vivek Goyal
    Cc: Michael Kerrisk
    Cc: Borislav Petkov
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
    that you can pass to mmap(). It can support sealing and avoids any
    connection to user-visible mount-points. Thus, it's not subject to quotas
    on mounted file-systems, but can be used like malloc()'ed memory, but with
    a file-descriptor to it.

    memfd_create() returns the raw shmem file, so calls like ftruncate() can
    be used to modify the underlying inode. Also calls like fstat() will
    return proper information and mark the file as regular file. If you want
    sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
    supported (like on all other regular files).

    Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
    subject to a filesystem size limit. It is still properly accounted to
    memcg limits, though, and to the same overcommit or no-overcommit
    accounting as all user memory.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     

19 Jul, 2014

1 commit

  • This adds the new "seccomp" syscall with both an "operation" and "flags"
    parameter for future expansion. The third argument is a pointer value,
    used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
    be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

    In addition to the TSYNC flag later in this patch series, there is a
    non-zero chance that this syscall could be used for configuring a fixed
    argument area for seccomp-tracer-aware processes to pass syscall arguments
    in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
    for this syscall. Additionally, this syscall uses operation, flags,
    and user pointer for arguments because strictly passing arguments via
    a user pointer would mean seccomp itself would be unable to trivially
    filter the seccomp syscall itself.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     

05 Jun, 2014

1 commit

  • sys_sgetmask and sys_ssetmask are obsolete system calls no longer
    supported in libc.

    This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert
    mode configuration.That option is enabled by default for those
    architectures.

    Signed-off-by: Fabian Frederick
    Cc: Steven Miao
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Koichi Yasutake
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Greg Ungerer
    Cc: Heiko Carstens
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

04 Apr, 2014

2 commits

  • uselib hasn't been used since libc5; glibc does not use it. Support
    turning it off.

    When disabled, also omit the load_elf_library implementation from
    binfmt_elf.c, which only uselib invokes.

    bloat-o-meter:
    add/remove: 0/4 grow/shrink: 0/1 up/down: 0/-785 (-785)
    function old new delta
    padzero 39 36 -3
    uselib_flags 20 - -20
    sys_uselib 168 - -168
    SyS_uselib 168 - -168
    load_elf_library 426 - -426

    The new CONFIG_USELIB defaults to `y'.

    Signed-off-by: Josh Triplett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     
  • sys_sysfs is an obsolete system call no longer supported by libc.

    - This patch adds a default CONFIG_SYSFS_SYSCALL=y

    - Option can be turned off in expert mode.

    - cond_syscall added to kernel/sys_ni.c

    [akpm@linux-foundation.org: tweak Kconfig help text]
    Signed-off-by: Fabian Frederick
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

10 May, 2013

1 commit


04 Mar, 2013

2 commits


14 Dec, 2012

1 commit

  • As part of the effort to create a stronger boundary between root and
    kernel, Chrome OS wants to be able to enforce that kernel modules are
    being loaded only from our read-only crypto-hash verified (dm_verity)
    root filesystem. Since the init_module syscall hands the kernel a module
    as a memory blob, no reasoning about the origin of the blob can be made.

    Earlier proposals for appending signatures to kernel modules would not be
    useful in Chrome OS, since it would involve adding an additional set of
    keys to our kernel and builds for no good reason: we already trust the
    contents of our root filesystem. We don't need to verify those kernel
    modules a second time. Having to do signature checking on module loading
    would slow us down and be redundant. All we need to know is where a
    module is coming from so we can say yes/no to loading it.

    If a file descriptor is used as the source of a kernel module, many more
    things can be reasoned about. In Chrome OS's case, we could enforce that
    the module lives on the filesystem we expect it to live on. In the case
    of IMA (or other LSMs), it would be possible, for example, to examine
    extended attributes that may contain signatures over the contents of
    the module.

    This introduces a new syscall (on x86), similar to init_module, that has
    only two arguments. The first argument is used as a file descriptor to
    the module and the second argument is a pointer to the NULL terminated
    string of module arguments.

    Signed-off-by: Kees Cook
    Cc: Andrew Morton
    Signed-off-by: Rusty Russell (merge fixes)

    Kees Cook
     

01 Jun, 2012

1 commit

  • While doing the checkpoint-restore in the user space one need to determine
    whether various kernel objects (like mm_struct-s of file_struct-s) are
    shared between tasks and restore this state.

    The 2nd step can be solved by using appropriate CLONE_ flags and the
    unshare syscall, while there's currently no ways for solving the 1st one.

    One of the ways for checking whether two tasks share e.g. mm_struct is to
    provide some mm_struct ID of a task to its proc file, but showing such
    info considered to be not that good for security reasons.

    Thus after some debates we end up in conclusion that using that named
    'comparison' syscall might be the best candidate. So here is it --
    __NR_kcmp.

    It takes up to 5 arguments - the pids of the two tasks (which
    characteristics should be compared), the comparison type and (in case of
    comparison of files) two file descriptors.

    Lookups for pids are done in the caller's PID namespace only.

    At moment only x86 is supported and tested.

    [akpm@linux-foundation.org: fix up selftests, warnings]
    [akpm@linux-foundation.org: include errno.h]
    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Cyrill Gorcunov
    Acked-by: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Andrey Vagin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Thomas Gleixner
    Cc: Glauber Costa
    Cc: Andi Kleen
    Cc: Tejun Heo
    Cc: Matt Helsley
    Cc: Pekka Enberg
    Cc: Eric Dumazet
    Cc: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Valdis.Kletnieks@vt.edu
    Cc: Michal Marek
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

01 Nov, 2011

1 commit

  • The basic idea behind cross memory attach is to allow MPI programs doing
    intra-node communication to do a single copy of the message rather than a
    double copy of the message via shared memory.

    The following patch attempts to achieve this by allowing a destination
    process, given an address and size from a source process, to copy memory
    directly from the source process into its own address space via a system
    call. There is also a symmetrical ability to copy from the current
    process's address space into a destination process's address space.

    - Use of /proc/pid/mem has been considered, but there are issues with
    using it:
    - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
    written to would need to be contiguous.
    - Currently mem_read allows only processes who are currently
    ptrace'ing the target and are still able to ptrace the target to read
    from the target. This check could possibly be moved to the open call,
    but its not clear exactly what race this restriction is stopping
    (reason appears to have been lost)
    - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
    domain socket is a bit ugly from a userspace point of view,
    especially when you may have hundreds if not (eventually) thousands
    of processes that all need to do this with each other
    - Doesn't allow for some future use of the interface we would like to
    consider adding in the future (see below)
    - Interestingly reading from /proc/pid/mem currently actually
    involves two copies! (But this could be fixed pretty easily)

    As mentioned previously use of vmsplice instead was considered, but has
    problems. Since you need the reader and writer working co-operatively if
    the pipe is not drained then you block. Which requires some wrapping to
    do non blocking on the send side or polling on the receive. In all to all
    communication it requires ordering otherwise you can deadlock. And in the
    example of many MPI tasks writing to one MPI task vmsplice serialises the
    copying.

    There are some cases of MPI collectives where even a single copy interface
    does not get us the performance gain we could. For example in an
    MPI_Reduce rather than copy the data from the source we would like to
    instead use it directly in a mathops (say the reduce is doing a sum) as
    this would save us doing a copy. We don't need to keep a copy of the data
    from the source. I haven't implemented this, but I think this interface
    could in the future do all this through the use of the flags - eg could
    specify the math operation and type and the kernel rather than just
    copying the data would apply the specified operation between the source
    and destination and store it in the destination.

    Although we don't have a "second user" of the interface (though I've had
    some nibbles from people who may be interested in using it for intra
    process messaging which is not MPI). This interface is something which
    hardware vendors are already doing for their custom drivers to implement
    fast local communication. And so in addition to this being useful for
    OpenMPI it would mean the driver maintainers don't have to fix things up
    when the mm changes.

    There was some discussion about how much faster a true zero copy would
    go. Here's a link back to the email with some testing I did on that:

    http://marc.info/?l=linux-mm&m=130105930902915&w=2

    There is a basic man page for the proposed interface here:

    http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

    This has been implemented for x86 and powerpc, other architecture should
    mainly (I think) just need to add syscall numbers for the process_vm_readv
    and process_vm_writev. There are 32 bit compatibility versions for
    64-bit kernels.

    For arch maintainers there are some simple tests to be able to quickly
    verify that the syscalls are working correctly here:

    http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

    Signed-off-by: Chris Yeoh
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: James Morris
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

27 Aug, 2011

1 commit


21 May, 2011

1 commit

  • When building with:

    CONFIG_64BIT=y
    CONFIG_MIPS32_COMPAT=y
    CONFIG_COMPAT=y
    CONFIG_MIPS32_O32=y
    CONFIG_MIPS32_N32=y
    CONFIG_SYSVIPC is not set
    (and implicitly: CONFIG_SYSVIPC_COMPAT is not set)

    the final link fails with unresolved symbols for:

    compat_sys_semctl, compat_sys_msgsnd, compat_sys_msgrcv,
    compat_sys_shmctl, compat_sys_msgctl, compat_sys_semtimedop

    The fix is to add cond_syscall declarations for all syscalls in
    ipc/compat.c

    Signed-off-by: Kevin Cernekee
    Acked-by: Ralf Baechle
    Acked-by: Arnd Bergmann
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Kevin Cernekee
     

06 May, 2011

1 commit

  • This patch adds a multiple message send syscall and is the send
    version of the existing recvmmsg syscall. This is heavily
    based on the patch by Arnaldo that added recvmmsg.

    I wrote a microbenchmark to test the performance gains of using
    this new syscall:

    http://ozlabs.org/~anton/junkcode/sendmmsg_test.c

    The test was run on a ppc64 box with a 10 Gbit network card. The
    benchmark can send both UDP and RAW ethernet packets.

    64B UDP

    batch pkts/sec
    1 804570
    2 872800 (+ 8 %)
    4 916556 (+14 %)
    8 939712 (+17 %)
    16 952688 (+18 %)
    32 956448 (+19 %)
    64 964800 (+20 %)

    64B raw socket

    batch pkts/sec
    1 1201449
    2 1350028 (+12 %)
    4 1461416 (+22 %)
    8 1513080 (+26 %)
    16 1541216 (+28 %)
    32 1553440 (+29 %)
    64 1557888 (+30 %)

    We see a 20% improvement in throughput on UDP send and 30%
    on raw socket send.

    [ Add sparc syscall entries. -DaveM ]

    Signed-off-by: Anton Blanchard
    Signed-off-by: David S. Miller

    Anton Blanchard
     

15 Mar, 2011

2 commits


23 Sep, 2010

1 commit


28 Jul, 2010

2 commits


13 Mar, 2010

1 commit

  • Add a generic implementation of the ipc demultiplexer syscall. Except for
    s390 and sparc64 all implementations of the sys_ipc are nearly identical.

    There are slight differences in the types of the parameters, where mips
    and powerpc as the only 64-bit architectures with sys_ipc use unsigned
    long for the "third" argument as it gets casted to a pointer later, while
    it traditionally is an "int" like most other paramters. frv goes even
    further and uses unsigned long for all parameters execept for "ptr" which
    is a pointer type everywhere. The change from int to unsigned long for
    "third" and back to "int" for the others on frv should be fine due to the
    in-register calling conventions for syscalls (we already had a similar
    issue with the generic sys_ptrace), but I'd prefer to have the arch
    maintainers looks over this in details.

    Except for that h8300, m68k and m68knommu lack an impplementation of the
    semtimedop sub call which this patch adds, and various architectures have
    gets used - at least on i386 it seems superflous as the compat code on
    x86-64 and ia64 doesn't even bother to implement it.

    [akpm@linux-foundation.org: add sys_ipc to sys_ni.c]
    Signed-off-by: Christoph Hellwig
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Hirokazu Takata
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Reviewed-by: H. Peter Anvin
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: "Luck, Tony"
    Cc: James Morris
    Cc: Andreas Schwab
    Acked-by: Jesper Nilsson
    Acked-by: Russell King
    Acked-by: David Howells
    Acked-by: Kyle McMartin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

08 Dec, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
    mac80211: fix reorder buffer release
    iwmc3200wifi: Enable wimax core through module parameter
    iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
    iwmc3200wifi: Coex table command does not expect a response
    iwmc3200wifi: Update wiwi priority table
    iwlwifi: driver version track kernel version
    iwlwifi: indicate uCode type when fail dump error/event log
    iwl3945: remove duplicated event logging code
    b43: fix two warnings
    ipw2100: fix rebooting hang with driver loaded
    cfg80211: indent regulatory messages with spaces
    iwmc3200wifi: fix NULL pointer dereference in pmkid update
    mac80211: Fix TX status reporting for injected data frames
    ath9k: enable 2GHz band only if the device supports it
    airo: Fix integer overflow warning
    rt2x00: Fix padding bug on L2PAD devices.
    WE: Fix set events not propagated
    b43legacy: avoid PPC fault during resume
    b43: avoid PPC fault during resume
    tcp: fix a timewait refcnt race
    ...

    Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
    CTL_UNNUMBERED removed) in
    kernel/sysctl_check.c
    net/ipv4/sysctl_net_ipv4.c
    net/ipv6/addrconf.c
    net/sctp/sysctl.c

    Linus Torvalds
     

06 Nov, 2009

1 commit


13 Oct, 2009

1 commit

  • Meaning receive multiple messages, reducing the number of syscalls and
    net stack entry/exit operations.

    Next patches will introduce mechanisms where protocols that want to
    optimize this operation will provide an unlocked_recvmsg operation.

    This takes into account comments made by:

    . Paul Moore: sock_recvmsg is called only for the first datagram,
    sock_recvmsg_nosec is used for the rest.

    . Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
    works in the same fashion as the ppoll one.

    If the underlying protocol returns a datagram with MSG_OOB set, this
    will make recvmmsg return right away with as many datagrams (+ the OOB
    one) it has received so far.

    . Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
    datagrams and then recvmsg returns an error, recvmmsg will return
    the successfully received datagrams, store the error and return it
    in the next call.

    This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
    where we will be able to acquire the lock only at batch start and end, not at
    every underlying recvmsg call.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

25 Sep, 2009

1 commit


22 Sep, 2009

1 commit

  • sparc64 allnoconfig:

    arch/sparc/kernel/built-in.o(.text+0x134e0): In function `sys32_recvfrom':
    : undefined reference to `compat_sys_recvfrom'
    arch/sparc/kernel/built-in.o(.text+0x134e4): In function `sys32_recvfrom':
    : undefined reference to `compat_sys_recvfrom'

    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Andrew Morton
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

21 Jan, 2009

1 commit


14 Jan, 2009

1 commit


08 Dec, 2008

1 commit

  • Implement the core kernel bits of Performance Counters subsystem.

    The Linux Performance Counter subsystem provides an abstraction of
    performance counter hardware capabilities. It provides per task and per
    CPU counters, and it provides event capabilities on top of those.

    Performance counters are accessed via special file descriptors.
    There's one file descriptor per virtual counter used.

    The special file descriptor is opened via the perf_counter_open()
    system call:

    int
    perf_counter_open(u32 hw_event_type,
    u32 hw_event_period,
    u32 record_type,
    pid_t pid,
    int cpu);

    The syscall returns the new fd. The fd can be used via the normal
    VFS system calls: read() can be used to read the counter, fcntl()
    can be used to set the blocking mode, etc.

    Multiple counters can be kept open at a time, and the counters
    can be poll()ed.

    See more details in Documentation/perf-counters.txt.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

20 Nov, 2008

1 commit

  • Introduce a new accept4() system call. The addition of this system call
    matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(),
    inotify_init1(), epoll_create1(), pipe2()) which added new system calls
    that differed from analogous traditional system calls in adding a flags
    argument that can be used to access additional functionality.

    The accept4() system call is exactly the same as accept(), except that
    it adds a flags bit-mask argument. Two flags are initially implemented.
    (Most of the new system calls in 2.6.27 also had both of these flags.)

    SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled
    for the new file descriptor returned by accept4(). This is a useful
    security feature to avoid leaking information in a multithreaded
    program where one thread is doing an accept() at the same time as
    another thread is doing a fork() plus exec(). More details here:
    http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling",
    Ulrich Drepper).

    The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag
    to be enabled on the new open file description created by accept4().
    (This flag is merely a convenience, saving the use of additional calls
    fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result.

    Here's a test program. Works on x86-32. Should work on x86-64, but
    I (mtk) don't have a system to hand to test with.

    It tests accept4() with each of the four possible combinations of
    SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies
    that the appropriate flags are set on the file descriptor/open file
    description returned by accept4().

    I tested Ulrich's patch in this thread by applying against 2.6.28-rc2,
    and it passes according to my test program.

    /* test_accept4.c

    Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk

    Licensed under the GNU GPLv2 or later.
    */
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PORT_NUM 33333

    #define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

    /**********************************************************************/

    /* The following is what we need until glibc gets a wrapper for
    accept4() */

    /* Flags for socket(), socketpair(), accept4() */
    #ifndef SOCK_CLOEXEC
    #define SOCK_CLOEXEC O_CLOEXEC
    #endif
    #ifndef SOCK_NONBLOCK
    #define SOCK_NONBLOCK O_NONBLOCK
    #endif

    #ifdef __x86_64__
    #define SYS_accept4 288
    #elif __i386__
    #define USE_SOCKETCALL 1
    #define SYS_ACCEPT4 18
    #else
    #error "Sorry -- don't know the syscall # on this architecture"
    #endif

    static int
    accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags)
    {
    printf("Calling accept4(): flags = %x", flags);
    if (flags != 0) {
    printf(" (");
    if (flags & SOCK_CLOEXEC)
    printf("SOCK_CLOEXEC");
    if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK))
    printf(" ");
    if (flags & SOCK_NONBLOCK)
    printf("SOCK_NONBLOCK");
    printf(")");
    }
    printf("\n");

    #if USE_SOCKETCALL
    long args[6];

    args[0] = fd;
    args[1] = (long) sockaddr;
    args[2] = (long) addrlen;
    args[3] = flags;

    return syscall(SYS_socketcall, SYS_ACCEPT4, args);
    #else
    return syscall(SYS_accept4, fd, sockaddr, addrlen, flags);
    #endif
    }

    /**********************************************************************/

    static int
    do_test(int lfd, struct sockaddr_in *conn_addr,
    int closeonexec_flag, int nonblock_flag)
    {
    int connfd, acceptfd;
    int fdf, flf, fdf_pass, flf_pass;
    struct sockaddr_in claddr;
    socklen_t addrlen;

    printf("=======================================\n");

    connfd = socket(AF_INET, SOCK_STREAM, 0);
    if (connfd == -1)
    die("socket");
    if (connect(connfd, (struct sockaddr *) conn_addr,
    sizeof(struct sockaddr_in)) == -1)
    die("connect");

    addrlen = sizeof(struct sockaddr_in);
    acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen,
    closeonexec_flag | nonblock_flag);
    if (acceptfd == -1) {
    perror("accept4()");
    close(connfd);
    return 0;
    }

    fdf = fcntl(acceptfd, F_GETFD);
    if (fdf == -1)
    die("fcntl:F_GETFD");
    fdf_pass = ((fdf & FD_CLOEXEC) != 0) ==
    ((closeonexec_flag & SOCK_CLOEXEC) != 0);
    printf("Close-on-exec flag is %sset (%s); ",
    (fdf & FD_CLOEXEC) ? "" : "not ",
    fdf_pass ? "OK" : "failed");

    flf = fcntl(acceptfd, F_GETFL);
    if (flf == -1)
    die("fcntl:F_GETFD");
    flf_pass = ((flf & O_NONBLOCK) != 0) ==
    ((nonblock_flag & SOCK_NONBLOCK) !=0);
    printf("nonblock flag is %sset (%s)\n",
    (flf & O_NONBLOCK) ? "" : "not ",
    flf_pass ? "OK" : "failed");

    close(acceptfd);
    close(connfd);

    printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL");
    return fdf_pass && flf_pass;
    }

    static int
    create_listening_socket(int port_num)
    {
    struct sockaddr_in svaddr;
    int lfd;
    int optval;

    memset(&svaddr, 0, sizeof(struct sockaddr_in));
    svaddr.sin_family = AF_INET;
    svaddr.sin_addr.s_addr = htonl(INADDR_ANY);
    svaddr.sin_port = htons(port_num);

    lfd = socket(AF_INET, SOCK_STREAM, 0);
    if (lfd == -1)
    die("socket");

    optval = 1;
    if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval,
    sizeof(optval)) == -1)
    die("setsockopt");

    if (bind(lfd, (struct sockaddr *) &svaddr,
    sizeof(struct sockaddr_in)) == -1)
    die("bind");

    if (listen(lfd, 5) == -1)
    die("listen");

    return lfd;
    }

    int
    main(int argc, char *argv[])
    {
    struct sockaddr_in conn_addr;
    int lfd;
    int port_num;
    int passed;

    passed = 1;

    port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM;

    memset(&conn_addr, 0, sizeof(struct sockaddr_in));
    conn_addr.sin_family = AF_INET;
    conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
    conn_addr.sin_port = htons(port_num);

    lfd = create_listening_socket(port_num);

    if (!do_test(lfd, &conn_addr, 0, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK))
    passed = 0;

    close(lfd);

    exit(passed ? EXIT_SUCCESS : EXIT_FAILURE);
    }

    [mtk.manpages@gmail.com: rewrote changelog, updated test program]
    Signed-off-by: Ulrich Drepper
    Tested-by: Michael Kerrisk
    Acked-by: Michael Kerrisk
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

17 Oct, 2008

1 commit

  • This patchs adds the CONFIG_AIO option which allows to remove support
    for asynchronous I/O operations, that are not necessarly used by
    applications, particularly on embedded devices. As this is a
    size-reduction option, it depends on CONFIG_EMBEDDED. It allows to
    save ~7 kilobytes of kernel code/data:

    text data bss dec hex filename
    1115067 119180 217088 1451335 162547 vmlinux
    1108025 119048 217088 1444161 160941 vmlinux.new
    -7042 -132 0 -7174 -1C06 +/-

    This patch has been originally written by Matt Mackall
    , and is part of the Linux Tiny project.

    [randy.dunlap@oracle.com: build fix]
    Signed-off-by: Thomas Petazzoni
    Cc: Benjamin LaHaise
    Cc: Zach Brown
    Signed-off-by: Matt Mackall
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Petazzoni
     

30 Sep, 2008

1 commit

  • This patch adds the CONFIG_FILE_LOCKING option which allows to remove
    support for advisory locks. With this patch enabled, the flock()
    system call, the F_GETLK, F_SETLK and F_SETLKW operations of fcntl()
    and NFS support are disabled. These features are not necessarly needed
    on embedded systems. It allows to save ~11 Kb of kernel code and data:

    text data bss dec hex filename
    1125436 118764 212992 1457192 163c28 vmlinux.old
    1114299 118564 212992 1445855 160fdf vmlinux
    -11137 -200 0 -11337 -2C49 +/-

    This patch has originally been written by Matt Mackall
    , and is part of the Linux Tiny project.

    Signed-off-by: Thomas Petazzoni
    Signed-off-by: Matt Mackall
    Cc: matthew@wil.cx
    Cc: linux-fsdevel@vger.kernel.org
    Cc: mpm@selenic.com
    Cc: akpm@linux-foundation.org
    Signed-off-by: J. Bruce Fields

    Thomas Petazzoni
     

26 Jul, 2008

1 commit