28 Sep, 2016

1 commit

  • current_fs_time() uses struct super_block* as an argument.
    As per Linus's suggestion, this is changed to take struct
    inode* as a parameter instead. This is because the function
    is primarily meant for vfs inode timestamps.
    Also the function was renamed as per Arnd's suggestion.

    Change all calls to current_fs_time() to use the new
    current_time() function instead. current_fs_time() will be
    deleted.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Al Viro

    Deepa Dinamani
     

07 Aug, 2016

1 commit

  • Pull binfmt_misc update from James Bottomley:
    "This update is to allow architecture emulation containers to function
    such that the emulation binary can be housed outside the container
    itself. The container and fs parts both have acks from relevant
    experts.

    To use the new feature you have to add an F option to your binfmt_misc
    configuration"

    From the docs:
    "The usual behaviour of binfmt_misc is to spawn the binary lazily when
    the misc format file is invoked. However, this doesn't work very well
    in the face of mount namespaces and changeroots, so the F mode opens
    the binary as soon as the emulation is installed and uses the opened
    image to spawn the emulator, meaning it is always available once
    installed, regardless of how the environment changes"

    * tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc:
    binfmt_misc: add F option description to documentation
    binfmt_misc: add persistent opened binary handler for containers
    fs: add filp_clone_open API

    Linus Torvalds
     

30 May, 2016

1 commit


31 Mar, 2016

1 commit

  • This patch adds a new flag 'F' to the binfmt handlers. If you pass in
    'F' the binary that runs the emulation will be opened immediately and
    in future, will be cloned from the open file.

    The net effect is that the handler survives both changeroots and mount
    namespace changes, making it easy to work with foreign architecture
    containers without contaminating the container image with the
    emulator.

    Signed-off-by: James Bottomley
    Acked-by: Serge Hallyn

    James Bottomley
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

17 Apr, 2015

1 commit

  • sprintf() reliably returns the number of characters printed, so we don't
    need to ask strlen() where we are. Also replace calling sprintf("%02x")
    in a loop with the much simpler bin2hex().

    [akpm@linux-foundation.org: it's odd to include kernel.h after everything else]
    Signed-off-by: Rasmus Villemoes
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

16 Apr, 2015

1 commit


17 Dec, 2014

1 commit

  • scanarg(s, del) never returns s; the empty field results in s + 1.
    Restore the correct checks, and move NUL-termination into scanarg(),
    while we are at it.

    Incidentally, mixing "coding style cleanups" (for small values of cleanup)
    with functional changes is a Bad Idea(tm)...

    Signed-off-by: Al Viro

    Al Viro
     

14 Dec, 2014

1 commit

  • This patchset adds execveat(2) for x86, and is derived from Meredydd
    Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

    The primary aim of adding an execveat syscall is to allow an
    implementation of fexecve(3) that does not rely on the /proc filesystem,
    at least for executables (rather than scripts). The current glibc version
    of fexecve(3) is implemented via /proc, which causes problems in sandboxed
    or otherwise restricted environments.

    Given the desire for a /proc-free fexecve() implementation, HPA suggested
    (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
    an appropriate generalization.

    Also, having a new syscall means that it can take a flags argument without
    back-compatibility concerns. The current implementation just defines the
    AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
    added in future -- for example, flags for new namespaces (as suggested at
    https://lkml.org/lkml/2006/7/11/474).

    Related history:
    - https://lkml.org/lkml/2006/12/27/123 is an example of someone
    realizing that fexecve() is likely to fail in a chroot environment.
    - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
    documenting the /proc requirement of fexecve(3) in its manpage, to
    "prevent other people from wasting their time".
    - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
    problem where a process that did setuid() could not fexecve()
    because it no longer had access to /proc/self/fd; this has since
    been fixed.

    This patch (of 4):

    Add a new execveat(2) system call. execveat() is to execve() as openat()
    is to open(): it takes a file descriptor that refers to a directory, and
    resolves the filename relative to that.

    In addition, if the filename is empty and AT_EMPTY_PATH is specified,
    execveat() executes the file to which the file descriptor refers. This
    replicates the functionality of fexecve(), which is a system call in other
    UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/" (and
    so relies on /proc being mounted).

    The filename fed to the executed program as argv[0] (or the name of the
    script fed to a script interpreter) will be of the form "/dev/fd/"
    (for an empty filename) or "/dev/fd//", effectively
    reflecting how the executable was found. This does however mean that
    execution of a script in a /proc-less environment won't work; also, script
    execution via an O_CLOEXEC file descriptor fails (as the file will not be
    accessible after exec).

    Based on patches by Meredydd Luff.

    Signed-off-by: David Drysdale
    Cc: Meredydd Luff
    Cc: Shuah Khan
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Alexander Viro
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Kees Cook
    Cc: Arnd Bergmann
    Cc: Rich Felker
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Drysdale
     

11 Dec, 2014

4 commits

  • GFP_USER means "honour cpuset nodes-allowed beancounting". These are
    regular old kernel objects and there seems no reason to give them this
    treatment.

    Acked-by: Mike Frysinger
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Clean up various coding style issues that checkpatch complains about.
    No functional changes here.

    Signed-off-by: Mike Frysinger
    Cc: Al Viro
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • When trying to develop a custom format handler, the errors returned all
    effectively get bucketed as EINVAL with no kernel messages. The other
    errors (ENOMEM/EFAULT) are internal/obvious and basic. Thus any time a
    bad handler is rejected, the developer has to walk the dense code and
    try to guess where it went wrong. Needing to dive into kernel code is
    itself a fairly high barrier for a lot of people.

    To improve this situation, let's deploy extensive pr_debug markers at
    logical parse points, and add comments to the dense parsing logic. It
    let's you see exactly where the parsing aborts, the string the kernel
    received (useful when dealing with shell code), how it translated the
    buffers to binary data, and how it will apply the mask at runtime.

    Some example output:
    $ echo ':qemu-foo:M::\x7fELF\xAD\xAD\x01\x00:\xff\xff\xff\xff\xff\x00\xff\x00:/usr/bin/qemu-foo:POC' > register
    $ dmesg
    binfmt_misc: register: received 92 bytes
    binfmt_misc: register: delim: 0x3a {:}
    binfmt_misc: register: name: {qemu-foo}
    binfmt_misc: register: type: M (magic)
    binfmt_misc: register: offset: 0x0
    binfmt_misc: register: magic[raw]: 5c 78 37 66 45 4c 46 5c 78 41 44 5c 78 41 44 5c \x7fELF\xAD\xAD\
    binfmt_misc: register: magic[raw]: 78 30 31 5c 78 30 30 00 x01\x00.
    binfmt_misc: register: mask[raw]: 5c 78 66 66 5c 78 66 66 5c 78 66 66 5c 78 66 66 \xff\xff\xff\xff
    binfmt_misc: register: mask[raw]: 5c 78 66 66 5c 78 30 30 5c 78 66 66 5c 78 30 30 \xff\x00\xff\x00
    binfmt_misc: register: mask[raw]: 00 .
    binfmt_misc: register: magic/mask length: 8
    binfmt_misc: register: magic[decoded]: 7f 45 4c 46 ad ad 01 00 .ELF....
    binfmt_misc: register: mask[decoded]: ff ff ff ff ff 00 ff 00 ........
    binfmt_misc: register: magic[masked]: 7f 45 4c 46 ad 00 01 00 .ELF....
    binfmt_misc: register: interpreter: {/usr/bin/qemu-foo}
    binfmt_misc: register: flag: P (preserve argv0)
    binfmt_misc: register: flag: O (open binary)
    binfmt_misc: register: flag: C (preserve creds)

    The [raw] lines show us exactly what was received from userspace. The
    lines after that show us how the kernel has decoded things.

    Signed-off-by: Mike Frysinger
    Cc: Al Viro
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • This patch replaces calls to get_unused_fd() with equivalent call to
    get_unused_fd_flags(0) to preserve current behavor for existing code.

    In a further patch, get_unused_fd() will be removed so that new code start
    using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
    by default or choosen by userspace.

    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     

14 Oct, 2014

2 commits

  • gcc-4.9 on ARM gives us a mysterious warning about the binfmt_misc
    parse_command function:

    fs/binfmt_misc.c: In function 'parse_command.part.3':
    fs/binfmt_misc.c:405:7: warning: array subscript is above array bounds [-Warray-bounds]

    I've managed to trace this back to the ARM implementation of memset,
    which is called from copy_from_user in case of a fault and which does

    #define memset(p,v,n) \
    ({ \
    void *__p = (p); size_t __n = n; \
    if ((__n) != 0) { \
    if (__builtin_constant_p((v)) && (v) == 0) \
    __memzero((__p),(__n)); \
    else \
    memset((__p),(v),(__n)); \
    } \
    (__p); \
    })

    Apparently gcc gets confused by the check for "size != 0" and believes
    that the size might be zero when it gets to the line that does "if
    (s[count-1] == '\n')", so it would access data outside of the array.

    gcc is clearly wrong here, since this condition was already checked
    earlier in the function and the 'size' value can not change in the
    meantime.

    Fortunately, we can work around it and get rid of the warning by
    rearranging the function to check for zero size after doing the
    copy_from_user. It is still safe to pass a zero size into
    copy_from_user, so it does not cause any side effects.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • The current code places a 256 byte limit on the registration format.
    This ends up being fairly limited when you try to do matching against a
    binary format like ELF:

    - the magic & mask formats cannot have any embedded NUL chars
    (string_unescape_inplace halts at the first NUL)
    - each escape sequence quadruples the size: \x00 is needed for NUL
    - trying to match bytes at the start of the file as well as further
    on leads to a lot of \x00 sequences in the mask
    - magic & mask have to be the same length (when decoded)
    - still need bytes for the other fields
    - impossible!

    Let's look at a concrete (and common) example: using QEMU to run MIPS
    ELFs. The name field uses 11 bytes "qemu-mipsel". The interp uses 20
    bytes "/usr/bin/qemu-mipsel". The type & flags takes up 4 bytes. We
    need 7 bytes for the delimiter (usually ":"). We can skip offset. So
    already we're down to 107 bytes to use with the magic/mask instead of
    the real limit of 128 (BINPRM_BUF_SIZE). If people use shell code to
    register (which they do the majority of the time), they're down to ~26
    possible bytes since the escape sequence must be \x##.

    The ELF format looks like (both 32 & 64 bit):

    e_ident: 16 bytes
    e_type: 2 bytes
    e_machine: 2 bytes

    Those 20 bytes are enough for most architectures because they have so few
    formats in the first place, thus they can be uniquely identified. That
    also means for shell users, since 20 is smaller than 26, they can sanely
    register a handler.

    But for some targets (like MIPS), we need to poke further. The ELF fields
    continue on:

    e_entry: 4 or 8 bytes
    e_phoff: 4 or 8 bytes
    e_shoff: 4 or 8 bytes
    e_flags: 4 bytes

    We only care about e_flags here as that includes the bits to identify
    whether the ELF is O32/N32/N64. But now we have to consume another 16
    bytes (for 32 bit ELFs) or 28 bytes (for 64 bit ELFs) just to match the
    flags. If every byte is escaped, we send 288 more bytes to the kernel
    ((20 {e_ident,e_type,e_machine} + 12 {e_entry,e_phoff,e_shoff} + 4
    {e_flags}) * 2 {mask,magic} * 4 {escape}) and we've clearly blown our
    budget.

    Even if we try to be clever and do the decoding ourselves (rather than
    relying on the kernel to process \x##), we still can't hit the mark --
    string_unescape_inplace treats mask & magic as C strings so NUL cannot
    be embedded. That leaves us with having to pass \x00 for the 12/24
    entry/phoff/shoff bytes (as those will be completely random addresses),
    and that is a minimum requirement of 48/96 bytes for the mask alone.
    Add up the rest and we blow through it (this is for 64 bit ELFs):
    magic: 20 {e_ident,e_type,e_machine} + 24 {e_entry,e_phoff,e_shoff} +
    4 {e_flags} = 48 # ^^ See note below.
    mask: 20 {e_ident,e_type,e_machine} + 96 {e_entry,e_phoff,e_shoff} +
    4 {e_flags} = 120
    Remember above we had 107 left over, and now we're at 168. This is of
    course the *best* case scenario -- you'll also want to have NUL bytes
    in the magic & mask too to match literal zeros.

    Note: the reason we can use 24 in the magic is that we can work off of the
    fact that for bytes the mask would clobber, we can stuff any value into
    magic that we want. So when mask is \x00, we don't need the magic to also
    be \x00, it can be an unescaped raw byte like '!'. This lets us handle
    more formats (barely) under the current 256 limit, but that's a pretty
    tall hoop to force people to jump through.

    With all that said, let's bump the limit from 256 bytes to 1920. This way
    we support escaping every byte of the mask & magic field (which is 1024
    bytes by themselves -- 128 * 4 * 2), and we leave plenty of room for other
    fields. Like long paths to the interpreter (when you have source in your
    /really/long/homedir/qemu/foo). Since the current code stuffs more than
    one structure into the same buffer, we leave a bit of space to easily
    round up to 2k. 1920 is just as arbitrary as 256 ;).

    Signed-off-by: Mike Frysinger
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     

04 Apr, 2014

1 commit


01 May, 2013

1 commit


04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Feb, 2013

1 commit


21 Dec, 2012

1 commit

  • If a series of scripts are executed, each triggering module loading via
    unprintable bytes in the script header, kernel stack contents can leak
    into the command line.

    Normally execution of binfmt_script and binfmt_misc happens recursively.
    However, when modules are enabled, and unprintable bytes exist in the
    bprm->buf, execution will restart after attempting to load matching
    binfmt modules. Unfortunately, the logic in binfmt_script and
    binfmt_misc does not expect to get restarted. They leave bprm->interp
    pointing to their local stack. This means on restart bprm->interp is
    left pointing into unused stack memory which can then be copied into the
    userspace argv areas.

    After additional study, it seems that both recursion and restart remains
    the desirable way to handle exec with scripts, misc, and modules. As
    such, we need to protect the changes to interp.

    This changes the logic to require allocation for any changes to the
    bprm->interp. To avoid adding a new kmalloc to every exec, the default
    value is left as-is. Only when passing through binfmt_script or
    binfmt_misc does an allocation take place.

    For a proof of concept, see DoTest.sh from:

    http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/

    Signed-off-by: Kees Cook
    Cc: halfdog
    Cc: P J P
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

18 Dec, 2012

1 commit

  • To avoid an explosion of request_module calls on a chain of abusive
    scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
    as maximum recursion depth is hit, the error will fail all the way back
    up the chain, aborting immediately.

    This also has the side-effect of stopping the user's shell from attempting
    to reexecute the top-level file as a shell script. As seen in the
    dash source:

    if (cmd != path_bshell && errno == ENOEXEC) {
    *argv-- = cmd;
    *argv = cmd = path_bshell;
    goto repeat;
    }

    The above logic was designed for running scripts automatically that lacked
    the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
    things continue to behave as the shell expects.

    Additionally, when tracking recursion, the binfmt handlers should not be
    involved. The recursion being tracked is the depth of calls through
    search_binary_handler(), so that function should be exclusively responsible
    for tracking the depth.

    Signed-off-by: Kees Cook
    Cc: halfdog
    Cc: P J P
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

29 Nov, 2012

2 commits


06 May, 2012

1 commit

  • After we moved inode_sync_wait() from end_writeback() it doesn't make sense
    to call the function end_writeback() anymore. Rename it to clear_inode()
    which well says what the function really does - set I_CLEAR flag.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

24 Mar, 2012

1 commit


21 Mar, 2012

1 commit


07 Jan, 2012

1 commit


02 Nov, 2011

1 commit


20 Jul, 2011

1 commit


29 Oct, 2010

1 commit


26 Oct, 2010

1 commit

  • Instead of always assigning an increasing inode number in new_inode
    move the call to assign it into those callers that actually need it.
    For now callers that need it is estimated conservatively, that is
    the call is added to all filesystems that do not assign an i_ino
    by themselves. For a few more filesystems we can avoid assigning
    any inode number given that they aren't user visible, and for others
    it could be done lazily when an inode number is actually needed,
    but that's left for later patches.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Christoph Hellwig
     

15 Oct, 2010

1 commit

  • All file_operations should get a .llseek operation so we can make
    nonseekable_open the default for future file operations without a
    .llseek pointer.

    The three cases that we can automatically detect are no_llseek, seq_lseek
    and default_llseek. For cases where we can we can automatically prove that
    the file offset is always ignored, we use noop_llseek, which maintains
    the current behavior of not returning an error from a seek.

    New drivers should normally not use noop_llseek but instead use no_llseek
    and call nonseekable_open at open time. Existing drivers can be converted
    to do the same when the maintainer knows for certain that no user code
    relies on calling seek on the device file.

    The generated code is often incorrectly indented and right now contains
    comments that clarify for each added line why a specific variant was
    chosen. In the version that gets submitted upstream, the comments will
    be gone and I will manually fix the indentation, because there does not
    seem to be a way to do that using coccinelle.

    Some amount of new code is currently sitting in linux-next that should get
    the same modifications, which I will do at the end of the merge window.

    Many thanks to Julia Lawall for helping me learn to write a semantic
    patch that does all this.

    ===== begin semantic patch =====
    // This adds an llseek= method to all file operations,
    // as a preparation for making no_llseek the default.
    //
    // The rules are
    // - use no_llseek explicitly if we do nonseekable_open
    // - use seq_lseek for sequential files
    // - use default_llseek if we know we access f_pos
    // - use noop_llseek if we know we don't access f_pos,
    // but we still want to allow users to call lseek
    //
    @ open1 exists @
    identifier nested_open;
    @@
    nested_open(...)
    {

    }

    @ open exists@
    identifier open_f;
    identifier i, f;
    identifier open1.nested_open;
    @@
    int open_f(struct inode *i, struct file *f)
    {

    }

    @ read disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {

    }

    @ read_no_fpos disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ write @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {

    }

    @ write_no_fpos @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ fops0 @
    identifier fops;
    @@
    struct file_operations fops = {
    ...
    };

    @ has_llseek depends on fops0 @
    identifier fops0.fops;
    identifier llseek_f;
    @@
    struct file_operations fops = {
    ...
    .llseek = llseek_f,
    ...
    };

    @ has_read depends on fops0 @
    identifier fops0.fops;
    identifier read_f;
    @@
    struct file_operations fops = {
    ...
    .read = read_f,
    ...
    };

    @ has_write depends on fops0 @
    identifier fops0.fops;
    identifier write_f;
    @@
    struct file_operations fops = {
    ...
    .write = write_f,
    ...
    };

    @ has_open depends on fops0 @
    identifier fops0.fops;
    identifier open_f;
    @@
    struct file_operations fops = {
    ...
    .open = open_f,
    ...
    };

    // use no_llseek if we call nonseekable_open
    ////////////////////////////////////////////
    @ nonseekable1 depends on !has_llseek && has_open @
    identifier fops0.fops;
    identifier nso ~= "nonseekable_open";
    @@
    struct file_operations fops = {
    ... .open = nso, ...
    +.llseek = no_llseek, /* nonseekable */
    };

    @ nonseekable2 depends on !has_llseek @
    identifier fops0.fops;
    identifier open.open_f;
    @@
    struct file_operations fops = {
    ... .open = open_f, ...
    +.llseek = no_llseek, /* open uses nonseekable */
    };

    // use seq_lseek for sequential files
    /////////////////////////////////////
    @ seq depends on !has_llseek @
    identifier fops0.fops;
    identifier sr ~= "seq_read";
    @@
    struct file_operations fops = {
    ... .read = sr, ...
    +.llseek = seq_lseek, /* we have seq_read */
    };

    // use default_llseek if there is a readdir
    ///////////////////////////////////////////
    @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier readdir_e;
    @@
    // any other fop is used that changes pos
    struct file_operations fops = {
    ... .readdir = readdir_e, ...
    +.llseek = default_llseek, /* readdir is present */
    };

    // use default_llseek if at least one of read/write touches f_pos
    /////////////////////////////////////////////////////////////////
    @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read.read_f;
    @@
    // read fops use offset
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = default_llseek, /* read accesses f_pos */
    };

    @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ... .write = write_f, ...
    + .llseek = default_llseek, /* write accesses f_pos */
    };

    // Use noop_llseek if neither read nor write accesses f_pos
    ///////////////////////////////////////////////////////////

    @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    identifier write_no_fpos.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ...
    .write = write_f,
    .read = read_f,
    ...
    +.llseek = noop_llseek, /* read and write both use no f_pos */
    };

    @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write_no_fpos.write_f;
    @@
    struct file_operations fops = {
    ... .write = write_f, ...
    +.llseek = noop_llseek, /* write uses no f_pos */
    };

    @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    @@
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = noop_llseek, /* read uses no f_pos */
    };

    @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    @@
    struct file_operations fops = {
    ...
    +.llseek = noop_llseek, /* no read or write fn */
    };
    ===== End semantic patch =====

    Signed-off-by: Arnd Bergmann
    Cc: Julia Lawall
    Cc: Christoph Hellwig

    Arnd Bergmann
     

10 Sep, 2010

1 commit

  • Commit 74641f584da ("alpha: binfmt_aout fix") (May 2009) introduced a
    regression - binfmt_misc is now consulted after binfmt_elf, which will
    unfortunately break ia32el. ia32 ELF binaries on ia64 used to be matched
    using binfmt_misc and executed using wrapper. As 32bit binaries are now
    matched by binfmt_elf before bindmt_misc kicks in, the wrapper is ignored.

    The fix increases precedence of binfmt_misc to the original state.

    Signed-off-by: Jan Sembera
    Cc: Ivan Kokshaysky
    Cc: Al Viro
    Cc: Richard Henderson [2.6.everything.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Sembera
     

18 Aug, 2010

1 commit

  • Make do_execve() take a const filename pointer so that kernel_execve() compiles
    correctly on ARM:

    arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type

    This also requires the argv and envp arguments to be consted twice, once for
    the pointer array and once for the strings the array points to. This is
    because do_execve() passes a pointer to the filename (now const) to
    copy_strings_kernel(). A simpler alternative would be to cast the filename
    pointer in do_execve() when it's passed to copy_strings_kernel().

    do_execve() may not change any of the strings it is passed as part of the argv
    or envp lists as they are some of them in .rodata, so marking these strings as
    const should be fine.

    Further kernel_execve() and sys_execve() need to be changed to match.

    This has been test built on x86_64, frv, arm and mips.

    Signed-off-by: David Howells
    Tested-by: Ralf Baechle
    Acked-by: Russell King
    Signed-off-by: Linus Torvalds

    David Howells
     

10 Aug, 2010

1 commit


07 Jan, 2009

1 commit


06 Jan, 2009

1 commit


17 Oct, 2008

1 commit

  • binfmt_script and binfmt_misc disallow recursion to avoid stack overflow
    using sh_bang and misc_bang. It causes problem in some cases:

    $ echo '#!/bin/ls' > /tmp/t0
    $ echo '#!/tmp/t0' > /tmp/t1
    $ echo '#!/tmp/t1' > /tmp/t2
    $ chmod +x /tmp/t*
    $ /tmp/t2
    zsh: exec format error: /tmp/t2

    Similar problem with binfmt_misc.

    This patch introduces field 'recursion_depth' into struct linux_binprm to
    track recursion level in binfmt_misc and binfmt_script. If recursion
    level more then BINPRM_MAX_RECURSION it generates -ENOEXEC.

    [akpm@linux-foundation.org: make linux_binprm.recursion_depth a uint]
    Signed-off-by: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Aug, 2008

1 commit

  • In case the binfmt_misc binary handler is registered *before* the e.g.
    script one (when for example being compiled as a module) the following
    situation may occur:

    1. user launches a script, whose interpreter is a misc binary;
    2. the load_misc_binary sets the misc_bang and returns -ENOEVEC,
    since the binary is a script;
    3. the load_script_binary loads one and calls for search_binary_hander
    to run the interpreter;
    4. the load_misc_binary is called again, but refuses to load the
    binary due to misc_bang bit set.

    The fix is to move the misc_bang setting lower - prior to the actual
    call to the search_binary_handler.

    Caused by the commit 3a2e7f47 (binfmt_misc.c: avoid potential kernel
    stack overflow)

    Signed-off-by: Pavel Emelyanov
    Reported-by: Kirill A. Shutemov
    Tested-by: Kirill A. Shutemov
    Cc: [2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov