09 Apr, 2014

1 commit

  • There wasn't any check of the size passed from userspace before trying
    to allocate the memory required.

    This meant that userspace might request more space than allowed,
    triggering an OOM.

    Signed-off-by: Sasha Levin
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

24 Jan, 2014

5 commits

  • The autofs4 module doesn't consider symlinks for expire as it did in the
    older autofs v3 module (so it's actually a long standing regression).

    The user space daemon has focused on the use of bind mounts instead of
    symlinks for a long time now and that's why this has not been noticed.
    But with the future addition of amd map parsing to automount(8), not to
    mention amd itself (of am-utils), symlink expiry will be needed.

    The direct and offset mount types can't be symlinks and the tree mounts of
    version 4 were always real mounts so only indirect mounts need expire
    symlinks.

    Since the current users of the autofs4 module haven't reported this as a
    problem to date this patch probably isn't a candidate for backport to
    stable.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Use the helper macro !IS_ROOT to replace parent != dentry->d_parent. Just
    clean up.

    Signed-off-by: Rui Xiang
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rui Xiang
     
  • While kzallocing sbi/ino fails, it should return -ENOMEM.

    And it should return the err value from autofs_prepare_pipe.

    Signed-off-by: Rui Xiang
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rui Xiang
     
  • The PID and the TGID of the process triggering the mount are sent to the
    daemon. Currently the global pid values are sent (ones valid in the
    initial pid namespace) but this is wrong if the autofs daemon itself is
    not running in the initial pid namespace.

    So send the pid values that are valid in the namespace of the autofs
    daemon.

    The namespace to use is taken from the oz_pgrp pid pointer, which was
    set at mount time to the mounting process' pid namespace.

    If the pid translation fails (the triggering process is in an unrelated
    pid namespace) then the automount fails with ENOENT.

    Signed-off-by: Miklos Szeredi
    Acked-by: Serge Hallyn
    Cc: Eric Biederman
    Acked-by: Ian Kent
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Enable autofs4 to work in a "container". oz_pgrp is converted from
    pid_t to struct pid and this is stored at mount time based on the
    "pgrp=" option or if the option is missing then the current pgrp.

    The "pgrp=" option is interpreted in the PID namespace of the current
    process. This option is flawed in that it doesn't carry the namespace
    information, so it should be deprecated. AFAICS the autofs daemon
    always sends the current pgrp, which is the default anyway.

    The oz_pgrp is also set from the AUTOFS_DEV_IOCTL_SETPIPEFD_CMD ioctl.
    This ioctl sets oz_pgrp to the current pgrp. It is not allowed to
    change the pid namespace.

    oz_pgrp is used mainly to determine whether the process traversing the
    autofs mount tree is the autofs daemon itself or not. This function now
    compares the pid pointers instead of the pid_t values.

    One other use of oz_pgrp is in autofs4_show_options. There is shows the
    virtual pid number (i.e. the one that is valid inside the PID namespace
    of the calling process)

    For debugging printk convert oz_pgrp to the value in the initial pid
    namespace.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Miklos Szeredi
    Acked-by: Serge Hallyn
    Cc: Eric Biederman
    Acked-by: Ian Kent
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

25 Oct, 2013

2 commits


17 Sep, 2013

1 commit

  • Don't drop ->wq_mutex before calling autofs4_notify_daemon() only to regain it
    there. Besides being pointless, that opens a race window where autofs4_wait_release()
    could've come and freed wq->name.name. And do the debugging printk in the "reused an
    existing wq" case before dropping ->wq_mutex - the same reason...

    Signed-off-by: Al Viro
    Acked-by: Ian Kent

    Al Viro
     

09 Sep, 2013

1 commit

  • When reconnecting to automounts at startup an autofs ioctl is used
    to find the device and inode of existing mounts so they can be used
    to open a file descriptor of possibly covered mounts.

    At this time the the caller might not yet "own" the mount so it can
    trigger calling ->d_automount(). This causes automount to hang when
    trying to reconnect to direct or offset mount types.

    Consequently kern_path() can't be used but kern_path_mountpoint() can be.

    Signed-off-by: Ian Kent
    Cc: Jeff Layton
    Cc: Al Viro
    Signed-off-by: Al Viro

    Ian Kent
     

05 Jul, 2013

1 commit


29 Jun, 2013

1 commit


07 May, 2013

2 commits

  • When checking if an autofs mount point is busy it isn't sufficient to
    only check if it's a mount point.

    For example, if the mount of an offset mountpoint in a tree is denied
    for this host by its export and the dentry becomes a process working
    directory the check incorrectly returns the mount as not in use at
    expire.

    This can happen since the default when mounting within a tree is
    nostrict, which means ingnore mount fails on mounts within the tree and
    continue. The nostrict option is meant to allow mounting in this case.

    Signed-off-by: David Jeffery
    Signed-off-by: Ian Kent
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    David Jeffery
     
  • Fixed the sparse warning:

    fs/autofs4/root.c:411:5: warning: symbol 'autofs4_d_manage' was not declared. Should it be static?"

    [ Clearly it should be static as the function is declared static at the
    top of root.c. - imk ]

    Signed-off-by: Claudiu Ghioc
    Signed-off-by: Ian Kent
    Signed-off-by: Linus Torvalds

    Claudiu Ghioc
     

04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

02 Mar, 2013

2 commits

  • smatch analysis:

    fs/autofs4/waitq.c:46 autofs4_catatonic_mode() info: redundant null check on wq->name.name calling kfree()

    Signed-off-by: Tim Gardner
    Signed-off-by: Ian Kent
    Cc: autofs@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Tim Gardner
     
  • …t lock contexts for basic block

    Sparse complains:

    fs/autofs4/root.c:409:9: sparse: context imbalance in 'autofs4_d_automount' - different lock contexts for basic block

    This was introduced by commit f55fb0c24386 ("autofs4 - dont clear
    DCACHE_NEED_AUTOMOUNT on rootless mount")

    The function autofs4_d_automount can be left with the (&sbi->fs_lock)
    held if sbi->version <= 4 and simple_empty(dentry) == false so the
    warning seems valid.

    --> Add an spin_unlock in this case before we jump to done

    Unfortunately compile tested only.

    Reported-by: Fengguang Wu <fengguang.wu@intel.com>
    Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
    Acked-by: Ian Kent <raven@themaw.net>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Peter Huewe
     

26 Feb, 2013

1 commit

  • According to SUSv3:

    [EACCES] Permission denied. An attempt was made to access a file in a way
    forbidden by its file access permissions.

    [EPERM] Operation not permitted. An attempt was made to perform an operation
    limited to processes with appropriate privileges or to the owner of a file
    or other resource.

    So -EPERM should be returned if capability checks fails.

    Strictly speaking this is an API change since the error code user sees is
    altered.

    Signed-off-by: Zhao Hongjiang
    Acked-by: Jan Kara
    Acked-by: Steven Whitehouse
    Acked-by: Ian Kent
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Zhao Hongjiang
     

23 Feb, 2013

1 commit


18 Dec, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

14 Dec, 2012

2 commits

  • For direct (and offset) mounts, if an automounted mount is manually
    umounted the trigger mount dentry can appear non-empty causing it to
    not trigger mounts. This can also happen if there is a file handle
    leak in a user space automounting application.

    This happens because, when a ioctl control file handle is opened
    on the mount, a cursor dentry is created which causes list_empty()
    to see the dentry as non-empty. Since there is a case where listing
    the directory of these dentrys is needed, the use of dcache_dir_*()
    functions for .open() and .release() is needed.

    Consequently simple_empty() must be used instead of list_empty()
    when checking for an empty directory.

    Signed-off-by: Ian Kent
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • The DCACHE_NEED_AUTOMOUNT flag is cleared on mount and set on expire
    for autofs rootless multi-mount dentrys to prevent unnecessary calls
    to ->d_automount().

    Since DCACHE_MANAGE_TRANSIT is always set on autofs dentrys ->d_managed()
    is always called so the check can be done in ->d_manage() without the
    need to change the flag. This still avoids unnecessary calls to
    ->d_automount(), adds negligible overhead and eliminates a seriously
    ugly check in the expire code.

    Signed-off-by: Ian Kent
    Signed-off-by: Linus Torvalds

    Ian Kent
     

15 Nov, 2012

1 commit

  • Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue.

    When creating directories and symlinks default the uid and gid of
    the mount requester to the global root uid and gid. autofs4_wait
    will update these fields when a mount is requested.

    When generating autofsv5 packets report the uid and gid of the mount
    requestor in user namespace of the process that opened the pipe,
    reporting unmapped uids and gids as overflowuid and overflowgid.

    In autofs_dev_ioctl_requester return the uid and gid of the last mount
    requester converted into the calling processes user namespace. When the
    uid or gid don't map return overflowuid and overflowgid as appropriate,
    allowing failure to find a mount requester to be distinguished from
    failure to map a mount requester.

    The uid and gid mount options specifying the user and group of the
    root autofs inode are converted into kuid and kgid as they are parsed
    defaulting to the current uid and current gid of the process that
    mounts autofs.

    Mounting of autofs for the present remains confined to processes in
    the initial user namespace.

    Cc: Ian Kent
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

11 Oct, 2012

1 commit

  • In autofs4_d_automount(), if a mount fail occurs the AUTOFS_INF_PENDING
    mount pending flag is not cleared.

    One effect of this is when using the "browse" option, directory entry
    attributes show up with all "?"s due to the incorrect callback and
    subsequent failure return (when in fact no callback should be made).

    Signed-off-by: Ian Kent
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Ian Kent
     

27 Sep, 2012

2 commits


17 Aug, 2012

2 commits

  • In some cases when an autofs indirect mount is contained in a file
    system that is marked as shared (such as when systemd does the
    equivalent of "mount --make-rshared /" early in the boot), mounts
    stop expiring.

    When this happens the first expiry check on a mountpoint dentry in
    autofs_expire_indirect() sees a mountpoint dentry with a higher
    than minimal reference count. Consequently the dentry is condidered
    busy and the actual expiry check is never done.

    This particular check was originally meant as an optimisation to
    detect a path walk in progress but with the addition of rcu-walk
    it can be ineffective anyway.

    Removing the test allows automounts to expire again since the
    actual expire check doesn't rely on the dentry reference count.

    Signed-off-by: Ian Kent
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Following a report of a crash during an automount expire I found that
    the locking in fs/autofs4/expire.c:get_next_positive_subdir() was wrong.
    Not only is the locking wrong but the function is more complex than it
    needs to be.

    The function is meant to calculate (and dget) the next entry in the list
    of directories contained in the root of an autofs mount point (an autofs
    indirect mount to be precise). The main problem was that the d_lock of
    the owner of the list was not being taken when walking the list, which
    lead to list corruption under load. The only other lock that needs to
    be taken is against the next dentry candidate so it can be checked for
    usability.

    Signed-off-by: Ian Kent
    Signed-off-by: Linus Torvalds

    Ian Kent
     

23 Jul, 2012

1 commit


14 Jul, 2012

1 commit

  • Just the flags; only NFS cares even about that, but there are
    legitimate uses for such argument. And getting rid of that
    completely would require splitting ->lookup() into a couple
    of methods (at least), so let's leave that alone for now...

    Signed-off-by: Al Viro

    Al Viro
     

29 May, 2012

1 commit

  • Pull writeback tree from Wu Fengguang:
    "Mainly from Jan Kara to avoid iput() in the flusher threads."

    * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Avoid iput() from flusher thread
    vfs: Rename end_writeback() to clear_inode()
    vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
    writeback: Refactor writeback_single_inode()
    writeback: Remove wb->list_lock from writeback_single_inode()
    writeback: Separate inode requeueing after writeback
    writeback: Move I_DIRTY_PAGES handling
    writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
    writeback: Move clearing of I_SYNC into inode_sync_complete()
    writeback: initialize global_dirty_limit
    fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
    mm: page-writeback.c: local functions should not be exposed globally

    Linus Torvalds
     

06 May, 2012

1 commit

  • After we moved inode_sync_wait() from end_writeback() it doesn't make sense
    to call the function end_writeback() anymore. Rename it to clear_inode()
    which well says what the function really does - set I_CLEAR flag.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

30 Apr, 2012

1 commit

  • The autofs packet size has had a very unfortunate size problem on x86:
    because the alignment of 'u64' differs in 32-bit and 64-bit modes, and
    because the packet data was not 8-byte aligned, the size of the autofsv5
    packet structure differed between 32-bit and 64-bit modes despite
    looking otherwise identical (300 vs 304 bytes respectively).

    We first fixed that up by making the 64-bit compat mode know about this
    problem in commit a32744d4abae ("autofs: work around unhappy compat
    problem on x86-64"), and that made a 32-bit 'systemd' work happily on a
    64-bit kernel because everything then worked the same way as on a 32-bit
    kernel.

    But it turned out that 'automount' had actually known and worked around
    this problem in user space, so fixing the kernel to do the proper 32-bit
    compatibility handling actually *broke* 32-bit automount on a 64-bit
    kernel, because it knew that the packet sizes were wrong and expected
    those incorrect sizes.

    As a result, we ended up reverting that compatibility mode fix, and
    thus breaking systemd again, in commit fcbf94b9dedd.

    With both automount and systemd doing a single read() system call, and
    verifying that they get *exactly* the size they expect but using
    different sizes, it seemed that fixing one of them inevitably seemed to
    break the other. At one point, a patch I seriously considered applying
    from Michael Tokarev did a "strcmp()" to see if it was automount that
    was doing the operation. Ugly, ugly.

    However, a prettier solution exists now thanks to the packetized pipe
    mode. By marking the communication pipe as being packetized (by simply
    setting the O_DIRECT flag), we can always just write the bigger packet
    size, and if user-space does a smaller read, it will just get that
    partial end result and the extra alignment padding will simply be thrown
    away.

    This makes both automount and systemd happy, since they now get the size
    they asked for, and the kernel side of autofs simply no longer needs to
    care - it could pad out the packet arbitrarily.

    Of course, if there is some *other* user of autofs (please, please,
    please tell me it ain't so - and we haven't heard of any) that tries to
    read the packets with multiple writes, that other user will now be
    broken - the whole point of the packetized mode is that one system call
    gets exactly one packet, and you cannot read a packet in pieces.

    Tested-by: Michael Tokarev
    Cc: Alan Cox
    Cc: David Miller
    Cc: Ian Kent
    Cc: Thomas Meyer
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Apr, 2012

1 commit

  • This reverts commit a32744d4abae24572eff7269bc17895c41bd0085.

    While that commit was technically the right thing to do, and made the
    x86-64 compat mode work identically to native 32-bit mode (and thus
    fixing the problem with a 32-bit systemd install on a 64-bit kernel), it
    turns out that the automount binaries had workarounds for this compat
    problem.

    Now, the workarounds are disgusting: doing an "uname()" to find out the
    architecture of the kernel, and then comparing it for the 64-bit cases
    and fixing up the size of the read() in automount for those. And they
    were confused: it's not actually a generic 64-bit issue at all, it's
    very much tied to just x86-64, which has different alignment for an
    'u64' in 64-bit mode than in 32-bit mode.

    But the end result is that fixing the compat layer actually breaks the
    case of a 32-bit automount on a x86-64 kernel.

    There are various approaches to fix this (including just doing a
    "strcmp()" on current->comm and comparing it to "automount"), but I
    think that I will do the one that teaches pipes about a special "packet
    mode", which will allow user space to not have to care too deeply about
    the padding at the end of the autofs packet.

    That change will make the compat workaround unnecessary, so let's revert
    it first, and get automount working again in compat mode. The
    packetized pipes will then fix autofs for systemd.

    Reported-and-requested-by: Michael Tokarev
    Cc: Ian Kent
    Cc: stable@kernel.org # for 3.3
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Mar, 2012

1 commit

  • Pull x32 support for x86-64 from Ingo Molnar:
    "This tree introduces the X32 binary format and execution mode for x86:
    32-bit data space binaries using 64-bit instructions and 64-bit kernel
    syscalls.

    This allows applications whose working set fits into a 32 bits address
    space to make use of 64-bit instructions while using a 32-bit address
    space with shorter pointers, more compressed data structures, etc."

    Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

    * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    x32: Fix alignment fail in struct compat_siginfo
    x32: Fix stupid ia32/x32 inversion in the siginfo format
    x32: Add ptrace for x32
    x32: Switch to a 64-bit clock_t
    x32: Provide separate is_ia32_task() and is_x32_task() predicates
    x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
    x86/x32: Fix the binutils auto-detect
    x32: Warn and disable rather than error if binutils too old
    x32: Only clear TIF_X32 flag once
    x32: Make sure TS_COMPAT is cleared for x32 tasks
    fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
    fs: Fix close_on_exec pointer in alloc_fdtable
    x32: Drop non-__vdso weak symbols from the x32 VDSO
    x32: Fix coding style violations in the x32 VDSO code
    x32: Add x32 VDSO support
    x32: Allow x32 to be configured
    x32: If configured, add x32 system calls to system call tables
    x32: Handle process creation
    x32: Signal-related system calls
    x86: Add #ifdef CONFIG_COMPAT to
    ...

    Linus Torvalds
     

21 Mar, 2012

2 commits


26 Feb, 2012

1 commit

  • When the autofs protocol version 5 packet type was added in commit
    5c0a32fc2cd0 ("autofs4: add new packet type for v5 communications"), it
    obvously tried quite hard to be word-size agnostic, and uses explicitly
    sized fields that are all correctly aligned.

    However, with the final "char name[NAME_MAX+1]" array at the end, the
    actual size of the structure ends up being not very well defined:
    because the struct isn't marked 'packed', doing a "sizeof()" on it will
    align the size of the struct up to the biggest alignment of the members
    it has.

    And despite all the members being the same, the alignment of them is
    different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte
    alignment on x86-64. And while 'NAME_MAX+1' ends up being a nice round
    number (256), the name[] array starts out a 4-byte aligned.

    End result: the "packed" size of the structure is 300 bytes: 4-byte, but
    not 8-byte aligned.

    As a result, despite all the fields being in the same place on all
    architectures, sizeof() will round up that size to 304 bytes on
    architectures that have 8-byte alignment for u64.

    Note that this is *not* a problem for 32-bit compat mode on POWER, since
    there __u64 is 8-byte aligned even in 32-bit mode. But on x86, 32-bit
    and 64-bit alignment is different for 64-bit entities, and as a result
    the structure that has exactly the same layout has different sizes.

    So on x86-64, but no other architecture, we will just subtract 4 from
    the size of the structure when running in a compat task. That way we
    will write the properly sized packet that user mode expects.

    Not pretty. Sadly, this very subtle, and unnecessary, size difference
    has been encoded in user space that wants to read packets of *exactly*
    the right size, and will refuse to touch anything else.

    Reported-and-tested-by: Thomas Meyer
    Signed-off-by: Ian Kent
    Signed-off-by: Linus Torvalds

    Ian Kent
     

20 Feb, 2012

1 commit

  • Wrap accesses to the fd_sets in struct fdtable (for recording open files and
    close-on-exec flags) so that we can move away from using fd_sets since we
    abuse the fd_set structs by not allocating the full-sized structure under
    normal circumstances and by non-core code looking at the internals of the
    fd_sets.

    The first abuse means that use of FD_ZERO() on these fd_sets is not permitted,
    since that cannot be told about their abnormal lengths.

    This introduces six wrapper functions for setting, clearing and testing
    close-on-exec flags and fd-is-open flags:

    void __set_close_on_exec(int fd, struct fdtable *fdt);
    void __clear_close_on_exec(int fd, struct fdtable *fdt);
    bool close_on_exec(int fd, const struct fdtable *fdt);
    void __set_open_fd(int fd, struct fdtable *fdt);
    void __clear_open_fd(int fd, struct fdtable *fdt);
    bool fd_is_open(int fd, const struct fdtable *fdt);

    Note that I've prepended '__' to the names of the set/clear functions because
    they require the caller to hold a lock to use them.

    Note also that I haven't added wrappers for looking behind the scenes at the
    the array. Possibly that should exist too.

    Signed-off-by: David Howells
    Link: http://lkml.kernel.org/r/20120216174942.23314.1364.stgit@warthog.procyon.org.uk
    Signed-off-by: H. Peter Anvin
    Cc: Al Viro

    David Howells
     

14 Feb, 2012

1 commit

  • When recursing down the locks when traversing a tree/list in
    get_next_positive_dentry() or get_next_positive_subdir() a lock can
    change from being nested to being a parent which breaks lockdep. This
    patch tells lockdep about what we did.

    Signed-off-by: Steven Rostedt
    Acked-by: Ian Kent
    Signed-off-by: Al Viro

    Steven Rostedt