03 Nov, 2011

14 commits

  • * 'for-3.2' of git://linux-nfs.org/~bfields/linux:
    nfsd4: typo logical vs bitwise negate in nfsd4_decode_share_access

    Linus Torvalds
     
  • Says Andrew:

    "60 patches. That's good enough for -rc1 I guess. I have quite a lot
    of detritus to be rechecked, work through maintainers, etc.

    - most of the remains of MM
    - rtc
    - various misc
    - cgroups
    - memcg
    - cpusets
    - procfs
    - ipc
    - rapidio
    - sysctl
    - pps
    - w1
    - drivers/misc
    - aio"

    * akpm: (60 commits)
    memcg: replace ss->id_lock with a rwlock
    aio: allocate kiocbs in batches
    drivers/misc/vmw_balloon.c: fix typo in code comment
    drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
    w1: disable irqs in critical section
    drivers/w1/w1_int.c: multiple masters used same init_name
    drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
    drivers/power/ds2780_battery.c: add a nolock function to w1 interface
    drivers/power/ds2780_battery.c: create central point for calling w1 interface
    w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
    pps gpio client: add missing dependency
    pps: new client driver using GPIO
    pps: default echo function
    include/linux/dma-mapping.h: add dma_zalloc_coherent()
    sysctl: make CONFIG_SYSCTL_SYSCALL default to n
    sysctl: add support for poll()
    RapidIO: documentation update
    drivers/net/rionet.c: fix ethernet address macros for LE platforms
    RapidIO: fix potential null deref in rio_setup_device()
    RapidIO: add mport driver for Tsi721 bridge
    ...

    Linus Torvalds
     
  • In testing aio on a fast storage device, I found that the context lock
    takes up a fair amount of cpu time in the I/O submission path. The reason
    is that we take it for every I/O submitted (see __aio_get_req). Since we
    know how many I/Os are passed to io_submit, we can preallocate the kiocbs
    in batches, reducing the number of times we take and release the lock.

    In my testing, I was able to reduce the amount of time spent in
    _raw_spin_lock_irq by .56% (average of 3 runs). The command I used to
    test this was:

    aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384

    I also tested the patch with various numbers of events passed to
    io_submit, and I ran the xfstests aio group of tests to ensure I didn't
    break anything.

    Signed-off-by: Jeff Moyer
    Cc: Daniel Ehrenberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Adding support for poll() in sysctl fs allows userspace to receive
    notifications of changes in sysctl entries. This adds a infrastructure to
    allow files in sysctl fs to be pollable and implements it for hostname and
    domainname.

    [akpm@linux-foundation.org: s/declare/define/ for definitions]
    Signed-off-by: Lucas De Marchi
    Cc: Greg KH
    Cc: Kay Sievers
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas De Marchi
     
  • fd* files are restricted to the task's owner, and other users may not get
    direct access to them. But one may open any of these files and run any
    setuid program, keeping opened file descriptors. As there are permission
    checks on open(), but not on readdir() and read(), operations on the kept
    file descriptors will not be checked. It makes it possible to violate
    procfs permission model.

    Reading fdinfo/* may disclosure current fds' position and flags, reading
    directory contents of fdinfo/ and fd/ may disclosure the number of opened
    files by the target task. This information is not sensible per se, but it
    can reveal some private information (like length of a password stored in a
    file) under certain conditions.

    Used existing (un)lock_trace functions to check for ptrace_may_access(),
    but instead of using EPERM return code from it use EACCES to be consistent
    with existing proc_pid_follow_link()/proc_pid_readlink() return code. If
    they differ, attacker can guess what fds exist by analyzing stat() return
    code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
    readdir() and lookup() for fd/ and fdinfo/.

    Signed-off-by: Vasiliy Kulikov
    Cc: Cyrill Gorcunov
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • On reading sysctl dirs we should return -EISDIR instead of -EINVAL.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Clement Lecigne reports a filesystem which causes a kernel oops in
    hfs_find_init() trying to dereference sb->ext_tree which is NULL.

    This proves to be because the filesystem has a corrupted MDB extent
    record, where the extents file does not fit into the first three extents
    in the file record (the first blocks).

    In hfs_get_block() when looking up the blocks for the extent file
    (HFS_EXT_CNID), it fails the first blocks special case, and falls
    through to the extent code (which ultimately calls hfs_find_init())
    which is in the process of being initialised.

    Hfs avoids this scenario by always having the extents b-tree fitting
    into the first blocks (the extents B-tree can't have overflow extents).

    The fix is to check at mount time that the B-tree fits into first
    blocks, i.e. fail if HFS_I(inode)->alloc_blocks >=
    HFS_I(inode)->first_blocks

    Note, the existing commit 47f365eb57573 ("hfs: fix oops on mount with
    corrupted btree extent records") becomes subsumed into this as a special
    case, but only for the extents B-tree (HFS_EXT_CNID), it is perfectly
    acceptable for the catalog B-Tree file to grow beyond three extents,
    with the remaining extent descriptors in the extents overfow.

    This fixes CVE-2011-2203

    Reported-by: Clement LECIGNE
    Signed-off-by: Phillip Lougher
    Cc: Jeff Mahoney
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phillip Lougher
     
  • Use mpage_readpages() instead of multiple calls to isofs_readpage() to
    reduce the CPU utilization and make performance higher.

    Signed-off-by: Namjae Jeon
    Cc: Al Viro
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namjae Jeon
     
  • Since ramfs is hard-selected to "y", the module leftovers make no sense.

    Signed-off-by: Richard Weinberger
    Reviewed-by: WANG Cong
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • The case of address space randomization being disabled in runtime through
    randomize_va_space sysctl is not treated properly in load_elf_binary(),
    resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
    in case the randomization has been disabled at runtime prior to calling
    exec().

    Handle the randomize_va_space == 0 case the same way as if we were not
    supporting .text randomization at all.

    Based on original patch by H.J. Lu and Josh Boyer.

    Signed-off-by: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Russell King
    Cc: H.J. Lu
    Cc:
    Tested-by: Josh Boyer
    Acked-by: Nicolas Pitre
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue:
    vfs: add d_prune dentry operation
    vfs: protect i_nlink
    filesystems: add set_nlink()
    filesystems: add missing nlink wrappers
    logfs: remove unnecessary nlink setting
    ocfs2: remove unnecessary nlink setting
    jfs: remove unnecessary nlink setting
    hypfs: remove unnecessary nlink setting
    vfs: ignore error on forced remount
    readlinkat: ensure we return ENOENT for the empty pathname for normal lookups
    vfs: fix dentry leak in simple_fill_super()

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits)
    jbd2: Unify log messages in jbd2 code
    jbd/jbd2: validate sb->s_first in journal_get_superblock()
    ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined
    ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled
    ext4: fix a typo in struct ext4_allocation_context
    ext4: Don't normalize an falloc request if it can fit in 1 extent.
    ext4: remove comments about extent mount option in ext4_new_inode()
    ext4: let ext4_discard_partial_buffers handle unaligned range correctly
    ext4: return ENOMEM if find_or_create_pages fails
    ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock()
    ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten
    ext4: optimize locking for end_io extent conversion
    ext4: remove unnecessary call to waitqueue_active()
    ext4: Use correct locking for ext4_end_io_nolock()
    ext4: fix race in xattr block allocation path
    ext4: trace punch_hole correctly in ext4_ext_map_blocks
    ext4: clean up AGGRESSIVE_TEST code
    ext4: move variables to their scope
    ext4: fix quota accounting during migration
    ext4: migrate cleanup
    ...

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    udf: Cleanup metadata flags handling
    udf: Skip mirror metadata FE loading when metadata FE is ok
    ext3: Allow quota file use root reservation
    udf: Remove web reference from UDF MAINTAINERS entry
    quota: Drop path reference on error exit from quotactl
    udf: Neaten udf_debug uses
    udf: Neaten logging output, use vsprintf extension %pV
    udf: Convert printks to pr_
    udf: Rename udf_warning to udf_warn
    udf: Rename udf_error to udf_err
    udf: Promote some debugging messages to udf_error
    ext3: Remove the obsolete broken EXT3_IOC32_WAIT_FOR_READONLY.
    udf: Add readpages support for udf.
    ext3/balloc.c: local functions should be static
    ext2: fix the outdated comment in ext2_nfs_get_inode()
    ext3: remove deprecated oldalloc
    fs/ext3/balloc.c: delete useless initialization
    fs/ext2/balloc.c: delete useless initialization
    ext3: fix message in ext3_remount for rw-remount case
    ext3: Remove i_mutex from ext3_sync_file()

    Fix up trivial (printf format cleanup) conflicts in fs/udf/udfdecl.h

    Linus Torvalds
     
  • * 'for-linus' of git://github.com/richardweinberger/linux: (90 commits)
    um: fix ubd cow size
    um: Fix kmalloc argument order in um/vdso/vma.c
    um: switch to use of drivers/Kconfig
    UserModeLinux-HOWTO.txt: fix a typo
    UserModeLinux-HOWTO.txt: remove ^H characters
    um: we need sys/user.h only on i386
    um: merge delay_{32,64}.c
    um: distribute exports to where exported stuff is defined
    um: kill system-um.h
    um: generic ftrace.h will do...
    um: segment.h is x86-only and needed only there
    um: asm/pda.h is not needed anymore
    um: hw_irq.h can go generic as well
    um: switch to generic-y
    um: clean Kconfig up a bit
    um: a couple of missing dependencies...
    um: kill useless argument of free_chan() and free_one_chan()
    um: unify ptrace_user.h
    um: unify KSTK_...
    um: fix gcov build breakage
    ...

    Linus Torvalds
     

02 Nov, 2011

18 commits

  • everything in USER_OBJ gets it via -include user.h

    Signed-off-by: Al Viro
    Signed-off-by: Richard Weinberger

    Al Viro
     
  • This adds a d_prune dentry operation that is called by the VFS prior to
    pruning (i.e. unhashing and killing) a hashed dentry from the dcache.
    Wrap dentry_lru_del() and use the new _prune() helper in the cases where we
    are about to unhash and kill the dentry.

    This will be used by Ceph to maintain a flag indicating whether the
    complete contents of a directory are contained in the dcache, allowing it
    to satisfy lookups and readdir without addition server communication.

    Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other
    DCACHE_OP_ bits.

    Signed-off-by: Sage Weil
    Signed-off-by: Christoph Hellwig

    Sage Weil
     
  • Prevent direct modification of i_nlink by making it const and adding a
    non-const __i_nlink alias.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Christoph Hellwig

    Miklos Szeredi
     
  • Replace remaining direct i_nlink updates with a new set_nlink()
    updater function.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Christoph Hellwig

    Miklos Szeredi
     
  • Replace direct i_nlink updates with the respective updater function
    (inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count).

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • alloc_inode() initializes i_nlink to 1. Remove unnecessary
    re-initialization.

    Signed-off-by: Miklos Szeredi
    CC: Joern Engel
    CC: Prasad Joshi
    Signed-off-by: Christoph Hellwig

    Miklos Szeredi
     
  • alloc_inode() initializes i_nlink to 1. Remove unnecessary
    re-initialization.

    Signed-off-by: Miklos Szeredi
    CC: Joel Becker
    CC: Mark Fasheh
    Signed-off-by: Christoph Hellwig

    Miklos Szeredi
     
  • alloc_inode() initializes i_nlink to 1. Remove unnecessary
    re-initialization.

    Signed-off-by: Miklos Szeredi
    Acked-by: Dave Kleikamp
    Signed-off-by: Christoph Hellwig

    Miklos Szeredi
     
  • On emergency remount we want to force MS_RDONLY on the super block
    even if ->remount_fs() failed for some reason.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Christoph Hellwig

    Miklos Szeredi
     
  • Since the commit below which added O_PATH support to the *at() calls, the
    error return for readlink/readlinkat for the empty pathname has switched
    from ENOENT to EINVAL:

    commit 65cfc6722361570bfe255698d9cd4dccaf47570d
    Author: Al Viro
    Date: Sun Mar 13 15:56:26 2011 -0400

    readlinkat(), fchownat() and fstatat() with empty relative pathnames

    This is both unexpected for userspace and makes readlink/readlinkat
    inconsistant with all other interfaces; and inconsistant with our stated
    return for these pathnames.

    As the readlinkat call does not have a flags parameter we cannot use the
    AT_EMPTY_PATH approach used in the other calls. Therefore expose whether
    the original path is infact entry via a new user_path_at_empty() path
    lookup function. Use this to determine whether to default to EINVAL or
    ENOENT for failures.

    Addresses http://bugs.launchpad.net/bugs/817187

    [akpm@linux-foundation.org: remove unused getname_flags()]
    Signed-off-by: Andy Whitcroft
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Christoph Hellwig

    Andy Whitcroft
     
  • put dentry if inode allocation failed, d_genocide() cannot release it

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Christoph Hellwig

    Konstantin Khlebnikov
     
  • Some jbd2 code prints out kernel messages with "JBD2: " prefix, at the
    same time other jbd2 code prints with "JBD: " prefix. Unify the prefix
    to "JBD2: ".

    Signed-off-by: Eryu Guan
    Signed-off-by: "Theodore Ts'o"

    Eryu Guan
     
  • I hit a J_ASSERT(blocknr != 0) failure in cleanup_journal_tail() when
    mounting a fsfuzzed ext3 image. It turns out that the corrupted ext3
    image has s_first = 0 in journal superblock, and the 0 is passed to
    journal->j_head in journal_reset(), then to blocknr in
    cleanup_journal_tail(), in the end the J_ASSERT failed.

    So validate s_first after reading journal superblock from disk in
    journal_get_superblock() to ensure s_first is valid.

    The following script could reproduce it:

    fstype=ext3
    blocksize=1024
    img=$fstype.img
    offset=0
    found=0
    magic="c0 3b 39 98"

    dd if=/dev/zero of=$img bs=1M count=8
    mkfs -t $fstype -b $blocksize -F $img
    filesize=`stat -c %s $img`
    while [ $offset -lt $filesize ]
    do
    if od -j $offset -N 4 -t x1 $img | grep -i "$magic";then
    echo "Found journal: $offset"
    found=1
    break
    fi
    offset=`echo "$offset+$blocksize" | bc`
    done

    if [ $found -ne 1 ];then
    echo "Magic \"$magic\" not found"
    exit 1
    fi

    dd if=/dev/zero of=$img seek=$(($offset+23)) conv=notrunc bs=1 count=1

    mkdir -p ./mnt
    mount -o loop $img ./mnt

    Cc: Jan Kara
    Signed-off-by: Eryu Guan
    Signed-off-by: "Theodore Ts'o"

    Eryu Guan
     
  • The variable 'block' is removed by commit 750c9c47, so use the
    replacement ex_ee_block instead.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • This patch fixes a syntax error which omits a comma. Besides this,
    logical block number is unsigend 32 bits, so printk should use %u
    instead %d.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • Signed-off-by: Benny Halevy
    Signed-off-by: J. Bruce Fields

    Benny Halevy
     
  • * 'pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
    pstore: make pstore write function return normal success/fail value
    pstore: change mutex locking to spin_locks
    pstore: defer inserting OOPS entries into pstore

    Linus Torvalds
     
  • In sysfs_rename we need to remove the optimization of not calling
    sysfs_unlink_sibling and sysfs_link_sibling if the renamed parent
    directory is not changing. This optimization is no longer valid now
    that sysfs dirents are stored in an rbtree sorted by name.

    Move the assignment of s_ns before the call of sysfs_link_sibling. With
    no sysfs_dirent fields changing after the call of sysfs_link_sibling
    this allows sysfs_link_sibling to take any of the directory entries into
    account when it builds the rbtrees, and s_ns looks like a prime canidate
    to be used in the rbtree in the future.

    Signed-off-by: Eric W. Biederman
    Cc: Jiri Slaby
    Cc: Greg KH
    Cc: David Miller
    Cc: Mikulas Patocka
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

01 Nov, 2011

8 commits

  • epoll can acquire recursively acquire ep->mtx on multiple "struct
    eventpoll"s at once in the case where one epoll fd is monitoring another
    epoll fd. This is perfectly OK, since we're careful about the lock
    ordering, but it causes spurious lockdep warnings. Annotate the recursion
    using mutex_lock_nested, and add a comment explaining the nesting rules
    for good measure.

    Recent versions of systemd are triggering this, and it can also be
    demonstrated with the following trivial test program:

    --------------------8
    Tested-by: Paul Bolle
    Signed-off-by: Nelson Elhage
    Acked-by: Jason Baron
    Cc: Dave Jones
    Cc: Davide Libenzi
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nelson Elhage
     
  • There is no functional change.

    Signed-off-by: Andy Shevchenko
    Acked-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Standardize the style for compiler based printf format verification.
    Standardized the location of __printf too.

    Done via script and a little typing.

    $ grep -rPl --include=*.[ch] -w "__attribute__" * | \
    grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
    xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'

    [akpm@linux-foundation.org: revert arch bits]
    Signed-off-by: Joe Perches
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Currently a statfs on a pipe's /proc//fd/ link returns -ENOSYS. Wire
    pipfs up so that the statfs succeeds.

    This is required by checkpoint-restart in the userspace to make it
    possible to distinguish pipes from fifos.

    When we dump information about task's open files we use the /proc/pid/fd
    directoy's symlinks and the fact that opening any of them gives us exactly
    the same dentry->inode pair as the original process has. Now if a task
    we're dumping has opened pipe and fifo we need to detect this and act
    accordingly. Knowing that an fd with type S_ISFIFO resides on a pipefs is
    the most precise way.

    Signed-off-by: Pavel Emelyanov
    Reviewed-by: Tejun Heo
    Acked-by: Serge Hallyn
    Signed-off-by: Cyrill Gorcunov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • On the ext4 mailing list[1], we got some report about errors in
    __find_get_block_slow(), but the information is very limited.

    If the device information is given, we can know the name of the sick
    volume. Futhermore, we can get the corresponding status of that
    block(group, inode block etc) by analyzing the disk layout.

    [1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2

    Signed-off-by: Tao Ma
    Cc: Al Viro
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma
     
  • The callback must not return -1 when nr_to_scan is zero. Fix the bug in
    fs/super.c and add this requirement to the callback specification.

    Signed-off-by: Mikulas Patocka
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • memchr_inv() is mainly used to check whether the whole buffer is filled
    with just a specified byte.

    The function name and prototype are stolen from logfs and the
    implementation is from SLUB.

    Signed-off-by: Akinobu Mita
    Acked-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Matt Mackall
    Acked-by: Joern Engel
    Cc: Marcin Slusarz
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Direct reclaim should never writeback pages. Warn if an attempt is made.

    Signed-off-by: Mel Gorman
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman