30 Jul, 2011

4 commits


29 Jul, 2011

2 commits

  • Make the inode mapping bdi consistent with the superblock bdi so that
    dirty pages are flushed properly.

    Signed-off-by: Thieu Le
    Cc: [2.6.39+]
    Signed-off-by: Tyler Hicks

    Thieu Le
     
  • Fixes a regression caused by b5695d04634fa4ccca7dcbc05bb4a66522f02e0b

    Kernel keyring keys containing eCryptfs authentication tokens should not
    be write locked when calling out to ecryptfsd to wrap and unwrap file
    encryption keys. The eCryptfs kernel code can not hold the key's write
    lock because ecryptfsd needs to request the key after receiving such a
    request from the kernel.

    Without this fix, all file opens and creates will timeout and fail when
    using the eCryptfs PKI infrastructure. This is not an issue when using
    passphrase-based mount keys, which is the most widely deployed eCryptfs
    configuration.

    Signed-off-by: Tyler Hicks
    Acked-by: Roberto Sassu
    Tested-by: Roberto Sassu
    Tested-by: Alexis Hafner1
    Cc: [2.6.39+]

    Tyler Hicks
     

28 Jul, 2011

22 commits

  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (54 commits)
    tpm_nsc: Fix bug when loading multiple TPM drivers
    tpm: Move tpm_tis_reenable_interrupts out of CONFIG_PNP block
    tpm: Fix compilation warning when CONFIG_PNP is not defined
    TOMOYO: Update kernel-doc.
    tpm: Fix a typo
    tpm_tis: Probing function for Intel iTPM bug
    tpm_tis: Fix the probing for interrupts
    tpm_tis: Delay ACPI S3 suspend while the TPM is busy
    tpm_tis: Re-enable interrupts upon (S3) resume
    tpm: Fix display of data in pubek sysfs entry
    tpm_tis: Add timeouts sysfs entry
    tpm: Adjust interface timeouts if they are too small
    tpm: Use interface timeouts returned from the TPM
    tpm_tis: Introduce durations sysfs entry
    tpm: Adjust the durations if they are too small
    tpm: Use durations returned from TPM
    TOMOYO: Enable conditional ACL.
    TOMOYO: Allow using argv[]/envp[] of execve() as conditions.
    TOMOYO: Allow using executable's realpath and symlink's target as conditions.
    TOMOYO: Allow using owner/group etc. of file objects as conditions.
    ...

    Fix up trivial conflict in security/tomoyo/realpath.c

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
    Btrfs: use the commit_root for reading free_space_inode crcs
    Btrfs: reduce extent_state lock contention for metadata
    Btrfs: remove lockdep magic from btrfs_next_leaf
    Btrfs: make a lockdep class for each root
    Btrfs: switch the btrfs tree locks to reader/writer
    Btrfs: fix deadlock when throttling transactions
    Btrfs: stop using highmem for extent_buffers
    Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
    Btrfs: tag pages for writeback in sync
    Btrfs: fix enospc problems with delalloc
    Btrfs: don't flush delalloc arbitrarily
    Btrfs: use find_or_create_page instead of grab_cache_page
    Btrfs: use a worker thread to do caching
    Btrfs: fix how we merge extent states and deal with cached states
    Btrfs: use the normal checksumming infrastructure for free space cache
    Btrfs: serialize flushers in reserve_metadata_bytes
    Btrfs: do transaction space reservation before joining the transaction
    Btrfs: try to only do one btrfs_search_slot in do_setxattr

    Linus Torvalds
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: optimize the negative xattr caching
    xfs: prevent against ioend livelocks in xfs_file_fsync
    xfs: flag all buffers as metadata
    xfs: encapsulate a block of debug code

    Linus Torvalds
     
  • * 'nfs-for-3.1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (44 commits)
    NFSv4: Don't use the delegation->inode in nfs_mark_return_delegation()
    nfs: don't use d_move in nfs_async_rename_done
    RDMA: Increasing RPCRDMA_MAX_DATA_SEGS
    SUNRPC: Replace xprt->resend and xprt->sending with a priority queue
    SUNRPC: Allow caller of rpc_sleep_on() to select priority levels
    SUNRPC: Support dynamic slot allocation for TCP connections
    SUNRPC: Clean up the slot table allocation
    SUNRPC: Initalise the struct xprt upon allocation
    SUNRPC: Ensure that we grab the XPRT_LOCK before calling xprt_alloc_slot
    pnfs: simplify pnfs files module autoloading
    nfs: document nfsv4 sillyrename issues
    NFS: Convert nfs4_set_ds_client to EXPORT_SYMBOL_GPL
    SUNRPC: Convert the backchannel exports to EXPORT_SYMBOL_GPL
    SUNRPC: sunrpc should not explicitly depend on NFS config options
    NFS: Clean up - simplify the switch to read/write-through-MDS
    NFS: Move the pnfs write code into pnfs.c
    NFS: Move the pnfs read code into pnfs.c
    NFS: Allow the nfs_pageio_descriptor to signal that a re-coalesce is needed
    NFS: Use the nfs_pageio_descriptor->pg_bsize in the read/write request
    NFS: Cache rpc_ops in struct nfs_pageio_descriptor
    ...

    Linus Torvalds
     
  • Chris Mason
     
  • The btrfs transaction code will return any errors that come from
    reserve_metadata_bytes. We need to make sure we don't return funny
    things like 1 or EAGAIN.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Since __proc_create() appends the name it is given to the end of the PDE
    structure that it allocates, there isn't a need to store a name pointer.
    Instead we can just replace the name pointer with a terminal char array of
    _unspecified_ length. The compiler will simply append the string to statically
    defined variables of PDE type overlapping any hole at the end of the structure
    and, unlike specifying an explicitly _zero_ length array, won't give a warning
    if you try to statically initialise it with a string of more than zero length.

    Also, whilst we're at it:

    (1) Move namelen to end just prior to name and reduce it to a single byte
    (name shouldn't be longer than NAME_MAX).

    (2) Move pde_unload_lock two places further on so that if it's four bytes in
    size on a 64-bit machine, it won't cause an unused hole in the PDE struct.

    Signed-off-by: David Howells
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Now that we are using regular file crcs for the free space cache,
    we can deadlock if we try to read the free_space_inode while we are
    updating the crc tree.

    This commit fixes things by using the commit_root to read the crcs. This is
    safe because we the free space cache file would already be loaded if
    that block group had been changed in the current transaction.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • For metadata buffers that don't straddle pages (all of them), btrfs
    can safely use the page uptodate bits and extent_buffer uptodate bit
    instead of needing to use the extent_state tree.

    This greatly reduces contention on the state tree lock.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Before the reader/writer locks, btrfs_next_leaf needed to keep
    the path blocking to avoid making lockdep upset.

    Now that btrfs_next_leaf only takes read locks, this isn't required.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This patch was originally from Tejun Heo. lockdep complains about the btrfs
    locking because we sometimes take btree locks from two different trees at the
    same time. The current classes are based only on level in the btree, which
    isn't enough information for lockdep to figure out if the lock is safe.

    This patch makes a class for each type of tree, and lumps all the FS trees that
    actually have files and directories into the same class.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The btrfs metadata btree is the source of significant
    lock contention, especially in the root node. This
    commit changes our locking to use a reader/writer
    lock.

    The lock is built on top of rw spinlocks, and it
    extends the lock tracking to remember if we have a
    read lock or a write lock when we go to blocking. Atomics
    count the number of blocking readers or writers at any
    given time.

    It removes all of the adaptive spinning from the old code
    and uses only the spinning/blocking hints inside of btrfs
    to decide when it should continue spinning.

    In read heavy workloads this is dramatically faster. In write
    heavy workloads we're still faster because of less contention
    on the root node lock.

    We suffer slightly in dbench because we schedule more often
    during write locks, but all other benchmarks so far are improved.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Hit this nice little deadlock. What happens is this

    __btrfs_end_transaction with throttle set, --use_count so it equals 0
    btrfs_commit_transaction

    btrfs_end_transaction --use_count so now its -1
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • The extent_buffers have a very complex interface where
    we use HIGHMEM for metadata and try to cache a kmap mapping
    to access the memory.

    The next commit adds reader/writer locks, and concurrent use
    of this kmap cache would make it even more complex.

    This commit drops the ability to use HIGHMEM with extent buffers,
    and rips out all of the related code.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • When we balanced the chunks across the devices, BUG_ON() in
    __finish_chunk_alloc() was triggered.

    ------------[ cut here ]------------
    kernel BUG at fs/btrfs/volumes.c:2568!
    [SNIP]
    Call Trace:
    [] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
    [] do_chunk_alloc+0x330/0x3a0 [btrfs]
    [] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
    [] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
    [] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
    [] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
    [] ? read_block_for_search+0x14d/0x4d0 [btrfs]
    [] btrfs_cow_block+0x10b/0x240 [btrfs]
    [] btrfs_search_slot+0x49e/0x7a0 [btrfs]
    [] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
    [] insert_with_overflow+0x43/0x110 [btrfs]
    [] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
    [] ? map_extent_buffer+0xb0/0xc0 [btrfs]
    [] ? rb_insert_color+0x9d/0x160
    [] ? inode_tree_add+0xf0/0x150 [btrfs]
    [] btrfs_add_link+0xc1/0x1c0 [btrfs]
    [] ? security_inode_init_security+0x1c/0x30
    [] ? btrfs_init_acl+0x4a/0x180 [btrfs]
    [] btrfs_add_nondir+0x2f/0x70 [btrfs]
    [] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
    [] btrfs_create+0x150/0x1d0 [btrfs]
    [] ? generic_permission+0x23/0xb0
    [] vfs_create+0xa5/0xc0
    [] do_last+0x5fe/0x880
    [] path_openat+0xcd/0x3d0
    [] do_filp_open+0x49/0xa0
    [] ? alloc_fd+0x95/0x160
    [] do_sys_open+0x107/0x1e0
    [] ? audit_syscall_entry+0x1bf/0x1f0
    [] sys_open+0x20/0x30
    [] system_call_fastpath+0x16/0x1b
    [SNIP]
    RIP [] __finish_chunk_alloc+0x20a/0x220 [btrfs]

    The reason is:
    Task1 Space balance task
    do_chunk_alloc()
    __finish_chunk_alloc()
    update device info
    in the chunk tree
    alloc system metadata block
    relocate system metadata block group
    set system metadata block group
    readonly, This block group is the
    only one that can allocate space. So
    there is no free space that can be
    allocated now.
    find no space and don't try
    to alloc new chunk, and then
    return ENOSPC
    BUG_ON() in __finish_chunk_alloc()
    was triggered.

    Fix this bug by allocating a new system metadata chunk before relocating the
    old one if we find there is no free space which can be allocated after setting
    the old block group to be read-only.

    Reported-by: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Tested-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Everybody else does this, we need to do it too. If we're syncing, we need to
    tag the pages we're going to write for writeback so we don't end up writing the
    same stuff over and over again if somebody is constantly redirtying our file.
    This will keep us from having latencies with heavy sync workloads. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • So I had this brilliant idea to use atomic counters for outstanding and reserved
    extents, but this turned out to be a bad idea. Consider this where we have 1
    outstanding extent and 1 reserved extent

    Reserver Releaser
    atomic_dec(outstanding) now 0
    atomic_read(outstanding)+1 get 1
    atomic_read(reserved) get 1
    don't actually reserve anything because
    they are the same
    atomic_cmpxchg(reserved, 1, 0)
    atomic_inc(outstanding)
    atomic_add(0, reserved)
    free reserved space for 1 extent

    Then the reserver now has no actual space reserved for it, and when it goes to
    finish the ordered IO it won't have enough space to do it's allocation and you
    get those lovely warnings.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Kill the check to see if we have 512mb of reserved space in delalloc and
    shrink_delalloc if we do. This causes unexpected latencies and we have other
    logic to see if we need to throttle. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
    GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we
    need GFP_NOFS so we don't deadlock. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • A user reported a deadlock when copying a bunch of files. This is because they
    were low on memory and kthreadd got hung up trying to migrate pages for an
    allocation when starting the caching kthread. The page was locked by the person
    starting the caching kthread. To fix this we just need to use the async thread
    stuff so that the threads are already created and we don't have to worry about
    deadlocks. Thanks,

    Reported-by: Roman Mamedov
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
    jfs: clean up some compiler warnings

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
    GFS2: Fix mount hang caused by certain access pattern to sysfs files

    Linus Torvalds
     

27 Jul, 2011

12 commits

  • Since the addition of file capabilities every write needs to read xattrs to
    check if we have any capabilities to clear. In Linux 3.0 Andi Kleen added
    a flag to cache the fact that we do not have any attributes on an inode.
    Make sure to already mark a file as not having any attributes when reading
    it from disk in case it doesn't even have an attribute fork. Based on an
    earlier patch from Andi Kleen.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We need to take some locks to prevent new ioends from coming in when we wait
    for all existing ones to go away. Up to Linux 3.0 that was done using the
    i_mutex held by the VFS fsync code, but now that we are called without
    it we need to take care of it ourselves. Use the I/O lock instead of
    i_mutex just like we do in other places.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Now that REQ_META bios aren't treated specially in the CFQ I/O schedule
    anymore, we can tag all buffers as metadata to make blktrace traces more
    meaningful. Note that we use buffers also to zero out partial blocks
    in the preallocation / hole punching code, and while they operate on
    data blocks the zeros written certainly aren't data. I think this case
    is borderline metadata enough to not bother special casing it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Pull into a helper function some debug-only code that validates a
    xfs_da_blkinfo structure that's been read from disk.

    Signed-off-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Alex Elder
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    merge fchmod() and fchmodat() guts, kill ancient broken kludge
    xfs: fix misspelled S_IS...()
    xfs: get rid of open-coded S_ISREG(), etc.
    vfs: document locking requirements for d_move, __d_move and d_materialise_unique
    omfs: fix (mode & S_IFDIR) abuse
    btrfs: S_ISREG(mode) is not mode & S_IFREG...
    ima: fmode_t misspelled as mode_t...
    pci-label.c: size_t misspelled as mode_t
    jffs2: S_ISLNK(mode & S_IFMT) is pointless
    snd_msnd ->mode is fmode_t, not mode_t
    v9fs_iop_get_acl: get rid of unused variable
    vfs: dont chain pipe/anon/socket on superblock s_inodes list
    Documentation: Exporting: update description of d_splice_alias
    fs: add missing unlock in default_llseek()

    Linus Torvalds
     
  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     
  • acct_arg_size() takes ->page_table_lock around add_mm_counter() if
    !SPLIT_RSS_COUNTING. This is not needed after commit 172703b08cd0 ("mm:
    delete non-atomic mm counter implementation").

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Matt Fleming
    Cc: Dave Hansen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If CONFIG_MODULES=n, it makes no sense to retry the list of binary formats
    handler because the list will not be modified by request_module().

    Signed-off-by: Tetsuo Handa
    Cc: Richard Weinberger
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Currently, search_binary_handler() tries to load binary loader module
    using request_module() if a loader for the requested program is not yet
    loaded. But second attempt of request_module() does not affect the result
    of search_binary_handler().

    If request_module() triggered recursion, calling request_module() twice
    causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions. It is
    not an infinite loop but is sufficient for users to consider as a hang up.

    Therefore, this patch changes not to call request_module() twice, making 1
    to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion.

    Signed-off-by: Tetsuo Handa
    Reported-by: Richard Weinberger
    Tested-by: Richard Weinberger
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit a8bef8ff6ea1 ("mm: migration: avoid race between
    shift_arg_pages() and rmap_walk() during migration by not migrating
    temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS
    and VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile
    time one, so BUILD_BUG_ON is more appropriate.

    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • If an inode's mode permits opening /proc/PID/io and the resulting file
    descriptor is kept across execve() of a setuid or similar binary, the
    ptrace_may_access() check tries to prevent using this fd against the
    task with escalated privileges.

    Unfortunately, there is a race in the check against execve(). If
    execve() is processed after the ptrace check, but before the actual io
    information gathering, io statistics will be gathered from the
    privileged process. At least in theory this might lead to gathering
    sensible information (like ssh/ftp password length) that wouldn't be
    available otherwise.

    Holding task->signal->cred_guard_mutex while gathering the io
    information should protect against the race.

    The order of locking is similar to the one inside of ptrace_attach():
    first goes cred_guard_mutex, then lock_task_sighand().

    Signed-off-by: Vasiliy Kulikov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Change the return value to ENOENT. This return value is then returned
    when opening the proc entry that have been removed. For example,
    open("/proc/bus/pci/XX/YY") when the corresponding device is being
    hot-removed.

    Signed-off-by: Daisuke Ogino
    Cc: Jesse Barnes
    Acked-by: Alexey Dobriyan
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Ogino