23 Jul, 2010

1 commit

  • Fix the security problem in the CIFS filesystem DNS lookup code in which a
    malicious redirect could be installed by a random user by simply adding a
    result record into one of their keyrings with add_key() and then invoking a
    CIFS CFS lookup [CVE-2010-2524].

    This is done by creating an internal keyring specifically for the caching of
    DNS lookups. To enforce the use of this keyring, the module init routine
    creates a set of override credentials with the keyring installed as the thread
    keyring and instructs request_key() to only install lookup result keys in that
    keyring.

    The override is then applied around the call to request_key().

    This has some additional benefits when a kernel service uses this module to
    request a key:

    (1) The result keys are owned by root, not the user that caused the lookup.

    (2) The result keys don't pop up in the user's keyrings.

    (3) The result keys don't come out of the quota of the user that caused the
    lookup.

    The keyring can be viewed as root by doing cat /proc/keys:

    2a0ca6c3 I----- 1 perm 1f030000 0 0 keyring .dns_resolver: 1/4

    It can then be listed with 'keyctl list' by root.

    # keyctl list 0x2a0ca6c3
    1 key in keyring:
    726766307: --alswrv 0 0 dns_resolver: foo.bar.com

    Signed-off-by: David Howells
    Reviewed-and-Tested-by: Jeff Layton
    Acked-by: Steve French
    Signed-off-by: Linus Torvalds

    David Howells
     

22 Jul, 2010

1 commit


21 Jul, 2010

1 commit


20 Jul, 2010

7 commits

  • * 'shrinker' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev:
    xfs: track AGs with reclaimable inodes in per-ag radix tree
    xfs: convert inode shrinker to per-filesystem contexts
    mm: add context argument to shrinker callback

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE
    Btrfs: fix CLONE ioctl destination file size expansion to block boundary
    Btrfs: fix split_leaf double split corner case

    Linus Torvalds
     
  • https://bugzilla.kernel.org/show_bug.cgi?id=16348

    When the filesystem grows to a large number of allocation groups,
    the summing of recalimable inodes gets expensive. In many cases,
    most AGs won't have any reclaimable inodes and so we are wasting CPU
    time aggregating over these AGs. This is particularly important for
    the inode shrinker that gets called frequently under memory
    pressure.

    To avoid the overhead, track AGs with reclaimable inodes in the
    per-ag radix tree so that we can find all the AGs with reclaimable
    inodes via a simple gang tag lookup. This involves setting the tag
    when the first reclaimable inode is tracked in the AG, and removing
    the tag when the last reclaimable inode is removed from the tree.
    Then the summation process becomes a loop walking the radix tree
    summing AGs with the reclaim tag set.

    This significantly reduces the overhead of scanning - a 6400 AG
    filesystea now only uses about 25% of a cpu in kswapd while slab
    reclaim progresses instead of being permanently stuck at 100% CPU
    and making little progress. Clean filesystems filesystems will see
    no overhead and the overhead only increases linearly with the number
    of dirty AGs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Now the shrinker passes us a context, wire up a shrinker context per
    filesystem. This allows us to remove the global mount list and the
    locking problems that introduced. It also means that a shrinker call
    does not need to traverse clean filesystems before finding a
    filesystem with reclaimable inodes. This significantly reduces
    scanning overhead when lots of filesystems are present.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • 1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
    whether the donor file is append-only before writing to it.

    2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
    overflow that allows a user to specify an out-of-bounds range to copy
    from the source file (if off + len wraps around). I haven't been able
    to successfully exploit this, but I'd imagine that a clever attacker
    could use this to read things he shouldn't. Even if it's not
    exploitable, it couldn't hurt to be safe.

    Signed-off-by: Dan Rosenberg
    cc: stable@kernel.org
    Signed-off-by: Chris Mason

    Dan Rosenberg
     
  • The CLONE and CLONE_RANGE ioctls round up the range of extents being
    cloned to the block size when the range to clone extends to the end of file
    (this is always the case with CLONE). It was then using that offset when
    extending the destination file's i_size. Fix this by not setting i_size
    beyond the originally requested ending offset.

    This bug was introduced by a22285a6 (2.6.35-rc1).

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     
  • split_leaf was not properly balancing leaves when it was forced to
    split a leaf twice. This commit adds an extra push left and right
    before forcing the double split in hopes of getting the slot where
    we want to insert at either the start or end of the leaf.

    If the extra pushes do work, then we are able to avoid splitting twice
    and we keep the tree properly balanced.

    Signed-off-by: Chris Mason

    Chris Mason
     

19 Jul, 2010

3 commits

  • Partition boundary calculation fails for DASD FBA disks under the
    following conditions:
    - disk is formatted with CMS FORMAT with a blocksize of more than
    512 bytes
    - all of the disk is reserved to a single CMS file using CMS RESERVE
    - the disk is accessed using the DIAG mode of the DASD driver

    Under these circumstances, the partition detection code tries to
    read the CMS label block containing partition-relevant information
    from logical block offset 1, while it is in fact located at physical
    block offset 1.

    Fix this problem by using the correct CMS label block location
    depending on the device type as determined by the DASD SENSE ID
    information.

    Signed-off-by: Peter Oberparleiter
    Signed-off-by: Martin Schwidefsky

    Peter Oberparleiter
     
  • The current shrinker implementation requires the registered callback
    to have global state to work from. This makes it difficult to shrink
    caches that are not global (e.g. per-filesystem caches). Pass the shrinker
    structure to the callback so that users can embed the shrinker structure
    in the context the shrinker needs to operate on and get back to it in the
    callback via container_of().

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
    ocfs2: Silence gcc warning in ocfs2_write_zero_page().
    jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
    ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
    ocfs2: Don't duplicate pages past i_size during CoW.
    ocfs2: tighten up strlen() checking
    ocfs2: Make xattr reflink work with new local alloc reservation.
    ocfs2: make xattr extension work with new local alloc reservation.
    ocfs2: Remove the redundant cpu_to_le64.
    ocfs2/dlm: don't access beyond bitmap size
    ocfs2: No need to zero pages past i_size.
    ocfs2: Zero the tail cluster when extending past i_size.
    ocfs2: When zero extending, do it by page.
    ocfs2: Limit default local alloc size within bitmap range.
    ocfs2: Move orphan scan work to ocfs2_wq.
    fs/ocfs2/dlm: Add missing spin_unlock

    Linus Torvalds
     

17 Jul, 2010

3 commits

  • ocfs2_write_zero_page() has a loop that won't ever be skipped, but gcc
    doesn't know that. Set ret=0 just to make gcc happy.

    Signed-off-by: Joel Becker

    Joel Becker
     
  • Strip the cap and dentry releases from replayed messages. They can
    cause the shared state to get out of sync because they were generated
    (with the request message) earlier, and no longer reflect the current
    client state.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Replayed rename operations (after an mds failure/recovery) were broken
    because the request paths were regenerated from the dentry names, which
    get mangled when d_move() is called.

    Instead, resend the previous request message when replaying completed
    operations. Just make sure the REPLAY flag is set and the target ino is
    filled in.

    This fixes problems with workloads doing renames when the MDS restarts,
    where the rename operation appears to succeed, but on mds restart then
    fails (leading to client confusion, app breakage, etc.).

    Signed-off-by: Sage Weil

    Sage Weil
     

16 Jul, 2010

3 commits

  • OCFS2 uses t_commit trigger to compute and store checksum of the just
    committed blocks. When a buffer has b_frozen_data, checksum is computed
    for it instead of b_data but this can result in an old checksum being
    written to the filesystem in the following scenario:

    1) transaction1 is opened
    2) handle1 is opened
    3) journal_access(handle1, bh)
    - This sets jh->b_transaction to transaction1
    4) modify(bh)
    5) journal_dirty(handle1, bh)
    6) handle1 is closed
    7) start committing transaction1, opening transaction2
    8) handle2 is opened
    9) journal_access(handle2, bh)
    - This copies off b_frozen_data to make it safe for transaction1 to commit.
    jh->b_next_transaction is set to transaction2.
    10) jbd2_journal_write_metadata() checksums b_frozen_data
    11) the journal correctly writes b_frozen_data to the disk journal
    12) handle2 is closed
    - There was no dirty call for the bh on handle2, so it is never queued for
    any more journal operation
    13) Checkpointing finally happens, and it just spools the bh via normal buffer
    writeback. This will write b_data, which was never triggered on and thus
    contains a wrong (old) checksum.

    This patch fixes the problem by calling the trigger at the moment data is
    frozen for journal commit - i.e., either when b_frozen_data is created by
    do_get_write_access or just before we write a buffer to the log if
    b_frozen_data does not exist. We also rename the trigger to t_frozen as
    that better describes when it is called.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Jan Kara
     
  • For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
    before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
    dlm_migration_can_proceed() for that purpose. However, if the node is
    down, dlm_migration_can_proceed() will also return "go ahead". In this
    rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
    the BUG_ON() that trips over this condition.

    Signed-off-by: Wengang Wang
    Signed-off-by: Joel Becker

    Wengang Wang
     
  • During CoW, the pages after i_size don't contain valid data, so there's
    no need to read and duplicate them.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

15 Jul, 2010

5 commits

  • This patch fixes a kernel Oops in the GFS2 rename code.

    The problem was in the way the gfs2 directory code was trying
    to re-use sentinel directory entries.

    In the failing case, gfs2's rename function was renaming a
    file to another name that had the same non-trivial length.
    The file being renamed happened to be the first directory
    entry on the leaf block.

    First, the rename code (gfs2_rename in ops_inode.c) found the
    original directory entry and decided it could do its job by
    simply replacing the directory entry with another. Therefore
    it determined correctly that no block allocations were needed.

    Next, the rename code deleted the old directory entry prior to
    replacing it with the new name. Therefore, the soon-to-be
    replaced directory entry was temporarily made into a directory
    entry "sentinel" or a place holder at the start of a leaf block.

    Lastly, it went to re-add the replacement directory entry in
    that leaf block. However, when gfs2_dirent_find_space was
    looking for space in the leaf block, it used the wrong value
    for the sentinel. That threw off its calculations so later
    it decides it can't really re-use the sentinel and therefore
    must allocate a new leaf block. But because it previously decided
    to re-use the directory entry, it didn't waste the time to
    grab a new block allocation for the inode. Therefore, the
    inode's i_alloc pointer was still NULL and it crashes trying to
    reference it.

    In the case of sentinel directory entries, the entire dirent is
    reused, not just the "free space" portion of it, and therefore
    the function gfs2_dirent_find_space should use the value 0
    rather than GFS2_DIRENT_SIZE(0) for the actual dirent size.

    Fixing this calculation enables the reproducer programs to work
    properly.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • HighMem pages on i686 do not get mapped to the buffer_heads and this was
    causing a NULL pointer dereference when we were trying to memset page buffers
    to zero.
    We now use zero_user() that kmaps the page and directly manipulates page data.
    This patch also fixes a boundary condition that was incorrect.

    Signed-off-by: Abhi Das
    Signed-off-by: Steven Whitehouse

    Abhijith Das
     
  • This patch fixes a problem in an error path when looking
    up dinodes. There are two sister-functions, gfs2_inode_lookup
    and gfs2_process_unlinked_inode. Both functions acquire and
    hold the i_iopen glock for the dinode being looked up. The last
    thing they try to do is hold the i_gl glock for the dinode.
    If that glock fails for some reason, the error path was
    incorrectly calling gfs2_glock_put for the i_iopen glock twice.
    This resulted in the glock being prematurely freed. The
    "minimum hold time" usually kept the glock in memory, but the
    lock interface to dlm (aka lock_dlm) freed its memory for the
    glock. In some circumstances, it would cause dlm's dlm_astd daemon
    to try to call the bast function for the freed lock_dlm memory,
    which resulted in a NULL pointer dereference.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch fixes bugzilla bug #590878: GFS2: recovery stuck on
    transaction lock. We set the frozen flag on the glock when we receive
    a completion that cannot be delivered due to blocked locks. At that
    point we check to see whether the first waiting holder has the noexp
    flag set. If the noexp lock is queued later, then we need to unfreeze
    the glock at that point in time, namely, in the glock work function.

    This patch was originally written by Steve Whitehouse, but since
    he's on holiday, I'm submitting it. It's been well tested with a
    complex recovery test called revolver.

    Signed-off-by: Steve Whitehouse
    Signed-off-by: Bob Peterson

    Bob Peterson
     
  • This patch replaces a statement that got dropped out by accident.
    Without the patch, truncates on stuffed (very small) files cause
    those files to have an unpredictable size.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

13 Jul, 2010

6 commits

  • This function is only called from one place and it's like this:
    dlm_register_domain(conn->cc_name, dlm_key, &fs_version);

    The "conn->cc_name" is 64 characters long. If strlen(conn->cc_name)
    were equal to O2NM_MAX_NAME_LEN (64) that would be a bug because
    strlen() doesn't count the NULL character.

    In fact, if you look how O2NM_MAX_NAME_LEN is used, it mostly describes
    64 character buffers. The only exception is nd_name from struct
    o2nm_node.

    Anyway I looked into it and in this case the domain string comes from
    osb->uuid_str in ocfs2_setup_osb_uuid(). That's 32 characters and NULL
    which easily fits into O2NM_MAX_NAME_LEN. This patch doesn't change how
    the code works, but I think it makes the code a little cleaner.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Joel Becker

    Dan Carpenter
     
  • The new reservation code in local alloc has add the limitation
    that the caller should handle the case that the local alloc
    doesn't give use enough contiguous clusters. It make the old
    xattr reflink code broken.

    So this patch udpate the xattr reflink code so that it can
    handle the case that local alloc give us one cluster at a time.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     
  • The old ocfs2_xattr_extent_allocation is too optimistic about
    the clusters we can get. So actually if the file system is
    too fragmented, ocfs2_add_clusters_in_btree will return us
    with EGAIN and we need to allocate clusters once again.

    So this patch change it to a while loop so that we can allocate
    clusters until we reach clusters_to_add.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Tao Ma
     
  • In ocfs2_block_group_alloc, we set c_blkno by bg->bg_blkno.
    But actually bg->bg_blkno is already changed to little endian
    in ocfs2_block_group_fill. So remove the extra cpu_to_le64.

    Reported-by: Marcos Matsunaga
    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     
  • dlm->recovery_map is defined as
    unsigned long recovery_map[BITS_TO_LONGS(O2NM_MAX_NODES)];

    We should treat O2NM_MAX_NODES as the bit map size in bits.
    This patches fixes a bit operation that takes O2NM_MAX_NODES + 1 as bitmap size.

    Signed-off-by: Wengang Wang
    Signed-off-by: Joel Becker

    Wengang Wang
     
  • When ocfs2 fills a hole, it does so by allocating clusters. When a
    cluster is larger than the write, ocfs2 must zero the portions of the
    cluster outside of the write. If the clustersize is smaller than a
    pagecache page, this is handled by the normal pagecache mechanisms, but
    when the clustersize is larger than a page, ocfs2's write code will zero
    the pages adjacent to the write. This makes sure the entire cluster is
    zeroed correctly.

    Currently ocfs2 behaves exactly the same when writing past i_size.
    However, this means ocfs2 is writing zeroed pages for portions of a new
    cluster that are beyond i_size. The page writeback code isn't expecting
    this. It treats all pages past the one containing i_size as left behind
    due to a previous truncate operation.

    Thankfully, ocfs2 calculates the number of pages it will be working on
    up front. The rest of the write code merely honors the original
    calculation. We can simply trim the number of pages to only cover the
    actual file data.

    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Joel Becker
     

10 Jul, 2010

2 commits


09 Jul, 2010

3 commits

  • The buffer was too small. Make it bigger, use snprintf(), put brackets
    around the ipv6 address to avoid mixing it up with the :port, and use the
    ever-so-handy %pI[46] formats.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • ocfs2's allocation unit is the cluster. This can be larger than a block
    or even a memory page. This means that a file may have many blocks in
    its last extent that are beyond the block containing i_size. There also
    may be more unwritten extents after that.

    When ocfs2 grows a file, it zeros the entire cluster in order to ensure
    future i_size growth will see cleared blocks. Unfortunately,
    block_write_full_page() drops the pages past i_size. This means that
    ocfs2 is actually leaking garbage data into the tail end of that last
    cluster. This is a bug.

    We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
    when a write or truncate is past i_size. They will use
    ocfs2_zero_extend() to ensure the data is properly zeroed.

    Older versions of ocfs2_zero_extend() simply zeroed every block between
    i_size and the zeroing position. This presumes three things:

    1) There is allocation for all of these blocks.
    2) The extents are not unwritten.
    3) The extents are not refcounted.

    (1) and (2) hold true for non-sparse filesystems, which used to be the
    only users of ocfs2_zero_extend(). (3) is another bug.

    Since we're now using ocfs2_zero_extend() for sparse filesystems as
    well, we teach ocfs2_zero_extend() to check every extent between
    i_size and the zeroing position. If the extent is unwritten, it is
    ignored. If it is refcounted, it is CoWed. Then it is zeroed.

    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Joel Becker
     
  • ocfs2_zero_extend() does its zeroing block by block, but it calls a
    function named ocfs2_write_zero_page(). Let's have
    ocfs2_write_zero_page() handle the page level. From
    ocfs2_zero_extend()'s perspective, it is now page-at-a-time.

    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Joel Becker
     

08 Jul, 2010

2 commits


07 Jul, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: fix crush device 'out' threshold to 1.0, not 0.1
    ceph: fix caps usage accounting for import (non-reserved) case
    ceph: only release clean, unused caps with mds requests
    ceph: fix crush CHOOSE_LEAF when type is already a leaf
    ceph: fix crush recursion
    ceph: fix caps debugfs entry
    ceph: delay umount until all mds requests drop inode+dentry refs
    ceph: handle splice_dentry/d_materialize_unique error in readdir_prepopulate
    ceph: fix crush map update decoding
    ceph: fix message memory leak, uninitialized variable
    ceph: fix map handler error path
    ceph: some endianity fixes

    Linus Torvalds
     

06 Jul, 2010

2 commits

  • First remove items from work_list as soon as we start working on them. This
    means we don't have to track any pending or visited state and can get
    rid of all the RCU magic freeing the work items - we can simply free
    them once the operation has finished. Second use a real completion for
    tracking synchronous requests - if the caller sets the completion pointer
    we complete it, otherwise use it as a boolean indicator that we can free
    the work item directly. Third unify struct wb_writeback_args and struct
    bdi_work into a single data structure, wb_writeback_work. Previous we
    set all parameters into a struct wb_writeback_args, copied it into
    struct bdi_work, copied it again on the stack to use it there. Instead
    of just allocate one structure dynamically or on the stack and use it
    all the way through the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The case where we have a superblock doesn't require a loop here as we scan
    over all inodes in writeback_sb_inodes. Split it out into a separate helper
    to make the code simpler. This also allows to get rid of the sb member in
    struct writeback_control, which was rather out of place there.

    Also update the comments in writeback_sb_inodes that explain the handling
    of inodes from wrong superblocks.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig