02 Aug, 2010

1 commit

  • nfs_commit_inode() needs to be defined irrespectively of whether or not
    we are supporting NFSv3 and NFSv4.

    Allow the compiler to optimise away code in the NFSv2-only case by
    converting it into an inlined stub function.

    Reported-and-tested-by: Ingo Molnar
    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

31 Jul, 2010

5 commits


30 Jul, 2010

1 commit

  • It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
    credentials by incrementing their usage count after their replacement by the
    task being accessed.

    What happens is that get_task_cred() can race with commit_creds():

    TASK_1 TASK_2 RCU_CLEANER
    -->get_task_cred(TASK_2)
    rcu_read_lock()
    __cred = __task_cred(TASK_2)
    -->commit_creds()
    old_cred = TASK_2->real_cred
    TASK_2->real_cred = ...
    put_cred(old_cred)
    call_rcu(old_cred)
    [__cred->usage == 0]
    get_cred(__cred)
    [__cred->usage == 1]
    rcu_read_unlock()
    -->put_cred_rcu()
    [__cred->usage == 1]
    panic()

    However, since a tasks credentials are generally not changed very often, we can
    reasonably make use of a loop involving reading the creds pointer and using
    atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.

    If successful, we can safely return the credentials in the knowledge that, even
    if the task we're accessing has released them, they haven't gone to the RCU
    cleanup code.

    We then change task_state() in procfs to use get_task_cred() rather than
    calling get_cred() on the result of __task_cred(), as that suffers from the
    same problem.

    Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
    tripped when it is noticed that the usage count is not zero as it ought to be,
    for example:

    kernel BUG at kernel/cred.c:168!
    invalid opcode: 0000 [#1] SMP
    last sysfs file: /sys/kernel/mm/ksm/run
    CPU 0
    Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
    745
    RIP: 0010:[] [] __put_cred+0xc/0x45
    RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
    RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
    RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
    R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
    FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
    Stack:
    ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
    ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
    ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
    Call Trace:
    [] put_cred+0x13/0x15
    [] commit_creds+0x16b/0x175
    [] set_current_groups+0x47/0x4e
    [] sys_setgroups+0xf6/0x105
    [] system_call_fastpath+0x16/0x1b
    Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
    48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 0b eb fe 65 48 8b
    04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
    RIP [] __put_cred+0xc/0x45
    RSP
    ---[ end trace df391256a100ebdd ]---

    Signed-off-by: David Howells
    Acked-by: Jiri Olsa
    Signed-off-by: Linus Torvalds

    David Howells
     

29 Jul, 2010

3 commits

  • The function ecryptfs_uid_hash wrongly assumes that the
    second parameter to hash_long() is the number of hash
    buckets instead of the number of hash bits.
    This patch fixes that and renames the variable
    ecryptfs_hash_buckets to ecryptfs_hash_bits to make it
    clearer.

    Fixes: CVE-2010-2492

    Signed-off-by: Andre Osterhues
    Signed-off-by: Tyler Hicks
    Signed-off-by: Linus Torvalds

    Andre Osterhues
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: use complete_all and wake_up_all
    ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"
    ceph: fix dentry lease release
    ceph: fix leak of dentry in ceph_init_dentry() error path
    ceph: fix pg_mapping leak on pg_temp updates
    ceph: fix d_release dop for snapdir, snapped dentries
    ceph: avoid dcache readdir for snapdir

    Linus Torvalds
     
  • If we don't need a huge amount of memory in ->readdir() then
    we can use kmalloc rather than vmalloc to allocate it. This
    should cut down on the greater overheads associated with
    vmalloc for smaller directories.

    We may be able to eliminate vmalloc entirely at some stage,
    but this is easy to do right away.

    Also using GFP_NOFS to avoid any issues wrt to deleting inodes
    while under a glock, and suggestion from Linus to factor out
    the alloc/dealloc.

    I've given this a test with a variety of different sized
    directories and it seems to work ok.

    Cc: Andrew Morton
    Cc: Nick Piggin
    Cc: Prarit Bhargava
    Signed-off-by: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Steven Whitehouse
     

28 Jul, 2010

2 commits


27 Jul, 2010

3 commits

  • Supporting symlinks from untagged to tagged directories is reasonable,
    and needed to support CONFIG_SYSFS_DEPRECATED. So don't fail a prior
    allowing that case to work.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • This happens for network devices when SYSFS_DEPRECATED is enabled.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • Recently my tagged sysfs support revealed a flaw in the device core
    that a few rare drivers are running into such that we don't always put
    network devices in a class subdirectory named net/.

    Since we are not creating the class directory the network devices wind
    up in a non-tagged directory, but the symlinks to the network devices
    from /sys/class/net are in a tagged directory. All of which works
    until we go to remove or rename the symlink. When we remove or rename
    a symlink we look in the namespace of the target of the symlink.
    Since the target of the symlink is in a non-tagged sysfs directory we
    don't have a namespace to look in, and we fail to remove the symlink.

    Detect this problem up front and simply don't create symlinks we won't
    be able to remove later. This prevents symlink leakage and fails in
    a much clearer and more understandable way.

    Signed-off-by: Eric W. Biederman
    Cc: Andrew Morton
    Cc: Rafael J. Wysocki
    Cc: Maciej W. Rozycki
    Cc: Kay Sievers
    Cc: Johannes Berg
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

25 Jul, 2010

1 commit


24 Jul, 2010

4 commits


23 Jul, 2010

2 commits

  • We should always go to the MDS for readdir on the hidden snapdir. The
    set of snapshots can change at any time; the client can't trust its cache
    for that.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Fix the security problem in the CIFS filesystem DNS lookup code in which a
    malicious redirect could be installed by a random user by simply adding a
    result record into one of their keyrings with add_key() and then invoking a
    CIFS CFS lookup [CVE-2010-2524].

    This is done by creating an internal keyring specifically for the caching of
    DNS lookups. To enforce the use of this keyring, the module init routine
    creates a set of override credentials with the keyring installed as the thread
    keyring and instructs request_key() to only install lookup result keys in that
    keyring.

    The override is then applied around the call to request_key().

    This has some additional benefits when a kernel service uses this module to
    request a key:

    (1) The result keys are owned by root, not the user that caused the lookup.

    (2) The result keys don't pop up in the user's keyrings.

    (3) The result keys don't come out of the quota of the user that caused the
    lookup.

    The keyring can be viewed as root by doing cat /proc/keys:

    2a0ca6c3 I----- 1 perm 1f030000 0 0 keyring .dns_resolver: 1/4

    It can then be listed with 'keyctl list' by root.

    # keyctl list 0x2a0ca6c3
    1 key in keyring:
    726766307: --alswrv 0 0 dns_resolver: foo.bar.com

    Signed-off-by: David Howells
    Reviewed-and-Tested-by: Jeff Layton
    Acked-by: Steve French
    Signed-off-by: Linus Torvalds

    David Howells
     

22 Jul, 2010

1 commit


21 Jul, 2010

1 commit


20 Jul, 2010

7 commits

  • * 'shrinker' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev:
    xfs: track AGs with reclaimable inodes in per-ag radix tree
    xfs: convert inode shrinker to per-filesystem contexts
    mm: add context argument to shrinker callback

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE
    Btrfs: fix CLONE ioctl destination file size expansion to block boundary
    Btrfs: fix split_leaf double split corner case

    Linus Torvalds
     
  • https://bugzilla.kernel.org/show_bug.cgi?id=16348

    When the filesystem grows to a large number of allocation groups,
    the summing of recalimable inodes gets expensive. In many cases,
    most AGs won't have any reclaimable inodes and so we are wasting CPU
    time aggregating over these AGs. This is particularly important for
    the inode shrinker that gets called frequently under memory
    pressure.

    To avoid the overhead, track AGs with reclaimable inodes in the
    per-ag radix tree so that we can find all the AGs with reclaimable
    inodes via a simple gang tag lookup. This involves setting the tag
    when the first reclaimable inode is tracked in the AG, and removing
    the tag when the last reclaimable inode is removed from the tree.
    Then the summation process becomes a loop walking the radix tree
    summing AGs with the reclaim tag set.

    This significantly reduces the overhead of scanning - a 6400 AG
    filesystea now only uses about 25% of a cpu in kswapd while slab
    reclaim progresses instead of being permanently stuck at 100% CPU
    and making little progress. Clean filesystems filesystems will see
    no overhead and the overhead only increases linearly with the number
    of dirty AGs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Now the shrinker passes us a context, wire up a shrinker context per
    filesystem. This allows us to remove the global mount list and the
    locking problems that introduced. It also means that a shrinker call
    does not need to traverse clean filesystems before finding a
    filesystem with reclaimable inodes. This significantly reduces
    scanning overhead when lots of filesystems are present.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • 1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
    whether the donor file is append-only before writing to it.

    2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
    overflow that allows a user to specify an out-of-bounds range to copy
    from the source file (if off + len wraps around). I haven't been able
    to successfully exploit this, but I'd imagine that a clever attacker
    could use this to read things he shouldn't. Even if it's not
    exploitable, it couldn't hurt to be safe.

    Signed-off-by: Dan Rosenberg
    cc: stable@kernel.org
    Signed-off-by: Chris Mason

    Dan Rosenberg
     
  • The CLONE and CLONE_RANGE ioctls round up the range of extents being
    cloned to the block size when the range to clone extends to the end of file
    (this is always the case with CLONE). It was then using that offset when
    extending the destination file's i_size. Fix this by not setting i_size
    beyond the originally requested ending offset.

    This bug was introduced by a22285a6 (2.6.35-rc1).

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     
  • split_leaf was not properly balancing leaves when it was forced to
    split a leaf twice. This commit adds an extra push left and right
    before forcing the double split in hopes of getting the slot where
    we want to insert at either the start or end of the leaf.

    If the extra pushes do work, then we are able to avoid splitting twice
    and we keep the tree properly balanced.

    Signed-off-by: Chris Mason

    Chris Mason
     

19 Jul, 2010

3 commits

  • Partition boundary calculation fails for DASD FBA disks under the
    following conditions:
    - disk is formatted with CMS FORMAT with a blocksize of more than
    512 bytes
    - all of the disk is reserved to a single CMS file using CMS RESERVE
    - the disk is accessed using the DIAG mode of the DASD driver

    Under these circumstances, the partition detection code tries to
    read the CMS label block containing partition-relevant information
    from logical block offset 1, while it is in fact located at physical
    block offset 1.

    Fix this problem by using the correct CMS label block location
    depending on the device type as determined by the DASD SENSE ID
    information.

    Signed-off-by: Peter Oberparleiter
    Signed-off-by: Martin Schwidefsky

    Peter Oberparleiter
     
  • The current shrinker implementation requires the registered callback
    to have global state to work from. This makes it difficult to shrink
    caches that are not global (e.g. per-filesystem caches). Pass the shrinker
    structure to the callback so that users can embed the shrinker structure
    in the context the shrinker needs to operate on and get back to it in the
    callback via container_of().

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
    ocfs2: Silence gcc warning in ocfs2_write_zero_page().
    jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
    ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
    ocfs2: Don't duplicate pages past i_size during CoW.
    ocfs2: tighten up strlen() checking
    ocfs2: Make xattr reflink work with new local alloc reservation.
    ocfs2: make xattr extension work with new local alloc reservation.
    ocfs2: Remove the redundant cpu_to_le64.
    ocfs2/dlm: don't access beyond bitmap size
    ocfs2: No need to zero pages past i_size.
    ocfs2: Zero the tail cluster when extending past i_size.
    ocfs2: When zero extending, do it by page.
    ocfs2: Limit default local alloc size within bitmap range.
    ocfs2: Move orphan scan work to ocfs2_wq.
    fs/ocfs2/dlm: Add missing spin_unlock

    Linus Torvalds
     

17 Jul, 2010

3 commits

  • ocfs2_write_zero_page() has a loop that won't ever be skipped, but gcc
    doesn't know that. Set ret=0 just to make gcc happy.

    Signed-off-by: Joel Becker

    Joel Becker
     
  • Strip the cap and dentry releases from replayed messages. They can
    cause the shared state to get out of sync because they were generated
    (with the request message) earlier, and no longer reflect the current
    client state.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Replayed rename operations (after an mds failure/recovery) were broken
    because the request paths were regenerated from the dentry names, which
    get mangled when d_move() is called.

    Instead, resend the previous request message when replaying completed
    operations. Just make sure the REPLAY flag is set and the target ino is
    filled in.

    This fixes problems with workloads doing renames when the MDS restarts,
    where the rename operation appears to succeed, but on mds restart then
    fails (leading to client confusion, app breakage, etc.).

    Signed-off-by: Sage Weil

    Sage Weil
     

16 Jul, 2010

3 commits

  • OCFS2 uses t_commit trigger to compute and store checksum of the just
    committed blocks. When a buffer has b_frozen_data, checksum is computed
    for it instead of b_data but this can result in an old checksum being
    written to the filesystem in the following scenario:

    1) transaction1 is opened
    2) handle1 is opened
    3) journal_access(handle1, bh)
    - This sets jh->b_transaction to transaction1
    4) modify(bh)
    5) journal_dirty(handle1, bh)
    6) handle1 is closed
    7) start committing transaction1, opening transaction2
    8) handle2 is opened
    9) journal_access(handle2, bh)
    - This copies off b_frozen_data to make it safe for transaction1 to commit.
    jh->b_next_transaction is set to transaction2.
    10) jbd2_journal_write_metadata() checksums b_frozen_data
    11) the journal correctly writes b_frozen_data to the disk journal
    12) handle2 is closed
    - There was no dirty call for the bh on handle2, so it is never queued for
    any more journal operation
    13) Checkpointing finally happens, and it just spools the bh via normal buffer
    writeback. This will write b_data, which was never triggered on and thus
    contains a wrong (old) checksum.

    This patch fixes the problem by calling the trigger at the moment data is
    frozen for journal commit - i.e., either when b_frozen_data is created by
    do_get_write_access or just before we write a buffer to the log if
    b_frozen_data does not exist. We also rename the trigger to t_frozen as
    that better describes when it is called.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Jan Kara
     
  • For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
    before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
    dlm_migration_can_proceed() for that purpose. However, if the node is
    down, dlm_migration_can_proceed() will also return "go ahead". In this
    rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
    the BUG_ON() that trips over this condition.

    Signed-off-by: Wengang Wang
    Signed-off-by: Joel Becker

    Wengang Wang
     
  • During CoW, the pages after i_size don't contain valid data, so there's
    no need to read and duplicate them.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma