05 May, 2009

1 commit

  • By using the same test as is used for /proc/pid/maps and /proc/pid/smaps,
    only allow processes that can ptrace() a given process to see information
    that might be used to bypass address space layout randomization (ASLR).
    These include eip, esp, wchan, and start_stack in /proc/pid/stat as well
    as the non-symbolic output from /proc/pid/wchan.

    ASLR can be bypassed by sampling eip as shown by the proof-of-concept
    code at http://code.google.com/p/fuzzyaslr/ As part of a presentation
    (http://www.cr0.org/paper/to-jt-linux-alsr-leak.pdf) esp and wchan were
    also noted as possibly usable information leaks as well. The
    start_stack address also leaks potentially useful information.

    Cc: Stable Team
    Signed-off-by: Jake Edge
    Acked-by: Arjan van de Ven
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Linus Torvalds

    Jake Edge
     

03 May, 2009

10 commits

  • Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite
    returns with the page lock held, and sets the VM_FAULT_LOCKED flag.

    See http://bugzilla.kernel.org/show_bug.cgi?id=12913

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: fix getbmap vs mmap deadlock
    xfs: a couple getbmap cleanups
    xfs: add more checks to superblock validation
    xfs_file_last_byte() needs to acquire ilock

    Linus Torvalds
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/configfs:
    configfs: Fix Trivial Warning in fs/configfs/symlink.c

    Linus Torvalds
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
    ocfs2: Change repository in MAINTAINERS.
    ocfs2: Fix a missing credit when deleting from indexed directories.
    ocfs2/trivial: Remove unused variable in ocfs2_rename.
    ocfs2: Add missing iput() during error handling in ocfs2_dentry_attach_lock()
    ocfs2: Fix some printk() warnings.
    ocfs2: Fix 2 warning during ocfs2 make.
    ocfs2: Reserve 1 more cluster in expanding_inline_dir for indexed dir.

    Linus Torvalds
     
  • ->real_parent is the parent. ->parent may be the tracer.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Acked-by: Roland McGrath
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The Committed_AS field can underflow in certain situations:

    > # while true; do cat /proc/meminfo | grep _AS; sleep 1; done | uniq -c
    > 1 Committed_AS: 18446744073709323392 kB
    > 11 Committed_AS: 18446744073709455488 kB
    > 6 Committed_AS: 35136 kB
    > 5 Committed_AS: 18446744073709454400 kB
    > 7 Committed_AS: 35904 kB
    > 3 Committed_AS: 18446744073709453248 kB
    > 2 Committed_AS: 34752 kB
    > 9 Committed_AS: 18446744073709453248 kB
    > 8 Committed_AS: 34752 kB
    > 3 Committed_AS: 18446744073709320960 kB
    > 7 Committed_AS: 18446744073709454080 kB
    > 3 Committed_AS: 18446744073709320960 kB
    > 5 Committed_AS: 18446744073709454080 kB
    > 6 Committed_AS: 18446744073709320960 kB

    Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
    not check for underflow.

    But NR_CPUS proportional isn't good calculation. In general,
    possibility of lock contention is proportional to the number of online
    cpus, not theorical maximum cpus (NR_CPUS).

    The current kernel has generic percpu-counter stuff. using it is right
    way. it makes code simplify and percpu_counter_read_positive() don't
    make underflow issue.

    Reported-by: Dave Hansen
    Signed-off-by: KOSAKI Motohiro
    Cc: Eric B Munson
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: [All kernel versions]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This fixes the problem introduced by commit 3bfacef412 (get rid of
    special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults
    when binfmt_aout built as module. That happens because aout binary
    handler gets on the top of the binfmt list due to late registration, and
    kernel attempts to execute the binary without preparatory work that must
    be done by binfmt_loader.

    Fixed by changing the registration order of the default binfmt handlers
    using list_add_tail() and introducing insert_binfmt() function which
    places new handler on the top of the binfmt list. This might be generally
    useful for installing arch-specific frontends for default handlers or just
    for overriding them.

    Signed-off-by: Ivan Kokshaysky
    Cc: Al Viro
    Cc: Richard Henderson
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     
  • The intention of commit aae8679b0ebcaa92f99c1c3cb0cd651594a43915
    ("pagemap: fix bug in add_to_pagemap, require aligned-length reads of
    /proc/pid/pagemap") was to force reads of /proc/pid/pagemap to be a
    multiple of 8 bytes, but now it allows to read 0 bytes, which actually
    puts some data to user's buffer. According to POSIX, if count is zero,
    read() should return zero and has no other results.

    Signed-off-by: Vitaly Mayatskikh
    Cc: Thomas Tuttle
    Acked-by: Matt Mackall
    Cc: Alexey Dobriyan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • Change page_mkwrite to allow implementations to return with the page
    locked, and also change it's callers (in page fault paths) to hold the
    lock until the page is marked dirty. This allows the filesystem to have
    full control of page dirtying events coming from the VM.

    Rather than simply hold the page locked over the page_mkwrite call, we
    call page_mkwrite with the page unlocked and allow callers to return with
    it locked, so filesystems can avoid LOR conditions with page lock.

    The problem with the current scheme is this: a filesystem that wants to
    associate some metadata with a page as long as the page is dirty, will
    perform this manipulation in its ->page_mkwrite. It currently then must
    return with the page unlocked and may not hold any other locks (according
    to existing page_mkwrite convention).

    In this window, the VM could write out the page, clearing page-dirty. The
    filesystem has no good way to detect that a dirty pte is about to be
    attached, so it will happily write out the page, at which point, the
    filesystem may manipulate the metadata to reflect that the page is no
    longer dirty.

    It is not always possible to perform the required metadata manipulation in
    ->set_page_dirty, because that function cannot block or fail. The
    filesystem may need to allocate some data structure, for example.

    And the VM cannot mark the pte dirty before page_mkwrite, because
    page_mkwrite is allowed to fail, so we must not allow any window where the
    page could be written to if page_mkwrite does fail.

    This solution of holding the page locked over the 3 critical operations
    (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
    closes out races nicely, preventing page cleaning for writeout being
    initiated in that window. This provides the filesystem with a strong
    synchronisation against the VM here.

    - Sage needs this race closed for ceph filesystem.
    - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
    - I need it for fsblock.
    - I suspect other filesystems may need it too (eg. btrfs).
    - I have converted buffer.c to the new locking. Even simple block allocation
    under dirty pages might be susceptible to i_size changing under partial page
    at the end of file (we also have a buffer.c-side problem here, but it cannot
    be fixed properly without this patch).
    - Other filesystems (eg. NFS, maybe btrfs) will need to change their
    page_mkwrite functions themselves.

    [ This also moves page_mkwrite another step closer to fault, which should
    eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
    filesystem calldown and page lock/unlock cycle in __do_fault. ]

    [akpm@linux-foundation.org: fix derefs of NULL ->mapping]
    Cc: Sage Weil
    Cc: Trond Myklebust
    Signed-off-by: Nick Piggin
    Cc: Valdis Kletnieks
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix an obvious incorrect return status in autofs4_mount_busy().

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     

01 May, 2009

1 commit

  • The ocfs2 directory index updates two blocks when we remove an entry -
    the dx root and the dx leaf. OCFS2_DELETE_INODE_CREDITS was only
    accounting for the dx leaf. This shows up when ocfs2_delete_inode()
    runs out of credits in jbd2_journal_dirty_metadata() at
    "J_ASSERT_JH(jh, handle->h_buffer_credits > 0);".

    The test that caught this was running dirop_file_racer from the
    ocfs2-test suite with a 250-character filename PREFIX. Run on a 512B
    blocksize, it forces the orphan dir index to grow large enough to
    trigger.

    Signed-off-by: Joel Becker

    Joel Becker
     

30 Apr, 2009

5 commits

  • xfs_getbmap (or rather the formatters called by it) copy out the getbmap
    structures under the ilock, which can deadlock against mmap. This has
    been reported via bugzilla a while ago (#717) and has recently also
    shown up via lockdep.

    So allocate a temporary buffer to format the kernel getbmap structures
    into and then copy them out after dropping the locks.

    A little problem with this is that we limit the number of extents we
    can copy out by the maximum allocation size, but I see no real way
    around that.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • - reshuffle various conditionals for data vs attr fork to make the code
    more readable
    - do fine-grainded goto-based error handling
    - exit early from conditionals instead of keeping a long else branch around
    - allow kmem_alloc to fail

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • There had been reports where xfs filesystem was randomly
    corrupted with fsfuzzer, and xfs failed to handle it
    gracefully. This patch fixes couple of reported problem
    by providing additional checks in the superblock
    validation routine.

    Signed-off-by: Olaf Weber
    Reviewed-by: Josef 'Jeff' Sipek
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Felix Blyakher

    Olaf Weber
     
  • We had some systems crash with this stack:

    [] ia64_leave_kernel+0x0/0x280
    [] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
    [] xfs_bmap_last_offset+0x210/0x280 [xfs]
    [] xfs_file_last_byte+0x70/0x1a0 [xfs]
    [] xfs_itruncate_start+0xc0/0x1a0 [xfs]
    [] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
    [] xfs_release+0x1b0/0x240 [xfs]
    [] xfs_file_release+0x70/0xa0 [xfs]
    [] __fput+0x1a0/0x420
    [] fput+0x40/0x60

    The problem here is that xfs_file_last_byte() does not acquire the
    inode lock and can therefore race with another thread that is modifying
    the extext list. While xfs_bmap_last_offset() is trying to lookup
    what was the last extent some extents were merged and the extent list
    shrunk so the index we lookup is now beyond the end of the extent list
    and potentially in a freed buffer.

    Signed-off-by: Lachlan McIlroy
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Lachlan McIlroy
     
  • With indexed dir enabled, now we use ocfs2_dir_lookup_result to
    wrap all the bh used for dir. So remove the 2 unused variables.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

29 Apr, 2009

1 commit


28 Apr, 2009

5 commits

  • This warning shows up on 64 bit builds:

    fs/ecryptfs/inode.c:693: warning: comparison of distinct pointer types
    lacks a cast

    Signed-off-by: Tyler Hicks

    Tyler Hicks
     
  • fs/ecryptfs/inode.c:670: warning: format '%d' expects type 'int', but argument 3 has type 'size_t'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Tyler Hicks
    Cc: Dustin Kirkland
    Signed-off-by: Andrew Morton

    Randy Dunlap
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: look for acls during btrfs_read_locked_inode
    Btrfs: fix acl caching
    Btrfs: Fix a bunch of printk() warnings.
    Btrfs: Fix a trivial warning using max() of u64 vs ULL.
    Btrfs: remove unused btrfs_bit_radix slab
    Btrfs: ratelimit IO error printks
    Btrfs: remove #if 0 code
    Btrfs: When shrinking, only update disk size on success
    Btrfs: fix deadlocks and stalls on dead root removal
    Btrfs: fix fallocate deadlock on inode extent lock
    Btrfs: kill btrfs_cache_create
    Btrfs: don't export symbols
    Btrfs: simplify makefile
    Btrfs: try to keep a healthy ratio of metadata vs data block groups

    Linus Torvalds
     
  • This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items.
    If it is certain a given inode has no acls, it will set the in memory acl
    fields to null to avoid acl lookups completely.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Linus noticed the btrfs code to cache acls wasn't properly caching
    a NULL acl when the inode didn't have any acls. This meant the common
    case of no acls resulted in expensive btree searches every time the
    kernel checked permissions (which is quite often).

    This is a modified version of Linus' original patch:

    Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode.
    This forces an acl lookup when permission checks are done.

    Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields
    are set to null.

    Fix btrfs_get_acl to use the right return value from __btrfs_getxattr
    when deciding to cache a NULL acl. It was storing a NULL acl when
    __btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning
    -ENODATA for this case.

    Signed-off-by: Chris Mason

    Chris Mason
     

27 Apr, 2009

9 commits


25 Apr, 2009

8 commits

  • The EXTENTS_FL flag should never be set on special files, but if it
    is, don't bother trying to validate that the extents tree is valid,
    since only files, directories, and non-fast symlinks will ever have an
    extent data structure. We perhaps should flag the filesystem as being
    corrupted if we see a special file (named pipes, device nodes, Unix
    domain sockets, etc.) with the EXTENTS_FL flag, but e2fsck doesn't
    currently check this case, so we'll just ignore this for now, since
    it's harmless.

    Without this fix, a special device with the extents flag is flagged as
    an error by the kernel, so it is impossible to access or delete the
    inode, but e2fsck doesn't see it as a problem, leading to
    confused/frustrated users.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • RomFS should advance the destination buffer pointer when reading data from a
    blockdev source (the data may be split over multiple blocks, each requiring its
    own sb_read() call). Without this, all the data is copied to the beginning of
    the output buffer.

    Signed-off-by: David Howells
    Tested-by: Michal Simek
    Signed-off-by: Linus Torvalds

    David Howells
     
  • romfs_lookup() should be using a routine akin to strcmp() on the backing store,
    rather than one akin to strncmp(). If it uses the latter, it's liable to match
    /bin/shutdown when looking up /bin/sh.

    Signed-off-by: David Howells
    Tested-by: Michal Simek
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Don't try to look at i_file_acl_high unless the INCOMPAT_64BIT feature
    bit is set. The field is normally zero, but older versions of e2fsck
    didn't automatically check to make sure of this, so in the spirit of
    "be liberal in what you accept", don't look at i_file_acl_high unless
    we are using a 64-bit filesystem.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • After a transaction commit, the old root of the subvol btrees are sent through
    snapshot removal. This is what actually frees up any blocks replaced by
    COW, and anything the old blocks pointed to.

    Snapshot deletion will pause when a transaction commit has started, which
    helps to avoid a huge amount of delayed reference count updates piling up
    as the transaction is trying to close.

    But, this pause happens after the snapshot deletion process has asked other
    procs on the system to throttle back a bit so that it can make progress.

    We don't want to throttle everyone while we're waiting for the transaction
    commit, it leads to deadlocks in the user transaction ioctls used by Ceph
    and makes things slower in general.

    This patch changes things to avoid the throttling while we sleep.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The btrfs fallocate call takes an extent lock on the entire range
    being fallocated, and then runs through insert_reserved_extent on each
    extent as they are allocated.

    The problem with this is that btrfs_drop_extents may decide to try
    and take the same extent lock fallocate was already holding. The solution
    used here is to push down knowledge of the range that is already locked
    going into btrfs_drop_extents.

    It turns out that at least one other caller had the same bug.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Just use kmem_cache_create directly.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Chris Mason

    Christoph Hellwig
     
  • Currently the extent_map code is only for btrfs so don't export it's
    symbols.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Chris Mason

    Christoph Hellwig