12 Feb, 2007

22 commits

  • They are fat: 4x8 bytes in task_struct.
    They are uncoditionally updated in every fork, read, write and sendfile.
    They are used only if you have some "extended acct fields feature".

    And please, please, please, read(2) knows about bytes, not characters,
    why it is called "rchar"?

    Signed-off-by: Alexey Dobriyan
    Cc: Jay Lan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • jbd function called instead of fs specific one.

    Signed-off-by: Dmitriy Monakhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitriy Monakhov
     
  • Remove the kernel config option ZISOFS_FS, since it appears that the actual
    option is simply ZISOFS.

    Signed-off-by: Robert P. J. Day
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • unlock_buffer(), like unlock_page(), must not clear the lock without
    ensuring that the critical section is closed.

    Mingming later sent the same patch, saying:

    We are running SDET benchmark and saw double free issue for ext3 extended
    attributes block, which complains the same xattr block already being freed (in
    ext3_xattr_release_block()). The problem could also been triggered by
    multiple threads loop untar/rm a kernel tree.

    The race is caused by missing a memory barrier at unlock_buffer() before the
    lock bit being cleared, resulting in possible concurrent h_refcounter update.
    That causes a reference counter leak, then later leads to the double free that
    we have seen.

    Inside unlock_buffer(), there is a memory barrier is placed *after* the lock
    bit is being cleared, however, there is no memory barrier *before* the bit is
    cleared. On some arch the h_refcount update instruction and the clear bit
    instruction could be reordered, thus leave the critical section re-entered.

    The race is like this: For example, if the h_refcount is initialized as 1,

    cpu 0: cpu1
    -------------------------------------- -----------------------------------
    lock_buffer() /* test_and_set_bit */
    clear_buffer_locked(bh);
    lock_buffer() /* test_and_set_bit */
    h_refcount = h_refcount+1; /* = 2*/ h_refcount = h_refcount + 1; /*= 2 */
    clear_buffer_locked(bh);
    .... ......

    We lost a h_refcount here. We need a memory barrier before the buffer head lock
    bit being cleared to force the order of the two writes. Please apply.

    Signed-off-by: Nick Piggin
    Signed-off-by: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Extend the set of "__attribute__" shortcut macros, and remove identical
    (and now superfluous) definitions from a couple of source files.

    based on a page at robert love's blog:

    http://rlove.org/log/2005102601

    extend the set of shortcut macros defined in compiler-gcc.h with the
    following:

    #define __packed __attribute__((packed))
    #define __weak __attribute__((weak))
    #define __naked __attribute__((naked))
    #define __noreturn __attribute__((noreturn))
    #define __pure __attribute__((pure))
    #define __aligned(x) __attribute__((aligned(x)))
    #define __printf(a,b) __attribute__((format(printf,a,b)))

    Once these are in place, it's up to subsystem maintainers to decide if they
    want to take advantage of them. there is already a strong precedent for
    using shortcuts like this in the source tree.

    The ones that might give people pause are "__aligned" and "__printf", but
    shortcuts for both of those are already in use, and in some ways very
    confusingly. note the two very different definitions for a macro named
    "ALIGNED":

    drivers/net/sgiseeq.c:#define ALIGNED(x) ((((unsigned long)(x)) + 0xf) & ~(0xf))
    drivers/scsi/ultrastor.c:#define ALIGNED(x) __attribute__((aligned(x)))

    also:

    include/acpi/platform/acgcc.h:
    #define ACPI_PRINTF_LIKE(c) __attribute__ ((__format__ (__printf__, c, c+1)))

    Given the precedent, then, it seems logical to at least standardize on a
    consistent set of these macros.

    Signed-off-by: Robert P. J. Day
    Acked-by: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • - Naming is confusing, ext3_inc_count manipulates i_nlink not i_count
    - handle argument passed in is not used
    - ext3 and ext4 already call inc_nlink and dec_nlink directly in other places

    Signed-off-by: Eric Sandeen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • Return -ENOENT from ext[34]_link if we've raced with unlink and i_nlink is
    0. Doing otherwise has the potential to corrupt the orphan inode list,
    because we'd wind up with an inode with a non-zero link count on the list,
    and it will never get properly cleaned up & removed from the orphan list
    before it is freed.

    [akpm@osdl.org: build fix]
    Signed-off-by: Eric Sandeen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • Fix insecure default behaviour reported by Tigran Aivazian: if an ext2 or
    ext3 or ext4 filesystem is tuned to mount with "acl", but mounted by a
    kernel built without ACL support, then umask was ignored when creating
    inodes - though root or user has umask 022, touch creates files as 0666,
    and mkdir creates directories as 0777.

    This appears to have worked right until 2.6.11, when a fix to the default
    mode on symlinks (always 0777) assumed VFS applies umask: which it does,
    unless the mount is marked for ACLs; but ext[234] set MS_POSIXACL in
    s_flags according to s_mount_opt set according to def_mount_opts.

    We could revert to the 2.6.10 ext[234]_init_acl (adding an S_ISLNK test);
    but other filesystems only set MS_POSIXACL when ACLs are configured. We
    could fix this at another level; but it seems most robust to avoid setting
    the s_mount_opt flag in the first place (at the expense of more ifdefs).

    Likewise don't set the XATTR_USER flag when built without XATTR support.

    Signed-off-by: Hugh Dickins
    Cc: Tigran Aivazian
    Cc:
    Cc: Andreas Gruenbacher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Compile-tested.

    Signed-off-by: Alexey Dobriyan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • In the rare case where we have skipped orphan inode processing due to a
    readonly block device, and the block device subsequently changes back to
    read-write, disallow a remount,rw transition of the filesystem when we have an
    unprocessed orphan inodes as this would corrupt the list.

    Ideally we should process the orphan inode list during the remount, but that's
    trickier, and this plugs the hole for now.

    Signed-off-by: Eric Sandeen
    Cc: "Stephen C. Tweedie"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • In the rare case where we have skipped orphan inode processing due to a
    readonly block device, and the block device subsequently changes back to
    read-write, disallow a remount,rw transition of the filesystem when we have an
    unprocessed orphan inodes as this would corrupt the list.

    Ideally we should process the orphan inode list during the remount, but that's
    trickier, and this plugs the hole for now.

    Signed-off-by: Eric Sandeen
    Cc: "Stephen C. Tweedie"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • fs/proc/proc_misc.c: In function 'proc_misc_init':
    fs/proc/proc_misc.c:764: warning: unused variable 'entry'

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Correct the AIX magic check to let 'echo > /dev/sdb' actually work.

    Signed-off-by: Olaf Hering
    Cc: OGAWA Hirofumi
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olaf Hering
     
  • The patch to identify AIX disks and ignore them has caused at least one
    machine to fail to find the root partition on 2.6.19. The patch is:

    http://lkml.org/lkml/2006/7/31/117

    The problem is some disk formatters do not blow away the first 4 bytes
    of the disk. If the disk we are installing to used to have AIX on it,
    then the first 4 bytes will still have IBMA in EBCDIC.

    The install in question was debian etch. Im not sure what the best fix
    is, perhaps the AIX detection code could check more than the first 4
    bytes.

    The whole partition info for primary partitions is in this block:

    dd if=/dev/sdb count=$(( 4 * 16 )) bs=1 skip=$(( 0x1be ))

    All other data do not matter, beside the 0x55aa marker at the end of the
    first block.

    Signed-off-by: Olaf Hering
    Cc: OGAWA Hirofumi
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olaf Hering
     
  • Convert all calls to invalidate_inode_pages() into open-coded calls to
    invalidate_mapping_pages().

    Leave the invalidate_inode_pages() wrapper in place for now, marked as
    deprecated.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This one was pointed out on the MOKB site:
    http://kernelfun.blogspot.com/2006/11/mokb-09-11-2006-linux-26x-ext2checkpage.html

    If a directory's i_size is corrupted, ext2_find_entry() will keep
    processing pages until the i_size is reached, even if there are no more
    blocks associated with the directory inode. This patch puts in some
    minimal sanity-checking so that we don't keep checking pages (and issuing
    errors) if we know there can be no more data to read, based on the block
    count of the directory inode.

    This is somewhat similar in approach to the ext3 patch I sent earlier this
    year.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
    corresponding "kmem_cache_zalloc()" call.

    Signed-off-by: Robert P. J. Day
    Cc: "Luck, Tony"
    Cc: Andi Kleen
    Cc: Roland McGrath
    Cc: James Bottomley
    Cc: Greg KH
    Acked-by: Joel Becker
    Cc: Steven Whitehouse
    Cc: Jan Kara
    Cc: Michael Halcrow
    Cc: "David S. Miller"
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • When igrab() is calling __iget() on an inode it should check if
    clear_inode() has been called on the inode already. Otherwise there is a
    race window between clear_inode() and destroy_inode() where igrab() calls
    __iget() which leads to already free inodes on the inode lists.

    Signed-off-by: Vandana Rungta
    Signed-off-by: Jan Blunck
    Cc: Al Viro
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • I added IS_NOATIME(inode) macro definition in include/linux/fs.h, true if
    the inode superblock is marked readonly or noatime.

    This new macro is then used in touch_atime() instead of separatly testing
    MS_RDONLY and MS_NOATIME

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • As pointed out by Hugh, ramfs would also benefit from using the new
    set_page_dirty aop method for memory backed file systems.

    Signed-off-by: Ken Chen
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • Values are available via ZVC sums.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the last vestiges of the long-deprecated "MAP_ANON" page protection
    flag: use "MAP_ANONYMOUS" instead.

    Signed-off-by: Robert P. J. Day
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     

10 Feb, 2007

5 commits

  • remap() the region we get from mmap() to mark the fact that we are
    using all of the available slack space. Any slack space is used
    to form a simple brk region, and potentially more stack space than
    requested at load time.

    Any searches of the vma chain may well fail looking for
    stack (and especially arg) addresses if the remaping is not done.
    The simplest example is /proc//cmdline, since the args
    are pretty much always at the top of the data/bss/stack region.

    Signed-off-by: Greg Ungerer
    Signed-off-by: Linus Torvalds

    Greg Ungerer
     
  • Fix a double free of "dfid" introduced by commit
    da977b2c7eb4d6312f063a7b486f2aad99809710 and spotted by the Coverity
    checker.

    Signed-off-by: Adrian Bunk
    Cc: Eric Van Hensbergen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • __unmap_hugepage_range() is buggy that it does not preserve dirty state of
    huge_pte when unmapping hugepage range. It causes data corruption in the
    event of dop_caches being used by sys admin. For example, an application
    creates a hugetlb file, modify pages, then unmap it. While leaving the
    hugetlb file alive, comes along sys admin doing a "echo 3 >
    /proc/sys/vm/drop_caches".

    drop_pagecache_sb() will happily free all pages that aren't marked dirty if
    there are no active mapping. Later when application remaps the hugetlb
    file back and all data are gone, triggering catastrophic flip over on
    application.

    Not only that, the internal resv_huge_pages count will also get all messed
    up. Fix it up by marking page dirty appropriately.

    Signed-off-by: Ken Chen
    Cc: "Nish Aravamudan"
    Cc: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc:
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • This is a fix of regression, which triggered by ~2.6.16.

    Patch with name ufs-directory-and-page-cache-from-blocks-to-pages.patch: in
    additional to conversation from block to page cache mechanism added new
    checks of directory integrity, one of them that directory entry do not
    across directory chunks.

    But some kinds of UFS: OpenStep UFS and Apple UFS (looks like these are the
    same filesystems) have different directory chunk size, then common
    UFSes(BSD and Solaris UFS).

    So this patch adds ability to works with variable size of directory chunks,
    and set it for ufstype=openstep to right size.

    Tested on darwin ufs.

    Signed-off-by: Evgeniy Dushistov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

09 Feb, 2007

2 commits

  • Conflicts:

    crypto/Kconfig

    David S. Miller
     
  • * 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/mfasheh/ocfs2: (22 commits)
    configfs: Zero terminate data in configfs attribute writes.
    [PATCH] ocfs2 heartbeat: clean up bio submission code
    ocfs2: introduce sc->sc_send_lock to protect outbound outbound messages
    [PATCH] ocfs2: drop INET from Kconfig, not needed
    ocfs2_dlm: Add timeout to dlm join domain
    ocfs2_dlm: Silence some messages during join domain
    ocfs2_dlm: disallow a domain join if node maps mismatch
    ocfs2_dlm: Ensure correct ordering of set/clear refmap bit on lockres
    ocfs2: Binds listener to the configured ip address
    ocfs2_dlm: Calling post handler function in assert master handler
    ocfs2: Added post handler callable function in o2net message handler
    ocfs2_dlm: Cookies in locks not being printed correctly in error messages
    ocfs2_dlm: Silence a failed convert
    ocfs2_dlm: wake up sleepers on the lockres waitqueue
    ocfs2_dlm: Dlm dispatch was stopping too early
    ocfs2_dlm: Drop inflight refmap even if no locks found on the lockres
    ocfs2_dlm: Flush dlm workqueue before starting to migrate
    ocfs2_dlm: Fix migrate lockres handler queue scanning
    ocfs2_dlm: Make dlmunlock() wait for migration to complete
    ocfs2_dlm: Fixes race between migrate and dirty
    ...

    Linus Torvalds
     

08 Feb, 2007

11 commits

  • Attributes in configfs are text files. As such, most handlers expect to be
    able to call functions like simple_strtoul() without checking the bounds
    of the buffer. Change the call to zero terminate the buffer before calling
    the client's ->store() method. This does reduce the attribute size from
    PAGE_SIZE to PAGE_SIZE-1.

    Also, change get_zeroed_page() to alloc_page(), as we are handling the
    termination.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • As was already pointed out Mathieu Avila on Thu, 07 Sep 2006 03:15:25 -0700
    that OCFS2 is expecting bio_add_page() to add pages to BIOs in an easily
    predictable manner.

    That is not true, especially for devices with own merge_bvec_fn().

    Therefore OCFS2's heartbeat code is very likely to fail on such devices.

    Move the bio_put() call into the bio's bi_end_io() function. This makes the
    whole idea of trying to predict the behaviour of bio_add_page() unnecessary.
    Removed compute_max_sectors() and o2hb_compute_request_limits().

    Signed-off-by: Philipp Reisner
    Signed-off-by: Mark Fasheh

    Philipp Reisner
     
  • When there is a lot of multithreaded I/O usage, two threads can collide
    while sending out a message to the other nodes. This is due to the lack of
    locking between threads while sending out the messages.

    When a connected TCP send(), sendto(), or sendmsg() arrives in the Linux
    kernel, it eventually comes through tcp_sendmsg(). tcp_sendmsg() protects
    itself by acquiring a lock at invocation by calling lock_sock().
    tcp_sendmsg() then loops over the buffers in the iovec, allocating
    associated sk_buff's and cache pages for use in the actual send. As it does
    so, it pushes the data out to tcp for actual transmission. However, if one
    of those allocation fails (because a large number of large sends is being
    processed, for example), it must wait for memory to become available. It
    does so by jumping to wait_for_sndbuf or wait_for_memory, both of which
    eventually cause a call to sk_stream_wait_memory(). sk_stream_wait_memory()
    contains a code path that calls sk_wait_event(). Finally, sk_wait_event()
    contains the call to release_sock().

    The following patch adds a lock to the socket container in order to
    properly serialize outbound requests.

    From: Zhen Wei
    Acked-by: Jeff Mahoney
    Signed-off-by: Mark Fasheh

    Zhen Wei
     
  • OCFS2: drop 'depends on INET' since local mounts are now allowed.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Mark Fasheh

    Randy Dunlap
     
  • Currently the ocfs2 dlm has no timeout during dlm join domain. While this is
    not a problem in normal operation, this does become an issue if, say, the
    other node is refusing to let the node join the domain because of a stuck
    recovery. This patch adds a 90 sec timeout.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • These messages can easily be activated using the mlog infrastructure
    and don't need to be enabled by default.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • There is a small window where a joining node may not see the node(s) that
    just died but are still part of the domain. To fix this, we must disallow
    join requests if the joining node has a different node map.

    A new field node_map is added to dlm_query_join_request to send the current
    nodes nodemap along with join request. On the receiving end the nodes that
    are part of the cluster verifies if this new node sees all the nodes that
    are still part of the cluster. They disallow the join if the maps mismatch.

    Signed-off-by: Srinivas Eeda
    Signed-off-by: Mark Fasheh

    Srinivas Eeda
     
  • Eventhough the set refmap bit message is sent before the clear refmap
    message, currently there is no guarentee that the set message will be
    handled before the clear. This patch prevents the clear refmap to be
    processed while the node is sending assert master messages to other
    nodes. (The set refmap message is sent as a response to the assert
    master request).

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch binds the o2net listener to the configured ip address
    instead of INADDR_ANY for security. Fixes oss.oracle.com bugzilla#814.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch prevents the dlm from sending the clear refmap message
    before the set refmap. We use the newly created post function handler
    routine to accomplish the task.

    Signed-off-by: Kurt Hackel
    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Kurt Hackel
     
  • Currently o2net allows one handler function per message type. This
    patch adds the ability to call another function to be called after
    the handler has returned the message to the other node.

    Handlers are now given the option of returning a context (in the form of a
    void **) which will be passed back into the post message handler function.

    Signed-off-by: Kurt Hackel
    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Kurt Hackel