09 Jan, 2006

25 commits

  • We've had two instances recently of overflows when doing

    64_bit_value = (32_bit_value << PAGE_CACHE_SHIFT)

    I did a tree-wide grep of `<page_base)

    Cc: Oleg Drokin
    Cc: David Howells
    Cc: David Woodhouse
    Cc:
    Cc: Christoph Hellwig
    Cc: Anton Altaparmakov
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Roman Zippel
    Cc:
    Cc: Miklos Szeredi
    Cc: Russell King
    Cc: Trond Myklebust
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When making an fctl locking call through compat_sys_fcntl64 (i.e. a 32bit
    app on a 64bit kernel), the syscall can return a locking range that is in
    conflict with the queried lock.

    If some aspect of this range does not fit in the 32bit structure, something
    needs to be done.

    The current code is wrong in several respects:

    - It returns data to userspace even if no conflict was found
    i.e. it should check l_type for F_UNLCK
    - It returns -EOVERFLOW too agressively. A lock range covering
    the last possible byte of the file (start = COMPAT_OFF_T_MAX,
    len = 1) should be possible, but is rejected with the current test.
    - A extra-long 'len' should not be a problem. If only that part
    of the conflicting lock that would be visible to the 32bit
    app needs to be reported to the 32bit app anyway.

    This patch addresses those three issues and adds a comment to (hopefully)
    record it for posterity.

    Note: this patch mainly affects test-cases. Real applications rarely is
    ever see the problems.

    This patch has been tested (LSB test suite), and works.

    Signed-off-by: Neil Brown
    Cc: Arnd Bergmann
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • SUS requires that when truncating a file to the size that it currently
    is:
    truncate and ftruncate should NOT modify ctime or mtime
    O_TRUNC SHOULD modify ctime and mtime.

    Currently mtime and ctime are always modified on most local
    filesystems (side effect of ->truncate) or never modified (on NFS).

    With this patch:
    ATTR_CTIME|ATTR_MTIME are sent with ATTR_SIZE precisely when
    an update of these times is required whether size changes or not
    (via a new argument to do_truncate). This allows NFS to do
    the right thing for O_TRUNC.
    inode_setattr nolonger forces ATTR_MTIME|ATTR_CTIME when the ATTR_SIZE
    sets the size to it's current value. This allows local filesystems
    to do the right thing for f?truncate.

    Also, the logic in inode_setattr is changed a bit so there are two return
    points. One returns the error from vmtruncate if it failed, the other
    returns 0 (there can be no other failure).

    Finally, if vmtruncate succeeds, and ATTR_SIZE is the only change
    requested, we now fall-through and mark_inode_dirty. If a filesystem did
    not have a ->truncate function, then vmtruncate will have changed i_size,
    without marking the inode as 'dirty', and I think this is wrong.

    Signed-off-by: Neil Brown
    Cc: Christoph Hellwig
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • inode can never be NULL when calling this function.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • This patch renames relayfs_file_operations to relay_file_operations, and the
    file operations themselves from relayfs_XXX to relay_file_XXX, to make it more
    clear that they refer to relay files.

    Signed-off-by: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Zanussi
     
  • This patch adds the optional is_global outparam to the create_buf_file()
    callback. This can be used by clients to create a single global relayfs
    buffer instead of the default per-cpu buffers. This was suggested as being
    useful for certain debugging applications where it's more convenient to be
    able to get all the data from a single channel without having to go to the
    bother of dealing with per-cpu files.

    Signed-off-by: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Zanussi
     
  • This patch adds a couple of callback functions that allow a client to hook
    into relay_open()/close() and supply the files that will be used to represent
    the channel buffers; the default implementation if no callbacks are defined is
    to create the files in relayfs. This is to support the creation and use of
    relay files in other filesystems such as debugfs, as implied by the fact that
    relayfs_file_operations are exported.

    Signed-off-by: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Zanussi
     
  • Since we're no longer using relayfs_inode_info, remove relayfs_alloc_inode()
    and relayfs_destroy_inode() along with the relayfs inode cache.

    Signed-off-by: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Zanussi
     
  • Use inode->u.generic_ip instead of relayfs_inode_info to store pointer to user
    data. Clients using relayfs_file_create() to create their own files would
    probably more expect their data to be stored in generic_ip; we also intend in
    the next set of patches to get rid of relayfs-specific stuff in the file
    operations, so we might as well do it here.

    Signed-off-by: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Zanussi
     
  • This patch adds and exports relayfs_remove_file(), for API symmetry (with
    relayfs_create_file()).

    Signed-off-by: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Zanussi
     
  • This patch adds a mandatory fileops param to relayfs_create_file() and exports
    that function so that clients can use it to create files defined by their own
    set of file operations, in relayfs. The purpose is to allow relayfs
    applications to create their own set of 'control' files alongside their relay
    files in relayfs rather than having to create them in /proc or debugfs for
    instance. relayfs_create_file() is also used by relay_open_buf() to create
    the relay files for a channel. In this case, a pointer to
    relayfs_file_operations is passed in, along with a pointer to the buffer
    associated with the file.

    Signed-off-by: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Zanussi
     
  • The patch series implementa or fixes 3 things that were specifically requested
    or suggested by relayfs users:

    - support for non-relay files (patches 1-6)

    Currently, the relayfs API only supports the creation of directories
    (relayfs_create_dir()) and relay files (relay_open()). These patches adds
    support for non-relay files (relayfs_create_file()). This is so relayfs
    applications can create 'control files' in relayfs itself rather than in /proc
    or via a netlink channel, as is currently done in the relay-app examples.
    Basically what this amounts to is exporting relayfs_create_file() with an
    additional file_ops param that clients can use to supply file operations for
    their own special-purpose files in relayfs.

    - make exported relay file ops useful (patches 7-8)

    The relayfs relay_file_operations have always been exported, the intent being
    to make it possible to create relay files in other filesystems such as
    debugfs. The problem, though, is that currently the file operations are too
    tightly coupled to relayfs to actually be used for this purpose. This patch
    fixes that by adding a couple of callback functions that allow a client to
    hook into relay_open()/close() and supply the files that will be used to
    represent the channel buffers; the default implementation if no callbacks are
    defined is to create the files in relayfs.

    - add an option to create global relay buffer (patches 9-10) The file creation
    callback also supplies an optional param, is_global, that can be used by
    clients to create a single global relayfs buffer instead of the default
    per-cpu buffers. This was suggested as being useful for certain debugging
    applications where it's more convenient to be able to get all the data from a
    single channel without having to go to the bother of dealing with per-cpu
    files.

    - cleanup, some renaming and Documentation updates (patches 11-12)

    There were several comments that the use of netlink in the example code was
    non-intuitive and in fact the whole relay-app business was needlessly
    confusing. Based on that feedback, the example code has been completely
    converted over to relayfs control files as supported by this patch, and have
    also been made completely self-contained.

    The converted examples along with a couple of new examples that demonstrate
    using exported relay files can be found in relay-apps tarball:
    http://prdownloads.sourceforge.net/relayfs/relay-apps-0.9.tar.gz?download

    This patch:

    Separate buffer create/destroy from inode create/destroy. We want to be able
    to associate other data and not just relay buffers with inodes. Buffer
    create/destroy is moved out of inode.c and into relayfs core code.

    Signed-off-by: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Zanussi
     
  • Use atomic_inc_not_zero for rcu files instead of special case rcuref.

    Signed-off-by: Nick Piggin
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.

    See mm/filemap.c:

    And changes the filemap_write_and_wait() and filemap_write_and_wait_range().

    Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
    returns error. However, even if filemap_fdatawrite() returned an
    error, it may have submitted the partially data pages to the device.
    (e.g. in the case of -ENOSPC)

    Andrew Morton writes,

    If filemap_fdatawrite() returns an error, this might be due to some
    I/O problem: dead disk, unplugged cable, etc. Given the generally
    crappy quality of the kernel's handling of such exceptions, there's a
    good chance that the filemap_fdatawait() will get stuck in D state
    forever.

    So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.

    Trond, could you please review the nfs part? Especially I'm not sure,
    nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.

    Acked-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • This patch changes generic_cont_expand(), in order to share the code
    with fatfs.

    - Use vmtruncate() if ->prepare_write() returns a error.

    Even if ->prepare_write() returns an error, it may already have added some
    blocks. So, this truncates blocks outside of ->i_size by vmtruncate().

    - Add generic_cont_expand_simple().

    The generic_cont_expand_simple() assumes that ->prepare_write() can handle
    the block boundary. With this, we don't need to care the extra byte.

    And for expanding a file size by truncate(), fatfs uses the
    added generic_cont_expand_simple().

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • This patch add to support of ->direct_IO() for mostly read.

    The user of this seems to want to use for streaming read. So, current direct
    I/O has limitation, it can only overwrite. (For write operation, mainly we
    need to handle the hole etc..)

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • All EXPORT_SYMBOL of fatfs is only for vfat/msdos. _GPL would be proper.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • We don't need to allocate buffer for checking the buffer is uptodate. This
    use sb_find_get_block() instead, and if it returns NULL it's not uptodate.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • It is overkill to update the FS_INFO whenever modifying
    prev_free/free_clusters, because those are just a hint.

    So, this patch uses ->write_super() for updating FS_INFO instead.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • configurable replacement for slab allocator

    This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is
    disabled, the kernel falls back to using the 'SLOB' allocator.

    SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
    similar to the original Linux kmalloc allocator that SLAB replaced. It's
    signicantly smaller code and is more memory efficient. But like all
    similar allocators, it scales poorly and suffers from fragmentation more
    than SLAB, so it's only appropriate for small systems.

    It's been tested extensively in the Linux-tiny tree. I've also
    stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
    recommended).

    Here's a comparison for otherwise identical builds, showing SLOB saving
    nearly half a megabyte of RAM:

    $ size vmlinux*
    text data bss dec hex filename
    3336372 529360 190812 4056544 3de5e0 vmlinux-slab
    3323208 527948 190684 4041840 3dac70 vmlinux-slob

    $ size mm/{slab,slob}.o
    text data bss dec hex filename
    13221 752 48 14021 36c5 mm/slab.o
    1896 52 8 1956 7a4 mm/slob.o

    /proc/meminfo:
    SLAB SLOB delta
    MemTotal: 27964 kB 27980 kB +16 kB
    MemFree: 24596 kB 25092 kB +496 kB
    Buffers: 36 kB 36 kB 0 kB
    Cached: 1188 kB 1188 kB 0 kB
    SwapCached: 0 kB 0 kB 0 kB
    Active: 608 kB 600 kB -8 kB
    Inactive: 808 kB 812 kB +4 kB
    HighTotal: 0 kB 0 kB 0 kB
    HighFree: 0 kB 0 kB 0 kB
    LowTotal: 27964 kB 27980 kB +16 kB
    LowFree: 24596 kB 25092 kB +496 kB
    SwapTotal: 0 kB 0 kB 0 kB
    SwapFree: 0 kB 0 kB 0 kB
    Dirty: 4 kB 12 kB +8 kB
    Writeback: 0 kB 0 kB 0 kB
    Mapped: 560 kB 556 kB -4 kB
    Slab: 1756 kB 0 kB -1756 kB
    CommitLimit: 13980 kB 13988 kB +8 kB
    Committed_AS: 4208 kB 4208 kB 0 kB
    PageTables: 28 kB 28 kB 0 kB
    VmallocTotal: 1007312 kB 1007312 kB 0 kB
    VmallocUsed: 48 kB 48 kB 0 kB
    VmallocChunk: 1007264 kB 1007264 kB 0 kB

    (this work has been sponsored in part by CELF)

    From: Ingo Molnar

    Fix 32-bitness bugs in mm/slob.c.

    Signed-off-by: Matt Mackall
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked
    instead of tasklist_lock read-locked. This is a scalability improvement on
    SMP and a preemption-latency improvement under PREEMPT_RCU.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Ingo Molnar
    Acked-by: William Irwin
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Suppress configuration of certain features for the FRV arch as they can't be
    built for FRV at the moment:

    (*) RTC

    (*) HISAX_*

    (*) PARPORT_PC

    (*) VGA_CONSOLE

    (*) BINFMT_ELF

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • First discussed at http://marc.theaimsgroup.com/?t=113149255100001&r=1&w=2

    - Use the check_range() in mempolicy.c to gather statistics.

    - Improve the numa_maps code in general and fix some comments.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to
    discard as much pagecache and/or reclaimable slab objects as it can. THis
    operation requires root permissions.

    It won't drop dirty data, so the user should run `sync' first.

    Caveats:

    a) Holds inode_lock for exorbitant amounts of time.

    b) Needs to be taught about NUMA nodes: propagate these all the way through
    so the discarding can be controlled on a per-node basis.

    This is a debugging feature: useful for getting consistent results between
    filesystem benchmarks. We could possibly put it under a config option, but
    it's less than 300 bytes.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

07 Jan, 2006

15 commits

  • Linus Torvalds
     
  • This patch should fix compilation failure of fs/ufs/dir.c with defined UFS_DIR_DEBUG

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Linus Torvalds

    Evgeniy
     
  • If the loop errors, we need to exit.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • If someone changes the uid/gid mapping in userland, then we do eventually
    want those changes to be propagated to the kernel. Currently the kernel
    assumes that it may cache entries forever.

    Add an expiration time + garbage collector for idmap entries.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • inode->i_mode contains a lot more than just the mode bits. Make sure that
    we mask away this extra stuff in SETATTR calls to the server.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Clean up: Every ULP that uses the in-kernel RPC client, except the NLM
    client, sets cl_chatty. There's no reason why NLM shouldn't set it, so
    just get rid of cl_chatty and always be verbose.

    Test-plan:
    Compile with CONFIG_NFS enabled.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • We'd like to hide fields in rpc_xprt and rpc_clnt from upper layer protocols.
    Start by creating an API to force RPC rebind, replacing logic that simply
    sets cl_port to zero.

    Test-plan:
    Destructive testing (unplugging the network temporarily). Connectathon
    with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
    Probably need to rig a server where certain services aren't running, or
    that returns an error for some typical operation.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Thanks to Ed Keizer for bug and root cause. He says: "... we could only mount
    the top-level Solaris share. We could not mount deeper into the tree.
    Investigation showed that Solaris allows UNIX authenticated FSINFO only on the
    top level of the share. This is a problem because we share/export our home
    directories one level higher than we mount them. I.e. we share the partition
    and not the individual home directories. This prevented access to home
    directories."

    We still may need to try auth_sys for the case where the client doesn't have
    appropriate credentials.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust

    J. Bruce Fields
     
  • The procedure that decodes statd sm_notify call seems to be skipping a
    few arguments. How did this ever work?

    >From folks at Polyserve.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust

    J. Bruce Fields
     
  • If the server receives an NLM cancel call and finds no waiting lock to
    cancel, then chances are the lock has already been applied, and the client
    just hadn't yet processed the NLM granted callback before it sent the
    cancel.

    The Open Group text, for example, perimts a server to return either success
    (LCK_GRANTED) or failure (LCK_DENIED) in this case. But returning an error
    seems more helpful; the client may be able to use it to recognize that a
    race has occurred and to recover from the race.

    So, modify the relevant functions to return an error in this case.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust

    J. Bruce Fields
     
  • The fl_next check here is superfluous (and possibly a layering violation).

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust

    J. Bruce Fields
     
  • Currently when lockd gets an NLM_CANCEL request, it also does an unlock for
    the same range. This is incorrect.

    The Open Group documentation says that "This procedure cancels an
    *outstanding* blocked lock request." (Emphasis mine.)

    Also, consider a client that holds a lock on the first byte of a file, and
    requests a lock on the entire file. If the client cancels that request
    (perhaps because the requesting process is signalled), the server shouldn't
    apply perform an unlock on the entire file, since that will also remove the
    previous lock that the client was already granted.

    Or consider a lock request that actually *downgraded* an exclusive lock to
    a shared lock.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust

    J. Bruce Fields
     
  • Slightly simpler logic here makes it more trivial to verify that the up's
    and down's are balanced here. Break out an assignment from a conditional
    while we're at it.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust

    J. Bruce Fields
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust