20 Apr, 2008

1 commit

  • Requiring userspace to close and re-open sysfs attributes has been the
    policy since before 2.6.12. It allows userspace to get a consistent
    snapshot of kernel state and consume it with incremental reads and seeks.

    Now, if the file position is zero the kernel assumes userspace wants to see
    the new value. The application for this change is to allow a userspace
    RAID metadata handler to check the state of an array without causing any
    memory allocations. Thus not causing writeback to a raid array that might
    be blocked waiting for userspace to take action.

    Cc: Neil Brown
    Acked-by: Tejun Heo
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

18 Apr, 2008

2 commits


16 Apr, 2008

1 commit


12 Apr, 2008

2 commits


31 Mar, 2008

2 commits

  • On Friday 2008-03-28 19:20, Jonathan Corbet wrote:
    >commit 9756ccfda31b4c4544aa010aacf71b6672d668e8
    >Date: Fri Mar 28 11:19:56 2008 -0600
    >
    > Add the seq_file documentation

    patch on top:

    - add const qualifiers
    - remove void* casts
    - use proper specifier (%Ld is not valid)

    Signed-off-by: Jonathan Corbet
    Signed-off-by: Jan Engelhardt

    Jan Engelhardt
     
  • This is an updated version of the document describing the seq_file
    interface.

    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     

12 Mar, 2008

1 commit


09 Feb, 2008

2 commits

  • This series addresses the problem of showing mount options in
    /proc/mounts.

    Several filesystems which use mount options, have not implemented a
    .show_options superblock operation. Several others have implemented
    this callback, but have not kept it fully up to date with the parsed
    options.

    Q: Why do we need correct option showing in /proc/mounts?
    A: We want /proc/mounts to fully replace /etc/mtab. The reasons for
    this are:
    - unprivileged mounters won't be able to update /etc/mtab
    - /etc/mtab doesn't work with private mount namespaces
    - /etc/mtab can become out-of-sync with reality

    Q: Can't this be done, so that filesystems need not bother with
    implementing a .show_mounts callback, and keeping it up to date?
    A: Only in some cases. Certain filesystems allow modification of a
    subset of options in their remount_fs method. It is not possible
    to take this into account without knowing exactly how the
    filesystem handles options.

    For the simple case (no remount or remount resets all options) the
    patchset introduces two helpers:

    generic_show_options()
    save_mount_options()

    These can also be used to emulate the old /etc/mtab behavior, until
    proper support is added. Even if this is not 100% correct, it's still
    better than showing no options at all.

    The following patches fix up most in-tree filesystems, some have been
    compile tested only, some have been reviewed and acked by the
    maintainer.

    Table displaying status of all in-kernel filesystems:
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    legend:

    none - fs has options, but doesn't define ->show_options()
    some - fs defines ->show_options(), but some only options are shown
    good - fs shows all options
    noopt - fs does not have options
    patch - a patch will be posted
    merged - a patch has been merged by subsystem maintainer

    9p good
    adfs patch
    affs patch
    afs patch
    autofs patch
    autofs4 patch
    befs patch
    bfs noopt
    cifs some
    coda noopt
    configfs noopt
    cramfs noopt
    debugfs noopt
    devpts patch
    ecryptfs good
    efs noopt
    ext2 patch
    ext3 good
    ext4 merged
    fat patch
    freevxfs noopt
    fuse patch
    fusectl noopt
    gfs2 good
    gfs2meta noopt
    hfs good
    hfsplus good
    hostfs patch
    hpfs patch
    hppfs noopt
    hugetlbfs patch
    isofs patch
    jffs2 noopt
    jfs merged
    minix noopt
    msdos ->fat
    ncpfs patch
    nfs some
    nfsd noopt
    ntfs good
    ocfs2 good
    ocfs2/dlmfs noopt
    openpromfs noopt
    proc noopt
    qnx4 noopt
    ramfs noopt
    reiserfs patch
    romfs noopt
    smbfs good
    sysfs noopt
    sysv noopt
    udf patch
    ufs good
    vfat ->fat
    xfs good

    mm/shmem.c patch
    drivers/oprofile/oprofilefs.c noopt
    drivers/infiniband/hw/ipath/ipath_fs.c noopt
    drivers/misc/ibmasm/ibmasmfs.c noopt
    drivers/usb/core (usbfs) merged
    drivers/usb/gadget (gadgetfs) noopt
    drivers/isdn/capi/capifs.c patch
    kernel/cpuset.c noopt
    fs/binfmt_misc.c noopt
    net/sunrpc/rpc_pipe.c noopt
    arch/powerpc/platforms/cell/spufs patch
    arch/s390/hypfs good
    ipc/mqueue.c noopt
    security (securityfs) noopt
    security/selinux/selinuxfs.c noopt
    kernel/cgroup.c good
    security/smack/smackfs.c noopt

    in -mm:

    reiser4 some
    unionfs good
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    This patch:

    Document the rules for handling mount options in the .show_options
    super operation.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Implement dmode option for iso9660 filesystem to allow setting of access
    rights for directories on the filesystem.

    Signed-off-by: Jan Kara
    Cc: "Ilya N. Golubev"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

08 Feb, 2008

4 commits

  • Remove the old iget() call and the read_inode() superblock operation it uses
    as these are really obsolete, and the use of read_inode() does not produce
    proper error handling (no distinction between ENOMEM and EIO when marking an
    inode bad).

    Furthermore, this removes the temptation to use iget() to find an inode by
    number in a filesystem from code outside that filesystem.

    iget_locked() should be used instead. A new function is added in an earlier
    patch (iget_failed) that is to be called to mark an inode as bad, unlock it
    and release it should the get routine fail. Mark iget() and read_inode() as
    being obsolete and remove references to them from the documentation.

    Typically a filesystem will be modified such that the read_inode function
    becomes an internal iget function, for example the following:

    void thingyfs_read_inode(struct inode *inode)
    {
    ...
    }

    would be changed into something like:

    struct inode *thingyfs_iget(struct super_block *sp, unsigned long ino)
    {
    struct inode *inode;
    int ret;

    inode = iget_locked(sb, ino);
    if (!inode)
    return ERR_PTR(-ENOMEM);
    if (!(inode->i_state & I_NEW))
    return inode;

    ...
    unlock_new_inode(inode);
    return inode;
    error:
    iget_failed(inode);
    return ERR_PTR(ret);
    }

    and then thingyfs_iget() would be called rather than iget(), for example:

    ret = -EINVAL;
    inode = iget(sb, ino);
    if (!inode || is_bad_inode(inode))
    goto error;

    becomes:

    inode = thingyfs_iget(sb, ino);
    if (IS_ERR(inode)) {
    ret = PTR_ERR(inode);
    goto error;
    }

    Note that is_bad_inode() does not need to be called. The error returned by
    thingyfs_iget() should render it unnecessary.

    Signed-off-by: David Howells
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Introduce a function to register failure in an inode construction path. This
    includes marking the inode under construction as bad, unlocking it and
    releasing it.

    Signed-off-by: David Howells
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • This documentation is also vfs-related.

    Signed-off-by: J. Bruce Fields
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    J. Bruce Fields
     
  • I'm inclined to think dnotify belongs in filesystems/.

    Signed-off-by: J. Bruce Fields
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    J. Bruce Fields
     

07 Feb, 2008

1 commit

  • NR_OPEN (historically set to 1024*1024) actually forbids processes to open
    more than 1024*1024 handles.

    Unfortunatly some production servers hit the not so 'ridiculously high
    value' of 1024*1024 file descriptors per process.

    Changing NR_OPEN is not considered safe because of vmalloc space potential
    exhaust.

    This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
    1024*1024, so that admins can decide to change this limit if their workload
    needs it.

    [akpm@linux-foundation.org: export it for sparc64]
    Signed-off-by: Eric Dumazet
    Cc: Alan Cox
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: "David S. Miller"
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

06 Feb, 2008

2 commits

  • Though the lower_zone_protection was changed to lowmem_reserve_ratio, the
    document has been not changed. The lowmem_reserve_ratio seems quite hard
    to estimate, but there is no guidance. This patch is to change document
    for it.

    Signed-off-by: Yasunori Goto
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Add vm.highmem_is_dirtyable toggle

    A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
    approximately 2Gb size which contains a hash format that is written
    randomly by the dbclean process. On 2.6.16 this process took a few
    minutes. With lowmem only accounting of dirty ratios, this takes about 12
    hours of 100% disk IO, all random writes.

    Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
    add the highmem back to the total available memory count.

    [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
    Signed-off-by: Bron Gondwana
    Cc: Ethan Solomita
    Cc: Peter Zijlstra
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bron Gondwana
     

03 Feb, 2008

5 commits


02 Feb, 2008

1 commit

  • execve arguments can be quite large. There is no limit on the number of
    arguments and a 4G limit on the size of an argument.

    this patch prints those aruguments in bite sized pieces. a userspace size
    limitation of 8k was discovered so this keeps messages around 7.5k

    single arguments larger than 7.5k in length are split into multiple records
    and can be identified as aX[Y]=

    Signed-off-by: Eric Paris

    Eric Paris
     

01 Feb, 2008

1 commit

  • Current ip route cache implementation is not suited to large caches.

    We can consume a lot of CPU when cache must be invalidated, since we
    currently need to evict all cache entries, and this eviction is
    sometimes asynchronous. min_delay & max_delay can somewhat control this
    asynchronism behavior, but whole thing is a kludge, regularly triggering
    infamous soft lockup messages. When entries are still in use, this also
    consumes a lot of ram, filling dst_garbage.list.

    A better scheme is to use a generation identifier on each entry,
    so that cache invalidation can be performed by changing the table
    identifier, without having to scan all entries.
    No more delayed flushing, no more stalling when secret_interval expires.

    Invalidated entries will then be freed at GC time (controled by
    ip_rt_gc_timeout or stress), or when an invalidated entry is found
    in a chain when an insert is done.
    Thus we keep a normal equilibrium.

    This patch :
    - renames rt_hash_rnd to rt_genid (and makes it an atomic_t)
    - Adds a new rt_genid field to 'struct rtable' (filling a hole on 64bit)
    - Checks entry->rt_genid at appropriate places :

    Eric Dumazet
     

29 Jan, 2008

2 commits

  • Signed-off-by: Alex Tomas
    Signed-off-by: Andreas Dilger
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Alex Tomas
     
  • The journal checksum feature adds two new flags i.e
    JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.

    JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
    checksum for the blocks described by the descriptor blocks.
    Due to checksums, writing of the commit record no longer needs to be
    synchronous. Now commit record can be sent to disk without waiting for
    descriptor blocks to be written to disk. This behavior is controlled
    using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
    able to recover the journal with _ASYNC_COMMIT hence it is made
    incompat.
    The commit header has been extended to hold the checksum along with the
    type of the checksum.

    For recovery in pass scan checksums are verified to ensure the sanity
    and completeness(in case of _ASYNC_COMMIT) of every transaction.

    Signed-off-by: Andreas Dilger
    Signed-off-by: Girish Shilamkar
    Signed-off-by: Dave Kleikamp
    Signed-off-by: Mingming Cao

    Girish Shilamkar
     

26 Jan, 2008

4 commits

  • Hook up ocfs2_flock(), using the new flock lock type in dlmglue.c. A new
    mount option, "localflocks" is added so that users can revert to old
    functionality as need be.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Local alloc is a performance optimization in ocfs2 in which a node
    takes a window of bits from the global bitmap and then uses that for
    all small local allocations. This window size is fixed to 8MB currently.
    This patch allows users to specify the window size in MB including
    disabling it by passing in 0. If the number specified is too large,
    the fs will use the default value of 8MB.

    mount -o localalloc=X /dev/sdX /mntpoint

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • Mostly taken from ext3. This allows the user to set the jbd commit interval,
    in seconds. The default of 5 seconds stays the same, but now users can
    easily increase the commit interval. Typically, this would be increased in
    order to benefit performance at the expense of data-safety.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Remove 'readpages' from the list in ocfs2.txt. Instead of having two
    identical lists, I just removed the list in the OCFS2 section of fs/Kconfig
    and added a pointer to Documentation/filesystems/ocfs2.txt.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

24 Oct, 2007

1 commit


22 Oct, 2007

1 commit

  • Update documentation to the current state of affairs. Remove duplicated
    method descruptions in exportfs.h and point to Documentation/filesystems/
    Exporting instead. Add a little file header comment in expfs.c describing
    what's going on and mentioning Neils and my copyright [1].

    Signed-off-by: Christoph Hellwig
    Cc: Neil Brown
    Cc: "J. Bruce Fields"
    Cc:
    Cc: Dave Kleikamp
    Cc: Anton Altaparmakov
    Cc: David Chinner
    Cc: Timothy Shimmin
    Cc: OGAWA Hirofumi
    Cc: Hugh Dickins
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

20 Oct, 2007

7 commits