04 Apr, 2009

6 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache: (41 commits)
    NFS: Add mount options to enable local caching on NFS
    NFS: Display local caching state
    NFS: Store pages from an NFS inode into a local cache
    NFS: Read pages from FS-Cache into an NFS inode
    NFS: nfs_readpage_async() needs to be accessible as a fallback for local caching
    NFS: Add read context retention for FS-Cache to call back with
    NFS: FS-Cache page management
    NFS: Add some new I/O counters for FS-Cache doing things for NFS
    NFS: Invalidate FsCache page flags when cache removed
    NFS: Use local disk inode cache
    NFS: Define and create inode-level cache objects
    NFS: Define and create superblock-level objects
    NFS: Define and create server-level objects
    NFS: Register NFS for caching and retrieve the top-level index
    NFS: Permit local filesystem caching to be enabled for NFS
    NFS: Add FS-Cache option bit and debug bit
    NFS: Add comment banners to some NFS functions
    FS-Cache: Make kAFS use FS-Cache
    CacheFiles: A cache that backs onto a mounted filesystem
    CacheFiles: Export things for CacheFiles
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (36 commits)
    dm: set queue ordered mode
    dm: move wait queue declaration
    dm: merge pushback and deferred bio lists
    dm: allow uninterruptible wait for pending io
    dm: merge __flush_deferred_io into caller
    dm: move bio_io_error into __split_and_process_bio
    dm: rename __split_bio
    dm: remove unnecessary struct dm_wq_req
    dm: remove unnecessary work queue context field
    dm: remove unnecessary work queue type field
    dm: bio list add bio_list_add_head
    dm snapshot: persistent fix dtr cleanup
    dm snapshot: move status to exception store
    dm snapshot: move ctr parsing to exception store
    dm snapshot: use DMEMIT macro for status
    dm snapshot: remove dm_snap header
    dm snapshot: remove dm_snap header use
    dm exception store: move cow pointer
    dm exception store: move chunk_fields
    dm exception store: move dm_target pointer
    ...

    Linus Torvalds
     
  • Commit f4112de6b679d84bd9b9681c7504be7bdfb7c7d5 ("mm: introduce
    debug_kmap_atomic") broke PPC builds with CONFIG_HIGHMEM=y:

    CC init/main.o
    In file included from include/linux/highmem.h:25,
    from include/linux/pagemap.h:11,
    from include/linux/mempolicy.h:63,
    from init/main.c:53:
    arch/powerpc/include/asm/highmem.h: In function 'kmap_atomic_prot':
    arch/powerpc/include/asm/highmem.h:98: error: implicit declaration of function 'debug_kmap_atomic'
    In file included from include/linux/pagemap.h:11,
    from include/linux/mempolicy.h:63,
    from init/main.c:53:
    include/linux/highmem.h: At top level:
    include/linux/highmem.h:196: warning: conflicting types for 'debug_kmap_atomic'
    include/linux/highmem.h:196: error: static declaration of 'debug_kmap_atomic' follows non-static declaration
    include/asm/highmem.h:98: error: previous implicit declaration of 'debug_kmap_atomic' was here
    make[1]: *** [init/main.o] Error 1
    make: *** [init] Error 2

    Signed-off-by: Kumar Gala
    Acked-by: Akinobu Mita
    Signed-off-by: Linus Torvalds

    Kumar Gala
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
    crypto: ixp4xx - Fix handling of chained sg buffers
    crypto: shash - Fix unaligned calculation with short length
    hwrng: timeriomem - Use phys address rather than virt

    Linus Torvalds
     
  • * 'for-linus' of git://neil.brown.name/md: (53 commits)
    md/raid5 revise rules for when to update metadata during reshape
    md/raid5: minor code cleanups in make_request.
    md: remove CONFIG_MD_RAID_RESHAPE config option.
    md/raid5: be more careful about write ordering when reshaping.
    md: don't display meaningless values in sysfs files resync_start and sync_speed
    md/raid5: allow layout and chunksize to be changed on active array.
    md/raid5: reshape using largest of old and new chunk size
    md/raid5: prepare for allowing reshape to change layout
    md/raid5: prepare for allowing reshape to change chunksize.
    md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
    Documentation/md.txt update
    md: allow number of drives in raid5 to be reduced
    md/raid5: change reshape-progress measurement to cope with reshaping backwards.
    md: add explicit method to signal the end of a reshape.
    md/raid5: enhance raid5_size to work correctly with negative delta_disks
    md/raid5: drop qd_idx from r6_state
    md/raid6: move raid6 data processing to raid6_pq.ko
    md: raid5 run(): Fix max_degraded for raid level 4.
    md: 'array_size' sysfs attribute
    md: centralize ->array_sectors modifications
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/bart/linux-hdreg-h-cleanup:
    remove include from
    include/linux/hdreg.h: remove unused defines
    isd200: use ATA_* defines instead of *_STAT and *_ERR ones
    include/linux/hdreg.h: cover WIN_* and friends with #ifndef/#endif __KERNEL__
    aoe: WIN_* -> ATA_CMD_*
    isd200: WIN_* -> ATA_CMD_*
    include/linux/hdreg.h: cover struct hd_driveid with #ifndef/#endif __KERNEL__
    xsysace: make it 'struct hd_driveid'-free
    ubd_kern: make it 'struct hd_driveid'-free
    isd200: make it 'struct hd_driveid'-free

    Linus Torvalds
     

03 Apr, 2009

34 commits

  • nfs_readpage_async() needs to be non-static so that it can be used as a
    fallback for the local on-disk caching should an EIO crop up when reading the
    cache.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Add some new NFS I/O counters for FS-Cache doing things for NFS. A new line is
    emitted into /proc/pid/mountstats if caching is enabled that looks like:

    fsc:

    Where is the number of pages read successfully from the cache, is
    the number of failed page reads against the cache, is the number of
    successful page writes to the cache, is the number of failed page writes
    to the cache, and is the number of NFS pages that have been disconnected
    from the cache.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Bind data storage objects in the local cache to NFS inodes.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Define and create superblock-level cache index objects (as managed by
    nfs_server structs).

    Each superblock object is created in a server level index object and is itself
    an index into which inode-level objects are inserted.

    Ideally there would be one superblock-level object per server, and the former
    would be folded into the latter; however, since the "nosharecache" option
    exists this isn't possible.

    The superblock object key is a sequence consisting of:

    (1) Certain superblock s_flags.

    (2) Various connection parameters that serve to distinguish superblocks for
    sget().

    (3) The volume FSID.

    (4) The security flavour.

    (5) The uniquifier length.

    (6) The uniquifier text. This is normally an empty string, unless the fsc=xyz
    mount option was used to explicitly specify a uniquifier.

    The key blob is of variable length, depending on the length of (6).

    The superblock object is given no coherency data to carry in the auxiliary data
    permitted by the cache. It is assumed that the superblock is always coherent.

    This patch also adds uniquification handling such that two otherwise identical
    superblocks, at least one of which is marked "nosharecache", won't end up
    trying to share the on-disk cache. It will be possible to manually provide a
    uniquifier through a mount option with a later patch to avoid the error
    otherwise produced.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Define and create server-level cache index objects (as managed by nfs_client
    structs).

    Each server object is created in the NFS top-level index object and is itself
    an index into which superblock-level objects are inserted.

    Ideally there would be one superblock-level object per server, and the former
    would be folded into the latter; however, since the "nosharecache" option
    exists this isn't possible.

    The server object key is a sequence consisting of:

    (1) NFS version

    (2) Server address family (eg: AF_INET or AF_INET6)

    (3) Server port.

    (4) Server IP address.

    The key blob is of variable length, depending on the length of (4).

    The server object is given no coherency data to carry in the auxiliary data
    permitted by the cache.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Add FS-Cache option bit to nfs_server struct. This is set to indicate local
    on-disk caching is enabled for a particular superblock.

    Also add debug bit for local caching operations.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Add a function to install a monitor on the page lock waitqueue for a particular
    page, thus allowing the page being unlocked to be detected.

    This is used by CacheFiles to detect read completion on a page in the backing
    filesystem so that it can then copy the data to the waiting netfs page.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Implement the data I/O part of the FS-Cache netfs API. The documentation and
    API header file were added in a previous patch.

    This patch implements the following functions for the netfs to call:

    (*) fscache_attr_changed().

    Indicate that the object has changed its attributes. The only attribute
    currently recorded is the file size. Only pages within the set file size
    will be stored in the cache.

    This operation is submitted for asynchronous processing, and will return
    immediately. It will return -ENOMEM if an out of memory error is
    encountered, -ENOBUFS if the object is not actually cached, or 0 if the
    operation is successfully queued.

    (*) fscache_read_or_alloc_page().
    (*) fscache_read_or_alloc_pages().

    Request data be fetched from the disk, and allocate internal metadata to
    track the netfs pages and reserve disk space for unknown pages.

    These operations perform semi-asynchronous data reads. Upon returning
    they will indicate which pages they think can be retrieved from disk, and
    will have set in progress attempts to retrieve those pages.

    These will return, in order of preference, -ENOMEM on memory allocation
    error, -ERESTARTSYS if a signal interrupted proceedings, -ENODATA if one
    or more requested pages are not yet cached, -ENOBUFS if the object is not
    actually cached or if there isn't space for future pages to be cached on
    this object, or 0 if successful.

    In the case of the multipage function, the pages for which reads are set
    in progress will be removed from the list and the page count decreased
    appropriately.

    If any read operations should fail, the completion function will be given
    an error, and will also be passed contextual information to allow the
    netfs to fall back to querying the server for the absent pages.

    For each successful read, the page completion function will also be
    called.

    Any pages subsequently tracked by the cache will have PG_fscache set upon
    them on return. fscache_uncache_page() must be called for such pages.

    If supplied by the netfs, the mark_pages_cached() cookie op will be
    invoked for any pages now tracked.

    (*) fscache_alloc_page().

    Allocate internal metadata to track a netfs page and reserve disk space.

    This will return -ENOMEM on memory allocation error, -ERESTARTSYS on
    signal, -ENOBUFS if the object isn't cached, or there isn't enough space
    in the cache, or 0 if successful.

    Any pages subsequently tracked by the cache will have PG_fscache set upon
    them on return. fscache_uncache_page() must be called for such pages.

    If supplied by the netfs, the mark_pages_cached() cookie op will be
    invoked for any pages now tracked.

    (*) fscache_write_page().

    Request data be stored to disk. This may only be called on pages that
    have been read or alloc'd by the above three functions and have not yet
    been uncached.

    This will return -ENOMEM on memory allocation error, -ERESTARTSYS on
    signal, -ENOBUFS if the object isn't cached, or there isn't immediately
    enough space in the cache, or 0 if successful.

    On a successful return, this operation will have queued the page for
    asynchronous writing to the cache. The page will be returned with
    PG_fscache_write set until the write completes one way or another. The
    caller will not be notified if the write fails due to an I/O error. If
    that happens, the object will become available and all pending writes will
    be aborted.

    Note that the cache may batch up page writes, and so it may take a while
    to get around to writing them out.

    The caller must assume that until PG_fscache_write is cleared the page is
    use by the cache. Any changes made to the page may be reflected on disk.
    The page may even be under DMA.

    (*) fscache_uncache_page().

    Indicate that the cache should stop tracking a page previously read or
    alloc'd from the cache. If the page was alloc'd only, but unwritten, it
    will not appear on disk.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Implement the cookie management part of the FS-Cache netfs client API. The
    documentation and API header file were added in a previous patch.

    This patch implements the following three functions:

    (1) fscache_acquire_cookie().

    Acquire a cookie to represent an object to the netfs. If the object in
    question is a non-index object, then that object and its parent indices
    will be created on disk at this point if they don't already exist. Index
    creation is deferred because an index may reside in multiple caches.

    (2) fscache_relinquish_cookie().

    Retire or release a cookie previously acquired. At this point, the
    object on disk may be destroyed.

    (3) fscache_update_cookie().

    Update the in-cache representation of a cookie. This is used to update
    the auxiliary data for coherency management purposes.

    With this patch it is possible to have a netfs instruct a cache backend to
    look up, validate and create metadata on disk and to destroy it again.
    The ability to actually store and retrieve data in the objects so created is
    added in later patches.

    Note that these functions will never return an error. _All_ errors are
    handled internally to FS-Cache.

    The worst that can happen is that fscache_acquire_cookie() may return a NULL
    pointer - which is considered a negative cookie pointer and can be passed back
    to any function that takes a cookie without harm. A negative cookie pointer
    merely suppresses caching at that level.

    The stub in linux/fscache.h will detect inline the negative cookie pointer and
    abort the operation as fast as possible. This means that the compiler doesn't
    have to set up for a call in that case.

    See the documentation in Documentation/filesystems/caching/netfs-api.txt for
    more information.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Add functions to register and unregister a network filesystem or other client
    of the FS-Cache service. This allocates and releases the cookie representing
    the top-level index for a netfs, and makes it available to the netfs.

    If the FS-Cache facility is disabled, then the calls are optimised away at
    compile time.

    Note that whilst this patch may appear to work with FS-Cache enabled and a
    netfs attempting to use it, it will leak the cookie it allocates for the netfs
    as fscache_relinquish_cookie() is implemented in a later patch. This will
    cause the slab code to emit a warning when the module is removed.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Implement two features of FS-Cache:

    (1) The ability to request and release cache tags - names by which a cache may
    be known to a netfs, and thus selected for use.

    (2) An internal function by which a cache is selected by consulting the netfs,
    if the netfs wishes to be consulted.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Make FS-Cache create its /proc interface and present various statistical
    information through it. Also provide the functions for updating this
    information.

    These features are enabled by:

    CONFIG_FSCACHE_PROC
    CONFIG_FSCACHE_STATS
    CONFIG_FSCACHE_HISTOGRAM

    The /proc directory for FS-Cache is also exported so that caching modules can
    add their own statistics there too.

    The FS-Cache module is loadable at this point, and the statistics files can be
    examined by userspace:

    cat /proc/fs/fscache/stats
    cat /proc/fs/fscache/histogram

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Add the API for a generic facility (FS-Cache) by which caches may declare them
    selves open for business, and may obtain work to be done from network
    filesystems. The header file is included by:

    #include

    Documentation for the API is also added to:

    Documentation/filesystems/caching/backend-api.txt

    This API is not usable without the implementation of the utility functions
    which will be added in further patches.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Add the API for a generic facility (FS-Cache) by which filesystems (such as AFS
    or NFS) may call on local caching capabilities without having to know anything
    about how the cache works, or even if there is a cache:

    +---------+
    | | +--------------+
    | NFS |--+ | |
    | | | +-->| CacheFS |
    +---------+ | +----------+ | | /dev/hda5 |
    | | | | +--------------+
    +---------+ +-->| | |
    | | | |--+
    | AFS |----->| FS-Cache |
    | | | |--+
    +---------+ +-->| | |
    | | | | +--------------+
    +---------+ | +----------+ | | |
    | | | +-->| CacheFiles |
    | ISOFS |--+ | /var/cache |
    | | +--------------+
    +---------+

    General documentation and documentation of the netfs specific API are provided
    in addition to the header files.

    As this patch stands, it is possible to build a filesystem against the facility
    and attempt to use it. All that will happen is that all requests will be
    immediately denied as if no cache is present.

    Further patches will implement the core of the facility. The facility will
    transfer requests from networking filesystems to appropriate caches if
    possible, or else gracefully deny them.

    If this facility is disabled in the kernel configuration, then all its
    operations will trivially reduce to nothing during compilation.

    WHY NOT I_MAPPING?
    ==================

    I have added my own API to implement caching rather than using i_mapping to do
    this for a number of reasons. These have been discussed a lot on the LKML and
    CacheFS mailing lists, but to summarise the basics:

    (1) Most filesystems don't do hole reportage. Holes in files are treated as
    blocks of zeros and can't be distinguished otherwise, making it difficult
    to distinguish blocks that have been read from the network and cached from
    those that haven't.

    (2) The backing inode must be fully populated before being exposed to
    userspace through the main inode because the VM/VFS goes directly to the
    backing inode and does not interrogate the front inode's VM ops.

    Therefore:

    (a) The backing inode must fit entirely within the cache.

    (b) All backed files currently open must fit entirely within the cache at
    the same time.

    (c) A working set of files in total larger than the cache may not be
    cached.

    (d) A file may not grow larger than the available space in the cache.

    (e) A file that's open and cached, and remotely grows larger than the
    cache is potentially stuffed.

    (3) Writes go to the backing filesystem, and can only be transferred to the
    network when the file is closed.

    (4) There's no record of what changes have been made, so the whole file must
    be written back.

    (5) The pages belong to the backing filesystem, and all metadata associated
    with that page are relevant only to the backing filesystem, and not
    anything stacked atop it.

    OVERVIEW
    ========

    FS-Cache provides (or will provide) the following facilities:

    (1) Caches can be added / removed at any time, even whilst in use.

    (2) Adds a facility by which tags can be used to refer to caches, even if
    they're not available yet.

    (3) More than one cache can be used at once. Caches can be selected
    explicitly by use of tags.

    (4) The netfs is provided with an interface that allows either party to
    withdraw caching facilities from a file (required for (1)).

    (5) A netfs may annotate cache objects that belongs to it. This permits the
    storage of coherency maintenance data.

    (6) Cache objects will be pinnable and space reservations will be possible.

    (7) The interface to the netfs returns as few errors as possible, preferring
    rather to let the netfs remain oblivious.

    (8) Cookies are used to represent indices, files and other objects to the
    netfs. The simplest cookie is just a NULL pointer - indicating nothing
    cached there.

    (9) The netfs is allowed to propose - dynamically - any index hierarchy it
    desires, though it must be aware that the index search function is
    recursive, stack space is limited, and indices can only be children of
    indices.

    (10) Indices can be used to group files together to reduce key size and to make
    group invalidation easier. The use of indices may make lookup quicker,
    but that's cache dependent.

    (11) Data I/O is effectively done directly to and from the netfs's pages. The
    netfs indicates that page A is at index B of the data-file represented by
    cookie C, and that it should be read or written. The cache backend may or
    may not start I/O on that page, but if it does, a netfs callback will be
    invoked to indicate completion. The I/O may be either synchronous or
    asynchronous.

    (12) Cookies can be "retired" upon release. At this point FS-Cache will mark
    them as obsolete and the index hierarchy rooted at that point will get
    recycled.

    (13) The netfs provides a "match" function for index searches. In addition to
    saying whether a match was made or not, this can also specify that an
    entry should be updated or deleted.

    FS-Cache maintains a virtual index tree in which all indices, files, objects
    and pages are kept. Bits of this tree may actually reside in one or more
    caches.

    FSDEF
    |
    +------------------------------------+
    | |
    NFS AFS
    | |
    +--------------------------+ +-----------+
    | | | |
    homedir mirror afs.org redhat.com
    | | |
    +------------+ +---------------+ +----------+
    | | | | | |
    00001 00002 00007 00125 vol00001 vol00002
    | | | | |
    +---+---+ +-----+ +---+ +------+------+ +-----+----+
    | | | | | | | | | | | | |
    PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
    | |
    PG0 +-------+
    | |
    00001 00003
    |
    +---+---+
    | | |
    PG0 PG1 PG2

    In the example above, two netfs's can be seen to be backed: NFS and AFS. These
    have different index hierarchies:

    (*) The NFS primary index will probably contain per-server indices. Each
    server index is indexed by NFS file handles to get data file objects.
    Each data file objects can have an array of pages, but may also have
    further child objects, such as extended attributes and directory entries.
    Extended attribute objects themselves have page-array contents.

    (*) The AFS primary index contains per-cell indices. Each cell index contains
    per-logical-volume indices. Each of volume index contains up to three
    indices for the read-write, read-only and backup mirrors of those volumes.
    Each of these contains vnode data file objects, each of which contains an
    array of pages.

    The very top index is the FS-Cache master index in which individual netfs's
    have entries.

    Any index object may reside in more than one cache, provided it only has index
    children. Any index with non-index object children will be assumed to only
    reside in one cache.

    The FS-Cache overview can be found in:

    Documentation/filesystems/caching/fscache.txt

    The netfs API to FS-Cache can be found in:

    Documentation/filesystems/caching/netfs-api.txt

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Recruit a page flag to aid in cache management. The following extra flag is
    defined:

    (1) PG_fscache (PG_private_2)

    The marked page is backed by a local cache and is pinning resources in the
    cache driver.

    If PG_fscache is set, then things that checked for PG_private will now also
    check for that. This includes things like truncation and page invalidation.
    The function page_has_private() had been added to make the checks for both
    PG_private and PG_private_2 at the same time.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • The attached patch causes read_cache_pages() to release page-private data on a
    page for which add_to_page_cache() fails. If the filler function fails, then
    the problematic page is left attached to the pagecache (with appropriate flags
    set, one presumes) and the remaining to-be-attached pages are invalidated and
    discarded. This permits pages with caching references associated with them to
    be cleaned up.

    The invalidatepage() address space op is called (indirectly) to do the honours.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Document the slow work thread pool.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Make the slow work pool configurable through /proc/sys/kernel/slow-work.

    (*) /proc/sys/kernel/slow-work/min-threads

    The minimum number of threads that should be in the pool as long as it is
    in use. This may be anywhere between 2 and max-threads.

    (*) /proc/sys/kernel/slow-work/max-threads

    The maximum number of threads that should in the pool. This may be
    anywhere between min-threads and 255 or NR_CPUS * 2, whichever is greater.

    (*) /proc/sys/kernel/slow-work/vslow-percentage

    The percentage of active threads in the pool that may be used to execute
    very slow work items. This may be between 1 and 99. The resultant number
    is bounded to between 1 and one fewer than the number of active threads.
    This ensures there is always at least one thread that can process very
    slow work items, and always at least one thread that won't.

    Signed-off-by: David Howells
    Acked-by: Serge Hallyn
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Create a dynamically sized pool of threads for doing very slow work items, such
    as invoking mkdir() or rmdir() - things that may take a long time and may
    sleep, holding mutexes/semaphores and hogging a thread, and are thus unsuitable
    for workqueues.

    The number of threads is always at least a settable minimum, but more are
    started when there's more work to do, up to a limit. Because of the nature of
    the load, it's not suitable for a 1-thread-per-CPU type pool. A system with
    one CPU may well want several threads.

    This is used by FS-Cache to do slow caching operations in the background, such
    as looking up, creating or deleting cache objects.

    Signed-off-by: David Howells
    Acked-by: Serge Hallyn
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    Remove two unneeded exports and make two symbols static in fs/mpage.c
    Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225
    Trim includes of fdtable.h
    Don't crap into descriptor table in binfmt_som
    Trim includes in binfmt_elf
    Don't mess with descriptor table in load_elf_binary()
    Get rid of indirect include of fs_struct.h
    New helper - current_umask()
    check_unsafe_exec() doesn't care about signal handlers sharing
    New locking/refcounting for fs_struct
    Take fs_struct handling to new file (fs/fs_struct.c)
    Get rid of bumping fs_struct refcount in pivot_root(2)
    Kill unsharing fs_struct in __set_personality()

    Linus Torvalds
     
  • Pass the original flags to rwlock arch-code, so that it can re-enable
    interrupts if implemented for that architecture.

    Initially, make __raw_read_lock_flags and __raw_write_lock_flags stubs
    which just do the same thing as non-flags variants.

    Signed-off-by: Petr Tesarik
    Signed-off-by: Robin Holt
    Acked-by: Peter Zijlstra
    Cc:
    Acked-by: Ingo Molnar
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • SGI has observed that on large systems, interrupts are not serviced for a
    long period of time when waiting for a rwlock. The following patch series
    re-enables irqs while waiting for the lock, resembling the code which is
    already there for spinlocks.

    I only made the ia64 version, because the patch adds some overhead to the
    fast path. I assume there is currently no demand to have this for other
    architectures, because the systems are not so large. Of course, the
    possibility to implement raw_{read|write}_lock_flags for any architecture
    is still there.

    This patch:

    The new macro LOCK_CONTENDED_FLAGS expands to the correct implementation
    depending on the config options, so that IRQ's are re-enabled when
    possible, but they remain disabled if CONFIG_LOCKDEP is set.

    Signed-off-by: Petr Tesarik
    Signed-off-by: Robin Holt
    Cc:
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • This patch adds preadv and pwritev system calls. These syscalls are a
    pretty straightforward combination of pread and readv (same for write).
    They are quite useful for doing vectored I/O in threaded applications.
    Using lseek+readv instead opens race windows you'll have to plug with
    locking.

    Other systems have such system calls too, for example NetBSD, check
    here: http://www.daemon-systems.org/man/preadv.2.html

    The application-visible interface provided by glibc should look like
    this to be compatible to the existing implementations in the *BSD family:

    ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
    ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);

    This prototype has one problem though: On 32bit archs is the (64bit)
    offset argument unaligned, which the syscall ABI of several archs doesn't
    allow to do. At least s390 needs a wrapper in glibc to handle this. As
    we'll need a wrappers in glibc anyway I've decided to push problem to
    glibc entriely and use a syscall prototype which works without
    arch-specific wrappers inside the kernel: The offset argument is
    explicitly splitted into two 32bit values.

    The patch sports the actual system call implementation and the windup in
    the x86 system call tables. Other archs follow as separate patches.

    Signed-off-by: Gerd Hoffmann
    Cc: Arnd Bergmann
    Cc: Al Viro
    Cc:
    Cc:
    Cc: Ralf Baechle
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerd Hoffmann
     
  • It would be nice to be able to extract the dmesg log from a vmcore file
    without needing to keep the debug symbols for the running kernel handy all
    the time. We have a facility to do this in /proc/vmcore. This patch adds
    the log_buf and log_end symbols to the vmcoreinfo area so that tools (like
    makedumpfile) can easily extract the dmesg logs from a vmcore image.

    [akpm@linux-foundation.org: several fixes and cleanups]
    [akpm@linux-foundation.org: fix unused log_buf_kexec_setup()]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Neil Horman
    Cc: Simon Horman
    Acked-by: Vivek Goyal
    Cc: Neil Horman
    Cc: Simon Horman
    Cc: Vivek Goyal
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Add the PCI Device ID of the PCI Bridge Controller on AMD8111 chip.

    Signed-off-by: Harry Ciao
    Cc: Doug Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harry Ciao
     
  • We are wasting 2 words in signal_struct without any reason to implement
    task_pgrp_nr() and task_session_nr().

    task_session_nr() has no callers since
    2e2ba22ea4fd4bb85f0fa37c521066db6775cbef, we can remove it.

    task_pgrp_nr() is still (I believe wrongly) used in fs/autofsX and
    fs/coda.

    This patch reimplements task_pgrp_nr() via task_pgrp_nr_ns(), and kills
    __pgrp/__session and the related helpers.

    The change in drivers/char/tty_io.c is cosmetic, but hopefully makes sense
    anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Alan Cox [tty parts]
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Eric Biederman
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sukadev Bhattiprolu
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.

    task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.

    As for "special" pids, vnr/nr_ns helpers always need rcu. However, if
    task != current, they are unsafe even under rcu lock, we can't trust
    task->group_leader without the special checks.

    And almost every helper has a callsite which needs a fix.

    Also, it is a bit annoying that the implementations of, say,
    task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".

    This patch introduces the new helper, __task_pid_nr_ns(), which is always
    safe to use, and turns all other helpers into the trivial wrappers.

    After this I'll send another patch which converts task_tgid_xxx() as well,
    they're are a bit special.

    Signed-off-by: Oleg Nesterov
    Cc: Louis Rilling
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Even if task == current, it is not safe to dereference the result of
    task_pgrp/task_session. We can race with another thread which changes the
    special pid via setpgid/setsid.

    Document this. The next 2 patches give an example of the unsafe usage, we
    have more bad users.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Louis Rilling
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Add support for x8 asynchronous sample rate and ability to specify base
    clock frequency.

    Signed-off-by: Paul Fulghum
    Acked-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Fulghum
     
  • Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • cpuhotplug_mutex_lock() is not used, remove it.

    Signed-off-by: Lai Jiangshan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Gautham R Shenoy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Now that task_detached() is exported, change tracehook_notify_death() to
    use this helper, nobody else checks ->exit_signal == -1 by hand.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: "Metzger, Markus T"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • By discussion with Roland.

    - Rename ptrace_exit() to exit_ptrace(), and change it to do all the
    necessary work with ->ptraced list by its own.

    - Move this code from exit.c to ptrace.c

    - Update the comment in ptrace_detach() to explain the rechecking of
    the child->ptrace.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: "Metzger, Markus T"
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When ptrace_detach() takes tasklist, the tracee can be SIGKILL'ed. If it
    has already passed exit_notify() we can leak a zombie, because a) ptracing
    disables the auto-reaping logic, and b) ->real_parent was not notified
    about the child's death.

    ptrace_detach() should follow the ptrace_exit's logic, change the code
    accordingly.

    Signed-off-by: Oleg Nesterov
    Cc: Jerome Marchand
    Cc: Roland McGrath
    Tested-by: Denys Vlasenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov