08 May, 2007

1 commit

  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

29 Mar, 2007

3 commits

  • shmem_truncate_range has its own truncate_inode_pages_range, to free any pages
    racily instantiated while it was in progress: a SHMEM_PAGEIN flag is set when
    this might have happened. But holepunching gets no chance to clear that flag
    at the start of vmtruncate_range, so it's always set (unless a truncate came
    just before), so holepunch almost always does this second
    truncate_inode_pages_range.

    shmem holepunch has unlikely swapfile races hereabouts whatever we do
    (without a fuller rework than is fit for this release): I was going to skip
    the second truncate in the punch_hole case, but Miklos points out that would
    make holepunch correctness more vulnerable to swapoff. So keep the second
    truncate, but follow it by an unmap_mapping_range to eliminate the
    disconnected pages (freed from pagecache while still mapped in userspace) that
    it might have left behind.

    Signed-off-by: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Miklos Szeredi observes that during truncation of shmem page directories,
    info->lock is released to improve latency (after lowering i_size and
    next_index to exclude races); but this is quite wrong for holepunching, which
    receives no such protection from i_size or next_index, and is left vulnerable
    to races with shmem_unuse, shmem_getpage and shmem_writepage.

    Hold info->lock throughout when holepunching? No, any user could prevent
    rescheduling for far too long. Instead take info->lock just when needed: in
    shmem_free_swp when removing the swap entries, and whenever removing a
    directory page from the level above. But so long as we remove before
    scanning, we can safely skip taking the lock at the lower levels, except at
    misaligned start and end of the hole.

    Signed-off-by: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Miklos Szeredi observes BUG_ON(!entry) in shmem_writepage() triggered in rare
    circumstances, because shmem_truncate_range() erroneously removes partially
    truncated directory pages at the end of the range: later reclaim on pages
    pointing to these removed directories triggers the BUG. Indeed, and it can
    also cause data loss beyond the hole.

    Fix this as in the patch proposed by Miklos, but distinguish between "limit"
    (how far we need to search: ignore truncation's next_index optimization in the
    holepunch case - if there are races it's more consistent to act on the whole
    range specified) and "upper_limit" (how far we can free directory pages:
    generally we must be careful to keep partially punched pages, but can relax at
    end of file - i_size being held stable by i_mutex).

    Signed-off-by: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Mar, 2007

1 commit


02 Mar, 2007

1 commit


13 Feb, 2007

1 commit

  • Many struct inode_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

12 Feb, 2007

1 commit

  • shmem backed file does not have page writeback, nor it participates in
    backing device's dirty or writeback accounting. So using generic
    __set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
    overkill. It unnecessarily prolongs shm unmap latency.

    For example, on a densely populated large shm segment (sevearl GBs), the
    unmapping operation becomes painfully long. Because at unmap, kernel
    transfers dirty bit in PTE into page struct and to the radix tree tag. The
    operation of tagging the radix tree is particularly expensive because it
    has to traverse the tree from the root to the leaf node on every dirty
    page. What's bothering is that radix tree tag is used for page write back.
    However, shmem is memory backed and there is no page write back for such
    file system. And in the end, we spend all that time tagging radix tree and
    none of that fancy tagging will be used. So let's simplify it by introduce
    a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.

    Signed-off-by: Ken Chen
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

23 Dec, 2006

1 commit

  • Ran into BUG() while doing madvise(REMOVE) testing. If we are punching a
    hole into shared memory segment using madvise(REMOVE) and the entire hole
    is below the indirect blocks, we hit following assert.

    BUG_ON(limit
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

09 Dec, 2006

1 commit


08 Dec, 2006

3 commits


21 Oct, 2006

1 commit

  • Separate out the concept of "queue congestion" from "backing-dev congestion".
    Congestion is a backing-dev concept, not a queue concept.

    The blk_* congestion functions are retained, as wrappers around the core
    backing-dev congestion functions.

    This proper layering is needed so that NFS can cleanly use the congestion
    functions, and so that CONFIG_BLOCK=n actually links.

    Cc: "Thomas Maier"
    Cc: "Jens Axboe"
    Cc: Trond Myklebust
    Cc: David Howells
    Cc: Peter Osterlund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

17 Oct, 2006

1 commit

  • We need to encode a decode the 'file' part of a handle. We simply use the
    inode number and generation number to construct the filehandle.

    The generation number is the time when the file was created. As inode numbers
    cycle through the full 32 bits before being reused, there is no real chance of
    the same inum being allocated to different files in the same second so this is
    suitably unique. Using time-of-day rather than e.g. jiffies makes it less
    likely that the same filehandle can be created after a reboot.

    In order to be able to decode a filehandle we need to be able to lookup by
    inum, which means that the inode needs to be added to the inode hash table
    (tmpfs doesn't currently hash inodes as there is never a need to lookup by
    inum). To avoid overhead when not exporting, we only hash an inode when it is
    first exported. This requires a lock to ensure it isn't hashed twice.

    This code is separate from the patch posted in June06 from Atal Shargorodsky
    which provided the same functionality, but does borrow slightly from it.

    Locking comment: Most filesystems that hash their inodes do so at the point
    where the 'struct inode' is initialised, and that has suitable locking
    (I_NEW). Here in shmem, we are hashing the inode later, the first time we
    need an NFS file handle for it. We no longer have I_NEW to ensure only one
    thread tries to add it to the hash table.

    Cc: Atal Shargorodsky
    Cc: Gilad Ben-Yossef
    Signed-off-by: David M. Grimes
    Signed-off-by: Neil Brown
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David M. Grimes
     

01 Oct, 2006

2 commits

  • This is mostly included for parity with dec_nlink(), where we will have some
    more hooks. This one should stay pretty darn straightforward for now.

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • When a filesystem decrements i_nlink to zero, it means that a write must be
    performed in order to drop the inode from the filesystem.

    We're shortly going to have keep filesystems from being remounted r/o between
    the time that this i_nlink decrement and that write occurs.

    So, add a little helper function to do the decrements. We'll tie into it in a
    bit to note when i_nlink hits zero.

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

30 Sep, 2006

1 commit


27 Sep, 2006

2 commits

  • This eliminates the i_blksize field from struct inode. Filesystems that want
    to provide a per-inode st_blksize can do so by providing their own getattr
    routine instead of using the generic_fillattr() function.

    Note that some filesystems were providing pretty much random (and incorrect)
    values for i_blksize.

    [bunk@stusta.de: cleanup]
    [akpm@osdl.org: generic_fillattr() fix]
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     
  • * Rougly half of callers already do it by not checking return value
    * Code in drivers/acpi/osl.c does the following to be sure:

    (void)kmem_cache_destroy(cache);

    * Those who check it printk something, however, slab_error already printed
    the name of failed cache.
    * XFS BUGs on failed kmem_cache_destroy which is not the decision
    low-level filesystem driver should make. Converted to ignore.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

26 Sep, 2006

1 commit


01 Jul, 2006

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    Remove obsolete #include
    remove obsolete swsusp_encrypt
    arch/arm26/Kconfig typos
    Documentation/IPMI typos
    Kconfig: Typos in net/sched/Kconfig
    v9fs: do not include linux/version.h
    Documentation/DocBook/mtdnand.tmpl: typo fixes
    typo fixes: specfic -> specific
    typo fixes in Documentation/networking/pktgen.txt
    typo fixes: occuring -> occurring
    typo fixes: infomation -> information
    typo fixes: disadvantadge -> disadvantage
    typo fixes: aquire -> acquire
    typo fixes: mecanism -> mechanism
    typo fixes: bandwith -> bandwidth
    fix a typo in the RTC_CLASS help text
    smb is no longer maintained

    Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S

    Linus Torvalds
     
  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Jörn Engel
    Signed-off-by: Adrian Bunk

    Jörn Engel
     

30 Jun, 2006

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/devfs-2.6: (22 commits)
    [PATCH] devfs: Remove it from the feature_removal.txt file
    [PATCH] devfs: Last little devfs cleanups throughout the kernel tree.
    [PATCH] devfs: Rename TTY_DRIVER_NO_DEVFS to TTY_DRIVER_DYNAMIC_DEV
    [PATCH] devfs: Remove the tty_driver devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the line_driver devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the videodevice devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the gendisk devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the miscdevice devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the devfs_fs_kernel.h file from the tree
    [PATCH] devfs: Remove devfs_remove() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_cdev() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_bdev() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_symlink() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_dir() function from the kernel tree
    [PATCH] devfs: Remove devfs_*_tape() functions from the kernel tree
    [PATCH] devfs: Remove devfs support from the sound subsystem
    [PATCH] devfs: Remove devfs support from the ide subsystem.
    [PATCH] devfs: Remove devfs support from the serial subsystem
    [PATCH] devfs: Remove devfs from the init code
    [PATCH] devfs: Remove devfs from the partition code
    ...

    Linus Torvalds
     

29 Jun, 2006

1 commit


27 Jun, 2006

2 commits


25 Jun, 2006

1 commit


23 Jun, 2006

3 commits

  • Remove two unnecessary PageSwapCache checks. The page refcount is raised
    and therefore page migration cannot occur in both functions.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Give the statfs superblock operation a dentry pointer rather than a superblock
    pointer.

    This complements the get_sb() patch. That reduced the significance of
    sb->s_root, allowing NFS to place a fake root there. However, NFS does
    require a dentry to use as a target for the statfs operation. This permits
    the root in the vfsmount to be used instead.

    linux/mount.h has been added where necessary to make allyesconfig build
    successfully.

    Interest has also been expressed for use with the FUSE and XFS filesystems.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Extend the get_sb() filesystem operation to take an extra argument that
    permits the VFS to pass in the target vfsmount that defines the mountpoint.

    The filesystem is then required to manually set the superblock and root dentry
    pointers. For most filesystems, this should be done with simple_set_mnt()
    which will set the superblock pointer and then set the root dentry to the
    superblock's s_root (as per the old default behaviour).

    The get_sb() op now returns an integer as there's now no need to return the
    superblock pointer.

    This patch permits a superblock to be implicitly shared amongst several mount
    points, such as can be done with NFS to avoid potential inode aliasing. In
    such a case, simple_set_mnt() would not be called, and instead the mnt_root
    and mnt_sb would be set directly.

    The patch also makes the following changes:

    (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
    pointer argument and return an integer, so most filesystems have to change
    very little.

    (*) If one of the convenience function is not used, then get_sb() should
    normally call simple_set_mnt() to instantiate the vfsmount. This will
    always return 0, and so can be tail-called from get_sb().

    (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
    dcache upon superblock destruction rather than shrink_dcache_anon().

    This is required because the superblock may now have multiple trees that
    aren't actually bound to s_root, but that still need to be cleaned up. The
    currently called functions assume that the whole tree is rooted at s_root,
    and that anonymous dentries are not the roots of trees which results in
    dentries being left unculled.

    However, with the way NFS superblock sharing are currently set to be
    implemented, these assumptions are violated: the root of the filesystem is
    simply a dummy dentry and inode (the real inode for '/' may well be
    inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
    with child trees.

    [*] Anonymous until discovered from another tree.

    (*) The documentation has been adjusted, including the additional bit of
    changing ext2_* into foo_* in the documentation.

    [akpm@osdl.org: convert ipath_fs, do other stuff]
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

20 Jun, 2006

1 commit


13 Jun, 2006

2 commits

  • shmem_rmdir() must undo the increment of i_nlink done in
    shmem_get_inode() for directories, otherwise at least
    IN_DELETE_SELF inotify event generation is broken.

    Signed-off-by: Sergey Vlasov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Sergey Vlasov
     
  • I noticed a strange behavior in a tmpfs file system the other day, while
    building packages - occasionally, and seemingly at random, make decided to
    rebuild a target. However, only on tmpfs.

    A file would be created, and if checked, it had a sub-second timestamp.
    However, after an utimes related call where sub-seconds should be set, they
    were zeroed instead. In the case that a file was created, and utimes(...,NULL)
    was used on it in the same second, the timestamp on the file moved backwards.

    After some digging, I found that this was being caused by tmpfs not having a
    time granularity set, thus inheriting the default 1 second granularity.

    Hugh adds: yes, we missed tmpfs when the s_time_gran mods went into 2.6.11.
    Unfortunately, the granularity of CURRENT_TIME, often used in filesystems,
    does not match the default granularity set by alloc_super. A few more such
    discrepancies have been found, but this is the most important to fix now.

    Signed-off-by: Robin H. Johnson
    Acked-by: Andi Kleen
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Robin H. Johnson
     

09 Jun, 2006

1 commit


23 Apr, 2006

1 commit

  • Basic problem: pages of a shared memory segment can only be migrated once.

    In 2.6.16 through 2.6.17-rc1, shared memory mappings do not have a
    migratepage address space op. Therefore, migrate_pages() falls back to
    default processing. In this path, it will try to pageout() dirty pages.
    Once a shared memory page has been migrated it becomes dirty, so
    migrate_pages() will try to page it out. However, because the page count
    is 3 [cache + current + pte], pageout() will return PAGE_KEEP because
    is_page_cache_freeable() returns false. This will abort all subsequent
    migrations.

    This patch adds a migratepage address space op to shared memory segments to
    avoid taking the default path. We use the "migrate_page()" function
    because it knows how to migrate dirty pages. This allows shared memory
    segment pages to migrate, subject to other conditions such as # pte's
    referencing the page [page_mapcount(page)], when requested.

    I think this is safe. If we're migrating a shared memory page, then we
    found the page via a page table, so it must be in memory.

    Can be verified with memtoy and the shmem-mbind-test script, both
    available at: http://free.linux.hp.com/~lts/Tools/

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

22 Mar, 2006

2 commits

  • shmem.c was named and shamed in Jesper's "Building 100 kernels" warnings:
    shmem_parse_mpol is only used when CONFIG_TMPFS parses mount options; and
    only called from that one site, so mark it inline like its non-NUMA stub.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We have struct kmem_cache now so use it instead of the old typedef.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

22 Feb, 2006

1 commit

  • I've been dissatisfied with the mpol_nodelist mount option which was
    added to tmpfs earlier in -rc. Replace it by mpol=policy:nodelist.

    And it was broken: a nodelist is a comma-separated list of numbers and
    ranges; the mount options are a comma-separated list of token=values.
    Whoops, blindly strsep'ing on commas doesn't work so well: since we've
    no numeric tokens, and unlikely to add them, use that to distinguish.

    Move the mpol= parsing to shmem_parse_mpol under CONFIG_NUMA, reject
    all its options as invalid if not NUMA. /proc shows MPOL_PREFERRED
    as "prefer", so use that name for the policy instead of "preferred".

    Enforce that mpol=default has no nodelist; that mpol=prefer has one
    node only; that mpol=bind has a nodelist; but let mpol=interleave use
    node_online_map if no nodelist given. Describe this in tmpfs.txt.

    Signed-off-by: Hugh Dickins
    Acked-by: Robin Holt
    Acked-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Hugh Dickins