09 Dec, 2006

1 commit


08 Dec, 2006

3 commits


21 Oct, 2006

1 commit

  • Separate out the concept of "queue congestion" from "backing-dev congestion".
    Congestion is a backing-dev concept, not a queue concept.

    The blk_* congestion functions are retained, as wrappers around the core
    backing-dev congestion functions.

    This proper layering is needed so that NFS can cleanly use the congestion
    functions, and so that CONFIG_BLOCK=n actually links.

    Cc: "Thomas Maier"
    Cc: "Jens Axboe"
    Cc: Trond Myklebust
    Cc: David Howells
    Cc: Peter Osterlund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

17 Oct, 2006

1 commit

  • We need to encode a decode the 'file' part of a handle. We simply use the
    inode number and generation number to construct the filehandle.

    The generation number is the time when the file was created. As inode numbers
    cycle through the full 32 bits before being reused, there is no real chance of
    the same inum being allocated to different files in the same second so this is
    suitably unique. Using time-of-day rather than e.g. jiffies makes it less
    likely that the same filehandle can be created after a reboot.

    In order to be able to decode a filehandle we need to be able to lookup by
    inum, which means that the inode needs to be added to the inode hash table
    (tmpfs doesn't currently hash inodes as there is never a need to lookup by
    inum). To avoid overhead when not exporting, we only hash an inode when it is
    first exported. This requires a lock to ensure it isn't hashed twice.

    This code is separate from the patch posted in June06 from Atal Shargorodsky
    which provided the same functionality, but does borrow slightly from it.

    Locking comment: Most filesystems that hash their inodes do so at the point
    where the 'struct inode' is initialised, and that has suitable locking
    (I_NEW). Here in shmem, we are hashing the inode later, the first time we
    need an NFS file handle for it. We no longer have I_NEW to ensure only one
    thread tries to add it to the hash table.

    Cc: Atal Shargorodsky
    Cc: Gilad Ben-Yossef
    Signed-off-by: David M. Grimes
    Signed-off-by: Neil Brown
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David M. Grimes
     

01 Oct, 2006

2 commits

  • This is mostly included for parity with dec_nlink(), where we will have some
    more hooks. This one should stay pretty darn straightforward for now.

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • When a filesystem decrements i_nlink to zero, it means that a write must be
    performed in order to drop the inode from the filesystem.

    We're shortly going to have keep filesystems from being remounted r/o between
    the time that this i_nlink decrement and that write occurs.

    So, add a little helper function to do the decrements. We'll tie into it in a
    bit to note when i_nlink hits zero.

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

30 Sep, 2006

1 commit


27 Sep, 2006

2 commits

  • This eliminates the i_blksize field from struct inode. Filesystems that want
    to provide a per-inode st_blksize can do so by providing their own getattr
    routine instead of using the generic_fillattr() function.

    Note that some filesystems were providing pretty much random (and incorrect)
    values for i_blksize.

    [bunk@stusta.de: cleanup]
    [akpm@osdl.org: generic_fillattr() fix]
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     
  • * Rougly half of callers already do it by not checking return value
    * Code in drivers/acpi/osl.c does the following to be sure:

    (void)kmem_cache_destroy(cache);

    * Those who check it printk something, however, slab_error already printed
    the name of failed cache.
    * XFS BUGs on failed kmem_cache_destroy which is not the decision
    low-level filesystem driver should make. Converted to ignore.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

26 Sep, 2006

1 commit


01 Jul, 2006

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    Remove obsolete #include
    remove obsolete swsusp_encrypt
    arch/arm26/Kconfig typos
    Documentation/IPMI typos
    Kconfig: Typos in net/sched/Kconfig
    v9fs: do not include linux/version.h
    Documentation/DocBook/mtdnand.tmpl: typo fixes
    typo fixes: specfic -> specific
    typo fixes in Documentation/networking/pktgen.txt
    typo fixes: occuring -> occurring
    typo fixes: infomation -> information
    typo fixes: disadvantadge -> disadvantage
    typo fixes: aquire -> acquire
    typo fixes: mecanism -> mechanism
    typo fixes: bandwith -> bandwidth
    fix a typo in the RTC_CLASS help text
    smb is no longer maintained

    Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S

    Linus Torvalds
     
  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Jörn Engel
    Signed-off-by: Adrian Bunk

    Jörn Engel
     

30 Jun, 2006

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/devfs-2.6: (22 commits)
    [PATCH] devfs: Remove it from the feature_removal.txt file
    [PATCH] devfs: Last little devfs cleanups throughout the kernel tree.
    [PATCH] devfs: Rename TTY_DRIVER_NO_DEVFS to TTY_DRIVER_DYNAMIC_DEV
    [PATCH] devfs: Remove the tty_driver devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the line_driver devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the videodevice devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the gendisk devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the miscdevice devfs_name field as it's no longer needed
    [PATCH] devfs: Remove the devfs_fs_kernel.h file from the tree
    [PATCH] devfs: Remove devfs_remove() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_cdev() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_bdev() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_symlink() function from the kernel tree
    [PATCH] devfs: Remove devfs_mk_dir() function from the kernel tree
    [PATCH] devfs: Remove devfs_*_tape() functions from the kernel tree
    [PATCH] devfs: Remove devfs support from the sound subsystem
    [PATCH] devfs: Remove devfs support from the ide subsystem.
    [PATCH] devfs: Remove devfs support from the serial subsystem
    [PATCH] devfs: Remove devfs from the init code
    [PATCH] devfs: Remove devfs from the partition code
    ...

    Linus Torvalds
     

29 Jun, 2006

1 commit


27 Jun, 2006

2 commits


25 Jun, 2006

1 commit


23 Jun, 2006

3 commits

  • Remove two unnecessary PageSwapCache checks. The page refcount is raised
    and therefore page migration cannot occur in both functions.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Give the statfs superblock operation a dentry pointer rather than a superblock
    pointer.

    This complements the get_sb() patch. That reduced the significance of
    sb->s_root, allowing NFS to place a fake root there. However, NFS does
    require a dentry to use as a target for the statfs operation. This permits
    the root in the vfsmount to be used instead.

    linux/mount.h has been added where necessary to make allyesconfig build
    successfully.

    Interest has also been expressed for use with the FUSE and XFS filesystems.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Extend the get_sb() filesystem operation to take an extra argument that
    permits the VFS to pass in the target vfsmount that defines the mountpoint.

    The filesystem is then required to manually set the superblock and root dentry
    pointers. For most filesystems, this should be done with simple_set_mnt()
    which will set the superblock pointer and then set the root dentry to the
    superblock's s_root (as per the old default behaviour).

    The get_sb() op now returns an integer as there's now no need to return the
    superblock pointer.

    This patch permits a superblock to be implicitly shared amongst several mount
    points, such as can be done with NFS to avoid potential inode aliasing. In
    such a case, simple_set_mnt() would not be called, and instead the mnt_root
    and mnt_sb would be set directly.

    The patch also makes the following changes:

    (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
    pointer argument and return an integer, so most filesystems have to change
    very little.

    (*) If one of the convenience function is not used, then get_sb() should
    normally call simple_set_mnt() to instantiate the vfsmount. This will
    always return 0, and so can be tail-called from get_sb().

    (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
    dcache upon superblock destruction rather than shrink_dcache_anon().

    This is required because the superblock may now have multiple trees that
    aren't actually bound to s_root, but that still need to be cleaned up. The
    currently called functions assume that the whole tree is rooted at s_root,
    and that anonymous dentries are not the roots of trees which results in
    dentries being left unculled.

    However, with the way NFS superblock sharing are currently set to be
    implemented, these assumptions are violated: the root of the filesystem is
    simply a dummy dentry and inode (the real inode for '/' may well be
    inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
    with child trees.

    [*] Anonymous until discovered from another tree.

    (*) The documentation has been adjusted, including the additional bit of
    changing ext2_* into foo_* in the documentation.

    [akpm@osdl.org: convert ipath_fs, do other stuff]
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

20 Jun, 2006

1 commit


13 Jun, 2006

2 commits

  • shmem_rmdir() must undo the increment of i_nlink done in
    shmem_get_inode() for directories, otherwise at least
    IN_DELETE_SELF inotify event generation is broken.

    Signed-off-by: Sergey Vlasov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Sergey Vlasov
     
  • I noticed a strange behavior in a tmpfs file system the other day, while
    building packages - occasionally, and seemingly at random, make decided to
    rebuild a target. However, only on tmpfs.

    A file would be created, and if checked, it had a sub-second timestamp.
    However, after an utimes related call where sub-seconds should be set, they
    were zeroed instead. In the case that a file was created, and utimes(...,NULL)
    was used on it in the same second, the timestamp on the file moved backwards.

    After some digging, I found that this was being caused by tmpfs not having a
    time granularity set, thus inheriting the default 1 second granularity.

    Hugh adds: yes, we missed tmpfs when the s_time_gran mods went into 2.6.11.
    Unfortunately, the granularity of CURRENT_TIME, often used in filesystems,
    does not match the default granularity set by alloc_super. A few more such
    discrepancies have been found, but this is the most important to fix now.

    Signed-off-by: Robin H. Johnson
    Acked-by: Andi Kleen
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Robin H. Johnson
     

09 Jun, 2006

1 commit


23 Apr, 2006

1 commit

  • Basic problem: pages of a shared memory segment can only be migrated once.

    In 2.6.16 through 2.6.17-rc1, shared memory mappings do not have a
    migratepage address space op. Therefore, migrate_pages() falls back to
    default processing. In this path, it will try to pageout() dirty pages.
    Once a shared memory page has been migrated it becomes dirty, so
    migrate_pages() will try to page it out. However, because the page count
    is 3 [cache + current + pte], pageout() will return PAGE_KEEP because
    is_page_cache_freeable() returns false. This will abort all subsequent
    migrations.

    This patch adds a migratepage address space op to shared memory segments to
    avoid taking the default path. We use the "migrate_page()" function
    because it knows how to migrate dirty pages. This allows shared memory
    segment pages to migrate, subject to other conditions such as # pte's
    referencing the page [page_mapcount(page)], when requested.

    I think this is safe. If we're migrating a shared memory page, then we
    found the page via a page table, so it must be in memory.

    Can be verified with memtoy and the shmem-mbind-test script, both
    available at: http://free.linux.hp.com/~lts/Tools/

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

22 Mar, 2006

2 commits

  • shmem.c was named and shamed in Jesper's "Building 100 kernels" warnings:
    shmem_parse_mpol is only used when CONFIG_TMPFS parses mount options; and
    only called from that one site, so mark it inline like its non-NUMA stub.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We have struct kmem_cache now so use it instead of the old typedef.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

22 Feb, 2006

1 commit

  • I've been dissatisfied with the mpol_nodelist mount option which was
    added to tmpfs earlier in -rc. Replace it by mpol=policy:nodelist.

    And it was broken: a nodelist is a comma-separated list of numbers and
    ranges; the mount options are a comma-separated list of token=values.
    Whoops, blindly strsep'ing on commas doesn't work so well: since we've
    no numeric tokens, and unlikely to add them, use that to distinguish.

    Move the mpol= parsing to shmem_parse_mpol under CONFIG_NUMA, reject
    all its options as invalid if not NUMA. /proc shows MPOL_PREFERRED
    as "prefer", so use that name for the policy instead of "preferred".

    Enforce that mpol=default has no nodelist; that mpol=prefer has one
    node only; that mpol=bind has a nodelist; but let mpol=interleave use
    node_online_map if no nodelist given. Describe this in tmpfs.txt.

    Signed-off-by: Hugh Dickins
    Acked-by: Robin Holt
    Acked-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Feb, 2006

1 commit

  • Check for PageSwapCache after looking up and locking a swap page.

    The page migration code may change a swap pte to point to a different page
    under lock_page().

    If that happens then the vm must retry the lookup operation in the swap space
    to find the correct page number. There are a couple of locations in the VM
    where a lock_page() is done on a swap page. In these locations we need to
    check afterwards if the page was migrated. If the page was migrated then the
    old page that was looked up before was freed and no longer has the
    PageSwapCache bit set.

    Signed-off-by: Hirokazu Takahashi
    Signed-off-by: Dave Hansen
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

15 Jan, 2006

1 commit

  • Anything that writes into a tmpfs filesystem is liable to disproportionately
    decrease the available memory on a particular node. Since there's no telling
    what sort of application (e.g. dd/cp/cat) might be dropping large files
    there, this lets the admin choose the appropriate default behavior for their
    site's situation.

    Introduce a tmpfs mount option which allows specifying a memory policy and
    a second option to specify the nodelist for that policy. With the default
    policy, tmpfs will behave as it does today. This patch adds support for
    preferred, bind, and interleave policies.

    The default policy will cause pages to be added to tmpfs files on the node
    which is doing the writing. Some jobs expect a single process to create
    and manage the tmpfs files. This results in a node which has a
    significantly reduced number of free pages.

    With this patch, the administrator can specify the policy and nodes for
    that policy where they would prefer allocations.

    This patch was originally written by Brent Casavant and Hugh Dickins. I
    added support for the bind and preferred policies and the mpol_nodelist
    mount option.

    Signed-off-by: Brent Casavant
    Signed-off-by: Hugh Dickins
    Signed-off-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

10 Jan, 2006

1 commit


07 Jan, 2006

2 commits

  • The attached patch makes the SYSV IPC shared memory facilities use the new
    ramfs facilities on a no-MMU kernel.

    The following changes are made:

    (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to
    allow the IPC SHM facilities to commune with the tiny-shmem and shmem
    code.

    (2) ramfs files now need resizing using do_truncate() rather than by modifying
    the inode size directly (see shmem_file_setup()). This causes ramfs to
    attempt to bind a block of pages of sufficient size to the inode.

    (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Here is the patch to implement madvise(MADV_REMOVE) - which frees up a
    given range of pages & its associated backing store. Current
    implementation supports only shmfs/tmpfs and other filesystems return
    -ENOSYS.

    "Some app allocates large tmpfs files, then when some task quits and some
    client disconnect, some memory can be released. However the only way to
    release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli

    Databases want to use this feature to drop a section of their bufferpool
    (shared memory segments) - without writing back to disk/swap space.

    This feature is also useful for supporting hot-plug memory on UML.

    Concerns raised by Andrew Morton:

    - "We have no plan for holepunching! If we _do_ have such a plan (or
    might in the future) then what would the API look like? I think
    sys_holepunch(fd, start, len), so we should start out with that."

    - Using madvise is very weird, because people will ask "why do I need to
    mmap my file before I can stick a hole in it?"

    - None of the other madvise operations call into the filesystem in this
    manner. A broad question is: is this capability an MM operation or a
    filesytem operation? truncate, for example, is a filesystem operation
    which sometimes has MM side-effects. madvise is an mm operation and with
    this patch, it gains FS side-effects, only they're really, really
    significant ones."

    Comments:

    - Andrea suggested the fs operation too but then it's more efficient to
    have it as a mm operation with fs side effects, because they don't
    immediatly know fd and physical offset of the range. It's possible to
    fixup in userland and to use the fs operation but it's more expensive,
    the vmas are already in the kernel and we can use them.

    Short term plan & Future Direction:

    - We seem to need this interface only for shmfs/tmpfs files in the short
    term. We have to add hooks into the filesystem for correctness and
    completeness. This is what this patch does.

    - In the future, plan is to support both fs and mmap apis also. This
    also involves (other) filesystem specific functions to be implemented.

    - Current patch doesn't support VM_NONLINEAR - which can be addressed in
    the future.

    Signed-off-by: Badari Pulavarty
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

04 Jan, 2006

1 commit

  • readpage(), prepare_write(), and commit_write() callers are updated to
    understand the special return code AOP_TRUNCATED_PAGE in the style of
    writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that
    the callee has unlocked the page and that the operation should be tried again
    with a new page. OCFS2 uses this to detect and work around a lock inversion in
    its aop methods. There should be no change in behaviour for methods that don't
    return AOP_TRUNCATED_PAGE.

    WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are
    made enums so that kerneldoc can be used to document their semantics.

    Signed-off-by: Zach Brown

    Zach Brown
     

30 Oct, 2005

3 commits

  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove PageReserved() calls from core code by tightening VM_RESERVED
    handling in mm/ to cover PageReserved functionality.

    PageReserved special casing is removed from get_page and put_page.

    All setting and clearing of PageReserved is retained, and it is now flagged
    in the page_alloc checks to help ensure we don't introduce any refcount
    based freeing of Reserved pages.

    MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
    deprecated. We never completely handled it correctly anyway, and is be
    reintroduced in future if required (Hugh has a proof of concept).

    Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
    arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
    be trivially removed.

    Last real user of PageReserved is swsusp, which uses PageReserved to
    determine whether a struct page points to valid memory or not. This still
    needs to be addressed (a generic page_is_ram() should work).

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
    thus mapcounted and count towards shared rss). These writes to the struct
    page could cause excessive cacheline bouncing on big systems. There are a
    number of ways this could be addressed if it is an issue.

    Signed-off-by: Nick Piggin

    Refcount bug fix for filemap_xip.c

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Impose a little more consistency on the page fault handlers do_wp_page,
    do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their
    arguments in the same order, called the same names?

    break_cow is all very well, but what it did was inlined elsewhere: easier to
    compare if it's brought back into do_wp_page.

    do_file_page's fallback to do_no_page dates from a time when we were testing
    pte_file by using it wherever possible: currently it's peculiar to nonlinear
    vmas, so just check that. BUG_ON if not? Better not, it's probably page
    table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's
    use that for do_wp_page's invalid pfn too.

    Hah! Someone in the ppc64 world noticed pte_ERROR was unused so removed it:
    restored (and say "pud" not "pmd" in its pud_ERROR).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins