05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

31 Jul, 2008

1 commit


27 Jul, 2008

2 commits

  • This patch makes the following needlessly global code static:
    - swap_lock
    - nr_swapfiles
    - struct swap_list

    Signed-off-by: Adrian Bunk
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • mapping->tree_lock has no read lockers. convert the lock from an rwlock
    to a spinlock.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

25 Jul, 2008

1 commit

  • Vegard Nossum has noticed the ever-decreasing negative priority in a
    swapon /swapoff loop, which eventually would misprioritize when int wraps
    positive. Not worth spending much code on, but probably better fixed.

    It's easy to handle the swapping on and off of just one area, but there's
    not much point if a pair or more still misbehave. To handle the general
    case, swapoff should compact negative priorities, keeping them always from
    -1 to -MAX_SWAPFILES. That's a change, but should cause no regression,
    since these negative (unspecified) priorities are disjoint from the the
    positive specified priorities 0 to 32767.

    One small functional difference, which seems appropriate: when swapoff
    fails to free all swap from a negative priority area, that area is now
    reinserted at lowest priority, rather than at its original priority.

    In moving down swapon's setting of priority, I notice that an area is
    visible to /proc/swaps when it has swap_map set, yet that was being set
    before all the visible fields were properly filled in: corrected.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reported-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 Apr, 2008

1 commit


28 Apr, 2008

1 commit

  • When checking for the swap header try byteswapping the endianess dependent
    fields to allow the swap partition to be shared between big & little endian
    systems.

    Signed-off-by: Chris Dearman
    Signed-off-by: Ralf Baechle
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Dearman
     

15 Feb, 2008

1 commit

  • seq_path() is always called with a dentry and a vfsmount from a struct path.
    Make seq_path() take it directly as an argument.

    Signed-off-by: Jan Blunck
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     

08 Feb, 2008

4 commits

  • This patch reinstates the "swapoff: scan ptes preemptibly" mod we started
    with: in due course it should be rendered down into the earlier patches,
    leaving us with a more straightforward mem_cgroup_charge mod to unuse_pte,
    allocating with GFP_KERNEL while holding no spinlock and no atomic kmap.

    Signed-off-by: Hugh Dickins
    Cc: Pavel Emelianov
    Acked-by: Balbir Singh
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Nick Piggin pointed out that swap cache and page cache addition routines
    could be called from non GFP_KERNEL contexts. This patch makes the
    charging routine aware of the gfp context. Charging might fail if the
    cgroup is over it's limit, in which case a suitable error is returned.

    This patch was tested on a Powerpc box. I am still looking at being able
    to test the path, through which allocations happen in non GFP_KERNEL
    contexts.

    [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the accounting hooks. The accounting is carried out for RSS and Page
    Cache (unmapped) pages. There is now a common limit and accounting for both.
    The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
    time. Page cache is accounted at add_to_page_cache(),
    __delete_from_page_cache(). Swap cache is also accounted for.

    Each page's page_cgroup is protected with the last bit of the
    page_cgroup pointer, this makes handling of race conditions involving
    simultaneous mappings of a page easier. A reference count is kept in the
    page_cgroup to deal with cases where a page might be unmapped from the RSS
    of all tasks, but still lives in the page cache.

    Credits go to Vaidyanathan Srinivasan for helping with reference counting work
    of the page cgroup. Almost all of the page cache accounting code has help
    from Vaidyanathan Srinivasan.

    [hugh@veritas.com: fix swapoff breakage]
    [akpm@linux-foundation.org: fix locking]
    Signed-off-by: Vaidyanathan Srinivasan
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc:
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • This patch precisely reverts the "swapoff: scan ptes preemptibly" patch
    just presented. It's a temporary measure to allow existing memory
    controller patches to apply without rejects: in due course they should be
    rendered down into one sensible patch, and this reversion disappear.

    Signed-off-by: Hugh Dickins
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Feb, 2008

4 commits

  • There are a couple of reasons (patches follow) why it would be good to open a
    window for sleep in shmem_unuse_inode, between its search for a matching swap
    entry, and its handling of the entry found.

    shmem_unuse_inode must then use igrab to hold the inode against deletion in
    that window, and its corresponding iput might result in deletion: so it had
    better unlock_page before the iput, and might as well release the page too.

    Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
    leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse
    into shmem_unuse_inode, in the case when it finds a match.

    Let try_to_unuse break on error in the shmem_unuse case, as it does in the
    unuse_mm case: though at this point in the series, no error to break on.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Provided that CONFIG_HIGHPTE is not set, unuse_pte_range can reduce latency
    in swapoff by scanning the page table preemptibly: so long as unuse_pte is
    careful to recheck that entry under pte lock.

    (To tell the truth, this patch was not inspired by any cries for lower
    latency here: rather, this restructuring permits a future memory controller
    patch to allocate with GFP_KERNEL in unuse_pte, where before it could not.
    But it would be wrong to tuck this change away inside a memcgroup patch.)

    Signed-off-by: Hugh Dickins
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • valid_swaphandles is supposed to do a quick pass over the swap map entries
    neigbouring the entry which swapin_readahead is targetting, to determine for
    it a range worth reading all together. But since it always starts its search
    from the beginning of the swap "cluster", a reject (free entry) there
    immediately curtails the readaround, and every swapin_readahead from that
    cluster is for just a single page. Instead scan forwards and backwards around
    the target entry.

    Use better names for some variables: a swap_info pointer is usually called
    "si" not "swapdev". And at the end, if only the target page should be read,
    return count of 0 to disable readaround, to avoid the unnecessarily repeated
    call to read_swap_cache_async.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Building in a filesystem on a loop device on a tmpfs file can hang when
    swapping, the loop thread caught in that infamous throttle_vm_writeout.

    In theory this is a long standing problem, which I've either never seen in
    practice, or long ago suppressed the recollection, after discounting my load
    and my tmpfs size as unrealistically high. But now, with the new aops, it has
    become easy to hang on one machine.

    Loop used to grab_cache_page before the old prepare_write to tmpfs, which
    seems to have been enough to free up some memory for any swapin needed; but
    the new write_begin lets tmpfs find or allocate the page (much nicer, since
    grab_cache_page missed tmpfs pages in swapcache).

    When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
    has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
    break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
    read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
    mapping_gfp_mask - hence the hang.

    So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
    swapin_readahead to read_swap_cache_async to add_to_swap_cache.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 Jul, 2007

1 commit


17 Jul, 2007

1 commit


08 May, 2007

1 commit

  • Ensure pages are uptodate after returning from read_cache_page, which allows
    us to cut out most of the filesystem-internal PageUptodate calls.

    I didn't have a great look down the call chains, but this appears to fixes 7
    possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
    ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
    block2mtd. All depending on whether the filler is async and/or can return
    with a !uptodate page.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

06 Jan, 2007

1 commit

  • In the kernels later than 2.6.19 there is a regression that makes swsusp
    fail if the resume device is not explicitly specified.

    It can be fixed by adding an additional parameter to
    mm/swapfile.c:swap_type_of() allowing us to pass the (struct block_device
    *) corresponding to the first available swap back to the caller.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

09 Dec, 2006

1 commit


08 Dec, 2006

5 commits

  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • Make swsusp use block device offsets instead of swap offsets to identify swap
    locations and make it use the same code paths for writing as well as for
    reading data.

    This allows us to use the same code for handling swap files and swap
    partitions and to simplify the code, eg. by dropping rw_swap_page_sync().

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The Linux kernel handles swap files almost in the same way as it handles swap
    partitions and there are only two differences between these two types of swap
    areas:

    (1) swap files need not be contiguous,

    (2) the header of a swap file is not in the first block of the partition
    that holds it. From the swsusp's point of view (1) is not a problem,
    because it is already taken care of by the swap-handling code, but (2) has
    to be taken into consideration.

    In principle the location of a swap file's header may be determined with the
    help of appropriate filesystem driver. Unfortunately, however, it requires
    the filesystem holding the swap file to be mounted, and if this filesystem is
    journaled, it cannot be mounted during a resume from disk. For this reason we
    need some other means by which swap areas can be identified.

    For example, to identify a swap area we can use the partition that holds the
    area and the offset from the beginning of this partition at which the swap
    header is located.

    The following patch allows swsusp to identify swap areas this way. It changes
    swap_type_of() so that it takes an additional argument representing an offset
    of the swap header within the partition represented by its first argument.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The fsfuzzer found this; with a corrupt small swapfile that claims to have
    many pages:

    [root]# file swap.741.img
    swap.741.img: Linux/i386 swap file (new style) 1 (4K pages) size 1040191487 pages
    [root]# ls -l swap.741.img
    -rw-r--r-- 1 root root 16777216 Nov 22 05:18 swap.741.img

    sys_swapon() will try to vmalloc all those pages, and -then- check to see if
    the file is actually that large:

    if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {

    if (swapfilesize && maxpages > swapfilesize) {
    printk(KERN_WARNING
    "Swap area shorter than signature indicates\n");

    It seems to me that it would make more sense to move this test up before
    the vmalloc, with the other checks, to avoid the OOM-killer in this
    situation...

    Signed-off-by: Eric Sandeen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • It would be possible for /proc/swaps to not always print out the header:

    swapon /dev/hdc2
    swapon /dev/hde2
    swapoff /dev/hdc2

    At this point /proc/swaps would not have a header.

    Signed-off-by: Suleiman Souhlal
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suleiman Souhlal
     

30 Sep, 2006

1 commit

  • akpm draws my attention to the fact that sysctl(VM_PAGE_CLUSTER) might
    conceivably change page_cluster to 0 while valid_swaphandles() is in the
    middle of using it, leading to an embarrassingly long loop: take a local
    snapshot of page_cluster and work with that.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Aug, 2006

1 commit

  • There is a bug in mm/swapfile.c#swap_type_of() that makes swsusp only be
    able to use the first active swap partition as the resume device. Fix it.

    Signed-off-by: Rafael J. Wysocki
    Cc: Hugh Dickins
    Acked-by: Pavel Machek
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

01 Jul, 2006

1 commit


23 Jun, 2006

5 commits

  • Add read_mapping_page() which is used for callers that pass
    mapping->a_ops->readpage as the filler for read_cache_page. This removes
    some duplication from filesystem code.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Now that we have atomic_inc_not_zero, it's more elegant for try_to_unuse to
    use that on mm_users: doesn't actually matter at present, but safer to be
    sure that once mm_users has gone to 0, nothing raises it for an instant.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rip the page migration logic out.

    Remove all code that has to do with swapping during page migration.

    This also guts the ability to migrate pages to swap. No one used that so lets
    let it go for good.

    Page migration should be a bit broken after this patch.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Implement read/write migration ptes

    We take the upper two swapfiles for the two types of migration ptes and define
    a series of macros in swapops.h.

    The VM is modified to handle the migration entries. migration entries can
    only be encountered when the page they are pointing to is locked. This limits
    the number of places one has to fix. We also check in copy_pte_range and in
    mprotect_pte_range() for migration ptes.

    We check for migration ptes in do_swap_cache and call a function that will
    then wait on the page lock. This allows us to effectively stop all accesses
    to apge.

    Migration entries are created by try_to_unmap if called for migration and
    removed by local functions in migrate.c

    From: Hugh Dickins

    Several times while testing swapless page migration (I've no NUMA, just
    hacking it up to migrate recklessly while running load), I've hit the
    BUG_ON(!PageLocked(p)) in migration_entry_to_page.

    This comes from an orphaned migration entry, unrelated to the current
    correctly locked migration, but hit by remove_anon_migration_ptes as it
    checks an address in each vma of the anon_vma list.

    Such an orphan may be left behind if an earlier migration raced with fork:
    copy_one_pte can duplicate a migration entry from parent to child, after
    remove_anon_migration_ptes has checked the child vma, but before it has
    removed it from the parent vma. (If the process were later to fault on this
    orphaned entry, it would hit the same BUG from migration_entry_wait.)

    This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
    not. There's no such problem with file pages, because vma_prio_tree_add
    adds child vma after parent vma, and the page table locking at each end is
    enough to serialize. Follow that example with anon_vma: add new vmas to the
    tail instead of the head.

    (There's no corresponding problem when inserting migration entries,
    because a missed pte will leave the page count and mapcount high, which is
    allowed for. And there's no corresponding problem when migrating via swap,
    because a leftover swap entry will be correctly faulted. But the swapless
    method has no refcounting of its entries.)

    From: Ingo Molnar

    pte_unmap_unlock() takes the pte pointer as an argument.

    From: Hugh Dickins

    Several times while testing swapless page migration, gcc has tried to exec
    a pointer instead of a string: smells like COW mappings are not being
    properly write-protected on fork.

    The protection in copy_one_pte looks very convincing, until at last you
    realize that the second arg to make_migration_entry is a boolean "write",
    and SWP_MIGRATION_READ is 30.

    Anyway, it's better done like in change_pte_range, using
    is_write_migration_entry and make_migration_entry_read.

    From: Hugh Dickins

    Remove unnecessary obfuscation from sys_swapon's range check on swap type,
    which blew up causing memory corruption once swapless migration made
    MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.

    Signed-off-by: Hugh Dickins
    Acked-by: Martin Schwidefsky
    Signed-off-by: Hugh Dickins
    Signed-off-by: Christoph Lameter
    Signed-off-by: Ingo Molnar
    From: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove two unnecessary PageSwapCache checks. The page refcount is raised
    and therefore page migration cannot occur in both functions.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Apr, 2006

1 commit

  • find_trylock_page() is an odd interface in that it doesn't take a reference
    like the others. Now that XFS no longer uses it, and its last remaining
    caller actually wants an elevated refcount, opencode that callsite and
    schedule find_trylock_page() for removal.

    Signed-off-by: Nick Piggin
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

23 Mar, 2006

2 commits

  • This patch introduces a user space interface for swsusp.

    The interface is based on a special character device, called the snapshot
    device, that allows user space processes to perform suspend and resume-related
    operations with the help of some ioctls and the read()/write() functions.
     Additionally it allows these processes to allocate free swap pages from a
    selected swap partition, called the resume partition, so that they know which
    sectors of the resume partition are available to them.

    The interface uses the same low-level system memory snapshot-handling
    functions that are used by the built-it swap-writing/reading code of swsusp.

    The interface documentation is included in the patch.

    The patch assumes that the major and minor numbers of the snapshot device will
    be 10 (ie. misc device) and 231, the registration of which has already been
    requested.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Introduce the low level interface that can be used for handling the
    snapshot of the system memory by the in-kernel swap-writing/reading code of
    swsusp and the userland interface code (to be introduced shortly).

    Also change the way in which swsusp records the allocated swap pages and,
    consequently, simplifies the in-kernel swap-writing/reading code (this is
    necessary for the userland interface too). To this end, it introduces two
    helper functions in mm/swapfile.c, so that the swsusp code does not refer
    directly to the swap internals.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

22 Mar, 2006

1 commit

  • When we've allocated SWAPFILE_CLUSTER pages, ->cluster_next should be the
    first index of swap cluster. But current code probably sets it wrong offset.

    Signed-off-by: Akinobu Mita
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

02 Feb, 2006

2 commits

  • Add remove_from_swap

    remove_from_swap() allows the restoration of the pte entries that existed
    before page migration occurred for anonymous pages by walking the reverse
    maps. This reduces swap use and establishes regular pte's without the need
    for page faults.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Check for PageSwapCache after looking up and locking a swap page.

    The page migration code may change a swap pte to point to a different page
    under lock_page().

    If that happens then the vm must retry the lookup operation in the swap space
    to find the correct page number. There are a couple of locations in the VM
    where a lock_page() is done on a swap page. In these locations we need to
    check afterwards if the page was migrated. If the page was migrated then the
    old page that was looked up before was freed and no longer has the
    PageSwapCache bit set.

    Signed-off-by: Hirokazu Takahashi
    Signed-off-by: Dave Hansen
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter