29 Oct, 2006

6 commits

  • If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
    area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative". Reinstate my
    preliminary i_size check before attempting to allocate the page (though this
    only fixes the most obvious case: more work will be needed here).

    Signed-off-by: Hugh Dickins
    Cc: Adam Litke
    Cc: David Gibson
    Cc: "Chen, Kenneth W"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If __vmalloc is called to allocate memory with GFP_ATOMIC in atomic
    context, the chain of calls results in __get_vm_area_node allocating memory
    for vm_struct with GFP_KERNEL, causing the 'sleeping from invalid context'
    warning. This patch fixes it by passing the gfp flags along so
    __get_vm_area_node allocates memory for vm_struct with the same flags.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Giridhar Pemmasani
     
  • Add __GFP_NOWARN flag to calling of __alloc_pages() in
    __kmalloc_section_memmap(). It can reduce noisy failure message.

    In ia64, section size is 1 GB, this means that order 8 pages are necessary
    for each section's memmap. It is often very hard requirement under heavy
    memory pressure as you know. So, __alloc_pages() gives up allocation and
    shows many noisy stack traces which means no page for each sections.
    (Current my environment shows 32 times of stack trace....)

    But, __kmalloc_section_memmap() calls vmalloc() after failure of it, and it
    can succeed allocation of memmap. So, its stack trace warning becomes just
    noisy. I suppose it shouldn't be shown.

    Signed-off-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • If try_to_free_pages / balance_pgdat are called with a gfp_mask specifying
    GFP_IO and/or GFP_FS, they will reclaim the requisite number of pages, and the
    reset prev_priority to DEF_PRIORITY (or to some other high (ie: unurgent)
    value).

    However, another reclaimer without those gfp_mask flags set (say, GFP_NOIO)
    may still be struggling to reclaim pages. The concurrent overwrite of
    zone->prev_priority will cause this GFP_NOIO thread to unexpectedly cease
    deactivating mapped pages, thus causing reclaim difficulties.

    Fix this is to key the distress calculation not off zone->prev_priority, but
    also take into account the local caller's priority by using
    min(zone->prev_priority, sc->priority)

    Signed-off-by: Martin J. Bligh
    Cc: Nick Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Bligh
     
  • The temp_priority field in zone is racy, as we can walk through a reclaim
    path, and just before we copy it into prev_priority, it can be overwritten
    (say with DEF_PRIORITY) by another reclaimer.

    The same bug is contained in both try_to_free_pages and balance_pgdat, but
    it is fixed slightly differently. In balance_pgdat, we keep a separate
    priority record per zone in a local array. In try_to_free_pages there is
    no need to do this, as the priority level is the same for all zones that we
    reclaim from.

    Impact of this bug is that temp_priority is copied into prev_priority, and
    setting this artificially high causes reclaimers to set distress
    artificially low. They then fail to reclaim mapped pages, when they are,
    in fact, under severe memory pressure (their priority may be as low as 0).
    This causes the OOM killer to fire incorrectly.

    From: Andrew Morton

    __zone_reclaim() isn't modifying zone->prev_priority. But zone->prev_priority
    is used in the decision whether or not to bring mapped pages onto the inactive
    list. Hence there's a risk here that __zone_reclaim() will fail because
    zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
    stuck on the active list.

    Fix that up by decreasing (ie making more urgent) zone->prev_priority as
    __zone_reclaim() scans the zone's pages.

    This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created. It should be
    possible to remove that now, and to just start out at DEF_PRIORITY?

    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Bligh
     
  • - Consolidate page_cache_alloc

    - Fix splice: only the pagecache pages and filesystem data need to use
    mapping_gfp_mask.

    - Fix grab_cache_page_nowait: same as splice, also honour NUMA placement.

    Signed-off-by: Nick Piggin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

22 Oct, 2006

3 commits

  • The zonelist may contain zones of nodes that have not been bootstrapped and
    we will oops if we try to allocate from those zones. So check if the node
    information for the slab and the node have been setup before attempting an
    allocation. If it has not been setup then skip that zone.

    Usually we will not encounter this situation since the slab bootstrap code
    avoids falling back before we have setup the respective nodes but we seem
    to have a special needs for pppc.

    Signed-off-by: Christoph Lameter
    Acked-by: Andy Whitcroft
    Cc: Paul Mackerras
    Cc: Mike Kravetz
    Cc: Benjamin Herrenschmidt
    Acked-by: Mel Gorman
    Acked-by: Will Schmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Reintroduce NODES_SPAN_OTHER_NODES for powerpc

    Revert "[PATCH] Remove SPAN_OTHER_NODES config definition"
    This reverts commit f62859bb6871c5e4a8e591c60befc8caaf54db8c.
    Revert "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"
    This reverts commit a94b3ab7eab4edcc9b2cb474b188f774c331adf7.

    Also update the comments to indicate that this is still required
    and where its used.

    Signed-off-by: Andy Whitcroft
    Cc: Paul Mackerras
    Cc: Mike Kravetz
    Cc: Benjamin Herrenschmidt
    Acked-by: Mel Gorman
    Acked-by: Will Schmidt
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • * 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
    [PATCH] Remove SUID when splicing into an inode
    [PATCH] Add lockless helpers for remove_suid()
    [PATCH] Introduce generic_file_splice_write_nolock()
    [PATCH] Take i_mutex in splice_from_pipe()

    Linus Torvalds
     

21 Oct, 2006

6 commits

  • Clarify lockorder comments now that sys_msync dropps mmap_sem before
    calling do_fsync.

    Signed-off-by: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • --=-=-=

    from mm/memory.c:
    1434 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va)
    1435 {
    1436 /*
    1437 * If the source page was a PFN mapping, we don't have
    1438 * a "struct page" for it. We do a best-effort copy by
    1439 * just copying from the original user address. If that
    1440 * fails, we just zero-fill it. Live with it.
    1441 */
    1442 if (unlikely(!src)) {
    1443 void *kaddr = kmap_atomic(dst, KM_USER0);
    1444 void __user *uaddr = (void __user *)(va & PAGE_MASK);
    1445
    1446 /*
    1447 * This really shouldn't fail, because the page is there
    1448 * in the page tables. But it might just be unreadable,
    1449 * in which case we just give up and fill the result with
    1450 * zeroes.
    1451 */
    1452 if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
    1453 memset(kaddr, 0, PAGE_SIZE);
    1454 kunmap_atomic(kaddr, KM_USER0);
    #### D-cache have to be flushed here.
    #### It seems it is just forgotten.

    1455 return;
    1456
    1457 }
    1458 copy_user_highpage(dst, src, va);
    #### Ok here. flush_dcache_page() called from this func if arch need it
    1459 }

    Following is the patch fix this issue:

    Signed-off-by: Dmitriy Monakhov
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitriy Monakhov
     
  • Qooting Adrian:

    - net/sunrpc/svc.c uses highest_possible_node_id()

    - include/linux/nodemask.h says highest_possible_node_id() is
    out-of-line #if MAX_NUMNODES > 1

    - the out-of-line highest_possible_node_id() is in lib/cpumask.c

    - lib/Makefile: lib-$(CONFIG_SMP) += cpumask.o
    CONFIG_ARCH_DISCONTIGMEM_ENABLE=y, CONFIG_SMP=n, CONFIG_SUNRPC=y

    -> highest_possible_node_id() is used in net/sunrpc/svc.c
    CONFIG_NODES_SHIFT defined and > 0

    -> include/linux/numa.h: MAX_NUMNODES > 1

    -> compile error

    The bug is not present on architectures where ARCH_DISCONTIGMEM_ENABLE
    depends on NUMA (but m32r isn't the only affected architecture).

    So move the function into page_alloc.c

    Cc: Adrian Bunk
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Despite mm.h is not being exported header, it does contain one thing
    which is part of userspace ABI -- value disabling OOM killer for given
    process. So,
    a) create and export include/linux/oom.h
    b) move OOM_DISABLE define there.
    c) turn bounding values of /proc/$PID/oom_adj into defines and export
    them too.

    Note: mass __KERNEL__ removal will be done later.

    Signed-off-by: Alexey Dobriyan
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Separate out the concept of "queue congestion" from "backing-dev congestion".
    Congestion is a backing-dev concept, not a queue concept.

    The blk_* congestion functions are retained, as wrappers around the core
    backing-dev congestion functions.

    This proper layering is needed so that NFS can cleanly use the congestion
    functions, and so that CONFIG_BLOCK=n actually links.

    Cc: "Thomas Maier"
    Cc: "Jens Axboe"
    Cc: Trond Myklebust
    Cc: David Howells
    Cc: Peter Osterlund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When direct-io falls back to buffered write, it will just leave the dirty data
    floating about in pagecache, pending regular writeback.

    But normal direct-io semantics are that IO is synchronous, and that it leaves
    no pagecache behind.

    So change the fallback-to-buffered-write code to sync the file region and to
    then strip away the pagecache, just as a regular direct-io write would do.

    Acked-by: Jeff Moyer
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

20 Oct, 2006

1 commit

  • Right now users have to grab i_mutex before calling remove_suid(), in the
    unlikely event that a call to ->setattr() may be needed. Split up the
    function in two parts:

    - One to check if we need to remove suid
    - One to actually remove it

    The first we can call lockless.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Oct, 2006

3 commits

  • A recent change to the vmalloc() code accidentally resulted in us passing
    __GFP_ZERO into the slab allocator. But we only wanted __GFP_ZERO for the
    actual pages whcih are being vmalloc()ed, and passing __GFP_ZERO into slab is
    not a rational thing to ask for.

    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • We need to encode a decode the 'file' part of a handle. We simply use the
    inode number and generation number to construct the filehandle.

    The generation number is the time when the file was created. As inode numbers
    cycle through the full 32 bits before being reused, there is no real chance of
    the same inum being allocated to different files in the same second so this is
    suitably unique. Using time-of-day rather than e.g. jiffies makes it less
    likely that the same filehandle can be created after a reboot.

    In order to be able to decode a filehandle we need to be able to lookup by
    inum, which means that the inode needs to be added to the inode hash table
    (tmpfs doesn't currently hash inodes as there is never a need to lookup by
    inum). To avoid overhead when not exporting, we only hash an inode when it is
    first exported. This requires a lock to ensure it isn't hashed twice.

    This code is separate from the patch posted in June06 from Atal Shargorodsky
    which provided the same functionality, but does borrow slightly from it.

    Locking comment: Most filesystems that hash their inodes do so at the point
    where the 'struct inode' is initialised, and that has suitable locking
    (I_NEW). Here in shmem, we are hashing the inode later, the first time we
    need an NFS file handle for it. We no longer have I_NEW to ensure only one
    thread tries to add it to the hash table.

    Cc: Atal Shargorodsky
    Cc: Gilad Ben-Yossef
    Signed-off-by: David M. Grimes
    Signed-off-by: Neil Brown
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David M. Grimes
     
  • If remove_mapping() failed to remove the page from its mapping, don't go and
    mark it not uptodate! Makes kernel go dead.

    (Actually, I don't think the ClearPageUptodate is needed there at all).

    Says Nick Piggin:

    "Right, it isn't needed because at this point the page is guaranteed
    by remove_mapping to have no references (except us) and cannot pick
    up any new ones because it is removed from pagecache.

    We can delete it."

    Signed-off-by: Andrew Morton
    Acked-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

16 Oct, 2006

1 commit

  • .. and clean up the file mapping code while at it. No point in having a
    "if (file)" repeated twice, and generally doing similar checks in two
    different sections of the same code

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Oct, 2006

9 commits

  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar
     
  • If try_to_release_page() is called with a zero gfp mask, then the
    filesystem is effectively denied the possibility of sleeping while
    attempting to release the page. There doesn't appear to be any valid
    reason why this should be banned, given that we're not calling this from a
    memory allocation context.

    For this reason, change the gfp_mask argument of the call to GFP_KERNEL.

    Signed-off-by: Trond Myklebust
    Cc: Steve Dickson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • A failure in invalidate_inode_pages2_range() can result in unpleasant things
    happening in NFS (at least). Stick a WARN_ON_ONCE() in there so we can find
    out if it happens, and maybe why.

    (akpm: might be a -mm-only patch, we'll see..)

    Cc: Chuck Lever
    Cc: Trond Myklebust
    Cc: Steve Dickson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Move the lock debug checks below the page reserved checks. Also, having
    debug_check_no_locks_freed in kernel_map_pages is wrong.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • After the PG_reserved check was added, arch_free_page was being called in the
    wrong place (it could be called for a page we don't actually want to free).
    Fix that.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • With CONFIG_MIGRATION=n

    mm/mempolicy.c: In function 'do_mbind':
    mm/mempolicy.c:796: warning: passing argument 2 of 'migrate_pages' from incompatible pointer type

    Signed-off-by: Keith Owens
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keith Owens
     
  • We have a persistent dribble of reports of this BUG triggering. Its extended
    diagnostics were recently made conditional on CONFIG_DEBUG_VM, which was a bad
    idea - we want to know about it.

    Signed-off-by: Dave Jones
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes kernel to oops with
    libhugetlbfs test suite. The problem is that hugetlb pages can be shared
    by multiple mappings. Multiple threads can fight over page->lru in the
    unmap path and bad things happen. We now serialize __unmap_hugepage_range
    to void concurrent linked list manipulation. Such serialization is also
    needed for shared page table page on hugetlb area. This patch will fixed
    the bug and also serve as a prepatch for shared page table.

    Signed-off-by: Ken Chen
    Cc: Hugh Dickins
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • memmap_zone_idx() is not used anymore. It was required by an earlier
    version of
    account-for-memmap-and-optionally-the-kernel-image-as-holes.patch but not
    any more.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

08 Oct, 2006

1 commit

  • Init list is called with a list parameter that is not equal to the
    cachep->nodelists entry under NUMA if more than one node exists. This is
    fully legitimatei. One may want to populate the list fields before
    switching nodelist pointers.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Oct, 2006

2 commits

  • Add a way for a no_page() handler to request a retry of the faulting
    instruction. It goes back to userland on page faults and just tries again
    in get_user_pages(). I added a cond_resched() in the loop in that later
    case.

    The problem I have with signal and spufs is an actual bug affecting apps and I
    don't see other ways of fixing it.

    In addition, we are having issues with infiniband and 64k pages (related to
    the way the hypervisor deals with some HV cards) that will require us to muck
    around with the MMU from within the IB driver's no_page() (it's a pSeries
    specific driver) and return to the caller the same way using NOPAGE_REFAULT.

    And to add to this, the graphics folks have been following a new approach of
    memory management that involves transparently swapping objects between video
    ram and main meory. To do that, they need installing PTEs from a no_page()
    handler as well and that also requires returning with NOPAGE_REFAULT.

    (For the later, they are currently using io_remap_pfn_range to install one PTE
    from no_page() which is a bit racy, we need to add a check for the PTE having
    already been installed afer taking the lock, but that's ok, they are only at
    the proof-of-concept stage. I'll send a patch adding a "clean" function to do
    that, we can use that from spufs too and get rid of the sparsemem hacks we do
    to create struct page for SPEs. Basically, that provides a generic solution
    for being able to have no_page() map hardware devices, which is something that
    I think sound driver folks have been asking for some time too).

    All of these things depend on having the NOPAGE_REFAULT exit path from
    no_page() handlers.

    Signed-off-by: Benjamin Herrenchmidt
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Reduce the NUMA text size of mm/slab.o a little on x86 by using a local
    variable to store the result of numa_node_id().

    text data bss dec hex filename
    16858 2584 16 19458 4c02 mm/slab.o (before)
    16804 2584 16 19404 4bcc mm/slab.o (after)

    [akpm@osdl.org: use better names]
    [pbadari@us.ibm.com: fix that]
    Cc: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

05 Oct, 2006

2 commits

  • * master.kernel.org:/pub/scm/linux/kernel/git/davej/configh:
    Remove all inclusions of

    Manually resolved trivial path conflicts due to removed files in
    the sound/oss/ subdirectory.

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6: (292 commits)
    [GFS2] Fix endian bug for de_type
    [GFS2] Initialize SELinux extended attributes at inode creation time.
    [GFS2] Move logging code into log.c (mostly)
    [GFS2] Mark nlink cleared so VFS sees it happen
    [GFS2] Two redundant casts removed
    [GFS2] Remove uneeded endian conversion
    [GFS2] Remove duplicate sb reading code
    [GFS2] Mark metadata reads for blktrace
    [GFS2] Remove iflags.h, use FS_
    [GFS2] Fix code style/indent in ops_file.c
    [GFS2] streamline-generic_file_-interfaces-and-filemap gfs fix
    [GFS2] Remove readv/writev methods and use aio_read/aio_write instead (gfs bits)
    [GFS2] inode-diet: Eliminate i_blksize from the inode structure
    [GFS2] inode_diet: Replace inode.u.generic_ip with inode.i_private (gfs)
    [GFS2] Fix typo in last patch
    [GFS2] Fix direct i/o logic in filemap.c
    [GFS2] Fix bug in Makefiles for lock modules
    [GFS2] Remove (extra) fs_subsys declaration
    [GFS2/DLM] Fix trailing whitespace
    [GFS2] Tidy up meta_io code
    ...

    Linus Torvalds
     

04 Oct, 2006

6 commits

  • - rename ____kmalloc to kmalloc_track_caller so that people have a chance
    to guess what it does just from it's name. Add a comment describing it
    for those who don't. Also move it after kmalloc in slab.h so people get
    less confused when they are just looking for kmalloc - move things around
    in slab.c a little to reduce the ifdef mess.

    [penberg@cs.helsinki.fi: Fix up reversed #ifdef]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Fix kernel-doc and function declaration (missing "void") in
    mm/page_alloc.c.

    Add mm/page_alloc.c to kernel-api.tmpl in DocBook.

    mm/page_alloc.c:2589:38: warning: non-ANSI function declaration of function 'remove_all_active_ranges'

    Signed-off-by: Randy Dunlap
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Spotted by Hugh that hugetlb page is free'ed back to global pool before
    performing any TLB flush in unmap_hugepage_range(). This potentially allow
    threads to abuse free-alloc race condition.

    The generic tlb gather code is unsuitable to use by hugetlb, I just open
    coded a page gathering list and delayed put_page until tlb flush is
    performed.

    Cc: Hugh Dickins
    Signed-off-by: Ken Chen
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • Having min be a signed quantity means gcc can't turn high latency divides
    into shifts. There happen to be two such divides for GFP_ATOMIC (ie.
    networking, ie. important) allocations, one of which depends on the other.
    Fixing this makes code smaller as a bonus.

    Shame on somebody (probably me).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fixes an kerneldoc error.

    Signed-off-by: Henrik Kretzschmar
    Cc: "Randy.Dunlap"
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Kretzschmar
     
  • kbuild explicitly includes this at build time.

    Signed-off-by: Dave Jones

    Dave Jones