01 Apr, 2006

5 commits

  • Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
    fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
    Reasons:

    - It's more flexible. Things which would require two or three syscalls with
    fadvise() can be done in a single syscall.

    - Using fadvise() in this manner is something not covered by POSIX.

    The patch wires up the syscall for x86.

    The sycall is implemented in the new fs/sync.c. The intention is that we can
    move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.

    Documentation for the syscall is in fs/sync.c.

    A test app (sync_file_range.c) is in
    http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

    The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
    say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
    NFS_DATA_SYNC which is hopefully the more common."

    Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
    the queue is congested. This is trivial to fix: add a new flag bit, set
    wbc->nonblocking. But I'm not sure that we want to expose implementation
    details down to that level.

    Note: it's notable that we can sync an fd which wasn't opened for writing.
    Same with fsync() and fdatasync()).

    Note: the code takes some care to handle attempts to sync file contents
    outside the 16TB offset on 32-bit machines. It makes such attempts appear to
    succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
    requests fail...

    Cc: Nick Piggin
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The boot cmdline is parsed in parse_early_param() and
    parse_args(,unknown_bootoption).

    And __setup() is used in obsolete_checksetup().

    start_kernel()
    -> parse_args()
    -> unknown_bootoption()
    -> obsolete_checksetup()

    If __setup()'s callback (->setup_func()) returns 1 in
    obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
    handled.

    If ->setup_func() returns 0, obsolete_checksetup() tries other
    ->setup_func(). If all ->setup_func() that matched a parameter returns 0,
    a parameter is seted to argv_init[].

    Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
    If the app doesn't ignore those arguments, it will warning and exit.

    This patch fixes a wrong usage of it, however fixes obvious one only.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • With strict page reservation, I think kernel should enforce number of free
    hugetlb page don't fall below reserved count. Currently it is possible in
    the sysctl path. Add proper check in sysctl to disallow that.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • git-commit: d5d4b0aa4e1430d73050babba999365593bdb9d2
    "[PATCH] optimize follow_hugetlb_page" breaks mlock on hugepage areas.

    I mis-interpret pages argument and made get_page() unconditional. It
    should only get a ref count when "pages" argument is non-null.

    Credit goes to Adam Litke who spotted the bug.

    Signed-off-by: Ken Chen
    Acked-by: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • find_trylock_page() is an odd interface in that it doesn't take a reference
    like the others. Now that XFS no longer uses it, and its last remaining
    caller actually wants an elevated refcount, opencode that callsite and
    schedule find_trylock_page() for removal.

    Signed-off-by: Nick Piggin
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

29 Mar, 2006

2 commits


28 Mar, 2006

6 commits

  • Helper functions for for_each_online_pgdat/for_each_zone look too big to be
    inlined. Speed of these helper macro itself is not very important. (inner
    loops are tend to do more work than this)

    This patch make helper function to be out-of-lined.

    inline out-of-line
    .text 005c0680 005bf6a0

    005c0680 - 005bf6a0 = FE0 = 4Kbytes.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • By using for_each_online_pgdat(), pgdat_list is not necessary now. This patch
    removes it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Replace for_each_pgdat() with for_each_online_pgdat().

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add a list_head to bootmem_data_t and make bootmems use it. bootmem list is
    sorted by node_boot_start.

    Only nodes against which init_bootmem() is called are linked to the list.
    (i386 allocates bootmem only from one node(0) not from all online nodes.)

    A summary:
    1. for_each_online_pgdat() traverses all *online* nodes.
    2. alloc_bootmem() allocates memory only from initialized-for-bootmem nodes.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch removes zone_mem_map.

    pfn_to_page uses pgdat, page_to_pfn uses zone. page_to_pfn can use pgdat
    instead of zone, which is only one user of zone_mem_map. By modifing it,
    we can remove zone_mem_map.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Dave Hansen
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There are 3 memory models, FLATMEM, DISCONTIGMEM, SPARSEMEM.
    Each arch has its own page_to_pfn(), pfn_to_page() for each models.
    But most of them can use the same arithmetic.

    This patch adds asm-generic/memory_model.h, which includes generic
    page_to_pfn(), pfn_to_page() definitions for each memory model.

    When CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y, out-of-line functions are
    used instead of macro. This is enabled by some archs and reduces
    text size.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: Hirokazu Takata
    Cc: Ralf Baechle
    Cc: Kyle McMartin
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

27 Mar, 2006

8 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    drivers/char/ftape/lowlevel/fdc-io.c: Correct a comment
    Kconfig help: MTD_JEDECPROBE already supports Intel
    Remove ugly debugging stuff
    do_mounts.c: Minor ROOT_DEV comment cleanup
    BUG_ON() Conversion in drivers/s390/block/dasd_devmap.c
    BUG_ON() Conversion in mm/mempool.c
    BUG_ON() Conversion in mm/memory.c
    BUG_ON() Conversion in kernel/fork.c
    BUG_ON() Conversion in ipc/sem.c
    BUG_ON() Conversion in fs/ext2/
    BUG_ON() Conversion in fs/hfs/
    BUG_ON() Conversion in fs/dcache.c
    BUG_ON() Conversion in fs/buffer.c
    BUG_ON() Conversion in input/serio/hp_sdc_mlc.c
    BUG_ON() Conversion in md/dm-table.c
    BUG_ON() Conversion in md/dm-path-selector.c
    BUG_ON() Conversion in drivers/isdn
    BUG_ON() Conversion in drivers/char
    BUG_ON() Conversion in drivers/mtd/

    Linus Torvalds
     
  • Add another allocator to the common mempool code: a kzalloc/kfree allocator

    This will be used by the next patch in the series to replace a mempool-backed
    kzalloc allocator. It is also very likely that there will be more users in
    the future.

    Signed-off-by: Matthew Dobson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dobson
     
  • Add another allocator to the common mempool code: a kmalloc/kfree allocator

    This will be used by the next patch in the series to replace duplicate
    mempool-backed kmalloc allocators in several places in the kernel. It is also
    very likely that there will be more users in the future.

    Signed-off-by: Matthew Dobson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dobson
     
  • Convert two mempool users that currently use their own mempool-backed page
    allocators to use the generic mempool page allocator.

    Also included are 2 trivial whitespace fixes.

    Signed-off-by: Matthew Dobson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dobson
     
  • This will be used by the next patch in the series to replace duplicate
    mempool-backed page allocators in 2 places in the kernel. It is also likely
    that there will be more users in the future.

    Signed-off-by: Matthew Dobson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dobson
     
  • Currently, get_user_pages() returns fully coherent pages to the kernel for
    anything other than anonymous pages. This is a problem for things like
    fuse and the SCSI generic ioctl SG_IO which can potentially wish to do DMA
    to anonymous pages passed in by users.

    The fix is to add a new memory management API: flush_anon_page() which
    is used in get_user_pages() to make anonymous pages coherent.

    Signed-off-by: James Bottomley
    Cc: Russell King
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Bottomley
     
  • this changes if() BUG(); constructs to BUG_ON() which is
    cleaner, contains unlikely() and can better optimized away.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Adrian Bunk

    Eric Sesterhenn
     
  • this changes if() BUG(); constructs to BUG_ON() which is
    cleaner, contains unlikely() and can better optimized away.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Adrian Bunk

    Eric Sesterhenn
     

26 Mar, 2006

15 commits

  • This fixes problems with very large nodes (over 128GB) filling up all of
    the first 4GB with their mem_map and not leaving enough space for the
    swiotlb.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Hugh is rightly concerned that the CONFIG_DEBUG_VM coverage has gone too
    far in vm_normal_page, considering that we expect production kernels to be
    shipped with the option turned off, and that the code has been under some
    large changes recently.

    Signed-off-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (21 commits)
    BUG_ON() Conversion in drivers/video/
    BUG_ON() Conversion in drivers/parisc/
    BUG_ON() Conversion in drivers/block/
    BUG_ON() Conversion in sound/sparc/cs4231.c
    BUG_ON() Conversion in drivers/s390/block/dasd.c
    BUG_ON() Conversion in lib/swiotlb.c
    BUG_ON() Conversion in kernel/cpu.c
    BUG_ON() Conversion in ipc/msg.c
    BUG_ON() Conversion in block/elevator.c
    BUG_ON() Conversion in fs/coda/
    BUG_ON() Conversion in fs/binfmt_elf_fdpic.c
    BUG_ON() Conversion in input/serio/hil_mlc.c
    BUG_ON() Conversion in md/dm-hw-handler.c
    BUG_ON() Conversion in md/bitmap.c
    The comment describing how MS_ASYNC works in msync.c is confusing
    rcu: undeclared variable used in documentation
    fix typos "wich" -> "which"
    typo patch for fs/ufs/super.c
    Fix simple typos
    tabify drivers/char/Makefile
    ...

    Linus Torvalds
     
  • The "rounded up to nearest power of 2 in size" algorithm in
    alloc_large_system_hash is not correct. As coded, it takes an otherwise
    acceptable power-of-2 value and doubles it. For example, we see the error
    if we boot with thash_entries=2097152 which produces a hash table with
    4194304 entries.

    Signed-off-by: John Hawkes
    Cc: Roland Dreier
    Cc: "Chen, Kenneth W"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hawkes
     
  • A couple of places are forgetting to take it.

    The kswapd case is probably unimportant. keventd_create_kthread() was racy.

    The whole thing is a bit flakey: you start a kernel thread, get its pid from
    kernel_thread() then look up its task_struct.

    a) It assumes that pid recycling takes a "long" time.

    b) We get a task_struct but no reference was taken on it. The owner of the
    kswapd and kthread task_struct*'s must assume that the new thread won't
    exit unexpectedly. Because if it does, they're left holding dead memory
    and any attempt to control or stop that task will crash.

    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • In zone_pcp_init we print out all zones even if they are empty:

    On node 0 totalpages: 245760
    DMA zone: 245760 pages, LIFO batch:31
    DMA32 zone: 0 pages, LIFO batch:0
    Normal zone: 0 pages, LIFO batch:0
    HighMem zone: 0 pages, LIFO batch:0

    To conserve dmesg space why not print only the non zero zones.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • The page migration code could function without NUMA but we currently have
    no users for the non-NUMA case.

    Signed-off-by: Christoph Lameter
    Cc: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We have had this memory leak for a while now. The situation is complicated
    by the use of alloc_kmemlist() as a function to resize various caches by
    do_tune_cpucache().

    What we do here is first of all make sure that we deallocate properly in
    the loop over all the nodes.

    If we are just resizing caches then we can simply return with -ENOMEM if an
    allocation fails.

    If the cache is new then we need to rollback and remove all earlier
    allocations.

    We detect that a cache is new by checking if the link to the global cache
    chain has been setup. This is a bit hackish ....

    (also fix up too overlong lines that I added in the last patch...)

    Signed-off-by: Christoph Lameter
    Cc: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Inspired by Jesper Juhl's patch from today

    1. Get rid of err
    We do not set it to anything else but zero.

    2. Drop the CONFIG_NUMA stuff.
    There are definitions for alloc_alien_cache and free_alien_cache()
    that do the right thing for the non NUMA case.

    3. Better naming of variables.

    4. Remove redundant cachep->nodelists[node] expressions.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • __drain_alien_cache() currently drains objects by freeing them to the
    (remote) freelists of the original node. However, each node also has a
    shared list containing objects to be used on any processor of that node.
    We can avoid a number of remote node accesses by copying the pointers to
    the free objects directly into the remote shared array.

    And while we are at it: Skip alien draining if the alien cache spinlock is
    already taken.

    Kiran reported that this is a performance benefit.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • slabr_objects() can be used to transfer objects between various object
    caches of the slab allocator. It is currently only used during
    __cache_alloc() to retrieve elements from the shared array. We will be
    using it soon to transfer elements from the alien caches to the remote
    shared array.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Convert mm/ to use the new kmem_cache_zalloc allocator.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • As suggested by Eric Dumazet, optimize kzalloc() calls that pass a
    compile-time constant size. Please note that the patch increases kernel
    text slightly (~200 bytes for defconfig on x86).

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Introduce a memory-zeroing variant of kmem_cache_alloc. The allocator
    already exits in XFS and there are potential users for it so this patch
    makes the allocator available for the general public.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Implement /proc/slab_allocators. It produces output like:

    idr_layer_cache: 80 idr_pre_get+0x33/0x4e
    buffer_head: 2555 alloc_buffer_head+0x20/0x75
    mm_struct: 9 mm_alloc+0x1e/0x42
    mm_struct: 20 dup_mm+0x36/0x370
    vm_area_struct: 384 dup_mm+0x18f/0x370
    vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3
    vm_area_struct: 1 split_vma+0x5a/0x10e
    vm_area_struct: 11 do_brk+0x206/0x2e2
    vm_area_struct: 2 copy_vma+0xda/0x142
    vm_area_struct: 9 setup_arg_pages+0x99/0x214
    fs_cache: 8 copy_fs_struct+0x21/0x133
    fs_cache: 29 copy_process+0xf38/0x10e3
    files_cache: 30 alloc_files+0x1b/0xcf
    signal_cache: 81 copy_process+0xbaa/0x10e3
    sighand_cache: 77 copy_process+0xe65/0x10e3
    sighand_cache: 1 de_thread+0x4d/0x5f8
    anon_vma: 241 anon_vma_prepare+0xd9/0xf3
    size-2048: 1 add_sect_attrs+0x5f/0x145
    size-2048: 2 journal_init_revoke+0x99/0x302
    size-2048: 2 journal_init_revoke+0x137/0x302
    size-2048: 2 journal_init_inode+0xf9/0x1c4

    Cc: Manfred Spraul
    Cc: Alexander Nyberg
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Ravikiran Thirumalai
    Signed-off-by: Al Viro
    DESC
    slab-leaks3-locking-fix
    EDESC
    From: Andrew Morton

    Update for slab-remove-cachep-spinlock.patch

    Cc: Al Viro
    Cc: Manfred Spraul
    Cc: Alexander Nyberg
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Ravikiran Thirumalai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     

25 Mar, 2006

1 commit


24 Mar, 2006

3 commits

  • This patch series creates a strndup_user() function to easy copying C strings
    from userspace. Also we avoid common pitfalls like userspace modifying the
    final \0 after the strlen_user().

    Signed-off-by: Davi Arnaut
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davi Arnaut
     
  • No need to duplicate all that code.

    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • msync() does a strange thing. Essentially:

    vma = find_vma();
    for ( ; ; ) {
    if (!vma)
    return -ENOMEM;
    ...
    vma = vma->vm_next;
    }

    so an msync() request which starts within or before a valid VMA and which ends
    within or beyond the final VMA will incorrectly return -ENOMEM.

    Fix.

    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton