20 Jul, 2007

6 commits

  • Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc).

    Here is a short excerpt of the semantic patch performing
    this transformation:

    @@
    type T2;
    expression x;
    identifier f,fld;
    expression E;
    expression E1,E2;
    expression e1,e2,e3,y;
    statement S;
    @@

    x =
    - kmalloc
    + kzalloc
    (E1,E2)
    ... when != \(x->fld=E;\|y=f(...,x,...);\|f(...,x,...);\|x=E;\|while(...) S\|for(e1;e2;e3) S\)
    - memset((T2)x,0,E1);

    @@
    expression E1,E2,E3;
    @@

    - kzalloc(E1 * E2,E3)
    + kcalloc(E1,E2,E3)

    [akpm@linux-foundation.org: get kcalloc args the right way around]
    Signed-off-by: Yoann Padioleau
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Acked-by: Russell King
    Cc: Bryan Wu
    Acked-by: Jiri Slaby
    Cc: Dave Airlie
    Acked-by: Roland Dreier
    Cc: Jiri Kosina
    Acked-by: Dmitry Torokhov
    Cc: Benjamin Herrenschmidt
    Acked-by: Mauro Carvalho Chehab
    Acked-by: Pierre Ossman
    Cc: Jeff Garzik
    Cc: "David S. Miller"
    Acked-by: Greg KH
    Cc: James Bottomley
    Cc: "Antonino A. Daplas"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoann Padioleau
     
  • This patch adds the documentation for /proc//coredump_filter.

    Signed-off-by: Hidehiro Kawai
    Cc: Alan Cox
    Cc: David Howells
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: "Randy.Dunlap"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kawai, Hidehiro
     
  • The purpose of audit_bprm() is to log the argv array to a userspace daemon at
    the end of the execve system call. Since user-space hasn't had time to run,
    this array is still in pristine state on the process' stack; so no need to
    copy it, we can just grab it from there.

    In order to minimize the damage to audit_log_*() copy each string into a
    temporary kernel buffer first.

    Currently the audit code requires that the full argument vector fits in a
    single packet. So currently it does clip the argv size to a (sysctl) limit,
    but only when execve auditing is enabled.

    If the audit protocol gets extended to allow for multiple packets this check
    can be removed.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ollie Wild
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Change ->fault prototype. We now return an int, which contains
    VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
    FAULT_RET_ code tells the VM whether a page was found, whether it has been
    locked, and potentially other things. This is not quite the way he wanted
    it yet, but that's changed in the next patch (which requires changes to
    arch code).

    This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
    that a page is locked which requires filemap_nopage to go away (because we
    can no longer remain backward compatible without that flag), but we were
    going to do that anyway.

    struct fault_data is renamed to struct vm_fault as Linus asked. address
    is now a void __user * that we should firmly encourage drivers not to use
    without really good reason.

    The page is now returned via a page pointer in the vm_fault struct.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • There seems to be very little documentation about this callback in general.
    The locking in particular is a bit tricky, so it's worth having this in
    writing.

    Signed-off-by: Mark Fasheh
    Cc: Nick Piggin
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     
  • Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
    the virtual address -> file offset differently from linear mappings.

    ->populate is a layering violation because the filesystem/pagecache code
    should need to know anything about the virtual memory mapping. The hitch here
    is that the ->nopage handler didn't pass down enough information (ie. pgoff).
    But it is more logical to pass pgoff rather than have the ->nopage function
    calculate it itself anyway (because that's a similar layering violation).

    Having the populate handler install the pte itself is likewise a nasty thing
    to be doing.

    This patch introduces a new fault handler that replaces ->nopage and
    ->populate and (later) ->nopfn. Most of the old mechanism is still in place
    so there is a lot of duplication and nice cleanups that can be removed if
    everyone switches over.

    The rationale for doing this in the first place is that nonlinear mappings are
    subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
    to duplicate the synchronisation logic rather than just consolidate the two.

    After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
    pagecache. Seems like a fringe functionality anyway.

    NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
    users have hit mainline yet.

    [akpm@linux-foundation.org: cleanup]
    [randy.dunlap@oracle.com: doc. fixes for readahead]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Randy Dunlap
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Jul, 2007

2 commits

  • Signed-off-by: Josef 'Jeff' Sipek
    Acked-by: Michael Halcrow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef 'Jeff' Sipek
     
  • This patch adds the kernelcore= parameter for x86.

    Once all patches are applied, a new command-line parameter exist and a new
    sysctl. This patch adds the necessary documentation.

    From: Yasunori Goto

    When "kernelcore" boot option is specified, kernel can't boot up on ia64
    because of an infinite loop. In addition, the parsing code can be handled
    in an architecture-independent manner.

    This patch uses common code to handle the kernelcore= parameter. It is
    only available to architectures that support arch-independent zone-sizing
    (i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will
    ignore the boot parameter.

    [bunk@stusta.de: make cmdline_parse_kernelcore() static]
    Signed-off-by: Mel Gorman
    Signed-off-by: Yasunori Goto
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Jul, 2007

4 commits

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (32 commits)
    [PATCH] ocfs2: zero_user_page conversion
    ocfs2: Support xfs style space reservation ioctls
    ocfs2: support for removing file regions
    ocfs2: update truncate handling of partial clusters
    ocfs2: btree support for removal of arbirtrary extents
    ocfs2: Support creation of unwritten extents
    ocfs2: support writing of unwritten extents
    ocfs2: small cleanup of ocfs2_write_begin_nolock()
    ocfs2: btree changes for unwritten extents
    ocfs2: abstract btree growing calls
    ocfs2: use all extent block suballocators
    ocfs2: plug truncate into cached dealloc routines
    ocfs2: simplify deallocation locking
    ocfs2: harden buffer check during mapping of page blocks
    ocfs2: shared writeable mmap
    ocfs2: factor out write aops into nolock variants
    ocfs2: rework ocfs2_buffered_write_cluster()
    ocfs2: take ip_alloc_sem during entire truncate
    ocfs2: Add "preferred slot" mount option
    [KJ PATCH] Replacing memset(,0,PAGE_SIZE) with clear_page() in fs/ocfs2/dlm/dlmrecovery.c
    ...

    Linus Torvalds
     
  • Update Documentation/filesystems/vfs.txt

    Signed-off-by: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • Update the description of struct file_system_type and get_sb() in
    Documentation/filesystems/vfs.txt to match the current code.

    Signed-off-by: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • Documentation for the /proc/$pid/stat file.

    Signed-off-by: Kees Cook
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

11 Jul, 2007

3 commits

  • Sometimes other drivers depend on particular configfs items. For
    example, ocfs2 mounts depend on a heartbeat region item. If that
    region item is removed with rmdir(2), the ocfs2 mount must BUG or go
    readonly. Not happy.

    This provides two additional API calls: configfs_depend_item() and
    configfs_undepend_item(). A client driver can call
    configfs_depend_item() on an existing item to tell configfs that it is
    depended on. configfs will then return -EBUSY from rmdir(2) for that
    item. When the item is no longer depended on, the client driver calls
    configfs_undepend_item() on it.

    These API cannot be called underneath any configfs callbacks, as
    they will conflict. They can block and allocate. A client driver
    probably shouldn't calling them of its own gumption. Rather it should
    be providing an API that external subsystems call.

    How does this work? Imagine the ocfs2 mount process. When it mounts,
    it asks for a heart region item. This is done via a call into the
    heartbeat code. Inside the heartbeat code, the region item is looked
    up. Here, the heartbeat code calls configfs_depend_item(). If it
    succeeds, then heartbeat knows the region is safe to give to ocfs2.
    If it fails, it was being torn down anyway, and heartbeat can gracefully
    pass up an error.

    [ Fixed some bad whitespace in configfs.txt. --Mark ]

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • Add a notification callback, ops->disconnect_notify(). It has the same
    prototype as ->drop_item(), but it will be called just before the item
    linkage is broken. This way, configfs users who want to do work while
    the object is still in the heirarchy have a chance.

    Client drivers will still need to config_item_put() in their
    ->drop_item(), if they implement it. They need do nothing in
    ->disconnect_notify(). They don't have to provide it if they don't
    care. But someone who wants to be notified before ci_parent is set to
    NULL can now be notified.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • Convert the su_sem member of struct configfs_subsystem to a struct
    mutex, as that's what it is. Also convert all the users and update
    Documentation/configfs.txt and Documentation/configfs_example.c
    accordingly.

    [ Conflict in fs/dlm/config.c with commit
    3168b0780d06ace875696f8a648d04d6089654e5 manually resolved. --Mark ]

    Inspired-by: Satyam Sharma
    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     

09 Jun, 2007

1 commit

  • Randy Dunlap reports that a tmpfs, mounted with NUMA mpol= specifying an
    offline node, crashes as soon as data is allocated upon it. Now restrict it
    to online nodes, where before it restricted to MAX_NUMNODES.

    Signed-off-by: Hugh Dickins
    Cc: Robin Holt
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Tested-and-acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 May, 2007

1 commit


09 May, 2007

7 commits

  • This patch substitutes i_sem by i_mutex in
    Documentation/filesystems/Locking.
    The patch also removes a couple of trailing white-spaces.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Adrian Bunk

    Artem Bityutskiy
     
  • Fix various typos in kernel docs and Kconfigs, 2.6.21-rc4.

    Signed-off-by: Matt LaPlante
    Signed-off-by: Adrian Bunk

    Matt LaPlante
     
  • Signed-off-by: Randy Dunlap
    Signed-off-by: Adrian Bunk

    Randy Dunlap
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
    JFS: Fix race waking up jfsIO kernel thread
    JFS: use __set_current_state()
    Copy i_flags to jfs inode flags on write
    JFS: document uid, gid, and umask mount options in jfs.txt

    Linus Torvalds
     
  • It seems that the recent Windows changed specification, and it's
    undocumented. Windows doesn't update ->free_clusters correctly.

    This patch doesn't use ->free_clusters by default. (instead, add "usefree"
    for forcing to use it)

    Signed-off-by: OGAWA Hirofumi
    Cc: Juergen Beisert
    Cc: Andreas Schwab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • 1) Introduces a new method in 'struct dentry_operations'. This method
    called d_dname() might be called from d_path() to build a pathname for
    special filesystems. It is called without locks.

    Future patches (if we succeed in having one common dentry for all
    pipes/sockets) may need to change prototype of this method, but we now
    use : char *d_dname(struct dentry *dentry, char *buffer, int buflen);

    2) Adds a dynamic_dname() helper function that eases d_dname() implementations

    3) Defines d_dname method for sockets : No more sprintf() at socket
    creation. This is delayed up to the moment someone does an access to
    /proc/pid/fd/...

    4) Defines d_dname method for pipes : No more sprintf() at pipe
    creation. This is delayed up to the moment someone does an access to
    /proc/pid/fd/...

    A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a
    *nice* speedup on my Pentium(M) 1.6 Ghz :

    3.090 s instead of 3.450 s

    Signed-off-by: Eric Dumazet
    Acked-by: Christoph Hellwig
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
    information about the memory location and usage of processes. Issues:

    - maps should not be world-readable, especially if programs expect any
    kind of ASLR protection from local attackers.
    - maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
    check the maps when %n is in a *printf call, and a setuid(getuid())
    process wouldn't be able to read its own maps file. (For reference
    see http://lkml.org/lkml/2006/1/22/150)
    - a system-wide toggle is needed to allow prior behavior in the case of
    non-root applications that depend on access to the maps contents.

    This change implements a check using "ptrace_may_attach" before allowing
    access to read the maps contents. To control this protection, the new knob
    /proc/sys/kernel/maps_protect has been added, with corresponding updates to
    the procfs documentation.

    [akpm@linux-foundation.org: build fixes]
    [akpm@linux-foundation.org: New sysctl numbers are old hat]
    Signed-off-by: Kees Cook
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

08 May, 2007

1 commit

  • Adds /proc/pid/clear_refs. When any non-zero number is written to this file,
    pte_mkold() and ClearPageReferenced() is called for each pte and its
    corresponding page, respectively, in that task's VMAs. This file is only
    writable by the user who owns the task.

    It is now possible to measure _approximately_ how much memory a task is using
    by clearing the reference bits with

    echo 1 > /proc/pid/clear_refs

    and checking the reference count for each VMA from the /proc/pid/smaps output
    at a measured time interval. For example, to observe the approximate change
    in memory footprint for a task, write a script that clears the references
    (echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
    extracts the size in kB. Add the sizes for each VMA together for the total
    referenced footprint. Moments later, repeat the process and observe the
    difference.

    For example, using an efficient Mozilla:

    accumulated time referenced memory
    ---------------- -----------------
    0 s 408 kB
    1 s 408 kB
    2 s 556 kB
    3 s 1028 kB
    4 s 872 kB
    5 s 1956 kB
    6 s 416 kB
    7 s 1560 kB
    8 s 2336 kB
    9 s 1044 kB
    10 s 416 kB

    This is a valuable tool to get an approximate measurement of the memory
    footprint for a task.

    Cc: Hugh Dickins
    Cc: Paul Mundt
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    [akpm@linux-foundation.org: build fixes]
    [mpm@selenic.com: rename for_each_pmd]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

27 Apr, 2007

1 commit


26 Apr, 2007

1 commit


10 Mar, 2007

1 commit


05 Mar, 2007

1 commit


21 Feb, 2007

1 commit

  • simple_prepare_write leaks uninitialised kernel data. This happens because
    the it leaves an uninitialised "hole" over the part of the page that the
    write is expected to go to. This is fine, but it then marks the page
    uptodate, which means a concurrent read can come in and copy the
    uninitialised memory into userspace before it written to.

    Fix it by simply marking it uptodate in simple_commit_write instead, after
    the hole has been filled in. This could theoretically break an fs that
    uses simple_prepare_write and not simple_commit_write, and that relies on
    the incorrect simple_prepare_write behaviour. Luckily, none of those
    exists in the tree.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

20 Feb, 2007

1 commit


19 Feb, 2007

1 commit

  • While cacheing is generally frowned upon in the 9p world, it has its
    place -- particularly in situations where the remote file system is
    exclusive and/or read-only. The vacfs views of venti content addressable
    store are a real-world instance of such a situation. To facilitate higher
    performance for these workloads (and eventually use the fscache patches),
    we have enabled a "loose" cache mode which does not attempt to maintain
    any form of consistency on the page-cache or dcache. This results in over
    two orders of magnitude performance improvement for cacheable block reads
    in the Bonnie benchmark. The more aggressive use of the dcache also seems
    to improve metadata operational performance.

    Signed-off-by: Eric Van Hensbergen

    Eric Van Hensbergen
     

18 Feb, 2007

1 commit


13 Feb, 2007

1 commit

  • These series of patches add UFS2 write-support. UFS2 - is default file system
    for recent versions of FreeBSD.

    The main differences from UFS1 from write support point of view
    are:
    1)Not all inodes are allocated during formatation of disk.
    2)All meta-data(pointer to data blocks) are 64bit(in UFS1 they
    are 32bit).

    So patch series consist of
    1)make possible mount UFS2 in read-write mode
    2)code to write ufs2 inodes and code to initialize inodes chunks.
    3)work with 64bit meta-data

    I made simple testing like create/deleting/writing/reading/truncating, also I
    ran fsx-linux and untar and build kernel on UFS1 and UFS2, after that FreeBSD
    fsck do not find any errors in fs.

    This patch makes possible to mount ufs2 "rw", and updates UFS2 documentation:
    remove note about bug(it fixed by reallocate blocks on the fly patch) and add
    me in the list of people who want receive bug reports.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     

12 Feb, 2007

1 commit

  • Mathieu originally needed to add this for tracing Xen, but it's something
    that's needed for any application that can be tracing while cpus are added.

    unplug isn't supported by this patch. The thought was that at minumum a new
    buffer needs to be added when a cpu comes up, but it wasn't worth the effort
    to remove buffers on cpu down since they'd be freed soon anyway when the
    channel was closed.

    [zanussi@us.ibm.com: avoid lock_cpu_hotplug deadlock]
    Signed-off-by: Mathieu Desnoyers
    Cc: Tom Zanussi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

27 Jan, 2007

1 commit


18 Jan, 2007

1 commit


12 Jan, 2007

1 commit

  • NFS: Fix race in nfs_release_page()

    invalidate_inode_pages2() may find the dirty bit has been set on a page
    owing to the fact that the page may still be mapped after it was locked.
    Only after the call to unmap_mapping_range() are we sure that the page
    can no longer be dirtied.
    In order to fix this, NFS has hooked the releasepage() method and tries
    to write the page out between the call to unmap_mapping_range() and the
    call to remove_mapping(). This, however leads to deadlocks in the page
    reclaim code, where the page may be locked without holding a reference
    to the inode or dentry.

    Fix is to add a new address_space_operation, launder_page(), which will
    attempt to write out a dirty page without releasing the page lock.

    Signed-off-by: Trond Myklebust

    Also, the bare SetPageDirty() can skew all sort of accounting leading to
    other nasties.

    [akpm@osdl.org: cleanup]
    Signed-off-by: Peter Zijlstra
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

31 Dec, 2006

1 commit


14 Dec, 2006

1 commit