30 Nov, 2009

2 commits


24 Sep, 2009

3 commits

  • * git://git.infradead.org/mtd-2.6: (58 commits)
    mtd: jedec_probe: add PSD4256G6V id
    mtd: OneNand support for Nomadik 8815 SoC (on NHK8815 board)
    mtd: nand: driver for Nomadik 8815 SoC (on NHK8815 board)
    m25p80: Add Spansion S25FL129P serial flashes
    jffs2: Use SLAB_HWCACHE_ALIGN for jffs2_raw_{dirent,inode} slabs
    mtd: sh_flctl: register sh_flctl using platform_driver_probe()
    mtd: nand: txx9ndfmc: transfer 512 byte at a time if possible
    mtd: nand: fix tmio_nand ecc correction
    mtd: nand: add __nand_correct_data helper function
    mtd: cfi_cmdset_0002: add 0xFF intolerance for M29W128G
    mtd: inftl: fix fold chain block number
    mtd: jedec: fix compilation problem with I28F640C3B definition
    mtd: nand: fix ECC Correction bug for SMC ordering for NDFC driver
    mtd: ofpart: Check availability of reg property instead of name property
    driver/Makefile: Initialize "mtd" and "spi" before "net"
    mtd: omap: adding DMA mode support in nand prefetch/post-write
    mtd: omap: add support for nand prefetch-read and post-write
    mtd: add nand support for w90p910 (v2)
    mtd: maps: add mtd-ram support to physmap_of
    mtd: pxa3xx_nand: add single-bit error corrections reporting
    ...

    Linus Torvalds
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (85 commits)
    ocfs2: Use buffer IO if we are appending a file.
    ocfs2: add spinlock protection when dealing with lockres->purge.
    dlmglue.c: add missed mlog lines
    ocfs2: __ocfs2_abort() should not enable panic for local mounts
    ocfs2: Add ioctl for reflink.
    ocfs2: Enable refcount tree support.
    ocfs2: Implement ocfs2_reflink.
    ocfs2: Add preserve to reflink.
    ocfs2: Create reflinked file in orphan dir.
    ocfs2: Use proper parameter for some inode operation.
    ocfs2: Make transaction extend more efficient.
    ocfs2: Don't merge in 1st refcount ops of reflink.
    ocfs2: Modify removing xattr process for refcount.
    ocfs2: Add reflink support for xattr.
    ocfs2: Create an xattr indexed block if needed.
    ocfs2: Call refcount tree remove process properly.
    ocfs2: Attach xattr clusters to refcount tree.
    ocfs2: Abstract ocfs2 xattr tree extend rec iteration process.
    ocfs2: Abstract the creation of xattr block.
    ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value.
    ...

    Linus Torvalds
     
  • For this system call user space passes a signed long length parameter,
    while the kernel side takes an unsigned long parameter and converts it
    later to signed long again.

    This has led to bugs in compat wrappers see e.g. dd90bbd5 "powerpc: Add
    compat_sys_truncate". The s390 compat wrapper for this functions is
    broken as well since it also performs zero extension instead of sign
    extension for the length parameter.

    In addition if hpa comes up with an automated way of generating
    compat wrappers it would generate a wrong one here.

    So change the length parameter from unsigned long to long.

    Cc: "H. Peter Anvin"
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Heiko Carstens
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

23 Sep, 2009

35 commits

  • Unlike on most other architectures ino_t is an unsigned int on s390. So
    add an explicit cast to avoid this compile warning:

    fs/ext2/namei.c: In function 'ext2_lookup':
    fs/ext2/namei.c:73: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'ino_t'

    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • There are a few places in the Minix FS code where the "inode" field of a
    minix_dir_entry is used without checking first to see if the dirent is
    really a minix3_dir_entry. The inode number in a V1/V2 dirent is 16 bits,
    whereas that in a V3 dirent is 32 bits.

    Accessing it as a 16 bit field when it really should be accessed as a 32
    bit field probably kinda sorta works on a little-endian machine, but leads
    to some rather odd behaviour on big-endian machines.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Doug Graham
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Graham
     
  • We want to check for s_inode's existence, not inode's one (inode is always
    valid in this function).

    This takes care of the following entry from Dan's list:

    fs/ncpfs/ioctl.c +445 __ncp_ioctl(180) warning: variable derefenced before check 'inode'

    Reported-by: Dan Carpenter
    Cc: Julia Lawall
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Cc: Petr Vandrovec
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • This function uses signed integers for the unix_date and local variables -
    if a negative number is supplied and the leap-year condition is not met,
    month will be 0, leading to a later read of day_n[-1]

    Signed-off-by: Roel Kluin
    Cc: Petr Vandrovec
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     
  • initramfs userspace likes to use this magic number.

    Cc: Hugh Dickins
    Signed-off-by: maximilian attems
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    maximilian attems
     
  • After memory hotplug (or other events in future), kcore size can be
    modified.

    To update inode->i_size, we have to know inode/dentry but we can't get it
    from inside /proc directly. But considerinyg memory hotplug, kcore image
    is updated only when it's opened. Then, updating inode->i_size at open()
    is enough.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently the size of /proc/kcore which can be read by 'ls -l' is 0. But
    it's not the correct value.

    On x86-64, ls -l shows
    ... root root 140737486266368 2009-09-17 10:29 /proc/kcore
    Then, 7FFFFFFE02000. This comes from vmalloc area's size.
    (*) This shows "core" size, not memory size.

    This patch shows the size by updating "size" field in struct
    proc_dir_entry. Later, lookup routine will create inode and fill
    inode->i_size based on this value. Then, this has a problem.

    - Once inode is cached, inode->i_size will never be updated.

    Then, this patch is not memory-hotplug-aware.

    To update inode->i_size, we have to know dentry or inode.
    But there is no way to lookup them by inside kernel. Hmmm....
    Next patch will try it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • proc_kcore_init() doesn't check NULL case. fix it and remove unnecessary
    comments.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
    This is handled only in x86-64. This patch make it more generic. And we
    can use vread/vwrite to access the area. Fix it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Benjamin Herrenschmidt pointed out that vmemmap
    range is not included in KCORE_RAM, KCORE_VMALLOC ....

    This adds KCORE_VMEMMAP if SPARSEMEM_VMEMMAP is used. By this, vmemmap
    can be readable via /proc/kcore

    Because it's not vmalloc area, vread/vwrite cannot be used. But the range
    is static against the memory layout, this patch handles vmemmap area by
    the same scheme with physical memory.

    This patch assumes SPARSEMEM_VMEMMAP range is not in VMALLOC range. It's
    correct now.

    [akpm@linux-foundation.org: fix typo]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Cc: Benjamin Herrenschmidt
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • For /proc/kcore, each arch registers its memory range by kclist_add().
    In usual,

    - range of physical memory
    - range of vmalloc area
    - text, etc...

    are registered but "range of physical memory" has some troubles. It
    doesn't updated at memory hotplug and it tend to include unnecessary
    memory holes. Now, /proc/iomem (kernel/resource.c) includes required
    physical memory range information and it's properly updated at memory
    hotplug. Then, it's good to avoid using its own code(duplicating
    information) and to rebuild kclist for physical memory based on
    /proc/iomem.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Jiri Slaby
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Some 64bit arch has special segment for mapping kernel text. It should be
    entried to /proc/kcore in addtion to direct-linear-map, vmalloc area.
    This patch unifies KCORE_TEXT entry scattered under x86 and ia64.

    I'm not familiar with other archs (mips has its own even after this patch)
    but range of [_stext ..._end) is a valid area of text and it's not in
    direct-map area, defining CONFIG_ARCH_PROC_KCORE_TEXT is only a necessary
    thing to do.

    Note: I left mips as it is now.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • For /proc/kcore, vmalloc areas are registered per arch. But, all of them
    registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies
    them. By this. archs which have no kclist_add() hooks can see vmalloc
    area correctly.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, kclist_add() only eats start address and size as its arguments.
    Considering to make kclist dynamically reconfigulable, it's necessary to
    know which kclists are for System RAM and which are not.

    This patch add kclist types as
    KCORE_RAM
    KCORE_VMALLOC
    KCORE_TEXT
    KCORE_OTHER

    This "type" is used in a patch following this for detecting KCORE_RAM.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patchset is for /proc/kcore. With this,

    - many per-arch hooks are removed.

    - /proc/kcore will know really valid physical memory area.

    - /proc/kcore will be aware of memory hotplug.

    - /proc/kcore will be architecture independent i.e.
    if an arch supports CONFIG_MMU, it can use /proc/kcore.
    (if the arch uses usual memory layout.)

    This patch:

    /proc/kcore uses its own list handling codes. It's better to use
    generic list codes.

    No changes in logic. just clean up.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • Remove obfuscated zero-length input check and return -EINVAL instead of
    -EIO error to make the error message clear to user. Add whitespace
    stripping. No functionality changes.

    The old code:

    echo 1 > /proc/pid/make-it-fail (ok)
    echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Input/output error)

    The new code:

    echo 1 > /proc/pid/make-it-fail (ok)
    echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Invalid argument)

    This patch is conservative in changes to not breaking existing
    scripts/applications.

    Signed-off-by: Vincent Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Li
     
  • Andrew Morton pointed out similar string hacking and obfuscated check for
    zero-length input at the end of the function, David Rientjes suggested to
    use strict_strtol to replace simple_strtol, this patch cover above
    suggestions, add removing of leading and trailing whitespace from user
    input. It does not change function behavious.

    Signed-off-by: Vincent Li
    Acked-by: David Rientjes
    Cc: Matt Mackall
    Cc: Amerigo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Li
     
  • In 9063c61fd5cbd ("x86, 64-bit: Clean up user address masking") Linus
    fixed the wrong size of /proc/kcore problem.

    But its size still looks insane, since it never equals the size of
    physical memory.

    Signed-off-by: WANG Cong
    Cc: "Eric W. Biederman"
    Cc: Tao Ma
    Cc:
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     
  • The exiting sub-thread flushes /proc/pid only, but this doesn't buy too
    much: ps and friends mostly use /proc/tid/task/pid.

    Remove "if (thread_group_leader())" checks from proc_flush_task() path,
    this means we always remove /proc/tid/task/pid dentry on exit, and this
    actually matches the comment above proc_flush_task().

    The test-case:

    static void* tfunc(void *arg)
    {
    char name[256];

    sprintf(name, "/proc/%d/task/%ld/status", getpid(), gettid());
    close(open(name, O_RDONLY));

    return NULL;
    }

    int main(void)
    {
    pthread_t t;

    for (;;) {
    if (!pthread_create(&t, NULL, &tfunc, NULL))
    pthread_join(t, NULL);
    }
    }

    slabtop shows that pid/proc_inode_cache/etc grow quickly and
    "indefinitely" until the task is killed or shrink_slab() is called, not
    good. And the main thread needs a lot of time to exit.

    The same can happen if something like "ps -efL" runs continuously, while
    some application spawns short-living threads.

    Reported-by: "James M. Leddy"
    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Dominic Duval
    Cc: Frank Hirtz
    Cc: "Fuller, Johnray"
    Cc: Larry Woodman
    Cc: Paul Batkowski
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • /proc/$pid/limits should show RLIMIT_CPU as seconds, which is the unit
    used in kernel/posix-cpu-timers.c:

    unsigned long psecs = cputime_to_secs(ptime);
    ...
    if (psecs >= sig->rlim[RLIMIT_CPU].rlim_max) {
    ...
    __group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);

    Signed-off-by: Kees Cook
    Acked-by: WANG Cong
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
    mark. This struct is filled as a parameter to getrusage syscall.
    ->ru_maxrss value is set to KBs which is the way it is done in BSD
    systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
    which seems to be incorrect behavior. Maintainer of this util was
    notified by me with the patch which corrects it and cc'ed.

    To make this happen we extend struct signal_struct by two fields. The
    first one is ->maxrss which we use to store rss hiwater of the task. The
    second one is ->cmaxrss which we use to store highest rss hiwater of all
    task childs. These values are used in k_getrusage() to actually fill
    ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
    struct exists.

    Note:
    exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
    it is intetionally behavior. *BSD getrusage have exec() inheriting.

    test programs
    ========================================================

    getrusage.c
    ===========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    #define err(str) perror(str), exit(1)

    int main(int argc, char** argv)
    {
    int status;

    printf("allocate 100MB\n");
    consume(100);

    printf("testcase1: fork inherit? \n");
    printf(" expect: initial.self ~= child.self\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase2: fork inherit? (cont.) \n");
    printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("child");
    _exit(0);
    }
    printf("\n");

    printf("testcase3: fork + malloc \n");
    printf(" expect: child.self ~= initial.self + 50MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    printf("allocate +50MB\n");
    consume(50);
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase4: grandchild maxrss\n");
    printf(" expect: post_wait.children ~= 300MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 0 -g 300");
    _exit(0);
    }
    printf("\n");

    printf("testcase5: zombie\n");
    printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
    printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
    show_rusage("initial");
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("pre_wait");
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 400");
    _exit(0);
    }
    printf("\n");

    printf("testcase6: SIG_IGN\n");
    printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
    show_rusage("initial");
    signal(SIGCHLD, SIG_IGN);
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("after_zombie");
    } else {
    system("./child -n 500");
    _exit(0);
    }
    printf("\n");
    signal(SIGCHLD, SIG_DFL);

    printf("testcase7: exec (without fork) \n");
    printf(" expect: initial ~= exec \n");
    show_rusage("initial");
    execl("./child", "child", "-v", NULL);

    return 0;
    }

    child.c
    =======
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    int main(int argc, char** argv)
    {
    int status;
    int c;
    long consume_size = 0;
    long grandchild_consume_size = 0;
    int show = 0;

    while ((c = getopt(argc, argv, "n:g:v")) != -1) {
    switch (c) {
    case 'n':
    consume_size = atol(optarg);
    break;
    case 'v':
    show = 1;
    break;
    case 'g':

    grandchild_consume_size = atol(optarg);
    break;
    default:
    break;
    }
    }

    if (show)
    show_rusage("exec");

    if (consume_size) {
    printf("child alloc %ldMB\n", consume_size);
    consume(consume_size);
    }

    if (grandchild_consume_size) {
    if (fork()) {
    wait(&status);
    } else {
    printf("grandchild alloc %ldMB\n", grandchild_consume_size);
    consume(grandchild_consume_size);

    exit(0);
    }
    }

    return 0;
    }

    common.c
    ========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"
    #define err(str) perror(str), exit(1)

    void show_rusage(char *prefix)
    {
    int err, err2;
    struct rusage rusage_self;
    struct rusage rusage_children;

    printf("%s: ", prefix);
    err = getrusage(RUSAGE_SELF, &rusage_self);
    if (!err)
    printf("self %ld ", rusage_self.ru_maxrss);
    err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
    if (!err2)
    printf("children %ld ", rusage_children.ru_maxrss);

    printf("\n");
    }

    /* Some buggy OS need this worthless CPU waste. */
    void make_pagefault(void)
    {
    void *addr;
    int size = getpagesize();
    int i;

    for (i=0; i
    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Pirko
     
  • Compat utimensat() returns EINVAL when the tv_nsec is one of UTIME_OMIT or
    UTIME_NOW and the tv_sec is set to non-zero. As per man pages, the tv_sec
    field should be ignored.

    sys_utimensat() works fine in this case.

    Test case:

    #define _GNU_SOURCE
    #define _ATFILE_SOURCE
    #include
    #include
    #include
    #include
    #include

    main(int argc, char *argv[])
    {
    struct timespec ts[2];
    struct timespec *tsp;

    if (argc < 2) {
    fprintf(stderr, "Usage : %s filename\n", argv[0]);
    exit (-1);
    }

    ts[0].tv_nsec = ts[1].tv_nsec = UTIME_NOW;
    ts[0].tv_sec = ts[1].tv_sec = 1;

    tsp = ts;

    if (utimensat(AT_FDCWD, argv[1],tsp,0) == -1)
    perror("utimensat");
    else
    fprintf(stdout, "utimensat success\n");
    return 0;
    }
    mjs22lp5:~ # cc -m64 utimensat-test.c -o utimensat_test64
    mjs22lp5:~ # cc -m32 utimensat-test.c -o utimensat_test32
    mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test
    utimensat: Invalid argument
    mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test
    utimensat success
    mjs22lp5:~ # uname -r
    2.6.31-rc8

    With the patch :

    mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test
    utimensat success
    mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test
    utimensat success
    mjs22lp5:~ # uname -r
    2.6.31-rc8utimensat

    Signed-off-by: Suzuki K P
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suzuki Poulose
     
  • qnx4 wrte support has never been fully implement, is broken since the dawn
    of time and hasn't been actively developed since before git history
    started.

    Instead of letting it further bitrot and complicate API transition (like
    the new truncate code) remove it.

    Signed-off-by: Christoph Hellwig
    Cc: Anders Larsen
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • do_sync_write() does the right thing for turning the aio_writev method
    into a normal non-vectored synchronous write, no need to duplicate it in
    ntfs.

    Signed-off-by: Christoph Hellwig
    Acked-by: Anton Altaparmakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Split the anonfd interface into a bare file pointer creation one, and a
    file pointer creation plus install one.

    There are cases, like the usage of eventfds inside other kernel
    interfaces, where the file pointer created by anonfd needs to be used
    inside the initialization of other structures.

    As it is right now, as soon as anon_inode_getfd() returns, the kenrle can
    race with userspace closing the newly installed file descriptor.

    This patch, while keeping the old anon_inode_getfd(), introduces a new
    anon_inode_getfile() (whose services are reused in anon_inode_getfd())
    that allows to split the file creation phase and the fd install one.

    Once all the kernel structures are initialized, the code can call the
    proper fd_install().

    Gregory manifested the need for something like this inside KVM.

    Signed-off-by: Davide Libenzi
    Cc: Alexander Viro
    Cc: James Morris
    Cc: Peter Zijlstra
    Cc: Gregory Haskins
    Acked-by: Serge Hallyn
    Acked-by: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • As mentioned in Documentation/CodingStyle, move EXPORT* macro's
    to the line immediately after the closing function brace line.

    Also, move the __initcall() similarly.

    Signed-off-by: H Hartley Sweeten
    Cc: Zach Brown
    Cc: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • According to Documentation/CodingStyle the EXPORT* macro should follow
    immediately after the closing function brace line.

    Also, mark_buffer_async_write_endio() and do_thaw_all() are not used
    elsewhere so they should be marked as static.

    In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to
    that file.

    Signed-off-by: H Hartley Sweeten
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • We have had a report of bad memory allocation latency during DVD-RAM (UDF)
    writing. This is causing the user's desktop session to become unusable.

    Jan tracked the cause of this down to UDF inode reclaim blocking:

    gnome-screens D ffff810006d1d598 0 20686 1
    ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
    ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
    ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
    Call Trace:
    [] io_schedule+0x63/0xa5
    [] sync_buffer+0x3b/0x3f
    [] __wait_on_bit+0x47/0x79
    [] out_of_line_wait_on_bit+0x6a/0x77
    [] __wait_on_buffer+0x1f/0x21
    [] __bread+0x70/0x86
    [] :udf:udf_tread+0x38/0x3a
    [] :udf:udf_update_inode+0x4d/0x68c
    [] :udf:udf_write_inode+0x1d/0x2b
    [] __writeback_single_inode+0x1c0/0x394
    [] write_inode_now+0x7d/0xc4
    [] :udf:udf_clear_inode+0x3d/0x53
    [] clear_inode+0xc2/0x11b
    [] dispose_list+0x5b/0x102
    [] shrink_icache_memory+0x1dd/0x213
    [] shrink_slab+0xe3/0x158
    [] try_to_free_pages+0x177/0x232
    [] __alloc_pages+0x1fa/0x392
    [] alloc_page_vma+0x176/0x189
    [] __do_fault+0x10c/0x417
    [] handle_mm_fault+0x466/0x940
    [] do_page_fault+0x676/0xabf

    This blocks with iprune_mutex held, which then blocks other reclaimers:

    X D ffff81009d47c400 0 17285 14831
    ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
    ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
    ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
    Call Trace:
    [] __mutex_lock_slowpath+0x72/0xa9
    [] mutex_lock+0x1e/0x22
    [] shrink_icache_memory+0x49/0x213
    [] shrink_slab+0xe3/0x158
    [] try_to_free_pages+0x177/0x232
    [] __alloc_pages+0x1fa/0x392
    [] alloc_pages_current+0xd1/0xd6
    [] __get_free_pages+0xe/0x4d
    [] __pollwait+0x5e/0xdf
    [] :nvidia:nv_kern_poll+0x2e/0x73
    [] do_select+0x308/0x506
    [] core_sys_select+0x1a6/0x254
    [] sys_select+0xb5/0x157

    Now I think the main problem is having the filesystem block (and do IO) in
    inode reclaim. The problem is that this doesn't get accounted well and
    penalizes a random allocator with a big latency spike caused by work
    generated from elsewhere.

    I think the best idea would be to avoid this. By design if possible, or
    by deferring the hard work to an asynchronous context. If the latter,
    then the fs would probably want to throttle creation of new work with
    queue size of the deferred work, but let's not get into those details.

    Anyway, the other obvious thing we looked at is the iprune_mutex which is
    causing the cascading blocking. We could turn this into an rwsem to
    improve concurrency. It is unreasonable to totally ban all potentially
    slow or blocking operations in inode reclaim, so I think this is a cheap
    way to get a small improvement.

    This doesn't solve the whole problem of course. The process doing inode
    reclaim will still take the latency hit, and concurrent processes may end
    up contending on filesystem locks. So fs developers should keep these
    problems in mind.

    Signed-off-by: Nick Piggin
    Cc: Jan Kara
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Make all seq_operations structs const, to help mitigate against
    revectoring user-triggerable function pointers.

    This is derived from the grsecurity patch, although generated from scratch
    because it's simpler than extracting the changes from there.

    Signed-off-by: James Morris
    Acked-by: Serge Hallyn
    Acked-by: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     
  • Move various magic-number definitions into magic.h.

    Signed-off-by: Nick Black
    Acked-by: Pekka Enberg
    Cc: Al Viro
    Cc: "David S. Miller"
    Cc: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Black
     
  • __estimate_accuracy() was prone to integer overflow, for example if *tv ==
    {2147, 483648000} on a 32 bit computer (or even for delays as small as
    {429, 500000000} if the task is niced).

    Because the result was already forced between 0 and 100ms, the effect of
    the overflow was not too problematic, but the use of the hrtimer range
    feature was not optimal in overflow cases.

    This patch ensures that there can not be an integer overflow in this
    function.

    Signed-off-by: Guillaume Knispel
    Cc: Alexander Viro
    Cc: Arjan van de Ven
    Cc: Thomas Gleixner
    Cc: Heiko Carstens
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guillaume Knispel
     
  • This function uses signed integers for the unix_date and local variables -
    if a negative number is supplied and the leap-year condition is not met,
    month will be 0, leading to a read of day_n[-1]

    Signed-off-by: Roel Kluin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     
  • In ocfs2_file_aio_write, we will prevent direct io if
    we find that we are appending(changing i_size) and call
    generic_file_aio_write_nolock. But actually O_DIRECT flag
    is there and this function will call generic_file_direct_write
    eventually which will update i_size and leave di->i_size
    alone. The bug is
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1173.

    So this patch let ocfs2_direct_IO returns 0 directly if we
    are appending so that buffered write will be called and
    di->i_size get updated successfully. And this is also
    what we want in ocfs2_file_aio_write.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     
  • when we check/modify lockres->purge, we should with the protection of lockres->spinlock.
    in dlm_purge_lockres(), the checking/modifying is not with the protectin.
    this patch fixes it.

    Signed-off-by: Wengang Wang
    Signed-off-by: Joel Becker

    Wengang Wang