01 Nov, 2011

2 commits

  • Some kernel components pin user space memory (infiniband and perf) (by
    increasing the page count) and account that memory as "mlocked".

    The difference between mlocking and pinning is:

    A. mlocked pages are marked with PG_mlocked and are exempt from
    swapping. Page migration may move them around though.
    They are kept on a special LRU list.

    B. Pinned pages cannot be moved because something needs to
    directly access physical memory. They may not be on any
    LRU list.

    I recently saw an mlockalled process where mm->locked_vm became
    bigger than the virtual size of the process (!) because some
    memory was accounted for twice:

    Once when the page was mlocked and once when the Infiniband
    layer increased the refcount because it needt to pin the RDMA
    memory.

    This patch introduces a separate counter for pinned pages and
    accounts them seperately.

    Signed-off-by: Christoph Lameter
    Cc: Mike Marciniszyn
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm:
    use walk_page_range() instead of custom page table walking code").

    Reported-by: Stephen Hemminger
    Tested-by: Stephen Hemminger
    Reviewed-by: Stephen Wilson
    Cc: KOSAKI Motohiro
    Cc: Hugh Dickins
    Acked-by: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

22 Sep, 2011

3 commits

  • This is modeled after the smaps code.

    It detects transparent hugepages and then does a single gather_stats()
    for the page as a whole. This has two benifits:
    1. It is more efficient since it does many pages in a single shot.
    2. It does not have to break down the huge page.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • gather_pte_stats() does a number of checks on a target page
    to see whether it should even be considered for statistics.
    This breaks that code out in to a separate function so that
    we can use it in the transparent hugepage case in the next
    patch.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • We need to teach the numa_maps code about transparent huge pages. The
    first step is to teach gather_stats() that the pte it is dealing with
    might represent more than one page.

    Note that will we use this in a moment for transparent huge pages since
    they have use a single pmd_t which _acts_ as a "surrogate" for a bunch
    of smaller pte_t's.

    I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes
    for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to
    figure out how many _bytes_ "dirty=1" means, you must first know the
    hugetlbfs page size. That's easier said than done especially if you
    don't have visibility in to the mount.

    But, that's probably a discussion for another day especially since it
    would change behavior to fix it. But, just in case anyone wonders why
    this patch only passes a '1' in the hugetlb case...

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

27 May, 2011

3 commits

  • Currently, pagemap_read() has three error and/or corner case handling
    mistake.

    (1) If ppos parameter is wrong, mm refcount will be leak.
    (2) If count parameter is 0, mm refcount will be leak too.
    (3) If the current task is sleeping in kmalloc() and the system
    is out of memory and oom-killer kill the proc associated task,
    mm_refcount prevent the task free its memory. then system may
    hang up.

    Cc: Hugh Dickins
    Cc: Jovi Zhang
    Acked-by: Hugh Dickins
    Cc: Stephen Wilson
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Convert fs/proc/ from strict_strto*() to kstrto*() functions.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

25 May, 2011

2 commits

  • In show_numa_map() we collect statistics into a numa_maps structure.
    Since the number of NUMA nodes can be very large, this structure is not a
    candidate for stack allocation.

    Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
    is invoked, perform the allocation just once when /proc/pid/numa_maps is
    opened.

    Performing the allocation when numa_maps is opened, and thus before a
    reference to the target tasks mm is taken, eliminates a potential
    stalemate condition in the oom-killer as originally described by Hugh
    Dickins:

    ... imagine what happens if the system is out of memory, and the mm
    we're looking at is selected for killing by the OOM killer: while
    we wait in __get_free_page for more memory, no memory is freed
    from the selected mm because it cannot reach exit_mmap while we hold
    that reference.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
    issues.

    - Having the show() operation "miles away" from the corresponding
    seq_file iteration operations is a maintenance burden.

    - The need to export ad hoc info like struct proc_maps_private is
    eliminated.

    - The implementation of show_numa_map() can be improved in a simple
    manner by cooperating with the other seq_file operations (start,
    stop, etc) -- something that would be messy to do without this
    change.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     

10 May, 2011

1 commit

  • Linux kernel excludes guard page when performing mlock on a VMA with
    down-growing stack. However, some architectures have up-growing stack
    and locking the guard page should be excluded in this case too.

    This patch fixes lvm2 on PA-RISC (and possibly other architectures with
    up-growing stack). lvm2 calculates number of used pages when locking and
    when unlocking and reports an internal error if the numbers mismatch.

    [ Patch changed fairly extensively to also fix /proc//maps for the
    grows-up case, and to move things around a bit to clean it all up and
    share the infrstructure with the /proc bits.

    Tested on ia64 that has both grow-up and grow-down segments - Linus ]

    Signed-off-by: Mikulas Patocka
    Tested-by: Tony Luck
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

28 Mar, 2011

1 commit

  • When m_start returns an error, the seq_file logic will still call m_stop
    with that error entry, so we'd better make sure that we check it before
    using it as a vma.

    Introduced by commit ec6fd8a4355c ("report errors in /proc/*/*map*
    sanely"), which replaced NULL with various ERR_PTR() cases.

    (On ia64, you happen to get a unaligned fault instead of a page fault,
    since the address used is generally some random error code like -EPERM)

    Reported-by: Anca Emanuel
    Reported-by: Tony Luck
    Cc: Al Viro
    Cc: Américo Wang
    Cc: Stephen Wilson
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Mar, 2011

5 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    deal with races in /proc/*/{syscall,stack,personality}
    proc: enable writing to /proc/pid/mem
    proc: make check_mem_permission() return an mm_struct on success
    proc: hold cred_guard_mutex in check_mem_permission()
    proc: disable mem_write after exec
    mm: implement access_remote_vm
    mm: factor out main logic of access_process_vm
    mm: use mm_struct to resolve gate vma's in __get_user_pages
    mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
    mm: arch: make in_gate_area take an mm_struct instead of a task_struct
    mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
    x86: mark associated mm when running a task in 32 bit compatibility mode
    x86: add context tag to mark mm when running a task in 32-bit compatibility mode
    auxv: require the target to be tracable (or yourself)
    close race in /proc/*/environ
    report errors in /proc/*/*map* sanely
    pagemap: close races with suid execve
    make sessionid permissions in /proc/*/task/* match those in /proc/*
    fix leaks in path_lookupat()

    Fix up trivial conflicts in fs/proc/base.c

    Linus Torvalds
     
  • The current code fails to print the "[heap]" marking if the heap is split
    into multiple mappings.

    Fix the check so that the marking is displayed in all possible cases:
    1. vma matches exactly the heap
    2. the heap vma is merged e.g. with bss
    3. the heap vma is splitted e.g. due to locked pages

    Test cases. In all cases, the process should have mapping(s) with
    [heap] marking:

    (1) vma matches exactly the heap

    #include
    #include
    #include

    int main (void)
    {
    if (sbrk(4096) != (void *)-1) {
    printf("check /proc/%d/maps\n", (int)getpid());
    while (1)
    sleep(1);
    }
    return 0;
    }

    # ./test1
    check /proc/553/maps
    [1] + Stopped ./test1
    # cat /proc/553/maps | head -4
    00008000-00009000 r-xp 00000000 01:00 3113640 /test1
    00010000-00011000 rw-p 00000000 01:00 3113640 /test1
    00011000-00012000 rw-p 00000000 00:00 0 [heap]
    4006f000-40070000 rw-p 00000000 00:00 0

    (2) the heap vma is merged

    #include
    #include
    #include

    char foo[4096] = "foo";
    char bar[4096];

    int main (void)
    {
    if (sbrk(4096) != (void *)-1) {
    printf("check /proc/%d/maps\n", (int)getpid());
    while (1)
    sleep(1);
    }
    return 0;
    }

    # ./test2
    check /proc/556/maps
    [2] + Stopped ./test2
    # cat /proc/556/maps | head -4
    00008000-00009000 r-xp 00000000 01:00 3116312 /test2
    00010000-00012000 rw-p 00000000 01:00 3116312 /test2
    00012000-00014000 rw-p 00000000 00:00 0 [heap]
    4004a000-4004b000 rw-p 00000000 00:00 0

    (3) the heap vma is splitted (this fails without the patch)

    #include
    #include
    #include
    #include

    int main (void)
    {
    if ((sbrk(4096) != (void *)-1) && !mlockall(MCL_FUTURE) &&
    (sbrk(4096) != (void *)-1)) {
    printf("check /proc/%d/maps\n", (int)getpid());
    while (1)
    sleep(1);
    }
    return 0;
    }

    # ./test3
    check /proc/559/maps
    [1] + Stopped ./test3
    # cat /proc/559/maps|head -4
    00008000-00009000 r-xp 00000000 01:00 3119108 /test3
    00010000-00011000 rw-p 00000000 01:00 3119108 /test3
    00011000-00012000 rw-p 00000000 00:00 0 [heap]
    00012000-00013000 rw-p 00000000 00:00 0 [heap]

    It looks like the bug has been there forever, and since it only results in
    some information missing from a procfile, it does not fulfil the -stable
    "critical issue" criteria.

    Signed-off-by: Aaro Koskinen
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaro Koskinen
     
  • Morally, the presence of a gate vma is more an attribute of a particular mm than
    a particular task. Moreover, dropping the dependency on task_struct will help
    make both existing and future operations on mm's more flexible and convenient.

    Signed-off-by: Stephen Wilson
    Reviewed-by: Michel Lespinasse
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Al Viro

    Stephen Wilson
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • just use mm_for_maps()

    Signed-off-by: Al Viro

    Al Viro
     

23 Mar, 2011

5 commits

  • Now that the mere act of _looking_ at /proc/$pid/smaps will not destroy
    transparent huge pages, tell how much of the VMA is actually mapped with
    them.

    This way, we can make sure that we're getting THPs where we
    expect to see them.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Eric B Munson
    Tested-by: Eric B Munson
    Cc: Michael J Wolf
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • This adds code to explicitly detect and handle pmd_trans_huge() pmds. It
    then passes HPAGE_SIZE units in to the smap_pte_entry() function instead
    of PAGE_SIZE.

    This means that using /proc/$pid/smaps now will no longer cause THPs to be
    broken down in to small pages.

    Signed-off-by: Dave Hansen
    Reviewed-by: Eric B Munson
    Tested-by: Eric B Munson
    Acked-by: Andrea Arcangeli
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Michael J Wolf
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Add an argument to the new smaps_pte_entry() function to let it account in
    things other than PAGE_SIZE units. I changed all of the PAGE_SIZE sites,
    even though not all of them can be reached for transparent huge pages,
    just so this will continue to work without changes as THPs are improved.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Reviewed-by: Eric B Munson
    Tested-by: Eric B Munson
    Cc: Michael J Wolf
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • We will use smaps_pte_entry() in a moment to handle both small and
    transparent large pages. But, we must break it out of smaps_pte_range()
    first.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Reviewed-by: Eric B Munson
    Tested-by: Eric B Munson
    Cc: Michael J Wolf
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
    unconditionally split any transparent huge pages it runs in to. In
    practice, that means that anyone doing a

    cat /proc/$pid/smaps

    will unconditionally break down every huge page in the process and depend
    on khugepaged to re-collapse it later. This is fairly suboptimal.

    This patch changes that behavior. It teaches each ->pmd_entry handler
    (there are five) that they must break down the THPs themselves. Also, the
    _generic_ code will never break down a THP unless a ->pte_entry handler is
    actually set.

    This means that the ->pmd_entry handlers can now choose to deal with THPs
    without breaking them down.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Eric B Munson
    Tested-by: Eric B Munson
    Cc: Michael J Wolf
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

14 Jan, 2011

2 commits

  • Currently there is no way to find whether a process has locked its pages
    in memory or not. And which of the memory regions are locked in memory.

    Add a new field "Locked" to export this information via the smaps file.

    Signed-off-by: Nikanth Karthikesan
    Acked-by: Balbir Singh
    Acked-by: Wu Fengguang
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     
  • /proc/*/statm code needlessly truncates data from unsigned long to int.
    One needs only 8+ TB of RAM to make truncation visible.

    Signed-off-by: Alexey Dobriyan
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

25 Nov, 2010

1 commit

  • Currently one pagemap_read() call walks in PAGEMAP_WALK_SIZE bytes (== 512
    pages.) But there is a corner case where walk_pmd_range() accidentally
    runs over a VMA associated with a hugetlbfs file.

    For example, when a process has mappings to VMAs as shown below:

    # cat /proc//maps
    ...
    3a58f6d000-3a58f72000 rw-p 00000000 00:00 0
    7fbd51853000-7fbd51855000 rw-p 00000000 00:00 0
    7fbd5186c000-7fbd5186e000 rw-p 00000000 00:00 0
    7fbd51a00000-7fbd51c00000 rw-s 00000000 00:12 8614 /hugepages/test

    then pagemap_read() goes into walk_pmd_range() path and walks in the range
    0x7fbd51853000-0x7fbd51a53000, but the hugetlbfs VMA should be handled by
    walk_hugetlb_range(). Otherwise PMD for the hugepage is considered bad
    and cleared, which causes undesirable results.

    This patch fixes it by separating pagemap walk range into one PMD.

    Signed-off-by: Naoya Horiguchi
    Cc: Jun'ichi Nomura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

28 Oct, 2010

1 commit

  • Export the number of anonymous pages in a mapping via smaps.

    Even the private pages in a mapping backed by a file, would be marked as
    anonymous, when they are modified. Export this information to user-space via
    smaps.

    Exporting this count will help gdb to make a better decision on which
    areas need to be dumped in its coredump; and should be useful to others
    studying the memory usage of a process.

    Signed-off-by: Nikanth Karthikesan
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     

23 Oct, 2010

1 commit

  • * 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
    vfs: make no_llseek the default
    vfs: don't use BKL in default_llseek
    llseek: automatically add .llseek fop
    libfs: use generic_file_llseek for simple_attr
    mac80211: disallow seeks in minstrel debug code
    lirc: make chardev nonseekable
    viotape: use noop_llseek
    raw: use explicit llseek file operations
    ibmasmfs: use generic_file_llseek
    spufs: use llseek in all file operations
    arm/omap: use generic_file_llseek in iommu_debug
    lkdtm: use generic_file_llseek in debugfs
    net/wireless: use generic_file_llseek in debugfs
    drm: use noop_llseek

    Linus Torvalds
     

15 Oct, 2010

1 commit

  • All file_operations should get a .llseek operation so we can make
    nonseekable_open the default for future file operations without a
    .llseek pointer.

    The three cases that we can automatically detect are no_llseek, seq_lseek
    and default_llseek. For cases where we can we can automatically prove that
    the file offset is always ignored, we use noop_llseek, which maintains
    the current behavior of not returning an error from a seek.

    New drivers should normally not use noop_llseek but instead use no_llseek
    and call nonseekable_open at open time. Existing drivers can be converted
    to do the same when the maintainer knows for certain that no user code
    relies on calling seek on the device file.

    The generated code is often incorrectly indented and right now contains
    comments that clarify for each added line why a specific variant was
    chosen. In the version that gets submitted upstream, the comments will
    be gone and I will manually fix the indentation, because there does not
    seem to be a way to do that using coccinelle.

    Some amount of new code is currently sitting in linux-next that should get
    the same modifications, which I will do at the end of the merge window.

    Many thanks to Julia Lawall for helping me learn to write a semantic
    patch that does all this.

    ===== begin semantic patch =====
    // This adds an llseek= method to all file operations,
    // as a preparation for making no_llseek the default.
    //
    // The rules are
    // - use no_llseek explicitly if we do nonseekable_open
    // - use seq_lseek for sequential files
    // - use default_llseek if we know we access f_pos
    // - use noop_llseek if we know we don't access f_pos,
    // but we still want to allow users to call lseek
    //
    @ open1 exists @
    identifier nested_open;
    @@
    nested_open(...)
    {

    }

    @ open exists@
    identifier open_f;
    identifier i, f;
    identifier open1.nested_open;
    @@
    int open_f(struct inode *i, struct file *f)
    {

    }

    @ read disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {

    }

    @ read_no_fpos disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ write @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {

    }

    @ write_no_fpos @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ fops0 @
    identifier fops;
    @@
    struct file_operations fops = {
    ...
    };

    @ has_llseek depends on fops0 @
    identifier fops0.fops;
    identifier llseek_f;
    @@
    struct file_operations fops = {
    ...
    .llseek = llseek_f,
    ...
    };

    @ has_read depends on fops0 @
    identifier fops0.fops;
    identifier read_f;
    @@
    struct file_operations fops = {
    ...
    .read = read_f,
    ...
    };

    @ has_write depends on fops0 @
    identifier fops0.fops;
    identifier write_f;
    @@
    struct file_operations fops = {
    ...
    .write = write_f,
    ...
    };

    @ has_open depends on fops0 @
    identifier fops0.fops;
    identifier open_f;
    @@
    struct file_operations fops = {
    ...
    .open = open_f,
    ...
    };

    // use no_llseek if we call nonseekable_open
    ////////////////////////////////////////////
    @ nonseekable1 depends on !has_llseek && has_open @
    identifier fops0.fops;
    identifier nso ~= "nonseekable_open";
    @@
    struct file_operations fops = {
    ... .open = nso, ...
    +.llseek = no_llseek, /* nonseekable */
    };

    @ nonseekable2 depends on !has_llseek @
    identifier fops0.fops;
    identifier open.open_f;
    @@
    struct file_operations fops = {
    ... .open = open_f, ...
    +.llseek = no_llseek, /* open uses nonseekable */
    };

    // use seq_lseek for sequential files
    /////////////////////////////////////
    @ seq depends on !has_llseek @
    identifier fops0.fops;
    identifier sr ~= "seq_read";
    @@
    struct file_operations fops = {
    ... .read = sr, ...
    +.llseek = seq_lseek, /* we have seq_read */
    };

    // use default_llseek if there is a readdir
    ///////////////////////////////////////////
    @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier readdir_e;
    @@
    // any other fop is used that changes pos
    struct file_operations fops = {
    ... .readdir = readdir_e, ...
    +.llseek = default_llseek, /* readdir is present */
    };

    // use default_llseek if at least one of read/write touches f_pos
    /////////////////////////////////////////////////////////////////
    @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read.read_f;
    @@
    // read fops use offset
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = default_llseek, /* read accesses f_pos */
    };

    @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ... .write = write_f, ...
    + .llseek = default_llseek, /* write accesses f_pos */
    };

    // Use noop_llseek if neither read nor write accesses f_pos
    ///////////////////////////////////////////////////////////

    @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    identifier write_no_fpos.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ...
    .write = write_f,
    .read = read_f,
    ...
    +.llseek = noop_llseek, /* read and write both use no f_pos */
    };

    @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write_no_fpos.write_f;
    @@
    struct file_operations fops = {
    ... .write = write_f, ...
    +.llseek = noop_llseek, /* write uses no f_pos */
    };

    @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    @@
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = noop_llseek, /* read uses no f_pos */
    };

    @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    @@
    struct file_operations fops = {
    ...
    +.llseek = noop_llseek, /* no read or write fn */
    };
    ===== End semantic patch =====

    Signed-off-by: Arnd Bergmann
    Cc: Julia Lawall
    Cc: Christoph Hellwig

    Arnd Bergmann
     

23 Sep, 2010

1 commit

  • Currently, /proc//smaps has wrong dirty pages accounting.
    Shared_Dirty and Private_Dirty output only pte dirty pages and ignore
    PG_dirty page flag. It is difference against documentation, but also
    inconsistent against Referenced field. (Referenced checks both pte and
    page flags)

    This patch fixes it.

    Test program:

    large-array.c
    ---------------------------------------------------
    #include
    #include
    #include
    #include

    char array[1*1024*1024*1024L];

    int main(void)
    {
    memset(array, 1, sizeof(array));
    pause();

    return 0;
    }
    ---------------------------------------------------

    Test case:
    1. run ./large-array
    2. cat /proc/`pidof large-array`/smaps
    3. swapoff -a
    4. cat /proc/`pidof large-array`/smaps again

    Test result:

    00601000-40601000 rw-p 00000000 00:00 0
    Size: 1048576 kB
    Rss: 1048576 kB
    Pss: 1048576 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 218992 kB

    00601000-40601000 rw-p 00000000 00:00 0
    Size: 1048576 kB
    Rss: 1048576 kB
    Pss: 1048576 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 0 kB
    Private_Dirty: 1048576 kB
    Acked-by: Hugh Dickins
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

10 Sep, 2010

1 commit


16 Aug, 2010

1 commit

  • This commit makes the stack guard page somewhat less visible to user
    space. It does this by:

    - not showing the guard page in /proc//maps

    It looks like lvm-tools will actually read /proc/self/maps to figure
    out where all its mappings are, and effectively do a specialized
    "mlockall()" in user space. By not showing the guard page as part of
    the mapping (by just adding PAGE_SIZE to the start for grows-up
    pages), lvm-tools ends up not being aware of it.

    - by also teaching the _real_ mlock() functionality not to try to lock
    the guard page.

    That would just expand the mapping down to create a new guard page,
    so there really is no point in trying to lock it in place.

    It would perhaps be nice to show the guard page specially in
    /proc//maps (or at least mark grow-down segments some way), but
    let's not open ourselves up to more breakage by user space from programs
    that depends on the exact deails of the 'maps' file.

    Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
    source code to see what was going on with the whole new warning.

    Reported-and-tested-by: François Valenduc
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 May, 2010

1 commit


12 May, 2010

1 commit

  • Originally, commit d899bf7b ("procfs: provide stack information for
    threads") attempted to introduce a new feature for showing where the
    threadstack was located and how many pages are being utilized by the
    stack.

    Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
    applied to fix the NO_MMU case.

    Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
    64-bit") was applied to fix a bug in ia32 executables being loaded.

    Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
    threads") was applied to fix a bug which had kernel threads printing a
    userland stack address.

    Commit 1306d603f ('proc: partially revert "procfs: provide stack
    information for threads"') was then applied to revert the stack pages
    being used to solve a significant performance regression.

    This patch nearly undoes the effect of all these patches.

    The reason for reverting these is it provides an unusable value in
    field 28. For x86_64, a fork will result in the task->stack_start
    value being updated to the current user top of stack and not the stack
    start address. This unpredictability of the stack_start value makes
    it worthless. That includes the intended use of showing how much stack
    space a thread has.

    Other architectures will get different values. As an example, ia64
    gets 0. The do_fork() and copy_process() functions appear to treat the
    stack_start and stack_size parameters as architecture specific.

    I only partially reverted c44972f1 ("procfs: disable per-task stack usage
    on NOMMU") . If I had completely reverted it, I would have had to change
    mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
    configured. Since I could not test the builds without significant effort,
    I decided to not change mm/Makefile.

    I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
    information for threads on 64-bit") . I left the KSTK_ESP() change in
    place as that seemed worthwhile.

    Signed-off-by: Robin Holt
    Cc: Stefani Seibold
    Cc: KOSAKI Motohiro
    Cc: Michal Simek
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

07 Apr, 2010

1 commit

  • When we look into pagemap using page-types with option -p, the value of
    pfn for hugepages looks wrong (see below.) This is because pte was
    evaluated only once for one vma although it should be updated for each
    hugepage. This patch fixes it.

    $ page-types -p 3277 -Nl -b huge
    voffset offset len flags
    7f21e8a00 11e400 1 ___U___________H_G________________
    7f21e8a01 11e401 1ff ________________TG________________
    ^^^
    7f21e8c00 11e400 1 ___U___________H_G________________
    7f21e8c01 11e401 1ff ________________TG________________
    ^^^

    One hugepage contains 1 head page and 511 tail pages in x86_64 and each
    two lines represent each hugepage. Voffset and offset mean virtual
    address and physical address in the page unit, respectively. The
    different hugepages should not have the same offset value.

    With this patch applied:

    $ page-types -p 3386 -Nl -b huge
    voffset offset len flags
    7fec7a600 112c00 1 ___UD__________H_G________________
    7fec7a601 112c01 1ff ________________TG________________
    ^^^
    7fec7a800 113200 1 ___UD__________H_G________________
    7fec7a801 113201 1ff ________________TG________________
    ^^^
    OK

    More info:

    - This patch modifies walk_page_range()'s hugepage walker. But the
    change only affects pagemap_read(), which is the only caller of hugepage
    callback.

    - Without this patch, hugetlb_entry() callback is called per vma, that
    doesn't match the natural expectation from its name.

    - With this patch, hugetlb_entry() is called per hugepte entry and the
    callback can become much simpler.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

06 Apr, 2010

1 commit


05 Apr, 2010

2 commits

  • Tejun Heo
     
  • In initial design, walk_page_range() was designed just for walking page
    table and it didn't require mmap_sem. Now, find_vma() etc.. are used
    in walk_page_range() and we need mmap_sem around it.

    This patch adds mmap_sem around walk_page_range().

    Because /proc//pagemap's callback routine use put_user(), we have
    to get rid of it to do sane fix.

    Changelog: 2010/Apr/2
    - fixed start_vaddr and end overflow
    Changelog: 2010/Apr/1
    - fixed start_vaddr calculation
    - removed unnecessary cast.
    - removed unnecessary change in smaps.
    - use GFP_TEMPORARY instead of GFP_KERNEL

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Matt Mackall
    Cc: KOSAKI Motohiro
    Cc: San Mehat
    Cc: Brian Swetland
    Cc: Dave Hansen
    Cc: Andrew Morton
    [ Fixed kmalloc failure return code as per Matt ]
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

2 commits

  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki