03 Dec, 2010

1 commit


29 Oct, 2010

1 commit

  • When a node contains only HighMem memory, slab_node(MPOL_BIND)
    dereferences a NULL pointer.

    [ This code seems to go back all the way to commit 19770b32609b: "mm:
    filter based on a nodemask as well as a gfp_mask". Which was back in
    April 2008, and it got merged into 2.6.26. - Linus ]

    Signed-off-by: Eric Dumazet
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

27 Oct, 2010

2 commits

  • Function check_range may return ERR_PTR(...). Check for it.

    Signed-off-by: Vasiliy Kulikov
    Acked-by: David Rientjes
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Presently update_nr_listpages() doesn't have a role. That's because lists
    passed is always empty just after calling migrate_pages. The
    migrate_pages cleans up page list which have failed to migrate before
    returning by aaa994b3.

    [PATCH] page migration: handle freeing of pages in migrate_pages()

    Do not leave pages on the lists passed to migrate_pages(). Seems that we will
    not need any postprocessing of pages. This will simplify the handling of
    pages by the callers of migrate_pages().

    At that time, we thought we don't need any postprocessing of pages. But
    the situation is changed. The compaction need to know the number of
    failed to migrate for COMPACTPAGEFAILED stat

    This patch makes new rule for caller of migrate_pages to call
    putback_lru_pages. So caller need to clean up the lists so it has a
    chance to postprocess the pages. [suggested by Christoph Lameter]

    Signed-off-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Reviewed-by: Mel Gorman
    Reviewed-by: Wu Fengguang
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

10 Aug, 2010

2 commits

  • migrate_pages() is using >500 bytes stack. Reduce it.

    mm/mempolicy.c: In function 'sys_migrate_pages':
    mm/mempolicy.c:1344: warning: the frame size of 528 bytes is larger than 512 bytes

    [akpm@linux-foundation.org: don't play with a might-be-NULL pointer]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The oom killer presently kills current whenever there is no more memory
    free or reclaimable on its mempolicy's nodes. There is no guarantee that
    current is a memory-hogging task or that killing it will free any
    substantial amount of memory, however.

    In such situations, it is better to scan the tasklist for nodes that are
    allowed to allocate on current's set of nodes and kill the task with the
    highest badness() score. This ensures that the most memory-hogging task,
    or the one configured by the user with /proc/pid/oom_adj, is always
    selected in such scenarios.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 Jun, 2010

1 commit

  • My patch to "Factor out duplicate put/frees in mpol_shared_policy_init()
    to a common return path"; and Dan Carpenter's fix thereto both left a
    dangling reference to the incoming tmpfs superblock mempolicy structure.
    A similar leak was introduced earlier when the nodemask was moved offstack
    to the scratch area despite the note in the comment block regarding the
    incoming ref.

    Move the remaining 'put of the incoming "mpol" to the common exit path to
    drop the reference.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Dan Carpenter
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

26 May, 2010

1 commit


25 May, 2010

10 commits

  • Use mm->task_size instead of TASK_SIZE to ensure that the entire user
    address space is migrated. mm->task_size is independent of the calling
    task context. TASK SIZE may be dependant on the address space size of the
    calling process. Usage of TASK_SIZE can lead to partial address space
    migration if the calling process was 32 bit and the migrating process was
    64 bit.

    Here is the test script used on 64 system with a 32 bit echo process:

    mount -t cgroup none /cgroup -o cpuset
    cd /cgroup

    mkdir 0
    echo 1 > 0/cpuset.cpus
    echo 0 > 0/cpuset.mems
    echo 1 > 0/cpuset.memory_migrate

    mkdir 1
    echo 1 > 1/cpuset.cpus
    echo 1 > 1/cpuset.mems
    echo 1 > 1/cpuset.memory_migrate

    echo $$ > 0/tasks
    64_bit_process &
    pid=$!

    echo $pid > 1/tasks # This does not migrate all process pages without
    # this patch. If 64 bit echo is used or this patch is
    # applied, then the full address space of $pid is
    # migrated.

    To check memory migration, I watched:
    grep MemUsed /sys/devices/system/node/node*/meminfo

    Signed-off-by: Greg Thelen
    Acked-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Nick Piggin reported that the allocator may see an empty nodemask when
    changing cpuset's mems[1]. It happens only on the kernel that do not do
    atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG)

    But I found that there is also a problem on the kernel that can do atomic
    nodemask_t stores. The problem is that the allocator can't find a node to
    alloc page when changing cpuset's mems though there is a lot of free
    memory. The reason is like this:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    I can use the attached program reproduce it by the following step:

    # mkdir /dev/cpuset
    # mount -t cpuset cpuset /dev/cpuset
    # mkdir /dev/cpuset/1
    # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
    # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
    # echo $$ > /dev/cpuset/1/tasks
    # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &
    = max(nr_cpus - 1, 1)
    # killall -s SIGUSR1 cpuset_mem_hog
    # ./change_mems.sh

    several hours later, oom will happen though there is a lot of free memory.

    This patchset fixes this problem by expanding the nodes range first(set
    newly allowed bits) and shrink it lazily(clear newly disallowed bits). So
    we use a variable to tell the write-side task that read-side task is
    reading nodemask, and the write-side task clears newly disallowed nodes
    after read-side task ends the current memory allocation.

    This patch:

    In order to fix no node to alloc memory, when we want to update mempolicy
    and mems_allowed, we expand the set of nodes first (set all the newly
    nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
    mempolicy's rebind functions may breaks the expanding.

    So we restructure the mempolicy's rebind functions and split the rebind
    work to two steps, just like the update of cpuset's mems: The 1st step:
    expand the set of the mempolicy's nodes. The 2nd step: shrink the set of
    the mempolicy's nodes. It is used when there is no real lock to protect
    the mempolicy in the read-side. Otherwise we can do rebind work at once.

    In order to implement it, we define

    enum mpol_rebind_step {
    MPOL_REBIND_ONCE,
    MPOL_REBIND_STEP1,
    MPOL_REBIND_STEP2,
    MPOL_REBIND_NSTEP,
    };

    If the mempolicy needn't be updated by two steps, we can pass
    MPOL_REBIND_ONCE to the rebind functions. Or we can pass
    MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
    MPOL_REBIND_STEP2 to do the second step work.

    Besides that, it maybe long time between these two step and we have to
    release the lock that protects mempolicy and mems_allowed. If we hold the
    lock once again, we must check whether the current mempolicy is under the
    rebinding (the first step has been done) or not, because the task may
    alloc a new mempolicy when we don't hold the lock. So we defined the
    following flag to identify it:

    #define MPOL_F_REBINDING (1 << 2)

    The new functions will be used in the next patch.

    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Factor out duplicate put/frees in mpol_shared_policy_init() to a common
    return path.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Rename 'policy_types[]' to 'policy_modes[]' to better match the array
    contents.

    Use designated intializer syntax for policy_modes[].

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • We don't really need the extra variable 'i' in mpol_parse_str(). The only
    use is as the the loop variable. Then, it's assigned to 'mode'. Just use
    mode, and loose the 'uninitialized_var()' macro.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • No need to call mpol_set_nodemask() when we have no context for the
    mempolicy. This can occur when we're parsing a tmpfs 'mpol' mount option.
    Just save the raw nodemask in the mempolicy's w.user_nodemask member for
    use when a tmpfs/shmem file is created. mpol_shared_policy_init() will
    "contextualize" the policy for the new file based on the creating task's
    context.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Lee's patch "mempolicy: use MPOL_PREFERRED for system-wide default policy"
    has made the MPOL_DEFAULT only used in the memory policy APIs. So, no
    need to check in __mpol_equal also. Also get rid of mpol_match_intent()
    and move its logic directly into __mpol_equal().

    Signed-off-by: Bob Liu
    Acked-by: David Rientjes
    Cc: Andi Kleen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • In policy_zonelist() mode MPOL_INTERLEAVE shouldn't happen, so fall
    through to BUG() instead of break to return. I also fixed the comment.

    Signed-off-by: Bob Liu
    Acked-by: David Rientjes
    Cc: Andi Kleen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • 1. In funtion is_valid_nodemask(), varibable k will be inited to 0 in
    the following loop, needn't init to policy_zone anymore.

    2. (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES) has already defined
    to MPOL_MODE_FLAGS in mempolicy.h.

    Signed-off-by: Bob Liu
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Mar, 2010

5 commits

  • Discovered while testing other mempolicy changes:

    get_mempolicy() does not handle static/relative mode flags correctly.
    Return the value that the user specified so that it can be restored
    via set_mempolicy() if desired.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • mpol_parse_str() made lots 'err' variable related bug. Because it is ugly
    and reviewing unfriendly.

    This patch simplifies it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Ravikiran Thirumalai
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 71fe804b6d5 (mempolicy: use struct mempolicy pointer in
    shmem_sb_info) added mpol=local mount option. but its feature is broken
    since it was born. because such code always return 1 (i.e. mount
    failure).

    This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Ravikiran Thirumalai
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, following mount operation cause mount error.

    % mount -t tmpfs -ompol=bind:0 none /tmp

    Because commit 71fe804b6d5 (mempolicy: use struct mempolicy pointer in
    shmem_sb_info) corrupted MPOL_BIND parse code.

    This patch restore the needed one.

    Signed-off-by: KOSAKI Motohiro
    Cc: Ravikiran Thirumalai
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Fix an 'oops' when a tmpfs mount point is mounted with the mpol=default
    mempolicy.

    Upon remounting a tmpfs mount point with 'mpol=default' option, the mount
    code crashed with a null pointer dereference. The initial problem report
    was on 2.6.27, but the problem exists in mainline 2.6.34-rc as well. On
    examining the code, we see that mpol_new returns NULL if default mempolicy
    was requested. This 'NULL' mempolicy is accessed to store the node mask
    resulting in oops.

    The following patch fixes it.

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     

14 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds
     

07 Mar, 2010

2 commits

  • Currently, do_migrate_pages() have very long comment and this is not
    indent properly. I often misunderstand it is function starting commnents
    and confused it.

    this patch fixes it.

    note: this patch doesn't break 80 column rule. I guess original
    author intended this indentaion, but an accident corrupted it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible.
    Unfortunately, many vma can reduce performance...

    This patch fixes it.

    reproduced program
    ----------------------------------------------------------------
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static unsigned long pagesize;

    int main(int argc, char** argv)
    {
    void* addr;
    int ch;
    int node;
    struct bitmask *nmask = numa_allocate_nodemask();
    int err;
    int node_set = 0;
    char buf[128];

    while ((ch = getopt(argc, argv, "n:")) != -1){
    switch (ch){
    case 'n':
    node = strtol(optarg, NULL, 0);
    numa_bitmask_setbit(nmask, node);
    node_set = 1;
    break;
    default:
    ;
    }
    }
    argc -= optind;
    argv += optind;

    if (!node_set)
    numa_bitmask_setbit(nmask, 0);

    pagesize = getpagesize();

    addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
    MAP_ANON|MAP_PRIVATE, 0, 0);
    if (addr == MAP_FAILED)
    perror("mmap "), exit(1);

    fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr);

    /* make page populate */
    memset(addr, 0, pagesize*3);

    /* first mbind */
    err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp,
    nmask->size, MPOL_MF_MOVE_ALL);
    if (err)
    error("mbind1 ");

    /* second mbind */
    err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0);
    if (err)
    error("mbind2 ");

    sprintf(buf, "cat /proc/%d/maps", getpid());
    system(buf);

    return 0;
    }
    ----------------------------------------------------------------

    result without this patch

    addr = 0x7fe26ef09000
    [snip]
    7fe26ef09000-7fe26ef0a000 rw-p 00000000 00:00 0
    7fe26ef0a000-7fe26ef0b000 rw-p 00000000 00:00 0
    7fe26ef0b000-7fe26ef0c000 rw-p 00000000 00:00 0
    7fe26ef0c000-7fe26ef0d000 rw-p 00000000 00:00 0

    => 0x7fe26ef09000-0x7fe26ef0c000 have three vmas.

    result with this patch

    addr = 0x7fc9ebc76000
    [snip]
    7fc9ebc76000-7fc9ebc7a000 rw-p 00000000 00:00 0
    7fffbe690000-7fffbe6a5000 rw-p 00000000 00:00 0 [stack]

    => 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma.

    [minchan.kim@gmail.com: fix file offset passed to vma_merge()]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

04 Mar, 2010

1 commit

  • Common code is used during task creation and after the task has
    started running. RCU protection is not needed during task
    creation because no other CPU has access to the
    under-construction task. Provide the RCU protection anyway to
    suppress the false positive, as there does not appear to be a
    good way for the common code to recognize that the task is only
    accessible to the CPU creating it.

    Signed-off-by: Paul E. McKenney
    Cc: Paul Menage
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

16 Dec, 2009

3 commits

  • The previous patch enables page migration of ksm pages, but that soon gets
    into trouble: not surprising, since we're using the ksm page lock to lock
    operations on its stable_node, but page migration switches the page whose
    lock is to be used for that. Another layer of locking would fix it, but
    do we need that yet?

    Do we actually need page migration of ksm pages? Yes, memory hotremove
    needs to offline sections of memory: and since we stopped allocating ksm
    pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
    candidates for migration.

    But KSM is currently unconscious of NUMA issues, happily merging pages
    from different NUMA nodes: at present the rule must be, not to use
    MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
    ksm pages does not make sense yet.

    So, to complete support for ksm swapping we need to make hotremove safe.
    ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
    release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
    are freed before migration reaches them, stable_nodes may be left still
    pointing to struct pages which have been removed from the system: the
    stable_node needs to identify a page by pfn rather than page pointer, then
    it can safely prune them when MEM_OFFLINE.

    And make NUMA migration skip PageKsm pages where it skips PageReserved.
    But it's only when we reach unmap_and_move() that the page lock is taken
    and we can be sure that raised pagecount has prevented a PageAnon from
    being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
    page when offlining (has sufficient locking) but reject it otherwise.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch derives a "nodes_allowed" node mask from the numa mempolicy of
    the task modifying the number of persistent huge pages to control the
    allocation, freeing and adjusting of surplus huge pages when the pool page
    count is modified via the new sysctl or sysfs attribute
    "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:

    * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
    is produced. This will cause the hugetlb subsystem to use
    node_online_map as the "nodes_allowed". This preserves the
    behavior before this patch.
    * For "preferred" mempolicy, including explicit local allocation,
    a nodemask with the single preferred node will be produced.
    "local" policy will NOT track any internode migrations of the
    task adjusting nr_hugepages.
    * For "bind" and "interleave" policy, the mempolicy's nodemask
    will be used.
    * Other than to inform the construction of the nodes_allowed node
    mask, the actual mempolicy mode is ignored. That is, all modes
    behave like interleave over the resulting nodes_allowed mask
    with no "fallback".

    See the updated documentation [next patch] for more information
    about the implications of this patch.

    Examples:

    Starting with:

    Node 0 HugePages_Total: 0
    Node 1 HugePages_Total: 0
    Node 2 HugePages_Total: 0
    Node 3 HugePages_Total: 0

    Default behavior [with or without this patch] balances persistent
    hugepage allocation across nodes [with sufficient contiguous memory]:

    sysctl vm.nr_hugepages[_mempolicy]=32

    yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 8
    Node 3 HugePages_Total: 8

    Of course, we only have nr_hugepages_mempolicy with the patch,
    but with default mempolicy, nr_hugepages_mempolicy behaves the
    same as nr_hugepages.

    Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
    '--membind' because it allows multiple nodes to be specified
    and it's easy to type]--we can allocate huge pages on
    individual nodes or sets of nodes. So, starting from the
    condition above, with 8 huge pages per node, add 8 more to
    node 2 using:

    numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

    This yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The incremental 8 huge pages were restricted to node 2 by the
    specified mempolicy.

    Similarly, we can use mempolicy to free persistent huge pages
    from specified nodes:

    numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

    yields:

    Node 0 HugePages_Total: 4
    Node 1 HugePages_Total: 4
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The 8 huge pages freed were balanced over nodes 0 and 1.

    [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
    in right after isolate_page().

    This patch does it.

    Reviewed-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

29 Oct, 2009

2 commits

  • If migrate_prep is failed, new variable is leaked. This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • If mbind() receives an invalid address, do_mbind leaks a page. The
    following test program detects this leak.

    This patch fixes it.

    migrate_efault.c
    =======================================
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static unsigned long pagesize;

    static void* make_hole_mapping(void)
    {

    void* addr;

    addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
    MAP_ANON|MAP_PRIVATE, 0, 0);
    if (addr == MAP_FAILED)
    return NULL;

    /* make page populate */
    memset(addr, 0, pagesize*3);

    /* make memory hole */
    munmap(addr+pagesize, pagesize);

    return addr;
    }

    int main(int argc, char** argv)
    {
    void* addr;
    int ch;
    int node;
    struct bitmask *nmask = numa_allocate_nodemask();
    int err;
    int node_set = 0;

    while ((ch = getopt(argc, argv, "n:")) != -1){
    switch (ch){
    case 'n':
    node = strtol(optarg, NULL, 0);
    numa_bitmask_setbit(nmask, node);
    node_set = 1;
    break;
    default:
    ;
    }
    }
    argc -= optind;
    argv += optind;

    if (!node_set)
    numa_bitmask_setbit(nmask, 0);

    pagesize = getpagesize();

    addr = make_hole_mapping();

    err = mbind(addr, pagesize*3, MPOL_BIND, nmask->maskp, nmask->size, MPOL_MF_MOVE_ALL);
    if (err)
    perror("mbind ");

    return 0;
    }
    =======================================

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

08 Aug, 2009

1 commit

  • At first, init_task's mems_allowed is initialized as this.
    init_task->mems_allowed == node_state[N_POSSIBLE]

    And cpuset's top_cpuset mask is initialized as this
    top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]

    Before 2.6.29:
    policy's mems_allowed is initialized as this.

    1. update tasks->mems_allowed by its cpuset->mems_allowed.
    2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)

    Updating task's mems_allowed in reference to top_cpuset's one.
    cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.

    In 2.6.30: After commit 58568d2a8215cb6f55caf2332017d7bdff954e1c
    ("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed
    is initialized as this.

    1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)

    Here, if task is in top_cpuset, task->mems_allowed is not updated from
    init's one. Assume user excutes command as #numactrl --interleave=all
    ,....

    policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)

    Then, policy's mems_allowd can includes a possible node, which has no pgdat.

    MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
    directly.

    NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL

    Then, what's we need is making policy->mems_allowed be aware of
    N_HIGH_MEMORY. This patch does that. But to do so, extra nodemask will
    be on statck. Because I know cpumask has a new interface of
    CPUMASK_ALLOC(), I added it to node.

    This patch stands on old behavior. But I feel this fix itself is just a
    Band-Aid. But to do fundametal fix, we have to take care of memory
    hotplug and it takes time. (task->mems_allowd should be N_HIGH_MEMORY, I
    think.)

    mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
    should be includes only online nodes.

    In old behavior, this is guaranteed by frequent reference to cpuset's
    code. Now, most of them are removed and mempolicy has to check it by
    itself.

    To do check, a few nodemask_t will be used for calculating nodemask. But,
    size of nodemask_t can be big and it's not good to allocate them on stack.

    Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
    NODEMASK_ALLOC/FREE shoudl be there.

    [akpm@linux-foundation.org: cleanups & tweaks]
    Tested-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

17 Jun, 2009

2 commits

  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Fix allocating page cache/slab object on the unallowed node when memory
    spread is set by updating tasks' mems_allowed after its cpuset's mems is
    changed.

    In order to update tasks' mems_allowed in time, we must modify the code of
    memory policy. Because the memory policy is applied in the process's
    context originally. After applying this patch, one task directly
    manipulates anothers mems_allowed, and we use alloc_lock in the
    task_struct to protect mems_allowed and memory policy of the task.

    But in the fast path, we didn't use lock to protect them, because adding a
    lock may lead to performance regression. But if we don't add a lock,the
    task might see no nodes when changing cpuset's mems_allowed to some
    non-overlapping set. In order to avoid it, we set all new allowed nodes,
    then clear newly disallowed ones.

    [lee.schermerhorn@hp.com:
    The rework of mpol_new() to extract the adjusting of the node mask to
    apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
    with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
    allocation. Fix this by adding the check for MPOL_PREFERRED and empty
    node mask to mpol_new_mpolicy().

    Remove the now unneeded 'nodes = NULL' from mpol_new().

    Note that mpol_new_mempolicy() is always called with a non-NULL
    'nodes' parameter now that it has been removed from mpol_new().
    Therefore, we don't need to test nodes for NULL before testing it for
    'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
    verify this assumption.]
    [lee.schermerhorn@hp.com:

    I don't think the function name 'mpol_new_mempolicy' is descriptive
    enough to differentiate it from mpol_new().

    This function applies cpuset set context, usually constraining nodes
    to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
    is set, it also translates the nodes. So I settled on
    'mpol_set_nodemask()', because the comment block for mpol_new() mentions
    that we need to call this function to "set nodes".

    Some additional minor line length, whitespace and typo cleanup.]
    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

14 Jan, 2009

1 commit


14 Nov, 2008

3 commits

  • Conflicts:
    security/keys/internal.h
    security/keys/process_keys.c
    security/keys/request_key.c

    Fixed conflicts above by using the non 'tsk' versions.

    Signed-off-by: James Morris

    James Morris
     
  • Use RCU to access another task's creds and to release a task's own creds.
    This means that it will be possible for the credentials of a task to be
    replaced without another task (a) requiring a full lock to read them, and (b)
    seeing deallocated memory.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells