17 Jan, 2011

2 commits


15 Jan, 2011

1 commit

  • LIO target is a full featured in-kernel target framework with the
    following feature set:

    High-performance, non-blocking, multithreaded architecture with SIMD
    support.

    Advanced SCSI feature set:

    * Persistent Reservations (PRs)
    * Asymmetric Logical Unit Assignment (ALUA)
    * Protocol and intra-nexus multiplexing, load-balancing and failover (MC/S)
    * Full Error Recovery (ERL=0,1,2)
    * Active/active task migration and session continuation (ERL=2)
    * Thin LUN provisioning (UNMAP and WRITE_SAMExx)

    Multiprotocol target plugins

    Storage media independence:

    * Virtualization of all storage media; transparent mapping of IO to LUNs
    * No hard limits on number of LUNs per Target; maximum LUN size ~750 TB
    * Backstores: SATA, SAS, SCSI, BluRay, DVD, FLASH, USB, ramdisk, etc.

    Standards compliance:

    * Full compliance with IETF (RFC 3720)
    * Full implementation of SPC-4 PRs and ALUA

    Significant code cleanups done by Christoph Hellwig.

    [jejb: fix up for new block bdev exclusive interface. Minor fixes from
    Randy Dunlap and Dan Carpenter.]
    Signed-off-by: Nicholas A. Bellinger
    Signed-off-by: James Bottomley

    Nicholas Bellinger
     

14 Jan, 2011

37 commits

  • SDEV_MEDIA_CHANGE event was first added by commit a341cd0f (SCSI: add
    asynchronous event notification API) for SATA AN support and then
    extended to cover generic media change events by commit 285e9670
    ([SCSI] sr,sd: send media state change modification events).

    This event was mapped to block device in userland with all properties
    stripped to simulate CHANGE event on the block device, which, in turn,
    was used to trigger further userspace action on media change.

    The recent addition of disk event framework kept this event for
    backward compatibility but it turns out to be unnecessary and causes
    erratic and inefficient behavior. The new disk event generates proper
    events on the block devices and the compat events are mapped to block
    device with all properties stripped, so the block device ends up
    generating multiple duplicate events for single actual event.

    This patch removes the compat event generation from both sr and sd as
    suggested by Kay Sievers. Both existing and newer versions of udev
    and the associated tools will behave better with the removal of these
    events as they from the beginning were expecting events on the block
    devices.

    Signed-off-by: Tejun Heo
    Acked-by: Kay Sievers
    Signed-off-by: James Bottomley

    Tejun Heo
     
  • Replace sd_media_change() with sd_check_events().

    * Move media removed logic into set_media_not_present() and
    media_not_present() and set sdev->changed iff an existing media is
    removed or the device indicates UNIT_ATTENTION.

    * Make sd_check_events() sets sdev->changed if previously missing
    media becomes present.

    * Event is reported only if sdev->changed is set.

    This makes media presence event reported if scsi_disk->media_present
    actually changed or the device indicated UNIT_ATTENTION. For backward
    compatibility, SDEV_EVT_MEDIA_CHANGE is generated each time
    sd_check_events() detects media change event.

    [jejb: fix boot failure]
    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: James Bottomley

    Tejun Heo
     
  • * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
    ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
    ACPI: fix resource check message
    ACPI / Battery: Update information on info notification and resume
    ACPI: Drop device flag wake_capable
    ACPI: Always check if _PRW is present before trying to evaluate it
    ACPI / PM: Check status of power resources under mutexes
    ACPI / PM: Rename acpi_power_off_device()
    ACPI / PM: Drop acpi_power_nocheck
    ACPI / PM: Drop acpi_bus_get_power()
    Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
    ACPI / Fan: Rework the handling of power resources
    ACPI / PM: Register power resource devices as soon as they are needed
    ACPI / PM: Register acpi_power_driver early
    ACPI / PM: Add function for updating device power state consistently
    ACPI / PM: Add function for device power state initialization
    ACPI / PM: Introduce __acpi_bus_get_power()
    ACPI / PM: Introduce function for refcounting device power resources
    ACPI / PM: Add functions for manipulating lists of power resources
    ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
    ACPICA: Update version to 20101209
    ...

    Linus Torvalds
     
  • * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
    cpuidle/x86/perf: fix power:cpu_idle double end events and throw cpu_idle events from the cpuidle layer
    intel_idle: open broadcast clock event
    cpuidle: CPUIDLE_FLAG_CHECK_BM is omap3_idle specific
    cpuidle: CPUIDLE_FLAG_TLB_FLUSHED is specific to intel_idle
    cpuidle: delete unused CPUIDLE_FLAG_SHALLOW, BALANCED, DEEP definitions
    SH, cpuidle: delete use of NOP CPUIDLE_FLAGS_SHALLOW
    cpuidle: delete NOP CPUIDLE_FLAG_POLL
    ACPI: processor_idle: delete use of NOP CPUIDLE_FLAGs
    cpuidle: Rename X86 specific idle poll state[0] from C0 to POLL
    ACPI, intel_idle: Cleanup idle= internal variables
    cpuidle: Make cpuidle_enable_device() call poll_idle_init()
    intel_idle: update Sandy Bridge core C-state residency targets

    Linus Torvalds
     
  • * 'sfi-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-sfi-2.6:
    SFI: use ioremap_cache() instead of ioremap()

    Linus Torvalds
     
  • …t/npiggin/linux-npiggin

    * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
    fs: fix do_last error case when need_reval_dot
    nfs: add missing rcu-walk check
    fs: hlist UP debug fixup
    fs: fix dropping of rcu-walk from force_reval_path
    fs: force_reval_path drop rcu-walk before d_invalidate
    fs: small rcu-walk documentation fixes

    Fixed up trivial conflicts in Documentation/filesystems/porting

    Linus Torvalds
     
  • When open(2) without O_DIRECTORY opens an existing dir, it should return
    EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
    is changed by d_revalidate() which returns any positive to represent
    'the target dir is valid.'

    Should we keep and return the initialized 'error' in this case.

    Signed-off-by: Nick Piggin

    J. R. Okajima
     
  • Signed-off-by: Nick Piggin

    Nick Piggin
     
  • * 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/p2m: Fix module linking error.
    xen p2m: clear the old pte when adding a page to m2p_override
    xen gntdev: use gnttab_map_refs and gnttab_unmap_refs
    xen: introduce gnttab_map_refs and gnttab_unmap_refs
    xen p2m: transparently change the p2m mappings in the m2p override
    xen/gntdev: Fix circular locking dependency
    xen/gntdev: stop using "token" argument
    xen: gntdev: move use of GNTMAP_contains_pte next to the map_op
    xen: add m2p override mechanism
    xen: move p2m handling to separate file
    xen/gntdev: add VM_PFNMAP to vma
    xen/gntdev: allow usermode to map granted pages
    xen: define gnttab_set_map_op/unmap_op

    Fix up trivial conflict in drivers/xen/Kconfig

    Linus Torvalds
     
  • * 'stable/platform-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen-platform: Fix compile errors if CONFIG_PCI is not enabled.
    xen: rename platform-pci module to xen-platform-pci.
    xen-platform: use PCI interfaces to request IO and MEM resources.

    Linus Torvalds
     
  • Po-Yu Chuang noticed that hlist_bl_set_first could
    crash on a UP system when LIST_BL_LOCKMASK is 0, because

    LIST_BL_BUG_ON(!((unsigned long)h->first & LIST_BL_LOCKMASK));

    always evaulates to true.

    Fix the expression, and also avoid a dependency between bit spinlock
    implementation and list bl code (list code shouldn't know anything
    except that bit 0 is set when adding and removing elements). Eventually
    if a good use case comes up, we might use this list to store 1 or more
    arbitrary bits of data, so it really shouldn't be tied to locking either,
    but for now they are helpful for debugging.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • As J. R. Okajima noted, force_reval_path passes in the same dentry to
    d_revalidate as the one in the nameidata structure (other callers pass in a
    child), so the locking breaks. This can oops with a chrooted nfs mount, for
    example. Similarly there can be other problems with revalidating a dentry
    which is already in nameidata of the path walk.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • d_revalidate can return in rcu-walk mode even when it returns 0. We can't just
    call any old dcache function on rcu-walk dentry (the dentry is unstable, so
    even through d_lock can safely be taken, the result may no longer be what we
    expect -- careful re-checks would be required). So just drop rcu in this case.

    (I missed this conversion when switching to the rcu-walk convention that Linus
    suggested)

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Signed-off-by: Nick Piggin

    Nick Piggin
     
  • In the current implementation mem_cgroup_end_migration() decides whether
    the page migration has succeeded or not by checking "oldpage->mapping".

    But if we are tring to migrate a shmem swapcache, the page->mapping of it
    is NULL from the begining, so the check would be invalid. As a result,
    mem_cgroup_end_migration() assumes the migration has succeeded even if
    it's not, so "newpage" would be freed while it's not uncharged.

    This patch fixes it by passing mem_cgroup_end_migration() the result of
    the page migration.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Reviewed-by: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
    followed by memset() to zero the memory. This can be more efficiently
    achieved by using kzalloc() and vzalloc(). There's also one situation
    where we can use kzalloc_node() - this is what's new in this version of
    the patch.

    Signed-off-by: Jesper Juhl
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Commit b1dd693e ("memcg: avoid deadlock between move charge and
    try_charge()") can cause another deadlock about mmap_sem on task migration
    if cpuset and memcg are mounted onto the same mount point.

    After the commit, cgroup_attach_task() has sequence like:

    cgroup_attach_task()
    ss->can_attach()
    cpuset_can_attach()
    mem_cgroup_can_attach()
    down_read(&mmap_sem) (1)
    ss->attach()
    cpuset_attach()
    mpol_rebind_mm()
    down_write(&mmap_sem) (2)
    up_write(&mmap_sem)
    cpuset_migrate_mm()
    do_migrate_pages()
    down_read(&mmap_sem)
    up_read(&mmap_sem)
    mem_cgroup_move_task()
    mem_cgroup_clear_mc()
    up_read(&mmap_sem)

    We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).

    But the commit itself is necessary to fix deadlocks which have existed
    before the commit like:

    Ex.1)
    move charge | try charge
    --------------------------------------+------------------------------
    mem_cgroup_can_attach() | down_write(&mmap_sem)
    mc.moving_task = current | ..
    mem_cgroup_precharge_mc() | __mem_cgroup_try_charge()
    mem_cgroup_count_precharge() | prepare_to_wait()
    down_read(&mmap_sem) | if (mc.moving_task)
    -> cannot aquire the lock | -> true
    | schedule()
    | -> move charge should wake it up

    Ex.2)
    move charge | try charge
    --------------------------------------+------------------------------
    mem_cgroup_can_attach() |
    mc.moving_task = current |
    mem_cgroup_precharge_mc() |
    mem_cgroup_count_precharge() |
    down_read(&mmap_sem) |
    .. |
    up_read(&mmap_sem) |
    | down_write(&mmap_sem)
    mem_cgroup_move_task() | ..
    mem_cgroup_move_charge() | __mem_cgroup_try_charge()
    down_read(&mmap_sem) | prepare_to_wait()
    -> cannot aquire the lock | if (mc.moving_task)
    | -> true
    | schedule()
    | -> move charge should wake it up

    This patch fixes all of these problems by:
    1. revert the commit.
    2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
    has released the mmap_sem.
    3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
    mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
    all extra charges, wake up all waiters, and retry trylock.

    Signed-off-by: Daisuke Nishimura
    Reported-by: Ben Blum
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Paul Menage
    Cc: Hiroyuki Kamezawa
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Signed-off-by: Minchan Kim
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Adding the number of swap pages to the byte limit of a memory control
    group makes no sense. Convert the pages to bytes before adding them.

    The only user of this code is the OOM killer, and the way it is used means
    that the error results in a higher OOM badness value. Since the cgroup
    limit is the same for all tasks in the cgroup, the error should have no
    practical impact at the moment.

    But let's not wait for future or changing users to trip over it.

    Signed-off-by: Johannes Weiner
    Cc: Greg Thelen
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
    accounting and migration code. This reworks the locking scheme of
    _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
    which is always taken under IRQ disable.

    1. If pages are being migrated from a memcg, then updates to that
    memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
    move_lock_page_cgroup(). In an upcoming commit, memcg dirty page
    accounting will be updating memcg page accounting (specifically: num
    writeback pages) from IRQ context (softirq). Avoid a deadlocking
    nested spin lock attempt by disabling irq on the local processor when
    grabbing the PCG_MOVE_LOCK.

    2. lock for update_page_stat is used only for avoiding race with
    move_account(). So, IRQ awareness of lock_page_cgroup() itself is not
    a problem. The problem is between mem_cgroup_update_page_stat() and
    mem_cgroup_move_account_page().

    Trade-off:
    * Changing lock_page_cgroup() to always disable IRQ (or
    local_bh) has some impacts on performance and I think
    it's bad to disable IRQ when it's not necessary.
    * adding a new lock makes move_account() slower. Score is
    here.

    Performance Impact: moving a 8G anon process.

    Before:
    real 0m0.792s
    user 0m0.000s
    sys 0m0.780s

    After:
    real 0m0.854s
    user 0m0.000s
    sys 0m0.842s

    This score is bad but planned patches for optimization can reduce
    this impact.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Greg Thelen
    Reviewed-by: Minchan Kim
    Acked-by: Daisuke Nishimura
    Cc: Andrea Righi
    Cc: Balbir Singh
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Replace usage of the mem_cgroup_update_file_mapped() memcg
    statistic update routine with two new routines:
    * mem_cgroup_inc_page_stat()
    * mem_cgroup_dec_page_stat()

    As before, only the file_mapped statistic is managed. However, these more
    general interfaces allow for new statistics to be more easily added. New
    statistics are added with memcg dirty page accounting.

    Signed-off-by: Greg Thelen
    Signed-off-by: Andrea Righi
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Document cgroup dirty memory interfaces and statistics.

    [akpm@linux-foundation.org: fix use_hierarchy description]
    Signed-off-by: Andrea Righi
    Signed-off-by: Greg Thelen
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • This patchset provides the ability for each cgroup to have independent
    dirty page limits.

    Limiting dirty memory is like fixing the max amount of dirty (hard to
    reclaim) page cache used by a cgroup. So, in case of multiple cgroup
    writers, they will not be able to consume more than their designated share
    of dirty pages and will be forced to perform write-out if they cross that
    limit.

    The patches are based on a series proposed by Andrea Righi in Mar 2010.

    Overview:

    - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
    unstable.

    - Extend mem_cgroup to record the total number of pages in each of the
    interesting dirty states (dirty, writeback, unstable_nfs).

    - Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_*
    limits to mem_cgroup. The mem_cgroup dirty parameters are accessible
    via cgroupfs control files.

    - Consider both system and per-memcg dirty limits in page writeback when
    deciding to queue background writeback or block for foreground writeback.

    Known shortcomings:

    - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
    writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
    just inodes contributing dirty pages to the cgroup exceeding its limit.

    - When memory.use_hierarchy is set, then dirty limits are disabled. This is a
    implementation detail. An enhanced implementation is needed to check the
    chain of parents to ensure that no dirty limit is exceeded.

    Performance data:
    - A page fault microbenchmark workload was used to measure performance, which
    can be called in read or write mode:
    f = open(foo. $cpu)
    truncate(f, 4096)
    alarm(60)
    while (1) {
    p = mmap(f, 4096)
    if (write)
    *p = 1
    else
    x = *p
    munmap(p)
    }

    - The workload was called for several points in the patch series in different
    modes:
    - s_read is a single threaded reader
    - s_write is a single threaded writer
    - p_read is a 16 thread reader, each operating on a different file
    - p_write is a 16 thread writer, each operating on a different file

    - Measurements were collected on a 16 core non-numa system using "perf stat
    --repeat 3". The -a option was used for parallel (p_*) runs.

    - All numbers are page fault rate (M/sec). Higher is better.

    - To compare the performance of a kernel without non-memcg compare the first and
    last rows, neither has memcg configured. The first row does not include any
    of these memcg patches.

    - To compare the performance of using memcg dirty limits, compare the baseline
    (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
    row titled "all patches").

    root_cgroup child_cgroup
    s_read s_write p_read p_write s_read s_write p_read p_write
    mmotm w/o memcg 0.428 0.390 0.429 0.388
    mmotm w/ memcg 0.411 0.378 0.391 0.362 0.412 0.377 0.385 0.363
    all patches 0.384 0.360 0.370 0.348 0.381 0.363 0.368 0.347
    all patches 0.431 0.402 0.427 0.395
    w/o memcg

    This patch:

    Add additional flags to page_cgroup to track dirty pages within a
    mem_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Righi
    Signed-off-by: Greg Thelen
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • The zone->lru_lock is heavily contented in workload where activate_page()
    is frequently used. We could do batch activate_page() to reduce the lock
    contention. The batched pages will be added into zone list when the pool
    is full or page reclaim is trying to drain them.

    For example, in a 4 socket 64 CPU system, create a sparse file and 64
    processes, processes shared map to the file. Each process read access the
    whole file and then exit. The process exit will do unmap_vmas() and cause
    a lot of activate_page() call. In such workload, we saw about 58% total
    time reduction with below patch. Other workloads with a lot of
    activate_page also benefits a lot too.

    I tested some microbenchmarks:
    case-anon-cow-rand-mt 0.58%
    case-anon-cow-rand -3.30%
    case-anon-cow-seq-mt -0.51%
    case-anon-cow-seq -5.68%
    case-anon-r-rand-mt 0.23%
    case-anon-r-rand 0.81%
    case-anon-r-seq-mt -0.71%
    case-anon-r-seq -1.99%
    case-anon-rx-rand-mt 2.11%
    case-anon-rx-seq-mt 3.46%
    case-anon-w-rand-mt -0.03%
    case-anon-w-rand -0.50%
    case-anon-w-seq-mt -1.08%
    case-anon-w-seq -0.12%
    case-anon-wx-rand-mt -5.02%
    case-anon-wx-seq-mt -1.43%
    case-fork 1.65%
    case-fork-sleep -0.07%
    case-fork-withmem 1.39%
    case-hugetlb -0.59%
    case-lru-file-mmap-read-mt -0.54%
    case-lru-file-mmap-read 0.61%
    case-lru-file-mmap-read-rand -2.24%
    case-lru-file-readonce -0.64%
    case-lru-file-readtwice -11.69%
    case-lru-memcg -1.35%
    case-mmap-pread-rand-mt 1.88%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq-mt 0.89%
    case-mmap-pread-seq -69.72%
    case-mmap-xread-rand-mt 0.71%
    case-mmap-xread-seq-mt 0.38%

    The most significent are:
    case-lru-file-readtwice -11.69%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq -69.72%

    which use activate_page a lot. others are basically variations because
    each run has slightly difference.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Clean up code and remove duplicate code. Next patch will use
    pagevec_lru_move_fn introduced here too.

    Signed-off-by: Shaohua Li
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • It's old-fashioned and unneeded.

    akpm:/usr/src/25> size mm/page_alloc.o
    text data bss dec hex filename
    39884 1241317 18808 1300009 13d629 mm/page_alloc.o (before)
    39838 1241317 18808 1299963 13d5fb mm/page_alloc.o (after)

    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • 2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
    but its anon_vma handling was still based around the 2.6.35 conventions.
    Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
    drop_anon_vma in the same way as we're now changing unmap_and_move().

    I don't particularly like to propose this for stable when I've not seen
    its problems in practice nor tested the solution: but it's clearly out of
    synch at present.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: "Jun'ichi Nomura"
    Cc: Andi Kleen
    Cc: [2.6.37, 2.6.36]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Increased usage of page migration in mmotm reveals that the anon_vma
    locking in unmap_and_move() has been deficient since 2.6.36 (or even
    earlier). Review at the time of f18194275c39835cb84563500995e0d503a32d9a
    ("mm: fix hang on anon_vma->root->lock") missed the issue here: the
    anon_vma to which we get a reference may already have been freed back to
    its slab (it is in use when we check page_mapped, but that can change),
    and so its anon_vma->root may be switched at any moment by reuse in
    anon_vma_prepare.

    Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
    not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
    then we don't need any rcu read locking over here.

    In removing the rcu_unlock label: since PageAnon is a bit in
    page->mapping, it's impossible for a !page->mapping page to be anon; but
    insert VM_BUG_ON in case the implementation ever changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Reviewed-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: "Jun'ichi Nomura"
    Cc: Andi Kleen
    Cc: [2.6.37, 2.6.36]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It was hard to explain the page counts which were causing new LTP tests
    of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.

    Signed-off-by: Hugh Dickins
    Reported-by: CAI Qian
    Cc:Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When parsing changes to the huge page pool sizes made from userspace via
    the sysfs interface, bogus input values are being covered up by
    nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0
    when strict_strtoul returns an error. This can cause an infinite loop in
    the nr_hugepages_store code. This patch changes the return value for
    these functions to -EINVAL when strict_strtoul returns an error.

    Signed-off-by: Eric B Munson
    Reported-by: CAI Qian
    Cc: Andrea Arcangeli
    Cc: Eric B Munson
    Cc: Michal Hocko
    Cc: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • Huge pages with order >= MAX_ORDER must be allocated at boot via the
    kernel command line, they cannot be allocated or freed once the kernel is
    up and running. Currently we allow values to be written to the sysfs and
    sysctl files controling pool size for these huge page sizes. This patch
    makes the store functions for nr_hugepages and nr_overcommit_hugepages
    return -EINVAL when the pool for a page size >= MAX_ORDER is changed.

    [akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()]
    [caiqian@redhat.com: add checking in hugetlb_overcommit_handler()]
    Signed-off-by: Eric B Munson
    Reported-by: CAI Qian
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • proc_doulongvec_minmax may fail if the given buffer doesn't represent a
    valid number. If we provide something invalid we will initialize the
    resulting value (nr_overcommit_huge_pages in this case) to a random value
    from the stack.

    The issue was introduced by a3d0c6aa when the default handler has been
    replaced by the helper function where we do not check the return value.

    Reproducer:
    echo "" > /proc/sys/vm/nr_overcommit_hugepages

    [akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code]
    Signed-off-by: Michal Hocko
    Cc: CAI Qian
    Cc: Nishanth Aravamudan
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The sync_inodes_sb() function does not have a return value. Remove the
    outdated documentation comment.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Hajnoczi
     
  • As it stands this code will degenerate into a busy-wait if the calling task
    has signal_pending().

    Cc: Rolf Eike Beer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • dma_pool_free() scans for the page to free in the pool list holding the
    pool lock. Then it releases the lock basically to acquire it immediately
    again. Modify the code to only take the lock once.

    This will do some additional loops and computations with the lock held in
    if memory debugging is activated. If it is not activated the only new
    operations with this lock is one if and one substraction.

    Signed-off-by: Rolf Eike Beer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rolf Eike Beer
     
  • The previous approach of calucation of combined index was

    page_idx & ~(1 << order))

    but we have same result with

    page_idx & buddy_idx

    This reduces instructions slightly as well as enhances readability.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix used-unintialised warning]
    Signed-off-by: KyongHo Cho
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KyongHo Cho
     
  • Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
    be overriden by randomize_va_space sysctl.

    If this is the case, the min_brk computation in sys_brk() implementation
    is wrong, as it solely takes into account COMPAT_BRK setting, assuming
    that brk start is not randomized. But that might not be the case if
    randomize_va_space sysctl has been set to '2' at the time the binary has
    been loaded from disk.

    In such case, the check has to be done in a same way as in
    !CONFIG_COMPAT_BRK case.

    In addition to that, the check for the COMPAT_BRK case introduced back in
    a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
    bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
    but mm->end_data instead, as that's where the legacy applications expect
    brk section to start (i.e. immediately after last global variable).

    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Jiri Kosina
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina