12 Sep, 2013

6 commits

  • swap cluster allocation is to get better request merge to improve
    performance. But the cluster is shared globally, if multiple tasks are
    doing swap, this will cause interleave disk access. While multiple tasks
    swap is quite common, for example, each numa node has a kswapd thread
    doing swap and multiple threads/processes doing direct page reclaim.

    ioscheduler can't help too much here, because tasks don't send swapout IO
    down to block layer in the meantime. Block layer does merge some IOs, but
    a lot not, depending on how many tasks are doing swapout concurrently. In
    practice, I've seen a lot of small size IO in swapout workloads.

    We makes the cluster allocation per-cpu here. The interleave disk access
    issue goes away. All tasks swapout to their own cluster, so swapout will
    become sequential, which can be easily merged to big size IO. If one CPU
    can't get its per-cpu cluster (for example, there is no free cluster
    anymore in the swap), it will fallback to scan swap_map. The CPU can
    still continue swap. We don't need recycle free swap entries of other
    CPUs.

    In my test (swap to a 2-disk raid0 partition), this improves around 10%
    swapout throughput, and request size is increased significantly.

    How does this impact swap readahead is uncertain though. On one side,
    page reclaim always isolates and swaps several adjancent pages, this will
    make page reclaim write the pages sequentially and benefit readahead. On
    the other side, several CPU write pages interleave means the pages don't
    live _sequentially_ but relatively _near_. In the per-cpu allocation
    case, if adjancent pages are written by different cpus, they will live
    relatively _far_. So how this impacts swap readahead depends on how many
    pages page reclaim isolates and swaps one time. If the number is big,
    this patch will benefit swap readahead. Of course, this is about
    sequential access pattern. The patch has no impact for random access
    pattern, because the new cluster allocation algorithm is just for SSD.

    Alternative solution is organizing swap layout to be per-mm instead of
    this per-cpu approach. In the per-mm layout, we allocate a disk range for
    each mm, so pages of one mm live in swap disk adjacently. per-mm layout
    has potential issues of lock contention if multiple reclaimers are swap
    pages from one mm. For a sequential workload, per-mm layout is better to
    implement swap readahead, because pages from the mm are adjacent in disk.
    But per-cpu layout isn't very bad in this workload, as page reclaim always
    isolates and swaps several pages one time, such pages will still live in
    disk sequentially and readahead can utilize this. For a random workload,
    per-mm layout isn't beneficial of request merge, because it's quite
    possible pages from different mm are swapout in the meantime and IO can't
    be merged in per-mm layout. while with per-cpu layout we can merge
    requests from any mm. Considering random workload is more popular in
    workloads with swap (and per-cpu approach isn't too bad for sequential
    workload too), I'm choosing per-cpu layout.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • The previous patch can expose races, according to Hugh:

    swapoff was sometimes failing with "Cannot allocate memory", coming from
    try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing
    on a free entry temporarily SWAP_MAP_BAD while being discarded.

    We should use ACCESS_ONCE() there, and whenever accessing swap_map
    locklessly; but rather than peppering it throughout try_to_unuse(), just
    declare *swap_map with volatile.

    try_to_unuse() is accustomed to *swap_map going down racily, but not
    necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to
    prevent that transition once SWP_WRITEOK is switched off, when it's a
    waste of time to issue discards anyway (swapon can do a whole discard).

    Another issue is:

    In swapin_readahead(), read_swap_cache_async() can read a bad swap entry,
    because we don't check if readahead swap entry is bad. This doesn't break
    anything but such swapin page is wasteful and can only be freed at page
    reclaim. We should avoid read such swap entry. And in discard, we mark
    swap entry SWAP_MAP_BAD and then switch it to normal when discard is
    finished. If readahead reads such swap entry, we have the same issue, so
    we much check if swap entry is bad too.

    Thanks Hugh to inspire swapin_readahead could use bad swap entry.

    [include Hugh's patch 'swap: fix swapoff ENOMEMs from discard']
    Signed-off-by: Shaohua Li
    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • swap can do cluster discard for SSD, which is good, but there are some
    problems here:

    1. swap do the discard just before page reclaim gets a swap entry and
    writes the disk sectors. This is useless for high end SSD, because an
    overwrite to a sector implies a discard to original sector too. A
    discard + overwrite == overwrite.

    2. the purpose of doing discard is to improve SSD firmware garbage
    collection. Idealy we should send discard as early as possible, so
    firmware can do something smart. Sending discard just after swap entry
    is freed is considered early compared to sending discard before write.
    Of course, if workload is already bound to gc speed, sending discard
    earlier or later doesn't make

    3. block discard is a sync API, which will delay scan_swap_map()
    significantly.

    4. Write and discard command can be executed parallel in PCIe SSD.
    Making swap discard async can make execution more efficiently.

    This patch makes swap discard async and moves discard to where swap entry
    is freed. Discard and write have no dependence now, so above issues can
    be avoided. Idealy we should do discard for any freed sectors, but some
    SSD discard is very slow. This patch still does discard for a whole
    cluster.

    My test does a several round of 'mmap, write, unmap', which will trigger a
    lot of swap discard. In a fusionio card, with this patch, the test
    runtime is reduced to 18% of the time without it, so around 5.5x faster.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to
    20~30% CPU time (when cluster is hard to find, the CPU time can be up to
    80%), which becomes a bottleneck. scan_swap_map() scans a byte array to
    search a 256 page cluster, which is very slow.

    Here I introduced a simple algorithm to search cluster. Since we only
    care about 256 pages cluster, we can just use a counter to track if a
    cluster is free. Every 256 pages use one int to store the counter. If
    the counter of a cluster is 0, the cluster is free. All free clusters
    will be added to a list, so searching cluster is very efficient. With
    this, scap_swap_map() overhead disappears.

    This might help low end SD card swap too. Because if the cluster is
    aligned, SD firmware can do flash erase more efficiently.

    We only enable the algorithm for SSD. Hard disk swap isn't fast enough
    and has downside with the algorithm which might introduce regression (see
    below).

    The patch slightly changes which cluster is choosen. It always adds free
    cluster to list tail. This can help wear leveling for low end SSD too.
    And if no cluster found, the scan_swap_map() will do search from the end
    of last cluster. So if no cluster found, the scan_swap_map() will do
    search from the end of last free cluster, which is random. For SSD, this
    isn't a problem at all.

    Another downside is the cluster must be aligned to 256 pages, which will
    reduce the chance to find a cluster. I would expect this isn't a big
    problem for SSD because of the non-seek penality. (And this is the reason
    I only enable the algorithm for SSD).

    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • A few 80-col gymnastics were cleaned up as a result.

    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • It is possible to swapon a swap area that is too big for the pte width
    to handle.

    Presently this failure happens silently.

    Instead, emit a diagnostic to warn the user.

    Testing results, root prompt commands and kernel log messages:

    # lvresize /dev/system/swap --size 16G
    # mkswap /dev/system/swap
    # swapon /dev/system/swap

    Jul 7 04:27:22 warfang kernel: Adding 16777212k swap
    on /dev/mapper/system-swap. Priority:-1 extents:1 across:16777212k

    # lvresize /dev/system/swap --size 64G
    # mkswap /dev/system/swap
    # swapon /dev/system/swap

    Jul 7 04:27:22 warfang kernel: Truncating oversized swap area, only
    using 33554432k out of 67108860k
    Jul 7 04:27:22 warfang kernel: Adding 33554428k swap
    on /dev/mapper/system-swap. Priority:-1 extents:1 across:33554428k

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Raymond Jennings
    Acked-by: Valdis Kletnieks
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raymond Jennings
     

14 Aug, 2013

1 commit

  • Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
    get swapped out, the bit is getting lost and no longer available when
    pte read back.

    To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
    pte entry for the page being swapped out. When such page is to be read
    back from a swap cache we check for bit presence and if it's there we
    clear it and restore the former _PAGE_SOFT_DIRTY bit back.

    One of the problem was to find a place in pte entry where we can save
    the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was
    chosen for that, it doesn't intersect with swap entry format stored in
    pte.

    Reported-by: Andy Lutomirski
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Reviewed-by: Minchan Kim
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

04 Jul, 2013

1 commit

  • Considering the use cases where the swap device supports discard:
    a) and can do it quickly;
    b) but it's slow to do in small granularities (or concurrent with other
    I/O);
    c) but the implementation is so horrendous that you don't even want to
    send one down;

    And assuming that the sysadmin considers it useful to send the discards down
    at all, we would (probably) want the following solutions:

    i. do the fine-grained discards for freed swap pages, if device is
    capable of doing so optimally;
    ii. do single-time (batched) swap area discards, either at swapon
    or via something like fstrim (not implemented yet);
    iii. allow doing both single-time and fine-grained discards; or
    iv. turn it off completely (default behavior)

    As implemented today, one can only enable/disable discards for swap, but
    one cannot select, for instance, solution (ii) on a swap device like (b)
    even though the single-time discard is regarded to be interesting, or
    necessary to the workload because it would imply (1), and the device is
    not capable of performing it optimally.

    This patch addresses the scenario depicted above by introducing a way to
    ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly
    flagged through swapon(8) to allow a sysadmin to select the best suitable
    swap discard policy accordingly to system constraints.

    This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
    new flags to allow more flexibe swap discard policies being flagged
    through swapon(8). The default behavior is to keep both single-time, or
    batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards
    for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep
    consistentcy with older kernel behavior, as well as maintain compatibility
    with older swapon(8). However, through the new introduced flags the best
    suitable discard policy can be selected accordingly to any given swap
    device constraint.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Rafael Aquini
    Acked-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Karel Zak
    Cc: Jeff Moyer
    Cc: Rik van Riel
    Cc: Larry Woodman
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

13 Jun, 2013

1 commit

  • The bitmap accessed by bitops must have enough size to hold the required
    numbers of bits rounded up to a multiple of BITS_PER_LONG. And the
    bitmap must not be zeroed by memset() if the number of bits cleared is
    not a multiple of BITS_PER_LONG.

    This fixes incorrect zeroing and allocation size for frontswap_map. The
    incorrect zeroing part doesn't cause any problem because frontswap_map
    is freed just after zeroing. But the wrongly calculated allocation size
    may cause the problem.

    For 32bit systems, the allocation size of frontswap_map is about twice
    as large as required size. For 64bit systems, the allocation size is
    smaller than requeired if the number of bits is not a multiple of
    BITS_PER_LONG.

    Signed-off-by: Akinobu Mita
    Cc: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

01 May, 2013

1 commit

  • Frontswap initialization routine depends on swap_lock, which want to be
    atomic about frontswap's first appearance. IOW, frontswap is not present
    and will fail all calls OR frontswap is fully functional but if new
    swap_info_struct isn't registered by enable_swap_info, swap subsystem
    doesn't start I/O so there is no race between init procedure and page I/O
    working on frontswap.

    So let's remove unnecessary swap_lock dependency.

    Cc: Dan Magenheimer
    Signed-off-by: Minchan Kim
    [v1: Rebased on my branch, reworked to work with backends loading late]
    [v2: Added a check for !map]
    [v3: Made the invalidate path follow the init path]
    [v4: Address comments by Wanpeng Li ]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Bob Liu
    Cc: Wanpeng Li
    Cc: Andor Daam
    Cc: Florian Schmaus
    Cc: Stefan Hengelein
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

30 Apr, 2013

1 commit


27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

24 Feb, 2013

3 commits

  • Before establishing that KSM page migration was the cause of my
    WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
    lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in
    many respects is equivalent to faulting in a page.

    In fact I've never caught that as the cause: but in theory it does at
    least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to
    avoid bringing a KSM page back in when it's not supposed to be.

    I intended to copy how it's done in do_swap_page(), but have a strong
    aversion to how "swapcache" ends up being used there: rework it with
    "page != swapcache".

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • swap_lock is heavily contended when I test swap to 3 fast SSD (even
    slightly slower than swap to 2 such SSD). The main contention comes
    from swap_info_get(). This patch tries to fix the gap with adding a new
    per-partition lock.

    Global data like nr_swapfiles, total_swap_pages, least_priority and
    swap_list are still protected by swap_lock.

    nr_swap_pages is an atomic now, it can be changed without swap_lock. In
    theory, it's possible get_swap_page() finds no swap pages but actually
    there are free swap pages. But sounds not a big problem.

    Accessing partition specific data (like scan_swap_map and so on) is only
    protected by swap_info_struct.lock.

    Changing swap_info_struct.flags need hold swap_lock and
    swap_info_struct.lock, because scan_scan_map() will check it. read the
    flags is ok with either the locks hold.

    If both swap_lock and swap_info_struct.lock must be hold, we always hold
    the former first to avoid deadlock.

    swap_entry_free() can change swap_list. To delete that code, we add a
    new highest_priority_index. Whenever get_swap_page() is called, we
    check it. If it's valid, we use it.

    It's a pity get_swap_page() still holds swap_lock(). But in practice,
    swap_lock() isn't heavily contended in my test with this patch (or I can
    say there are other much more heavier bottlenecks like TLB flush). And
    BTW, looks get_swap_page() doesn't really need the lock. We never free
    swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
    lock is we could swapout to some low priority swap, but we can quickly
    recover after several rounds of swap, so sounds not a big deal to me.
    But I'd prefer to fix this if it's a real problem.

    "swap: make each swap partition have one address_space" improved the
    swapout speed from 1.7G/s to 2G/s. This patch further improves the
    speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
    so TLB flush isn't the biggest bottleneck before the patches.

    [arnd@arndb.de: fix it for nommu]
    [hughd@google.com: add missing unlock]
    [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Greg Kroah-Hartman
    Cc: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Xiao Guangrong
    Cc: Dan Magenheimer
    Cc: Stephen Rothwell
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Hugh Dickins
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When I use several fast SSD to do swap, swapper_space.tree_lock is
    heavily contended. This makes each swap partition have one
    address_space to reduce the lock contention. There is an array of
    address_space for swap. The swap entry type is the index to the array.

    In my test with 3 SSD, this increases the swapout throughput 20%.

    [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

23 Feb, 2013

1 commit


12 Dec, 2012

4 commits

  • test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
    specify that current should be killed first if an oom condition occurs in
    between the two calls.

    The usage is

    short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
    ...
    compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);

    to store the thread's oom_score_adj, temporarily change it to the maximum
    score possible, and then restore the old value if it is still the same.

    This happens to still be racy, however, if the user writes
    OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
    The compare_swap_oom_score_adj() will then incorrectly reset the old value
    prior to the write of OOM_SCORE_ADJ_MAX.

    To fix this, introduce a new oom_flags_t member in struct signal_struct
    that will be used for per-thread oom killer flags. KSM and swapoff can
    now use a bit in this member to specify that threads should be killed
    first in oom conditions without playing around with oom_score_adj.

    This also allows the correct oom_score_adj to always be shown when reading
    /proc/pid/oom_score.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Anton Vorontsov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
    so this range can be represented by the signed short type with no
    functional change. The extra space this frees up in struct signal_struct
    will be used for per-thread oom kill flags in the next patch.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Anton Vorontsov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The call to frontswap_init() was added within enable_swap_info(), which
    was called not only during sys_swapon, but also to reinsert the swap_info
    into the swap_list in case of failure of try_to_unuse() within
    sys_swapoff. This means that frontswap_init() might be called more than
    once for the same swap area.

    While as far as I could see no frontswap implementation has any problem
    with it (and in fact, all the ones I found ignore the parameter passed to
    frontswap_init), this could change in the future.

    To prevent future problems, move the call to frontswap_init() to outside
    the code shared between sys_swapon and sys_swapoff.

    Signed-off-by: Cesar Eduardo Barros
    Cc: Konrad Rzeszutek Wilk
    Acked-by: Dan Magenheimer
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The block within sys_swapoff() which re-inserts the swap_info into the
    swap_list in case of failure of try_to_unuse() reads a few values outside
    the swap_lock. While this is safe at that point, it is subtle code.

    Simplify the code by moving the reading of these values to a separate
    function, refactoring it a bit so they are read from within the swap_lock.
    This is easier to understand, and matches better the way it worked before
    I unified the insertion of the swap_info from both sys_swapon and
    sys_swapoff.

    This change should make no functional difference. The only real change is
    moving the read of two or three structure fields to within the lock
    (frontswap_map_get() is nothing more than a read of p->frontswap_map).

    Signed-off-by: Cesar Eduardo Barros
    Cc: Konrad Rzeszutek Wilk
    Cc: Dan Magenheimer
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     

17 Nov, 2012

1 commit

  • There's a name leak introduced by commit 91a27b2a7567 ("vfs: define
    struct filename and have getname() return it"). Add the missing
    putname.

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: Xiaotian Feng
    Reviewed-by: Jeff Layton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     

13 Oct, 2012

2 commits

  • ...and fix up the callers. For do_file_open_root, just declare a
    struct filename on the stack and fill out the .name field. For
    do_filp_open, make it also take a struct filename pointer, and fix up its
    callers to call it appropriately.

    For filp_open, add a variant that takes a struct filename pointer and turn
    filp_open into a wrapper around it.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • getname() is intended to copy pathname strings from userspace into a
    kernel buffer. The result is just a string in kernel space. It would
    however be quite helpful to be able to attach some ancillary info to
    the string.

    For instance, we could attach some audit-related info to reduce the
    amount of audit-related processing needed. When auditing is enabled,
    we could also call getname() on the string more than once and not
    need to recopy it from userspace.

    This patchset converts the getname()/putname() interfaces to return
    a struct instead of a string. For now, the struct just tracks the
    string in kernel space and the original userland pointer for it.

    Later, we'll add other information to the struct as it becomes
    convenient.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

01 Aug, 2012

5 commits

  • The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when
    the function would continue to reestablish the page even after
    mem_cgroup_try_charge_swapin() failed. After 85d9fc8 "memcg: fix refcnt
    handling at swapoff", the condition is always true when this code is
    reached.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit b3a27d ("swap: Add swap slot free callback to
    block_device_operations") dereferences p->bdev->bd_disk but this is a NULL
    dereference if using swap-over-NFS. This patch checks SWP_BLKDEV on the
    swap_info_struct before dereferencing.

    With reference to this callback, Christoph Hellwig stated "Please just
    remove the callback entirely. It has no user outside the staging tree and
    was added clearly against the rules for that staging tree". This would
    also be my preference but there was not an obvious way of keeping zram in
    staging/ happy.

    Signed-off-by: Xiaotian Feng
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The version of swap_activate introduced is sufficient for swap-over-NFS
    but would not provide enough information to implement a generic handler.
    This patch shuffles things slightly to ensure the same information is
    available for aops->swap_activate() as is available to the core.

    No functionality change.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently swapfiles are managed entirely by the core VM by using ->bmap to
    allocate space and write to the blocks directly. This effectively ensures
    that the underlying blocks are allocated and avoids the need for the swap
    subsystem to locate what physical blocks store offsets within a file.

    If the swap subsystem is to use the filesystem information to locate the
    blocks, it is critical that information such as block groups, block
    bitmaps and the block descriptor table that map the swap file were
    resident in memory. This patch adds address_space_operations that the VM
    can call when activating or deactivating swap backed by a file.

    int swap_activate(struct file *);
    int swap_deactivate(struct file *);

    The ->swap_activate() method is used to communicate to the file that the
    VM relies on it, and the address_space should take adequate measures such
    as reserving space in the underlying device, reserving memory for mempools
    and pinning information such as the block descriptor table in memory. The
    ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
    returned success.

    After a successful swapfile ->swap_activate, the swapfile is marked
    SWP_FILE and swapper_space.a_ops will proxy to
    sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
    pages and ->readpage to read.

    It is perfectly possible that direct_IO be used to read the swap pages but
    it is an unnecessary complication. Similarly, it is possible that
    ->writepage be used instead of direct_io to write the pages but filesystem
    developers have stated that calling writepage from the VM is undesirable
    for a variety of reasons and using direct_IO opens up the possibility of
    writing back batches of swap pages in the future.

    [a.p.zijlstra@chello.nl: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In order to teach filesystems to handle swap cache pages, three new page
    functions are introduced:

    pgoff_t page_file_index(struct page *);
    loff_t page_file_offset(struct page *);
    struct address_space *page_file_mapping(struct page *);

    page_file_index() - gives the offset of this page in the file in
    PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this
    function also gives the correct index for PG_swapcache pages.

    page_file_offset() - uses page_file_index(), so that it will give the
    expected result, even for PG_swapcache pages.

    page_file_mapping() - gives the mapping backing the actual page; that is
    for swap cache pages it will give swap_file->f_mapping.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Jun, 2012

1 commit

  • Minchan Kim reports that when a system has many swap areas, and tmpfs
    swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
    back the page cannot locate it, and the read fails with -ENOMEM.

    Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
    swp_entry_to_pte()) technique for determining maximum usable swap
    offset, without stopping to realize that that actually depends upon the
    pte swap encoding shifting swap offset to the higher bits and truncating
    it there. Whereas our radix_tree swap encoding leaves offset in the
    lower bits: it's swap "type" (that is, index of swap area) that was
    truncated.

    Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
    broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().

    This does not reduce the usable size of a swap area any further, it
    leaves it as claimed when making the original commit: no change from 3.0
    on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
    per swapfile on i386 with PAE. It's not a change I would have risked
    five years ago, but with x86_64 supported for ten years, I believe it's
    appropriate now.

    Hmm, and what if some architecture implements its swap pte with offset
    encoded below type? That would equally break the maximum usable swap
    offset check. Happily, they all follow the same tradition of encoding
    offset above type, but I'll prepare a check on that for next.

    Reported-and-Reviewed-and-Tested-by: Minchan Kim
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4]
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Jun, 2012

1 commit

  • Pull frontswap feature from Konrad Rzeszutek Wilk:
    "Frontswap provides a "transcendent memory" interface for swap pages.
    In some environments, dramatic performance savings may be obtained
    because swapped pages are saved in RAM (or a RAM-like device) instead
    of a swap disk. This tag provides the basic infrastructure along with
    some changes to the existing backends."

    Fix up trivial conflict in mm/Makefile due to removal of swap token code
    changing a line next to the new frontswap entry.

    This pull request came in before the merge window even opened, it got
    delayed to after the merge window by me just wanting to make sure it had
    actual users. Apparently IBM is using this on their embedded side, and
    Jan Beulich says that it's already made available for SLES and OpenSUSE
    users.

    Also acked by Rik van Riel, and Konrad points to other people liking it
    too. So in it goes.

    By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
    via Konrad Rzeszutek Wilk
    * tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    frontswap: s/put_page/store/g s/get_page/load
    MAINTAINER: Add myself for the frontswap API
    mm: frontswap: config and doc files
    mm: frontswap: core frontswap functionality
    mm: frontswap: core swap subsystem hooks and headers
    mm: frontswap: add frontswap header file

    Linus Torvalds
     

30 May, 2012

2 commits

  • This patch changes memcg's behavior at task_move().

    At task_move(), the kernel scans a task's page table and move the changes
    for mapped pages from source cgroup to target cgroup. There has been a
    bug at handling shared anonymous pages for a long time.

    Before patch:
    - The spec says 'shared anonymous pages are not moved.'
    - The implementation was 'shared anonymoys pages may be moved'.
    If page_mapcount
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Naoya Horiguchi
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the
    backing RAM has to be below 4GB. Not a problem while the boards
    supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
    their GMA3600 is managed by the GMA500 driver.

    shmem/tmpfs has never pretended to support hardware restrictions on the
    backing memory, but it might have appeared to do so before v3.1, and
    even now it works fine until a page is swapped out then back in. When
    read_cache_page_gfp() supplied a freshly allocated page for copy, that
    compensated for whatever choice might have been made by earlier swapin
    readahead; but swapoff was likely to destroy the illusion.

    We'd like to continue to support GMA500, so now add a new
    shmem_should_replace_page() check on the zone when about to move a page
    from swapcache to filecache (in swapin and swapoff cases), with
    shmem_replace_page() to allocate and substitute a suitable page (given
    gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).

    This does involve a minor extension to mem_cgroup_replace_page_cache()
    (the page may or may not have already been charged); and I've removed a
    comment and call to mem_cgroup_uncharge_cache_page(), which in fact is
    always a no-op while PageSwapCache.

    Also removed optimization of an unlikely path in shmem_getpage_gfp(),
    now that we need to check PageSwapCache more carefully (a racing caller
    might already have made the copy). And at one point shmem_unuse_inode()
    needs to use the hitherto private page_swapcount(), to guard against
    racing with inode eviction.

    It would make sense to extend shmem_should_replace_page(), to cover
    cpuset and NUMA mempolicy restrictions too, but set that aside for now:
    needs a cleanup of shmem mempolicy handling, and more testing, and ought
    to handle swap faults in do_swap_page() as well as shmem.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Alan Cox
    Cc: Stephane Marchesin
    Cc: Andi Kleen
    Cc: Dave Airlie
    Cc: Daniel Vetter
    Cc: Rob Clark
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

15 May, 2012

1 commit

  • This patch, 2of4, contains the changes to the core swap subsystem.
    This includes:

    (1) makes available core swap data structures (swap_lock, swap_list and
    swap_info) that are needed by frontswap.c but we don't need to expose them
    to the dozens of files that include swap.h so we create a new swapfile.h
    just to extern-ify these and modify their declarations to non-static

    (2) adds frontswap-related elements to swap_info_struct. Frontswap_map
    points to vzalloc'ed one-bit-per-swap-page metadata that indicates
    whether the swap page is in frontswap or in the device and frontswap_pages
    counts how many pages are in frontswap.

    (3) adds hooks in the swap subsystem and extends try_to_unuse so that
    frontswap_shrink can do a "partial swapoff".

    Note that a failed frontswap_map allocation is safe... failure is noted
    by lack of "FS" in the subsequent printk.

    ---

    [v14: rebase to 3.4-rc2]
    [v10: no change]
    [v9: akpm@linux-foundation.org: mark some statics __read_mostly]
    [v9: akpm@linux-foundation.org: add clarifying comments]
    [v9: akpm@linux-foundation.org: no need to loop repeating try_to_unuse]
    [v9: error27@gmail.com: remove superfluous check for NULL]
    [v8: rebase to 3.0-rc4]
    [v8: kamezawa.hiroyu@jp.fujitsu.com: change counter to atomic_t to avoid races]
    [v8: kamezawa.hiroyu@jp.fujitsu.com: comment to clarify informational counters]
    [v7: rebase to 3.0-rc3]
    [v7: JBeulich@novell.com: add new swap struct elements only if config'd]
    [v6: rebase to 3.0-rc1]
    [v6: lliubbo@gmail.com: fix null pointer deref if vzalloc fails]
    [v6: konrad.wilk@oracl.com: various checks and code clarifications/comments]
    [v5: no change from v4]
    [v4: rebase to 2.6.39]
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Kamezawa Hiroyuki
    Acked-by: Jan Beulich
    Acked-by: Seth Jennings
    Cc: Jeremy Fitzhardinge
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Nitin Gupta
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Rik Riel
    Cc: Andrew Morton
    [v11: Rebased, fixed mm/swapfile.c context change]
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer
     

29 Mar, 2012

1 commit

  • Most system calls taking flags first check that the flags passed in are
    valid, and that helps userspace to detect when new flags are supported.

    But swapon never did so: start checking now, to help if we ever want to
    support more swap_flags in future.

    It's difficult to get stray bits set in an int, and swapon is not widely
    used, so this is most unlikely to break any userspace; but we can just
    revert if it turns out to do so.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Mar, 2012

1 commit

  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

4 commits

  • When swapon() was not passed the SWAP_FLAG_DISCARD option, sys_swapon()
    will still perform a discard operation. This can cause problems if
    discard is slow or buggy.

    Reverse the order of the check so that a discard operation is performed
    only if the sys_swapon() caller is attempting to enable discard.

    Signed-off-by: Shaohua Li
    Reported-by: Holger Kiehl
    Tested-by: Holger Kiehl
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Ever since abandoning the virtual scan of processes, for scalability
    reasons, swap space has been a little more fragmented than before. This
    can lead to the situation where a large memory user is killed, swap space
    ends up full of "holes" and swapin readahead is totally ineffective.

    On my home system, after killing a leaky firefox it took over an hour to
    page just under 2GB of memory back in, slowing the virtual machines down
    to a crawl.

    This patch makes swapin readahead simply skip over holes, instead of
    stopping at them. This allows the system to swap things back in at rates
    of several MB/second, instead of a few hundred kB/second.

    The checks done in valid_swaphandles are already done in
    read_swap_cache_async as well, allowing us to remove a fair amount of
    code.

    [akpm@linux-foundation.org: fix it for page_cluster >= 32]
    Signed-off-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: Adrian Drzewiecki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: [2.6.38+]
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Pull security subsystem updates for 3.4 from James Morris:
    "The main addition here is the new Yama security module from Kees Cook,
    which was discussed at the Linux Security Summit last year. Its
    purpose is to collect miscellaneous DAC security enhancements in one
    place. This also marks a departure in policy for LSM modules, which
    were previously limited to being standalone access control systems.
    Chromium OS is using Yama, and I believe there are plans for Ubuntu,
    at least.

    This patchset also includes maintenance updates for AppArmor, TOMOYO
    and others."

    Fix trivial conflict in due to the jumo_label->static_key
    rename.

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
    AppArmor: Fix location of const qualifier on generated string tables
    TOMOYO: Return error if fails to delete a domain
    AppArmor: add const qualifiers to string arrays
    AppArmor: Add ability to load extended policy
    TOMOYO: Return appropriate value to poll().
    AppArmor: Move path failure information into aa_get_name and rename
    AppArmor: Update dfa matching routines.
    AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
    AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
    AppArmor: Add const qualifiers to generated string tables
    AppArmor: Fix oops in policy unpack auditing
    AppArmor: Fix error returned when a path lookup is disconnected
    KEYS: testing wrong bit for KEY_FLAG_REVOKED
    TOMOYO: Fix mount flags checking order.
    security: fix ima kconfig warning
    AppArmor: Fix the error case for chroot relative path name lookup
    AppArmor: fix mapping of META_READ to audit and quiet flags
    AppArmor: Fix underflow in xindex calculation
    AppArmor: Fix dropping of allowed operations that are force audited
    AppArmor: Add mising end of structure test to caps unpacking
    ...

    Linus Torvalds
     

20 Mar, 2012

1 commit