13 Jan, 2012

1 commit


11 Jan, 2012

1 commit

  • Colin Cross reported;

    Under the following conditions, __alloc_pages_slowpath can loop forever:
    gfp_mask & __GFP_WAIT is true
    gfp_mask & __GFP_FS is false
    reclaim and compaction make no progress
    order
    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit

  • test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
    PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
    current's oom_score_adj for ksm and swapoff without requiring an
    additional per-process flag.

    Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
    then reinstate the previous value is racy since it's possible that
    userspace can set the value to something else itself before the old value
    is reinstated. That results in userspace setting current's oom_score_adj
    to a different value and then the kernel immediately setting it back to
    its previous value without notification.

    To fix this, a new compare_swap_oom_score_adj() function is introduced
    with the same semantics as the compare and swap CAS instruction, or
    CMPXCHG on x86. It is used to reinstate the previous value of
    oom_score_adj if and only if the present value is the same as the old
    value.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

31 Oct, 2011

1 commit


04 Aug, 2011

1 commit

  • If swap entries are to be stored along with struct page pointers in a
    radix tree, they need to be distinguished as exceptional entries.

    Most of the handling of swap entries in radix tree will be contained in
    shmem.c, but a few functions in filemap.c's common code need to check
    for their appearance: find_get_page(), find_lock_page(),
    find_get_pages() and find_get_pages_contig().

    So as not to slow their fast paths, tuck those checks inside the
    existing checks for unlikely radix_tree_deref_slot(); except for
    find_lock_page(), where it is an added test. And make it a BUG in
    find_get_pages_tag(), which is not applied to tmpfs files.

    A part of the reason for eliminating shmem_readpage() earlier, was to
    minimize the places where common code would need to allow for swap
    entries.

    The swp_entry_t known to swapfile.c must be massaged into a slightly
    different form when stored in the radix tree, just as it gets massaged
    into a pte_t when stored in page tables.

    In an i386 kernel this limits its information (type and page offset) to
    30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
    swapfile size of 128GB. Which is less than the 512GB we previously
    allowed with X86_PAE (where the swap entry can occupy the entire upper
    32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
    there's not a new limitation on 64-bit (where swap filesize is already
    limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
    probably still enough swap for a 64GB 32-bit machine.

    Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
    enforce filesize limit in read_swap_header(), just as for ptes.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Jul, 2011

1 commit

  • Moving the event counter into the dynamically allocated 'struc seq_file'
    allows poll() support without the need to allocate its own tracking
    structure.

    All current users are switched over to use the new counter.

    Requested-by: Andrew Morton akpm@linux-foundation.org
    Acked-by: NeilBrown
    Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
    Signed-off-by: Kay Sievers
    Signed-off-by: Al Viro

    Kay Sievers
     

28 Jun, 2011

1 commit

  • Before adding any more global entry points into shmem.c, gather such
    prototypes into shmem_fs.h. Remove mm's own declarations from swap.h,
    but for now leave the ones in mm.h: because shmem_file_setup() and
    shmem_zero_setup() are called from various places, and we should not
    force other subsystems to update immediately.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 May, 2011

1 commit

  • There's a kernel-wide shortage of per-process flags, so it's always
    helpful to trim one when possible without incurring a significant penalty.
    It's even more important when you're planning on adding a per- process
    flag yourself, which I plan to do shortly for transparent hugepages.

    PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
    tendency to allocate large amounts of memory and should be preferred for
    killing over other tasks. We'd rather immediately kill the task making
    the errant syscall rather than penalizing an innocent task.

    This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
    setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

    The process's old oom_score_adj is stored and then set to
    OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old
    value is then reinstated when the process should no longer be considered a
    high priority for oom killing.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Izik Eidus
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

24 Mar, 2011

1 commit

  • Remove initialization of vaiable in caller of memory cgroup function.
    Actually, it's return value of memcg function but it's initialized in
    caller.

    Some memory cgroup uses following style to bring the result of start
    function to the end function for avoiding races.

    mem_cgroup_start_A(&(*ptr))
    /* Something very complicated can happen here. */
    mem_cgroup_end_A(*ptr)

    In some calls, *ptr should be initialized to NULL be caller. But it's
    ugly. This patch fixes that *ptr is initialized by _start function.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Mar, 2011

27 commits

  • A conflict between 52c50567d8ab ("mm: swap: unlock swapfile inode mutex
    before closing file on bad swapfiles") and 83ef99befc32 ("sys_swapon:
    remove did_down variable") caused a double unlock of the inode mutex
    (once in bad_swap: before the filp_close, once at the end just before
    returning).

    The patch which added the extra unlock cleared did_down to avoid
    unlocking twice, but the other patch removed the did_down variable.

    To fix, set inode to NULL after the first unlock, since it will be used
    after that point only for the final unlock.

    While checking this patch, I found a path which could unlock without
    locking, in case the same inode was added as a swapfile twice. To fix,
    move the setting of the inode variable further down, to just before
    claim_swapfile, which will lock the inode before doing anything else.

    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Eric B Munson
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrew Morton
    Signed-off-by: Cesar Eduardo Barros
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • scan_swap_map() is a large function (224 lines), with several loops and a
    complex control flow involving several gotos.

    Given all that, it is a bit silly that it is marked as inline. The
    compiler agrees with me: on a x86-64 compile, it did not inline the
    function.

    Remove the "inline" and let the compiler decide instead.

    Signed-off-by: Cesar Eduardo Barros
    Reviewed-by: Pekka Enberg
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The block in sys_swapon which does the final adjustments to the
    swap_info_struct and to swap_list is the same as the block which
    re-inserts it again at sys_swapoff on failure of try_to_unuse(). Move
    this code to a separate function, and use it both in sys_swapon and
    sys_swapoff.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The block in sys_swapon which does the final adjustments to the
    swap_info_struct and to swap_list is the same as the block which
    re-inserts it again at sys_swapoff on failure of try_to_unuse(), except
    for the order of the operations within the lock. Since the order should
    not matter, arbitrarily change sys_swapoff to match sys_swapon, in
    preparation to making both share the same code.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The block in sys_swapon which does the final adjustments to the
    swap_info_struct and to swap_list is the same as the block which
    re-inserts it again at sys_swapoff on failure of try_to_unuse(). To be
    able to make both share the same code, move the printk() call in the
    middle of it to just after it.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • It still exists within setup_swap_map_and_extents(), but after it
    nr_good_pages == p->pages.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Since there is no cleanup to do, there is no reason to jump to a label.
    Return directly instead.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Move the code which parses the bad block list and the extents to a
    separate function. Only code movement, no functional changes.

    This change uses the fact that, after the success path, nr_good_pages ==
    p->pages.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The call to swap_cgroup_swapon is in the middle of loading the swap map
    and extents. As it only does memory allocation and does not depend on
    the swapfile layout (map/extents), it can be called earlier (or later).

    Move it to just after the allocation of swap_map, since it is
    conceptually similar (allocates a map).

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Since there is no cleanup to do, there is no reason to jump to a label.
    Return directly instead.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Move the code which parses and checks the swapfile header (except for
    the bad block list) to a separate function. Only code movement, no
    functional changes.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • There is no reason I can see to read inode->i_size long before it is
    needed. Move its read to just before it is needed, to reduce the
    variable lifetime.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Jesper Juhl
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Since there is no cleanup to do, there is no reason to jump to a label.
    Return directly instead.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Move the code which claims the bdev (S_ISBLK) or locks the inode
    (S_ISREG) to a separate function. Only code movement, no functional
    changes.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • sys_swapon currently has two error labels, bad_swap and bad_swap_2.
    bad_swap does the same as bad_swap_2 plus destroy_swap_extents() and
    swap_cgroup_swapoff(); both are noops in the places where bad_swap_2 is
    jumped to. With a single extra test for inode (matching the one in the
    S_ISREG case below), all the error paths in the function can go to
    bad_swap.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The only way error is 0 in the cleanup blocks is when the function is
    returning successfully. In this case, the cleanup blocks were setting
    S_SWAPFILE in the S_ISREG case. But this is not a cleanup.

    Move the setting of S_SWAPFILE to just before the "goto out;" to make
    this more clear. At this point, we do not need to test for inode because
    it will never be NULL.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The bdev variable is always equivalent to (S_ISBLK(inode->i_mode) ?
    p->bdev : NULL), as long as it being set is moved to a bit earlier. Use
    this fact to remove the bdev variable.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Move the setting of the error variable nearer the goto in a few places.

    Avoids calling PTR_ERR() if not IS_ERR() in two places, and makes the
    error condition more explicit in two other places.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Jesper Juhl
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Since mutex_lock(&inode->i_mutex) is called just after setting inode,
    did_down is always equivalent to (inode && S_ISREG(inode->i_mode)).

    Use this fact to remove the did_down variable.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Now there is nothing which jumps to the cleanup blocks before the name
    variable is set. There is no need to set it initially to NULL anymore.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Since there is no cleanup to do, there is no reason to jump to a label.
    Return directly instead.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • At this point in sys_swapon, there is nothing to free. Return directly
    instead of jumping to the cleanup block at the end of the function.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Move the swap_info allocation to its own function. Only code movement,
    no functional changes.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Within sys_swapon, after the swap_info entry has been allocated, we
    always have type == p->type and swap_info[type] == p. Use this fact to
    reduce the dependency on the "type" local variable within the function,
    as a preparation to move the allocation of the swap_info entry to a
    separate function.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Changelogs belong in the git history instead of in the source code.

    Also, "The swapon system call" is redundant with
    "SYSCALL_DEFINE2(swapon, ...)".

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: Jesper Juhl
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    [ Gaah. That's a _historical_ comment. But the patch-series depends on removal ]
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • This patch series refactors the sys_swapon function.

    sys_swapon is currently a very large function, with 313 lines (more than
    12 25-line screens), which can make it a bit hard to read. This patch
    series reduces this size by half, by extracting large chunks of related
    code to new helper functions.

    One of these chunks of code was nearly identical to the part of
    sys_swapoff which is used in case of a failure return from
    try_to_unuse(), so this patch series also makes both share the same
    code.

    As a side effect of all this refactoring, the compiled code gets a bit
    smaller (from v1 of this patch series):

    text data bss dec hex filename
    14012 944 276 15232 3b80 mm/swapfile.o.before
    13941 944 276 15161 3b39 mm/swapfile.o.after

    This patch:

    Use vzalloc() instead of vmalloc/memset.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: Jesper Juhl
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • If an administrator tries to swapon a file backed by NFS, the inode mutex is
    taken (as it is for any swapfile) but later identified to be a bad swapfile
    due to the lack of bmap and tries to cleanup. During cleanup, an attempt is
    made to close the file but with inode->i_mutex still held. Closing an NFS
    file syncs it which tries to acquire the inode mutex leading to deadlock. If
    lockdep is enabled the following appears on the console;

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.38-rc8-autobuild #1
    ---------------------------------------------
    swapon/2192 is trying to acquire lock:
    (&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    other info that might help us debug this:
    1 lock held by swapon/2192:
    #0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    stack backtrace:
    Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1
    Call Trace:
    __lock_acquire+0x2eb/0x1623
    find_get_pages_tag+0x14a/0x174
    pagevec_lookup_tag+0x25/0x2e
    vfs_fsync_range+0x47/0x7c
    lock_acquire+0xd3/0x100
    vfs_fsync_range+0x47/0x7c
    nfs_flush_one+0x0/0xdf [nfs]
    mutex_lock_nested+0x40/0x2b1
    vfs_fsync_range+0x47/0x7c
    vfs_fsync_range+0x47/0x7c
    vfs_fsync+0x1c/0x1e
    nfs_file_flush+0x64/0x69 [nfs]
    filp_close+0x43/0x72
    sys_swapon+0xa39/0xae7
    sysret_check+0x2e/0x69
    system_call_fastpath+0x16/0x1b

    This patch releases the mutex if its held before calling filep_close()
    so swapon fails as expected without deadlock when the swapfile is backed
    by NFS. If accepted for 2.6.39, it should also be considered a -stable
    candidate for 2.6.38 and 2.6.37.

    Signed-off-by: Mel Gorman
    Acked-by: Hugh Dickins
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

10 Mar, 2011

2 commits