19 Jun, 2009

40 commits

  • In theory it is not safe to dereference ->parent/real_parent without
    tasklist or rcu lock, we can race with re-parenting.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The forked child can have TIF_SIGPENDING if it was copied from parent's
    ti->flags. But this is harmless and actually almost never happens,
    because copy_process() can't succeed if signal_pending() == T.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There is no reason for thread_group_cputime() in wait_task_zombie(), there
    must be no other threads.

    This call was previously needed to collect the per-cpu data which we do
    not have any longer.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change ptrace_getsiginfo/ptrace_setsiginfo to use lock_task_sighand()
    without tasklist_lock. Perhaps it makes sense to make a single helper
    with "bool rw" argument.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If the non-traced sub-thread calls do_notify_parent_cldstop(), we send the
    notification to group_leader->real_parent and we report group_leader's
    pid.

    But, if group_leader is traced we use the wrong ->parent->nsproxy->pid_ns,
    the tracer and parent can live in different namespaces. Change the code
    to use "parent" instead of tsk->parent.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change wait_task_zombie() to use ->real_parent instead of ->parent. We
    could even use current afaics, but ->real_parent is more clean.

    We know that the child is not ptrace_reparented() and thus they are equal.
    But we should avoid using task_struct->parent, we are going to remove it.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - Use rcu_read_lock() instead of tasklist_lock to find/get the task
    in ptrace_get_task_struct().

    - Make it static, it has no callers outside of ptrace.c.

    - The comment doesn't match the reality, this helper does not do
    any checks. Beacuse it is really trivial and static I removed the
    whole comment.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Remove the "Nasty, nasty" lock dance in ptrace_attach()/ptrace_traceme() -
    from now task_lock() has nothing to do with ptrace at all.

    With the recent changes nobody uses task_lock() to serialize with ptrace,
    but in fact it was never needed and it was never used consistently.

    However ptrace_attach() calls __ptrace_may_access() and needs task_lock()
    to pin task->mm for get_dumpable(). But we can call __ptrace_may_access()
    before we take tasklist_lock, ->cred_exec_mutex protects us against
    do_execve() path which can change creds and MMF_DUMP* flags.

    (ugly, but we can't use ptrace_may_access() because it hides the error
    code, so we have to take task_lock() and use __ptrace_may_access()).

    NOTE: this change assumes that LSM hooks, security_ptrace_may_access() and
    security_ptrace_traceme(), can be called without task_lock() held.

    Signed-off-by: Oleg Nesterov
    Cc: Chris Wright
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • ptrace_attach() and ptrace_traceme() are the last functions which look as
    if the untraced task can have task->ptrace != 0, this must not be
    possible. Change the code to just check ->ptrace != 0 and s/|=/=/ to set
    PT_PTRACED.

    Also, a couple of trivial whitespace cleanups in ptrace_attach().

    And move ptrace_traceme() up near ptrace_attach() to keep them close to
    each other.

    Signed-off-by: Oleg Nesterov
    Cc: Chris Wright
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - Add PF_KTHREAD check to prevent attaching to the kernel thread
    with a borrowed ->mm.

    With or without this change we can race with daemonize() which
    can set PF_KTHREAD or clear ->mm after ptrace_attach() does the
    check, but this doesn't matter because reparent_to_kthreadd()
    does ptrace_unlink().

    - Kill "!task->mm" check. We don't really care about ->mm != NULL,
    and the task can call exit_mm() right after we drop task_lock().
    What we need is to make sure we can't attach after exit_notify(),
    check task->exit_state != 0 instead.

    Also, move the "already traced" check down for cosmetic reasons.

    Signed-off-by: Oleg Nesterov
    Cc: Chris Wright
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes.

    - Nobody except ptrace.c & co should use ptrace flags directly, we have
    task_ptrace() for that.

    - No need to specially check PT_PTRACED, we must not have other PT_ bits
    set without PT_PTRACED. And no need to know this flag exists.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • tracehook_unsafe_exec() doesn't need task_lock(), remove the old comment.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "Search in the siblings" should use ->real_parent, not ->parent. If the
    task is traced then ->parent == tracer, while the task's parent is always
    ->real_parent.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • m32r: PTRACE_SINGLESTEP sets PT_DTRACE, it is never used except cleared
    after do_execve().

    Signed-off-by: Oleg Nesterov
    Acked-by: Hirokazu Takata
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • m68k sets PT_DTRACE in trap_c() but never uses it.

    Signed-off-by: Oleg Nesterov
    Acked-by: Geert Uytterhoeven
    Acked-by: Greg Ungerer
    Cc: Roman Zippel
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • avr32, mn10300, parisc, s390, sh, xtensa:

    They never set PT_DTRACE, but clear it after do_execve().

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Acked-by: Kyle McMartin
    Cc: Grant Grundler
    Cc: Matthew Wilcox
    Acked-by: Martin Schwidefsky
    Cc: Heiko Carstens
    Acked-by: Paul Mundt
    Acked-by: Chris Zankel
    Acked-by: Roland McGrath
    Acked-by: Haavard Skinnemoen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • h8300 defines PT_DTRACE for asm but never uses it.

    DEFINE(PT_PTRACED, PT_PTRACED) seems to be unused too.

    Signed-off-by: Oleg Nesterov
    Acked-by: Yoshinori Sato
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • allow_signal() checks ->mm == NULL. Not sure why. Perhaps to make sure
    current is the kernel thread. But this helper must not be used unless we
    are the kernel thread, kill this check.

    Also, document the fact that the CLONE_SIGHAND kthread must not use
    allow_signal(), unless the caller really wants to change the parent's
    ->sighand->action as well.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Try to fix memcg's lru rotation sanity: make memcg use the same logic as
    the global LRU does.

    Now, at __isolate_lru_page() retruns -EBUSY, the page is rotated to the
    tail of LRU in global LRU's isolate LRU pages. But in memcg, it's not
    handled. This makes memcg do the same behavior as global LRU and rotate
    LRU in the page is busy.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • We don't have an interface to reset mem.limit or memsw.limit now.

    This patch allows to reset mem.limit or memsw.limit when they are being
    set to -1.

    Signed-off-by: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • A user can set memcg.limit_in_bytes == memcg.memsw.limit_in_bytes when the
    user just want to limit the total size of applications, in other words,
    not very interested in memory usage itself. In this case, swap-out will
    be done only by global-LRU.

    But, under current implementation, memory.limit_in_bytes is checked at
    first and try_to_free_page() may do swap-out. But, that swap-out is
    useless for memsw.limit_in_bytes and the thread may hit limit again.

    This patch tries to fix the current behavior at memory.limit ==
    memsw.limit case. And documentation is updated to explain the behavior of
    this special case.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch fixes mis-accounting of swap usage in memcg.

    In the current implementation, memcg's swap account is uncharged only when
    swap is completely freed. But there are several cases where swap cannot
    be freed cleanly. For handling that, this patch changes that memcg
    uncharges swap account when swap has no references other than cache.

    By this, memcg's swap entry accounting can be fully synchronous with the
    application's behavior.

    This patch also changes memcg's hooks for swap-out.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This forward declaration seems pointless.

    Signed-off-by: Li Zefan
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • We don't need to check do_swap_account in the case that the function which
    checks do_swap_account will never get called if do_swap_account == 0.

    Signed-off-by: Li Zefan
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • mem_cgroup_cache_charge_swapin() isn't used any more, so remove no-op
    definition of it in header file.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Add file RSS tracking per memory cgroup

    We currently don't track file RSS, the RSS we report is actually anon RSS.
    All the file mapped pages, come in through the page cache and get
    accounted there. This patch adds support for accounting file RSS pages.
    It should

    1. Help improve the metrics reported by the memory resource controller
    2. Will form the basis for a future shared memory accounting heuristic
    that has been proposed by Kamezawa.

    Unfortunately, we cannot rename the existing "rss" keyword used in
    memory.stat to "anon_rss". We however, add "mapped_file" data and hope to
    educate the end user through documentation.

    [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops]
    Signed-off-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Dhaval Giani
    Cc: Daisuke Nishimura
    Cc: YAMAMOTO Takashi
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • While walking through the whitelist, if the DEV_ALL item is found, no more
    check is needed.

    Signed-off-by: Li Zefan
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • The 'noprefix' option was introduced for backwards-compatibility of
    cpuset, but actually it can be used when mounting other subsystems.

    This results in possibility of name collision, and now the collision can
    really happen, because we have 'stat' file in both memory and cpuacct
    subsystem:

    # mount -t cgroup -o noprefix,memory,cpuacct xxx /mnt

    Cgroup will happily mount the 2 subsystems, but only 'stat' file of memory
    subsys can be seen.

    We don't want users to use nopreifx, and also want to avoid name
    collision, so we change to allow noprefix only if mounting just the cpuset
    subsystem.

    [akpm@linux-foundation.org: fix shift for cpuset_subsys_id >= 32]
    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Acked-by: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Fix some cgroup messages to read better.
    Update MAINTAINERS to include mm/*cgroup* files.

    Signed-off-by: Randy Dunlap
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Currently cn_test_want_notify() has no user.

    So add an ifdef and a comment which tells us to not remove it.

    Signed-off-by: Jaswinder Singh Rajput
    Acked-by: Evgeniy Polyakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaswinder Singh Rajput
     
  • Perl is used on the kernel Makefile to generate documentation, firmwares
    in c source form, sources, graphs, and some headers and this fact is
    undocumented.

    [akpm@linux-foundation.org: 80-columns, please]
    Signed-off-by: Jose Luis Perez Diez
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jose Luis Perez Diez
     
  • Several code paths in reiserfs have a construct like:

    if (is_direntry_le_ih(ih = B_N_PITEM_HEAD(src, item_num))) ...

    which, in addition to being ugly, end up causing compiler warnings with
    gcc 4.4.0. Previous compilers didn't issue a warning.

    fs/reiserfs/do_balan.c:1273: warning: operation on `aux_ih' may be undefined
    fs/reiserfs/lbalance.c:393: warning: operation on `ih' may be undefined
    fs/reiserfs/lbalance.c:421: warning: operation on `ih' may be undefined
    fs/reiserfs/lbalance.c:777: warning: operation on `ih' may be undefined

    I believe this is due to the ih being passed to macros which evaluate the
    argument more than once. This is old code and we haven't seen any
    problems with it, but this patch eliminates the warnings.

    It converts the multiple evaluation macros to static inlines and does a
    preassignment for the cases that were causing the warnings because that
    code is just ugly.

    Reported-by: Chris Mason
    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • unsigned i_block,fragment cannot be negative.

    Signed-off-by: Roel Kluin
    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     
  • Remove unused variables from isofs_sb_info (used to be some mount
    options), unify variables for option to use 0/1 (some options used
    'y'/'n'), use bit fields for option flags in superblock.

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • isofs allows setting of default uid and gid of files but value 0 was used
    to indicate that user did not specify any uid/gid mount option. Since
    this option also overrides uid/gid set in Rock Ridge extension, it makes
    sense to allow forcing uid/gid 0. Fix option processing to allow this.

    Cc:
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • So far, permissions set via 'mode' and/or 'dmode' mount options were
    effective only if the medium had no rock ridge extensions (or was mounted
    without them). Add 'overriderockmode' mount option to indicate that these
    options should override permissions set in rock ridge extensions. Maybe
    this should be default but the current behavior is there since mount
    options were created so I think we should not change how they behave.

    Cc:
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • As Ted pointed out, it can happen that ext3_truncate() returns without
    removing inode from orphan list. This way we could in some rare cases
    (like when we get ENOMEM from an allocation in ext3_truncate called
    because of failed ext3_write_begin) leave the inode on orphan list and
    that triggers assertion failure on umount.

    So make ext3_truncate() always remove inode from in-memory orphan list.

    Cc: Theodore Ts'o
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • I delete the following patch
    "commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
    Author: Mingming Cao
    Date: Fri Jul 25 01:46:22 2008 -0700

    jbd: fix race between free buffer and commit transaction

    This patch is no longer needed because if race between freeing buffer and
    committing transaction functionality occurs and dio gets error, currently
    dio falls back to buffered IO by the following patch.

    commit 6ccfa806a9cfbbf1cd43d5b6aa47ef2c0eb518fd
    Author: Hisashi Hifumi
    Date: Tue Sep 2 14:35:40 2008 -0700

    VFS: fix dio write returning EIO when try_to_release_page fails

    Signed-off-by: Hisashi Hifumi
    Cc: Theodore Tso
    Cc: Mingming Cao
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     
  • Chain verification in ext3_get_blocks() has been hosed since it called
    verify_chain(chain, NULL) which always returns success. As a result
    readers could in theory race with truncate. On the other hand the race
    probably cannot happen with the current locking scheme, since by the
    time ext3_truncate() is called all the pages are already removed and
    hence get_block() shouldn't be called on such pages...

    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • ext2.txt says that dirs can have 32,768 subdirs, but the actual value of
    EXT2_LINK_MAX is 32000.

    ext3 is the same, but the doc does not mention it. One of ext4's features
    is to "fix 32000 subdirectory limit".

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Shields