Commit ad4ecbcba72855a2b5319b96e2a3a65ed1ca3bfd

Authored by Shailabh Nagar
Committed by Linus Torvalds
1 parent 2589045466

[PATCH] delay accounting taskstats interface send tgid once

Send per-tgid data only once during exit of a thread group instead of once
with each member thread exit.

Currently, when a thread exits, besides its per-tid data, the per-tgid data
of its thread group is also sent out, if its thread group is non-empty.
The per-tgid data sent consists of the sum of per-tid stats for all
*remaining* threads of the thread group.

This patch modifies this sending in two ways:

- the per-tgid data is sent only when the last thread of a thread group
  exits.  This cuts down heavily on the overhead of sending/receiving
  per-tgid data, especially when other exploiters of the taskstats
  interface aren't interested in per-tgid stats

- the semantics of the per-tgid data sent are changed.  Instead of being
  the sum of per-tid data for remaining threads, the value now sent is the
  true total accumalated statistics for all threads that are/were part of
  the thread group.

The patch also addresses a minor issue where failure of one accounting
subsystem to fill in the taskstats structure was causing the send of
taskstats to not be sent at all.

The patch has been tested for stability and run cerberus for over 4 hours
on an SMP.

[akpm@osdl.org: bugfixes]
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

Showing 8 changed files with 162 additions and 81 deletions Side-by-side Diff

Documentation/accounting/delay-accounting.txt
... ... @@ -48,9 +48,10 @@
48 48 experienced by the task waiting for the corresponding resource
49 49 in that interval.
50 50  
51   -When a task exits, records containing the per-task and per-process statistics
52   -are sent to userspace without requiring a command. More details are given in
53   -the taskstats interface description.
  51 +When a task exits, records containing the per-task statistics
  52 +are sent to userspace without requiring a command. If it is the last exiting
  53 +task of a thread group, the per-tgid statistics are also sent. More details
  54 +are given in the taskstats interface description.
54 55  
55 56 The getdelays.c userspace utility in this directory allows simple commands to
56 57 be run and the corresponding delay statistics to be displayed. It also serves
Documentation/accounting/taskstats.txt
... ... @@ -32,12 +32,11 @@
32 32 statistics for all tasks of the process (if tgid is specified).
33 33  
34 34 To obtain statistics for tasks which are exiting, userspace opens a multicast
35   -netlink socket. Each time a task exits, two records are sent by the kernel to
36   -each listener on the multicast socket. The first the per-pid task's statistics
37   -and the second is the sum for all tasks of the process to which the task
38   -belongs (the task does not need to be the thread group leader). The need for
39   -per-tgid stats to be sent for each exiting task is explained in the per-tgid
40   -stats section below.
  35 +netlink socket. Each time a task exits, its per-pid statistics is always sent
  36 +by the kernel to each listener on the multicast socket. In addition, if it is
  37 +the last thread exiting its thread group, an additional record containing the
  38 +per-tgid stats are also sent. The latter contains the sum of per-pid stats for
  39 +all threads in the thread group, both past and present.
41 40  
42 41 getdelays.c is a simple utility demonstrating usage of the taskstats interface
43 42 for reporting delay accounting statistics.
44 43  
... ... @@ -104,20 +103,14 @@
104 103 of atomicity).
105 104  
106 105 However, maintaining per-process, in addition to per-task stats, within the
107   -kernel has space and time overheads. Hence the taskstats implementation
108   -dynamically sums up the per-task stats for each task belonging to a process
109   -whenever per-process stats are needed.
  106 +kernel has space and time overheads. To address this, the taskstats code
  107 +accumalates each exiting task's statistics into a process-wide data structure.
  108 +When the last task of a process exits, the process level data accumalated also
  109 +gets sent to userspace (along with the per-task data).
110 110  
111   -Not maintaining per-tgid stats creates a problem when userspace is interested
112   -in getting these stats when the process dies i.e. the last thread of
113   -a process exits. It isn't possible to simply return some aggregated per-process
114   -statistic from the kernel.
115   -
116   -The approach taken by taskstats is to return the per-tgid stats *each* time
117   -a task exits, in addition to the per-pid stats for that task. Userspace can
118   -maintain task<->process mappings and use them to maintain the per-process stats
119   -in userspace, updating the aggregate appropriately as the tasks of a process
120   -exit.
  111 +When a user queries to get per-tgid data, the sum of all other live threads in
  112 +the group is added up and added to the accumalated total for previously exited
  113 +threads of the same thread group.
121 114  
122 115 Extending taskstats
123 116 -------------------
... ... @@ -2240,6 +2240,12 @@
2240 2240 L: netdev@vger.kernel.org
2241 2241 S: Maintained
2242 2242  
  2243 +PER-TASK DELAY ACCOUNTING
  2244 +P: Shailabh Nagar
  2245 +M: nagar@watson.ibm.com
  2246 +L: linux-kernel@vger.kernel.org
  2247 +S: Maintained
  2248 +
2243 2249 PERSONALITY HANDLING
2244 2250 P: Christoph Hellwig
2245 2251 M: hch@infradead.org
... ... @@ -2765,6 +2771,12 @@
2765 2771 TI OMAP RANDOM NUMBER GENERATOR SUPPORT
2766 2772 P: Deepak Saxena
2767 2773 M: dsaxena@plexity.net
  2774 +S: Maintained
  2775 +
  2776 +TASKSTATS STATISTICS INTERFACE
  2777 +P: Shailabh Nagar
  2778 +M: nagar@watson.ibm.com
  2779 +L: linux-kernel@vger.kernel.org
2768 2780 S: Maintained
2769 2781  
2770 2782 TI PARALLEL LINK CABLE DRIVER
include/linux/sched.h
... ... @@ -463,6 +463,10 @@
463 463 #ifdef CONFIG_BSD_PROCESS_ACCT
464 464 struct pacct_struct pacct; /* per-process accounting information */
465 465 #endif
  466 +#ifdef CONFIG_TASKSTATS
  467 + spinlock_t stats_lock;
  468 + struct taskstats *stats;
  469 +#endif
466 470 };
467 471  
468 472 /* Context switch must be unlocked if interrupts are to be enabled */
include/linux/taskstats_kern.h
... ... @@ -19,36 +19,75 @@
19 19 extern kmem_cache_t *taskstats_cache;
20 20 extern struct mutex taskstats_exit_mutex;
21 21  
22   -static inline void taskstats_exit_alloc(struct taskstats **ptidstats,
23   - struct taskstats **ptgidstats)
  22 +static inline void taskstats_exit_alloc(struct taskstats **ptidstats)
24 23 {
25 24 *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
26   - *ptgidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
27 25 }
28 26  
29   -static inline void taskstats_exit_free(struct taskstats *tidstats,
30   - struct taskstats *tgidstats)
  27 +static inline void taskstats_exit_free(struct taskstats *tidstats)
31 28 {
32 29 if (tidstats)
33 30 kmem_cache_free(taskstats_cache, tidstats);
34   - if (tgidstats)
35   - kmem_cache_free(taskstats_cache, tgidstats);
36 31 }
37 32  
38   -extern void taskstats_exit_send(struct task_struct *, struct taskstats *,
39   - struct taskstats *);
40   -extern void taskstats_init_early(void);
  33 +static inline void taskstats_tgid_init(struct signal_struct *sig)
  34 +{
  35 + spin_lock_init(&sig->stats_lock);
  36 + sig->stats = NULL;
  37 +}
41 38  
  39 +static inline void taskstats_tgid_alloc(struct signal_struct *sig)
  40 +{
  41 + struct taskstats *stats;
  42 + unsigned long flags;
  43 +
  44 + stats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
  45 + if (!stats)
  46 + return;
  47 +
  48 + spin_lock_irqsave(&sig->stats_lock, flags);
  49 + if (!sig->stats) {
  50 + sig->stats = stats;
  51 + stats = NULL;
  52 + }
  53 + spin_unlock_irqrestore(&sig->stats_lock, flags);
  54 +
  55 + if (stats)
  56 + kmem_cache_free(taskstats_cache, stats);
  57 +}
  58 +
  59 +static inline void taskstats_tgid_free(struct signal_struct *sig)
  60 +{
  61 + struct taskstats *stats = NULL;
  62 + unsigned long flags;
  63 +
  64 + spin_lock_irqsave(&sig->stats_lock, flags);
  65 + if (sig->stats) {
  66 + stats = sig->stats;
  67 + sig->stats = NULL;
  68 + }
  69 + spin_unlock_irqrestore(&sig->stats_lock, flags);
  70 + if (stats)
  71 + kmem_cache_free(taskstats_cache, stats);
  72 +}
  73 +
  74 +extern void taskstats_exit_send(struct task_struct *, struct taskstats *, int);
  75 +extern void taskstats_init_early(void);
  76 +extern void taskstats_tgid_alloc(struct signal_struct *);
42 77 #else
43   -static inline void taskstats_exit_alloc(struct taskstats **ptidstats,
44   - struct taskstats **ptgidstats)
  78 +static inline void taskstats_exit_alloc(struct taskstats **ptidstats)
45 79 {}
46   -static inline void taskstats_exit_free(struct taskstats *ptidstats,
47   - struct taskstats *ptgidstats)
  80 +static inline void taskstats_exit_free(struct taskstats *ptidstats)
48 81 {}
49 82 static inline void taskstats_exit_send(struct task_struct *tsk,
50   - struct taskstats *tidstats,
51   - struct taskstats *tgidstats)
  83 + struct taskstats *tidstats,
  84 + int group_dead)
  85 +{}
  86 +static inline void taskstats_tgid_init(struct signal_struct *sig)
  87 +{}
  88 +static inline void taskstats_tgid_alloc(struct signal_struct *sig)
  89 +{}
  90 +static inline void taskstats_tgid_free(struct signal_struct *sig)
52 91 {}
53 92 static inline void taskstats_init_early(void)
54 93 {}
... ... @@ -845,7 +845,7 @@
845 845 fastcall NORET_TYPE void do_exit(long code)
846 846 {
847 847 struct task_struct *tsk = current;
848   - struct taskstats *tidstats, *tgidstats;
  848 + struct taskstats *tidstats;
849 849 int group_dead;
850 850  
851 851 profile_task_exit(tsk);
... ... @@ -884,7 +884,7 @@
884 884 current->comm, current->pid,
885 885 preempt_count());
886 886  
887   - taskstats_exit_alloc(&tidstats, &tgidstats);
  887 + taskstats_exit_alloc(&tidstats);
888 888  
889 889 acct_update_integrals(tsk);
890 890 if (tsk->mm) {
... ... @@ -905,8 +905,8 @@
905 905 #endif
906 906 if (unlikely(tsk->audit_context))
907 907 audit_free(tsk);
908   - taskstats_exit_send(tsk, tidstats, tgidstats);
909   - taskstats_exit_free(tidstats, tgidstats);
  908 + taskstats_exit_send(tsk, tidstats, group_dead);
  909 + taskstats_exit_free(tidstats);
910 910 delayacct_tsk_exit(tsk);
911 911  
912 912 exit_mm(tsk);
... ... @@ -44,6 +44,7 @@
44 44 #include <linux/acct.h>
45 45 #include <linux/cn_proc.h>
46 46 #include <linux/delayacct.h>
  47 +#include <linux/taskstats_kern.h>
47 48  
48 49 #include <asm/pgtable.h>
49 50 #include <asm/pgalloc.h>
... ... @@ -819,6 +820,7 @@
819 820 if (clone_flags & CLONE_THREAD) {
820 821 atomic_inc(&current->signal->count);
821 822 atomic_inc(&current->signal->live);
  823 + taskstats_tgid_alloc(current->signal);
822 824 return 0;
823 825 }
824 826 sig = kmem_cache_alloc(signal_cachep, GFP_KERNEL);
... ... @@ -863,6 +865,7 @@
863 865 INIT_LIST_HEAD(&sig->cpu_timers[0]);
864 866 INIT_LIST_HEAD(&sig->cpu_timers[1]);
865 867 INIT_LIST_HEAD(&sig->cpu_timers[2]);
  868 + taskstats_tgid_init(sig);
866 869  
867 870 task_lock(current->group_leader);
868 871 memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
... ... @@ -884,6 +887,7 @@
884 887 void __cleanup_signal(struct signal_struct *sig)
885 888 {
886 889 exit_thread_group_keys(sig);
  890 + taskstats_tgid_free(sig);
887 891 kmem_cache_free(signal_cachep, sig);
888 892 }
889 893  
... ... @@ -132,46 +132,79 @@
132 132 static int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
133 133 struct taskstats *stats)
134 134 {
135   - int rc;
136 135 struct task_struct *tsk, *first;
  136 + unsigned long flags;
137 137  
  138 + /*
  139 + * Add additional stats from live tasks except zombie thread group
  140 + * leaders who are already counted with the dead tasks
  141 + */
138 142 first = tgidtsk;
139   - read_lock(&tasklist_lock);
140 143 if (!first) {
  144 + read_lock(&tasklist_lock);
141 145 first = find_task_by_pid(tgid);
142 146 if (!first) {
143 147 read_unlock(&tasklist_lock);
144 148 return -ESRCH;
145 149 }
146   - }
  150 + get_task_struct(first);
  151 + read_unlock(&tasklist_lock);
  152 + } else
  153 + get_task_struct(first);
  154 +
  155 + /* Start with stats from dead tasks */
  156 + spin_lock_irqsave(&first->signal->stats_lock, flags);
  157 + if (first->signal->stats)
  158 + memcpy(stats, first->signal->stats, sizeof(*stats));
  159 + spin_unlock_irqrestore(&first->signal->stats_lock, flags);
  160 +
147 161 tsk = first;
  162 + read_lock(&tasklist_lock);
148 163 do {
  164 + if (tsk->exit_state == EXIT_ZOMBIE && thread_group_leader(tsk))
  165 + continue;
149 166 /*
150   - * Each accounting subsystem adds calls its functions to
  167 + * Accounting subsystem can call its functions here to
151 168 * fill in relevant parts of struct taskstsats as follows
152 169 *
153   - * rc = per-task-foo(stats, tsk);
154   - * if (rc)
155   - * break;
  170 + * per-task-foo(stats, tsk);
156 171 */
  172 + delayacct_add_tsk(stats, tsk);
157 173  
158   - rc = delayacct_add_tsk(stats, tsk);
159   - if (rc)
160   - break;
161   -
162 174 } while_each_thread(first, tsk);
163 175 read_unlock(&tasklist_lock);
164 176 stats->version = TASKSTATS_VERSION;
165 177  
166   -
167 178 /*
168   - * Accounting subsytems can also add calls here if they don't
169   - * wish to aggregate statistics for per-tgid stats
  179 + * Accounting subsytems can also add calls here to modify
  180 + * fields of taskstats.
170 181 */
171 182  
172   - return rc;
  183 + return 0;
173 184 }
174 185  
  186 +
  187 +static void fill_tgid_exit(struct task_struct *tsk)
  188 +{
  189 + unsigned long flags;
  190 +
  191 + spin_lock_irqsave(&tsk->signal->stats_lock, flags);
  192 + if (!tsk->signal->stats)
  193 + goto ret;
  194 +
  195 + /*
  196 + * Each accounting subsystem calls its functions here to
  197 + * accumalate its per-task stats for tsk, into the per-tgid structure
  198 + *
  199 + * per-task-foo(tsk->signal->stats, tsk);
  200 + */
  201 + delayacct_add_tsk(tsk->signal->stats, tsk);
  202 +ret:
  203 + spin_unlock_irqrestore(&tsk->signal->stats_lock, flags);
  204 + return;
  205 +}
  206 +
  207 +
175 208 static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
176 209 {
177 210 int rc = 0;
... ... @@ -230,7 +263,7 @@
230 263  
231 264 /* Send pid data out on exit */
232 265 void taskstats_exit_send(struct task_struct *tsk, struct taskstats *tidstats,
233   - struct taskstats *tgidstats)
  266 + int group_dead)
234 267 {
235 268 int rc;
236 269 struct sk_buff *rep_skb;
237 270  
238 271  
... ... @@ -238,13 +271,16 @@
238 271 size_t size;
239 272 int is_thread_group;
240 273 struct nlattr *na;
  274 + unsigned long flags;
241 275  
242 276 if (!family_registered || !tidstats)
243 277 return;
244 278  
245   - is_thread_group = !thread_group_empty(tsk);
246   - rc = 0;
  279 + spin_lock_irqsave(&tsk->signal->stats_lock, flags);
  280 + is_thread_group = tsk->signal->stats ? 1 : 0;
  281 + spin_unlock_irqrestore(&tsk->signal->stats_lock, flags);
247 282  
  283 + rc = 0;
248 284 /*
249 285 * Size includes space for nested attributes
250 286 */
251 287  
252 288  
253 289  
254 290  
255 291  
256 292  
257 293  
258 294  
... ... @@ -268,30 +304,28 @@
268 304 *tidstats);
269 305 nla_nest_end(rep_skb, na);
270 306  
271   - if (!is_thread_group || !tgidstats) {
272   - send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
273   - goto ret;
274   - }
  307 + if (!is_thread_group)
  308 + goto send;
275 309  
276   - rc = fill_tgid(tsk->pid, tsk, tgidstats);
277 310 /*
278   - * If fill_tgid() failed then one probable reason could be that the
279   - * thread group leader has exited. fill_tgid() will fail, send out
280   - * the pid statistics collected earlier.
  311 + * tsk has/had a thread group so fill the tsk->signal->stats structure
  312 + * Doesn't matter if tsk is the leader or the last group member leaving
281 313 */
282   - if (rc < 0) {
283   - send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
284   - goto ret;
285   - }
286 314  
  315 + fill_tgid_exit(tsk);
  316 + if (!group_dead)
  317 + goto send;
  318 +
287 319 na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
288 320 NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
  321 + /* No locking needed for tsk->signal->stats since group is dead */
289 322 NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS,
290   - *tgidstats);
  323 + *tsk->signal->stats);
291 324 nla_nest_end(rep_skb, na);
292 325  
  326 +send:
293 327 send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
294   - goto ret;
  328 + return;
295 329  
296 330 nla_put_failure:
297 331 genlmsg_cancel(rep_skb, reply);