From 614db49c72db5ecf85ef94fe8bad7ebc88744ba9 Mon Sep 17 00:00:00 2001 From: Jiapeng Chong Date: Sat, 8 May 2021 18:37:16 +0800 Subject: tracing: Remove redundant assignment to event_var Variable event_var is set to 'ERR_PTR(-EINVAL)', but this value is never read as it is overwritten or not used later on, hence it is a redundant assignment and can be removed. Clean up the following clang-analyzer warning: kernel/trace/trace_events_hist.c:2437:21: warning: Value stored to 'event_var' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Link: https://lkml.kernel.org/r/1620470236-26562-1-git-send-email-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot Signed-off-by: Jiapeng Chong Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index c1abd63f1d6c..dacd6fe0f60c 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -2434,12 +2434,12 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data, char *subsys_name, char *event_name, char *field_name) { struct trace_array *tr = target_hist_data->event_file->tr; - struct hist_field *event_var = ERR_PTR(-EINVAL); struct hist_trigger_data *hist_data; unsigned int i, n, first = true; struct field_var_hist *var_hist; struct trace_event_file *file; struct hist_field *key_field; + struct hist_field *event_var; char *saved_filter; char *cmd; int ret; -- cgit v1.2.3 From 957cdcd9bd7e035dcf0f29e4124b8021ea2ed696 Mon Sep 17 00:00:00 2001 From: Wei Ming Chen Date: Tue, 11 May 2021 22:02:46 +0800 Subject: ring-buffer: Use fallthrough pseudo-keyword Replace /* fall through */ comment with pseudo-keyword macro fallthrough[1] [1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Link: https://lkml.kernel.org/r/20210511140246.18868-1-jj251510319013@gmail.com Signed-off-by: Wei Ming Chen Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 2c0ee6484990..d1463eac11a3 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -3391,7 +3391,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer, case RINGBUF_TYPE_PADDING: if (event->time_delta == 1) break; - /* fall through */ + fallthrough; case RINGBUF_TYPE_DATA: ts += event->time_delta; break; -- cgit v1.2.3 From 08b0c9b4b922ccd1b7b54589942492cfa686214e Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Thu, 13 May 2021 12:55:17 +0100 Subject: tracing: Remove redundant initialization of variable ret The variable ret is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Link: https://lkml.kernel.org/r/20210513115517.58178-1-colin.king@canonical.com Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 9299057feb56..2e592885a167 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -6145,7 +6145,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr, ssize_t tracing_resize_ring_buffer(struct trace_array *tr, unsigned long size, int cpu_id) { - int ret = size; + int ret; mutex_lock(&trace_types_lock); -- cgit v1.2.3 From 099dcc1801d981260aee9496dbeb55270dca70c1 Mon Sep 17 00:00:00 2001 From: Qiujun Huang Date: Sat, 15 May 2021 10:57:35 +0000 Subject: tracing: Fix set_named_trigger_data() kernel-doc comment Fix the description of the parameters. Link: https://lkml.kernel.org/r/20210515105735.52785-1-hqjagain@gmail.com Signed-off-by: Qiujun Huang Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_trigger.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c index b8bfa8505b7b..cf84d0f6583a 100644 --- a/kernel/trace/trace_events_trigger.c +++ b/kernel/trace/trace_events_trigger.c @@ -916,7 +916,8 @@ void unpause_named_trigger(struct event_trigger_data *data) /** * set_named_trigger_data - Associate common named trigger data - * @data: The trigger data of a named trigger to unpause + * @data: The trigger data to associate + * @named_data: The common named trigger to be associated * * Named triggers are sets of triggers that share a common set of * trigger data. The first named trigger registered with a given name -- cgit v1.2.3 From 6c610dba6e2beb1a16ac309672181d0090fb8d30 Mon Sep 17 00:00:00 2001 From: Hyeonggon Yoo <42.hyeyoo@gmail.com> Date: Sat, 29 May 2021 15:14:23 +0900 Subject: tracing: Add WARN_ON_ONCE when returned value is negative ret is assigned return value of event_hist_trigger_func, but the value is unused. It is better to warn when returned value is negative, rather than just ignoring it. Link: https://lkml.kernel.org/r/20210529061423.GA103954@hyeyoo Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index dacd6fe0f60c..ba03b7d84fc2 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -5232,6 +5232,7 @@ static void unregister_field_var_hists(struct hist_trigger_data *hist_data) cmd = hist_data->field_var_hists[i]->cmd; ret = event_hist_trigger_func(&trigger_hist_cmd, file, "!hist", "hist", cmd); + WARN_ON_ONCE(ret < 0); } } -- cgit v1.2.3 From 4f99f8489950c03c792f17ca2d55cbb591286174 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Wed, 2 Jun 2021 23:33:00 +0900 Subject: tracing/boot: Add per-group/all events enablement Add ftrace.event..enable and ftrace.event.enable boot-time tracing, which enables all events under given GROUP and all events respectivly. Link: https://lkml.kernel.org/r/162264438005.302580.12019174481201855444.stgit@devnote2 Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_boot.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_boot.c b/kernel/trace/trace_boot.c index a82f03f385f8..94ef2d099e32 100644 --- a/kernel/trace/trace_boot.c +++ b/kernel/trace/trace_boot.c @@ -225,14 +225,37 @@ static void __init trace_boot_init_events(struct trace_array *tr, struct xbc_node *node) { struct xbc_node *gnode, *enode; + bool enable, enable_all = false; + const char *data; node = xbc_node_find_child(node, "event"); if (!node) return; /* per-event key starts with "event.GROUP.EVENT" */ - xbc_node_for_each_child(node, gnode) - xbc_node_for_each_child(gnode, enode) + xbc_node_for_each_child(node, gnode) { + data = xbc_node_get_data(gnode); + if (!strcmp(data, "enable")) { + enable_all = true; + continue; + } + enable = false; + xbc_node_for_each_child(gnode, enode) { + data = xbc_node_get_data(enode); + if (!strcmp(data, "enable")) { + enable = true; + continue; + } trace_boot_init_one_event(tr, gnode, enode); + } + /* Event enablement must be done after event settings */ + if (enable) { + data = xbc_node_get_data(gnode); + trace_array_set_clr_event(tr, data, NULL, true); + } + } + /* Ditto */ + if (enable_all) + trace_array_set_clr_event(tr, NULL, NULL, true); } #else #define trace_boot_enable_events(tr, node) do {} while (0) -- cgit v1.2.3 From faa76a6c289f43c88affcb292bc02870921d93bf Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 9 Jun 2021 18:04:58 -0400 Subject: tracing: Simplify the max length test when using the filtering temp buffer When filtering trace events, a temp buffer is used because the extra copy from the temp buffer into the ring buffer is still faster than the direct write into the ring buffer followed by a discard if the filter does not match. But the data that can be stored in the temp buffer is a PAGE_SIZE minus the ring buffer event header. The calculation of that header size is complex, but using the helper macro "struct_size()" can simplify it. Link: https://lore.kernel.org/stable/CAHk-=whKbJkuVmzb0hD3N6q7veprUrSpiBHRxVY=AffWZPtxmg@mail.gmail.com/ Suggested-by: Linus Torvalds Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2e592885a167..a0a84ff46ecd 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2735,8 +2735,10 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb, (trace_file->flags & (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) && (entry = this_cpu_read(trace_buffered_event))) { /* Try to use the per cpu buffer first */ + int max_len = PAGE_SIZE - struct_size(entry, array, 1); + val = this_cpu_inc_return(trace_buffered_event_cnt); - if ((len < (PAGE_SIZE - sizeof(*entry) - sizeof(entry->array[0]))) && val == 1) { + if (val == 1 && likely(len <= max_len)) { trace_event_setup(entry, type, trace_ctx); entry->array[0] = len; return entry; -- cgit v1.2.3 From 8f0901cda14d3be38cd2196d8cf61cdf3b368e34 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 9 Jun 2021 18:04:59 -0400 Subject: tracing: Add better comments for the filtering temp buffer use case When filtering is enabled, the event is copied into a temp buffer instead of being written into the ring buffer directly, because the discarding of events from the ring buffer is very expensive, and doing the extra copy is much faster than having to discard most of the time. As that logic is subtle, add comments to explain in more detail to what is going on and how it works. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 36 +++++++++++++++++++++++++++++++++++- 1 file changed, 35 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index a0a84ff46ecd..a0d66a056e59 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2734,10 +2734,44 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb, if (!tr->no_filter_buffering_ref && (trace_file->flags & (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) && (entry = this_cpu_read(trace_buffered_event))) { - /* Try to use the per cpu buffer first */ + /* + * Filtering is on, so try to use the per cpu buffer first. + * This buffer will simulate a ring_buffer_event, + * where the type_len is zero and the array[0] will + * hold the full length. + * (see include/linux/ring-buffer.h for details on + * how the ring_buffer_event is structured). + * + * Using a temp buffer during filtering and copying it + * on a matched filter is quicker than writing directly + * into the ring buffer and then discarding it when + * it doesn't match. That is because the discard + * requires several atomic operations to get right. + * Copying on match and doing nothing on a failed match + * is still quicker than no copy on match, but having + * to discard out of the ring buffer on a failed match. + */ int max_len = PAGE_SIZE - struct_size(entry, array, 1); val = this_cpu_inc_return(trace_buffered_event_cnt); + + /* + * Preemption is disabled, but interrupts and NMIs + * can still come in now. If that happens after + * the above increment, then it will have to go + * back to the old method of allocating the event + * on the ring buffer, and if the filter fails, it + * will have to call ring_buffer_discard_commit() + * to remove it. + * + * Need to also check the unlikely case that the + * length is bigger than the temp buffer size. + * If that happens, then the reserve is pretty much + * guaranteed to fail, as the ring buffer currently + * only allows events less than a page. But that may + * change in the future, so let the ring buffer reserve + * handle the failure in that case. + */ if (val == 1 && likely(len <= max_len)) { trace_event_setup(entry, type, trace_ctx); entry->array[0] = len; -- cgit v1.2.3 From f38601368f4a0c2a9f859511768dc3957e2e1769 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 17 Jun 2021 10:51:02 -0400 Subject: tracing: Add tp_printk_stop_on_boot option Add a kernel command line option that disables printing of events to console at late_initcall_sync(). This is useful when needing to see specific events written to console on boot up, but not wanting it when user space starts, as user space may make the console so noisy that the system becomes inoperable. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 40 +++++++++++++++++++++++++++++----------- 1 file changed, 29 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index a0d66a056e59..bbc63ac5b47f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -86,6 +86,7 @@ void __init disable_tracing_selftest(const char *reason) /* Pipe tracepoints to printk */ struct trace_iterator *tracepoint_print_iter; int tracepoint_printk; +static bool tracepoint_printk_stop_on_boot __initdata; static DEFINE_STATIC_KEY_FALSE(tracepoint_printk_key); /* For tracers that don't implement custom flags */ @@ -256,6 +257,13 @@ static int __init set_tracepoint_printk(char *str) } __setup("tp_printk", set_tracepoint_printk); +static int __init set_tracepoint_printk_stop(char *str) +{ + tracepoint_printk_stop_on_boot = true; + return 1; +} +__setup("tp_printk_stop_on_boot", set_tracepoint_printk_stop); + unsigned long long ns2usecs(u64 nsec) { nsec += 500; @@ -9578,6 +9586,8 @@ static __init int tracer_init_tracefs(void) return 0; } +fs_initcall(tracer_init_tracefs); + static int trace_panic_handler(struct notifier_block *this, unsigned long event, void *unused) { @@ -9998,7 +10008,7 @@ void __init trace_init(void) trace_event_init(); } -__init static int clear_boot_tracer(void) +__init static void clear_boot_tracer(void) { /* * The default tracer at boot buffer is an init section. @@ -10008,26 +10018,21 @@ __init static int clear_boot_tracer(void) * about to be freed. */ if (!default_bootup_tracer) - return 0; + return; printk(KERN_INFO "ftrace bootup tracer '%s' not registered.\n", default_bootup_tracer); default_bootup_tracer = NULL; - - return 0; } -fs_initcall(tracer_init_tracefs); -late_initcall_sync(clear_boot_tracer); - #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK -__init static int tracing_set_default_clock(void) +__init static void tracing_set_default_clock(void) { /* sched_clock_stable() is determined in late_initcall */ if (!trace_boot_clock && !sched_clock_stable()) { if (security_locked_down(LOCKDOWN_TRACEFS)) { pr_warn("Can not set tracing clock due to lockdown\n"); - return -EPERM; + return; } printk(KERN_WARNING @@ -10037,8 +10042,21 @@ __init static int tracing_set_default_clock(void) "on the kernel command line\n"); tracing_set_clock(&global_trace, "global"); } +} +#else +static inline void tracing_set_default_clock(void) { } +#endif +__init static int late_trace_init(void) +{ + if (tracepoint_printk && tracepoint_printk_stop_on_boot) { + static_key_disable(&tracepoint_printk_key.key); + tracepoint_printk = 0; + } + + tracing_set_default_clock(); + clear_boot_tracer(); return 0; } -late_initcall_sync(tracing_set_default_clock); -#endif + +late_initcall_sync(late_trace_init); -- cgit v1.2.3 From 2db7ab6b4c962e2499c86e8fe9cb1369ebaf91d1 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 17 Jun 2021 16:20:41 -0400 Subject: tracing: Have ftrace_dump_on_oops kernel parameter take numbers The kernel parameter for ftrace_dump_on_oops can take a single assignment. That is, it can be: ftrace_dump_on_oops or ftrace_dump_on_oops=orig_cpu But the content in the sysctl file is a number. 0 for disabled 1 for ftrace_dump_on_oops (all CPUs) 2 for ftrace_dump_on_oops (orig CPU) Allow the kernel command line to take a number as well to match the sysctl numbers. That is: ftrace_dump_on_oops=1 is the same as ftrace_dump_on_oops and ftrace_dump_on_oops=2 is the same as ftrace_dump_on_oops=orig_cpu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index bbc63ac5b47f..d352fb4b7709 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -197,12 +197,12 @@ __setup("ftrace=", set_cmdline_ftrace); static int __init set_ftrace_dump_on_oops(char *str) { - if (*str++ != '=' || !*str) { + if (*str++ != '=' || !*str || !strcmp("1", str)) { ftrace_dump_on_oops = DUMP_ALL; return 1; } - if (!strcmp("orig_cpu", str)) { + if (!strcmp("orig_cpu", str) || !strcmp("2", str)) { ftrace_dump_on_oops = DUMP_ORIG; return 1; } -- cgit v1.2.3 From bb1b24cf41b5b3b96a921f80f9799e7be75f167d Mon Sep 17 00:00:00 2001 From: Daniel Bristot de Oliveira Date: Tue, 22 Jun 2021 16:42:19 +0200 Subject: trace/hwlat: Fix Clark's email Clark's email is williams@redhat.com. No functional change. Link: https://lkml.kernel.org/r/6fa4b49e17ab8a1ff19c335ab7cde38d8afb0e29.1624372313.git.bristot@redhat.com Cc: Phil Auld Cc: Sebastian Andrzej Siewior Cc: Kate Carcia Cc: Jonathan Corbet Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Alexandre Chartre Cc: Clark Willaims Cc: John Kacur Cc: Juri Lelli Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_hwlat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c index 632ef88131a9..0a5635401125 100644 --- a/kernel/trace/trace_hwlat.c +++ b/kernel/trace/trace_hwlat.c @@ -34,7 +34,7 @@ * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. * - * Includes useful feedback from Clark Williams + * Includes useful feedback from Clark Williams * */ #include -- cgit v1.2.3 From 8fa826b7344d6752f5cfd72380d9fe7bd8c6b928 Mon Sep 17 00:00:00 2001 From: Daniel Bristot de Oliveira Date: Tue, 22 Jun 2021 16:42:20 +0200 Subject: trace/hwlat: Implement the mode config option Provides the "mode" config to the hardware latency detector. hwlatd has two different operation modes. The default mode is the "round-robin" one, in which a single hwlatd thread runs, migrating among the allowed CPUs in a "round-robin" fashion. This is the current behavior. The "none" sets the allowed cpumask for a single hwlatd thread at the startup, but skips the round-robin, letting the scheduler handle the migration. In preparation to the per-cpu mode. Link: https://lkml.kernel.org/r/f3b1271262aa030c680e26615c1b9b2d71e55e92.1624372313.git.bristot@redhat.com Cc: Phil Auld Cc: Sebastian Andrzej Siewior Cc: Kate Carcia Cc: Jonathan Corbet Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Alexandre Chartre Cc: Clark Willaims Cc: John Kacur Cc: Juri Lelli Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_hwlat.c | 179 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 167 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c index 0a5635401125..43a436d85a01 100644 --- a/kernel/trace/trace_hwlat.c +++ b/kernel/trace/trace_hwlat.c @@ -59,6 +59,14 @@ static struct task_struct *hwlat_kthread; static struct dentry *hwlat_sample_width; /* sample width us */ static struct dentry *hwlat_sample_window; /* sample window us */ +static struct dentry *hwlat_thread_mode; /* hwlat thread mode */ + +enum { + MODE_NONE = 0, + MODE_ROUND_ROBIN, + MODE_MAX +}; +static char *thread_mode_str[] = { "none", "round-robin" }; /* Save the previous tracing_thresh value */ static unsigned long save_tracing_thresh; @@ -96,11 +104,16 @@ static struct hwlat_data { u64 sample_window; /* total sampling window (on+off) */ u64 sample_width; /* active sampling portion of window */ + int thread_mode; /* thread mode */ + } hwlat_data = { .sample_window = DEFAULT_SAMPLE_WINDOW, .sample_width = DEFAULT_SAMPLE_WIDTH, + .thread_mode = MODE_ROUND_ROBIN }; +static bool hwlat_busy; + static void trace_hwlat_sample(struct hwlat_sample *sample) { struct trace_array *tr = hwlat_trace; @@ -328,7 +341,8 @@ static int kthread_fn(void *data) while (!kthread_should_stop()) { - move_to_next_cpu(); + if (hwlat_data.thread_mode == MODE_ROUND_ROBIN) + move_to_next_cpu(); local_irq_disable(); get_sample(); @@ -351,7 +365,7 @@ static int kthread_fn(void *data) return 0; } -/** +/* * start_kthread - Kick off the hardware latency sampling/detector kthread * * This starts the kernel thread that will sit and sample the CPU timestamp @@ -366,11 +380,6 @@ static int start_kthread(struct trace_array *tr) if (hwlat_kthread) return 0; - /* Just pick the first CPU on first iteration */ - get_online_cpus(); - cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask); - put_online_cpus(); - next_cpu = cpumask_first(current_mask); kthread = kthread_create(kthread_fn, NULL, "hwlatd"); if (IS_ERR(kthread)) { @@ -378,8 +387,19 @@ static int start_kthread(struct trace_array *tr) return -ENOMEM; } - cpumask_clear(current_mask); - cpumask_set_cpu(next_cpu, current_mask); + + /* Just pick the first CPU on first iteration */ + get_online_cpus(); + cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask); + put_online_cpus(); + + if (hwlat_data.thread_mode == MODE_ROUND_ROBIN) { + next_cpu = cpumask_first(current_mask); + cpumask_clear(current_mask); + cpumask_set_cpu(next_cpu, current_mask); + + } + sched_setaffinity(kthread->pid, current_mask); hwlat_kthread = kthread; @@ -388,7 +408,7 @@ static int start_kthread(struct trace_array *tr) return 0; } -/** +/* * stop_kthread - Inform the hardware latency sampling/detector kthread to stop * * This kicks the running hardware latency sampling/detector kernel thread and @@ -511,6 +531,129 @@ hwlat_window_write(struct file *filp, const char __user *ubuf, return cnt; } +static void *s_mode_start(struct seq_file *s, loff_t *pos) +{ + int mode = *pos; + + mutex_lock(&hwlat_data.lock); + + if (mode >= MODE_MAX) + return NULL; + + return pos; +} + +static void *s_mode_next(struct seq_file *s, void *v, loff_t *pos) +{ + int mode = ++(*pos); + + if (mode >= MODE_MAX) + return NULL; + + return pos; +} + +static int s_mode_show(struct seq_file *s, void *v) +{ + loff_t *pos = v; + int mode = *pos; + + if (mode == hwlat_data.thread_mode) + seq_printf(s, "[%s]", thread_mode_str[mode]); + else + seq_printf(s, "%s", thread_mode_str[mode]); + + if (mode != MODE_MAX) + seq_puts(s, " "); + + return 0; +} + +static void s_mode_stop(struct seq_file *s, void *v) +{ + seq_puts(s, "\n"); + mutex_unlock(&hwlat_data.lock); +} + +static const struct seq_operations thread_mode_seq_ops = { + .start = s_mode_start, + .next = s_mode_next, + .show = s_mode_show, + .stop = s_mode_stop +}; + +static int hwlat_mode_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &thread_mode_seq_ops); +}; + +static void hwlat_tracer_start(struct trace_array *tr); +static void hwlat_tracer_stop(struct trace_array *tr); + +/** + * hwlat_mode_write - Write function for "mode" entry + * @filp: The active open file structure + * @ubuf: The user buffer that contains the value to write + * @cnt: The maximum number of bytes to write to "file" + * @ppos: The current position in @file + * + * This function provides a write implementation for the "mode" interface + * to the hardware latency detector. hwlatd has different operation modes. + * The "none" sets the allowed cpumask for a single hwlatd thread at the + * startup and lets the scheduler handle the migration. The default mode is + * the "round-robin" one, in which a single hwlatd thread runs, migrating + * among the allowed CPUs in a round-robin fashion. + */ +static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct trace_array *tr = hwlat_trace; + const char *mode; + char buf[64]; + int ret, i; + + if (cnt >= sizeof(buf)) + return -EINVAL; + + if (copy_from_user(buf, ubuf, cnt)) + return -EFAULT; + + buf[cnt] = 0; + + mode = strstrip(buf); + + ret = -EINVAL; + + /* + * trace_types_lock is taken to avoid concurrency on start/stop + * and hwlat_busy. + */ + mutex_lock(&trace_types_lock); + if (hwlat_busy) + hwlat_tracer_stop(tr); + + mutex_lock(&hwlat_data.lock); + + for (i = 0; i < MODE_MAX; i++) { + if (strcmp(mode, thread_mode_str[i]) == 0) { + hwlat_data.thread_mode = i; + ret = cnt; + } + } + + mutex_unlock(&hwlat_data.lock); + + if (hwlat_busy) + hwlat_tracer_start(tr); + mutex_unlock(&trace_types_lock); + + *ppos += cnt; + + + + return ret; +} + static const struct file_operations width_fops = { .open = tracing_open_generic, .read = hwlat_read, @@ -523,6 +666,13 @@ static const struct file_operations window_fops = { .write = hwlat_window_write, }; +static const struct file_operations thread_mode_fops = { + .open = hwlat_mode_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, + .write = hwlat_mode_write +}; /** * init_tracefs - A function to initialize the tracefs interface files * @@ -558,6 +708,13 @@ static int init_tracefs(void) if (!hwlat_sample_width) goto err; + hwlat_thread_mode = trace_create_file("mode", 0644, + top_dir, + NULL, + &thread_mode_fops); + if (!hwlat_thread_mode) + goto err; + return 0; err: @@ -579,8 +736,6 @@ static void hwlat_tracer_stop(struct trace_array *tr) stop_kthread(); } -static bool hwlat_busy; - static int hwlat_tracer_init(struct trace_array *tr) { /* Only allow one instance to enable this */ -- cgit v1.2.3 From 7bb7d802af1d0b2608ef5afafcf968073a50acb7 Mon Sep 17 00:00:00 2001 From: Daniel Bristot de Oliveira Date: Tue, 22 Jun 2021 16:42:21 +0200 Subject: trace/hwlat: Switch disable_migrate to mode none When in the round-robin mode, if the tracer detects a change in the hwlatd thread affinity by an external tool, e.g., taskset, the round-robin logic is disabled. The disable_migrate variable currently tracks this. With the addition of the "mode" config and the mode "none," the disable_migrate logic is equivalent to switch to the "none" mode. Hence, instead of using a hidden variable to track this behavior, switch the mode to none, informing the user about this change. Link: https://lkml.kernel.org/r/a679af672458d6b1f62252605905c5214030f247.1624372313.git.bristot@redhat.com Cc: Phil Auld Cc: Sebastian Andrzej Siewior Cc: Kate Carcia Cc: Jonathan Corbet Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Alexandre Chartre Cc: Clark Willaims Cc: John Kacur Cc: Juri Lelli Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_hwlat.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c index 43a436d85a01..bae74b95cf55 100644 --- a/kernel/trace/trace_hwlat.c +++ b/kernel/trace/trace_hwlat.c @@ -286,7 +286,6 @@ out: } static struct cpumask save_cpumask; -static bool disable_migrate; static void move_to_next_cpu(void) { @@ -294,15 +293,13 @@ static void move_to_next_cpu(void) struct trace_array *tr = hwlat_trace; int next_cpu; - if (disable_migrate) - return; /* * If for some reason the user modifies the CPU affinity * of this thread, then stop migrating for the duration * of the current test. */ if (!cpumask_equal(current_mask, current->cpus_ptr)) - goto disable; + goto change_mode; get_online_cpus(); cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask); @@ -313,7 +310,7 @@ static void move_to_next_cpu(void) next_cpu = cpumask_first(current_mask); if (next_cpu >= nr_cpu_ids) /* Shouldn't happen! */ - goto disable; + goto change_mode; cpumask_clear(current_mask); cpumask_set_cpu(next_cpu, current_mask); @@ -321,8 +318,9 @@ static void move_to_next_cpu(void) sched_setaffinity(0, current_mask); return; - disable: - disable_migrate = true; + change_mode: + hwlat_data.thread_mode = MODE_NONE; + pr_info(BANNER "cpumask changed while in round-robin mode, switching to mode none\n"); } /* @@ -744,7 +742,6 @@ static int hwlat_tracer_init(struct trace_array *tr) hwlat_trace = tr; - disable_migrate = false; hwlat_data.count = 0; tr->max_latency = 0; save_tracing_thresh = tracing_thresh; -- cgit v1.2.3 From f46b16520a087e892a189db9c23ccf7e9bb5fa69 Mon Sep 17 00:00:00 2001 From: Daniel Bristot de Oliveira Date: Tue, 22 Jun 2021 16:42:22 +0200 Subject: trace/hwlat: Implement the per-cpu mode Implements the per-cpu mode in which a sampling thread is created for each cpu in the "cpus" (and tracing_mask). The per-cpu mode has the potention to speed up the hwlat detection by running on multiple CPUs at the same time, at the cost of higher cpu usage with irqs disabled. Use with care. [ Changed get_cpu_data() to static. Reported-by: kernel test robot ] Link: https://lkml.kernel.org/r/ec06d0ab340e8460d293772faba19ad8a5c371aa.1624372313.git.bristot@redhat.com Cc: Phil Auld Cc: Sebastian Andrzej Siewior Cc: Kate Carcia Cc: Jonathan Corbet Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Alexandre Chartre Cc: Clark Willaims Cc: John Kacur Cc: Juri Lelli Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_hwlat.c | 186 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 150 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c index bae74b95cf55..3957b36826e2 100644 --- a/kernel/trace/trace_hwlat.c +++ b/kernel/trace/trace_hwlat.c @@ -54,9 +54,6 @@ static struct trace_array *hwlat_trace; #define DEFAULT_SAMPLE_WIDTH 500000 /* 0.5s */ #define DEFAULT_LAT_THRESHOLD 10 /* 10us */ -/* sampling thread*/ -static struct task_struct *hwlat_kthread; - static struct dentry *hwlat_sample_width; /* sample width us */ static struct dentry *hwlat_sample_window; /* sample window us */ static struct dentry *hwlat_thread_mode; /* hwlat thread mode */ @@ -64,18 +61,26 @@ static struct dentry *hwlat_thread_mode; /* hwlat thread mode */ enum { MODE_NONE = 0, MODE_ROUND_ROBIN, + MODE_PER_CPU, MODE_MAX }; -static char *thread_mode_str[] = { "none", "round-robin" }; +static char *thread_mode_str[] = { "none", "round-robin", "per-cpu" }; /* Save the previous tracing_thresh value */ static unsigned long save_tracing_thresh; -/* NMI timestamp counters */ -static u64 nmi_ts_start; -static u64 nmi_total_ts; -static int nmi_count; -static int nmi_cpu; +/* runtime kthread data */ +struct hwlat_kthread_data { + struct task_struct *kthread; + /* NMI timestamp counters */ + u64 nmi_ts_start; + u64 nmi_total_ts; + int nmi_count; + int nmi_cpu; +}; + +struct hwlat_kthread_data hwlat_single_cpu_data; +DEFINE_PER_CPU(struct hwlat_kthread_data, hwlat_per_cpu_data); /* Tells NMIs to call back to the hwlat tracer to record timestamps */ bool trace_hwlat_callback_enabled; @@ -112,6 +117,14 @@ static struct hwlat_data { .thread_mode = MODE_ROUND_ROBIN }; +static struct hwlat_kthread_data *get_cpu_data(void) +{ + if (hwlat_data.thread_mode == MODE_PER_CPU) + return this_cpu_ptr(&hwlat_per_cpu_data); + else + return &hwlat_single_cpu_data; +} + static bool hwlat_busy; static void trace_hwlat_sample(struct hwlat_sample *sample) @@ -149,7 +162,9 @@ static void trace_hwlat_sample(struct hwlat_sample *sample) void trace_hwlat_callback(bool enter) { - if (smp_processor_id() != nmi_cpu) + struct hwlat_kthread_data *kdata = get_cpu_data(); + + if (!kdata->kthread) return; /* @@ -158,13 +173,13 @@ void trace_hwlat_callback(bool enter) */ if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) { if (enter) - nmi_ts_start = time_get(); + kdata->nmi_ts_start = time_get(); else - nmi_total_ts += time_get() - nmi_ts_start; + kdata->nmi_total_ts += time_get() - kdata->nmi_ts_start; } if (enter) - nmi_count++; + kdata->nmi_count++; } /** @@ -176,6 +191,7 @@ void trace_hwlat_callback(bool enter) */ static int get_sample(void) { + struct hwlat_kthread_data *kdata = get_cpu_data(); struct trace_array *tr = hwlat_trace; struct hwlat_sample s; time_type start, t1, t2, last_t2; @@ -188,9 +204,8 @@ static int get_sample(void) do_div(thresh, NSEC_PER_USEC); /* modifies interval value */ - nmi_cpu = smp_processor_id(); - nmi_total_ts = 0; - nmi_count = 0; + kdata->nmi_total_ts = 0; + kdata->nmi_count = 0; /* Make sure NMIs see this first */ barrier(); @@ -260,15 +275,15 @@ static int get_sample(void) ret = 1; /* We read in microseconds */ - if (nmi_total_ts) - do_div(nmi_total_ts, NSEC_PER_USEC); + if (kdata->nmi_total_ts) + do_div(kdata->nmi_total_ts, NSEC_PER_USEC); hwlat_data.count++; s.seqnum = hwlat_data.count; s.duration = sample; s.outer_duration = outer_sample; - s.nmi_total_ts = nmi_total_ts; - s.nmi_count = nmi_count; + s.nmi_total_ts = kdata->nmi_total_ts; + s.nmi_count = kdata->nmi_count; s.count = count; trace_hwlat_sample(&s); @@ -364,21 +379,40 @@ static int kthread_fn(void *data) } /* - * start_kthread - Kick off the hardware latency sampling/detector kthread + * stop_stop_kthread - Inform the hardware latency sampling/detector kthread to stop + * + * This kicks the running hardware latency sampling/detector kernel thread and + * tells it to stop sampling now. Use this on unload and at system shutdown. + */ +static void stop_single_kthread(void) +{ + struct hwlat_kthread_data *kdata = get_cpu_data(); + struct task_struct *kthread = kdata->kthread; + + if (!kthread) + return; + + kthread_stop(kthread); + kdata->kthread = NULL; +} + + +/* + * start_single_kthread - Kick off the hardware latency sampling/detector kthread * * This starts the kernel thread that will sit and sample the CPU timestamp * counter (TSC or similar) and look for potential hardware latencies. */ -static int start_kthread(struct trace_array *tr) +static int start_single_kthread(struct trace_array *tr) { + struct hwlat_kthread_data *kdata = get_cpu_data(); struct cpumask *current_mask = &save_cpumask; struct task_struct *kthread; int next_cpu; - if (hwlat_kthread) + if (kdata->kthread) return 0; - kthread = kthread_create(kthread_fn, NULL, "hwlatd"); if (IS_ERR(kthread)) { pr_err(BANNER "could not start sampling thread\n"); @@ -400,24 +434,97 @@ static int start_kthread(struct trace_array *tr) sched_setaffinity(kthread->pid, current_mask); - hwlat_kthread = kthread; + kdata->kthread = kthread; wake_up_process(kthread); return 0; } /* - * stop_kthread - Inform the hardware latency sampling/detector kthread to stop + * stop_cpu_kthread - Stop a hwlat cpu kthread + */ +static void stop_cpu_kthread(unsigned int cpu) +{ + struct task_struct *kthread; + + kthread = per_cpu(hwlat_per_cpu_data, cpu).kthread; + if (kthread) + kthread_stop(kthread); +} + +/* + * stop_per_cpu_kthreads - Inform the hardware latency sampling/detector kthread to stop * - * This kicks the running hardware latency sampling/detector kernel thread and + * This kicks the running hardware latency sampling/detector kernel threads and * tells it to stop sampling now. Use this on unload and at system shutdown. */ -static void stop_kthread(void) +static void stop_per_cpu_kthreads(void) { - if (!hwlat_kthread) - return; - kthread_stop(hwlat_kthread); - hwlat_kthread = NULL; + unsigned int cpu; + + get_online_cpus(); + for_each_online_cpu(cpu) + stop_cpu_kthread(cpu); + put_online_cpus(); +} + +/* + * start_cpu_kthread - Start a hwlat cpu kthread + */ +static int start_cpu_kthread(unsigned int cpu) +{ + struct task_struct *kthread; + char comm[24]; + + snprintf(comm, 24, "hwlatd/%d", cpu); + + kthread = kthread_create_on_cpu(kthread_fn, NULL, cpu, comm); + if (IS_ERR(kthread)) { + pr_err(BANNER "could not start sampling thread\n"); + return -ENOMEM; + } + + per_cpu(hwlat_per_cpu_data, cpu).kthread = kthread; + wake_up_process(kthread); + + return 0; +} + +/* + * start_per_cpu_kthreads - Kick off the hardware latency sampling/detector kthreads + * + * This starts the kernel threads that will sit on potentially all cpus and + * sample the CPU timestamp counter (TSC or similar) and look for potential + * hardware latencies. + */ +static int start_per_cpu_kthreads(struct trace_array *tr) +{ + struct cpumask *current_mask = &save_cpumask; + unsigned int cpu; + int retval; + + get_online_cpus(); + /* + * Run only on CPUs in which hwlat is allowed to run. + */ + cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask); + + for_each_online_cpu(cpu) + per_cpu(hwlat_per_cpu_data, cpu).kthread = NULL; + + for_each_cpu(cpu, current_mask) { + retval = start_cpu_kthread(cpu); + if (retval) + goto out_error; + } + put_online_cpus(); + + return 0; + +out_error: + put_online_cpus(); + stop_per_cpu_kthreads(); + return retval; } /* @@ -600,7 +707,8 @@ static void hwlat_tracer_stop(struct trace_array *tr); * The "none" sets the allowed cpumask for a single hwlatd thread at the * startup and lets the scheduler handle the migration. The default mode is * the "round-robin" one, in which a single hwlatd thread runs, migrating - * among the allowed CPUs in a round-robin fashion. + * among the allowed CPUs in a round-robin fashion. The "per-cpu" mode + * creates one hwlatd thread per allowed CPU. */ static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) @@ -724,14 +832,20 @@ static void hwlat_tracer_start(struct trace_array *tr) { int err; - err = start_kthread(tr); + if (hwlat_data.thread_mode == MODE_PER_CPU) + err = start_per_cpu_kthreads(tr); + else + err = start_single_kthread(tr); if (err) pr_err(BANNER "Cannot start hwlat kthread\n"); } static void hwlat_tracer_stop(struct trace_array *tr) { - stop_kthread(); + if (hwlat_data.thread_mode == MODE_PER_CPU) + stop_per_cpu_kthreads(); + else + stop_single_kthread(); } static int hwlat_tracer_init(struct trace_array *tr) @@ -760,7 +874,7 @@ static int hwlat_tracer_init(struct trace_array *tr) static void hwlat_tracer_reset(struct trace_array *tr) { - stop_kthread(); + hwlat_tracer_stop(tr); /* the tracing threshold is static between runs */ last_tracing_thresh = tracing_thresh; -- cgit v1.2.3 From bc87cf0a08d437ea192b15f0918cb581a8698f15 Mon Sep 17 00:00:00 2001 From: Daniel Bristot de Oliveira Date: Tue, 22 Jun 2021 16:42:23 +0200 Subject: trace: Add a generic function to read/write u64 values from tracefs The hwlat detector and (in preparation for) the osnoise/timerlat tracers have a set of u64 parameters that the user can read/write via tracefs. For instance, we have hwlat_detector's window and width. To reduce the code duplication, hwlat's window and width share the same read function. However, they do not share the write functions because they do different parameter checks. For instance, the width needs to be smaller than the window, while the window needs to be larger than the window. The same pattern repeats on osnoise/timerlat, and a large portion of the code was devoted to the write function. Despite having different checks, the write functions have the same structure: read a user-space buffer take the lock that protects the value check for minimum and maximum acceptable values save the value release the lock return success or error To reduce the code duplication also in the write functions, this patch provides a generic read and write implementation for u64 values that need to be within some minimum and/or maximum parameters, while (potentially) being protected by a lock. To use this interface, the structure trace_min_max_param needs to be filled: struct trace_min_max_param { struct mutex *lock; u64 *val; u64 *min; u64 *max; }; The desired value is stored on the variable pointed by *val. If *min points to a minimum acceptable value, it will be checked during the write operation. Likewise, if *max points to a maximum allowable value, it will be checked during the write operation. Finally, if *lock points to a mutex, it will be taken at the beginning of the operation and released at the end. The definition of a trace_min_max_param needs to passed as the (private) *data for tracefs_create_file(), and the trace_min_max_fops (added by this patch) as the *fops file_operations. Link: https://lkml.kernel.org/r/3e35760a7c8b5c55f16ae5ad5fc54a0e71cbe647.1624372313.git.bristot@redhat.com Cc: Phil Auld Cc: Sebastian Andrzej Siewior Cc: Kate Carcia Cc: Jonathan Corbet Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Alexandre Chartre Cc: Clark Willaims Cc: John Kacur Cc: Juri Lelli Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/trace/trace.h | 18 +++++++++++ 2 files changed, 103 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d352fb4b7709..27bf203ef05a 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7583,6 +7583,91 @@ static const struct file_operations snapshot_raw_fops = { #endif /* CONFIG_TRACER_SNAPSHOT */ +/* + * trace_min_max_write - Write a u64 value to a trace_min_max_param struct + * @filp: The active open file structure + * @ubuf: The userspace provided buffer to read value into + * @cnt: The maximum number of bytes to read + * @ppos: The current "file" position + * + * This function implements the write interface for a struct trace_min_max_param. + * The filp->private_data must point to a trace_min_max_param structure that + * defines where to write the value, the min and the max acceptable values, + * and a lock to protect the write. + */ +static ssize_t +trace_min_max_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) +{ + struct trace_min_max_param *param = filp->private_data; + u64 val; + int err; + + if (!param) + return -EFAULT; + + err = kstrtoull_from_user(ubuf, cnt, 10, &val); + if (err) + return err; + + if (param->lock) + mutex_lock(param->lock); + + if (param->min && val < *param->min) + err = -EINVAL; + + if (param->max && val > *param->max) + err = -EINVAL; + + if (!err) + *param->val = val; + + if (param->lock) + mutex_unlock(param->lock); + + if (err) + return err; + + return cnt; +} + +/* + * trace_min_max_read - Read a u64 value from a trace_min_max_param struct + * @filp: The active open file structure + * @ubuf: The userspace provided buffer to read value into + * @cnt: The maximum number of bytes to read + * @ppos: The current "file" position + * + * This function implements the read interface for a struct trace_min_max_param. + * The filp->private_data must point to a trace_min_max_param struct with valid + * data. + */ +static ssize_t +trace_min_max_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) +{ + struct trace_min_max_param *param = filp->private_data; + char buf[U64_STR_SIZE]; + int len; + u64 val; + + if (!param) + return -EFAULT; + + val = *param->val; + + if (cnt > sizeof(buf)) + cnt = sizeof(buf); + + len = snprintf(buf, sizeof(buf), "%llu\n", val); + + return simple_read_from_buffer(ubuf, cnt, ppos, buf, len); +} + +const struct file_operations trace_min_max_fops = { + .open = tracing_open_generic, + .read = trace_min_max_read, + .write = trace_min_max_write, +}; + #define TRACING_LOG_ERRS_MAX 8 #define TRACING_LOG_LOC_MAX 128 diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index cd80d046c7a5..22f8c652ef8b 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -1952,4 +1952,22 @@ static inline bool is_good_name(const char *name) return true; } +/* + * This is a generic way to read and write a u64 value from a file in tracefs. + * + * The value is stored on the variable pointed by *val. The value needs + * to be at least *min and at most *max. The write is protected by an + * existing *lock. + */ +struct trace_min_max_param { + struct mutex *lock; + u64 *val; + u64 *min; + u64 *max; +}; + +#define U64_STR_SIZE 24 /* 20 digits max */ + +extern const struct file_operations trace_min_max_fops; + #endif /* _LINUX_KERNEL_TRACE_H */ -- cgit v1.2.3 From f27a1c9e1ba1e4f18f2c01e7bcbc400651ed821d Mon Sep 17 00:00:00 2001 From: Daniel Bristot de Oliveira Date: Tue, 22 Jun 2021 16:42:24 +0200 Subject: trace/hwlat: Use trace_min_max_param for width and window params Use the trace_min_max_param to reduce code duplication. No functional change. Link: https://lkml.kernel.org/r/b91accd5a7c6c14ea02d3379aae974ba22b47dd6.1624372313.git.bristot@redhat.com Cc: Phil Auld Cc: Sebastian Andrzej Siewior Cc: Kate Carcia Cc: Jonathan Corbet Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Alexandre Chartre Cc: Clark Willaims Cc: John Kacur Cc: Juri Lelli Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_hwlat.c | 145 ++++++++------------------------------------- 1 file changed, 24 insertions(+), 121 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c index 3957b36826e2..44f46bc1140f 100644 --- a/kernel/trace/trace_hwlat.c +++ b/kernel/trace/trace_hwlat.c @@ -527,115 +527,6 @@ out_error: return retval; } -/* - * hwlat_read - Wrapper read function for reading both window and width - * @filp: The active open file structure - * @ubuf: The userspace provided buffer to read value into - * @cnt: The maximum number of bytes to read - * @ppos: The current "file" position - * - * This function provides a generic read implementation for the global state - * "hwlat_data" structure filesystem entries. - */ -static ssize_t hwlat_read(struct file *filp, char __user *ubuf, - size_t cnt, loff_t *ppos) -{ - char buf[U64STR_SIZE]; - u64 *entry = filp->private_data; - u64 val; - int len; - - if (!entry) - return -EFAULT; - - if (cnt > sizeof(buf)) - cnt = sizeof(buf); - - val = *entry; - - len = snprintf(buf, sizeof(buf), "%llu\n", val); - - return simple_read_from_buffer(ubuf, cnt, ppos, buf, len); -} - -/** - * hwlat_width_write - Write function for "width" entry - * @filp: The active open file structure - * @ubuf: The user buffer that contains the value to write - * @cnt: The maximum number of bytes to write to "file" - * @ppos: The current position in @file - * - * This function provides a write implementation for the "width" interface - * to the hardware latency detector. It can be used to configure - * for how many us of the total window us we will actively sample for any - * hardware-induced latency periods. Obviously, it is not possible to - * sample constantly and have the system respond to a sample reader, or, - * worse, without having the system appear to have gone out to lunch. It - * is enforced that width is less that the total window size. - */ -static ssize_t -hwlat_width_write(struct file *filp, const char __user *ubuf, - size_t cnt, loff_t *ppos) -{ - u64 val; - int err; - - err = kstrtoull_from_user(ubuf, cnt, 10, &val); - if (err) - return err; - - mutex_lock(&hwlat_data.lock); - if (val < hwlat_data.sample_window) - hwlat_data.sample_width = val; - else - err = -EINVAL; - mutex_unlock(&hwlat_data.lock); - - if (err) - return err; - - return cnt; -} - -/** - * hwlat_window_write - Write function for "window" entry - * @filp: The active open file structure - * @ubuf: The user buffer that contains the value to write - * @cnt: The maximum number of bytes to write to "file" - * @ppos: The current position in @file - * - * This function provides a write implementation for the "window" interface - * to the hardware latency detector. The window is the total time - * in us that will be considered one sample period. Conceptually, windows - * occur back-to-back and contain a sample width period during which - * actual sampling occurs. Can be used to write a new total window size. It - * is enforced that any value written must be greater than the sample width - * size, or an error results. - */ -static ssize_t -hwlat_window_write(struct file *filp, const char __user *ubuf, - size_t cnt, loff_t *ppos) -{ - u64 val; - int err; - - err = kstrtoull_from_user(ubuf, cnt, 10, &val); - if (err) - return err; - - mutex_lock(&hwlat_data.lock); - if (hwlat_data.sample_width < val) - hwlat_data.sample_window = val; - else - err = -EINVAL; - mutex_unlock(&hwlat_data.lock); - - if (err) - return err; - - return cnt; -} - static void *s_mode_start(struct seq_file *s, loff_t *pos) { int mode = *pos; @@ -760,16 +651,28 @@ static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf, return ret; } -static const struct file_operations width_fops = { - .open = tracing_open_generic, - .read = hwlat_read, - .write = hwlat_width_write, +/* + * The width parameter is read/write using the generic trace_min_max_param + * method. The *val is protected by the hwlat_data lock and is upper + * bounded by the window parameter. + */ +static struct trace_min_max_param hwlat_width = { + .lock = &hwlat_data.lock, + .val = &hwlat_data.sample_width, + .max = &hwlat_data.sample_window, + .min = NULL, }; -static const struct file_operations window_fops = { - .open = tracing_open_generic, - .read = hwlat_read, - .write = hwlat_window_write, +/* + * The window parameter is read/write using the generic trace_min_max_param + * method. The *val is protected by the hwlat_data lock and is lower + * bounded by the width parameter. + */ +static struct trace_min_max_param hwlat_window = { + .lock = &hwlat_data.lock, + .val = &hwlat_data.sample_window, + .max = NULL, + .min = &hwlat_data.sample_width, }; static const struct file_operations thread_mode_fops = { @@ -802,15 +705,15 @@ static int init_tracefs(void) hwlat_sample_window = tracefs_create_file("window", 0640, top_dir, - &hwlat_data.sample_window, - &window_fops); + &hwlat_window, + &trace_min_max_fops); if (!hwlat_sample_window) goto err; hwlat_sample_width = tracefs_create_file("width", 0644, top_dir, - &hwlat_data.sample_width, - &width_fops); + &hwlat_width, + &trace_min_max_fops); if (!hwlat_sample_width) goto err; -- cgit v1.2.3 From aa892f8c887dd4331458d04de9425cde6664c694 Mon Sep 17 00:00:00 2001 From: Daniel Bristot de Oliveira Date: Tue, 22 Jun 2021 16:42:25 +0200 Subject: trace/hwlat: Remove printk from sampling loop hwlat has some time operation checks on the sample loop, and it is currently using pr_err (printk) to report them. The problem is that this can lead the system to an unresponsible state due to an overflow of printk messages. This problem can be mitigated by writing the error message to the trace buffer. Remove the printk messages from the sampling loop, switching the to messages in the trace buffer. No functional change. Link: https://lkml.kernel.org/r/9d77c34869748aa105e965c769d24642914eea3a.1624372313.git.bristot@redhat.com Cc: Phil Auld Cc: Sebastian Andrzej Siewior Cc: Kate Carcia Cc: Jonathan Corbet Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Alexandre Chartre Cc: Clark Willaims Cc: John Kacur Cc: Juri Lelli Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_hwlat.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c index 44f46bc1140f..a625bfdb844e 100644 --- a/kernel/trace/trace_hwlat.c +++ b/kernel/trace/trace_hwlat.c @@ -182,6 +182,15 @@ void trace_hwlat_callback(bool enter) kdata->nmi_count++; } +/* + * hwlat_err - report a hwlat error. + */ +#define hwlat_err(msg) ({ \ + struct trace_array *tr = hwlat_trace; \ + \ + trace_array_printk_buf(tr->array_buffer.buffer, _THIS_IP_, msg); \ +}) + /** * get_sample - sample the CPU TSC and look for likely hardware latencies * @@ -225,7 +234,7 @@ static int get_sample(void) outer_diff = time_to_us(time_sub(t1, last_t2)); /* This shouldn't happen */ if (outer_diff < 0) { - pr_err(BANNER "time running backwards\n"); + hwlat_err(BANNER "time running backwards\n"); goto out; } if (outer_diff > outer_sample) @@ -237,7 +246,7 @@ static int get_sample(void) /* Check for possible overflows */ if (total < last_total) { - pr_err("Time total overflowed\n"); + hwlat_err("Time total overflowed\n"); break; } last_total = total; @@ -253,7 +262,7 @@ static int get_sample(void) /* This shouldn't happen */ if (diff < 0) { - pr_err(BANNER "time running backwards\n"); + hwlat_err(BANNER "time running backwards\n"); goto out; } -- cgit v1.2.3 From 6880c987e45172fdaca0b4c07b0990f5b3c74f70 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Fri, 25 Jun 2021 19:47:33 -0400 Subject: tracing: Add LATENCY_FS_NOTIFY to define if latency_fsnotify() is defined With the coming addition of the osnoise tracer, the configs needed to include the latency_fsnotify() has become more complex, and to keep the declaration in the header file the same as in the C file, just have the logic needed to define it in one place, and that defines LATENCY_FS_NOTIFY which will be used in the C code. Reported-by: kernel test robot Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 3 +-- kernel/trace/trace.h | 6 +++--- 2 files changed, 4 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 27bf203ef05a..60492464281e 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1690,8 +1690,7 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt) unsigned long __read_mostly tracing_thresh; static const struct file_operations tracing_max_lat_fops; -#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \ - defined(CONFIG_FSNOTIFY) +#ifdef LATENCY_FS_NOTIFY static struct workqueue_struct *fsnotify_wq; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 22f8c652ef8b..87588d1e24ca 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -677,13 +677,13 @@ void update_max_tr_single(struct trace_array *tr, #if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \ defined(CONFIG_FSNOTIFY) +#define LATENCY_FS_NOTIFY +#endif +#ifdef LATENCY_FS_NOTIFY void latency_fsnotify(struct trace_array *tr); - #else - static inline void latency_fsnotify(struct trace_array *tr) { } - #endif #ifdef CONFIG_STACKTRACE -- cgit v1.2.3 From bce29ac9ce0bb0b0b146b687ab978378c21e9078 Mon Sep 17 00:00:00 2001 From: Daniel Bristot de Oliveira Date: Tue, 22 Jun 2021 16:42:27 +0200 Subject: trace: Add osnoise tracer In the context of high-performance computing (HPC), the Operating System Noise (*osnoise*) refers to the interference experienced by an application due to activities inside the operating system. In the context of Linux, NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the system. Moreover, hardware-related jobs can also cause noise, for example, via SMIs. The osnoise tracer leverages the hwlat_detector by running a similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing all the sources of *osnoise* during its execution. Using the same approach of hwlat, osnoise takes note of the entry and exit point of any source of interferences, increasing a per-cpu interference counter. The osnoise tracer also saves an interference counter for each source of interference. The interference counter for NMI, IRQs, SoftIRQs, and threads is increased anytime the tool observes these interferences' entry events. When a noise happens without any interference from the operating system level, the hardware noise counter increases, pointing to a hardware-related noise. In this way, osnoise can account for any source of interference. At the end of the period, the osnoise tracer prints the sum of all noise, the max single noise, the percentage of CPU available for the thread, and the counters for the noise sources. Usage Write the ASCII text "osnoise" into the current_tracer file of the tracing system (generally mounted at /sys/kernel/tracing). For example:: [root@f32 ~]# cd /sys/kernel/tracing/ [root@f32 tracing]# echo osnoise > current_tracer It is possible to follow the trace by reading the trace trace file:: [root@f32 tracing]# cat trace # tracer: osnoise # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth MAX # || / SINGLE Interference counters: # |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+ # TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD # | | | |||| | | | | | | | | | | <...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1 <...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3 <...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21 <...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0 <...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41 <...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2 <...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1 <...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19 In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the tracer prints a message at the end of each period for each CPU that is running an osnoise/CPU thread. The osnoise specific fields report: - The RUNTIME IN USE reports the amount of time in microseconds that the osnoise thread kept looping reading the time. - The NOISE IN US reports the sum of noise in microseconds observed by the osnoise tracer during the associated runtime. - The % OF CPU AVAILABLE reports the percentage of CPU available for the osnoise thread during the runtime window. - The MAX SINGLE NOISE IN US reports the maximum single noise observed during the runtime window. - The Interference counters display how many each of the respective interference happened during the runtime window. Note that the example above shows a high number of HW noise samples. The reason being is that this sample was taken on a virtual machine, and the host interference is detected as a hardware interference. Tracer options The tracer has a set of options inside the osnoise directory, they are: - osnoise/cpus: CPUs at which a osnoise thread will execute. - osnoise/period_us: the period of the osnoise thread. - osnoise/runtime_us: how long an osnoise thread will look for noise. - osnoise/stop_tracing_us: stop the system tracing if a single noise hi