aboutsummaryrefslogtreecommitdiff
path: root/kernel/time
AgeCommit message (Collapse)AuthorFilesLines
2025-03-13posix-timers: Improve hash table performanceThomas Gleixner1-31/+68
Eric and Ben reported a significant performance bottleneck on the global hash, which is used to store posix timers for lookup. Eric tried to do a lockless validation of a new timer ID before trying to insert the timer, but that does not solve the problem. For the non-contended case this is a pointless exercise and for the contended case this extra lookup just creates enough interleaving that all tasks can make progress. There are actually two real solutions to the problem: 1) Provide a per process (signal struct) xarray storage 2) Implement a smarter hash like the one in the futex code #1 works perfectly fine for most cases, but the fact that CRIU enforced a linear increasing timer ID to restore timers makes this problematic. It's easy enough to create a sparse timer ID space, which amounts very fast to a large junk of memory consumed for the xarray. 2048 timers with a ID offset of 512 consume more than one megabyte of memory for the xarray storage. #2 The main advantage of the futex hash is that it uses per hash bucket locks instead of a global hash lock. Aside of that it is scaled according to the number of CPUs at boot time. Experiments with artifical benchmarks have shown that a scaled hash with per bucket locks comes pretty close to the xarray performance and in some scenarios it performes better. Test 1: A single process creates 20000 timers and afterwards invokes timer_getoverrun(2) on each of them: mainline Eric newhash xarray create 23 ms 23 ms 9 ms 8 ms getoverrun 14 ms 14 ms 5 ms 4 ms Test 2: A single process creates 50000 timers and afterwards invokes timer_getoverrun(2) on each of them: mainline Eric newhash xarray create 98 ms 219 ms 20 ms 18 ms getoverrun 62 ms 62 ms 10 ms 9 ms Test 3: A single process creates 100000 timers and afterwards invokes timer_getoverrun(2) on each of them: mainline Eric newhash xarray create 313 ms 750 ms 48 ms 33 ms getoverrun 261 ms 260 ms 20 ms 14 ms Erics changes create quite some overhead in the create() path due to the double list walk, as the main issue according to perf is the list walk itself. With 100k timers each hash bucket contains ~200 timers, which in the worst case need to be all inspected. The same problem applies for getoverrun() where the lookup has to walk through the hash buckets to find the timer it is looking for. The scaled hash obviously reduces hash collisions and lock contention significantly. This becomes more prominent with concurrency. Test 4: A process creates 63 threads and all threads wait on a barrier before each instance creates 20000 timers and afterwards invokes timer_getoverrun(2) on each of them. The threads are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per thread: mainline Eric newhash xarray create 180239 ms 38599 ms 579 ms 813 ms getoverrun 2645 ms 2642 ms 32 ms 7 ms Test 5: A process forks 63 times and all forks wait on a barrier before each instance creates 20000 timers and afterwards invokes timer_getoverrun(2) on each of them. The processes are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per process: mainline eric newhash xarray create 157253 ms 40008 ms 83 ms 60 ms getoverrun 2611 ms 2614 ms 40 ms 4 ms So clearly the reduction of lock contention with Eric's changes makes a significant difference for the create() loop, but it does not mitigate the problem of long list walks, which is clearly visible on the getoverrun() side because that is purely dominated by the lookup itself. Once the timer is found, the syscall just reads from the timer structure with no other locks or code paths involved and returns. The reason for the difference between the thread and the fork case for the new hash and the xarray is that both suffer from contention on sighand::siglock and the xarray suffers additionally from contention on the xarray lock on insertion. The only case where the reworked hash slighly outperforms the xarray is a tight loop which creates and deletes timers. Test 4: A process creates 63 threads and all threads wait on a barrier before each instance runs a loop which creates and deletes a timer 100000 times in a row. The threads are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per thread: mainline Eric newhash xarray loop 5917 ms 5897 ms 5473 ms 7846 ms Test 5: A process forks 63 times and all forks wait on a barrier before each each instance runs a loop which creates and deletes a timer 100000 times in a row. The processes are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per process: mainline Eric newhash xarray loop 5137 ms 7828 ms 891 ms 872 ms In both test there is not much contention on the hash, but the ucount accounting for the signal and in the thread case the sighand::siglock contention (plus the xarray locking) contribute dominantly to the overhead. As the memory consumption of the xarray in the sparse ID case is significant, the scaled hash with per bucket locks seems to be the better overall option. While the xarray has faster lookup times for a large number of timers, the actual syscall usage, which requires the lookup is not an extreme hotpath. Most applications utilize signal delivery and all syscalls except timer_getoverrun(2) are all but cheap. So implement a scaled hash with per bucket locks, which offers the best tradeoff between performance and memory consumption. Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: Benjamin Segall <bsegall@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155624.216091571@linutronix.de
2025-03-13posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_tEric Dumazet1-9/+5
The global hash_lock protecting the posix timer hash table can be heavily contended especially when there is an extensive linear search for a timer ID. Timer IDs are handed out by monotonically increasing next_posix_timer_id and then validating that there is no timer with the same ID in the hash table. Both operations happen with the global hash lock held. To reduce the hash lock contention the hash will be reworked to a scaled hash with per bucket locks, which requires to handle the ID counter lockless. Prepare for this by making next_posix_timer_id an atomic_t, which can be used lockless with atomic_inc_return(). [ tglx: Adopted from Eric's series, massaged change log and simplified it ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com Link: https://lore.kernel.org/all/20250308155624.151545978@linutronix.de
2025-03-13posix-timers: Make lock_timer() use guard()Peter Zijlstra1-56/+36
The lookup and locking of posix timers requires the same repeating pattern at all usage sites: tmr = lock_timer(tiner_id); if (!tmr) return -EINVAL; .... unlock_timer(tmr); Solve this with a guard implementation, which works in most places out of the box except for those, which need to unlock the timer inside the guard scope. Though the only places where this matters are timer_delete() and timer_settime(). In both cases the timer pointer needs to be preserved across the end of the scope, which is solved by storing the pointer in a variable outside of the scope. timer_settime() also has to protect the timer with RCU before unlocking, which obviously can't use guard(rcu) before leaving the guard scope as that guard is cleaned up before the unlock. Solve this by providing the RCU protection open coded. [ tglx: Made it work and added change log ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250224162103.GD11590@noisy.programming.kicks-ass.net Link: https://lore.kernel.org/all/20250308155624.087465658@linutronix.de
2025-03-13posix-timers: Rework timer removalThomas Gleixner1-112/+82
sys_timer_delete() and the do_exit() cleanup function itimer_delete() are doing the same thing, but have needlessly different implementations instead of sharing the code. The other oddity of timer deletion is the fact that the timer is not invalidated before the actual deletion happens, which allows concurrent lookups to succeed. That's wrong because a timer which is in the process of being deleted should not be visible and any actions like signal queueing, delivery and rearming should not happen once the task, which invoked timer_delete(), has the timer locked. Rework the code so that: 1) The signal queueing and delivery code ignore timers which are marked invalid 2) The deletion implementation between sys_timer_delete() and itimer_delete() is shared 3) The timer is invalidated and removed from the linked lists before the deletion callback of the relevant clock is invoked. That requires to rework timer_wait_running() as it does a lookup of the timer when relocking it at the end. In case of deletion this lookup would fail due to the preceding invalidation and the wait loop would terminate prematurely. But due to the preceding invalidation the timer cannot be accessed by other tasks anymore, so there is no way that the timer has been freed after the timer lock has been dropped. Move the re-validation out of timer_wait_running() and handle it at the only other usage site, timer_settime(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/87zfht1exf.ffs@tglx
2025-03-13posix-timers: Simplify lock/unlock_timer()Thomas Gleixner1-41/+29
Since the integration of sigqueue into the timer struct, lock_timer() is only used in task context. So taking the lock with irqsave() is not longer required. Convert it to use spin_[un]lock_irq(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.959825668@linutronix.de
2025-03-13posix-timers: Use guards in a few placesThomas Gleixner1-38/+30
Switch locking and RCU to guards where applicable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.892762130@linutronix.de
2025-03-13posix-timers: Remove SLAB_PANIC from kmem cacheThomas Gleixner1-4/+7
There is no need to panic when the posix-timer kmem_cache can't be created. timer_create() will fail with -ENOMEM and that's it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.829215801@linutronix.de
2025-03-13posix-timers: Remove a few paranoid warningsThomas Gleixner1-29/+8
Warnings about a non-initialized timer or non-existing callbacks are just useful for implementing new posix clocks, but there a NULL pointer dereference is expected anyway. :) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.765462334@linutronix.de
2025-03-13posix-timers: Cleanup includesThomas Gleixner1-16/+10
Remove pointless includes and sort the remaining ones alphabetically. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.701301552@linutronix.de
2025-03-13posix-timers: Add cond_resched() to posix_timer_add() search loopEric Dumazet1-0/+1
With a large number of POSIX timers the search for a valid ID might cause a soft lockup on PREEMPT_NONE/VOLUNTARY kernels. Add cond_resched() to the loop to prevent that. [ tglx: Split out from Eric's series ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250214135911.2037402-2-edumazet@google.com Link: https://lore.kernel.org/all/20250308155623.635612865@linutronix.de
2025-03-13posix-timers: Initialise timer before adding it to the hash tableEric Dumazet1-14/+42
A timer is only valid in the hashtable when both timer::it_signal and timer::it_id are set to their final values, but timers are added without those values being set. The timer ID is allocated when the timer is added to the hash in invalid state. The ID is taken from a monotonically increasing per process counter which wraps around after reaching INT_MAX. The hash insertion validates that there is no timer with the allocated ID in the hash table which belongs to the same process. That opens a mostly theoretical race condition: If other threads of the same process manage to create/delete timers in rapid succession before the newly created timer is fully initialized and wrap around to the timer ID which was handed out, then a duplicate timer ID will be inserted into the hash table. Prevent this by: 1) Setting timer::it_id before inserting the timer into the hashtable. 2) Storing the signal pointer in timer::it_signal with bit 0 set before inserting it into the hashtable. Bit 0 acts as a invalid bit, which means that the regular lookup for sys_timer_*() will fail the comparison with the signal pointer. But the lookup on insertion masks out bit 0 and can therefore detect a timer which is not yet valid, but allocated in the hash table. Bit 0 in the pointer is cleared once the initialization of the timer completed. [ tglx: Fold ID and signal iniitializaion into one patch and massage change log and comments. ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250219125522.2535263-3-edumazet@google.com Link: https://lore.kernel.org/all/20250308155623.572035178@linutronix.de
2025-03-13posix-timers: Ensure that timer initialization is fully visibleThomas Gleixner1-7/+14
Frederic pointed out that the memory operations to initialize the timer are not guaranteed to be visible, when __lock_timer() observes timer::it_signal valid under timer::it_lock: T0 T1 --------- ----------- do_timer_create() // A new_timer->.... = .... spin_lock(current->sighand) // B WRITE_ONCE(new_timer->it_signal, current->signal) spin_unlock(current->sighand) sys_timer_*() t = __lock_timer() spin_lock(&timr->it_lock) // observes B if (timr->it_signal == current->signal) return timr; if (!t) return; // Is not guaranteed to observe A Protect the write of timer::it_signal, which makes the timer valid, with timer::it_lock as well. This guarantees that T1 must observe the initialization A completely, when it observes the valid signal pointer under timer::it_lock. sighand::siglock must still be taken to protect the signal::posix_timers list. Reported-by: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.507944489@linutronix.de
2025-03-13clocksource: Remove unnecessary strscpy() size argumentThorsten Blum1-1/+1
The size argument of strscpy() is only required when the destination pointer is not a fixed sized array or when the copy needs to be smaller than the size of the fixed sized destination array. For fixed sized destination arrays and full copies, strscpy() automatically determines the length of the destination buffer if the size argument is omitted. This makes the explicit sizeof() unnecessary. Remove it. [ tglx: Massaged change log ] Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250311110624.495718-2-thorsten.blum@linux.dev
2025-03-13timer_list: Don't use %pK through printk()Thomas Weißschuh1-2/+2
This reverts commit f590308536db ("timer debug: Hide kernel addresses via %pK in /proc/timer_list") The timer list helper SEQ_printf() uses either the real seq_printf() for procfs output or vprintk() to print to the kernel log, when invoked from SysRq-q. It uses %pK for printing pointers. In the past %pK was prefered over %p as it would not leak raw pointer values into the kernel log. Since commit ad67b74d2469 ("printk: hash addresses printed with %p") the regular %p has been improved to avoid this issue. Furthermore, restricted pointers ("%pK") were never meant to be used through printk(). They can still unintentionally leak raw pointers or acquire sleeping looks in atomic contexts. Switch to the regular pointer formatting which is safer, easier to reason about and sufficient here. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Link: https://lore.kernel.org/all/20250311-restricted-pointers-timer-v1-1-6626b91e54ab@linutronix.de
2025-03-08vdso: Rework struct vdso_time_data and introduce struct vdso_clockAnna-Maria Behnsen2-7/+6
To support multiple PTP clocks, the VDSO data structure needs to be reworked. All clock specific data will end up in struct vdso_clock and in struct vdso_time_data there will be an array of VDSO clocks. Now that all preparatory changes are in place: Split the clock related struct members into a separate struct vdso_clock. Make sure all users are aware, that vdso_time_data is no longer initialized as an array and vdso_clock is now the array inside vdso_data. Remove the vdso_clock define, which mapped it to vdso_time_data for the transition. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-19-c1b5c69a166f@linutronix.de
2025-03-08time/namespace: Prepare introduction of struct vdso_clockAnna-Maria Behnsen1-8/+10
To support multiple PTP clocks, the VDSO data structure needs to be reworked. All clock specific data will end up in struct vdso_clock and in struct vdso_time_data there will be array of VDSO clocks. At the moment, vdso_clock is simply a define which maps vdso_clock to vdso_time_data. To prepare for the rework of the data structures, replace the struct vdso_time_data pointer with a struct vdso_clock pointer where applicable. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-14-c1b5c69a166f@linutronix.de
2025-03-08vdso/namespace: Rename timens_setup_vdso_data() to reflect new vdso_clock structAnna-Maria Behnsen1-3/+3
To support multiple PTP clocks, the VDSO data structure needs to be reworked. All clock specific data will end up in struct vdso_clock and in struct vdso_time_data there will be array of VDSO clocks. At the moment, vdso_clock is simply a define which maps vdso_clock to vdso_time_data. For time namespaces, vdso_time_data needs to be set up. But only the clock related part of the vdso_data thats requires this setup. To reflect the future struct vdso_clock, rename timens_setup_vdso_data() to timns_setup_vdso_clock_data(). No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-13-c1b5c69a166f@linutronix.de
2025-03-08vdso/vsyscall: Prepare introduction of struct vdso_clockAnna-Maria Behnsen1-19/+21
To support multiple PTP clocks, the VDSO data structure needs to be reworked. All clock specific data will end up in struct vdso_clock and in struct vdso_time_data there will be array of VDSO clocks. At the moment, vdso_clock is simply a define which maps vdso_clock to vdso_time_data. To prepare for the rework of the data structures, replace the struct vdso_time_data pointer with a struct vdso_clock pointer where applicable. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-12-c1b5c69a166f@linutronix.de
2025-03-05ptp: Add PHC file mode checks. Allow RO adjtime() without FMODE_WRITE.Wojtek Wasko1-1/+1
Many devices implement highly accurate clocks, which the kernel manages as PTP Hardware Clocks (PHCs). Userspace applications rely on these clocks to timestamp events, trace workload execution, correlate timescales across devices, and keep various clocks in sync. The kernel’s current implementation of PTP clocks does not enforce file permissions checks for most device operations except for POSIX clock operations, where file mode is verified in the POSIX layer before forwarding the call to the PTP subsystem. Consequently, it is common practice to not give unprivileged userspace applications any access to PTP clocks whatsoever by giving the PTP chardevs 600 permissions. An example of users running into this limitation is documented in [1]. Additionally, POSIX layer requires WRITE permission even for readonly adjtime() calls which are used in PTP layer to return current frequency offset applied to the PHC. Add permission checks for functions that modify the state of a PTP device. Continue enforcing permission checks for POSIX clock operations (settime, adjtime) in the POSIX layer. Only require WRITE access for dynamic clocks adjtime() if any flags are set in the modes field. [1] https://lists.nwtime.org/sympa/arc/linuxptp-users/2024-01/msg00036.html Changes in v4: - Require FMODE_WRITE in ajtime() only for calls modifying the clock in any way. Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Wojtek Wasko <wwasko@nvidia.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-05posix-clock: Store file pointer in struct posix_clock_contextWojtek Wasko1-0/+1
File descriptor based pc_clock_*() operations of dynamic posix clocks have access to the file pointer and implement permission checks in the generic code before invoking the relevant dynamic clock callback. Character device operations (open, read, poll, ioctl) do not implement a generic permission control and the dynamic clock callbacks have no access to the file pointer to implement them. Extend struct posix_clock_context with a struct file pointer and initialize it in posix_clock_open(), so that all dynamic clock callbacks can access it. Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Wojtek Wasko <wwasko@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-02-26posix-clock: Remove duplicate compat ioctl() handlerThomas Weißschuh1-23/+1
The normal and compat ioctl handlers are identical, which is fine as compat ioctls are detected and handled dynamically inside the underlying clock implementation. The duplicate definition however is unnecessary. Just reuse the regular ioctl handler also for compat ioctls. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Link: https://lore.kernel.org/all/20250225-posix-clock-compat-cleanup-v2-1-30de86457a2b@weissschuh.net
2025-02-21vdso: Remove remnants of architecture-specific time storageThomas Weißschuh2-16/+15
All users of the time releated parts of the vDSO are now using the generic storage implementation. Remove the therefore unnecessary compatibility accessor functions and symbols. Co-developed-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-18-13a4669dfc8c@linutronix.de
2025-02-21vdso: Add generic time data storageThomas Weißschuh1-4/+4
Historically each architecture defined their own way to store the vDSO data page. Add a generic mechanism to provide storage for that page. Furthermore this generic storage will be extended to also provide uniform storage for *non*-time-related data, like the random state or architecture-specific data. These will have their own pages and data structures, so rename 'vdso_data' into 'vdso_time_data' to make that split clear from the name. Also introduce a new consistent naming scheme for the symbols related to the vDSO, which makes it clear if the symbol is accessible from userspace or kernel space and the type of data behind the symbol. The generic fault handler contains an optimization to prefault the vvar page when the timens page is accessed. This was lifted from s390 and x86. Co-developed-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-5-13a4669dfc8c@linutronix.de
2025-02-18can: Switch to use hrtimer_setup()Nam Cao1-5/+0
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Most of this patch is generated by Coccinelle. Except for the TX thrtimer in bcm_tx_setup() because this timer is not used and the callback function is never set. For this particular case, set the callback to hrtimer_dummy_timeout() Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/all/a3a6be42c818722ad41758457408a32163bfd9a0.1738746872.git.namcao@linutronix.de
2025-02-18time: Switch to hrtimer_setup()Nam Cao5-14/+8
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/170bb691a0d59917c8268a98c80b607128fc9f7f.1738746821.git.namcao@linutronix.de
2025-02-18posix-timers: Invoke cond_resched() during exit_itimers()Benjamin Segall1-1/+3
exit_itimers() loops through every timer in the process to delete it. This requires taking the system-wide hash_lock for each of these timers, and contends with other processes trying to create or delete timers. When a process creates hundreds of thousands of timers, and then exits while other processes contend with it, this can trigger softlockups on CONFIG_PREEMPT=n. Add a cond_resched() invocation into the loop to allow the system to make progress. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/xm2634gg2n23.fsf@google.com
2025-02-18hrtimers: Replace hrtimer_clock_to_base_table with switch-caseAndy Shevchenko1-17/+12
Clang and GCC complain about overlapped initialisers in the hrtimer_clock_to_base_table definition. With `make W=1` and CONFIG_WERROR=y (which is default nowadays) this breaks the build: CC kernel/time/hrtimer.o kernel/time/hrtimer.c:124:21: error: initializer overrides prior initialization of this subobject [-Werror,-Winitializer-overrides] 124 | [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME, kernel/time/hrtimer.c:122:27: note: previous initialization is here 122 | [0 ... MAX_CLOCKS - 1] = HRTIMER_MAX_CLOCK_BASES, (and similar for CLOCK_MONOTONIC, CLOCK_BOOTTIME, and CLOCK_TAI). hrtimer_clockid_to_base(), which uses the table, is only used in __hrtimer_init(), which is not a hotpath. Therefore replace the table lookup with a switch case in hrtimer_clockid_to_base() to avoid this warning. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250214134424.3367619-1-andriy.shevchenko@linux.intel.com
2025-02-08Merge tag 'timers-urgent-2025-02-08' of ↵Linus Torvalds2-3/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Fix a PREEMPT_RT bug in the clocksource verification code that caused false positive warnings. Also fix a timer migration setup bug when new CPUs are added" * tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Fix off-by-one root mis-connection clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
2025-02-07timers/migration: Fix off-by-one root mis-connectionFrederic Weisbecker1-1/+9
Before attaching a new root to the old root, the children counter of the new root is checked to verify that only the upcoming CPU's top group have been connected to it. However since the recently added commit b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit") this check is not valid anymore because the old root is pre-accounted as a child to the new root. Therefore after connecting the upcoming CPU's top group to the new root, the children count to be expected must be 2 and not 1 anymore. This omission results in the old root to not be connected to the new root. Then eventually the system may run with more than one top level, which defeats the purpose of a single idle migrator. Also the old root is pre-accounted but not connected upon the new root creation. But it can be connected to the new root later on. Therefore the old root may be accounted twice to the new root. The propagation of such overcommit can end up creating a double final top-level root with a groupmask incorrectly initialized. Although harmless given that the final top level roots will never have a parent to walk up to, this oddity opportunistically reported the core issue: WARNING: CPU: 8 PID: 0 at kernel/time/timer_migration.c:543 tmigr_requires_handle_remote CPU: 8 UID: 0 PID: 0 Comm: swapper/8 RIP: 0010:tmigr_requires_handle_remote Call Trace: <IRQ> ? tmigr_requires_handle_remote ? hrtimer_run_queues update_process_times tick_periodic tick_handle_periodic __sysvec_apic_timer_interrupt sysvec_apic_timer_interrupt </IRQ> Fix the problem by taking the old root into account in the children count of the new root so the connection is not omitted. Also warn when more than one top level group exists to better detect similar issues in the future. Fixes: b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit") Reported-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250205160220.39467-1-frederic@kernel.org
2025-02-03Merge tag 'timers-urgent-2025-02-03' of ↵Linus Torvalds2-32/+96
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: - Properly cast the input to secs_to_jiffies() to unsigned long as otherwise the result uses the data type of the input variable, which causes result range checks to fail if the input data type is signed and smaller than unsigned long. - Handle late armed hrtimers gracefully on CPU hotplug There are legitimate cases where a hrtimer is (re)armed on an outgoing CPU after the timers have been migrated away. This triggers warnings and caused people to implement horrible workarounds in RCU. But those workarounds are incomplete and do not cover e.g. the scheduler hrtimers. Stop this by force moving timer which are enqueued on the current CPU after timer migration to be queued on a remote online CPU. This allows to undo the workarounds in a seperate step. - Demote a warning level printk() to info level in the clocksource watchdog code as there is no point to emit a warning level message for a purely informational message. - Mark a helper function __always_inline and move it into the existing #ifdef block to avoid 'unused function' warnings from CLANG * tag 'timers-urgent-2025-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: jiffies: Cast to unsigned long in secs_to_jiffies() conversion clocksource: Use pr_info() for "Checking clocksource synchronization" message hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYING hrtimers: Mark is_migration_base() with __always_inline
2025-02-03clocksource: Use migrate_disable() to avoid calling get_random_u32() in ↵Waiman Long1-2/+4
atomic context The following bug report happened with a PREEMPT_RT kernel: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2012, name: kwatchdog preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 get_random_u32+0x4f/0x110 clocksource_verify_choose_cpus+0xab/0x1a0 clocksource_verify_percpu.part.0+0x6b/0x330 clocksource_watchdog_kthread+0x193/0x1a0 It is due to the fact that clocksource_verify_choose_cpus() is invoked with preemption disabled. This function invokes get_random_u32() to obtain random numbers for choosing CPUs. The batched_entropy_32 local lock and/or the base_crng.lock spinlock in driver/char/random.c will be acquired during the call. In PREEMPT_RT kernel, they are both sleeping locks and so cannot be acquired in atomic context. Fix this problem by using migrate_disable() to allow smp_processor_id() to be reliably used without introducing atomic context. preempt_disable() is then called after clocksource_verify_choose_cpus() but before the clocksource measurement is being run to avoid introducing unexpected latency. Fixes: 7560c02bdffb ("clocksource: Check per-CPU clock synchronization when marked unstable") Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/all/20250131173323.891943-2-longman@redhat.com
2025-01-28treewide: const qualify ctl_tables where applicableJoel Granados1-1/+1
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-27clocksource: Use pr_info() for "Checking clocksource synchronization" messageWaiman Long1-1/+2
The "Checking clocksource synchronization" message is normally printed when clocksource_verify_percpu() is called for a given clocksource if both the CLOCK_SOURCE_UNSTABLE and CLOCK_SOURCE_VERIFY_PERCPU flags are set. It is an informational message and so pr_info() is the correct choice. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250125015442.3740588-1-longman@redhat.com
2025-01-23hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYINGFrederic Weisbecker1-21/+82
hrtimers are migrated away from the dying CPU to any online target at the CPUHP_AP_HRTIMERS_DYING stage in order not to delay bandwidth timers handling tasks involved in the CPU hotplug forward progress. However wakeups can still be performed by the outgoing CPU after CPUHP_AP_HRTIMERS_DYING. Those can result again in bandwidth timers being armed. Depending on several considerations (crystal ball power management based election, earliest timer already enqueued, timer migration enabled or not), the target may eventually be the current CPU even if offline. If that happens, the timer is eventually ignored. The most notable example is RCU which had to deal with each and every of those wake-ups by deferring them to an online CPU, along with related workarounds: _ e787644caf76 (rcu: Defer RCU kthreads wakeup when CPU is dying) _ 9139f93209d1 (rcu/nocb: Fix RT throttling hrtimer armed from offline CPU) _ f7345ccc62a4 (rcu/nocb: Fix rcuog wake-up from offline softirq) The problem isn't confined to RCU though as the stop machine kthread (which runs CPUHP_AP_HRTIMERS_DYING) reports its completion at the end of its work through cpu_stop_signal_done() and performs a wake up that eventually arms the deadline server timer: WARNING: CPU: 94 PID: 588 at kernel/time/hrtimer.c:1086 hrtimer_start_range_ns+0x289/0x2d0 CPU: 94 UID: 0 PID: 588 Comm: migration/94 Not tainted Stopper: multi_cpu_stop+0x0/0x120 <- stop_machine_cpuslocked+0x66/0xc0 RIP: 0010:hrtimer_start_range_ns+0x289/0x2d0 Call Trace: <TASK> start_dl_timer enqueue_dl_entity dl_server_start enqueue_task_fair enqueue_task ttwu_do_activate try_to_wake_up complete cpu_stopper_thread Instead of providing yet another bandaid to work around the situation, fix it in the hrtimers infrastructure instead: always migrate away a timer to an online target whenever it is enqueued from an offline CPU. This will also allow to revert all the above RCU disgraceful hacks. Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Vlad Poenaru <vlad.wing@gmail.com> Reported-by: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/20250117232433.24027-1-frederic@kernel.org Closes: 20241213203739.1519801-1-usamaarif642@gmail.com
2025-01-23hrtimers: Mark is_migration_base() with __always_inlineAndy Shevchenko1-10/+12
When is_migration_base() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/time/hrtimer.c:156:20: error: unused function 'is_migration_base' [-Werror,-Wunused-function] 156 | static inline bool is_migration_base(struct hrtimer_clock_base *base) | ^~~~~~~~~~~~~~~~~ Fix this by marking it with __always_inline. [ tglx: Use __always_inline instead of __maybe_unused and move it into the usage sites conditional ] Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250116160745.243358-1-andriy.shevchenko@linux.intel.com
2025-01-21Merge tag 'timers-core-2025-01-21' of ↵Linus Torvalds8-100/+32
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer and timekeeping updates from Thomas Gleixner: - Just boring cleanups, typo and comment fixes and trivial optimizations * tag 'timers-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Simplify top level detection on group setup timers: Optimize get_timer_[this_]cpu_base() timekeeping: Remove unused ktime_get_fast_timestamps() timer/migration: Fix kernel-doc warnings for union tmigr_state tick/broadcast: Add kernel-doc for function parameters hrtimers: Update the return type of enqueue_hrtimer() clocksource/wdtest: Print time values for short udelay(1) posix-timers: Fix typo in __lock_timer() vdso: Correct typo in PAGE_SHIFT comment
2025-01-16timers/migration: Simplify top level detection on group setupFrederic Weisbecker1-3/+1
Having a single group on a given level is enough to know this is the top level, because a root has to have at least two children, unless that root is the only group and the children are actual CPUs. Simplify the test in tmigr_setup_groups() accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250114231507.21672-5-frederic@kernel.org
2025-01-16hrtimers: Handle CPU state correctly on hotplugKoichiro Den1-1/+10
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to CPUHP_ONLINE: Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set to 1 throughout. However, during a CPU unplug operation, the tick and the clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online state, for instance CFS incorrectly assumes that the hrtick is already active, and the chance of the clockevent device to transition to oneshot mode is also lost forever for the CPU, unless it goes back to a lower state than CPUHP_HRTIMERS_PREPARE once. This round-trip reveals another issue; cpu_base.online is not set to 1 after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer(). Aside of that, the bulk of the per CPU state is not reset either, which means there are dangling pointers in the worst case. Address this by adding a corresponding startup() callback, which resets the stale per CPU state and sets the online flag. [ tglx: Make the new callback unconditionally available, remove the online modification in the prepare() callback and clear the remaining state in the starting callback instead of the prepare callback ] Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
2025-01-16timers/migration: Annotate accesses to ignore flagFrederic Weisbecker1-6/+15
The group's ignore flag is: _ read under the group's lock (idle entry, remote expiry) _ turned on/off under the group's lock (idle entry, remote expiry) _ turned on locklessly on idle exit When idle entry or remote expiry clear the "ignore" flag of a group, the operation must be synchronized against other concurrent idle entry or remote expiry to make sure the related group timer is never missed. To enforce this synchronization, both "ignore" clear and read are performed under the group lock. On the contrary, whether idle entry or remote expiry manage to observe the "ignore" flag turned on by a CPU exiting idle is a matter of optimization. If that flag set is missed or cleared concurrently, the worst outcome is a migrator wasting time remotely handling a "ghost" timer. This is why the ignore flag can be set locklessly. Unf