aboutsummaryrefslogtreecommitdiff
path: root/include/trace/events/ext4.h
AgeCommit message (Collapse)AuthorFilesLines
2025-11-26ext4: rename EXT4_GET_BLOCKS_PRE_IOYang Erkun1-1/+1
This flag has been generalized to split an unwritten extent when we do dio or dioread_nolock writeback, or to avoid merge new extents which was created by extents split. Update some related comments too. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-2-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: add two trace points for moving extentsZhang Yi1-0/+74
To facilitate tracking the length, type, and outcome of the move extent, add a trace point at both the entry and exit of mext_move_extent(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-13-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: introduce seq counter for the extent status entryZhang Yi1-8/+15
In the iomap_write_iter(), the iomap buffered write frame does not hold any locks between querying the inode extent mapping info and performing page cache writes. As a result, the extent mapping can be changed due to concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the write-back process faces a similar problem: concurrent changes can invalidate the extent mapping before the I/O is submitted. Therefore, both of these processes must recheck the mapping info after acquiring the folio lock. To address this, similar to XFS, we propose introducing an extent sequence number to serve as a validity cookie for the extent. After commit 24b7a2331fcd ("ext4: clairfy the rules for modifying extents"), we can ensure the extent information should always be processed through the extent status tree, and the extent status tree is always uptodate under i_rwsem or invalidate_lock or folio lock, so it's safe to introduce this sequence number. The sequence number will be increased whenever the extent status tree changes, preparing for the buffered write iomap conversion. Besides, this mechanism is also applicable for the moving extents case. In move_extent_per_page(), it also needs to reacquire data_sem and check the mapping info again under the folio lock. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-3-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-31Merge tag 'ext4_for_linus_6.17-rc1' of ↵Linus Torvalds1-11/+39
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major ext4 changes for 6.17: - Better scalability for ext4 block allocation - Fix insufficient credits when writing back large folios Miscellaneous bug fixes, especially when handling exteded attriutes, inline data, and fast commit" * tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (39 commits) ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr ext4: implement linear-like traversal across order xarrays ext4: refactor choose group to scan group ext4: convert free groups order lists to xarrays ext4: factor out ext4_mb_scan_group() ext4: factor out ext4_mb_might_prefetch() ext4: factor out __ext4_mb_scan_group() ext4: fix largest free orders lists corruption on mb_optimize_scan switch ext4: fix zombie groups in average fragment size lists ext4: merge freed extent with existing extents before insertion ext4: convert sbi->s_mb_free_pending to atomic_t ext4: fix typo in CR_GOAL_LEN_SLOW comment ext4: get rid of some obsolete EXT4_MB_HINT flags ext4: utilize multiple global goals to reduce contention ext4: remove unnecessary s_md_lock on update s_mb_last_group ext4: remove unnecessary s_mb_last_start ext4: separate stream goal hits from s_bal_goals for better tracking ext4: add ext4_try_lock_group() to skip busy groups ext4: initialize superblock fields in the kballoc-test.c kunit tests ext4: refactor the inline directory conversion and new directory codepaths ...
2025-07-25ext4: get rid of some obsolete EXT4_MB_HINT flagsBaokun Li1-3/+0
Since nobody has used these EXT4_MB_HINT flags for ages, let's remove them. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: enhance tracepoints during the folios writebackZhang Yi1-5/+37
After mpage_map_and_submit_extent() supports restarting handle if credits are insufficient during allocating blocks, it is more likely to exit the current mapping iteration and continue to process the current processing partially mapped folio again. The existing tracepoints are not sufficient to track this situation, so enhance the tracepoints to track the writeback position and the return value before and after submitting the folios. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13ext4: process folios writeback in bytesZhang Yi1-7/+6
Since ext4 supports large folios, processing writebacks in pages is no longer appropriate, it can be modified to process writebacks in bytes. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-06-23ext4: add FALLOC_FL_WRITE_ZEROES supportZhang Yi1-1/+2
Add support for FALLOC_FL_WRITE_ZEROES if the underlying device enable the unmap write zeroes operation. This first allocates blocks as unwritten, then issues a zero command outside of the running journal handle, and finally converts them to a written state. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-10-yi.zhang@huaweicloud.com Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28ext4: remove tracing for FALLOC_FL_NO_HIDE_STALEChristoph Hellwig1-1/+0
FALLOC_FL_NO_HIDE_STALE can't make it past vfs_fallocate (and if the flag does what the name implies that's a good thing as it would be highly dangerous). Remove the dead tracing code for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-3-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-27ext4: make ext4_da_reserve_space() reserve multi-clustersZhang Yi1-4/+6
Add 'nr_resv' parameter to ext4_da_reserve_space(), which indicates the number of clusters wants to reserve, make it reserve multiple clusters at a time. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: make ext4_es_insert_delayed_block() insert multi-blocksZhang Yi1-7/+9
Rename ext4_es_insert_delayed_block() to ext4_es_insert_delayed_extent() and pass length parameter to make it insert multiple delalloc blocks at a time. For the case of bigalloc, split the allocated parameter to lclu_allocated and end_allocated. lclu_allocated indicates the allocation state of the cluster which is containing the lblk, end_allocated indicates the allocation state of the extent end, clusters in the middle of delay allocated extent must be unallocated. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18ext4: remove 'needed' in trace_ext4_discard_preallocationsKemeng Shi1-7/+4
As 'needed' to trace_ext4_discard_preallocations is always 0 which is meaningless. Just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: Give symbolic names to mballoc criteriasOjaswin Mujoo1-13/+13
mballoc criterias have historically been called by numbers like CR0, CR1... however this makes it confusing to understand what each criteria is about. Change these criterias from numbers to symbolic names and add relevant comments. While we are at it, also reformat and add some comments to ext4_seq_mb_stats_show() for better readability. Additionally, define CR_FAST which signifies the criteria below which we can make quicker decisions like: * quitting early if (free block < requested len) * avoiding to scan free extents smaller than required len. * avoiding to initialize buddy cache and work with existing cache * limiting prefetches Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/a2dc6ec5aea5e5e68cf8e788c2a964ffead9c8b0.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: Add allocation criteria 1.5 (CR1_5)Ojaswin Mujoo1-0/+2
CR1_5 aims to optimize allocations which can't be satisfied in CR1. The fact that we couldn't find a group in CR1 suggests that it would be difficult to find a continuous extent to compleltely satisfy our allocations. So before falling to the slower CR2, in CR1.5 we proactively trim the the preallocations so we can find a group with (free / fragments) big enough. This speeds up our allocation at the cost of slightly reduced preallocation. The patch also adds a new sysfs tunable: * /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order This controls how much CR1.5 can trim a request before falling to CR2. For example, for a request of order 7 and max trim order 2, CR1.5 can trim this upto order 5. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/150fdf65c8e4cc4dba71e020ce0859bcf636a5ff.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: Convert mballoc cr (criteria) to enumOjaswin Mujoo1-2/+14
Convert criteria to be an enum so it easier to maintain and update the tracefiles to use enum names. This change also makes it easier to insert new criterias in the future. There is no functional change in this patch. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/5d82fd467bdf70ea45bdaef810af3b146013946c.1685449706.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15ext4: Change remaining tracepoints to use folioRitesh Harjani1-13/+13
ext4_readpage() is converted to ext4_read_folio() hence change the related tracepoint from trace_ext4_readpage(page) to trace_ext4_read_folio(folio). Do the same for trace_ext4_releasepage(page) to trace_ext4_release_folio(folio) As a minor bit of optimization to avoid an extra dereferencing, since both of the above functions already were dereferencing folio->mapping->host, hence change the tracepoint argument to take (inode, folio). Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/caba2b3c0147bed4ea7706767dc1d19cd0e29ab0.1684122756.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-23ext4: Convert data=journal writeback to use ext4_writepages()Jan Kara1-7/+0
Add support for writeback of journalled data directly into ext4_writepages() instead of offloading it to write_cache_pages(). This actually significantly simplifies the code and reduces code duplication. For checkpointing of committed data we can use ext4_writepages() rightaway the same way as writeback of ordered data uses it on transaction commit. For journalling of dirty mapped pages, we need to add a special case to mpage_prepare_extent_to_map() to add all page buffers to the journal. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-8-tytso@mit.edu
2022-12-08ext4: disable fast-commit of encrypted dir operationsEric Biggers1-2/+5
fast-commit of create, link, and unlink operations in encrypted directories is completely broken because the unencrypted filenames are being written to the fast-commit journal instead of the encrypted filenames. These operations can't be replayed, as encryption keys aren't present at journal replay time. It is also an information leak. Until if/when we can get this working properly, make encrypted directory operations ineligible for fast-commit. Note that fast-commit operations on encrypted regular files continue to be allowed, as they seem to work. Fixes: aa75f4d3daae ("ext4: main fast-commit commit path") Cc: <stable@vger.kernel.org> # v5.10+ Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221106224841.279231-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-01ext4: split ext4_journal_start trace for debugchangfengnan1-11/+46
we might want to know why jbd2 thread using high io for detail, split ext4_journal_start trace to ext4_journal_start_sb and ext4_journal_start_inode, show ino and handle type when possible. Signed-off-by: changfengnan <changfengnan@bytedance.com> Link: https://lore.kernel.org/r/20221008120518.74870-1-changfengnan@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-08fs: Remove flags parameter from aops->write_beginMatthew Wilcox (Oracle)1-13/+8
There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds1-15/+15
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-15ext4: fix remaining two trace events to use same printk conventionRitesh Harjani1-2/+2
All ext4 & jbd2 trace events starts with "dev Major:Minor". While we are still improving/adding the ftrace events for FC, let's fix last two remaining trace events to follow the same convention. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/8f33b163f0f29df2491c03b79f8ac96890ea5184.1647057583.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-15ext4: add commit tid info in ext4_fc_commit_start/stop trace eventsRitesh Harjani1-8/+13
This adds commit_tid info in ext4_fc_commit_start/stop which is helpful in debugging fast_commit issues. For e.g. issues where due to jbd2 journal full commit, FC miss to commit updates to a file. Also improves TP_prink format string i.e. all ext4 and jbd2 trace events starts with "dev MAjOR,MINOR". Let's follow the same convention while we are still at it. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/ebcd6b9ab5b718db30f90854497886801ce38c63.1647057583.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-15ext4: add transaction tid info in fc_track eventsRitesh Harjani1-45/+68
This patch adds the transaction & inode tid info in trace events for callers of ext4_fc_track_template(). This is helpful in debugging race conditions where an inode could belong to two different transaction tids. It also fixes the checkpatch warnings which says use tabs instead of spaces. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c203c09dc11bb372803c430f621f25a4b8c2c8b4.1647057583.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-15ext4: add new trace event in ext4_fc_cleanupRitesh Harjani1-0/+26
This adds a new trace event in ext4_fc_cleanup() which is helpful in debugging some fast_commit issues. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/794cdb1d5d3622f3f80d30c222ff6652ea68c375.1647057583.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-15ext4: Convert invalidatepage to invalidate_folioMatthew Wilcox (Oracle)1-15/+15
Extensive changes, but fairly mechanical. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-12ext4: convert ext4_fc_track_dentry type events to use event classRitesh Harjani1-25/+31
One should use DECLARE_EVENT_CLASS for similar event types instead of defining TRACE_EVENT for each event type. This is helpful in reducing the text section footprint for e.g. [1] [1]: https://lwn.net/Articles/381064/ Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/a019cb46219ef4b30e4d98d7ced7d8819a2fc61d.1647057583.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-12ext4: fix ext4_fc_stats trace pointRitesh Harjani1-29/+49
ftrace's __print_symbolic() requires that any enum values used in the symbol to string translation table be wrapped in a TRACE_DEFINE_ENUM so that the enum value can be decoded from the ftrace ring buffer by user space tooling. This patch also fixes few other problems found in this trace point. e.g. dereferencing structures in TP_printk which should not be done at any cost. Also to avoid checkpatch warnings, this patch removes those whitespaces/tab stops issues. Cc: stable@kernel.org Fixes: aa75f4d3daae ("ext4: main fast-commit commit path") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/b4b9691414c35c62e570b723e661c80674169f9a.1647057583.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: implement support for get/set fs labelLukas Czerner1-0/+23
Implement support for FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls for online reading and setting of file system label. ext4_ioctl_getlabel() is simple, just get the label from the primary superblock. This might not be the first sb on the file system if 'sb=' mount option is used. In ext4_ioctl_setlabel() we update what ext4 currently views as a primary superblock and then proceed to update backup superblocks. There are two caveats: - the primary superblock might not be the first superblock and so it might not be the one used by userspace tools if read directly off the disk. - because the primary superblock might not be the first superblock we potentialy have to update it as part of backup superblock update. However the first sb location is a bit more complicated than the rest so we have to account for that. The superblock modification is created generic enough so the infrastructure can be used for other potential superblock modification operations, such as chaning UUID. Tested with generic/492 with various configurations. I also checked the behavior with 'sb=' mount options, including very large file systems with and without sparse_super/sparse_super2. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20211213135618.43303-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-02ext4: delete some unused tracepoint definitionsEric Whitney1-176/+0
A number of tracepoint instances have been removed from ext4 by past patches but the definitions of those tracepoints have not. All instances of ext4_ext_in_cache and ext4_ext_put_in_cache were removed by commit 69eb33dc24dc ("ext4: remove single extent cache"). ext4_get_reserved_cluster_alloc was removed by commit b6bf9171ef5c ("ext4: reduce reserved cluster count by number of allocated clusters"). ext4_find_delalloc_range was removed by commit 7d1b1fbc95eb ("ext4: reimplement ext4_find_delay_alloc_range on extent status tree"). All instances of ext4_direct_IO_enter and ext4_direct_IO_exit were removed by commit 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure"). Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20210216191634.20957-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06ext4: disable fast commit with data journallingHarshad Shirwadkar1-2/+4
Fast commits don't work with data journalling. This patch disables the fast commit support when data journalling is turned on. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201106035911.1942128-19-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06ext4: mark fc ineligible if inode gets evictied due to mem pressureHarshad Shirwadkar1-3/+3
If inode gets evicted due to memory pressure, we have to remove it from the fast commit list. However, that inode may have uncommitted changes that fast commits will lose. So, just fall back to full commits in this case. Also, rename the fast commit ineligiblity reason from "EXT4_FC_REASON_MEM" to "EXT4_FC_REASON_MEM_NOMEM" for better expression. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-3-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: fast commit recovery pathHarshad Shirwadkar1-4/+52
This patch adds fast commit recovery path support for Ext4 file system. We add several helper functions that are similar in spirit to e2fsprogs journal recovery path handlers. Example of such functions include - a simple block allocator, idempotent block bitmap update function etc. Using these routines and the fast commit log in the fast commit area, the recovery path (ext4_fc_replay()) performs fast commit log recovery. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: main fast-commit commit pathHarshad Shirwadkar1-0/+172
This patch adds main fast commit commit path handlers. The overall patch can be divided into two inter-related parts: (A) Metadata updates tracking This part consists of helper functions to track changes that need to be committed during a commit operation. These updates are maintained by Ext4 in different in-memory queues. Following are the APIs and their short description that are implemented in this patch: - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat operations - ext4_fc_track_range() - Track changed logical block offsets inodes - ext4_fc_track_inode() - Track inodes - ext4_fc_mark_ineligible() - Mark file system fast commit ineligible() - ext4_fc_start_update() / ext4_fc_stop_update() / ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These functions are useful for co-ordinating inode updates with commits. (B) Main commit Path This part consists of functions to convert updates tracked in in-memory data structures into on-disk commits. Function ext4_fc_commit() is the main entry point to commit path. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-19ext4: limit the length of per-inode prealloc listbrookxu1-6/+11
In the scenario of writing sparse files, the per-inode prealloc list may be very long, resulting in high overhead for ext4_mb_use_preallocated(). To circumvent this problem, we limit the maximum length of per-inode prealloc list to 512 and allow users to modify it. After patching, we observed that the sys ratio of cpu has dropped, and the system throughput has increased significantly. We created a process to write the sparse file, and the running time of the process on the fixed kernel was significantly reduced, as follows: Running time on unfixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m2.051s user 0m0.008s sys 0m2.026s Running time on fixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m0.471s user 0m0.004s sys 0m0.395s Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07ext4: add prefetch_block_bitmaps mount optionTheodore Ts'o1-0/+44
For file systems where we can afford to keep the buddy bitmaps cached, we can speed up initial writes to large file systems by starting to load the block allocation bitmaps as soon as the file system is mounted. This won't work well for _super_ large file systems, or memory constrained systems, so we only enable this when it is requested via a mount option. Addresses-Google-Bug: 159488342 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2020-08-07ext4: indicate via a block bitmap read is prefetched via a tracepointTheodore Ts'o1-4/+20
Modify the ext4_read_block_bitmap_load tracepoint so that it tells us whether a block bitmap is being prefetched. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
2020-06-03ext4: mballoc: use lock for checking free blocks while retryingRitesh Harjani1-1/+2
Currently while doing block allocation grp->bb_free may be getting modified if discard is happening in parallel. For e.g. consider a case where there are lot of threads who have preallocated lot of blocks and there is a thread which is trying to discard all of this group's PA. Now it could happen that we see all of those group's bb_free is zero and fail the allocation while there is sufficient space if we free up all the PA. So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set if we are unable to allocate any blocks in the first try (since we may not have considered blocks about to be discarded from PA lists). So during retry attempt to allocate blocks we will use ext4_lock_group() for checking if the group is good or not. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: translate a few more map flags to strings in tracepointsEric Whitney1-1/+4
As new ext4_map_blocks() flags have been added, not all have gotten flag bit to string translations to make tracepoint output more readable. Fix that, and go one step further by adding a translation for the EXT4_EX_NOCACHE flag as well. The EXT4_EX_FORCE_CACHE flag can never be set in a tracepoint in the current code, so there's no need to bother with a translation for it right now. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200415203140.30349-3-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flagEric Whitney1-1/+0
The eofblocks code was removed in the 5.7 release by "ext4: remove EOFBLOCKS_FL and associated code" (4337ecd1fe99). The ext4_map_blocks() flag used to trigger it can now be removed as well. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-25ext4: fix extent_status trace pointsDmitry Monakhov1-4/+4
Show pblock only if it has meaningful value. # before ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 576460752303423487 H ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 576460752303423487 HR # after ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 0 H ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 0 HR Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Link: https://lore.kernel.org/r/20191114200147.1073-2-dmonakhov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-25ext4: fix symbolic enum printing in trace outputDmitry Monakhov1-1/+18
Trace's macro __print_flags() produce raw event's decraration w/o knowing actual flags value cat /sys/kernel/debug/tracing/events/ext4/ext4_ext_map_blocks_exit/format .. __print_flags(REC->mflags, "", { (1 << BH_New), For that reason we have to explicitly define it via special macro TRACE_DEFINE_ENUM() Also add missed EXTENT_STATUS_REFERENCED flag. #Before patch ext4:ext4_ext_map_blocks_exit: dev 253,0 ino 2 flags lblk 0 pblk 4177 len 1 mflags 0x20 ret 1 ext4:ext4_ext_map_blocks_exit: dev 253,0 ino 12 flags CREATE lblk 0 pblk 34304 len 1 mflags 0x60 ret 1 #With patch ext4:ext4_ext_map_blocks_exit: dev 253,0 ino 2 flags lblk 0 pblk 4177 len 1 mflags M ret 1 ext4:ext4_ext_map_blocks_exit: dev 253,0 ino 12 flags CREATE lblk 0 pblk 34816 len 1 mflags NM ret 1 Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Link: https://lore.kernel.org/r/20191114200147.1073-1-dmonakhov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: Reserve revoke credits for freed blocksJan Kara1-5/+8
So far we have reserved only relatively high fixed amount of revoke credits for each transaction. We over-reserved by large amount for most cases but when freeing large directories or files with data journalling, the fixed amount is not enough. In fact the worst case estimate is inconveniently large (maximum extent size) for freeing of one extent. We fix this by doing proper estimate of the amount of blocks that need to be revoked when removing blocks from the inode due to truncate or hole punching and otherwise reserve just a small amount of revoke credits for each transaction to accommodate freeing of xattrs block or so. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-23-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-12-19ext4: force inode writes when nfsd calls commit_metadata()Theodore Ts'o1-0/+20
Some time back, nfsd switched from calling vfs_fsync() to using a new commit_metadata() hook in export_operations(). If the file system did not provide a commit_metadata() hook, it fell back to using sync_inode_metadata(). Unfortunately doesn't work on all file systems. In particular, it doesn't work on ext4 due to how the inode gets journalled --- the VFS writeback code will not always call ext4_write_inode(). So we need to provide our own ext4_nfs_commit_metdata() method which calls ext4_write_inode() directly. Google-Bug-Id: 121195940 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2018-10-01ext4: adjust reserved cluster count when removing extentsEric Whitney1-20/+40
Modify ext4_ext_remove_space() and the code it calls to correct the reserved cluster count for pending reservations (delayed allocated clusters shared with allocated blocks) when a block range is removed from the extent tree. Pending reservations may be found for the clusters at the ends of written or unwritten extents when a block range is removed. If a physical cluster at the end of an extent is freed, it's necessary to increment the reserved cluster count to maintain correct accounting if the corresponding logical cluster is shared with at least one delayed and unwritten extent as found in the extents status tree. Add a new function, ext4_rereserve_cluster(), to reapply a reservation on a delayed allocated cluster sharing blocks with a freed allocated cluster. To avoid ENOSPC on reservation, a flag is applied to ext4_free_blocks() to briefly defer updating the freeclusters counter when an allocated cluster is freed. This prevents another thread from allocating the freed block before the reservation can be reapplied. Redefine the partial cluster object as a struct to carry more state information and to clarify the code using it. Adjust the conditional code structure in ext4_ext_remove_space to reduce the indentation level in the main body of the code to improve readability. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01ext4: fix reserved cluster accounting at delayed write timeEric Whitney1-0/+35
The code in ext4_da_map_blocks sometimes reserves space for more delayed allocated clusters than it should, resulting in premature ENOSPC, exceeded quota, and inaccurate free space reporting. Fix this by checking for written and unwritten blocks shared in the same cluster with the newly delayed allocated block. A cluster reservation should not be made for a cluster for which physical space has already been allocated. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01ext4: generalize extents status tree search functionsEric Whitney1-2/+2
Ext4 contains a few functions that are used to search for delayed extents or blocks in the extents status tree. Rather than duplicate code to add new functions to search for extents with different status values, such as written or a combination of delayed and unwritten, generalize the existing code to search for caller-specified extents status values. Also, move this code into extents_status.c where it is better associated with the data structures it operates upon, and where it can be more readily used to implement new extents status tree functions that might want a broader scope for i_es_lock. Three missing static specifiers in RFC version of patch reported and fixed by Fengguang Wu <fengguang.wu@intel.com>. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-02-18ext4: add tracepoints for shutdown and file system errorsTheodore Ts'o1-0/+43
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they b