Eric Lee / smarc-fsl-linux-kernel

15 Sep, 2022

1 commit

8ddd001ce erofs: fix pcluster use-after-free on UP platforms ... Browse Code »

[ Upstream commit 2f44013e39984c127c6efedf70e6b5f4e9dcf315 ]

During stress testing with CONFIG_SMP disabled, KASAN reports as below:

==================================================================
BUG: KASAN: use-after-free in __mutex_lock+0xe5/0xc30
Read of size 8 at addr ffff8881094223f8 by task stress/7789

CPU: 0 PID: 7789 Comm: stress Not tainted 6.0.0-rc1-00002-g0d53d2e882f9 #3
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:

..
__mutex_lock+0xe5/0xc30
..
z_erofs_do_read_page+0x8ce/0x1560
..
z_erofs_readahead+0x31c/0x580
..
Freed by task 7787
kasan_save_stack+0x1e/0x40
kasan_set_track+0x20/0x30
kasan_set_free_info+0x20/0x40
__kasan_slab_free+0x10c/0x190
kmem_cache_free+0xed/0x380
rcu_core+0x3d5/0xc90
__do_softirq+0x12d/0x389

Last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x97/0xb0
call_rcu+0x3d/0x3f0
erofs_shrink_workstation+0x11f/0x210
erofs_shrink_scan+0xdc/0x170
shrink_slab.constprop.0+0x296/0x530
drop_slab+0x1c/0x70
drop_caches_sysctl_handler+0x70/0x80
proc_sys_call_handler+0x20a/0x2f0
vfs_write+0x555/0x6c0
ksys_write+0xbe/0x160
do_syscall_64+0x3b/0x90

The root cause is that erofs_workgroup_unfreeze() doesn't reset to
orig_val thus it causes a race that the pcluster reuses unexpectedly
before freeing.

Since UP platforms are quite rare now, such path becomes unnecessary.
Let's drop such specific-designed path directly instead.

Fixes: 73f5c66df3e2 ("staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'")
Reviewed-by: Yue Hu
Reviewed-by: Chao Yu
Link: https://lore.kernel.org/r/20220902045710.109530-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang
Signed-off-by: Sasha Levin

Gao Xiang
2022-09-15 17:30:06 +0800

17 Aug, 2022

1 commit

1c350a597 erofs: avoid consecutive detection for Highmem memory ... Browse Code »

[ Upstream commit 448b5a1548d87c246c3d0c3df8480d3c6eb6c11a ]

Currently, vmap()s are avoided if physical addresses are
consecutive for decompressed buffers.

I observed that is very common for 4KiB pclusters since the
numbers of decompressed pages are almost 2 or 3.

However, such detection doesn't work for Highmem pages on
32-bit machines, let's fix it now.

Reported-by: Liu Jinbao
Fixes: 7fc45dbc938a ("staging: erofs: introduce generic decompression backend")
Link: https://lore.kernel.org/r/20220708101001.21242-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang
Signed-off-by: Sasha Levin

Gao Xiang
2022-08-17 20:23:12 +0800

01 May, 2022

1 commit

d3b744791 iomap: Add done_before argument to iomap_dio_rw ... Browse Code »

commit 4fdccaa0d184c202f98d73b24e3ec8eeee88ab8d upstream

Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred. When the request succeeds, we
report that done_before additional bytes were tranferred. This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted. The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher
Reviewed-by: Darrick J. Wong
Signed-off-by: Anand Jain
Signed-off-by: Greg Kroah-Hartman

Andreas Gruenbacher
2022-05-01 23:22:32 +0800

01 Dec, 2021

1 commit

4339cd082 erofs: fix deadlock when shrink erofs slab ... Browse Code »

[ Upstream commit 57bbeacdbee72a54eb97d56b876cf9c94059fc34 ]

We observed the following deadlock in the stress test under low
memory scenario:

Thread A Thread B
- erofs_shrink_scan
- erofs_try_to_release_workgroup
- erofs_workgroup_try_to_freeze -- A
- z_erofs_do_read_page
- z_erofs_collection_begin
- z_erofs_register_collection
- erofs_insert_workgroup
- xa_lock(&sbi->managed_pslots) -- B
- erofs_workgroup_get
- erofs_wait_on_workgroup_freezed -- A
- xa_erase
- xa_lock(&sbi->managed_pslots) -- B

To fix this, it needs to hold xa_lock before freezing the workgroup
since xarray will be touched then. So let's hold the lock before
accessing each workgroup, just like what we did with the radix tree
before.

[ Gao Xiang: Jianhua Hao also reports this issue at
https://lore.kernel.org/r/b10b85df30694bac8aadfe43537c897a@xiaomi.com ]

Link: https://lore.kernel.org/r/20211118135844.3559-1-huangjianan@oppo.com
Fixes: 64094a04414f ("erofs: convert workstn to XArray")
Reviewed-by: Chao Yu
Reviewed-by: Gao Xiang
Signed-off-by: Huang Jianan
Reported-by: Jianhua Hao
Signed-off-by: Gao Xiang
Signed-off-by: Sasha Levin

Huang Jianan
2021-12-01 16:04:50 +0800

19 Nov, 2021

2 commits

f60978969 erofs: fix unsafe pagevec reuse of hooked pclusters ... Browse Code »

commit 86432a6dca9bed79111990851df5756d3eb5f57c upstream.

There are pclusters in runtime marked with Z_EROFS_PCLUSTER_TAIL
before actual I/O submission. Thus, the decompression chain can be
extended if the following pcluster chain hooks such tail pcluster.

As the related comment mentioned, if some page is made of a hooked
pcluster and another followed pcluster, it can be reused for in-place
I/O (since I/O should be submitted anyway):
_______________________________________________________________
| tail (partial) page | head (partial) page |
|_____PRIMARY_HOOKED___|____________PRIMARY_FOLLOWED____________|

However, it's by no means safe to reuse as pagevec since if such
PRIMARY_HOOKED pclusters finally move into bypass chain without I/O
submission. It's somewhat hard to reproduce with LZ4 and I just found
it (general protection fault) by ro_fsstressing a LZMA image for long
time.

I'm going to actively clean up related code together with multi-page
folio adaption in the next few months. Let's address it directly for
easier backporting for now.

Call trace for reference:
z_erofs_decompress_pcluster+0x10a/0x8a0 [erofs]
z_erofs_decompress_queue.isra.36+0x3c/0x60 [erofs]
z_erofs_runqueue+0x5f3/0x840 [erofs]
z_erofs_readahead+0x1e8/0x320 [erofs]
read_pages+0x91/0x270
page_cache_ra_unbounded+0x18b/0x240
filemap_get_pages+0x10a/0x5f0
filemap_read+0xa9/0x330
new_sync_read+0x11b/0x1a0
vfs_read+0xf1/0x190

Link: https://lore.kernel.org/r/20211103182006.4040-1-xiang@kernel.org
Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support")
Cc: # 4.19+
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang
Signed-off-by: Greg Kroah-Hartman

Gao Xiang
2021-11-19 02:17:15 +0800
ab7ce95a0 erofs: don't trigger WARN() when decompression fails ... Browse Code »

[ Upstream commit a0961f351d82d43ab0b845304caa235dfe249ae9 ]

syzbot reported a WARNING [1] due to corrupted compressed data.

As Dmitry said, "If this is not a kernel bug, then the code should
not use WARN. WARN if for kernel bugs and is recognized as such by
all testing systems and humans."

[1] https://lore.kernel.org/r/000000000000b3586105cf0ff45e@google.com

Link: https://lore.kernel.org/r/20211025074311.130395-1-hsiangkao@linux.alibaba.com
Cc: Dmitry Vyukov
Reviewed-by: Chao Yu
Reported-by: syzbot+d8aaffc3719597e8cfb4@syzkaller.appspotmail.com
Signed-off-by: Gao Xiang
Signed-off-by: Sasha Levin

Gao Xiang
2021-11-19 02:16:21 +0800

23 Sep, 2021

2 commits

c40dd3ca2 erofs: clear compacted_2b if compacted_4b_initial > totalidx ... Browse Code »

Currently, the whole indexes will only be compacted 4B if
compacted_4b_initial > totalidx. So, the calculated compacted_2b
is worthless for that case. It may waste CPU resources.

No need to update compacted_4b_initial as mkfs since it's used to
fulfill the alignment of the 1st compacted_2b pack and would handle
the case above.

We also need to clarify compacted_4b_end here. It's used for the
last lclusters which aren't fitted in the previous compacted_2b
packs.

Some messages are from Xiang.

Link: https://lore.kernel.org/r/20210914035915.1190-1-zbestahu@gmail.com
Signed-off-by: Yue Hu
Reviewed-by: Gao Xiang
Reviewed-by: Chao Yu
[ Gao Xiang: it's enough to use "compacted_4b_initial < totalidx". ]
Signed-off-by: Gao Xiang

Yue Hu
2021-09-23 23:23:04 +0800
d705117dd erofs: fix misbehavior of unsupported chunk format check ... Browse Code »

Unsupported chunk format should be checked with
"if (vi->chunkformat & ~EROFS_CHUNK_FORMAT_ALL)"

Found when checking with 4k-byte blockmap (although currently mkfs
uses inode chunk indexes format by default.)

Link: https://lore.kernel.org/r/20210922095141.233938-1-hsiangkao@linux.alibaba.com
Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files")
Reviewed-by: Liu Bo
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-09-23 23:22:04 +0800

10 Sep, 2021

1 commit

2e5fd489a Merge tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:

- Fix a race condition in the teardown path of raw mode pmem
namespaces.

- Cleanup the code that filesystems use to detect filesystem-dax
capabilities of their underlying block device.

* tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: remove bdev_dax_supported
xfs: factor out a xfs_buftarg_is_dax helper
dax: stub out dax_supported for !CONFIG_FS_DAX
dax: remove __generic_fsdax_supported
dax: move the dax_read_lock() locking into dax_supported
dax: mark dax_get_by_host static
dm: use fs_dax_get_by_bdev instead of dax_get_by_host
dax: stop using bdevname
fsdax: improve the FS_DAX Kconfig description and help text
libnvdimm/pmem: Fix crash triggered when I/O in-flight during unbind

Linus Torvalds
2021-09-10 02:39:57 +0800

03 Sep, 2021

1 commit

815409a12 Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs ... Browse Code »

Pull overlayfs update from Miklos Szeredi:

- Copy up immutable/append/sync/noatime attributes (Amir Goldstein)

- Improve performance by enabling RCU lookup.

- Misc fixes and improvements

The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument. The ->get_acl() API was updated based on
comments from Al and Linus:

Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/

* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: enable RCU'd ->get_acl()
vfs: add rcu argument to ->get_acl() callback
ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
ovl: use kvalloc in xattr copy-up
ovl: update ctime when changing fileattr
ovl: skip checking lower file's i_writecount on truncate
ovl: relax lookup error on mismatch origin ftype
ovl: do not set overlay.opaque for new directories
ovl: add ovl_allow_offline_changes() helper
ovl: disable decoding null uuid with redirect_dir
ovl: consistent behavior for immutable/append-only inodes
ovl: copy up sync/noatime fileattr flags
ovl: pass ovl_fs to ovl_check_setxattr()
fs: add generic helper for filling statx attribute flags

Linus Torvalds
2021-09-03 00:21:27 +0800

25 Aug, 2021

1 commit

1266b4a7e erofs: fix double free of 'copied' ... Browse Code »

Dan reported a new smatch warning [1]
"fs/erofs/inode.c:210 erofs_read_inode() error: double free of 'copied'"

Due to new chunk-based format handling logic, the error path can be
called after kfree(copied).

Set "copied = NULL" after kfree(copied) to fix this.

[1] https://lore.kernel.org/r/202108251030.bELQozR7-lkp@intel.com

Link: https://lore.kernel.org/r/20210825120757.11034-1-hsiangkao@linux.alibaba.com
Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files")
Reported-by: kernel test robot
Reported-by: Dan Carpenter
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-08-25 22:05:58 +0800

20 Aug, 2021

2 commits

c5aa903a5 erofs: support reading chunk-based uncompressed files ... Browse Code »

Add runtime support for chunk-based uncompressed files
described in the previous patch.

Link: https://lore.kernel.org/r/20210820100019.208490-2-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-08-20 22:38:01 +0800
2a9dc7a8f erofs: introduce chunk-based file on-disk format ... Browse Code »

Currently, uncompressed data except for tail-packing inline is
consecutive on disk.

In order to support chunk-based data deduplication, add a new
corresponding inode data layout.

In the future, the data source of chunks can be either (un)compressed.

Link: https://lore.kernel.org/r/20210820100019.208490-1-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-08-20 22:38:01 +0800

19 Aug, 2021

3 commits

0cad62466 vfs: add rcu argument to ->get_acl() callback ... Browse Code »

Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi

Miklos Szeredi
2021-08-19 04:08:24 +0800
eadcd6b5a erofs: add fiemap support with iomap ... Browse Code »

This adds fiemap support for both uncompressed files and compressed
files by using iomap infrastructure.

Link: https://lore.kernel.org/r/20210813052931.203280-3-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-08-19 00:13:43 +0800
d95ae5e25 erofs: add support for the full decompressed length ... Browse Code »

Previously, there is no need to get the full decompressed length since
EROFS supports partial decompression. However for some other cases
such as fiemap, the full decompressed length is necessary for iomap to
make it work properly.

This patch adds a way to get the full decompressed length. Note that
it takes more metadata overhead and it'd be avoided if possible in the
performance sensitive scenario.

Link: https://lore.kernel.org/r/20210818152231.243691-1-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-08-19 00:13:26 +0800

11 Aug, 2021

2 commits

d252ff3de erofs: remove the mapping parameter from erofs_try_to_free_cached_page() ... Browse Code »

The mapping is not used at all, remove it and update related code.

Link: https://lore.kernel.org/r/20210810072416.1392-1-zbestahu@gmail.com
Reviewed-by: Gao Xiang
Reviewed-by: Chao Yu
Signed-off-by: Yue Hu
Signed-off-by: Gao Xiang

Yue Hu
2021-08-11 09:47:39 +0800
f4d4e5fc2 erofs: directly use wrapper erofs_page_is_managed() when shrinking ... Browse Code »

We already have the wrapper function to identify managed page.

Link: https://lore.kernel.org/r/20210810065450.1320-1-zbestahu@gmail.com
Reviewed-by: Gao Xiang
Reviewed-by: Chao Yu
Signed-off-by: Yue Hu
Signed-off-by: Gao Xiang

Yue Hu
2021-08-11 09:46:34 +0800

10 Aug, 2021

3 commits

771c994ea erofs: convert all uncompressed cases to iomap ... Browse Code »

Since tail-packing inline has been supported by iomap now, let's
convert all EROFS uncompressed data I/O to iomap, which is pretty
straight-forward.

Link: https://lore.kernel.org/r/20210805003601.183063-4-hsiangkao@linux.alibaba.com
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-08-10 00:18:46 +0800
06252e9ce erofs: dax support for non-tailpacking regular file ... Browse Code »

DAX is quite useful for some VM use cases in order to save guest
memory extremely with minimal lightweight EROFS.

In order to prepare for such use cases, add preliminary dax support
for non-tailpacking regular files for now.

Tested with the DRAM-emulated PMEM and the EROFS image generated by
"mkfs.erofs -Enoinline_data enwik9.fsdax.img enwik9"

Link: https://lore.kernel.org/r/20210805003601.183063-3-hsiangkao@linux.alibaba.com
Cc: nvdimm@lists.linux.dev
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-08-10 00:14:59 +0800
a08e67a02 erofs: iomap support for non-tailpacking DIO ... Browse Code »

Add iomap support for non-tailpacking uncompressed data in order to
support DIO and DAX.

Direct I/O is useful in certain scenarios for uncompressed files.
For example, double pagecache can be avoid by direct I/O when
loop device is used for uncompressed files containing upper layer
compressed filesystem.

This adds iomap DIO support for non-tailpacking cases first and
tail-packing inline files are handled in the follow-up patch.

Link: https://lore.kernel.org/r/20210805003601.183063-2-hsiangkao@linux.alibaba.com
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Chao Yu
Signed-off-by: Huang Jianan
Signed-off-by: Gao Xiang

Huang Jianan
2021-08-10 00:14:42 +0800

08 Jun, 2021

3 commits

c5fcb5111 erofs: clean up file headers & footers ... Browse Code »

- Remove my outdated misleading email address;

- Get rid of all unnecessary trailing newline by accident.

Link: https://lore.kernel.org/r/20210602160634.10757-1-xiang@kernel.org
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-06-08 00:41:24 +0800
7dea3de7d erofs: remove the occupied parameter from z_erofs_pagevec_enqueue() ... Browse Code »

No any behavior to variable occupied in z_erofs_attach_page() which
is only caller to z_erofs_pagevec_enqueue().

Link: https://lore.kernel.org/r/20210419102623.2015-1-zbestahu@gmail.com
Signed-off-by: Yue Hu
Reviewed-by: Gao Xiang
Signed-off-by: Gao Xiang

Yue Hu
2021-06-08 00:40:18 +0800
0508c1ad0 erofs: fix error return code in erofs_read_superblock() ... Browse Code »

'ret' will be overwritten to 0 if erofs_sb_has_sb_chksum() return true,
thus 0 will return in some error handling cases. Fix to return negative
error code -EINVAL instead of 0.

Link: https://lore.kernel.org/r/20210519141657.3062715-1-weiyongjun1@huawei.com
Fixes: b858a4844cfb ("erofs: support superblock checksum")
Cc: stable # 5.5+
Reported-by: Hulk Robot
Signed-off-by: Wei Yongjun
Reviewed-by: Gao Xiang
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Wei Yongjun
2021-06-08 00:40:18 +0800

13 May, 2021

1 commit

0852b6ca9 erofs: fix 1 lcluster-sized pcluster for big pcluster ... Browse Code »

If the 1st NONHEAD lcluster of a pcluster isn't CBLKCNT lcluster type
rather than a HEAD or PLAIN type instead, which means its pclustersize
_must_ be 1 lcluster (since its uncompressed size < 2 lclusters),
as illustrated below:

HEAD HEAD / PLAIN lcluster type
____________ ____________
|_:__________|_________:__| file data (uncompressed)
. .
.____________.
|____________| pcluster data (compressed)

Such on-disk case was explained before [1] but missed to be handled
properly in the runtime implementation.

It can be observed if manually generating 1 lcluster-sized pcluster
with 2 lclusters (thus CBLKCNT doesn't exist.) Let's fix it now.

[1] https://lore.kernel.org/r/20210407043927.10623-1-xiang@kernel.org

Link: https://lore.kernel.org/r/20210510064715.29123-1-xiang@kernel.org
Fixes: cec6e93beadf ("erofs: support parsing big pcluster compress indexes")
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-05-13 15:58:46 +0800

10 Apr, 2021

9 commits

8e6c8fa9f erofs: enable big pcluster feature ... Browse Code »

Enable COMPR_CFGS and BIG_PCLUSTER since the implementations are
all settled properly.

Link: https://lore.kernel.org/r/20210407043927.10623-11-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-10 03:20:19 +0800
598162d05 erofs: support decompress big pcluster for lz4 backend ... Browse Code »

Prior to big pcluster, there was only one compressed page so it'd
easy to map this. However, when big pcluster is enabled, more work
needs to be done to handle multiple compressed pages. In detail,

- (maptype 0) if there is only one compressed page + no need
to copy inplace I/O, just map it directly what we did before;

- (maptype 1) if there are more compressed pages + no need to
copy inplace I/O, vmap such compressed pages instead;

- (maptype 2) if inplace I/O needs to be copied, use per-CPU
buffers for decompression then.

Another thing is how to detect inplace decompression is feasable or
not (it's still quite easy for non big pclusters), apart from the
inplace margin calculation, inplace I/O page reusing order is also
needed to be considered for each compressed page. Currently, if the
compressed page is the xth page, it shouldn't be reused as [0 ...
nrpages_out - nrpages_in + x], otherwise a full copy will be triggered.

Although there are some extra optimization ideas for this, I'd like
to make big pcluster work correctly first and obviously it can be
further optimized later since it has nothing with the on-disk format
at all.

Link: https://lore.kernel.org/r/20210407043927.10623-10-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-10 03:20:19 +0800
b86269f43 erofs: support parsing big pcluster compact indexes ... Browse Code »

Different from non-compact indexes, several lclusters are packed
as the compact form at once and an unique base blkaddr is stored for
each pack, so each lcluster index would take less space on avarage
(e.g. 2 bytes for COMPACT_2B.) btw, that is also why BIG_PCLUSTER
switch should be consistent for compact head0/1.

Prior to big pcluster, the size of all pclusters was 1 lcluster.
Therefore, when a new HEAD lcluster was scanned, blkaddr would be
bumped by 1 lcluster. However, that way doesn't work anymore for
big pcluster since we actually don't know the compressed size of
pclusters in advance (before reading CBLKCNT lcluster).

So, instead, let blkaddr of each pack be the first pcluster blkaddr
with a valid CBLKCNT, in detail,

1) if CBLKCNT starts at the pack, this first valid pcluster is
itself, e.g.
_____________________________________________________________
|_CBLKCNT0_|_NONHEAD_| .. |_HEAD_|_CBLKCNT1_| ... |_HEAD_| ...
^ = blkaddr base ^ += CBLKCNT0 ^ += CBLKCNT1

2) if CBLKCNT doesn't start at the pack, the first valid pcluster
is the next pcluster, e.g.
_________________________________________________________
| NONHEAD_| .. |_HEAD_|_CBLKCNT0_| ... |_HEAD_|_HEAD_| ...
^ = blkaddr base ^ += CBLKCNT0
^ += 1

When a CBLKCNT is found, blkaddr will be increased by CBLKCNT
lclusters, or a new HEAD is found immediately, bump blkaddr by 1
instead (see the picture above.)

Also noted if CBLKCNT is the end of the pack, instead of storing
delta1 (distance of the next HEAD lcluster) as normal NONHEADs,
it still uses the compressed block count (delta0) since delta1
can be calculated indirectly but the block count can't.

Adjust decoding logic to fit big pcluster compact indexes as well.

Link: https://lore.kernel.org/r/20210407043927.10623-9-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-10 03:20:18 +0800
cec6e93be erofs: support parsing big pcluster compress indexes ... Browse Code »

When INCOMPAT_BIG_PCLUSTER sb feature is enabled, legacy compress indexes
will also have the same on-disk header compact indexes to keep per-file
configurations instead of leaving it zeroed.

If ADVISE_BIG_PCLUSTER is set for a file, CBLKCNT will be loaded for each
pcluster in this file by parsing 1st non-head lcluster.

Link: https://lore.kernel.org/r/20210407043927.10623-8-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-10 03:20:18 +0800
4fea63f7d erofs: adjust per-CPU buffers according to max_pclusterblks ... Browse Code »

Adjust per-CPU buffers on demand since big pcluster definition is
available. Also, bail out unsupported pcluster size according to
Z_EROFS_PCLUSTER_MAX_SIZE.

Link: https://lore.kernel.org/r/20210407043927.10623-7-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-10 03:20:17 +0800
5404c3301 erofs: add big physical cluster definition ... Browse Code »

Big pcluster indicates the size of compressed data for each physical
pcluster is no longer fixed as block size, but could be more than 1
block (more accurately, 1 logical pcluster)

When big pcluster feature is enabled for head0/1, delta0 of the 1st
non-head lcluster index will keep block count of this pcluster in
lcluster size instead of 1. Or, the compressed size of pcluster
should be 1 lcluster if pcluster has no non-head lcluster index.

Also note that BIG_PCLUSTER feature reuses COMPR_CFGS feature since
it depends on COMPR_CFGS and will be released together.

Link: https://lore.kernel.org/r/20210407043927.10623-6-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-10 03:20:17 +0800
81382f5f5 erofs: fix up inplace I/O pointer for big pcluster ... Browse Code »

When picking up inplace I/O pages, it should be traversed in reverse
order in aligned with the traversal order of file-backed online pages.
Also, index should be updated together when preloading compressed pages.

Previously, only page-sized pclustersize was supported so no problem
at all. Also rename `compressedpages' to `icpage_ptr' to reflect its
functionality.

Link: https://lore.kernel.org/r/20210407043927.10623-5-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-10 03:20:16 +0800
9f6cc76e6 erofs: introduce physical cluster slab pools ... Browse Code »

Since multiple pcluster sizes could be used at once, the number of
compressed pages will become a variable factor. It's necessary to
introduce slab pools rather than a single slab cache now.

This limits the pclustersize to 1M (Z_EROFS_PCLUSTER_MAX_SIZE), and
get rid of the obsolete EROFS_FS_CLUSTER_PAGE_LIMIT, which has no
use now.

Link: https://lore.kernel.org/r/20210407043927.10623-4-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-10 03:20:16 +0800
524887347 erofs: introduce multipage per-CPU buffers ... Browse Code »

To deal the with the cases which inplace decompression is infeasible
for some inplace I/O. Per-CPU buffers was introduced to get rid of page
allocation latency and thrash for low-latency decompression algorithms
such as lz4.

For the big pcluster feature, introduce multipage per-CPU buffers to
keep such inplace I/O pclusters temporarily as well but note that
per-CPU pages are just consecutive virtually.

When a new big pcluster fs is mounted, its max pclustersize will be
read and per-CPU buffers can be growed if needed. Shrinking adjustable
per-CPU buffers is more complex (because we don't know if such size
is still be used), so currently just release them all when unloading.

Link: https://lore.kernel.org/r/20210409190630.19569-1-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-10 03:19:59 +0800

07 Apr, 2021

1 commit

54e0b6c87 erofs: reserve physical_clusterbits[] ... Browse Code »

Formal big pcluster design is actually more powerful / flexable than
the previous thought whose pclustersize was fixed as power-of-2 blocks,
which was obviously inefficient and space-wasting. Instead, pclustersize
can now be set independently for each pcluster, so various pcluster
sizes can also be used together in one file if mkfs wants (for example,
according to data type and/or compression ratio).

Let's get rid of previous physical_clusterbits[] setting (also notice
that corresponding on-disk fields are still 0 for now). Therefore,
head1/2 can be used for at most 2 different algorithms in one file and
again pclustersize is now independent of these.

Link: https://lore.kernel.org/r/20210407043927.10623-2-xiang@kernel.org
Acked-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-04-07 12:41:22 +0800

03 Apr, 2021

1 commit

fe6adcce7 erofs: Clean up spelling mistakes found in fs/erofs ... Browse Code »

zmap.c: s/correspoinding/corresponding
zdata.c: s/endding/ending

Link: https://lore.kernel.org/r/20210331093920.31923-1-gongruiqi1@huawei.com
Reported-by: Hulk Robot
Signed-off-by: Ruiqi Gong
Reviewed-by: Gao Xiang
Signed-off-by: Gao Xiang

Ruiqi Gong
2021-04-03 12:23:47 +0800

29 Mar, 2021

4 commits

14373711d erofs: add on-disk compression configurations ... Browse Code »

Add a bitmap for available compression algorithms and a variable-sized
on-disk table for compression options in preparation for upcoming big
pcluster and LZMA algorithm, which follows the end of super block.

To parse the compression options, the bitmap is scanned one by one.
For each available algorithm, there is data followed by 2-byte `length'
correspondingly (it's enough for most cases, or entire fs blocks should
be used.)

With such available algorithm bitmap, kernel itself can also refuse to
mount such filesystem if any unsupported compression algorithm exists.

Note that COMPR_CFGS feature will be enabled with BIG_PCLUSTER.

Link: https://lore.kernel.org/r/20210329100012.12980-1-hsiangkao@aol.com
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-03-29 18:01:42 +0800
46249cded erofs: introduce on-disk lz4 fs configurations ... Browse Code »

Introduce z_erofs_lz4_cfgs to store all lz4 configurations.
Currently it's only max_distance, but will be used for new
features later.

Link: https://lore.kernel.org/r/20210329012308.28743-4-hsiangkao@aol.com
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-03-29 10:24:58 +0800
5d50538fc erofs: support adjust lz4 history window size ... Browse Code »

lz4 uses LZ4_DISTANCE_MAX to record history preservation. When
using rolling decompression, a block with a higher compression
ratio will cause a larger memory allocation (up to 64k). It may
cause a large resource burden in extreme cases on devices with
small memory and a large number of concurrent IOs. So appropriately
reducing this value can improve performance.

Decreasing this value will reduce the compression ratio (except
when input_size
Signed-off-by: Huang Jianan
Signed-off-by: Guo Weichao
[ Gao Xiang: introduce struct erofs_sb_lz4_info for configurations. ]
Signed-off-by: Gao Xiang

Huang Jianan
2021-03-29 10:24:58 +0800
de06a6a37 erofs: introduce erofs_sb_has_xxx() helpers ... Browse Code »

Introduce erofs_sb_has_xxx() to make long checks short, especially
for later big pcluster & LZMA features.

Link: https://lore.kernel.org/r/20210329012308.28743-2-hsiangkao@aol.com
Reviewed-by: Chao Yu
Signed-off-by: Gao Xiang

Gao Xiang
2021-03-29 10:24:57 +0800