-
Notifications
You must be signed in to change notification settings - Fork 937
v3-alpha: Rbyds and B-trees and gcksums, oh my! #1111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Not sure what the point of this was, I think it was copied from a d3 example svg at some point. But it forces the svg to always fit in the window, even if this makes the svg unreadable. These svgs tend to end up questionably large in order to fit in the most info, so the unreadableness ends up a real problem for even modest window sizes.
I don't think this hyphen broke anything, but this matches the naming convention of other files in the codebase (lfs_util.h for example).
Removing the vestiges of v2 tests.
Kinda. It's actually only 3 scripts. These have been replaced with the new dbg*.py scripts: - readblock.py -> dbgblock.py - readmdir.py -> dbgrbyd.py - readtree.py -> dbglfs.py
Removing the final vestiges of v2.
This one was a bit more involved. Removes utils that are no longer useful, and made sure some of the name/API changes over time are adopted consistently: - lfs_npw2 -> lfs_nlog2 - lfs_tole32_ -> lfs_tole32 - lfs_fromle32_ -> lfs_fromle32 Also did another pass for lfs_ prefixes on mem/str functions. The habit to use the naked variants of these is hard to break!
Like the LFS_DISK_VERSION, the intention is to mark the current driver as experimental. Just in case...
|
There is high risk this thread drowns in noise. So, please limit this thread to high-level v3-wide discussion. For bugs, low-level questions, concerns, etc, feel free to create issues on littlefs with "v3" in the title. I will label relevant issues with the v3 label when I can. |
|
Fantastic! Great to see this v3 work making it's way into the world. The code size increase would be the biggest concern for MicroPython. Any optimisations to get that down would be welcome. In particular we put read-only littlefs2 inside a 32k bootloader. Do you have any idea how big the read-only version of v3 is? |
|
@dpgeorge, I don't know quite yet. The non-default builds aren't fully reimplemented, but will update this thread when they are. Read-only is the most interesting but also the most time consuming. The good news is that rbyds (and red-black trees) and B-trees are much more complicated on the write side than read side, so the impact should be less than the default build, but I'll wait to stick my foot in my mouth until after I have actual numbers. |
|
Looking forward to the Simple key-value APIs in v3 |
|
@Ryan-CW-Code, it's worth noting the simple key-value APIs are stretch goals. They may or may not make it into v3 depending on how long things take and funding. Or to put it another way, stretch goals won't block the release of v3. But it would be convenient to get them in while breaking the API, to minimize future API breaks. Though if they don't make it into v3, I would like to introduce them in future releases, if possible. Note most stretch and out-of-scope features are mainly API related unlikely to affect the on-disk format. |
|
My English is not very good, please forgive me for using machine translation. |
This is the new readonly flag, to be consistent with LFS3_M_RDONLY and
friends.
Note this overlaps with LFS3_YES_RDONLY in a weird way, where
LFS3_YES_RDONLY is basically just an alias for LFS3_RDONLY. For most
flags, LFS3_THING enables the _option_ of using LFS3_M_THING, with
LFS3_YES_THING implying LFS3_M_THING in all mount calls. But
LFS3_RDONLY _disables_ the option of using LFS3_M_RDWR, so it's a bit
different...
Do we really need two flags for the same thing? Not sure. But most users
probably expect LFS3_RDONLY coming from other filesystems.
Worst case this can be revisited in the planned config API rework.
---
As for the readonly code size, this is just the first draft and limited
to mostly ifdefing out all prog/write logic paths. There's some TODOs in
the code that may save a bit more (rbyd.eoff, file.b.shrub_ for
example). But the results are looking ok:
code stack ctx
v2.11.0 rdonly: 6270 448 580
v3-alpha rdonly: 10776 (+71.9%) 840 (+87.5%) 524 (-9.7%)
It's interesting to note most of the additional code/stack cost come
from filesystem traversal. In v2, the threaded linked-list made rdonly
traversal _incredibly_ cheap. But the extra rdwr baggage of turning
littlefs into a fully connected graph made it something to be avoided
in v3.
This hits v3 with the double whammy of:
1. Filesystem traversal is more complicated since we need to keep track
of which btree and where in the btree we are
2. Everything needs to be tracked explicitly due to the new inverted
state-machine driven API (no callbacks)
Note that even if we disabled the traversal APIs, lfs3_fs_usage, cksum
checking, etc, we'd still need to traverse to rebuild gstate. Otherwise
we risk showing grmed files after a powerloss.
---
This did affect the default build a little bit, due to moving things
around for nicer ifdef groupings:
code stack ctx
default before: 37300 2280 636
default after: 37304 (+0.0%) 2280 (+0.0%) 636 (+0.0%)
This function is kinda ugly in that our failed label expects the
dirty/mutated flags to be swapped, but we only swap _after_ calling
lfs3_mtree_traverse to avoid messing up lfs3_mtree_traverse's eot logic.
Long story short, this goto failed after lfs3_mtree_traverse could end
up with drity/mutated in the wrong state.
Worst case, this can leave littlefs in a state where it thinks work was
accomplished, but only if lfs3_mtree_traverse encounters an exceptional
error (LFS3_ERR_IO? LFS3_ERR_CORRUPT?), which usually leads to emergency
actions anyways.
We probably need more testing around exceptional errors like these,
they're also the main limit to our line/branch coverage. But the work
will be tedious so for now that's a future thing.
I at least added a comment to hopefully prevent a similar regression.
Code changes minimal, humorously undoes the LFS3_RDONLY noise:
code stack ctx
before: 37304 2280 636
after: 37300 (-0.0%) 2280 (+0.0%) 636 (+0.0%)
Having on-disk definitions in one place is useful for referencing them later, even if they aren't relevant for most API users. .h files in C are already forced to expose a bunch of internal details anyways, in order to provide struct size/alignment. Might as well include on-disk information that would have even bigger consequences if it changed. Moved: - Compat flag definitions - Tag definitions - DSIZEs and relevant encoding comments - Note some of these were already required to define lfs3_t
This includes the mask/rm/grow bits: - LFS3_tag_RM - LFS3_tag_GROW - LFS3_tag_MASK0/2/8/12 Our in-device only handle types: - LFS3_tag_ORPHAN - LFS3_tag_TRV - LFS3_tag_UNKNOWN And in-device only tags with special behavior: - LFS3_tag_INTERNAL - LFS3_tag_RATTRS - LFS3_tag_SHRUBCOMMIT - LFS3_tag_GRMPUSH - LFS3_tag_MOVE - LFS3_tag_ATTRS Usually I'm not a big fan of case-sensitive naming patterns, but this has been useful for self-documenting what compat flags are in-device only. Might as well extend the idea to our tag definitions.
I can't think of a reason this should be uint8_t. Bumping it up to uint32_t matches the type used for other flags (even though whence is arguably not flags in a strict sense). No code changes.
This more closely matches behavior of functions like mkdir and remove,
even though mkgbmap/rmgbmap operate on a special object and not files.
Besides, returning an error is more useful as users are always free to
ignore said error.
Adds what appears to be one literal to mkgbmap (curiously not rmgbmap?
snuck into alignment?):
code stack ctx
before: 35968 2280 660
after: 35968 (+0.0%) 2280 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38824 2296 772
gbmap after: 38828 (+0.0%) 2296 (+0.0%) 772 (+0.0%)
TLDR: Replaced lfs3_file_ckmeta/ckdata and lfs3_fs_ckmeta/ckdata with
flag based ck functions:
- lfs3_file_ckmeta -> lfs3_file_ck + LFS3_CK_CKMETA
- lfs3_file_ckdata -> lfs3_file_ck + LFS3_CK_CKDATA
- lfs3_fs_ckmeta -> lfs3_fs_ck + LFS3_FSCK_CKMETA
- lfs3_fs_ckdata -> lfs3_fs_ck + LFS3_FSCK_CKDATA
Note lfs3_fs_ck is equivalent to lfs3_fs_gc, but:
1. Performs the work in one call (equivalent to littlefs2's lfs2_fs_gc)
2. Takes flags at call time (like lfs3_mount) instead of cfg time (like
lfs3_fs_gc)
3. Avoids the constant RAM necessary to track incremental GC state
---
Motivation:
I've been thinking: It's a bit weird that users are able to one-shot
janitorial work in lfs3_mount, but there's no equivalent function after
the filesystem is mounted.
Originally this is what lfs3_fs_gc was for, but after adding support for
incremental GC, it made sense to hide lfs3_fs_gc behind the opt-in
LFS3_GC ifdef due to the extra (ironically non-gc-able) state.
In theory lfs3_trv_t fills a bit of the gap, but, without the internal
i_flag handling and traversal restarts, it's a bit hard to use. And
basically requires duplicating said log, which we need anyways for
lfs3_mount!
So ideally we'd add an explicit one-shot GC function, but now lfs3_fs_gc
is taken.
While thinking about alternative names, I realized we can just call this
lfs3_fs_ck and completely replace lfs3_fs_ckmeta/ckdata.
This has some extra benefits:
- Avoids an explosion of ckmeta/ckdata/repairmeta/repairdata functions
- Discourages redundant traversals that could accomplish more work
- Makes it less confusing that ckdata implies ckmeta
---
I also tweaked lfs3_file_ck to match, but note that lfs3_file_ck is
internally very different from lfs3_fs_ck. For one, lfs3_file_ck only
supports "actual" check flags (LFS3_CK_*) vs all gc flags (LFS3_FSCK_*):
lfs3_file_ck:
LFS3_CK_CKMETA 0x00010000 Check metadata checksums
LFS3_CK_CKDATA 0x00020000 Check metadata + data checksums
LFS3_CK_REPAIRMETA* 0x00040000 Repair metadata blocks
LFS3_CK_REPAIRDATA* 0x00080000 Repair metadata + data blocks
* Planned
lfs3_fs_ck:
LFS3_FSCK_MKCONSISTENT 0x00000800 Make the filesystem consistent
LFS3_FSCK_LOOKAHEAD 0x00001000 Repopulate lookahead buffer
LFS3_FSCK_LOOKGBMAP 0x00002000 Repopulate the gbmap
LFS3_FSCK_PREERASE* 0x00004000 Pre-erase unused blocks
LFS3_FSCK_COMPACTMETA 0x00008000 Compact metadata logs
LFS3_FSCK_CKMETA 0x00010000 Check metadata checksums
LFS3_FSCK_CKDATA 0x00020000 Check metadata + data checksums
LFS3_FSCK_REPAIRMETA* 0x00040000 Repair metadata blocks
LFS3_FSCK_REPAIRDATA* 0x00080000 Repair metadata + data blocks
* Planned
As a plus, this also saves a bit of code:
code stack ctx
before: 35968 2280 660
after: 35924 (-0.1%) 2280 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38828 2296 772
gbmap after: 38812 (-0.0%) 2296 (+0.0%) 772 (+0.0%)
Unintentionally arriving at the infamous "fsck" name is a bit funny. But it's probably something we don't want to conflict with if we can help it, on the off chance we want a sort of lfs3_fsck function in the future. (This is all hypothetical, but lfs3_fsck may expect an unmounted filesystem, and have a much larger scope than lfs3_fs_ck. Though typing this out now I'm realizing how confusing that might be...) Since lfs3_file_ck and lfs3_fs_ck share a subset of flags, it's not _entirely_ unreasonable for lfs3_file_ck and lfs3_fs_ck to share the same namespace. There's a risk of confusing users around what flags lfs3_file_ck accepts, but we have asserts, and said flags (LFS3_CK_MKCONSISTENT, LFS3_CK_LOOKAHEAD, etc) just don't really make sense in lfs3_file_ck: fs file y LFS3_CK_MKCONSISTENT 0x00000800 Make the filesystem consistent y LFS3_CK_LOOKAHEAD 0x00001000 Repopulate lookahead buffer y LFS3_CK_LOOKGBMAP 0x00002000 Repopulate the gbmap y LFS3_CK_PREERASE* 0x00004000 Pre-erase unused blocks y LFS3_CK_COMPACTMETA 0x00008000 Compact metadata logs y y LFS3_CK_CKMETA 0x00010000 Check metadata checksums y y LFS3_CK_CKDATA 0x00020000 Check metadata + data checksums y y LFS3_CK_REPAIRMETA* 0x00040000 Repair data blocks y y LFS3_CK_REPAIRDATA* 0x00080000 Repair metadata + data blocks * Planned Another option would be to document that lfs3_fs_ck accepts both LFS3_CK_* _and_ LFS3_GC_* flags, but I worry that would be more confusing. It would also lock us into supporting all LFs3_GC_* flags in lfs3_fs_ck, which may not always be the case. Though this is an argument for doing away with the whole LFS3_M/F/CK/GC/I_* duplication... (tbh another reason for this is to reduce the number of namespaces by at least one). No code changes.
Though note most high-level calls (lfs3_file_close, lfs3_dir_close,
etc), still include an assertion at a higher-level.
Why make the internal APIs harder to use than they need to be? As a
plus this drops the need for a separate bool-returning
lfs3_handle_close_.
Saves a bit of code:
code stack ctx
before: 35924 2280 660
after: 35912 (-0.0%) 2280 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38812 2296 772
gbmap after: 38800 (-0.0%) 2296 (+0.0%) 772 (+0.0%)
This was motivated by a discussion with a gh user, in which it was noted
that not having a reserved suptype for internal tags risks potential
issues with long-term future tag compatibility.
I think the risk is low, but, without a reserved suptype, it _is_
possible for a future tag to conflict with an internal tag in an older
driver version, potentially and unintentionally breaking compatibility.
Note this is especially concerning during mdir compactions, where we
copy tags we may not understand otherwise.
In littlefs2 we reserved suptype=0x100, though this was mostly an
accident due to saturating the 3-bit suptype space. With the larger tag
space in littlefs3, the reserved suptype=0x100 was dropped.
---
Long story short, this reserves suptype=0 for internal flags (well, and
null, which is _mostly_ internal only, but does get written to disk as
unreachable tags).
Unfortunately, adding a new suptype _did_ require moving a bunch of
stuff around:
LFS3_TAG_NULL 0x0000 v--- ---- +--- ----
LFS3_TAG_INTERNAL 0x00tt v--- ---- +ttt tttt
LFS3_TAG_CONFIG 0x01tt v--- ---1 +ttt tttt
LFS3_TAG_MAGIC 0x0131 v--- ---1 +-11 --rr
LFS3_TAG_VERSION 0x0134 v--- ---1 +-11 -1--
LFS3_TAG_RCOMPAT 0x0135 v--- ---1 +-11 -1-1
LFS3_TAG_WCOMPAT 0x0136 v--- ---1 +-11 -11-
LFS3_TAG_OCOMPAT 0x0137 v--- ---1 +-11 -111
LFS3_TAG_GEOMETRY 0x0138 v--- ---1 +-11 1---
LFS3_TAG_NAMELIMIT 0x0139 v--- ---1 +-11 1--1
LFS3_TAG_FILELIMIT 0x013a v--- ---1 +-11 1-1-
LFS3_TAG_ATTRLIMIT? 0x013b v--- ---1 +-11 1-11
LFS3_TAG_GDELTA 0x02tt v--- --1- +ttt tttt
LFS3_TAG_GRMDELTA 0x0230 v--- --1- +-11 ----
LFS3_TAG_GBMAPDELTA 0x0234 v--- --1- +-11 -1rr
LFS3_TAG_GDDTREEDELTA* 0x0238 v--- --1- +-11 1-rr
LFS3_TAG_GPTREEDELTA* 0x023c v--- --1- +-11 11rr
LFS3_TAG_NAME 0x03tt v--- --11 +ttt tttt
LFS3_TAG_BNAME 0x0300 v--- --11 +--- ----
LFS3_TAG_REG 0x0301 v--- --11 +--- ---1
LFS3_TAG_DIR 0x0302 v--- --11 +--- --1-
LFS3_TAG_STICKYNOTE 0x0303 v--- --11 +--- --11
LFS3_TAG_BOOKMARK 0x0304 v--- --11 +--- -1--
LFS3_TAG_SYMLINK? 0x0305 v--- --11 +--- -1-1
LFS3_TAG_SNAPSHOT? 0x0306 v--- --11 +--- -11-
LFS3_TAG_MNAME 0x0330 v--- --11 +-11 ----
LFS3_TAG_DDNAME* 0x0350 v--- --11 +1-1 ----
LFS3_TAG_DDTOMB* 0x0351 v--- --11 +1-1 ---1
LFS3_TAG_STRUCT 0x04tt v--- -1-- +ttt tttt
LFS3_TAG_BRANCH 0x040r v--- -1-- +--- --rr
LFS3_TAG_DATA 0x0404 v--- -1-- +--- -1--
LFS3_TAG_BLOCK 0x0408 v--- -1-- +--- 1err
LFS3_TAG_DDKEY* 0x0410 v--- -1-- +--1 ----
LFS3_TAG_DID 0x0420 v--- -1-- +-1- ----
LFS3_TAG_BSHRUB 0x0428 v--- -1-- +-1- 1---
LFS3_TAG_BTREE 0x042c v--- -1-- +-1- 11rr
LFS3_TAG_MROOT 0x0431 v--- -1-- +-11 --rr
LFS3_TAG_MDIR 0x0435 v--- -1-- +-11 -1rr
LFS3_TAG_MSHRUB+ 0x0438 v--- -1-- +-11 1---
LFS3_TAG_MTREE 0x043c v--- -1-- +-11 11rr
LFS3_TAG_BMRANGE 0x044u v--- -1-- +1-- ++uu
LFS3_TAG_BMFREE 0x0440 v--- -1-- +1-- ----
LFS3_TAG_BMINUSE 0x0441 v--- -1-- +1-- ---1
LFS3_TAG_BMERASED 0x0442 v--- -1-- +1-- --1-
LFS3_TAG_BMBAD 0x0443 v--- -1-- +1-- --11
LFS3_TAG_DDRC* 0x0450 v--- -1-- +1-1 ----
LFS3_TAG_DDPCOEFF* 0x0451 v--- -1-- +1-1 ---1
LFs3_TAG_PCOEFFMAP* 0x0460 v--- -1-- +11- ----
LFS3_TAG_ATTR 0x06aa v--- -11a +aaa aaaa
LFS3_TAG_UATTR 0x06aa v--- -11- +aaa aaaa
LFS3_TAG_SATTR 0x07aa v--- -111 +aaa aaaa
LFS3_TAG_SHRUB 0x1kkk v--1 kkkk +kkk kkkk
LFS3_TAG_ALT 0x4kkk v1cd kkkk +kkk kkkk
LFS3_TAG_CKSUM 0x300p v-11 ---- ++++ +pqq
LFS3_TAG_NOTE 0x3100 v-11 ---1 ++++ ++++
LFS3_TAG_ECKSUM 0x3200 v-11 --1- ++++ ++++
LFS3_TAG_GCKSUMDELTA 0x3300 v-11 --11 ++++ ++++
* Planned
+ Reserved
? Hypothetical
Some additional notes:
- I was on the fence on keeping the 0x30 prefix on config tags now that
it is not longer needed to differentiate from null, but ultimately
decided to keep it because: 1. it's fun, 2. it decreases the chance
of false positives, 3. it keeps the redund bits readable in hexdumps,
and 4. it reserves some tags < config, which is useful since order
matters.
Instead, I pushed the 0x30 prefix to _more_ tags, mainly gstate.
As a coincidence, meta related tags (MNAME, MROOT, MRTREE) all shifted
to also have the 0x30 prefix, which is a nice bit of unexpected
consistency.
- I also considered reserving the redund bits across the config tags
similarly to what we've done in struct/gstate tags, but decided
against it as 1. it significantly reduces the config tag space
available, and 2. makes alignment with VERSION + R/W/OCOMPAT a bit
awkward.
Instead I think would should relax the redund bit alignment in other
suptypes, though in practice the intermixing of non-redund and redund
tags makes this a bit difficult.
Maybe we should consider including redund bits as a hint for things
like DATA? DDKEY? BSHRUB? etc?
- I created a bit more space for file btree struct tags, allowing for
both the future planned DDKEY, and BLOCK with optional erased-bit. We
don't currently use this, but it may be useful for the future planned
gddtree, which in-theory can track erased-state in partially written
file blocks.
Currently tracking erased-state in file blocks is difficult due to
the potential of multiple references, and inability to prevent ecksum
conflicts in raw data blocks.
- UATTR/SATTR bumped up to 0x600/0x700 to keep the 1-bit alignment,
leaving the suptype 0x500 unused. Though this may be useful if we ever
run out of struct tags (suptype=0x400), which is likely where most new
tags will go.
---
Code changes were minimal, but with a bunch of noise:
code stack ctx
before: 35912 2280 660
after: 35920 (+0.0%) 2280 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38800 2296 772
gbmap after: 38812 (+0.0%) 2296 (+0.0%) 772 (+0.0%)
Well, kinda. At the moment we don't have any reund support (it's a TODO), so arguably redund=0 and this is just a comment tweak. Though our mdirs _are_ already redund=1... so maybe these should actually set redund=1? It's unclear, so for now I've just tweaked the comment, and we should probably revisit when _actually_ implementing meta/data redundancy. --- Note this only really affects struct tags: LFS3_TAG_STRUCT 0x04tt v--- -1-- +ttt tttt LFS3_TAG_BRANCH 0x040r v--- -1-- +--- --rr LFS3_TAG_DATA 0x0404 v--- -1-- +--- -1rr LFS3_TAG_BLOCK 0x0408 v--- -1-- +--- 1err LFS3_TAG_DDKEY* 0x0410 v--- -1-- +--1 --rr LFS3_TAG_DID 0x0420 v--- -1-- +-1- ---- LFS3_TAG_BSHRUB 0x0428 v--- -1-- +-1- 1-rr LFS3_TAG_BTREE 0x042c v--- -1-- +-1- 11rr LFS3_TAG_MROOT 0x0431 v--- -1-- +-11 --rr LFS3_TAG_MDIR 0x0435 v--- -1-- +-11 -1rr LFS3_TAG_MSHRUB+ 0x0438 v--- -1-- +-11 1-rr LFS3_TAG_MTREE 0x043c v--- -1-- +-11 11rr LFS3_TAG_BMRANGE 0x044u v--- -1-- +1-- ++uu LFS3_TAG_BMFREE 0x0440 v--- -1-- +1-- ---- LFS3_TAG_BMINUSE 0x0441 v--- -1-- +1-- ---1 LFS3_TAG_BMERASED 0x0442 v--- -1-- +1-- --1- LFS3_TAG_BMBAD 0x0443 v--- -1-- +1-- --11 LFS3_TAG_DDRC* 0x0450 v--- -1-- +1-1 ---- LFS3_TAG_DDPCOEFF* 0x0451 v--- -1-- +1-1 ---1 LFs3_TAG_PCOEFFMAP* 0x0460 v--- -1-- +11- ---- This redund hint may be useful for debugging and the theoretical CKMETAREDUND feature.
This changes -w/--wait to sleep _after_ -k/--keep-open, instead of including the time spent waiting on inotifywait in the sleep time. 1. It's easier, no need to keep track of when we started waiting. 2. It's simpler to reason about. 3. It trivially avoids the multiple wakeup noise that plagued watch.py + vim (vim likes to do a bunch of renaming and stuff when saving files, including the file 4913 randomly?) Avoiding this was previously impossible because -~/--sleep was effectively a noop when combined with -k/--keep-open. --- Also renamed from -~/--sleep -> -w/--wait, which is a bit more intuitive and avoids possible shell issues with -~. To make this work, dropped the -w/--block-cycles shortform flag in dbgtrace.py. It's not like this flag is ever used anyways. Though at the moment this is ignoring the possible conflict with -w/--word-bits...
I'm trying to avoid the inevitable conflict with -w/--word, which will probably become important when exploring non-32-bit filesystem configurations. Renaming this to -t/--wait still conflicts with -t/--tree and -t/--tiny, but as a debug-only flag, I think these are less important. Oh, and -t/--trace, but test.py/bench.py are already quite different in their flag naming (see -d/--disk vs -d/--diff). --- Renamed a few other flags while tweaking things: - -t/--tiny -> --tiny (dropped shortform) - -w/--word-bits -> -w/--word/--word-bits - -t/--tree -> -R/--tree/--rbyd/--tree-rbyd - -R/--tree-rbyd -> -Y/--rbyd-all/--tree-rbyd-all - -B/--tree-btree -> -B/--btree/--tree-btree After tinkering with it a bit, I think the -R/-Y/-B set of flags are a decent way to organize the tree renderers. At least --tree-rbyd-all does a better job of describing the difference between --tree-rbyd and --tree-rbyd-all.
Test/bench filters have proven to be mostly non-optional, protecting against bad configuration that doesn't make any sense. It's still valid to want to override test filters sometimes, but using a more, uh, forceful verb probably makes sense here. The shortform would conflict with -f/--fail, so no shortform flag for this, but some argue --force should never have a shortform flag anyways.
Just a bit less typing than --o, and lowers risk of conflicts with actual flags we may care about. To be honest I was procrastinating because I thought this would be a lot more work! I was prepared to write a hacky secondary parser, but argparse already supports this natively with prefix_chars='-+'. Yay!
This is still a hack to make the seek _enum_ appear somewhat readable in our dbg _flags_ script. But the previously internal SEEK_MODE was causing all seek flags to be hidden from -l/--list confusingly.
The idea is this is similar to ~ for the home directory, hopefully simplifying littlefs path parsing in scripts: - /hi -> hi in root dir - ./hi -> hi in current dir - ../hi -> hi in parent dir - ~/hi -> hi in home dir - %/hi -> hi in littlefs root dir Note this only works when standalone: - ~hi != ~/hi - %hi != %/hi And files named % can still be referenced with a ./ prefix: - ./% -> % in current dir - %/% -> % in littlefs root dir --- This is probably overkill for dbglfs3.py, as the arg ordering was already enough to disambiguate disk path vs mroot address vs littlefs path, but eventually I think the idea will be useful for more powerful scripts. A hypothetical: $ mklfs3 cp disk -b4096 -r image_files %/image_files
Now that LFS3_TAG_INTERNAL uses the 0x00tt prefix, this bit mask no longer works. Fortunately LFS3_TAG_INTERNAL is really only used in LFS3_ASSERTs, so a less efficient test has no effect on code cost. No code changes.
It's funny to see what originally started as a simple list of rbyd attrs
slowly morph into a full isa. But it makes sense. What we really want is
an abstract description of operations that can be played and replayed as
necessary to atomically update the mtree.
Using a fixed lfs3_rattr_t struct to represent this in C is easy, and
avoids strict-aliasing issues, but ultimately limited when it comes to
the wide-range of data we want to attach to attributes.
Unlike a computer's isa, we want to be able to include full 12-24 byte
branch pointers directly in the instruction!
---
So here's a full variable-length isa organized by words (max(uintptr_t,
uint32_t)).
The first 32-bit word extends the 16-bit tag with an extra 16-bits of
control information:
wwll llff ffcc cccc tttt tttt tttt tttt
^'-.-''-.-''--.--' : :
'--|----|-----|----:-----------------:-- compressed weight
:: '----|-----|----:-----------------:-- total len
:: '-----|----:-----------------:-- from encoder
:: '----:-----------------:-- optional count
:: rgmm kkkk -kkk kkkk
11 => w=-1 ^^ ^ '-.' '---.---'
00 => w=0 '|-|---|------|------ rm bit
01 => w=+1 '-|---|------|------ grow bit
10 => w=attached '---|------|------ mask bits
'------|------ tag suptype
'------ tag subtype
The 4-bit length field always encodes the full length of the
instruction, including the instruction itself and optional weight. The
4-bit from + 6-bit count fields operate independently and tell
lfs3_rbyd_appendrattr_ how to actually encode the data related to the
instruction.
To work around strict-aliasing issues, complex structs are expected to
be broken down into words and reconstructed in lfs3_rbyd_appendrattr_.
Most of our structs are organized into words anyways. For example:
// new child
*r++ = LFS3_RATTR(5, LFS3_TAG_BRANCH, -2, LFS3_FROM_BRANCH);
*r++ = LFS3_RATTR_WEIGHT(+child_->weight);
*r++ = LFS3_RATTR_ARG(child_->blocks[0]);
*r++ = LFS3_RATTR_ARG(child_->trunk);
*r++ = LFS3_RATTR_ARG(child_->cksum);
This also changes rattr-lists to be null-terminated, which makes a bit
more sense in a variable-length isa:
*r++ = LFS3_RATTR_NULL; // all zeros, including length
One concern with null-terminated rattr-lists is how easy it is to
forget the null-terminator, but an assert that all non-null rattrs have
non-zero length seemed to catch the many many mistakes during adoption.
Alternatively, separate LFS3_FROM_NULL/LFS3_FROM_NIL from fields could
be used if encoding space gets tight.
I'm also quite happy with the 2-bit weight feild, which allows omitting
the optional weight word for -1,0,+1 weights. These should cover at
least all mdir operations.
Note the exact encoding of the rattr fields is less of a concern than
the tag fields, as it doesn't reside on-disk can be changed on whim.
---
Saves a nice chunk of code and stack:
code stack ctx
before: 35920 2280 660
after: 35324 (-1.7%) 2176 (-4.6%) 660 (+0.0%)
code stack ctx
gbmap before: 38812 2296 772
gbmap after: 38156 (-1.7%) 2192 (-4.5%) 772 (+0.0%)
The stack savings are obvious, but the code savings a bit less so. A
variable length isa _is_ more complicated, but by limiting most encoding
decisions to compile-time (2-bit weights vs 32-bit weights for example),
the savings from fewer word manipulations on the stack wins.
LFS3_tag_RATTRS, now LFS3_tag_TAIL, is a bit funny in that it only
supports tail-recursive rattrs. The whole point of littlefs is
bounded-RAM after all. And if we know LFS3_tag_TAIL will terminate an
rattr-list, why bother with an additional LFS3_tag_NULL?
Like LFS3_RATTR_NULL, LFS3_RATTR_TAIL sets length=0 to indicate the end
of the rattr-lists.
Saves a bit of code:
code stack ctx
before: 35324 2176 660
after: 35316 (-0.0%) 2176 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38156 2192 772
gbmap after: 38148 (-0.0%) 2192 (+0.0%) 772 (+0.0%)
This originally started as an attempt to drop LFS3_tag_TAIL entirely,
but that didn't really go anywhere. Any attempt to work around the
double rattr-lists during mtree splits results in more mess than this
magic rattr.
But I did notice we only need to support "simple" rattrs during mtree
splits, which means we can just call lfs3_rbyd_appendrattrs to handle
these.
Maybe this will be problematic if we ever want to deduplicate
lfs3_mdir_commit___ and lfs3_rbyd_appendrattrs, but I don't see that
happening because of the different concerns (mdir-specific rattrs):
function code stack ctx
lfs3_mdir_commit___ 1052 744 396
lfs3_rbyd_appendrattrs 142 584 388
The benefit of a single-recurse LFS3_tag_RATTRS:
- Simplifies lfs3_mdir_commit__, no more awkward loop recursion.
- May have other use cases for nesting simple rattrs?
Adds a bit of code:
code stack ctx
before: 35316 2176 660
after: 35320 (+0.0%) 2176 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38148 2192 772
gbmap after: 38172 (+0.1%) 2192 (+0.0%) 772 (+0.0%)
Here's one interesting use-case for the single-recurse LFS3_tag_RATTRS:
Avoiding a copy of the name creation rattrs in lfs3_file_sync_.
There is a concern with nesting LFS3_tag_RATTRS in that it risks
conflicts across layers, but currently this is ok as long as
high-level LFS3_tag_RATTRS stick to non-negative mids (the mtree split
commit in lfs3_mdir_commit_ only needs to recurse for mroot rattrs).
Saves a bit of code:
code stack ctx
before: 35320 2176 660
after: 35316 (-0.0%) 2176 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38172 2192 772
gbmap after: 38168 (-0.0%) 2192 (+0.0%) 772 (+0.0%)
I was poking around at possibly inlining small (<=255) name lens in
lfs3_rattr_t, but realized all LFS3_FROM_NAME rattrs in our system
already use the lfs3_path_namelen pattern (terminates in either
'\0' or '/').
Well, except for our tests, but who cares about those.
Adopting lfs3_path_namelen in LFS3_FROM_NAME saves a bit of code:
code stack ctx
before: 35316 2176 660
after: 35288 (-0.1%) 2176 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38168 2192 772
gbmap after: 38140 (-0.1%) 2192 (+0.0%) 772 (+0.0%)
I don't think there was anything inherently wrong with this idea, but:
- The code savings (28 bytes) was surprisingly small.
- Expecting lfs3_path_namelen may be a headache for future
non-null-terminated string support.
- Even if you don't care about non-null-terminated strings, precomputing
strlen as early as possible is a good idea to minimize repeated strlen
scans.
Reverting adds a bit of code:
code stack ctx
before: 35288 2176 660
after: 35316 (+0.1%) 2176 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38140 2192 772
gbmap after: 38168 (+0.1%) 2192 (+0.0%) 772 (+0.0%)
The idea here was to merge mtree split commits to try to minimize
redundant logic that only differs in whether or not we need to create
the initial mtree weight.
Surprisingly, this backfired, adding more code than it saved. I guess I
underestimated how effective the compiler is at deduplicating these two
paths of logic:
code stack ctx
before: 35316 2176 660
after: 35324 (+0.0%) 2176 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38168 2192 772
gbmap after: 38172 (+0.0%) 2192 (+0.0%) 772 (+0.0%)
See previous commit for why.
The merged commits surprisingly cost more than separate commit
functions. I guess because the compiler is smart enough to deduplicate
the two logic paths here:
code stack ctx
before: 35324 2176 660
after: 35316 (-0.0%) 2176 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38172 2192 772
gbmap after: 38168 (-0.0%) 2192 (+0.0%) 772 (+0.0%)
The good news is this is a win for readability, I think the separate
conditions are easier to understand than a merged commit muddied with a
bunch of lfs3->mtree.r.weight == 0 checks.
This was originally dropped because it's not strictly necessary,
little-leb128s (28-bits) can always be encoded with the default leb128
encoder (31-bits). But it is useful if only for the assert.
Note this matches lfs3_data_readlleb128, which was never dropped, and is
useful for decreasing decoder DSIZEs.
Maybe it makes sense to drop both of these in the future, especially if
we start running into from-field pressure. But for now, this assert is
useful for ensuring disk compatibility with little 4-byte leb128s.
---
Surprisingly no code cost, at least by default (code alignment?). Though
it did add 4 bytes to the gbmap build (so yes, probably code alignment):
code stack ctx
before: 35316 2176 660
after: 35316 (+0.0%) 2176 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38168 2192 772
gbmap after: 38172 (+0.0%) 2192 (+0.0%) 772 (+0.0%)
As much as I don't want to admit it, our 3-word lfs3_data_t struct is
just too large to be treated as pass-by-value with today's compilers.
It's a real shame, because I don't think there's a great technical
reason, just that compiler's pass-by-value optimizations generally stop
after 2 words.
If we could expect 16-bit block sizes (off and size), we could fit in
2 words, but this is already challenged by today's NAND chips
(bs>=128KiB).
---
So, as a compromise, this stops treating lfs3_data_t as pass-by-value,
with the exception of the lfs3_data_from* functions that still return
lfs3_data_t directly.
So instead of:
lfs3_data_t data = lfs3_data_fromecksum(&ecksum, buffer);
data = lfs3_data_slice(data, 8, -1);
return lfs3_data_size(data);
Most operations take lfs3_data_t by pointer:
lfs3_data_t data = lfs3_data_fromecksum(&ecksum, buffer);
lfs3_data_slice(&data, 8, -1);
return lfs3_data_size(&data);
One of the main consequences is there are now several ways to slice data
(internally these all redirect to lfs3_data_slice), and LFS3_DATA_SLICE
will likely see more use since we need temporary allocations to pass the
data slice by address:
- lfs3_data_slice(data, a, b) - Slices the data in place
- lfs3_data_fromslice(data, a, b) - Returns a new data slice
- LFS3_DATA_SLICE(data, a, b) - Creates a new compound-literal slice
---
As a pragmatic compromise, this saves a nice chunk of both code and
stack:
code stack ctx
before: 35316 2176 660
after: 35188 (-0.4%) 2136 (-1.8%) 660 (+0.0%)
code stack ctx
gbmap before: 38172 2192 772
gbmap after: 38048 (-0.3%) 2152 (-1.8%) 772 (+0.0%)
May rerevert this in the future, but I'm on the fence.
It's true this only saves a small amount of code, but in theory it also
reduces stack consumption in name-related functions. Currently this
doesn't affect the stack hot-path, which is a bit surprising as this
includes lfs3_set, but it may in the future.
The arguments against this optimization are also a bit weak:
- Non-null-terminated strings - We probably shouldn't optimize for a
theoretical future feature. If anything, we want to optimize in the
opposite direction to best measure the theoretical code cost.
- Precomputing strlen early - While this is generally a good idea, our
rattrs benefit greatly from compact encodings, as rattrs sitting on
the stack are one of the bigger contributors to our stack hot-path.
So for now I'm unreverting to see how long this optimization makes
sense, but could see this being rereverted in the future.
At the very least we probably want to keep the test changes to make
future testing easier.
---
Saves a bit of code:
code stack ctx
before: 35188 2136 660
after: 35160 (-0.1%) 2136 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38048 2152 772
gbmap after: 38020 (-0.1%) 2152 (+0.0%) 772 (+0.0%)
- Adopted -2,-2 for unbounded lfs3_mdir_commit___ ranges that include
gstate.
Why not? We treat any negative upper bound as unbounded and it's a bit
easier to read.
- Prefer <=-2 when checking for rid bounds that include gstate.
- Prefer <=-2 when checking for attached rattr weights.
- Cleaned up a couple outdated comments.
No code changes:
code stack ctx
before: 35160 2136 660
after: 35160 (+0.0%) 2136 (+0.0%) 660 (+0.0%)
code stack ctx
gbmap before: 38020 2152 772
gbmap after: 38020 (+0.0%) 2152 (+0.0%) 772 (+0.0%)
Note: v3-alpha discussion (#1114)
Unfortunately GitHub made a complete mess of the PR discussion. To try to salvage things, please use #1114 for new comments. Feedback/criticism are welcome and immensely important at this stage.
Table of contents ^
Hello! ^
Hello everyone! As some of you may have already picked up on, there's been a large body of work fermenting in the background for the past couple of years. Originally started as an experiment to try to solve littlefs's$O(n^2)$ metadata compaction, this branch eventually snowballed into more-or-less a full rewrite of the filesystem from the ground up.
There's still several chunks of planned work left, but now that this branch has reached on-disk feature parity with v2, there's nothing really stopping it from being merged eventually.
So I figured it's a good time to start calling this v3, and put together a public roadmap.
NOTE: THIS WORK IS INCOMPLETE AND UNSTABLE
Here's a quick TODO list of planned work before stabilization. More details below:
This work may continue to break the on-disk format.
That being said, I highly encourage others to experiment with v3 where possible. Feedback is welcome, and immensely important at this stage. Once it's stabilized, it's stabilized.
To help with this, the current branch uses a v0.0 as its on-disk version to indicate that it is experimental. When it is eventually released, v3 will reject this version and fail to mount.
Unfortunately, the API will be under heavy flux during this period.
A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.
Wait, a disk breaking change? ^
Yes. v3 breaks disk compatibility from v2.
I think this is a necessary evil. Attempting to maintain backwards compatibility has a heavy cost:
Development time - The littlefs team is ~1 guy, and v3 has already taken ~2.5 years. The extra work to make everything compatible would stretch this out much longer and likely be unsustainable.
Code cost - The goal of littlefs is to be, well, little. This is unfortunately in conflict with backwards compatibility.
Take the new B-tree data-structure, for example. It would be easy to support both B-tree and CTZ skip-list files, but now you need ~2x the code. This cost gets worse for the more enmeshed features, and potentially exceeds the cost of just including both v3 and v2 in the codebase.
So I think it's best for both littlefs as a project and long-term users to break things here.
Note v2 isn't going anywhere! I'm happy to continue maintaining the v2 branch, merge bug fixes when necessary, etc. But the economic reality is my focus will be shifting to v3.
What's new ^
Ok, with that out of the way, what does breaking everything actually get us?
Implemented: ^
Efficient metadata compaction:$O(n^2) \rightarrow O(n \log n)$ ^
v3 adopts a new metadata data-structure: Red-black-yellow Dhara trees (rbyds). Based on the data-structure invented by Daniel Beer for the Dhara FTL, rbyds extend log-encoded Dhara trees with self-balancing and self-counting (also called order-statistic) properties.
This speeds up most metadata operations, including metadata lookup ($O(n) \rightarrow O(\log n)$ ), and, critically, metadata compaction ( $O(n^2) \rightarrow O(n \log n)$ ).
This improvement may sound minor on paper, but it's a difference measured in seconds, sometimes even minutes, on devices with extremely large blocks.
Efficient random writes:$O(n) \rightarrow O(\log_b^2 n)$ ^
A much requested feature, v3 adopts B-trees, replacing the CTZ skip-list that previously backed files.
This avoids needing to rewrite the entire file on random reads, bringing the performance back down into tractability.
For extra cool points, littlefs's B-trees use rbyds for the inner nodes, which makes CoW updates much cheaper than traditional array-packed B-tree nodes when large blocks are involved ($O(n) \rightarrow O(\log n)$ ).
Better logging: No more sync-padding issues ^
v3's B-trees support inlining data directly in the B-tree nodes. This gives us a place to store data during sync, without needing to pad things for prog alignment.
In v2 this padding would force the rewriting of blocks after sync, which had a tendency to wreck logging performance.
Efficient inline files, no more RAM constraints:$O(n^2) \rightarrow O(n \log n)$ ^
In v3, B-trees can have their root inlined in the file's mdir, giving us what I've been calling a "B-shrub". This, combined with the above inlined leaves, gives us a much more efficient inlined file representation, with better code reuse to boot.
Oh, and B-shrubs also make small B-trees more efficient by avoiding the extra block needed for the root.
Independent file caches ^
littlefs's
pcache,rcache, and file caches can be configured independently now. This should allow for better RAM utilization when tuning the filesystem.Easier logging APIs:
lfs3_file_fruncate^Thanks to the new self-counting/order-statistic properties, littlefs can now truncate from both the end and front of files via the new
lfs3_file_fruncateAPI.Before, the best option for logging was renaming log files when they filled up. Now, maintaining a log/FIFO is as easy as:
Sparse files ^
Another advantage of adopting B-trees, littlefs can now cheaply represent file holes, where contiguous runs of zeros can be implied without actually taking up any disk space.
Currently this is limited to a couple operations:
lfs3_file_truncatelfs3_file_fruncatelfs3_file_seek+lfs3_file_writepast the end of the fileBut more advanced hole operations may be added in the future.
Efficient file name lookup:$O(n) \rightarrow O(\log_b n)$ ^
littlefs now uses a B-tree (yay code reuse) to organize files by file name. This allows for much faster file name lookup than the previous linked-list of metadata blocks.
A simpler/more robust metadata tree ^
As a part of adopting B-trees for metadata, the previous threaded file tree has been completely ripped out and replaced with one big metadata tree: the M-tree.
I'm not sure how much users are aware of it, but the previous threaded file tree was a real pain-in-the-ass with the amount of bugs it caused. Turns out having a fully-connected graph in a CoBW filesystem is a really bad idea.
In addition to removing an entire category of possible bugs, adopting the M-tree allows for multiple directories in a single metadata block, removing the 1-dir = 1-block minimum requirement.
A well-defined sync model ^
One interesting thing about littlefs, it doesn't have a strictly POSIX API. This puts us in a relatively unique position, where we can explore tweaks to the POSIX API that may make it easer to write powerloss-safe applications.
To leverage this (and because the previous sync model had some real problems), v3 includes a new, well-defined sync model.
I think this discussion captures most of the idea, but for a high-level overview:
Open file handles are strictly snapshots of the on-disk state. Writes to a file are copy-on-write (CoW), with no immediate affect to the on-disk state or any other file handles.
Syncing or closing an in-sync file atomically updates the on-disk state and any other in-sync file handles.
Files can be desynced, either explicitly via
lfs3_file_desync, or because of an error. Desynced files do not recieve sync broadcasts, and closing a desynced file has no affect on the on-disk state.Calling
lfs3_file_syncon a desynced file will atomically update the on-disk state, any other in-sync file handles, and mark the file as in-sync again.Calling
lfs3_file_resyncon a file will discard its current contents and mark the file as in-sync. This is equivalent toclosing and reopening the file.
Stickynotes, no more 0-sized files ^
As an extension of the littlefs's new sync model, v3 introduces a new file type:
LFS3_TYPE_STICKYNOTE.A stickynote represents a file that's in the awkward state of having been created, but not yet synced. If you lose power, stickynotes are hidden from the user and automatically cleaned up on the next mount.
This avoids the 0-sized file issue, while still allowing most of the POSIX interactions users expect.
A new and improved compat flag system ^
v2.1 was a bit of a mess, but it was a learning experience. v3 still includes a global version field, but also includes a set of compat flags that allow non-linear addition/removal of future features.
These are probably familiar to users of Linux filesystems, though I've given them slightly different names:
rcompat flags- Must understand to read the filesystem (incompat_flags)wcompat flags- Must understand to write to the filesystem (ro_compat_flags)ocompat flags- No understanding necessary (compat_flags)This also provides an easy route for marking a filesystem as read-only, non-standard, etc, on-disk.
Error detection! - Global-checksums ^
v3 now supports filesystem-wide error-detection. This is actually quite tricky in a CoBW filesystem, and required the invention of global-checksums (gcksums) to prevent rollback issues caused by naive checksumming.
With gcksums, and a traditional Merkle-tree-esque B-tree construction, v3 now provides a filesystem-wide self-validating checksum via
lfs3_fs_cksum. This checksum can be stored external to the filesystem to provide protection against last-commit rollback issues, metastability, or just for that extra peace of mind.Funny thing about checksums. It's incredibly cheap to calculate checksums when writing, as we're already processing that data anyways. The hard part is, when do you check the checksums?
This is a problem that mostly ends up on the user, but to help, v3 adds a large number checksum checking APIs (probably too many if I'm honest):
LFS3_M_CKMETA/CKDATA- Check checksums during mountLFS3_O_CKMETA/CKDATA- Check checksums during file openlfs3_fs_ckmeta/ckdata- Explicitly check all checksums in the filesystemlfs3_file_ckmeta/ckdata- Explicitly check a file's checksumsLFS3_T_CKMETA/CKDATA- Check checksums incrementally during a traversalLFS3_GC_CKMETA/CKDATA- Check checksums during GC operationsLFS3_M_CKPROGS- Closed checking of data during progsLFS3_M_CKFETCHES- Optimistic (not closed) checking of data during fetchesLFS3_M_CKREADS(planned) - Closed checking of data during readsBetter traversal APIs ^
The traversal API has been completely reworked to be easier to use (both externally and internally).
No more callback needed, blocks can now be iterated over via the dir-like
lfs3_trv_readfunction.Traversals can also perform janitorial work and check checksums now, based on the flags provided to
lfs3_trv_open.Incremental GC ^
GC work can now be accomplished incrementally, instead of requiring one big go. This is managed by
lfs3_fs_gc,cfg.gc_flags, andcfg.gc_steps.Internally, this just shoves one of the new traversal objects into
lfs3_t. It's equivalent to managing a traversal object yourself, but hopefully makes it easier to write library code.However, this does add a significant chunk of RAM to
lfs3_t, so GC is now an opt-in feature behind theLFS3_GCifdef.Better recovery from runtime errors ^
Since we're already doing a full rewrite, I figured let's actually take the time to make sure things don't break on exceptional errors.
Most in-RAM filesystem state should now revert to the last known-good state on error.
The one exception involves file data (not metadata!). Reverting file data correctly turned out to roughly double the cost of files. And now that you can manual revert with
lfs3_file_resync, I figured this cost just isn't worth it. So file data remains undefined after an error.In total, these changes add a significant amount of code and stack, but I'm of the opinion this is necessary for the maturing of littlefs as a filesystem.
Standard custom attributes ^
Breaking disk gives us a chance to reserve attributes
0x80-0xbffor future standard custom attributes:0x00-0x7f- Free for user-attributes (uattr)0x80-0xbf- Reserved for standard-attributes (sattr)0xc0-0xff- Encouraged for system-attributes (yattr)In theory, it was technically possible to reserve these attributes without a disk-breaking change, but it's much safer to do so while we're already breaking the disk.
v3 also includes the possibility of extending the custom attribute space from 8-bits to ~25-bits in the future, but I'd hesitate to to use this, as it risks a significant increase in stack usage.
More tests! ^
v3 comes with a couple more tests than v2 (+~6812.2%):
You may or may not have seen the test framework rework that went curiously under-utilized. That was actually in preparation for the v3 work.
The goal is not 100% line/branch coverage, but just to have more confidence in littlefs's reliability.
Simple key-value APIs ^
v3 includes a couple easy-to-use key-value APIs:
lfs3_get- Get the contents of a filelfs3_size- Get the size of a filelfs3_set- Set the contents of a filelfs3_remove- Remove a file (this one already exists)This API is limited to files that fit in RAM, but if it fits your use case, you can disable the full file API with
LFS3_KVONLYto save some code.If your filesystem fits in only 2 blocks, you can also define
LFS3_2BONLYto save more code.These can be useful for creating small key-value stores on systems that already use littlefs for other storage.
Planned: ^
Efficient block allocation, via optional on-disk block-map (bmap) ^
The one remaining bottleneck in v3 is block allocation. This is a tricky problem for littlefs (and any CoW/CoBW filesystem), because we don't actually know when a block becomes free.
This is in-progress work, but the solution I'm currently looking involves 1. adding an optional on-disk block map (bmap) stored in gstate, and 2. updating it via tree diffing on sync. In theory this will drop huge file writes:$O(n^2 \log n) \rightarrow O(n \log_b^2 n)$
There is also the option of using the bmap as a simple cache, which doesn't avoid the filesystem-wide scan but at least eliminates the RAM constraint of the lookahead buffer.
As a plus, we should be able to leverage the self-counting property of B-trees to make the on-disk bmap compressible.
Bad block tracking ^
This is a much requested feature, and adding the optional on-disk bmap finally gives us a place to track bad blocks.
Pre-erased block tracking ^
Just like bad-blocks, the optional on-disk bmap gives us a place to track pre-erased blocks. Well, at least in theory.
In practice it's a bit more of a nightmare. To avoid multiple progs, we need to mark erased blocks as unerased before progging. This introduces an unbounded number of catch-22s when trying to update the bmap itself.
Fortunately, if instead we store a simple counter in the bmap's gstate, we can resolve things at the mrootanchor worst case.
Error correction! - Metadata redundancy ^
Note it's already possible to do error-correction at the block-device level outside of littlefs, see ramcrc32cbd and ramrsbd for examples. Because of this, integrating in-block error correction is low priority.
But I think there's potential for cross-block error-correction in addition to the in-block error-correction.
The plan for cross-block error-correction/block redundancy is a bit different for metadata vs data. In littlefs, all metadata is logs, which is a bit of a problem for parity schemes. I think the best we can do is store metadata redundancy as naive copies.
But we already need two blocks for every mdir, one usually just sits unused when not compacting. This, combined with metadata usually being much smaller than data, makes the naive scheme less costly than one might expect.
Error correction! - Data redundancy ^
For raw data blocks, we can be a bit more clever. If we add an optional dedup tree for block -> parity group mapping, and an optional parity tree for parity blocks, we can implement a RAID-esque parity scheme for up to 3 blocks of data redundancy relatively cheaply.
Transparent block deduplication ^
This one is a bit funny. Originally block deduplication was intentionally out-of-scope, but it turns out you need something that looks a lot like a dedup tree for error-correction to work in a system that allows multiple block references.
If we already need a virtual -> physical block mapping for error correction, why not make the key the block checksum and get block deduplication for free?
Though if this turns out to not be as free as I think it is, block deduplication will fall out-of-scope.
Stretch goals: ^
These may or may not be included in v3, depending on time and funding:
lfs3_migratefor v2->v3 migration ^16-bit and 64-bit variants ^
Config API rework ^
Block device API rework ^
Custom attr API rework ^
Alternative (cheaper) write-strategies (write-once, global-aligned, eager-crystallization) ^
Advanced file tree operations (
lfs3_file_punchhole,lfs3_file_insertrange,lfs3_file_collapserange,LFS3_SEEK_DATA,LFS3_SEEK_HOLE) ^Advanced file copy-on-write operations (shallow
lfs3_cowcopy+ opportunisticlfs3_copy) ^Reserved blocks to prevent CoW lockups ^
Metadata checks to prevent metadata lockups ^
Integrated block-level ECC (ramcrc32cbd, ramrsbd) ^
Disk-level RAID (this is just data redund + a disk aware block allocator) ^
Out-of-scope (for now): ^
If we don't stop somewhere, v3 will never be released. But these may be added in the future:
Alternative checksums (crc16, crc64, sha256, etc) ^
Feature-limited configurations for smaller code/stack sizes (
LFS3_NO_DIRS,LFS3_KV,LFS3_2BLOCK, etc) ^lfs3_file_openatfor dir-relative APIs ^lfs3_file_opennfor non-null-terminated-string APIs ^Transparent compression ^
Filesystem shrinking ^
High-level caches (block cache, mdir cache, btree leaf cache, etc) ^
Symbolic links ^
100% line/branch coverage ^
Code/stack size ^
littlefs v1, v2, and v3, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)
littlefs v2 and v3 rdonly, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)
Unfortunately, v3 is a little less little than v2:
On one hand, yes, more features generally means more code.
And it's true there's an opportunity here to carve out more feature-limited builds to save code/stack in the future.
But I think it's worth discussing some of the other reasons for the code/stack increase:
Runtime error recovery ^
Recovering from runtime errors isn't cheap. We need to track both the before and after state of things during fallible operations, and this adds both stack and code.
But I think this is necessary for the maturing of littlefs as a filesystem.
Maybe it will make sense to add a sort of
LFS3_GLASSmode in the future, but this is out-of-scope for now.B-tree flexibility ^
The bad news: The new B-tree files are extremely flexible. Unfortunately, this is a double-edged sword.
B-trees, on their own, don't add that much code. They are a relatively poetic data-structure. But deciding how to write to a B-tree, efficiently, with an unknown write pattern, is surprisingly tricky.
The current implementation, what I've taken to calling the "lazy-crystallization algorithm", leans on the more complicated side to see what is possible performance-wise.
The good news: The new B-tree files are extremely flexible.
There's no reason you need the full crystallization algorithm if you have a simple write pattern, or don't care as much about performance. This will either be a future or stretch goal, but it would be interesting to explore alternative write-strategies that could save code in these cases.
Traversal inversion ^
Inverting the traversal, i.e. moving from a callback to incremental state machine, adds both code and stack as 1. all of the previous on-stack state needs to be tracked explicitly, and 2. we now need to worry about what happens if the filesystem is modified mid-traversal.
In theory, this could be reverted if you don't need incremental traversals, but extricating incremental traversals from the current codebase would be an absolute nightmare, so this is out-of-scope for now.
Benchmarks ^
A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.
First off, I would highly encourage others to do their own benchmarking with v3/v2. Filesystem performance is tricky to measure because it depends heavily on your application's write pattern and hardware nuances. If you do, please share in this thread! Others may find the results useful, and now is the critical time for finding potential disk-related performance issues.
Simulated benchmarks ^
To test the math behind v3, I've put together some preliminary simulated benchmarks.
Note these are simulated and optimistic. They do not take caching or hardware buffers into account, which can have a big impact on performance. Still, I think they provide at least a good first impression of v3 vs v2.
To find an estimate of runtime, I first measured the amount of bytes read, progged, and erased, and then scaled based on values found in relevant datasheets. The options here were a bit limited, but WinBond fortunately provides runtime estimates in the datasheets on their website:
NOR flash - w25q64jv
NAND flash - w25n01gv
SD/eMMC - Also w25n01gv, assuming a perfect FTL
I said optimistic, didn't I? I could't find useful estimates for SD/eMMC, so I'm just assuming a perfect FTL here.
These also assume an optimal bus configuration, which, as any embedded engineer knows, is often not the case.
Full benchmarks here: https://benchmarks.littlefs.org (repo, commit)
And here are the ones I think are the most interesting:
Note that SD/eMMC is heavily penalized by the lack of on-disk block-map! SD/eMMC breaks flash down into many small blocks, which tends to make block allocator performance dominate.
Linear writes, where we write a 1 MiB file and don't call sync until closing the file. ^
This one is the most frustrating to compare against v2. CTZ skip-lists are really fast at appending! The problem is they are only fast at appending:
Random writes, note we start with a 1MiB file. ^
As expected, v2 is comically bad at random writes. v3 is indistinguishable from zero in the NOR case:
Logging, write 4 MiB, but limit the file to 1 MiB. ^
In v2 this is accomplished by renaming the file, in v3 we can leverage
lfs3_file_fruncate.v3 performs significantly better with large blocks thanks to avoiding the sync-padding problem:
Funding ^
If you think this work is worthwhile, consider sponsoring littlefs. Current benefits include:
I joke, but I truly appreciate those who have contributed to littlefs so far. littlefs, in its current form, is a mostly self-funded project, so every little bit helps.
If you would like to contribute in a different way, or have other requests, feel free to reach me at geky at geky.net.
As stabilization gets closer, I will also be open to contract work to help port/integrate/adopt v3. If this is interesting to anyone, let me know.
Thank you @micropython, @fusedFET for sponsoring littlefs, and thank you @Eclo, @kmetabg, and @nedap for your past sponsorships!
Next steps ^
For me, I think it's time to finally put together a website/wiki/discussions/blog. I'm not sure on the frequency quite yet, but I plan to write/publish the new DESIGN.md in chapters in tandem with the remaining work.
EDIT: Pinned codemap/plot links to specific commits via benchmarks.littlefs.org/tree.html
EDIT: Updated with rdonly code/stack sizes
EDIT: Added link to #1114
EDIT: Implemented simple key-value APIs
EDIT: Added lfs3_migrate stretch goal with link to #1120
EDIT: Adopted lfs3_traversal_t -> lfs3_trv_t rename
EDIT: Added link to #1125 to clarify "feature parity"