1063 lines
40 KiB
Org Mode
1063 lines
40 KiB
Org Mode
#+TITLE:Butter File System
|
|
#+AUTHOR: Janis Böhm
|
|
#+email: janis@nirgendwo.xyz
|
|
#+STARTUP: indent inlineimages
|
|
#+INFOJS_OPT:
|
|
#+BABEL: :session *R* :cache yes :results output graphics :exports both :tangle yes
|
|
--------------------------
|
|
* Butter?
|
|
|
|
According to the [[https://en.wikipedia.org/wiki/Btrfs][wikipedia page]] for =btrfs=, the correct way to pronounce it is =Butter FS= or =Better FS=. While =Better FS= might make a lot of sense to someone who (correctly) believes =btrfs= to be a better file system than alternative options, I find butter to be so hillariously far removed from linux, file systems and computers in general that it makes for a perfect pronounciation.
|
|
|
|
* How to write a btrfs driver
|
|
[[./idk.webp]]
|
|
|
|
* How to write a (bad (read only)) btrfs driver
|
|
|
|
** Getting Started
|
|
At the very least, to get started writing any kind of software that interacts with a btrfs filesystem, we ought to have an instance of a btrfs filesystem which we don't care about potentially getting damaged.
|
|
|
|
The simplest way to get a new btrfs filesystem on a modern linux install, after installing the btrfs utility programs, is to allocate a new file and then use ~mkfs.btrfs~ to write the necessary structures to it.
|
|
#+begin_src shell
|
|
# unfortunately, btrfs requires a decent amount
|
|
# of space even for an empty file system so to be
|
|
# able to have a few interesting files allocating
|
|
# 150MB is probably a good idea.
|
|
fallocate -l 150M btrfs.img
|
|
# write a new btrfs filesystem with a
|
|
# label to the allocated file.
|
|
mkfs.btrfs -L "my-btrfs-filesystem" btrfs.img
|
|
#+end_src
|
|
|
|
We can later mount this image with the help of a loopback device (or for simplicity with a helper tool like udiskie) and write a handful of files and directories to explore with our driver.
|
|
|
|
** Defining some structures
|
|
Btrfs contains a lot of structures, many of which are [[https://btrfs.readthedocs.io/en/latest/dev/dev-btrfs-design.html#][documented]] [[https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Data_Structures.html][online]] alongside the [[https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/On-disk_Format.html][on-disk layout]] in which these structures can be read. Most of these structures do not interests us at this point, and many of them we will never need to touch at all to implement basic functionality such as reading files and walking directories.
|
|
|
|
All structures in btrfs are little-endian. Since your system is probably also little-endian and this "driver" shouldn't be used as anything more than a learning exercise, this equates to little more than a curiosity for us; but thanks to the awesome rust ecosystem of crates, defining structures as explicitly little-endian is just as easy using the integer types defined in [[https://docs.rs/zerocopy/latest/zerocopy/][=zerocopy=]]s [[https://docs.rs/zerocopy/latest/zerocopy/byteorder/index.html][=byteorder=]] module, so I will be using them in any codeblocks.
|
|
|
|
The entry point into a btrfs filesystem is the [[https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/On-disk_Format.html#Superblock][~Superblock~]]:
|
|
#+begin_src rust
|
|
#[repr(C, packed)]
|
|
#[derive(Derivative, Clone, Copy, FromBytes, AsBytes)]
|
|
#[derivative(Debug)]
|
|
pub struct Superblock {
|
|
pub csum: [u8; 32],
|
|
pub fsid: Uuid,
|
|
/// Physical address of this block
|
|
pub bytenr: U64<LE>,
|
|
pub flags: U64<LE>,
|
|
pub magic: [u8; 0x8],
|
|
pub generation: U64<LE>,
|
|
/// Logical address of the root tree root
|
|
pub root: U64<LE>,
|
|
/// Logical address of the chunk tree root
|
|
pub chunk_root: U64<LE>,
|
|
/// Logical address of the log tree root
|
|
pub log_root: U64<LE>,
|
|
pub log_root_transid: U64<LE>,
|
|
pub total_bytes: U64<LE>,
|
|
pub bytes_used: U64<LE>,
|
|
pub root_dir_objectid: U64<LE>,
|
|
pub num_devices: U64<LE>,
|
|
pub sector_size: U32<LE>,
|
|
pub node_size: U32<LE>,
|
|
/// Unused and must be equal to `nodesize`
|
|
pub leafsize: U32<LE>,
|
|
pub stripesize: U32<LE>,
|
|
pub sys_chunk_array_size: U32<LE>,
|
|
pub chunk_root_generation: U64<LE>,
|
|
pub compat_flags: U64<LE>,
|
|
pub compat_ro_flags: U64<LE>,
|
|
pub incompat_flags: U64<LE>,
|
|
pub csum_type: U16<LE>,
|
|
pub root_level: u8,
|
|
pub chunk_root_level: u8,
|
|
pub log_root_level: u8,
|
|
pub dev_item: DevItem,
|
|
#[derivative(Debug(format_with = "format_u8str"))]
|
|
pub label: [u8; 0x100],
|
|
pub cache_generation: U64<LE>,
|
|
pub uuid_tree_generation: U64<LE>,
|
|
pub metadata_uuid: Uuid,
|
|
/// Future expansion
|
|
#[derivative(Debug = "ignore")]
|
|
_reserved: [u64; 28],
|
|
#[derivative(Debug = "ignore")]
|
|
pub sys_chunk_array: [u8; 0x800],
|
|
#[derivative(Debug = "ignore")]
|
|
pub root_backups: [RootBackup; 4],
|
|
#[derivative(Debug = "ignore")]
|
|
_reserved2: [u8; 565],
|
|
}
|
|
#+end_src
|
|
|
|
The superblock can be found at physical offset ~0x10000~, and copies at physical offsets ~0x4000000~, ~0x4000000000~, ~0x4000000000000~.
|
|
|
|
The superblock contains important data, such as the logical addresses of various btree structures, the default root directory, the label of the file system and, most importantly, a list of ~ChunkItems~ required to bootstrap the logical-to-physical address mapping.
|
|
|
|
** Logical Addresses
|
|
|
|
Btrfs uses logical addresses as indices into a dedicated Chunk Tree containing non-overlapping ranges of logical addresses mapped to physical offsets from the start of the filesystem device.
|
|
Since everything, including the Chunk Tree, is described by logical addresses, the superblock contains the ~sys_chunk_array~ field containing ~(Key, Chunk)~ tuples followed by ~chunk.num_stripes >= 1~ ~Stripes~ (any stripes past the first don't interest us for now, they are related to multidevice and raid setups). The most important fields here are ~key.offset~, which is start of the logical address range the chunk describes and ~chunk.length~, the length of the range.
|
|
Each stripe describes a physical offset into a device at which the physical range which the logical range maps to starts.
|
|
|
|
#+begin_src rust
|
|
#[repr(C, packed)]
|
|
#[derive(Debug, Clone, Copy, FromBytes, AsBytes)]
|
|
pub struct Stripe {
|
|
pub devid: U64<LE>,
|
|
pub offset: U64<LE>,
|
|
pub dev_uuid: Uuid,
|
|
}
|
|
|
|
#[repr(C, packed)]
|
|
#[derive(Debug, Clone, Copy, FromBytes, AsBytes)]
|
|
pub struct Chunk {
|
|
/// size of this chunk in bytes
|
|
pub length: U64<LE>,
|
|
/// objectid of the root referencing this chunk
|
|
pub owner: U64<LE>,
|
|
pub stripe_len: U64<LE>,
|
|
pub ty: U64<LE>,
|
|
/// optimal io alignment for this chunk
|
|
pub io_align: U32<LE>,
|
|
/// optimal io width for this chunk
|
|
pub io_width: U32<LE>,
|
|
/// minimal io size for this chunk
|
|
pub sector_size: U32<LE>,
|
|
/// 2^16 stripes is quite a lot, a second limit is the size of a single item in the btree
|
|
pub num_stripes: U16<LE>,
|
|
/// sub stripes only matter for raid10
|
|
pub sub_stripes: U16<LE>,
|
|
pub stripe: Stripe,
|
|
// additional stripes go here
|
|
}
|
|
|
|
fn bootstrap_chunk_tree(superblock: &Superblock) -> Result<BTreeMap<ChunkTreeKey, u64>> {
|
|
let array_size = superblock.sys_chunk_array_size.get() as usize;
|
|
let mut offset: usize = 0;
|
|
|
|
let key_size = size_of::<Key>();
|
|
let mut chunk_tree = BTreeMap::new();
|
|
|
|
let bytes = &superblock.sys_chunk_array;
|
|
|
|
while offset < array_size {
|
|
if offset + key_size > array_size {
|
|
log::error!("short key read");
|
|
return Err(Error::InvalidOffset);
|
|
}
|
|
|
|
let key = bytes.gread::<Key>(&mut offset)?;
|
|
if key.ty() != ObjectType::ChunkItem {
|
|
log::error!("key is not of type ChunkItem");
|
|
return Err(Error::InvalidOffset);
|
|
}
|
|
|
|
let chunk = bytes.gread::<Chunk>(&mut offset)?;
|
|
let num_stripes = chunk.num_stripes.get();
|
|
|
|
if num_stripes == 0 {
|
|
log::error!("num_stripes cannot be 0");
|
|
return Err(Error::InvalidOffset);
|
|
}
|
|
|
|
if num_stripes != 1 {
|
|
log::warn!(
|
|
"warning: {} stripes detected but only processing 1",
|
|
num_stripes
|
|
);
|
|
}
|
|
|
|
let key_offset = key.offset.get();
|
|
let chunk_length = chunk.length.get();
|
|
|
|
match chunk_tree.entry(ChunkTreeKey {
|
|
range: key_offset..(key_offset + chunk_length),
|
|
}) {
|
|
Entry::Vacant(entry) => {
|
|
entry.insert(chunk.stripe.offset.get());
|
|
}
|
|
Entry::Occupied(_) => {
|
|
log::error!("overlapping stripes!");
|
|
return Err(Error::InvalidOffset);
|
|
}
|
|
};
|
|
|
|
offset += (num_stripes - 1) as usize * size_of::<Stripe>();
|
|
if offset > array_size {
|
|
log::error!("short chunk item + stripes read");
|
|
return Err(Error::InvalidOffset);
|
|
}
|
|
}
|
|
|
|
Ok(chunk_tree)
|
|
}
|
|
#+end_src
|
|
|
|
After parsing the ~sys_chunk_array~ Chunks, we can then access and parse the chunk tree before being able to parse the rest of the btrfs structure.
|
|
|
|
#+begin_src rust
|
|
fn parse_chunk_tree(mut self) -> Result<Self> {
|
|
log::debug!("parsing chunk tree");
|
|
let this = Rc::new(self);
|
|
|
|
let chunk_tree =
|
|
Tree::from_logical_offset(this.clone(), this.superblock().chunk_root.get())?;
|
|
|
|
let chunks = chunk_tree
|
|
.iter()
|
|
.filter_map(|(item, v)| {
|
|
log::debug!("{:?}", item);
|
|
match v {
|
|
TreeItem::Chunk(chunk) => Some((item, chunk)),
|
|
_ => None,
|
|
}
|
|
})
|
|
.collect::<Vec<_>>();
|
|
|
|
drop(chunk_tree);
|
|
|
|
self = match Rc::try_unwrap(this) {
|
|
Ok(v) => v,
|
|
Err(_) => unreachable!(),
|
|
};
|
|
|
|
for (item, chunk) in chunks {
|
|
let start = item.key.offset.get() as u64;
|
|
let end = start + chunk.length.get();
|
|
|
|
match self.chunk_cache.entry(ChunkTreeKey { range: start..end }) {
|
|
Entry::Vacant(entry) => {
|
|
log::info!("inserting chunk [{start}, {end})");
|
|
entry.insert(chunk.stripe.offset.get());
|
|
}
|
|
Entry::Occupied(entry) => {
|
|
log::warn!("overlapping stripes!");
|
|
log::warn!(
|
|
"\t{:?} and {:?}",
|
|
entry.key(),
|
|
ChunkTreeKey { range: start..end }
|
|
);
|
|
log::warn!(
|
|
"\twith offsets: {} and {}",
|
|
entry.get(),
|
|
chunk.stripe.offset.get()
|
|
);
|
|
|
|
if *entry.get() != chunk.stripe.offset.get() {
|
|
log::error!("\tprobably an error?");
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
Ok(self)
|
|
}
|
|
#+end_src
|
|
|
|
#+RESULTS:
|
|
|
|
** Trees
|
|
|
|
*** Keys and PartialKeys
|
|
Everything in btrfs is stored in various records stored in B-Tree, similar in layout to a B+ tree, with ~Key~ structures acting as the key.
|
|
|
|
#+begin_src rust
|
|
#[repr(C, packed)]
|
|
#[derive(Debug, Clone, Copy, Eq, FromBytes, AsBytes)]
|
|
pub struct Key {
|
|
pub id: U64<LE>,
|
|
pub ty: u8,
|
|
pub offset: U64<LE>,
|
|
}
|
|
#+end_src
|
|
|
|
A ~Key~ contains an id, a type and an offset. The meaning of the ~id~ and ~offset~ fields varies greatly depending on the tree the key indexes and ~id~, so ~offset~ might store a logical address, such as for a ChunkItem, a directory index for a DirIndex item or the hash of a filename for a DirItem entry.
|
|
|
|
The ~ty~ field of a ~Key~ has only a handful of meaningful values:
|
|
|
|
#+begin_src rust
|
|
#[repr(u8)]
|
|
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, FromPrimitive, IntoPrimitive)]
|
|
pub enum ObjectType {
|
|
INodeItem = 0x01,
|
|
INodeRef = 0x0C,
|
|
InodeExtref = 0x0D,
|
|
XattrItem = 0x18,
|
|
OrphanInode = 0x30,
|
|
DirItem = 0x54,
|
|
DirIndex = 0x60,
|
|
ExtentData = 0x6C,
|
|
ExtentCsum = 0x80,
|
|
RootItem = 0x84,
|
|
TypeRootBackref = 0x90,
|
|
RootRef = 0x9C,
|
|
ExtentItem = 0xA8,
|
|
MetadataItem = 0xA9,
|
|
TreeBlockRef = 0xB0,
|
|
ExtentDataRef = 0xB2,
|
|
ExtentRefV0 = 0xB4,
|
|
SharedBlockRef = 0xB6,
|
|
SharedDataRef = 0xB8,
|
|
BlockGroupItem = 0xC0,
|
|
FreeSpaceInfo = 0xC6,
|
|
FreeSpaceExtent = 0xC7,
|
|
FreeSpaceBitmap = 0xC8,
|
|
DevExtent = 0xCC,
|
|
DevItem = 0xD8,
|
|
ChunkItem = 0xE4,
|
|
TempItem = 0xF8,
|
|
DevStats = 0xF9,
|
|
SubvolUuid = 0xFB,
|
|
SubvolRecUuid = 0xFC,
|
|
// catch any other u8 value as an invalid object type to be able to convert
|
|
// between any u8 and this type with num_enum
|
|
#[num_enum(catch_all)]
|
|
Invalid(u8),
|
|
}
|
|
#+end_src
|
|
|
|
The ~id~ field has a handful of special defined values, aswell as a range of values that may mean different things depending on the value of the ~ty~ field:
|
|
|
|
#+begin_src rust
|
|
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, FromPrimitive, IntoPrimitive)]
|
|
#[repr(u64)]
|
|
pub enum KnownObjectId {
|
|
RootTree = 1,
|
|
ExtentTree,
|
|
ChunkTree,
|
|
DevTree,
|
|
FsTree,
|
|
RootTreeDir,
|
|
CsumTree,
|
|
QuotaTree,
|
|
UuidTree,
|
|
FreeSpaceTree,
|
|
// RootINode = 0x100, // first free id, always the root inode of a fs
|
|
// __LastFreeId = u64::MAX - 0x100, // last free id
|
|
DataRelocTree = u64::MAX - 9,
|
|
TreeReloc = u64::MAX - 8,
|
|
TreeLog = u64::MAX - 7,
|
|
Orphan = u64::MAX - 5,
|
|
// catch all which contains the valid range of custom ids from 256 to
|
|
// u64::MAX-256 plus some invalid custom ids, unfortunately num_enum does
|
|
// not allow to define a range like that for a tuple variant
|
|
#[num_enum(catch_all)]
|
|
Custom(u64),
|
|
}
|
|
#+end_src
|
|
~Keys~ are ordered by ~id~, then ~ty~ and finally ~offset~, making it very easy to find, for example, the range of all ExtentData items for some specific inode number by searching for the first and last key which matches ~(inode_number, ObjectType::ExtentData, _)~.
|
|
|
|
#+begin_src rust
|
|
impl PartialOrd for Key {
|
|
fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> {
|
|
match self.id().partial_cmp(&other.id()) {
|
|
Some(core::cmp::Ordering::Equal) => {}
|
|
ord => return ord,
|
|
}
|
|
match self.ty().partial_cmp(&other.ty()) {
|
|
Some(core::cmp::Ordering::Equal) => {}
|
|
ord => return ord,
|
|
}
|
|
self.offset.get().partial_cmp(&other.offset.get())
|
|
}
|
|
}
|
|
#+end_src
|
|
|
|
For looking for "incomplete" keys in a btree, we can define a "partial" key which can be compared with a full ~Key~, which will give us a useful subtree containing the records we are interested in as long as the partial key is filled from left to right (i.e. ~(_, ObjectType::DirIndex, 5)~ would not give a useful range because inbetween the 5th DirIndex entry of inode 256 and the 5th DirIndex entry of inode 257 may lie many other records, even other DirIndex records with different indices --- in other words, there is no way to get a consecutive range over every 5th DirIndex of any inode).
|
|
|
|
#+begin_src rust
|
|
pub struct PartialKey {
|
|
id: KnownObjectId,
|
|
ty: Option<ObjectType>,
|
|
offset: Option<u64>,
|
|
}
|
|
|
|
impl PartialOrd<Key> for PartialKey {
|
|
fn partial_cmp(&self, other: &Key) -> Option<core::cmp::Ordering> {
|
|
match self.id.partial_cmp(&other.id()) {
|
|
Some(core::cmp::Ordering::Equal) | None => {
|
|
match self.ty.and_then(|ty| ty.partial_cmp(&other.ty())) {
|
|
Some(core::cmp::Ordering::Equal) | None => self
|
|
.offset
|
|
.and_then(|offset| offset.partial_cmp(&other.offset.get())),
|
|
ord => ord,
|
|
}
|
|
}
|
|
ord => ord,
|
|
}
|
|
}
|
|
}
|
|
#+end_src
|
|
|
|
*** Tree Nodes
|
|
Trees in btrfs are made of internal nodes containing keys in ~KeyPtrs~, and leaf nodes at the bottom containing ~Items~ describing the record values. Both internal and leaf nodes start with a common ~Header~:
|
|
|
|
#+begin_src rust
|
|
#[repr(C, packed)]
|
|
#[derive(Derivative, Clone, PartialEq, Eq, Copy, FromBytes, AsBytes)]
|
|
#[derivative(Debug)]
|
|
pub struct Header {
|
|
#[derivative(Debug = "ignore")]
|
|
pub csum: [u8; 32],
|
|
pub fsid: Uuid,
|
|
/// Which block this node is supposed to live in
|
|
pub bytenr: U64<LE>,
|
|
pub flags: U64<LE>,
|
|
pub chunk_tree_uuid: Uuid,
|
|
pub generation: U64<LE>,
|
|
pub owner: U64<LE>,
|
|
pub nritems: U32<LE>,
|
|
pub level: u8,
|
|
}
|
|
#+end_src
|
|
|
|
The only interesting fields for us are the level or depth of this node ~level~ and the number of children (either further InternalNodes or at level 0 LeafNodes) this node has, ~nritems~.
|
|
|
|
Internal nodes themselves do not contain any values, but instead contain ~KeyPtrs~ as bounds for values found at the leaf level when decending that branch of the tree:
|
|
|
|
#+begin_src plantuml :file internal_node.png
|
|
package InternalNode0 {
|
|
object KeyPtr1
|
|
object KeyPtr2
|
|
object KeyPtr3
|
|
object KeyPtr4
|
|
|
|
KeyPtr1 -|> KeyPtr2 : "is greater?"
|
|
KeyPtr2 -|> KeyPtr3 : "is greater?"
|
|
KeyPtr3 -|> KeyPtr4 : "is greater?"
|
|
|
|
package Children {
|
|
package InternalNode1 {}
|
|
package InternalNode2 {}
|
|
package InternalNode3 {}
|
|
package InternalNode4 {}
|
|
|
|
}
|
|
|
|
KeyPtr1 --|> Children.InternalNode1 : is equal?
|
|
KeyPtr2 --|> Children.InternalNode2 : is equal?
|
|
KeyPtr3 --|> Children.InternalNode3 : is equal?
|
|
KeyPtr4 --|> Children.InternalNode4 : is equal or greater?
|
|
KeyPtr2 --|> Children.InternalNode1 : is less?
|
|
KeyPtr3 --|> Children.InternalNode2 : is less?
|
|
KeyPtr4 --|> Children.InternalNode3 : is less?
|
|
}
|
|
#+end_src
|
|
|
|
In leaf nodes, the header is immedately followed by a list of ~nritems~ ~Items~, which further point to records relative to the memory immediately following the header. The layout of these records depends on the ~ty~ field of the ~key~ field of the ~Item~:
|
|
|
|
#+begin_src rust
|
|
#[repr(C, packed)]
|
|
#[derive(Debug, Clone, Copy, FromBytes, AsBytes)]
|
|
/// A `BtrfsLeaf` is full of `BtrfsItem`s. `offset` and `size` (relative to start of data area)
|
|
/// tell us where to find the item in the leaf.
|
|
pub struct Item {
|
|
pub key: Key,
|
|
pub offset: U32<LE>,
|
|
pub size: U32<LE>,
|
|
}
|
|
#+end_src
|
|
|
|
*** Searching a Tree
|
|
To look up a key in a tree, we iterate over the children of a node, and descend according to the diagram above:
|
|
|
|
#+begin_src rust
|
|
pub fn find_key<K: PartialEq<Key> + PartialOrd<Key>>(self, key: &K) -> SearchResult {
|
|
match &self.node.inner {
|
|
BTreeNode::Internal(node) => {
|
|
let idx = node
|
|
.visit_children_keys()
|
|
.find_map(|(i, child)| match key.partial_cmp(&child) {
|
|
Some(core::cmp::Ordering::Less) => {
|
|
Some(if i == 0 { 0 } else { i as u32 - 1 })
|
|
}
|
|
Some(core::cmp::Ordering::Equal) | None => Some(i as u32),
|
|
_ => None,
|
|
})
|
|
.unwrap_or(node.children.len() as u32 - 1);
|
|
|
|
SearchResult::Edge(NodeHandle { idx, ..self })
|
|
}
|
|
BTreeNode::Leaf(node) => {
|
|
for (i, child) in node.items.iter().enumerate() {
|
|
if key.eq(&child.key) {
|
|
return SearchResult::Leaf(NodeHandle {
|
|
idx: i as u32,
|
|
..self
|
|
});
|
|
}
|
|
}
|
|
|
|
log::debug!("key definitely not found!");
|
|
SearchResult::Edge(NodeHandle {
|
|
idx: node.items.len() as u32 - 1,
|
|
..self
|
|
})
|
|
}
|
|
}
|
|
}
|
|
#+end_src
|
|
|
|
This function operates on a and returns a ~NodeHandle~ --- containing a pointer to a ~Node~ and an index into that node pointing at a child or value --- to either an edge if the node is an internal node, or a leaf if the key was found in a leaf node. An important step is hidden inside the ~visit_children_keys()~ function.
|
|
|
|
#+begin_src rust
|
|
#[derive(Debug, Clone, PartialEq, Eq)]
|
|
pub struct NodeHandle {
|
|
parents: Vec<(BoxedNode, u32)>,
|
|
node: BoxedNode,
|
|
idx: u32,
|
|
}
|
|
#+end_src
|
|
|
|
The node handle also keeps track of the parent nodes and edges (via the index) to allow for walking back up the tree when iterating over ranges between two nodes.
|
|
|
|
#+begin_src rust
|
|
#[derive(Derivative, Eq, Clone)]
|
|
#[derivative(Debug, PartialEq)]
|
|
pub struct Node {
|
|
inner: BTreeNode,
|
|
#[derivative(Debug = "ignore")]
|
|
#[derivative(PartialEq = "ignore")]
|
|
bytes: Vec<u8>,
|
|
}
|
|
|
|
#[derive(Debug, PartialEq, Eq, Clone)]
|
|
pub enum BTreeNode {
|
|
Internal(BTreeInternalNode),
|
|
Leaf(BTreeLeafNode),
|
|
}
|
|
|
|
/// An internal node in a btrfs tree, containing `KeyPtr`s to other internal nodes or leaf nodes.
|
|
#[derive(Derivative, Clone)]
|
|
#[derivative(Debug)]
|
|
pub struct BTreeInternalNode {
|
|
pub header: Header,
|
|
#[derivative(Debug = "ignore")]
|
|
children: Vec<RefCell<NodePtr>>,
|
|
}
|
|
|
|
/// A leaf node in a btrfs tree, containing different items
|
|
#[derive(Debug, Clone)]
|
|
pub struct BTreeLeafNode {
|
|
pub header: Header,
|
|
/// actual leaf data
|
|
pub items: Vec<Item>,
|
|
}
|
|
#+end_src
|
|
|
|
The node type stores either an internal or a leaf node, aswell as a copy of the bytes of the node on disk, from which the leaf records can be read.
|
|
|
|
*** Iterating over a range
|
|
|
|
To iterate over a range between 2 nodes, we need to keep to keep track of the start and end *nodes*, but only yield *leafs*.
|
|
|
|
#+begin_src rust
|
|
#[derive(Debug)]
|
|
pub struct Range<'tree, R: super::Read> {
|
|
volume: Rc<Volume<R>>,
|
|
pub(crate) start: RootOrEdge,
|
|
pub(crate) end: RootOrEdge,
|
|
phantom: PhantomData<&'tree ()>,
|
|
}
|
|
#+end_src
|
|
|
|
We keep track of a reference lifetime to the tree to take advantage of the borrow checker to ensure we don't violate the aliasing rules, since we are working with raw ~NonNull~ pointers to nodes.
|
|
~RootOrEdge~ is, perhaps not optimally named, either the first node the range is defined with (since this is the only time we want to yield the *current* node, and not the *next*), or the last yielded leaf node.
|
|
|
|
To step to the next node in sequential (depth-first) order:
|
|
|
|
#+begin_src rust
|
|
/// returns Ok(next) or Err(self) if self is already the last node
|
|
pub fn into_next<R: super::Read>(
|
|
self,
|
|
volume: &super::volume::Volume<R>,
|
|
) -> core::result::Result<Self, Self> {
|
|
let Self {
|
|
mut parents,
|
|
node,
|
|
idx,
|
|
} = self;
|
|
|
|
let header = node.inner.header();
|
|
|
|
if idx + 1 >= header.nritems.get() {
|
|
// go up
|
|
match parents.pop() {
|
|
Some((node, idx)) => Self {
|
|
idx: (idx + 1).min(node.inner.header().nritems.get()),
|
|
node,
|
|
parents,
|
|
}
|
|
.into_next(volume),
|
|
None => Err(Self { node, idx, parents }),
|
|
}
|
|
} else {
|
|
match &node.inner {
|
|
BTreeNode::Internal(internal) => {
|
|
let node = match internal.visit_child(idx as usize, volume) {
|
|
Ok(child) => {
|
|
parents.push((node, idx));
|
|
Ok(Self {
|
|
parents,
|
|
idx: 0,
|
|
node: child,
|
|
})
|
|
}
|
|
Err(_) => {
|
|
log::error!("failed to read child node!");
|
|
Err(Self { parents, node, idx })
|
|
}
|
|
};
|
|
|
|
node
|
|
}
|
|
BTreeNode::Leaf(_) => Ok(Self {
|
|
idx: idx + 1,
|
|
parents,
|
|
node,
|
|
}),
|
|
}
|
|
}
|
|
}
|
|
#+end_src
|
|
~visit_child()~ lazily reads the next node when it is needed.
|
|
|
|
#+begin_src rust
|
|
impl<'tree, R> Iterator for Range<'tree, R>
|
|
where
|
|
R: super::Read,
|
|
{
|
|
type Item = (Item, TreeItem);
|
|
|
|
fn next(&mut self) -> Option<Self::Item> {
|
|
loop {
|
|
if !self.is_empty() {
|
|
replace_with::replace_with_or_abort(&mut self.start, |start| {
|
|
match start.into_next_node(&self.volume) {
|
|
Ok(next) => next,
|
|
Err(next) => next,
|
|
}
|
|
});
|
|
|
|
if self.start.node.inner.is_leaf() {
|
|
break Some(self.start.as_handle().parse_item().expect("range item"));
|
|
}
|
|
// else repeat
|
|
} else {
|
|
break None;
|
|
}
|
|
}
|
|
}
|
|
}
|
|
#+end_src
|
|
|
|
At the cost of duplicating some code, we can iterate forwards and backwards through the range:
|
|
#+begin_src rust
|
|
impl<'tree, R> DoubleEndedIterator for Range<'tree, R>
|
|
where
|
|
R: super::Read,
|
|
{
|
|
fn next_back(&mut self) -> Option<Self::Item> {
|
|
loop {
|
|
if !self.is_empty() {
|
|
replace_with::replace_with_or_abort(&mut self.end, |start| {
|
|
match start.into_next_back_node(&self.volume) {
|
|
Ok(next) => next,
|
|
Err(next) => next,
|
|
}
|
|
});
|
|
|
|
if self.end.node.inner.is_leaf() {
|
|
break Some(self.end.as_handle().parse_item().expect("range item"));
|
|
}
|
|
// else repeat
|
|
} else {
|
|
break None;
|
|
}
|
|
}
|
|
}
|
|
}
|
|
#+end_src
|
|
|
|
** Reading the file system
|
|
|
|
*** The Root Tree
|
|
The ~root~ field of the ~Superblock~ points at the root node of the Root Tree, which contains all other important trees, like the Extent Tree, File System Trees, aswell as records pointing at the root directories of subvolumes including the default subvolume, whose inode id can be found in the ~root_dir_objectid~ field in the ~Superblock~:
|
|
|
|
#+begin_src rust
|
|
let root_tree =
|
|
Tree::from_logical_offset(self.inner.clone(), self.inner.superblock().root.get())?;
|
|
|
|
// we are looking for the root tree directory (?)
|
|
// this is a DIR_ITEM entry in the root tree, with the name "default",
|
|
// and the crc32 of "default" as its offset
|
|
let key = Key::new(
|
|
KnownObjectId::Custom(self.inner.superblock().root_dir_objectid.get()),
|
|
ObjectType::DirItem,
|
|
0x8dbfc2d2, // crc of "default"
|
|
);
|
|
|
|
let subvol_root = match root_tree.entry(&key)? {
|
|
super::tree::entry::Entry::Occupied(entry) => Some(entry.value()?),
|
|
super::tree::entry::Entry::Vacant(_) => None,
|
|
}
|
|
.ok_or(Error::NoDefaultSubvolRoot)?;
|
|
#+end_src
|
|
|
|
Filesystem trees contain many different kinds of records, namely as ~DirIndex~, ~DirItem~, ~InodeItem~, ~INodeRef~ and ~ExtentData~, which are all important but not all of which are meaningful by themselves.
|
|
|
|
the ~location~ this ~DirItem~ entry points at is the key at which we will find the file system tree for the default subvolume:
|
|
#+begin_src rust
|
|
let subvol_id = subvol_root
|
|
.as_dir_item()
|
|
.expect("dir item")
|
|
.first()
|
|
.expect("dir item entry")
|
|
.item()
|
|
.location
|
|
.id();
|
|
|
|
let (root_item, fs_root) = self
|
|
.roots
|
|
.get(&subvol_id)
|
|
.ok_or(Error::NoDefaultSubvolFsRoot)?
|
|
.clone();
|
|
#+end_src
|
|
|
|
Using this tree we can now very easily iterate over all files and directories in the subvolume and print their filenames by looking at ~DirIndex~ entries, of which there is exactly one per file, containing a pointer to the ~INodeItem~ of just that file, and its filename:
|
|
|
|
#+begin_src rust
|
|
for (_id, entry) in fs.fs_root.iter() {
|
|
if let Some(_dir) = entry.as_dir_index() {
|
|
log::info!("{}", dir.name_as_string_lossy());
|
|
}
|
|
}
|
|
#+end_src
|
|
|
|
Just having a list of filenames without paths or file contents is not very helpful, but as this is the first time in the process of writing the code where we are confronted with some tangeable quality we typically connect with file systems, this was still a big milestone for me.
|
|
|
|
More meaningful is finding the root directory of the subvolume, and listing its immediate children.
|
|
Using the ~root_dirid~ field of the ~root_item~ we got from the root tree with our subvolume id as an inode number, we can look up the ~DirIndex~ entries of the directory:
|
|
|
|
#+begin_src rust
|
|
pub fn get_inode_children(
|
|
&self,
|
|
inode: &INode,
|
|
) -> Result<impl Iterator<Item = DirItemEntry> + '_> {
|
|
let key = PartialKey::new(inode.id(), Some(ObjectType::DirIndex), None);
|
|
|
|
let children = self.fs_root.find_range(&key)?;
|
|
|
|
let a = children.map(|(_, v)| v.try_into_dir_index().expect("dir index"));
|
|
|
|
Ok(a)
|
|
}
|
|
#+end_src
|
|
|
|
By looking at the file names of the ~DirIndex~ entries we can also look up files by their absolute path and relative paths to some known inode:
|
|
|
|
#+begin_src rust
|
|
pub fn get_inode_by_relative_path<P>(&self, inode: INode, path: P) -> Result<INode>
|
|
where
|
|
P: Path,
|
|
{
|
|
if path.is_absolute() {
|
|
self.get_inode_by_path(path)
|
|
} else {
|
|
self.get_inode_by_relative_normalized_path(inode, path.normalize())
|
|
}
|
|
}
|
|
|
|
pub fn get_inode_by_path<P>(&self, path: P) -> Result<INode>
|
|
where
|
|
P: Path,
|
|
{
|
|
let mut normalized = path.normalize();
|
|
if !path.is_absolute() {
|
|
log::error!("path is not absolute!");
|
|
} else {
|
|
// pop root
|
|
_ = normalized.pop_segment();
|
|
}
|
|
|
|
self.get_inode_by_relative_normalized_path(self.get_root_dir(), normalized)
|
|
}
|
|
|
|
pub fn get_inode_by_relative_normalized_path(
|
|
&self,
|
|
inode: INode,
|
|
path: NormalizedPath,
|
|
) -> Result<INode> {
|
|
let mut inode = inode;
|
|
|
|
for segment in path.iter() {
|
|
match segment {
|
|
crate::path::Segment::ParentDir => {
|
|
inode = self.get_inode_parent(&inode)?;
|
|
}
|
|
crate::path::Segment::File(child_name) => {
|
|
let child = self
|
|
.get_inode_children_inodes(&inode)?
|
|
.find(|child| {
|
|
child.path.last().map(|bytes| bytes.as_slice()) == Some(child_name)
|
|
})
|
|
.ok_or(Error::INodeNotFound)?
|
|
.clone();
|
|
// silly borrow checker
|
|
inode = child;
|
|
}
|
|
_ => unreachable!(),
|
|
}
|
|
}
|
|
|
|
Ok(inode)
|
|
}
|
|
#+end_src
|
|
|
|
To be able to actually read a file we need to get a list of its ~ExtentData~ records:
|
|
|
|
#+begin_src rust
|
|
pub fn get_inode_extents(&self, inode_id: u64) -> Result<Vec<(u64, ExtentData)>> {
|
|
if let Some(dir_entry) = self.get_inode_dir_index(inode_id)? {
|
|
if dir_entry.item().ty() == DirItemType::RegFile {
|
|
let key = PartialKey::new(inode_id.into(), Some(ObjectType::ExtentData), None);
|
|
|
|
let extents = self.fs_root.find_range(&key)?;
|
|
|
|
let extents = extents
|
|
.map(|(key, item)| {
|
|
(
|
|
key.key.offset.get(),
|
|
item.as_extent_data().expect("extent data").clone(),
|
|
)
|
|
})
|
|
.collect::<Vec<_>>();
|
|
|
|
Ok(extents)
|
|
} else {
|
|
Ok(vec![])
|
|
}
|
|
} else {
|
|
Err(Error::INodeNotFound)
|
|
}
|
|
}
|
|
#+end_src
|
|
~ExtentData~ records come in two flavors: ~Inline~, where the file contents follow immediately after the record on disk, or ~Regular~, where the record is followed immediately by more data pointing at the file contents.
|
|
|
|
To read some view into an inode, we need to check each extent for an overlap with our views bounds:
|
|
#+begin_src rust
|
|
pub fn read_inode_raw<I: RangeBounds<u64>>(&self, inode: &INode, range: I) -> Result<Vec<u8>> {
|
|
let mut contents = Vec::new();
|
|
let extents = self.get_inode_extents(inode.id)?;
|
|
|
|
let start = match range.start_bound() {
|
|
core::ops::Bound::Included(v) => *v,
|
|
core::ops::Bound::Excluded(v) => *v + 1,
|
|
core::ops::Bound::Unbounded => 0,
|
|
};
|
|
|
|
let end = match range.end_bound() {
|
|
core::ops::Bound::Included(v) => Some(*v + 1),
|
|
core::ops::Bound::Excluded(v) => Some(*v),
|
|
core::ops::Bound::Unbounded => None,
|
|
};
|
|
|
|
log::info!("extents: {}", extents.len());
|
|
log::info!("{:?}", extents);
|
|
for (offset, extent) in extents.into_iter().filter(|(offset, extent)| {
|
|
// bounds of the current extent
|
|
let extent_start = *offset;
|
|
let extent_end = extent_start + extent.len();
|
|
let extent_len = extent.len();
|
|
|
|
// entire range we want to read from the file
|
|
let range_len = end.map(|end| end - start);
|
|
|
|
// start of the UNION (from lowest bound to highest bound) of the
|
|
// current extent and the entire range
|
|
let start = start.min(extent_start);
|
|
// end of the UNION of the current extent and the entire range
|
|
let end = end.map(|end| end.max(extent_end));
|
|
// width of the union o fthe current extent and the entire range
|
|
let len = end.map(|end| (end - start));
|
|
|
|
if let (Some(len), Some(range_len)) = (len, range_len) {
|
|
// proceed if the widths of the 2 ranges (the range we want to
|
|
// read, and the current extent) are greater than the width of
|
|
// the union range:
|
|
//
|
|
// In the first example, the 2 ranges overlap, and the width of
|
|
// the union is smaller than the sum of the widths of the ranges:
|
|
//
|
|
// |------range-1------|
|
|
// |---range-2----|
|
|
// |-----width-of-union-----|
|
|
// |-------sum----|-of---widths-------|
|
|
// |------------width-of-union------------|
|
|
// |------range-1------|
|
|
// |---range-2----|
|
|
//
|
|
// In this second example, the ranges do not overlap, and the
|
|
// width of the unions is equal or greater than the sum of the
|
|
// widths.
|
|
len < extent_len + range_len
|
|
} else {
|
|
start < extent_end
|
|
}
|
|
}) {
|
|
//
|
|
let start = start.saturating_sub(offset);
|
|
|
|
let end = end.map(|end| end - offset).unwrap_or(start + extent.len());
|
|
let len = end - start;
|
|
|
|
log::info!("reading {}..{:?} from extent.", start, end);
|
|
|
|
let data: alloc::borrow::Cow<[u8]> = match &extent {
|
|
ExtentData::Inline { data, .. } => (&data[start as usize..end as usize]).into(),
|
|
ExtentData::Other(extent) => {
|
|
let address = extent.address() + extent.offset() + start;
|
|
let address = self
|
|
.volume
|
|
.inner
|
|
.offset_from_logical(address)
|
|
.ok_or(Error::BadLogicalAddress)?;
|
|
|
|
let range = match extent.extent_data1().compression() {
|
|
// compressed size
|
|
CompressionType::Zlib | CompressionType::Lzo | CompressionType::ZStd => {
|
|
address..address + extent.size()
|
|
}
|
|
_ => address + start..address + start + len,
|
|
};
|
|
|
|
let data = self.volume.inner.read_range(range).expect("bytes");
|
|
data.into()
|
|
}
|
|
};
|
|
|
|
// truncate inflated data if needed
|
|
contents.extend_from_slice(&data[..len as usize]);
|
|
}
|
|
|
|
Ok(contents)
|
|
}
|
|
#+end_src
|
|
|
|
Btrfs also supports compression at individual extent resolution with the Zlib, Zstd and Lzo algorithms. Uo be able to read a compressed file we need to first apply the compression algorithm to the entire extent and truncate afterwards:
|
|
|
|
#+begin_src rust
|
|
log::info!("reading {} bytes from file", data.len());
|
|
log::info!("compression: {:?}", extent.header().compression());
|
|
|
|
match extent.header().compression() {
|
|
CompressionType::None => {
|
|
contents.extend_from_slice(&data);
|
|
}
|
|
CompressionType::Zlib => {
|
|
let mut state = miniz_oxide::inflate::stream::InflateState::new(
|
|
miniz_oxide::DataFormat::Zlib,
|
|
);
|
|
let mut output_data = vec![0u8; extent.header().decoded_size() as usize];
|
|
let mut output = &mut output_data[..];
|
|
let mut data = &data[..];
|
|
loop {
|
|
let result = miniz_oxide::inflate::stream::inflate(
|
|
&mut state,
|
|
&data,
|
|
&mut output,
|
|
miniz_oxide::MZFlush::None,
|
|
);
|
|
|
|
match result.status.map_err(|_| Error::DecompressionError)? {
|
|
miniz_oxide::MZStatus::Ok => {}
|
|
miniz_oxide::MZStatus::StreamEnd => break,
|
|
_ => {
|
|
log::error!("need dict ?!");
|
|
return Err(Error::DecompressionError);
|
|
}
|
|
}
|
|
|
|
data = &data[result.bytes_consumed..];
|
|
output = &mut output[result.bytes_written..];
|
|
}
|
|
_ = miniz_oxide::inflate::stream::inflate(
|
|
&mut state,
|
|
&data,
|
|
&mut output,
|
|
miniz_oxide::MZFlush::Finish,
|
|
)
|
|
.status
|
|
.map_err(|_| Error::DecompressionError)?;
|
|
|
|
// truncate inflated data if needed
|
|
contents
|
|
.extend_from_slice(&output_data[start as usize..(start + len) as usize]);
|
|
}
|
|
CompressionType::Lzo => {
|
|
todo!()
|
|
}
|
|
CompressionType::ZStd => {
|
|
let mut output_data = vec![0u8; extent.header().decoded_size() as usize];
|
|
|
|
let mut zstd = zstd_safe::DCtx::create();
|
|
zstd.init().map_err(|e| {
|
|
log::error!("zstd init error: {}", zstd_safe::get_error_name(e));
|
|
Error::DecompressionError
|
|
})?;
|
|
|
|
let mut input = zstd_safe::InBuffer::around(&data);
|
|
let mut output = zstd_safe::OutBuffer::around(&mut output_data[..]);
|
|
|
|
loop {
|
|
match zstd.decompress_stream(&mut output, &mut input) {
|
|
Ok(len) => {
|
|
if len == 0 {
|
|
break;
|
|
}
|
|
}
|
|
Err(e) => {
|
|
log::error!(
|
|
"zstd decompress stream error: {}",
|
|
zstd_safe::get_error_name(e)
|
|
);
|
|
return Err(Error::DecompressionError);
|
|
}
|
|
}
|
|
|
|
if output.pos() == extent.header().decoded_size() as usize {
|
|
break;
|
|
}
|
|
}
|
|
|
|
contents
|
|
.extend_from_slice(&output_data[start as usize..(start + len) as usize]);
|
|
}
|
|
c => {
|
|
log::error!("invalid compression type {:?}", c);
|
|
contents.extend_from_slice(&data);
|
|
}
|
|
}
|
|
#+end_src
|
|
|
|
|
|
** Design Flaws
|
|
|
|
Everything, of course, this is just a toy meant to help me understand how Btrfs works and is laid out :^).
|
|
But specifically:
|
|
|
|
*** copying huge chunks of memory for each tree node
|
|
by copying the entire chunk past the address of a tree node and caching it instead of first reading the header and then reading all the ~KeyPtr~ or ~Item~ children and again reading records directly from file/disk we waste huge amounts of RAM, not to mention that caching en entire chunk like that probably makes exceedingly little sense if we ever want to support write access.
|
|
One possible solution might be a kind of GC-like structure that keeps track of views into chunks.
|
|
|
|
*** Looking up file by comparing filename and not filename hash
|
|
btrfs stores ~DirItemEntries~ both one-to-one in ~DirIndex~ entries, aswell as in ~DirItem~ by the hash, which allows us to quickly look for the file name in an ordered range of file hashes, and then only comparing the actual file name if there are any hash colisions. Doing that instead of iterating through all ~DirIndex~ entries should be significantly faster.
|
|
|