40 KiB
#+TITLE:Butter File System
Butter?
According to the wikipedia page for btrfs
, the correct way to pronounce it is Butter FS
or Better FS
. While Better FS
might make a lot of sense to someone who (correctly) believes btrfs
to be a better file system than alternative options, I find butter to be so hillariously far removed from linux, file systems and computers in general that it makes for a perfect pronounciation.
How to write a btrfs driver
How to write a (bad (read only)) btrfs driver
Getting Started
At the very least, to get started writing any kind of software that interacts with a btrfs filesystem, we ought to have an instance of a btrfs filesystem which we don't care about potentially getting damaged.
The simplest way to get a new btrfs filesystem on a modern linux install, after installing the btrfs utility programs, is to allocate a new file and then use mkfs.btrfs
to write the necessary structures to it.
# unfortunately, btrfs requires a decent amount
# of space even for an empty file system so to be
# able to have a few interesting files allocating
# 150MB is probably a good idea.
fallocate -l 150M btrfs.img
# write a new btrfs filesystem with a
# label to the allocated file.
mkfs.btrfs -L "my-btrfs-filesystem" btrfs.img
We can later mount this image with the help of a loopback device (or for simplicity with a helper tool like udiskie) and write a handful of files and directories to explore with our driver.
Defining some structures
Btrfs contains a lot of structures, many of which are documented online alongside the on-disk layout in which these structures can be read. Most of these structures do not interests us at this point, and many of them we will never need to touch at all to implement basic functionality such as reading files and walking directories.
All structures in btrfs are little-endian. Since your system is probably also little-endian and this "driver" shouldn't be used as anything more than a learning exercise, this equates to little more than a curiosity for us; but thanks to the awesome rust ecosystem of crates, defining structures as explicitly little-endian is just as easy using the integer types defined in zerocopy
s byteorder
module, so I will be using them in any codeblocks.
The entry point into a btrfs filesystem is the Superblock
:
#[repr(C, packed)]
#[derive(Derivative, Clone, Copy, FromBytes, AsBytes)]
#[derivative(Debug)]
pub struct Superblock {
pub csum: [u8; 32],
pub fsid: Uuid,
/// Physical address of this block
pub bytenr: U64<LE>,
pub flags: U64<LE>,
pub magic: [u8; 0x8],
pub generation: U64<LE>,
/// Logical address of the root tree root
pub root: U64<LE>,
/// Logical address of the chunk tree root
pub chunk_root: U64<LE>,
/// Logical address of the log tree root
pub log_root: U64<LE>,
pub log_root_transid: U64<LE>,
pub total_bytes: U64<LE>,
pub bytes_used: U64<LE>,
pub root_dir_objectid: U64<LE>,
pub num_devices: U64<LE>,
pub sector_size: U32<LE>,
pub node_size: U32<LE>,
/// Unused and must be equal to `nodesize`
pub leafsize: U32<LE>,
pub stripesize: U32<LE>,
pub sys_chunk_array_size: U32<LE>,
pub chunk_root_generation: U64<LE>,
pub compat_flags: U64<LE>,
pub compat_ro_flags: U64<LE>,
pub incompat_flags: U64<LE>,
pub csum_type: U16<LE>,
pub root_level: u8,
pub chunk_root_level: u8,
pub log_root_level: u8,
pub dev_item: DevItem,
#[derivative(Debug(format_with = "format_u8str"))]
pub label: [u8; 0x100],
pub cache_generation: U64<LE>,
pub uuid_tree_generation: U64<LE>,
pub metadata_uuid: Uuid,
/// Future expansion
#[derivative(Debug = "ignore")]
_reserved: [u64; 28],
#[derivative(Debug = "ignore")]
pub sys_chunk_array: [u8; 0x800],
#[derivative(Debug = "ignore")]
pub root_backups: [RootBackup; 4],
#[derivative(Debug = "ignore")]
_reserved2: [u8; 565],
}
The superblock can be found at physical offset 0x10000
, and copies at physical offsets 0x4000000
, 0x4000000000
, 0x4000000000000
.
The superblock contains important data, such as the logical addresses of various btree structures, the default root directory, the label of the file system and, most importantly, a list of ChunkItems
required to bootstrap the logical-to-physical address mapping.
Logical Addresses
Btrfs uses logical addresses as indices into a dedicated Chunk Tree containing non-overlapping ranges of logical addresses mapped to physical offsets from the start of the filesystem device.
Since everything, including the Chunk Tree, is described by logical addresses, the superblock contains the sys_chunk_array
field containing (Key, Chunk)
tuples followed by chunk.num_stripes >= 1
Stripes
(any stripes past the first don't interest us for now, they are related to multidevice and raid setups). The most important fields here are key.offset
, which is start of the logical address range the chunk describes and chunk.length
, the length of the range.
Each stripe describes a physical offset into a device at which the physical range which the logical range maps to starts.
#[repr(C, packed)]
#[derive(Debug, Clone, Copy, FromBytes, AsBytes)]
pub struct Stripe {
pub devid: U64<LE>,
pub offset: U64<LE>,
pub dev_uuid: Uuid,
}
#[repr(C, packed)]
#[derive(Debug, Clone, Copy, FromBytes, AsBytes)]
pub struct Chunk {
/// size of this chunk in bytes
pub length: U64<LE>,
/// objectid of the root referencing this chunk
pub owner: U64<LE>,
pub stripe_len: U64<LE>,
pub ty: U64<LE>,
/// optimal io alignment for this chunk
pub io_align: U32<LE>,
/// optimal io width for this chunk
pub io_width: U32<LE>,
/// minimal io size for this chunk
pub sector_size: U32<LE>,
/// 2^16 stripes is quite a lot, a second limit is the size of a single item in the btree
pub num_stripes: U16<LE>,
/// sub stripes only matter for raid10
pub sub_stripes: U16<LE>,
pub stripe: Stripe,
// additional stripes go here
}
fn bootstrap_chunk_tree(superblock: &Superblock) -> Result<BTreeMap<ChunkTreeKey, u64>> {
let array_size = superblock.sys_chunk_array_size.get() as usize;
let mut offset: usize = 0;
let key_size = size_of::<Key>();
let mut chunk_tree = BTreeMap::new();
let bytes = &superblock.sys_chunk_array;
while offset < array_size {
if offset + key_size > array_size {
log::error!("short key read");
return Err(Error::InvalidOffset);
}
let key = bytes.gread::<Key>(&mut offset)?;
if key.ty() != ObjectType::ChunkItem {
log::error!("key is not of type ChunkItem");
return Err(Error::InvalidOffset);
}
let chunk = bytes.gread::<Chunk>(&mut offset)?;
let num_stripes = chunk.num_stripes.get();
if num_stripes == 0 {
log::error!("num_stripes cannot be 0");
return Err(Error::InvalidOffset);
}
if num_stripes != 1 {
log::warn!(
"warning: {} stripes detected but only processing 1",
num_stripes
);
}
let key_offset = key.offset.get();
let chunk_length = chunk.length.get();
match chunk_tree.entry(ChunkTreeKey {
range: key_offset..(key_offset + chunk_length),
}) {
Entry::Vacant(entry) => {
entry.insert(chunk.stripe.offset.get());
}
Entry::Occupied(_) => {
log::error!("overlapping stripes!");
return Err(Error::InvalidOffset);
}
};
offset += (num_stripes - 1) as usize * size_of::<Stripe>();
if offset > array_size {
log::error!("short chunk item + stripes read");
return Err(Error::InvalidOffset);
}
}
Ok(chunk_tree)
}
After parsing the sys_chunk_array
Chunks, we can then access and parse the chunk tree before being able to parse the rest of the btrfs structure.
fn parse_chunk_tree(mut self) -> Result<Self> {
log::debug!("parsing chunk tree");
let this = Rc::new(self);
let chunk_tree =
Tree::from_logical_offset(this.clone(), this.superblock().chunk_root.get())?;
let chunks = chunk_tree
.iter()
.filter_map(|(item, v)| {
log::debug!("{:?}", item);
match v {
TreeItem::Chunk(chunk) => Some((item, chunk)),
_ => None,
}
})
.collect::<Vec<_>>();
drop(chunk_tree);
self = match Rc::try_unwrap(this) {
Ok(v) => v,
Err(_) => unreachable!(),
};
for (item, chunk) in chunks {
let start = item.key.offset.get() as u64;
let end = start + chunk.length.get();
match self.chunk_cache.entry(ChunkTreeKey { range: start..end }) {
Entry::Vacant(entry) => {
log::info!("inserting chunk [{start}, {end})");
entry.insert(chunk.stripe.offset.get());
}
Entry::Occupied(entry) => {
log::warn!("overlapping stripes!");
log::warn!(
"\t{:?} and {:?}",
entry.key(),
ChunkTreeKey { range: start..end }
);
log::warn!(
"\twith offsets: {} and {}",
entry.get(),
chunk.stripe.offset.get()
);
if *entry.get() != chunk.stripe.offset.get() {
log::error!("\tprobably an error?");
}
}
}
}
Ok(self)
}
Trees
Keys and PartialKeys
Everything in btrfs is stored in various records stored in B-Tree, similar in layout to a B+ tree, with Key
structures acting as the key.
#[repr(C, packed)]
#[derive(Debug, Clone, Copy, Eq, FromBytes, AsBytes)]
pub struct Key {
pub id: U64<LE>,
pub ty: u8,
pub offset: U64<LE>,
}
A Key
contains an id, a type and an offset. The meaning of the id
and offset
fields varies greatly depending on the tree the key indexes and id
, so offset
might store a logical address, such as for a ChunkItem, a directory index for a DirIndex item or the hash of a filename for a DirItem entry.
The ty
field of a Key
has only a handful of meaningful values:
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, FromPrimitive, IntoPrimitive)]
pub enum ObjectType {
INodeItem = 0x01,
INodeRef = 0x0C,
InodeExtref = 0x0D,
XattrItem = 0x18,
OrphanInode = 0x30,
DirItem = 0x54,
DirIndex = 0x60,
ExtentData = 0x6C,
ExtentCsum = 0x80,
RootItem = 0x84,
TypeRootBackref = 0x90,
RootRef = 0x9C,
ExtentItem = 0xA8,
MetadataItem = 0xA9,
TreeBlockRef = 0xB0,
ExtentDataRef = 0xB2,
ExtentRefV0 = 0xB4,
SharedBlockRef = 0xB6,
SharedDataRef = 0xB8,
BlockGroupItem = 0xC0,
FreeSpaceInfo = 0xC6,
FreeSpaceExtent = 0xC7,
FreeSpaceBitmap = 0xC8,
DevExtent = 0xCC,
DevItem = 0xD8,
ChunkItem = 0xE4,
TempItem = 0xF8,
DevStats = 0xF9,
SubvolUuid = 0xFB,
SubvolRecUuid = 0xFC,
// catch any other u8 value as an invalid object type to be able to convert
// between any u8 and this type with num_enum
#[num_enum(catch_all)]
Invalid(u8),
}
The id
field has a handful of special defined values, aswell as a range of values that may mean different things depending on the value of the ty
field:
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, FromPrimitive, IntoPrimitive)]
#[repr(u64)]
pub enum KnownObjectId {
RootTree = 1,
ExtentTree,
ChunkTree,
DevTree,
FsTree,
RootTreeDir,
CsumTree,
QuotaTree,
UuidTree,
FreeSpaceTree,
// RootINode = 0x100, // first free id, always the root inode of a fs
// __LastFreeId = u64::MAX - 0x100, // last free id
DataRelocTree = u64::MAX - 9,
TreeReloc = u64::MAX - 8,
TreeLog = u64::MAX - 7,
Orphan = u64::MAX - 5,
// catch all which contains the valid range of custom ids from 256 to
// u64::MAX-256 plus some invalid custom ids, unfortunately num_enum does
// not allow to define a range like that for a tuple variant
#[num_enum(catch_all)]
Custom(u64),
}
Keys
are ordered by id
, then ty
and finally offset
, making it very easy to find, for example, the range of all ExtentData items for some specific inode number by searching for the first and last key which matches (inode_number, ObjectType::ExtentData, _)
.
impl PartialOrd for Key {
fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> {
match self.id().partial_cmp(&other.id()) {
Some(core::cmp::Ordering::Equal) => {}
ord => return ord,
}
match self.ty().partial_cmp(&other.ty()) {
Some(core::cmp::Ordering::Equal) => {}
ord => return ord,
}
self.offset.get().partial_cmp(&other.offset.get())
}
}
For looking for "incomplete" keys in a btree, we can define a "partial" key which can be compared with a full Key
, which will give us a useful subtree containing the records we are interested in as long as the partial key is filled from left to right (i.e. (_, ObjectType::DirIndex, 5)
would not give a useful range because inbetween the 5th DirIndex entry of inode 256 and the 5th DirIndex entry of inode 257 may lie many other records, even other DirIndex records with different indices — in other words, there is no way to get a consecutive range over every 5th DirIndex of any inode).
pub struct PartialKey {
id: KnownObjectId,
ty: Option<ObjectType>,
offset: Option<u64>,
}
impl PartialOrd<Key> for PartialKey {
fn partial_cmp(&self, other: &Key) -> Option<core::cmp::Ordering> {
match self.id.partial_cmp(&other.id()) {
Some(core::cmp::Ordering::Equal) | None => {
match self.ty.and_then(|ty| ty.partial_cmp(&other.ty())) {
Some(core::cmp::Ordering::Equal) | None => self
.offset
.and_then(|offset| offset.partial_cmp(&other.offset.get())),
ord => ord,
}
}
ord => ord,
}
}
}
Tree Nodes
Trees in btrfs are made of internal nodes containing keys in KeyPtrs
, and leaf nodes at the bottom containing Items
describing the record values. Both internal and leaf nodes start with a common Header
:
#[repr(C, packed)]
#[derive(Derivative, Clone, PartialEq, Eq, Copy, FromBytes, AsBytes)]
#[derivative(Debug)]
pub struct Header {
#[derivative(Debug = "ignore")]
pub csum: [u8; 32],
pub fsid: Uuid,
/// Which block this node is supposed to live in
pub bytenr: U64<LE>,
pub flags: U64<LE>,
pub chunk_tree_uuid: Uuid,
pub generation: U64<LE>,
pub owner: U64<LE>,
pub nritems: U32<LE>,
pub level: u8,
}
The only interesting fields for us are the level or depth of this node level
and the number of children (either further InternalNodes or at level 0 LeafNodes) this node has, nritems
.
Internal nodes themselves do not contain any values, but instead contain KeyPtrs
as bounds for values found at the leaf level when decending that branch of the tree:
package InternalNode0 {
object KeyPtr1
object KeyPtr2
object KeyPtr3
object KeyPtr4
KeyPtr1 -|> KeyPtr2 : "is greater?"
KeyPtr2 -|> KeyPtr3 : "is greater?"
KeyPtr3 -|> KeyPtr4 : "is greater?"
package Children {
package InternalNode1 {}
package InternalNode2 {}
package InternalNode3 {}
package InternalNode4 {}
}
KeyPtr1 --|> Children.InternalNode1 : is equal?
KeyPtr2 --|> Children.InternalNode2 : is equal?
KeyPtr3 --|> Children.InternalNode3 : is equal?
KeyPtr4 --|> Children.InternalNode4 : is equal or greater?
KeyPtr2 --|> Children.InternalNode1 : is less?
KeyPtr3 --|> Children.InternalNode2 : is less?
KeyPtr4 --|> Children.InternalNode3 : is less?
}
In leaf nodes, the header is immedately followed by a list of nritems
Items
, which further point to records relative to the memory immediately following the header. The layout of these records depends on the ty
field of the key
field of the Item
:
#[repr(C, packed)]
#[derive(Debug, Clone, Copy, FromBytes, AsBytes)]
/// A `BtrfsLeaf` is full of `BtrfsItem`s. `offset` and `size` (relative to start of data area)
/// tell us where to find the item in the leaf.
pub struct Item {
pub key: Key,
pub offset: U32<LE>,
pub size: U32<LE>,
}
Searching a Tree
To look up a key in a tree, we iterate over the children of a node, and descend according to the diagram above:
pub fn find_key<K: PartialEq<Key> + PartialOrd<Key>>(self, key: &K) -> SearchResult {
match &self.node.inner {
BTreeNode::Internal(node) => {
let idx = node
.visit_children_keys()
.find_map(|(i, child)| match key.partial_cmp(&child) {
Some(core::cmp::Ordering::Less) => {
Some(if i == 0 { 0 } else { i as u32 - 1 })
}
Some(core::cmp::Ordering::Equal) | None => Some(i as u32),
_ => None,
})
.unwrap_or(node.children.len() as u32 - 1);
SearchResult::Edge(NodeHandle { idx, ..self })
}
BTreeNode::Leaf(node) => {
for (i, child) in node.items.iter().enumerate() {
if key.eq(&child.key) {
return SearchResult::Leaf(NodeHandle {
idx: i as u32,
..self
});
}
}
log::debug!("key definitely not found!");
SearchResult::Edge(NodeHandle {
idx: node.items.len() as u32 - 1,
..self
})
}
}
}
This function operates on a and returns a NodeHandle
— containing a pointer to a Node
and an index into that node pointing at a child or value — to either an edge if the node is an internal node, or a leaf if the key was found in a leaf node. An important step is hidden inside the visit_children_keys()
function.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct NodeHandle {
parents: Vec<(BoxedNode, u32)>,
node: BoxedNode,
idx: u32,
}
The node handle also keeps track of the parent nodes and edges (via the index) to allow for walking back up the tree when iterating over ranges between two nodes.
#[derive(Derivative, Eq, Clone)]
#[derivative(Debug, PartialEq)]
pub struct Node {
inner: BTreeNode,
#[derivative(Debug = "ignore")]
#[derivative(PartialEq = "ignore")]
bytes: Vec<u8>,
}
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum BTreeNode {
Internal(BTreeInternalNode),
Leaf(BTreeLeafNode),
}
/// An internal node in a btrfs tree, containing `KeyPtr`s to other internal nodes or leaf nodes.
#[derive(Derivative, Clone)]
#[derivative(Debug)]
pub struct BTreeInternalNode {
pub header: Header,
#[derivative(Debug = "ignore")]
children: Vec<RefCell<NodePtr>>,
}
/// A leaf node in a btrfs tree, containing different items
#[derive(Debug, Clone)]
pub struct BTreeLeafNode {
pub header: Header,
/// actual leaf data
pub items: Vec<Item>,
}
The node type stores either an internal or a leaf node, aswell as a copy of the bytes of the node on disk, from which the leaf records can be read.
Iterating over a range
To iterate over a range between 2 nodes, we need to keep to keep track of the start and end nodes, but only yield leafs.
#[derive(Debug)]
pub struct Range<'tree, R: super::Read> {
volume: Rc<Volume<R>>,
pub(crate) start: RootOrEdge,
pub(crate) end: RootOrEdge,
phantom: PhantomData<&'tree ()>,
}
We keep track of a reference lifetime to the tree to take advantage of the borrow checker to ensure we don't violate the aliasing rules, since we are working with raw NonNull
pointers to nodes.
RootOrEdge
is, perhaps not optimally named, either the first node the range is defined with (since this is the only time we want to yield the current node, and not the next), or the last yielded leaf node.
To step to the next node in sequential (depth-first) order:
/// returns Ok(next) or Err(self) if self is already the last node
pub fn into_next<R: super::Read>(
self,
volume: &super::volume::Volume<R>,
) -> core::result::Result<Self, Self> {
let Self {
mut parents,
node,
idx,
} = self;
let header = node.inner.header();
if idx + 1 >= header.nritems.get() {
// go up
match parents.pop() {
Some((node, idx)) => Self {
idx: (idx + 1).min(node.inner.header().nritems.get()),
node,
parents,
}
.into_next(volume),
None => Err(Self { node, idx, parents }),
}
} else {
match &node.inner {
BTreeNode::Internal(internal) => {
let node = match internal.visit_child(idx as usize, volume) {
Ok(child) => {
parents.push((node, idx));
Ok(Self {
parents,
idx: 0,
node: child,
})
}
Err(_) => {
log::error!("failed to read child node!");
Err(Self { parents, node, idx })
}
};
node
}
BTreeNode::Leaf(_) => Ok(Self {
idx: idx + 1,
parents,
node,
}),
}
}
}
visit_child()
lazily reads the next node when it is needed.
impl<'tree, R> Iterator for Range<'tree, R>
where
R: super::Read,
{
type Item = (Item, TreeItem);
fn next(&mut self) -> Option<Self::Item> {
loop {
if !self.is_empty() {
replace_with::replace_with_or_abort(&mut self.start, |start| {
match start.into_next_node(&self.volume) {
Ok(next) => next,
Err(next) => next,
}
});
if self.start.node.inner.is_leaf() {
break Some(self.start.as_handle().parse_item().expect("range item"));
}
// else repeat
} else {
break None;
}
}
}
}
At the cost of duplicating some code, we can iterate forwards and backwards through the range:
impl<'tree, R> DoubleEndedIterator for Range<'tree, R>
where
R: super::Read,
{
fn next_back(&mut self) -> Option<Self::Item> {
loop {
if !self.is_empty() {
replace_with::replace_with_or_abort(&mut self.end, |start| {
match start.into_next_back_node(&self.volume) {
Ok(next) => next,
Err(next) => next,
}
});
if self.end.node.inner.is_leaf() {
break Some(self.end.as_handle().parse_item().expect("range item"));
}
// else repeat
} else {
break None;
}
}
}
}
Reading the file system
The Root Tree
The root
field of the Superblock
points at the root node of the Root Tree, which contains all other important trees, like the Extent Tree, File System Trees, aswell as records pointing at the root directories of subvolumes including the default subvolume, whose inode id can be found in the root_dir_objectid
field in the Superblock
:
let root_tree =
Tree::from_logical_offset(self.inner.clone(), self.inner.superblock().root.get())?;
// we are looking for the root tree directory (?)
// this is a DIR_ITEM entry in the root tree, with the name "default",
// and the crc32 of "default" as its offset
let key = Key::new(
KnownObjectId::Custom(self.inner.superblock().root_dir_objectid.get()),
ObjectType::DirItem,
0x8dbfc2d2, // crc of "default"
);
let subvol_root = match root_tree.entry(&key)? {
super::tree::entry::Entry::Occupied(entry) => Some(entry.value()?),
super::tree::entry::Entry::Vacant(_) => None,
}
.ok_or(Error::NoDefaultSubvolRoot)?;
Filesystem trees contain many different kinds of records, namely as DirIndex
, DirItem
, InodeItem
, INodeRef
and ExtentData
, which are all important but not all of which are meaningful by themselves.
the location
this DirItem
entry points at is the key at which we will find the file system tree for the default subvolume:
let subvol_id = subvol_root
.as_dir_item()
.expect("dir item")
.first()
.expect("dir item entry")
.item()
.location
.id();
let (root_item, fs_root) = self
.roots
.get(&subvol_id)
.ok_or(Error::NoDefaultSubvolFsRoot)?
.clone();
Using this tree we can now very easily iterate over all files and directories in the subvolume and print their filenames by looking at DirIndex
entries, of which there is exactly one per file, containing a pointer to the INodeItem
of just that file, and its filename:
for (_id, entry) in fs.fs_root.iter() {
if let Some(_dir) = entry.as_dir_index() {
log::info!("{}", dir.name_as_string_lossy());
}
}
Just having a list of filenames without paths or file contents is not very helpful, but as this is the first time in the process of writing the code where we are confronted with some tangeable quality we typically connect with file systems, this was still a big milestone for me.
More meaningful is finding the root directory of the subvolume, and listing its immediate children.
Using the root_dirid
field of the root_item
we got from the root tree with our subvolume id as an inode number, we can look up the DirIndex
entries of the directory:
pub fn get_inode_children(
&self,
inode: &INode,
) -> Result<impl Iterator<Item = DirItemEntry> + '_> {
let key = PartialKey::new(inode.id(), Some(ObjectType::DirIndex), None);
let children = self.fs_root.find_range(&key)?;
let a = children.map(|(_, v)| v.try_into_dir_index().expect("dir index"));
Ok(a)
}
By looking at the file names of the DirIndex
entries we can also look up files by their absolute path and relative paths to some known inode:
pub fn get_inode_by_relative_path<P>(&self, inode: INode, path: P) -> Result<INode>
where
P: Path,
{
if path.is_absolute() {
self.get_inode_by_path(path)
} else {
self.get_inode_by_relative_normalized_path(inode, path.normalize())
}
}
pub fn get_inode_by_path<P>(&self, path: P) -> Result<INode>
where
P: Path,
{
let mut normalized = path.normalize();
if !path.is_absolute() {
log::error!("path is not absolute!");
} else {
// pop root
_ = normalized.pop_segment();
}
self.get_inode_by_relative_normalized_path(self.get_root_dir(), normalized)
}
pub fn get_inode_by_relative_normalized_path(
&self,
inode: INode,
path: NormalizedPath,
) -> Result<INode> {
let mut inode = inode;
for segment in path.iter() {
match segment {
crate::path::Segment::ParentDir => {
inode = self.get_inode_parent(&inode)?;
}
crate::path::Segment::File(child_name) => {
let child = self
.get_inode_children_inodes(&inode)?
.find(|child| {
child.path.last().map(|bytes| bytes.as_slice()) == Some(child_name)
})
.ok_or(Error::INodeNotFound)?
.clone();
// silly borrow checker
inode = child;
}
_ => unreachable!(),
}
}
Ok(inode)
}
To be able to actually read a file we need to get a list of its ExtentData
records:
pub fn get_inode_extents(&self, inode_id: u64) -> Result<Vec<(u64, ExtentData)>> {
if let Some(dir_entry) = self.get_inode_dir_index(inode_id)? {
if dir_entry.item().ty() == DirItemType::RegFile {
let key = PartialKey::new(inode_id.into(), Some(ObjectType::ExtentData), None);
let extents = self.fs_root.find_range(&key)?;
let extents = extents
.map(|(key, item)| {
(
key.key.offset.get(),
item.as_extent_data().expect("extent data").clone(),
)
})
.collect::<Vec<_>>();
Ok(extents)
} else {
Ok(vec![])
}
} else {
Err(Error::INodeNotFound)
}
}
ExtentData
records come in two flavors: Inline
, where the file contents follow immediately after the record on disk, or Regular
, where the record is followed immediately by more data pointing at the file contents.
To read some view into an inode, we need to check each extent for an overlap with our views bounds:
pub fn read_inode_raw<I: RangeBounds<u64>>(&self, inode: &INode, range: I) -> Result<Vec<u8>> {
let mut contents = Vec::new();
let extents = self.get_inode_extents(inode.id)?;
let start = match range.start_bound() {
core::ops::Bound::Included(v) => *v,
core::ops::Bound::Excluded(v) => *v + 1,
core::ops::Bound::Unbounded => 0,
};
let end = match range.end_bound() {
core::ops::Bound::Included(v) => Some(*v + 1),
core::ops::Bound::Excluded(v) => Some(*v),
core::ops::Bound::Unbounded => None,
};
log::info!("extents: {}", extents.len());
log::info!("{:?}", extents);
for (offset, extent) in extents.into_iter().filter(|(offset, extent)| {
// bounds of the current extent
let extent_start = *offset;
let extent_end = extent_start + extent.len();
let extent_len = extent.len();
// entire range we want to read from the file
let range_len = end.map(|end| end - start);
// start of the UNION (from lowest bound to highest bound) of the
// current extent and the entire range
let start = start.min(extent_start);
// end of the UNION of the current extent and the entire range
let end = end.map(|end| end.max(extent_end));
// width of the union o fthe current extent and the entire range
let len = end.map(|end| (end - start));
if let (Some(len), Some(range_len)) = (len, range_len) {
// proceed if the widths of the 2 ranges (the range we want to
// read, and the current extent) are greater than the width of
// the union range:
//
// In the first example, the 2 ranges overlap, and the width of
// the union is smaller than the sum of the widths of the ranges:
//
// |------range-1------|
// |---range-2----|
// |-----width-of-union-----|
// |-------sum----|-of---widths-------|
// |------------width-of-union------------|
// |------range-1------|
// |---range-2----|
//
// In this second example, the ranges do not overlap, and the
// width of the unions is equal or greater than the sum of the
// widths.
len < extent_len + range_len
} else {
start < extent_end
}
}) {
//
let start = start.saturating_sub(offset);
let end = end.map(|end| end - offset).unwrap_or(start + extent.len());
let len = end - start;
log::info!("reading {}..{:?} from extent.", start, end);
let data: alloc::borrow::Cow<[u8]> = match &extent {
ExtentData::Inline { data, .. } => (&data[start as usize..end as usize]).into(),
ExtentData::Other(extent) => {
let address = extent.address() + extent.offset() + start;
let address = self
.volume
.inner
.offset_from_logical(address)
.ok_or(Error::BadLogicalAddress)?;
let range = match extent.extent_data1().compression() {
// compressed size
CompressionType::Zlib | CompressionType::Lzo | CompressionType::ZStd => {
address..address + extent.size()
}
_ => address + start..address + start + len,
};
let data = self.volume.inner.read_range(range).expect("bytes");
data.into()
}
};
// truncate inflated data if needed
contents.extend_from_slice(&data[..len as usize]);
}
Ok(contents)
}
Btrfs also supports compression at individual extent resolution with the Zlib, Zstd and Lzo algorithms. Uo be able to read a compressed file we need to first apply the compression algorithm to the entire extent and truncate afterwards:
log::info!("reading {} bytes from file", data.len());
log::info!("compression: {:?}", extent.header().compression());
match extent.header().compression() {
CompressionType::None => {
contents.extend_from_slice(&data);
}
CompressionType::Zlib => {
let mut state = miniz_oxide::inflate::stream::InflateState::new(
miniz_oxide::DataFormat::Zlib,
);
let mut output_data = vec![0u8; extent.header().decoded_size() as usize];
let mut output = &mut output_data[..];
let mut data = &data[..];
loop {
let result = miniz_oxide::inflate::stream::inflate(
&mut state,
&data,
&mut output,
miniz_oxide::MZFlush::None,
);
match result.status.map_err(|_| Error::DecompressionError)? {
miniz_oxide::MZStatus::Ok => {}
miniz_oxide::MZStatus::StreamEnd => break,
_ => {
log::error!("need dict ?!");
return Err(Error::DecompressionError);
}
}
data = &data[result.bytes_consumed..];
output = &mut output[result.bytes_written..];
}
_ = miniz_oxide::inflate::stream::inflate(
&mut state,
&data,
&mut output,
miniz_oxide::MZFlush::Finish,
)
.status
.map_err(|_| Error::DecompressionError)?;
// truncate inflated data if needed
contents
.extend_from_slice(&output_data[start as usize..(start + len) as usize]);
}
CompressionType::Lzo => {
todo!()
}
CompressionType::ZStd => {
let mut output_data = vec![0u8; extent.header().decoded_size() as usize];
let mut zstd = zstd_safe::DCtx::create();
zstd.init().map_err(|e| {
log::error!("zstd init error: {}", zstd_safe::get_error_name(e));
Error::DecompressionError
})?;
let mut input = zstd_safe::InBuffer::around(&data);
let mut output = zstd_safe::OutBuffer::around(&mut output_data[..]);
loop {
match zstd.decompress_stream(&mut output, &mut input) {
Ok(len) => {
if len == 0 {
break;
}
}
Err(e) => {
log::error!(
"zstd decompress stream error: {}",
zstd_safe::get_error_name(e)
);
return Err(Error::DecompressionError);
}
}
if output.pos() == extent.header().decoded_size() as usize {
break;
}
}
contents
.extend_from_slice(&output_data[start as usize..(start + len) as usize]);
}
c => {
log::error!("invalid compression type {:?}", c);
contents.extend_from_slice(&data);
}
}
Design Flaws
Everything, of course, this is just a toy meant to help me understand how Btrfs works and is laid out :^). But specifically:
copying huge chunks of memory for each tree node
by copying the entire chunk past the address of a tree node and caching it instead of first reading the header and then reading all the KeyPtr
or Item
children and again reading records directly from file/disk we waste huge amounts of RAM, not to mention that caching en entire chunk like that probably makes exceedingly little sense if we ever want to support write access.
One possible solution might be a kind of GC-like structure that keeps track of views into chunks.
Looking up file by comparing filename and not filename hash
btrfs stores DirItemEntries
both one-to-one in DirIndex
entries, aswell as in DirItem
by the hash, which allows us to quickly look for the file name in an ordered range of file hashes, and then only comparing the actual file name if there are any hash colisions. Doing that instead of iterating through all DirIndex
entries should be significantly faster.