bootloader/btrfs/book/butter.org at main

janis/bootloader

Fork 0

Janis dd316e2abc book

2023-04-12 23:07:56 +02:00

40 KiB

Raw Permalink Blame History

Butter?
How to write a btrfs driver
How to write a (bad (read only)) btrfs driver

#+TITLE:Butter File System

Butter?

According to the wikipedia page for btrfs, the correct way to pronounce it is Butter FS or Better FS. While Better FS might make a lot of sense to someone who (correctly) believes btrfs to be a better file system than alternative options, I find butter to be so hillariously far removed from linux, file systems and computers in general that it makes for a perfect pronounciation.

How to write a btrfs driver

/janis/bootloader/src/branch/main/btrfs/book/idk.webp

How to write a (bad (read only)) btrfs driver

Getting Started

At the very least, to get started writing any kind of software that interacts with a btrfs filesystem, we ought to have an instance of a btrfs filesystem which we don't care about potentially getting damaged.

The simplest way to get a new btrfs filesystem on a modern linux install, after installing the btrfs utility programs, is to allocate a new file and then use mkfs.btrfs to write the necessary structures to it.

  # unfortunately, btrfs requires a decent amount
  # of space even for an empty file system so to be
  # able to have a few interesting files allocating
  # 150MB is probably a good idea.
  fallocate -l 150M btrfs.img
  # write a new btrfs filesystem with a
  # label to the allocated file.
  mkfs.btrfs -L "my-btrfs-filesystem" btrfs.img

We can later mount this image with the help of a loopback device (or for simplicity with a helper tool like udiskie) and write a handful of files and directories to explore with our driver.

Defining some structures

Btrfs contains a lot of structures, many of which are documented online alongside the on-disk layout in which these structures can be read. Most of these structures do not interests us at this point, and many of them we will never need to touch at all to implement basic functionality such as reading files and walking directories.

All structures in btrfs are little-endian. Since your system is probably also little-endian and this "driver" shouldn't be used as anything more than a learning exercise, this equates to little more than a curiosity for us; but thanks to the awesome rust ecosystem of crates, defining structures as explicitly little-endian is just as easy using the integer types defined in zerocopys byteorder module, so I will be using them in any codeblocks.

The entry point into a btrfs filesystem is the Superblock:

  #[repr(C, packed)]
  #[derive(Derivative, Clone, Copy, FromBytes, AsBytes)]
  #[derivative(Debug)]
  pub struct Superblock {
      pub csum: [u8; 32],
      pub fsid: Uuid,
      /// Physical address of this block
      pub bytenr: U64<LE>,
      pub flags: U64<LE>,
      pub magic: [u8; 0x8],
      pub generation: U64<LE>,
      /// Logical address of the root tree root
      pub root: U64<LE>,
      /// Logical address of the chunk tree root
      pub chunk_root: U64<LE>,
      /// Logical address of the log tree root
      pub log_root: U64<LE>,
      pub log_root_transid: U64<LE>,
      pub total_bytes: U64<LE>,
      pub bytes_used: U64<LE>,
      pub root_dir_objectid: U64<LE>,
      pub num_devices: U64<LE>,
      pub sector_size: U32<LE>,
      pub node_size: U32<LE>,
      /// Unused and must be equal to `nodesize`
      pub leafsize: U32<LE>,
      pub stripesize: U32<LE>,
      pub sys_chunk_array_size: U32<LE>,
      pub chunk_root_generation: U64<LE>,
      pub compat_flags: U64<LE>,
      pub compat_ro_flags: U64<LE>,
      pub incompat_flags: U64<LE>,
      pub csum_type: U16<LE>,
      pub root_level: u8,
      pub chunk_root_level: u8,
      pub log_root_level: u8,
      pub dev_item: DevItem,
      #[derivative(Debug(format_with = "format_u8str"))]
      pub label: [u8; 0x100],
      pub cache_generation: U64<LE>,
      pub uuid_tree_generation: U64<LE>,
      pub metadata_uuid: Uuid,
      /// Future expansion
      #[derivative(Debug = "ignore")]
      _reserved: [u64; 28],
      #[derivative(Debug = "ignore")]
      pub sys_chunk_array: [u8; 0x800],
      #[derivative(Debug = "ignore")]
      pub root_backups: [RootBackup; 4],
      #[derivative(Debug = "ignore")]
      _reserved2: [u8; 565],
  }

The superblock can be found at physical offset 0x10000, and copies at physical offsets 0x4000000, 0x4000000000, 0x4000000000000.

The superblock contains important data, such as the logical addresses of various btree structures, the default root directory, the label of the file system and, most importantly, a list of ChunkItems required to bootstrap the logical-to-physical address mapping.

Logical Addresses

Btrfs uses logical addresses as indices into a dedicated Chunk Tree containing non-overlapping ranges of logical addresses mapped to physical offsets from the start of the filesystem device. Since everything, including the Chunk Tree, is described by logical addresses, the superblock contains the sys_chunk_array field containing (Key, Chunk) tuples followed by chunk.num_stripes >= 1 Stripes (any stripes past the first don't interest us for now, they are related to multidevice and raid setups). The most important fields here are key.offset, which is start of the logical address range the chunk describes and chunk.length, the length of the range. Each stripe describes a physical offset into a device at which the physical range which the logical range maps to starts.

  #[repr(C, packed)]
  #[derive(Debug, Clone, Copy, FromBytes, AsBytes)]
  pub struct Stripe {
      pub devid: U64<LE>,
      pub offset: U64<LE>,
      pub dev_uuid: Uuid,
  }

  #[repr(C, packed)]
  #[derive(Debug, Clone, Copy, FromBytes, AsBytes)]
  pub struct Chunk {
      /// size of this chunk in bytes
      pub length: U64<LE>,
      /// objectid of the root referencing this chunk
      pub owner: U64<LE>,
      pub stripe_len: U64<LE>,
      pub ty: U64<LE>,
      /// optimal io alignment for this chunk
      pub io_align: U32<LE>,
      /// optimal io width for this chunk
      pub io_width: U32<LE>,
      /// minimal io size for this chunk
      pub sector_size: U32<LE>,
      /// 2^16 stripes is quite a lot, a second limit is the size of a single item in the btree
      pub num_stripes: U16<LE>,
      /// sub stripes only matter for raid10
      pub sub_stripes: U16<LE>,
      pub stripe: Stripe,
      // additional stripes go here
  }

  fn bootstrap_chunk_tree(superblock: &Superblock) -> Result<BTreeMap<ChunkTreeKey, u64>> {
      let array_size = superblock.sys_chunk_array_size.get() as usize;
      let mut offset: usize = 0;

      let key_size = size_of::<Key>();
      let mut chunk_tree = BTreeMap::new();

      let bytes = &superblock.sys_chunk_array;

      while offset < array_size {
          if offset + key_size > array_size {
              log::error!("short key read");
              return Err(Error::InvalidOffset);
          }

          let key = bytes.gread::<Key>(&mut offset)?;
          if key.ty() != ObjectType::ChunkItem {
              log::error!("key is not of type ChunkItem");
              return Err(Error::InvalidOffset);
          }

          let chunk = bytes.gread::<Chunk>(&mut offset)?;
          let num_stripes = chunk.num_stripes.get();

          if num_stripes == 0 {
              log::error!("num_stripes cannot be 0");
              return Err(Error::InvalidOffset);
          }

          if num_stripes != 1 {
              log::warn!(
                  "warning: {} stripes detected but only processing 1",
                  num_stripes
              );
          }

          let key_offset = key.offset.get();
          let chunk_length = chunk.length.get();

          match chunk_tree.entry(ChunkTreeKey {
              range: key_offset..(key_offset + chunk_length),
          }) {
              Entry::Vacant(entry) => {
                  entry.insert(chunk.stripe.offset.get());
              }
              Entry::Occupied(_) => {
                  log::error!("overlapping stripes!");
                  return Err(Error::InvalidOffset);
              }
          };

          offset += (num_stripes - 1) as usize * size_of::<Stripe>();
          if offset > array_size {
              log::error!("short chunk item + stripes read");
              return Err(Error::InvalidOffset);
          }
      }

      Ok(chunk_tree)
  }

After parsing the sys_chunk_array Chunks, we can then access and parse the chunk tree before being able to parse the rest of the btrfs structure.

  fn parse_chunk_tree(mut self) -> Result<Self> {
      log::debug!("parsing chunk tree");
      let this = Rc::new(self);

      let chunk_tree =
          Tree::from_logical_offset(this.clone(), this.superblock().chunk_root.get())?;

      let chunks = chunk_tree
          .iter()
          .filter_map(|(item, v)| {
              log::debug!("{:?}", item);
              match v {
                  TreeItem::Chunk(chunk) => Some((item, chunk)),
                  _ => None,
              }
          })
          .collect::<Vec<_>>();

      drop(chunk_tree);

      self = match Rc::try_unwrap(this) {
          Ok(v) => v,
          Err(_) => unreachable!(),
      };

      for (item, chunk) in chunks {
          let start = item.key.offset.get() as u64;
          let end = start + chunk.length.get();

          match self.chunk_cache.entry(ChunkTreeKey { range: start..end }) {
              Entry::Vacant(entry) => {
                  log::info!("inserting chunk [{start}, {end})");
                  entry.insert(chunk.stripe.offset.get());
              }
              Entry::Occupied(entry) => {
                  log::warn!("overlapping stripes!");
                  log::warn!(
                      "\t{:?} and {:?}",
                      entry.key(),
                      ChunkTreeKey { range: start..end }
                  );
                  log::warn!(
                      "\twith offsets: {} and {}",
                      entry.get(),
                      chunk.stripe.offset.get()
                  );

                  if *entry.get() != chunk.stripe.offset.get() {
                      log::error!("\tprobably an error?");
                  }
              }
          }
      }

      Ok(self)
  }

Trees

Keys and PartialKeys

Everything in btrfs is stored in various records stored in B-Tree, similar in layout to a B+ tree, with Key structures acting as the key.

#[repr(C, packed)]
#[derive(Debug, Clone, Copy, Eq, FromBytes, AsBytes)]
pub struct Key {
    pub id: U64<LE>,
    pub ty: u8,
    pub offset: U64<LE>,
}

A Key contains an id, a type and an offset. The meaning of the id and offset fields varies greatly depending on the tree the key indexes and id, so offset might store a logical address, such as for a ChunkItem, a directory index for a DirIndex item or the hash of a filename for a DirItem entry.

The ty field of a Key has only a handful of meaningful values:

  #[repr(u8)]
  #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, FromPrimitive, IntoPrimitive)]
  pub enum ObjectType {
      INodeItem = 0x01,
      INodeRef = 0x0C,
      InodeExtref = 0x0D,
      XattrItem = 0x18,
      OrphanInode = 0x30,
      DirItem = 0x54,
      DirIndex = 0x60,
      ExtentData = 0x6C,
      ExtentCsum = 0x80,
      RootItem = 0x84,
      TypeRootBackref = 0x90,
      RootRef = 0x9C,
      ExtentItem = 0xA8,
      MetadataItem = 0xA9,
      TreeBlockRef = 0xB0,
      ExtentDataRef = 0xB2,
      ExtentRefV0 = 0xB4,
      SharedBlockRef = 0xB6,
      SharedDataRef = 0xB8,
      BlockGroupItem = 0xC0,
      FreeSpaceInfo = 0xC6,
      FreeSpaceExtent = 0xC7,
      FreeSpaceBitmap = 0xC8,
      DevExtent = 0xCC,
      DevItem = 0xD8,
      ChunkItem = 0xE4,
      TempItem = 0xF8,
      DevStats = 0xF9,
      SubvolUuid = 0xFB,
      SubvolRecUuid = 0xFC,
      // catch any other u8 value as an invalid object type to be able to convert
      // between any u8 and this type with num_enum
      #[num_enum(catch_all)]
      Invalid(u8),
  }

The id field has a handful of special defined values, aswell as a range of values that may mean different things depending on the value of the ty field:

  #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, FromPrimitive, IntoPrimitive)]
  #[repr(u64)]
  pub enum KnownObjectId {
      RootTree = 1,
      ExtentTree,
      ChunkTree,
      DevTree,
      FsTree,
      RootTreeDir,
      CsumTree,
      QuotaTree,
      UuidTree,
      FreeSpaceTree,
      // RootINode = 0x100, // first free id, always the root inode of a fs
      // __LastFreeId = u64::MAX - 0x100, // last free id
      DataRelocTree = u64::MAX - 9,
      TreeReloc = u64::MAX - 8,
      TreeLog = u64::MAX - 7,
      Orphan = u64::MAX - 5,
      // catch all which contains the valid range of custom ids from 256 to
      // u64::MAX-256 plus some invalid custom ids, unfortunately num_enum does
      // not allow to define a range like that for a tuple variant
      #[num_enum(catch_all)]
      Custom(u64),
  }

Keys are ordered by id, then ty and finally offset, making it very easy to find, for example, the range of all ExtentData items for some specific inode number by searching for the first and last key which matches (inode_number, ObjectType::ExtentData, _).

  impl PartialOrd for Key {
      fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> {
          match self.id().partial_cmp(&other.id()) {
              Some(core::cmp::Ordering::Equal) => {}
              ord => return ord,
          }
          match self.ty().partial_cmp(&other.ty()) {
              Some(core::cmp::Ordering::Equal) => {}
              ord => return ord,
          }
          self.offset.get().partial_cmp(&other.offset.get())
      }
  }

For looking for "incomplete" keys in a btree, we can define a "partial" key which can be compared with a full Key, which will give us a useful subtree containing the records we are interested in as long as the partial key is filled from left to right (i.e. (_, ObjectType::DirIndex, 5) would not give a useful range because inbetween the 5th DirIndex entry of inode 256 and the 5th DirIndex entry of inode 257 may lie many other records, even other DirIndex records with different indices — in other words, there is no way to get a consecutive range over every 5th DirIndex of any inode).

  pub struct PartialKey {
      id: KnownObjectId,
      ty: Option<ObjectType>,
      offset: Option<u64>,
  }

  impl PartialOrd<Key> for PartialKey {
      fn partial_cmp(&self, other: &Key) -> Option<core::cmp::Ordering> {
          match self.id.partial_cmp(&other.id()) {
              Some(core::cmp::Ordering::Equal) | None => {
                  match self.ty.and_then(|ty| ty.partial_cmp(&other.ty())) {
                      Some(core::cmp::Ordering::Equal) | None => self
                          .offset
                          .and_then(|offset| offset.partial_cmp(&other.offset.get())),
                      ord => ord,
                  }
              }
              ord => ord,
          }
      }
  }

Tree Nodes

Trees in btrfs are made of internal nodes containing keys in KeyPtrs, and leaf nodes at the bottom containing Items describing the record values. Both internal and leaf nodes start with a common Header:

  #[repr(C, packed)]
  #[derive(Derivative, Clone, PartialEq, Eq, Copy, FromBytes, AsBytes)]
  #[derivative(Debug)]
  pub struct Header {
      #[derivative(Debug = "ignore")]
      pub csum: [u8; 32],
      pub fsid: Uuid,
      /// Which block this node is supposed to live in
      pub bytenr: U64<LE>,
      pub flags: U64<LE>,
      pub chunk_tree_uuid: Uuid,
      pub generation: U64<LE>,
      pub owner: U64<LE>,
      pub nritems: U32<LE>,
      pub level: u8,
  }

The only interesting fields for us are the level or depth of this node level and the number of children (either further InternalNodes or at level 0 LeafNodes) this node has, nritems.

Internal nodes themselves do not contain any values, but instead contain KeyPtrs as bounds for values found at the leaf level when decending that branch of the tree:

  package InternalNode0 {
          object KeyPtr1
          object KeyPtr2
          object KeyPtr3
          object KeyPtr4

          KeyPtr1 -|> KeyPtr2 : "is greater?"
          KeyPtr2 -|> KeyPtr3 : "is greater?"
          KeyPtr3 -|> KeyPtr4 : "is greater?"

          package Children {
          package InternalNode1 {}
          package InternalNode2 {}
          package InternalNode3 {}
          package InternalNode4 {}

          }

          KeyPtr1 --|> Children.InternalNode1 : is equal?
          KeyPtr2 --|> Children.InternalNode2 : is equal?
          KeyPtr3 --|> Children.InternalNode3 : is equal?
          KeyPtr4 --|> Children.InternalNode4 : is equal or greater?
          KeyPtr2 --|> Children.InternalNode1 : is less?
          KeyPtr3 --|> Children.InternalNode2 : is less?
          KeyPtr4 --|> Children.InternalNode3 : is less?
  }

In leaf nodes, the header is immedately followed by a list of nritems Items, which further point to records relative to the memory immediately following the header. The layout of these records depends on the ty field of the key field of the Item:

  #[repr(C, packed)]
  #[derive(Debug, Clone, Copy, FromBytes, AsBytes)]
  /// A `BtrfsLeaf` is full of `BtrfsItem`s. `offset` and `size` (relative to start of data area)
  /// tell us where to find the item in the leaf.
  pub struct Item {
      pub key: Key,
      pub offset: U32<LE>,
      pub size: U32<LE>,
  }

Searching a Tree

To look up a key in a tree, we iterate over the children of a node, and descend according to the diagram above:

  pub fn find_key<K: PartialEq<Key> + PartialOrd<Key>>(self, key: &K) -> SearchResult {
      match &self.node.inner {
          BTreeNode::Internal(node) => {
              let idx = node
                  .visit_children_keys()
                  .find_map(|(i, child)| match key.partial_cmp(&child) {
                      Some(core::cmp::Ordering::Less) => {
                          Some(if i == 0 { 0 } else { i as u32 - 1 })
                      }
                      Some(core::cmp::Ordering::Equal) | None => Some(i as u32),
                      _ => None,
                  })
                  .unwrap_or(node.children.len() as u32 - 1);

              SearchResult::Edge(NodeHandle { idx, ..self })
          }
          BTreeNode::Leaf(node) => {
              for (i, child) in node.items.iter().enumerate() {
                  if key.eq(&child.key) {
                      return SearchResult::Leaf(NodeHandle {
                          idx: i as u32,
                          ..self
                      });
                  }
              }

              log::debug!("key definitely not found!");
              SearchResult::Edge(NodeHandle {
                  idx: node.items.len() as u32 - 1,
                  ..self
              })
          }
      }
  }

This function operates on a and returns a NodeHandle — containing a pointer to a Node and an index into that node pointing at a child or value — to either an edge if the node is an internal node, or a leaf if the key was found in a leaf node. An important step is hidden inside the visit_children_keys() function.

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct NodeHandle {
    parents: Vec<(BoxedNode, u32)>,
    node: BoxedNode,
    idx: u32,
}

The node handle also keeps track of the parent nodes and edges (via the index) to allow for walking back up the tree when iterating over ranges between two nodes.

  #[derive(Derivative, Eq, Clone)]
  #[derivative(Debug, PartialEq)]
  pub struct Node {
      inner: BTreeNode,
      #[derivative(Debug = "ignore")]
      #[derivative(PartialEq = "ignore")]
      bytes: Vec<u8>,
  }

  #[derive(Debug, PartialEq, Eq, Clone)]
  pub enum BTreeNode {
      Internal(BTreeInternalNode),
      Leaf(BTreeLeafNode),
  }

  /// An internal node in a btrfs tree, containing `KeyPtr`s to other internal nodes or leaf nodes.
  #[derive(Derivative, Clone)]
  #[derivative(Debug)]
  pub struct BTreeInternalNode {
      pub header: Header,
      #[derivative(Debug = "ignore")]
      children: Vec<RefCell<NodePtr>>,
  }

  /// A leaf node in a btrfs tree, containing different items
  #[derive(Debug, Clone)]
  pub struct BTreeLeafNode {
      pub header: Header,
      /// actual leaf data
      pub items: Vec<Item>,
  }

The node type stores either an internal or a leaf node, aswell as a copy of the bytes of the node on disk, from which the leaf records can be read.

Iterating over a range

To iterate over a range between 2 nodes, we need to keep to keep track of the start and end nodes, but only yield leafs.

#[derive(Debug)]
pub struct Range<'tree, R: super::Read> {
    volume: Rc<Volume<R>>,
    pub(crate) start: RootOrEdge,
    pub(crate) end: RootOrEdge,
    phantom: PhantomData<&'tree ()>,
}

We keep track of a reference lifetime to the tree to take advantage of the borrow checker to ensure we don't violate the aliasing rules, since we are working with raw NonNull pointers to nodes.

RootOrEdge is, perhaps not optimally named, either the first node the range is defined with (since this is the only time we want to yield the current node, and not the next), or the last yielded leaf node.

To step to the next node in sequential (depth-first) order:

  /// returns Ok(next) or Err(self) if self is already the last node
  pub fn into_next<R: super::Read>(
      self,
      volume: &super::volume::Volume<R>,
  ) -> core::result::Result<Self, Self> {
      let Self {
          mut parents,
          node,
          idx,
      } = self;

      let header = node.inner.header();

      if idx + 1 >= header.nritems.get() {
          // go up
          match parents.pop() {
              Some((node, idx)) => Self {
                  idx: (idx + 1).min(node.inner.header().nritems.get()),
                  node,
                  parents,
              }
              .into_next(volume),
              None => Err(Self { node, idx, parents }),
          }
      } else {
          match &node.inner {
              BTreeNode::Internal(internal) => {
                  let node = match internal.visit_child(idx as usize, volume) {
                      Ok(child) => {
                          parents.push((node, idx));
                          Ok(Self {
                              parents,
                              idx: 0,
                              node: child,
                          })
                      }
                      Err(_) => {
                          log::error!("failed to read child node!");
                          Err(Self { parents, node, idx })
                      }
                  };

                  node
              }
              BTreeNode::Leaf(_) => Ok(Self {
                  idx: idx + 1,
                  parents,
                  node,
              }),
          }
      }
  }

visit_child() lazily reads the next node when it is needed.

impl<'tree, R> Iterator for Range<'tree, R>
where
    R: super::Read,
{
    type Item = (Item, TreeItem);

    fn next(&mut self) -> Option<Self::Item> {
        loop {
            if !self.is_empty() {
                replace_with::replace_with_or_abort(&mut self.start, |start| {
                    match start.into_next_node(&self.volume) {
                        Ok(next) => next,
                        Err(next) => next,
                    }
                });

                if self.start.node.inner.is_leaf() {
                    break Some(self.start.as_handle().parse_item().expect("range item"));
                }
                // else repeat
            } else {
                break None;
            }
        }
    }
}

At the cost of duplicating some code, we can iterate forwards and backwards through the range:

impl<'tree, R> DoubleEndedIterator for Range<'tree, R>
where
    R: super::Read,
{
    fn next_back(&mut self) -> Option<Self::Item> {
        loop {
            if !self.is_empty() {
                replace_with::replace_with_or_abort(&mut self.end, |start| {
                    match start.into_next_back_node(&self.volume) {
                        Ok(next) => next,
                        Err(next) => next,
                    }
                });

                if self.end.node.inner.is_leaf() {
                    break Some(self.end.as_handle().parse_item().expect("range item"));
                }
                // else repeat
            } else {
                break None;
            }
        }
    }
}

Reading the file system

The Root Tree

The root field of the Superblock points at the root node of the Root Tree, which contains all other important trees, like the Extent Tree, File System Trees, aswell as records pointing at the root directories of subvolumes including the default subvolume, whose inode id can be found in the root_dir_objectid field in the Superblock:

  let root_tree =
      Tree::from_logical_offset(self.inner.clone(), self.inner.superblock().root.get())?;

  // we are looking for the root tree directory (?)
  // this is a DIR_ITEM entry in the root tree, with the name "default",
  // and the crc32 of "default" as its offset
  let key = Key::new(
      KnownObjectId::Custom(self.inner.superblock().root_dir_objectid.get()),
      ObjectType::DirItem,
      0x8dbfc2d2, // crc of "default"
  );

  let subvol_root = match root_tree.entry(&key)? {
      super::tree::entry::Entry::Occupied(entry) => Some(entry.value()?),
      super::tree::entry::Entry::Vacant(_) => None,
  }
  .ok_or(Error::NoDefaultSubvolRoot)?;

Filesystem trees contain many different kinds of records, namely as DirIndex, DirItem, InodeItem, INodeRef and ExtentData, which are all important but not all of which are meaningful by themselves.

the location this DirItem entry points at is the key at which we will find the file system tree for the default subvolume:

  let subvol_id = subvol_root
      .as_dir_item()
      .expect("dir item")
      .first()
      .expect("dir item entry")
      .item()
      .location
      .id();

  let (root_item, fs_root) = self
      .roots
      .get(&subvol_id)
      .ok_or(Error::NoDefaultSubvolFsRoot)?
      .clone();

Using this tree we can now very easily iterate over all files and directories in the subvolume and print their filenames by looking at DirIndex entries, of which there is exactly one per file, containing a pointer to the INodeItem of just that file, and its filename:

  for (_id, entry) in fs.fs_root.iter() {
      if let Some(_dir) = entry.as_dir_index() {
          log::info!("{}", dir.name_as_string_lossy());
      }
  }

Just having a list of filenames without paths or file contents is not very helpful, but as this is the first time in the process of writing the code where we are confronted with some tangeable quality we typically connect with file systems, this was still a big milestone for me.

More meaningful is finding the root directory of the subvolume, and listing its immediate children. Using the root_dirid field of the root_item we got from the root tree with our subvolume id as an inode number, we can look up the DirIndex entries of the directory:

  pub fn get_inode_children(
      &self,
      inode: &INode,
  ) -> Result<impl Iterator<Item = DirItemEntry> + '_> {
      let key = PartialKey::new(inode.id(), Some(ObjectType::DirIndex), None);

      let children = self.fs_root.find_range(&key)?;

      let a = children.map(|(_, v)| v.try_into_dir_index().expect("dir index"));

      Ok(a)
  }

By looking at the file names of the DirIndex entries we can also look up files by their absolute path and relative paths to some known inode:

  pub fn get_inode_by_relative_path<P>(&self, inode: INode, path: P) -> Result<INode>
  where
      P: Path,
  {
      if path.is_absolute() {
          self.get_inode_by_path(path)
      } else {
          self.get_inode_by_relative_normalized_path(inode, path.normalize())
      }
  }

  pub fn get_inode_by_path<P>(&self, path: P) -> Result<INode>
  where
      P: Path,
  {
      let mut normalized = path.normalize();
      if !path.is_absolute() {
          log::error!("path is not absolute!");
      } else {
          // pop root
          _ = normalized.pop_segment();
      }

      self.get_inode_by_relative_normalized_path(self.get_root_dir(), normalized)
  }

  pub fn get_inode_by_relative_normalized_path(
      &self,
      inode: INode,
      path: NormalizedPath,
  ) -> Result<INode> {
      let mut inode = inode;

      for segment in path.iter() {
          match segment {
              crate::path::Segment::ParentDir => {
                  inode = self.get_inode_parent(&inode)?;
              }
              crate::path::Segment::File(child_name) => {
                  let child = self
                      .get_inode_children_inodes(&inode)?
                      .find(|child| {
                          child.path.last().map(|bytes| bytes.as_slice()) == Some(child_name)
                      })
                      .ok_or(Error::INodeNotFound)?
                      .clone();
                  // silly borrow checker
                  inode = child;
              }
              _ => unreachable!(),
          }
      }

      Ok(inode)
  }

To be able to actually read a file we need to get a list of its ExtentData records:

  pub fn get_inode_extents(&self, inode_id: u64) -> Result<Vec<(u64, ExtentData)>> {
      if let Some(dir_entry) = self.get_inode_dir_index(inode_id)? {
          if dir_entry.item().ty() == DirItemType::RegFile {
              let key = PartialKey::new(inode_id.into(), Some(ObjectType::ExtentData), None);

              let extents = self.fs_root.find_range(&key)?;

              let extents = extents
                  .map(|(key, item)| {
                      (
                          key.key.offset.get(),
                          item.as_extent_data().expect("extent data").clone(),
                      )
                  })
                  .collect::<Vec<_>>();

              Ok(extents)
          } else {
              Ok(vec![])
          }
      } else {
          Err(Error::INodeNotFound)
      }
  }

ExtentData records come in two flavors: Inline, where the file contents follow immediately after the record on disk, or Regular, where the record is followed immediately by more data pointing at the file contents.

To read some view into an inode, we need to check each extent for an overlap with our views bounds:

  pub fn read_inode_raw<I: RangeBounds<u64>>(&self, inode: &INode, range: I) -> Result<Vec<u8>> {
      let mut contents = Vec::new();
      let extents = self.get_inode_extents(inode.id)?;

      let start = match range.start_bound() {
          core::ops::Bound::Included(v) => *v,
          core::ops::Bound::Excluded(v) => *v + 1,
          core::ops::Bound::Unbounded => 0,
      };

      let end = match range.end_bound() {
          core::ops::Bound::Included(v) => Some(*v + 1),
          core::ops::Bound::Excluded(v) => Some(*v),
          core::ops::Bound::Unbounded => None,
      };

      log::info!("extents: {}", extents.len());
      log::info!("{:?}", extents);
      for (offset, extent) in extents.into_iter().filter(|(offset, extent)| {
          // bounds of the current extent
          let extent_start = *offset;
          let extent_end = extent_start + extent.len();
          let extent_len = extent.len();

          // entire range we want to read from the file
          let range_len = end.map(|end| end - start);

          // start of the UNION (from lowest bound to highest bound) of the
          // current extent and the entire range
          let start = start.min(extent_start);
          // end of the UNION of the current extent and the entire range
          let end = end.map(|end| end.max(extent_end));
          // width of the union o fthe current extent and the entire range
          let len = end.map(|end| (end - start));

          if let (Some(len), Some(range_len)) = (len, range_len) {
              // proceed if the widths of the 2 ranges (the range we want to
              // read, and the current extent) are greater than the width of
              // the union range:
              //
              // In the first example, the 2 ranges overlap, and the width of
              // the union is smaller than the sum of the widths of the ranges:
              //
              // |------range-1------|
              //           |---range-2----|
              // |-----width-of-union-----|
              // |-------sum----|-of---widths-------|
              // |------------width-of-union------------|
              // |------range-1------|
              //                         |---range-2----|
              //
              // In this second example, the ranges do not overlap, and the
              // width of the unions is equal or greater than the sum of the
              // widths.
              len < extent_len + range_len
          } else {
              start < extent_end
          }
      }) {
          //
          let start = start.saturating_sub(offset);

          let end = end.map(|end| end - offset).unwrap_or(start + extent.len());
          let len = end - start;

          log::info!("reading {}..{:?} from extent.", start, end);

          let data: alloc::borrow::Cow<[u8]> = match &extent {
              ExtentData::Inline { data, .. } => (&data[start as usize..end as usize]).into(),
              ExtentData::Other(extent) => {
                  let address = extent.address() + extent.offset() + start;
                  let address = self
                      .volume
                      .inner
                      .offset_from_logical(address)
                      .ok_or(Error::BadLogicalAddress)?;

                  let range = match extent.extent_data1().compression() {
                      // compressed size
                      CompressionType::Zlib | CompressionType::Lzo | CompressionType::ZStd => {
                          address..address + extent.size()
                      }
                      _ => address + start..address + start + len,
                  };

                  let data = self.volume.inner.read_range(range).expect("bytes");
                  data.into()
              }
          };

          // truncate inflated data if needed
          contents.extend_from_slice(&data[..len as usize]);
      }

      Ok(contents)
  }

Btrfs also supports compression at individual extent resolution with the Zlib, Zstd and Lzo algorithms. Uo be able to read a compressed file we need to first apply the compression algorithm to the entire extent and truncate afterwards:

  log::info!("reading {} bytes from file", data.len());
  log::info!("compression: {:?}", extent.header().compression());

  match extent.header().compression() {
      CompressionType::None => {
          contents.extend_from_slice(&data);
      }
      CompressionType::Zlib => {
          let mut state = miniz_oxide::inflate::stream::InflateState::new(
              miniz_oxide::DataFormat::Zlib,
          );
          let mut output_data = vec![0u8; extent.header().decoded_size() as usize];
          let mut output = &mut output_data[..];
          let mut data = &data[..];
          loop {
              let result = miniz_oxide::inflate::stream::inflate(
                  &mut state,
                  &data,
                  &mut output,
                  miniz_oxide::MZFlush::None,
              );

              match result.status.map_err(|_| Error::DecompressionError)? {
                  miniz_oxide::MZStatus::Ok => {}
                  miniz_oxide::MZStatus::StreamEnd => break,
                  _ => {
                      log::error!("need dict ?!");
                      return Err(Error::DecompressionError);
                  }
              }

              data = &data[result.bytes_consumed..];
              output = &mut output[result.bytes_written..];
          }
          _ = miniz_oxide::inflate::stream::inflate(
              &mut state,
              &data,
              &mut output,
              miniz_oxide::MZFlush::Finish,
          )
          .status
          .map_err(|_| Error::DecompressionError)?;

          // truncate inflated data if needed
          contents
              .extend_from_slice(&output_data[start as usize..(start + len) as usize]);
      }
      CompressionType::Lzo => {
          todo!()
      }
      CompressionType::ZStd => {
          let mut output_data = vec![0u8; extent.header().decoded_size() as usize];

          let mut zstd = zstd_safe::DCtx::create();
          zstd.init().map_err(|e| {
              log::error!("zstd init error: {}", zstd_safe::get_error_name(e));
              Error::DecompressionError
          })?;

          let mut input = zstd_safe::InBuffer::around(&data);
          let mut output = zstd_safe::OutBuffer::around(&mut output_data[..]);

          loop {
              match zstd.decompress_stream(&mut output, &mut input) {
                  Ok(len) => {
                      if len == 0 {
                          break;
                      }
                  }
                  Err(e) => {
                      log::error!(
                          "zstd decompress stream error: {}",
                          zstd_safe::get_error_name(e)
                      );
                      return Err(Error::DecompressionError);
                  }
              }

              if output.pos() == extent.header().decoded_size() as usize {
                  break;
              }
          }

          contents
              .extend_from_slice(&output_data[start as usize..(start + len) as usize]);
      }
      c => {
          log::error!("invalid compression type {:?}", c);
          contents.extend_from_slice(&data);
      }
  }

Design Flaws

Everything, of course, this is just a toy meant to help me understand how Btrfs works and is laid out :^). But specifically:

copying huge chunks of memory for each tree node

by copying the entire chunk past the address of a tree node and caching it instead of first reading the header and then reading all the KeyPtr or Item children and again reading records directly from file/disk we waste huge amounts of RAM, not to mention that caching en entire chunk like that probably makes exceedingly little sense if we ever want to support write access. One possible solution might be a kind of GC-like structure that keeps track of views into chunks.

Looking up file by comparing filename and not filename hash

btrfs stores DirItemEntries both one-to-one in DirIndex entries, aswell as in DirItem by the hash, which allows us to quickly look for the file name in an ordered range of file hashes, and then only comparing the actual file name if there are any hash colisions. Doing that instead of iterating through all DirIndex entries should be significantly faster.

40 KiB Raw Permalink Blame History