Build ZNS SSD using QUME

Install QUME

QEMU supports emulating NVMe namespaces starting from version 1.6. However, simulated partition namespaces are only supported starting from QEMU version 6.0.
Download QUME: https://www.qemu.org/download/#source

Mirror download

The user account and guest OS of the VM:

  • username: femu
  • passwd:femu
  • Guest OS: Ubuntu 20.04.1 server, with kernel 5.4
mkdir -p ~/images
wget http://people.cs.uchicago.edu/~huaicheng/femu/femu-vm.tar.xz
tar xJvf femu-vm.tar.xz

After completing these steps, you will get two files under the current folder: “u20s.qcow2 ” and “u20s.md5sum “
The integrity of the VM image can be verified by the following statement

md5sum u20s.qcow2 > tmp.md5sum
diff tmp.md5sum u20s.md5sum

If the diff shows that the two files are different, it means that the VM image is damaged and the above steps need to be redone.

Create a shared directory between host and guest

qemu startup parameters need to be added:

-fsdev local,security_model=passthrough,id=fsdev0,path=/tmp/share \
-device virtio-9p-pci,id=fs0,fsdev=fsdev0,mount_tag=hostshare

Where path is the local shared directory of host
Mount the host shared directory in the guest

mkdir /tmp/host_files
sudo mount -t 9p -o trans=virtio,version=9p2000.L hostshare /tmp/host_files

Reference https://blog.csdn.net/gatieme/article/details/82912921

Switch to linux5.10 kernel

1. Upload linux-5.10 downloaded in advance to the host
2. Transfer the 5.10 version kernel compressed package from the host to QUME through share
3. Unzip

cd /usr/src/
sudo tar -zxvpf linux-5.10.tar.gz

4. Compile the kernel (can be compiled on the host side)

cd linux-5.10/
sudo apt-get install ncurses-dev
sudo apt-get install flex
sudo apt-get install bison -y

Configure kernel options

sudo make menuconfig

For details, please refer to https://zonedstorage.io/docs/linux/config

sudo apt-get install libssl-dev
sudo make -j32

5.Installation

sudo make modules_install
sudo make install
sudo poweroff

Create a simulated partition namespace

To create an emulated partition namespace, you must first have a backing store file that the namespace can use. The size of the file determines the capacity of the namespace seen by the guest OS running in the QEMU virtual machine.
For example: to create a namespace for a 32GiB partition, you must first create a 32GiB file on the host. This can be accomplished by creating a sparse file using the truncate command, or a fully allocated file using the dd command.
Use truncate to create the simulation area namespace Backstore

# truncate -s 32G /var/lib/qemu/images/zns.raw

# ls -l /var/lib/qemu/images/zns.raw
-rw-r--r-- 1 root root 34359738368 Jun 21 15:13 /var/lib/qemu/images/zns.raw

Use dd to create the simulation area namespace Backstore

# dd if=/dev/zero of=/var/lib/qemu/images/zns.raw bs=1M count=32768
32768 + 0 records in
32768 + 0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 11.4072 s, 3.0 GB/s

# ls -l /var/lib/qemu/images/zns.raw
-rw-r--r-- 1 root root 34359738368 Jun 22 11:22 /var/lib/qemu/images/zns.raw

Create ZNS and use Backstore files

Execute QEMU with command line options and arguments to create a partitioned namespace that uses backend files for storage.
The implementation of NVMe device emulation and ZNS namespace emulation in QEMU provides several configuration options to control device characteristics. Options and parameters related to “Partition Namespace” are as follows.

Option Default Value Description
zoned.zasl=UINT32 0 Zone Append Size Limit. If left at the default (0), the zone append size limit will be equal to the maximum data transfer size ( MDTS). Otherwise, the zone append size limit is equal to 2 to the power of zasl multiplied by the minimum memory page size (4096 B), but cannot exceed the maximum data transfer size.
zoned.zone_size=SIZE 128MiB Define the zone size (ZSZE)
zoned.zone_capacity=SIZE 0 Define the zone capacity (ZCAP). If left at the default (0), the zone capacity will equal the zone size.
zoned.descr_ext_size=UINT32 0 Set the Zone Descriptor Extension Size (ZDES). Must be a multiple of 64 bytes.
zoned.cross_read=BOOL off Set to “on” to allow reads to cross zone boundaries.
zoned.max_active=UINT32 0 Set the maximum number of active resources (MAR). The default (0) allows all zones to be active.
zoned.max_open= UINT32 0 Set the maximum number of open resources (MOR). The default (0) allows all zones to be open. If zoned.max_active is specified, this value must be less than or equal to that

In the following example, the backstore file is used to simulate the namespace of a partition with a size of 64 MiB and a capacity of 62 MiB. The namespace block size is 4096b. The namespace allows up to 16 open zones and 32 active zones.

#!/bin/bash
# Huaicheng Li <[email protected]>
# Run FEMU as a black-box SSD (FTL managed by the device)

# image directory
IMGDIR=/home/qwj/ZNS/images
# Virtual machine disk image
OSIMGF=$IMGDIR/u20s.qcow2
ZNSIMGF=$IMGDIR/zns.raw
if [[ ! -e "$OSIMGF" ]]; then
        echo ""
        echo "VM disk image couldn't be found ..."
        echo "Please prepare a usable VM image and place it as $OSIMGF"
        echo "Once VM disk image is ready, please rerun this script again"
        echo ""
        exit
fi

sudo /home/qwj/ZNS/qemu-7.0.0/bin/debug/native/x86_64-softmmu/qemu-system-x86_64 \
    -name "FEMU-BBSSD-VM" \
    -enable-kvm \
    -cpu host \
    -smp 4 \
    -m 32G \
    -fsdev local,security_model=passthrough,id=fsdev0,path=/home/qwj/ZNS/share \
    -device virtio-9p-pci,id=fs0,fsdev=fsdev0,mount_tag=hostshare \
    -device virtio-scsi-pci,id=scsi0 \
    -device scsi-hd,drive=hd0 \
    -drive file=$OSIMGF,if=none,aio=native,cache=none,format=qcow2,id=hd0 \
    -device nvme,id=nvme0,serial=deadbeef,zoned.zasl=5 \
    -drive file=$ZNSIMGF,id=nvmezns0,format=raw,if=none \
    -device nvme-ns,drive=nvmezns0,bus=nvme0,nsid=1,logical_block_size=4096,physical_block_size=4096,zoned=true,zoned.zone_size=64M,zoned.zone_capacity=62M,zoned.max_open=16,zoned. max_active=32,uuid=5e40ec5f-eeb6-4317-bc5e-c919796a5f79 \
    -net user,hostfwd=tcp::8081-:22 \
    -net nic,model=virtio \
    -nographic \
    -qmp unix:./qmp-sock,server,nowait 2> & amp;1 | tee log

Generally, you don’t use the VM directly in this interface, so keep the interface still and open another window to log in using ssh.

usernamexxx@hostnamexxx:~$ ssh -p8081 femu@localhost
# Password for femu: femu

Verify simulated partition namespace

# sudo nvme list
Node SN Model Namespace Usage Format FW Rev
------------------------------------------------------------- -------------------------- --------- --------------- ------------ ---------------- --------
/dev/nvme0n1 deadbeef QEMU NVMe Ctrl 1 34.36 GB / 34.36 GB 4 KiB + 0 B 1.

ZNS SSD construction rocksdb and zenfs

https://github.com/westerndigitalcorporation/zenfs
1.Download, build and install libzbd. See the libzbd README for instructions.
2.Download rocksdb and the zenfs projects:

$ git clone https://github.com/facebook/rocksdb.git
$ cd rocksdb
$ git clone https://github.com/westerndigitalcorporation/zenfs plugin/zenfs

3.Build and install rocksdb with zenfs enabled:
Install sudo apt-get install libgflags-dev before installation.
https://github.com/westerndigitalcorporation/zenfs/issues/3
https://www.jianshu.com/p/4a79b95bca9a
https://blog.csdn.net/qq_41108558/article/details/118513202
Enable debugging log: ROCKSDB_PLUGINS=zenfs make -j4 db_bench dbg
https://github.com/westerndigitalcorporation/zenfs/issues/240
I mastered the latest version, I found that, > 1. users can change the destination for logs in #define DEFAULT_ZENV_LOG_PATH “/tmp/ > 2

$ DEBUG_LEVEL=0 ROCKSDB_PLUGINS=zenfs make -j4 db_bench install

4.Build the zenfs utility:

$ pushd
$ cd plugin/zenfs/util
$ make
$popd

5.Configure the IO Scheduler for the zoned block device
The IO scheduler must be set to deadline to avoid writes from being reordered. This must be done every time the zoned name space is enumerated (e.g at boot).

echo deadline > /sys/class/block/<zoned block device>/queue/scheduler

6.Creating a ZenFS file system

cd ./rocksdb/plugin/zenfs/
mkdir log
sudo ./util/zenfs mkfs --zbd=nvme0n1 --aux-path=./log --force

7.Testing with db_bench
To instruct db_bench to use zenfs on a specific zoned block device, the –fs_uri parameter is used. The device name may be used by specifying –fs_uri=zenfs://dev: or by specifying a unique identifier for the created file system by specifying –fs_uri=zenfs://uuid:. UUIDs can be listed using ./plugin/zenfs/util/zenfs ls-uuid

sudo ./db_bench --fs_uri=zenfs://dev:nvme0n1 --benchmarks=fillrandom --use_direct_io_for_flush_and_compaction

ZNS SSD construction and mounting F2FS (dual hard drive)

1. Create a traditional SSD

Because f2fs uses a disk metadata block format with fixed block locations, only partitioned block devices containing regular regions are supported. Partitioned devices consisting entirely of sequential regions cannot be used as standalone devices with f2fs, and they require a multi-device setup to place metadata blocks on random writable storage. f2fs supports multi-device setups where multiple block device address spaces are linearly connected to form a logically larger block device.
F2FS partitioned block device support is implemented using the following principles:
1 Section Alignment In f2fs, section is a set of fixed-size segments (2mb). Match the zone size of the partition device by specifying the number of segments in a section. For example: the zone size is 256mb, and a section contains 128 2MB segments.
2 Forced LFS mode By default, f2fs optimizes block allocation by allowing some random writes within segments (to avoid excessive append writes). LFS mode forces sequential writes to segments and sequential use of segments within segments, which results in full compliance with the write constraints of the partition’s block device.
3 Zone reset as discard operation In the past, block discard (or trim) indicated to the device that a block or a group of blocks is no longer in use. When all blocks of all segments of a section are free, the “zone write pointer reset” command will be executed. This allows the section to be reused.
Compared to solutions using dm-partitioned device mapper targets, the performance of f2fs on partitioned devices does not suffer from “region reclamation overhead” because writes are always sequential and do not require on-disk Temporary buffer. F2fs garbage collection (segment cleaning) only incurs overhead for workloads that frequently delete files or modify file data.
Zone Capacity Support
The capacity of each zone of NVMe ZNS SSD can be smaller than the size of the zone. To support ZNS devices, f2fs ensures that block allocation and accounting only considers blocks within a zone that are within that zone’s capacity. This support for NVMe ZNS zone capacity has been available since its introduction in Linux kernel version 5.10.
F2fs volumes require some random writable storage space in order to store and update the volume’s metadata blocks. Since the namespace of an NVMe partition does not have conventional zones, you cannot self-contain f2fs volumes in the namespace of a single NVMe partition. To format an f2fs volume with an NVMe partition’s namespace, a multi-device volume format must be used in order to provide an additional regular block device to store the volume metadata blocks. This additional regular block device can be a regular namespace on the same NVMe device, or a regular namespace on another NVMe device.
f2fs uses 32-bit block numbers and a block size of 4 KB. This results in a maximum volume size of 16 TB. Any device or combination of devices (for multi-device volumes) with a total capacity greater than 16TB cannot be used with f2fs.
To overcome this limitation, the dm-linear device mapper target can be used to divide a partitioned block device into usable, smaller logical devices. This configuration must ensure that each logical device created is allocated a sufficient amount of general area to store f2fs fixed-position metadata blocks.
Add traditional NVMe SSD to QUME’s running command:

#!/bin/bash
# Huaicheng Li <[email protected]>
# Run FEMU as a black-box SSD (FTL managed by the device)

# image directory
IMGDIR=/home/qwj/ZNS/images
# Virtual machine disk image
OSIMGF=$IMGDIR/u20s.qcow2
ZNSIMGF=$IMGDIR/zns.raw
CONVIMGF=$IMGDIR/convSSD.raw
if [[ ! -e "$OSIMGF" ]]; then
        echo ""
        echo "VM disk image couldn't be found ..."
        echo "Please prepare a usable VM image and place it as $OSIMGF"
        echo "Once VM disk image is ready, please rerun this script again"
        echo ""
        exit
fi

sudo /home/qwj/ZNS/qemu-7.0.0/bin/debug/native/x86_64-softmmu/qemu-system-x86_64 \
    -name "FEMU-BBSSD-VM" \
    -enable-kvm \
    -cpu host \
    -smp 4 \
    -m 32G \
    -fsdev local,security_model=passthrough,id=fsdev0,path=/home/qwj/ZNS/share \
    -device virtio-9p-pci,id=fs0,fsdev=fsdev0,mount_tag=hostshare \
    -device virtio-scsi-pci,id=scsi0 \
    -device scsi-hd,drive=hd0 \
    -drive file=$OSIMGF,if=none,aio=native,cache=none,format=qcow2,id=hd0 \
    -device nvme,id=nvme0,serial=deadbeef,zoned.zasl=5 \
    -drive file=$ZNSIMGF,id=nvmezns0,format=raw,if=none \
    -device nvme-ns,drive=nvmezns0,bus=nvme0,nsid=1,logical_block_size=4096,physical_block_size=4096,zoned=true,zoned.zone_size=64M,zoned.zone_capacity=62M,zoned.max_open=16,zoned. max_active=32,uuid=5e40ec5f-eeb6-4317-bc5e-c919796a5f79 \
    -drive file=$CONVIMGF,if=none,id=nvmeconv0,format=raw,if=none \
    -device nvme-ns,drive=nvmeconv0,logical_block_size=4096,physical_block_size=4096 \
    -net user,hostfwd=tcp::8081-:22 \
    -net nic,model=virtio \
    -nographic \
    -qmp unix:./qmp-sock,server,nowait 2> & amp;1 | tee log

2. Install f2fs-tools-1.14.0

Older versions of f2fs-tools may not support mounting F2FS on dual hard drives, so you need to download and install tools 1.14.0 and above.
https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/about/
https://f2fs.wiki.kernel.org/

3. Mount F2FS

Unlike SMR hard drives, the kernel does not select the mq-deadline block-io scheduler by default for block devices representing the NVMe partition namespace. To ensure that regular write operations used by f2fs are delivered to the device in order, the IO scheduler for the NVMe partition’s namespace block device must be set to mq-deadline. This is done with the following command

echo mq-deadline > /sys/block/nvme1n1/queue/scheduler

In the above command, /dev/nvme1n1 is the block device file of the partition namespace that will be used for the f2fs volume. Using this namespace, a multi-device f2fs volume that uses an additional regular block device (/dev/nvme0n1 in the example below) can be formatted using the -c option of mkfs.f2fs, as shown in the example below

cd ~/f2fs-tools-1.14.0/mkfs
sudo ./mkfs.f2fs -f -m -c /dev/nvme1n1 /dev/nvme0n1

        F2FS-tools: mkfs.f2fs Ver: 1.14.0 (2021-06-23)

Info: Disable heap-based policy
Info: Debug level = 0
Info: Trim is enabled
Info: Host-managed zoned block device:
      2048 zones, 0 randomly writeable zones
      524288 blocks per zone
Info: Segments per section = 1024
Info: Sections per zone = 1
Info: sector size = 4096
Info: total sectors = 1107296256 (4325376 MB)
Info: zone aligned segment0 blkaddr: 524288
Info: format version with
  "Linux version 5.13.0-rc6 + (user1@brahmaputra) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #2 SMP Fri Jun 18 16:45 :29 IST 2021"
Info: [/dev/nvme0n1] Discarding device
Info: This device doesn't support BLKSECDISCARD
Info: This device doesn't support BLKDISCARD
Info: [/dev/nvme1n1] Discarding device
Info: Discarded 4194304 MB
Info: Overprovision ratio = 3.090%
Info: Overprovision segments = 74918 (GC reserved = 40216)
Info: format successful

To mount a volume formatted using the above command, a regular block device must be specified:

# mount -t f2fs /dev/nvme0n1 /mnt/f2fs/

Possible problems

0. It is best to set the memory to be greater than 8G. I set it to 32GB, otherwise compiling rocksdb may crash.
Need to install sudo apt-get install libgflags-dev in advance
1. The server cannot git clone
After using Linux PC git clone, copy to the server through U disk. Then you may need:
sudo chmod -R + x rocksdb/
sudo git config –global –add safe.directory /home/femu/rocksdb/plugin/zenfs
2. Need to install libzbd in advance
https://github.com/westerndigitalcorporation/libzbd
You need to install autoconf autoconf-archive automake libtool m4 first
3. An error may be reported when running make on zenfs because sudo apt-get install libgflags-dev is not installed.
https://github.com/westerndigitalcorporation/zenfs/issues/3
Modify this line in the makefile to link with libgflags:
$(TARGET): $(TARGET).cc
$(CXX) $(CXXFLAGS) -g -o $(TARGET) $< $(LIBS) -lgflags $(LDFLAGS)
Modify the scheduling algorithm: echo deadline > /sys/class/block/nvme0n1/queue/scheduler
Format the file system: sudo ./plugin/zenfs/util/zenfs mkfs –zbd=nvme0n1 –aux-path=./plugin/zenfs/log