Seaweedfs Erasure-coding in-depth analysis erasure coding Reed-Solomon code distributed object storage redundant error correction high availability test

Seaweedfs Erasure-coding

Introduction

https://github.com/seaweedfs/seaweedfs/wiki/Erasure-coding-for-warm-storage

SeaweedFS implements RS(10,4), which allows 4 blocks to be lost out of every 10 hard disks and can still be accessed normally.

It saves 3.6 times the disk space compared to copying data 5 times to achieve the same robustness.

Service startup

Start the master server, 4 volume servers, and a filer file management server

mkdir disk1 disk2 disk3 disk4
./weed master -mdir=$PWD/master -volumeSizeLimitMB=64
./weed volume -port=8081 -dir=$PWD/disk1 -max=100 -mserver="localhost:9333"
./weed volume -port=8082 -dir=$PWD/disk2 -max=100 -mserver="localhost:9333"
./weed volume -port=8083 -dir=$PWD/disk3 -max=100 -mserver="localhost:9333"
./weed volume -port=8084 -dir=$PWD/disk4 -max=100 -mserver="localhost:9333"
./weed filer -master=localhost:9333

Browser access master server: http://192.168.1.24:9333

You can see some basic information:

Volume Size Limit Each volume size is 64MB
number of free remaining volumes 400
Max Maximum number of volumes 400
That is, there is currently 64*400 = 25G object storage size

The browser accesses the filer server http://192.168.1.24:8888

Drag the file to upload a video with a size of 395M (2023-09-07-20-32-34.mp4)

Alt text

Return to the master web page and see the following changes in volume:

Alt text

Free has 393 volumes left, that is, the master allocates 7 volumes to the filer file server at one time, with a total size of 7*64 = 448M
The master allocates 7 volumes by default each time. If it is used up, it will allocate 7 volumes again.
The 7 volumes are allocated to volume servers with port numbers 8081, 8082, and 8084. The allocation is managed independently by the master.
- 8081 has 3 volumes, and the volume ID is 2 4 7
- 8082 has 3 volumes, and the corresponding volume ID is 1 3 5
- 8083 No volume allocated
- 8084 has 1 volume with volume id 6
- Each volume size 64M
If any of the three volume servers goes offline at this time, or any hard disk of disk1, disk2, or disk4 fails, the data will be inaccessible.

Therefore, erasure coding needs to be configured in Seaweedfs to improve availability.

Before configuring erasure coding, we need to understand what a data block is, because erasure coding encodes and decodes based on data blocks.

Data chunks

The master allocates 7 volumes with volume IDs 1~7, each volume is 64M in size, totaling 448M.

The size of the file just uploaded is 395M. It can be inferred that it has occupied all 6 volumes, and only the last volume is not full. We can use the command to check the allocation of the file by the filer file server:

# This command will export all file information in the filer directory and download the results to the fs.json file in the current directory.
curl -H "Accept: application/json" http://192.168.1.24:8888/?pretty=y -o fs.json

This is the content of the fs.json file:

 "Path": "",
  "Entries": [
    {<!-- -->
      "FullPath": "/2023-09-07-20-32-34.mp4",
      "Mtime": "2023-10-26T00:26:30 + 08:00",
       ...
      "chunks": [
        {<!-- -->
          "file_id": "5,06ed294c0f",
          "size": 4194304,
          "modified_ts_ns": 1698251187647289116,
          "e_tag": "VVvWUbQ + 4E6gSL6yWq/Jjw==",
          "fid": {<!-- -->
            "volume_id": 5,
            "file_key": 6,
            "cookie": 3978906639
          }
        },
    ...
        {<!-- -->
          "file_id": "4,5eac18c4e1",
          "offset": 394264576,
          "size": 729504,
          "modified_ts_ns": 1698251190710407585,
          "e_tag": "uVx8HBfAD4i/FwZUvQnzOw==",
          "fid": {<!-- -->
            "volume_id": 4,
            "file_key": 94,
            "cookie": 2887304417
          }
        }
    ...

Among them, chunks represents a data block. The size of each data block is 4M "size": 4194304 . The data block is automatically split by the filer. You can use the command ./weed filer -h View the default block size split by the filer service:

-maxMB int
    split files larger than the limit (default 4)

Since the content of fs.json is too long, we can use a python script to count how many data blocks are used.

import json
f = open('./fs.json')
fs = json.load(f)

chunks = fs['Entries'][0]['chunks']
print("chkuns num:", len(chunks))
total_size = 0
for chkun in chunks:
    total_size + = chkun['size']
print("total size:", total_size)

#output:
chkunsnum: 95
total size: 394994080

It can be seen that it uses 95 data blocks, each 4M, the total is approximately equal to (the last block is not full) 95*4*1024*1024 = 394994080 (395M)

Splitting large files into chunks has many benefits:

Facilitates erasure coding, decoding, and data verification
Reduce the amount of data your computer loads at once, reducing memory and computing requirements
Grouping facilitates retrieval and improves efficiency
Increased concurrency means large amounts of data can be processed simultaneously

Erasure-coding

Erasure coding (EC) is a data protection method that divides data into fragments, extends and encodes redundant data blocks, and stores them in different locations, such as disks, storage nodes, or other geographical locations. .

Seaweedfs erasure coding architecture

Official introduction:

SeaweedFS implemented 10.4 Reed-Solomon Erasure Coding (EC). The large volumes are split into chunks of 1GB, and every 10 data chunks are also encoded into 4 parity chunks. So a 30 GB data volume will be encoded into 14 EC shards, each shard is of size 3 GB and has 3 EC blocks.

For smaller volumes less than 10GB, and for edge cases, the volume is split into smaller 1MB chunks.

EC Blocking Strategy

In addition to filer, seaweedfs will split the file into chunks, and the EC module will also split the volume into many different data chunks (chunks):

If the volume exceeds 10G, EC will split the volume into 1GB data blocks.
If the volume is smaller than 10G, it will be divided into 1M data blocks.
The difference from filer is that filer splits data blocks based on a single file, while EC splits data blocks based on volumes.

RS(10,4) erasure coding strategy

Assuming a volume size is 30G, EC will split the volume into 30 1G data blocks
For every 10 1G data blocks as a group, do Erasure Coding once, generate 4 check blocks, and get 14 blocks as a groupRS(10,4) erasure code data
- In this set of RS(10,4) data, if any 4 data blocks are lost, the data can be completely restored
- Regardless of whether the data block or the parity block is lost
The 30G data block is encoded into 3 groups of RS(10,4), totaling 42 data blocks.
Therefore, a volume of 30G will be encoded into 14 EC shards. Each EC shard contains 3G data blocks or 3 EC parity blocks, totaling 42G.

The above author’s assumption is that each volume size is 30G, and for the convenience of testing, I set the volume size to 64M. Next, I will enable EC to see what happens to the data.

Enable Erasure-coding

The way to enable EC is to run ./weed scaffold -config=master >> master.toml to generate a master.toml file in the current path. In the file:

[master.maintenance]
# periodically run these scripts are the same as running them from 'weed shell'
scripts = """
  lock
  ec.encode -fullPercent=50 -quietFor=1h
  ec.rebuild-force
  ec.balance-force
  volume.deleteEmpty -quietFor=24h -force
  volume.balance-force
  volume.fix.replication
  s3.clean.uploads-timeAgo=24h
  unlock
"""
sleep_minutes = 17 # sleep minutes between each script execution

scripts is its startup script. Since it is executed regularly according to time intervals, in order to facilitate triggering, we enter the shell background and trigger manually:

# Enter the management background
./weed shell
# View help
help
lock
# Enable EC, enable EC for volumes with usage greater than 50%, cold data that has not been updated in the last 2s, -quietFor optional 1h0m0s
ec.encode -fullPercent=50 -quietFor=2s
ec.rebuild-force
# Evenly distribute EC fragments
ec.balance-force
volume.deleteEmpty -quietFor=24h -force
volume.balance-force
volume.fix.replication
s3.clean.uploads-timeAgo=24h
unlock

Alt text

The remaining data volume of Free has changed from the original 393 to 390, 2 more volumes
The volume id used to have 1~7 volumes, but now they are all gone.
EC erasure coding shards (ErasureCodingShards) have changed from the original 0 to a total of 98:
- 8081 23 EC erasure coded shards allocated
- 8082 26 EC erasure coded shards allocated
- 8083 26 EC erasure coded shards allocated
- 8084 23 EC erasure coded shards allocated
RS(10,4) indicates that we lose any 4/10 EC erasure coded shards and the data can still be accessed normally.
From the current service, 4/10 * 98 = 32 EC erasure code fragments are allowed to be lost, which means that one disk damage can be tolerated, or one volume server can be tolerated going offline.

Offline a volume test at will:

Alt text
Data can still be accessed normally

Alt text

However, if the two volumes are offline, they will become inaccessible.

Edge cases

From the above example, we can find that fullPercent defines that volumes with a capacity greater than 50% will be encoded.

ec.encode -fullPercent=50 -quietFor=2m

Assume that my volume size is 64M and the file size is 395M. Among the 7 volumes allocated by the filer, although they are evenly divided, some volumes must be allocated more and some volumes are allocated less.

Before EC encoding, my distribution of each volume was as follows:

8081:
- volume id 2 68M
- volume id 4 56M
- volume id 7 44M
8082:
- volume id 1 48M
- volume id 3 40M
- volume id 5 72M
8083
- null
8084
- volume id 6 48M

It can be seen that the files are split into different proportions of each volume. If the percentage set by ec.encode -fullPercent=50 is not 50, but 95 or 80, then EC encode cannot All volumes are coded.

In case the volume happens to go offline, the data will be inaccessible.

If you want to achieve fault-tolerant recovery of all files, you can use the ec.encode -collection t -quietFor 1s -fullPercent=0.001 trigger to encode all volumes that have not been updated within 1s.

However, this is not the optimal strategy:

Each encoding and decoding will occupy CPU calculations.
Taking up too many disk IO operations
It will be slow for users to access large amounts of files

Hot and cold separation

In order to avoid encoding and verifying all data, the author’s storage strategy refers to some ideas from “Facebook’s Warm BLOB Storage System”:

Hot data refers to data that is accessed frequently and frequently updated, so using backup can improve the access speed of data.

While cold data is accessed less frequently, erasure coding can save storage space while still ensuring data reliability.

Use ec.encode for cold data, the default fullPercent is 95, -quietFor is 1h0m0s, that is, the volume capacity exceeds 95% and within 1 hour Volumes without updates are EC encoded.

For other volumes, use data backup.

After enabling encoding and data backup at the same time, if the volume data becomes “cold”, the encoding operation will only take effect on the source copy. After the encoding is successful, other copies will be automatically deleted.

If the volume is replicated, only one copy will be erasure encoded. All the original copies will be purged after a successful erasure encoding.

Hot data backup

Shut down all servers, delete test data, and enable backup policies

rm -rf disk1/* disk2/* disk3/* disk4/* master filerldb2
# Start the master server, -defaultReplication enables backup
./weed master -mdir=$PWD/master -volumeSizeLimitMB=64 -defaultReplication=001
# Set the volume server data center to dc1, 8081 and 8082 racks to rack1, 8083 and 8084 to rack2
./weed volume -port=8081 -dir=$PWD/disk1 -max=100 -mserver="localhost:9333" -dataCenter=dc1 -rack=rack1
./weed volume -port=8082 -dir=$PWD/disk2 -max=100 -mserver="localhost:9333" -dataCenter=dc1 -rack=rack1
./weed volume -port=8083 -dir=$PWD/disk3 -max=100 -mserver="localhost:9333" -dataCenter=dc1 -rack=rack2
./weed volume -port=8084 -dir=$PWD/disk4 -max=100 -mserver="localhost:9333" -dataCenter=dc1 -rack=rack2
./weed filer -master=localhost:9333

For configuration instructions of defaultReplication, please refer to the author’s document Replication

I’m using the 001 backup strategy: backup once on the same rack.

In actual use, a rack usually means many physical servers, and each server can be configured with multiple hard drives. Since I only tested on one machine, I used different ports to distinguish.

Alt text

This picture represents that I have a dc1 data center with two racks (rack1 and rack2), and two volume servers on each rack.

# View the initial size of each disk
du -b disk*
4132disk1
4132disk2
4132disk3
4132disk4

Feel free to upload a 395M video

Looking at the disk changes again, we can see that the storage capacity is about 790M, which is exactly twice the size of the file.

du -b disk*
130759792 disk1 # 8081 rack1
130759792 disk2 # 8082 rack1
264052148 disk3 # 8083 rack2
264250668 disk4 # 8084 rack2

Alt text
The 390M file is arranged by the master to store 130M on rack1 and 130M on rack2; at the same time, the backup policy requires “copy once on the same rack”, so disk2 and disk4 obtain the same copy.

In this case, if a node is offline at any time, it can still read and write normally due to the existence of the replica.

Cold data storage

We assume that the data just now is cold data and perform EC encoding on it.

./weed shell
lock
# Enable EC and encode cold data that has usage greater than 95% of the volume and has not been updated in the last 2 minutes.
ec.encode -fullPercent=95 -quietFor=2s
ec.rebuild-force
ec.balance-force
volume.deleteEmpty -quietFor=2s -force
volume.balance-force
volume.fix.replication
s3.clean.uploads-timeAgo=24h
unlock

Post-encoding architecture analysis

Before encoding:
Alt text

Although the volume size is set to 64M in the master, each full volume is not saved strictly according to 64M, and may float upward by 20M.

After encoding:
Alt text

disk1 EC shard

The total disk data is 641M, which is 150M less than the 790M before encoding.

du -b disk*
160444532 disk1
160444532 disk2
144709880disk3
176167160disk4

Analysis

For architectural principles, please refer to the previous chapter: Seaweedfs erasure coding architecture

Volume ID 1, 2, 4, 6 Because the storage space is greater than 64M * 95%, each volume is split into a group of 10 1M original data blocks
These 10 data blocks are EC encoded into 4 parity blocks, each volume has a total of 14 blocks, and the 4 volumes have a total of 56 blocks (also called 56 erasure coding shards ErasureCodingShards)
The 56 erasure-coded shards are evenly distributed across the volume server and the original volume and backup copies are cleared simultaneously.
Example:
- Volume 1 before encoding is 76M, rounded up to 80M
- Split into 10 1M original data blocks
- RS(10,4) encoding is performed on 10 data blocks to obtain 4 check blocks, totaling 14 blocks, collectively called erasure coding shards (ErasureCodingShards), each shard is 8M
- Volume 6 encodes the first 64M, rounds up to 70M, and performs RS(10,4) encoding to obtain 14 erasure code shards (Shard Size), each shard is 7M

?Note that although I set the volume size to 64M in the master, the volume server is not divided strictly according to 64M. The volume size will float about 20M above 64M.

Taking volume server 8081 disk1 as an example, it stores erasure coded shards (Shards) with original volume IDs 1, 2, 4, and 6:
erasure coded fragments stored in disk1
In the same way, disk2, disk3, and disk4 also store the remaining fragments in different volumes.

Hard disk/volume id	disk1	disk2	disk3	disk4	Shard size
voulme 1	[1 5 9 13]	[2 6 10]	[3 7 11]	[0 4 8 12]	8M
voulme 2	[3 7 11]	[0 2 6 10]	[4 8 12]	[1 5 9 13]	8M
voulme 3			48M	48M	Not encoded
voulme 4	[1 3 7 11]	[2 6 10]	[5 9 13]	[0 4 8 12]	7M
voulme 5	48.1M	48.1M			Not encoded
voulme 6	[2 6 10]	[1 5 9 13]	[3 7 11]	[0 4 8 12]	7M
Total storage	(78 + 77) =105 + 48.1 =154M	(78 + 77) =105 + 48.1 =154M	(68 + 67) =105 + 48 =138M	(68 + 67) =105 + 48 =138M

View the storage distribution in the shell, which is consistent with the calculation:

du -h disk*
154M disk1
154M disk2
139M disk3
169M disk4

Offline hard drive test

According to Reed-Solomon erasure code RS(10,4): N = K + M

N is the EC erasure code fragmentation 14, K is the original data block 10, M is the number of tolerated losses 4
That is, if there is erasure code fragmentation greater than or equal to 10, the data can be recovered.
For my 395M video, a total of 7 volumes were used, 4 volumes were EC encoded, and the remaining two volumes were not encoded, but had 2 copies each.
- A total of 56 EC erasure coded shards, 4:14 = x:56, can tolerate 16 shard losses
- However, the backup strategy only has 2 copies in the same rack, so the same rack does not allow two volume servers at the same time.