Seaweedfs Erasure-coding
Introduction
https://github.com/seaweedfs/seaweedfs/wiki/Erasure-coding-for-warm-storage
SeaweedFS implements RS(10,4), which allows 4 blocks to be lost out of every 10 hard disks and can still be accessed normally.
It saves 3.6 times the disk space compared to copying data 5 times to achieve the same robustness.
Service startup
Start the master server, 4 volume servers, and a filer file management server
mkdir disk1 disk2 disk3 disk4 ./weed master -mdir=$PWD/master -volumeSizeLimitMB=64 ./weed volume -port=8081 -dir=$PWD/disk1 -max=100 -mserver="localhost:9333" ./weed volume -port=8082 -dir=$PWD/disk2 -max=100 -mserver="localhost:9333" ./weed volume -port=8083 -dir=$PWD/disk3 -max=100 -mserver="localhost:9333" ./weed volume -port=8084 -dir=$PWD/disk4 -max=100 -mserver="localhost:9333" ./weed filer -master=localhost:9333
Browser access master server: http://192.168.1.24:9333
You can see some basic information:
- Volume Size Limit Each volume size is 64MB
- number of free remaining volumes 400
- Max Maximum number of volumes 400
- That is, there is currently 64*400 = 25G object storage size
The browser accesses the filer server http://192.168.1.24:8888
Drag the file to upload a video with a size of 395M (2023-09-07-20-32-34.mp4)
Return to the master web page and see the following changes in volume:
- Free has 393 volumes left, that is, the master allocates 7 volumes to the filer file server at one time, with a total size of 7*64 = 448M
- The master allocates 7 volumes by default each time. If it is used up, it will allocate 7 volumes again.
- The 7 volumes are allocated to volume servers with port numbers 8081, 8082, and 8084. The allocation is managed independently by the master.
- 8081 has 3 volumes, and the volume ID is 2 4 7
- 8082 has 3 volumes, and the corresponding volume ID is 1 3 5
- 8083 No volume allocated
- 8084 has 1 volume with volume id 6
- Each volume size 64M
- If any of the three volume servers goes offline at this time, or any hard disk of disk1, disk2, or disk4 fails, the data will be inaccessible.
Therefore, erasure coding needs to be configured in Seaweedfs to improve availability.
Before configuring erasure coding, we need to understand what a data block is, because erasure coding encodes and decodes based on data blocks.
Data chunks
The master allocates 7 volumes with volume IDs 1~7, each volume is 64M in size, totaling 448M.
The size of the file just uploaded is 395M. It can be inferred that it has occupied all 6 volumes, and only the last volume is not full. We can use the command to check the allocation of the file by the filer file server:
# This command will export all file information in the filer directory and download the results to the fs.json file in the current directory. curl -H "Accept: application/json" http://192.168.1.24:8888/?pretty=y -o fs.json
This is the content of the fs.json file:
"Path": "", "Entries": [ {<!-- --> "FullPath": "/2023-09-07-20-32-34.mp4", "Mtime": "2023-10-26T00:26:30 + 08:00", ... "chunks": [ {<!-- --> "file_id": "5,06ed294c0f", "size": 4194304, "modified_ts_ns": 1698251187647289116, "e_tag": "VVvWUbQ + 4E6gSL6yWq/Jjw==", "fid": {<!-- --> "volume_id": 5, "file_key": 6, "cookie": 3978906639 } }, ... {<!-- --> "file_id": "4,5eac18c4e1", "offset": 394264576, "size": 729504, "modified_ts_ns": 1698251190710407585, "e_tag": "uVx8HBfAD4i/FwZUvQnzOw==", "fid": {<!-- --> "volume_id": 4, "file_key": 94, "cookie": 2887304417 } } ...
Among them, chunks represents a data block. The size of each data block is 4M "size": 4194304
. The data block is automatically split by the filer. You can use the command ./weed filer -h
View the default block size split by the filer service:
-maxMB int split files larger than the limit (default 4)
Since the content of fs.json
is too long, we can use a python script to count how many data blocks are used.
import json f = open('./fs.json') fs = json.load(f) chunks = fs['Entries'][0]['chunks'] print("chkuns num:", len(chunks)) total_size = 0 for chkun in chunks: total_size + = chkun['size'] print("total size:", total_size) #output: chkunsnum: 95 total size: 394994080
It can be seen that it uses 95 data blocks, each 4M, the total is approximately equal to (the last block is not full) 95*4*1024*1024 = 394994080
(395M)
Splitting large files into chunks has many benefits:
- Facilitates erasure coding, decoding, and data verification
- Reduce the amount of data your computer loads at once, reducing memory and computing requirements
- Grouping facilitates retrieval and improves efficiency
- Increased concurrency means large amounts of data can be processed simultaneously
Erasure-coding
Erasure coding (EC) is a data protection method that divides data into fragments, extends and encodes redundant data blocks, and stores them in different locations, such as disks, storage nodes, or other geographical locations. .
Seaweedfs erasure coding architecture
Official introduction:
SeaweedFS implemented 10.4 Reed-Solomon Erasure Coding (EC). The large volumes are split into chunks of 1GB, and every 10 data chunks are also encoded into 4 parity chunks. So a 30 GB data volume will be encoded into 14 EC shards, each shard is of size 3 GB and has 3 EC blocks.
For smaller volumes less than 10GB, and for edge cases, the volume is split into smaller 1MB chunks.
EC Blocking Strategy
In addition to filer, seaweedfs will split the file into chunks, and the EC module will also split the volume into many different data chunks (chunks):
- If the volume exceeds 10G, EC will split the volume into 1GB data blocks.
- If the volume is smaller than 10G, it will be divided into 1M data blocks.
- The difference from filer is that filer splits data blocks based on a single file, while EC splits data blocks based on volumes.
RS(10,4) erasure coding strategy
- Assuming a volume size is 30G, EC will split the volume into 30 1G data blocks
- For every 10 1G data blocks as a group, do Erasure Coding once, generate 4 check blocks, and get 14 blocks as a groupRS(10,4) erasure code data
- In this set of RS(10,4) data, if any 4 data blocks are lost, the data can be completely restored
- Regardless of whether the data block or the parity block is lost
- The 30G data block is encoded into 3 groups of RS(10,4), totaling 42 data blocks.
- Therefore, a volume of 30G will be encoded into 14 EC shards. Each EC shard contains 3G data blocks or 3 EC parity blocks, totaling 42G.
The above author’s assumption is that each volume size is 30G, and for the convenience of testing, I set the volume size to 64M. Next, I will enable EC to see what happens to the data.
Enable Erasure-coding
The way to enable EC is to run ./weed scaffold -config=master >> master.toml
to generate a master.toml
file in the current path. In the file:
[master.maintenance] # periodically run these scripts are the same as running them from 'weed shell' scripts = """ lock ec.encode -fullPercent=50 -quietFor=1h ec.rebuild-force ec.balance-force volume.deleteEmpty -quietFor=24h -force volume.balance-force volume.fix.replication s3.clean.uploads-timeAgo=24h unlock """ sleep_minutes = 17 # sleep minutes between each script execution
scripts is its startup script. Since it is executed regularly according to time intervals, in order to facilitate triggering, we enter the shell background and trigger manually:
# Enter the management background ./weed shell # View help help lock # Enable EC, enable EC for volumes with usage greater than 50%, cold data that has not been updated in the last 2s, -quietFor optional 1h0m0s ec.encode -fullPercent=50 -quietFor=2s ec.rebuild-force # Evenly distribute EC fragments ec.balance-force volume.deleteEmpty -quietFor=24h -force volume.balance-force volume.fix.replication s3.clean.uploads-timeAgo=24h unlock
- The remaining data volume of Free has changed from the original 393 to 390, 2 more volumes
- The volume id used to have 1~7 volumes, but now they are all gone.
- EC erasure coding shards (ErasureCodingShards) have changed from the original 0 to a total of 98:
- 8081 23 EC erasure coded shards allocated
- 8082 26 EC erasure coded shards allocated
- 8083 26 EC erasure coded shards allocated
- 8084 23 EC erasure coded shards allocated
- RS(10,4) indicates that we lose any 4/10 EC erasure coded shards and the data can still be accessed normally.
- From the current service, 4/10 * 98 = 32 EC erasure code fragments are allowed to be lost, which means that one disk damage can be tolerated, or one volume server can be tolerated going offline.
Offline a volume test at will:
Data can still be accessed normally
However, if the two volumes are offline, they will become inaccessible.
Edge cases
From the above example, we can find that fullPercent
defines that volumes with a capacity greater than 50% will be encoded.
ec.encode -fullPercent=50 -quietFor=2m
Assume that my volume size is 64M and the file size is 395M. Among the 7 volumes allocated by the filer, although they are evenly divided, some volumes must be allocated more and some volumes are allocated less.
Before EC encoding, my distribution of each volume was as follows:
- 8081:
- volume id 2 68M
- volume id 4 56M
- volume id 7 44M
- 8082:
- volume id 1 48M
- volume id 3 40M
- volume id 5 72M
- 8083
- null
- 8084
- volume id 6 48M
It can be seen that the files are split into different proportions of each volume. If the percentage set by ec.encode -fullPercent=50
is not 50, but 95 or 80, then EC encode cannot All volumes are coded.
In case the volume happens to go offline, the data will be inaccessible.
If you want to achieve fault-tolerant recovery of all files, you can use the ec.encode -collection t -quietFor 1s -fullPercent=0.001
trigger to encode all volumes that have not been updated within 1s.
However, this is not the optimal strategy:
- Each encoding and decoding will occupy CPU calculations.
- Taking up too many disk IO operations
- It will be slow for users to access large amounts of files
Hot and cold separation
In order to avoid encoding and verifying all data, the author’s storage strategy refers to some ideas from “Facebook’s Warm BLOB Storage System”:
Hot data refers to data that is accessed frequently and frequently updated, so using backup can improve the access speed of data.
While cold data is accessed less frequently, erasure coding can save storage space while still ensuring data reliability.
Use ec.encode
for cold data, the default fullPercent
is 95, -quietFor
is 1h0m0s, that is, the volume capacity exceeds 95% and within 1 hour Volumes without updates are EC encoded.
For other volumes, use data backup.
After enabling encoding and data backup at the same time, if the volume data becomes “cold”, the encoding operation will only take effect on the source copy. After the encoding is successful, other copies will be automatically deleted.
If the volume is replicated, only one copy will be erasure encoded. All the original copies will be purged after a successful erasure encoding.
Hot data backup
Shut down all servers, delete test data, and enable backup policies
rm -rf disk1/* disk2/* disk3/* disk4/* master filerldb2 # Start the master server, -defaultReplication enables backup ./weed master -mdir=$PWD/master -volumeSizeLimitMB=64 -defaultReplication=001 # Set the volume server data center to dc1, 8081 and 8082 racks to rack1, 8083 and 8084 to rack2 ./weed volume -port=8081 -dir=$PWD/disk1 -max=100 -mserver="localhost:9333" -dataCenter=dc1 -rack=rack1 ./weed volume -port=8082 -dir=$PWD/disk2 -max=100 -mserver="localhost:9333" -dataCenter=dc1 -rack=rack1 ./weed volume -port=8083 -dir=$PWD/disk3 -max=100 -mserver="localhost:9333" -dataCenter=dc1 -rack=rack2 ./weed volume -port=8084 -dir=$PWD/disk4 -max=100 -mserver="localhost:9333" -dataCenter=dc1 -rack=rack2 ./weed filer -master=localhost:9333
For configuration instructions of defaultReplication
, please refer to the author’s document Replication
I’m using the 001 backup strategy: backup once on the same rack.
In actual use, a rack usually means many physical servers, and each server can be configured with multiple hard drives. Since I only tested on one machine, I used different ports to distinguish.
- This picture represents that I have a dc1 data center with two racks (rack1 and rack2), and two volume servers on each rack.
# View the initial size of each disk du -b disk* 4132disk1 4132disk2 4132disk3 4132disk4
Feel free to upload a 395M video
Looking at the disk changes again, we can see that the storage capacity is about 790M, which is exactly twice the size of the file.
du -b disk* 130759792 disk1 # 8081 rack1 130759792 disk2 # 8082 rack1 264052148 disk3 # 8083 rack2 264250668 disk4 # 8084 rack2
The 390M file is arranged by the master to store 130M on rack1 and 130M on rack2; at the same time, the backup policy requires “copy once on the same rack”, so disk2 and disk4 obtain the same copy.
In this case, if a node is offline at any time, it can still read and write normally due to the existence of the replica.
Cold data storage
We assume that the data just now is cold data and perform EC encoding on it.
./weed shell lock # Enable EC and encode cold data that has usage greater than 95% of the volume and has not been updated in the last 2 minutes. ec.encode -fullPercent=95 -quietFor=2s ec.rebuild-force ec.balance-force volume.deleteEmpty -quietFor=2s -force volume.balance-force volume.fix.replication s3.clean.uploads-timeAgo=24h unlock
Post-encoding architecture analysis
Before encoding:
Although the volume size is set to 64M in the master, each full volume is not saved strictly according to 64M, and may float upward by 20M.
After encoding:
disk1 EC shard
The total disk data is 641M, which is 150M less than the 790M before encoding.
du -b disk* 160444532 disk1 160444532 disk2 144709880disk3 176167160disk4
Analysis
For architectural principles, please refer to the previous chapter: Seaweedfs erasure coding architecture
- Volume ID 1, 2, 4, 6 Because the storage space is greater than 64M * 95%, each volume is split into a group of 10 1M original data blocks
- These 10 data blocks are EC encoded into 4 parity blocks, each volume has a total of 14 blocks, and the 4 volumes have a total of 56 blocks (also called 56 erasure coding shards ErasureCodingShards)
- The 56 erasure-coded shards are evenly distributed across the volume server and the original volume and backup copies are cleared simultaneously.
- Example:
- Volume 1 before encoding is 76M, rounded up to 80M
- Split into 10 1M original data blocks
- RS(10,4) encoding is performed on 10 data blocks to obtain 4 check blocks, totaling 14 blocks, collectively called erasure coding shards (ErasureCodingShards), each shard is 8M
- Volume 6 encodes the first 64M, rounds up to 70M, and performs RS(10,4) encoding to obtain 14 erasure code shards (Shard Size), each shard is 7M
?Note that although I set the volume size to 64M in the master, the volume server is not divided strictly according to 64M. The volume size will float about 20M above 64M.
Taking volume server 8081 disk1 as an example, it stores erasure coded shards (Shards) with original volume IDs 1, 2, 4, and 6:
In the same way, disk2, disk3, and disk4 also store the remaining fragments in different volumes.
Hard disk/volume id | disk1 | disk2 | disk3 | disk4 | Shard size |
---|---|---|---|---|---|
voulme 1 | [1 5 9 13] | [2 6 10] | [3 7 11] | [0 4 8 12] | 8M |
voulme 2 | [3 7 11] | [0 2 6 10] | [4 8 12] | [1 5 9 13] | 8M |
voulme 3 |
48M |
48M |
Not encoded | ||
voulme 4 | [1 3 7 11] | [2 6 10] | [5 9 13] | [0 4 8 12] | 7M |
voulme 5 |
48.1M |
48.1M |
Not encoded | ||
voulme 6 | [2 6 10] | [1 5 9 13] | [3 7 11] | [0 4 8 12] | 7M |
Total storage | (7*8 + 7*7) =105 + 48.1 =154M |
(7*8 + 7*7) =105 + 48.1 =154M |
(6*8 + 6*7) =105 + 48 =138M |
(6*8 + 6*7) =105 + 48 =138M |
View the storage distribution in the shell, which is consistent with the calculation:
du -h disk* 154M disk1 154M disk2 139M disk3 169M disk4
Offline hard drive test
According to Reed-Solomon erasure code RS(10,4): N = K + M
- N is the EC erasure code fragmentation 14, K is the original data block 10, M is the number of tolerated losses 4
- That is, if there is erasure code fragmentation greater than or equal to 10, the data can be recovered.
- For my 395M video, a total of 7 volumes were used, 4 volumes were EC encoded, and the remaining two volumes were not encoded, but had 2 copies each.
- A total of 56 EC erasure coded shards, 4:14 = x:56, can tolerate 16 shard losses
- However, the backup strategy only has 2 copies in the same rack, so the same rack does not allow two volume servers at the same time.
Offline any hard drive individually
Offline alone 8081:
Access is normal
Offline 8082, 8083, 8084 separately. Access is normal.
Two hard drives offline
offline rack1 and a rack2
Inaccessible
Offline two rack1,
Inaccessible
Offline two rack2