ceph source code reading erasure-code

1. ceph erasure code

Erasure code is a popular data redundancy storage method. The original data is divided into k data chunks, and m coding chunks are calculated from the k data chunks. Save n=k + m data blocks in different nodes, and restore the original data through any k blocks in n. EC contains two processes: encoding and decoding.

EC encoding in ceph is provided in the form of plug-ins. EC encoding has three indicators: space utilization, data reliability and recovery efficiency. Ceph provides the following erasure coding plug-ins: clay (coupled-layer), jerasure, lrc, shec, isa.

clay: Used to save network bandwidth and disk IO when repairing failed nodes/OSD/rack.

jerasure: Open source library, currently the default encoding method of ceph.

isa: isa is an EC library provided by Intel. It uses Intel processor instructions to accelerate calculations and can only run on Intel CPUs.

lrc: Divide the check blocks into global check blocks and local check blocks to reduce the network overhead in the recovery process after a single node fails.

shec: shec(k,m,l), k is the data chunk, m is the coding chunk, and l represents the number of data chunks required to calculate the coding chunk. The maximum allowable failed data block is: ml/k. To recover a failed single data block (data chunk), only l additional data blocks need to be read.

2. erasure-code file structure

erasure-code is ceph’s erasure coding core module, which contains 5 folders and ErasureCodeInterface, ErasureCode, ErasureCodePlugin, and adopts factory mode.

  • Clay, isa, jerasure, lrc, and shec are EC plug-ins for ceph.
  • ErasureCodeInterface: Provides common interfaces for EC plug-ins, with detailed descriptions for each interface.
  • ErasureCode: Provides access to erasure code parameters and core encoding and decoding interfaces.
  • ErasureCodePlugin: Provides registration, addition, and deletion functions for erasure mode events.

3. Data striping

Storage devices all have throughput limitations, which will affect performance and scalability, so storage systems generally support striping (sharding continuous information into multiple devices) to increase throughput and performance.

  • Basic concepts

Chunk: When encoding based on erasure coding, each encoding will produce several blocks of the same size (these blocks must be in order, otherwise they cannot be decoded). Ceph stores these in different OSDs through equal numbers of PGs.

Strip: If the encoding object is too large, it can be encoded multiple times, and the part that is encoded each time is called a strip. Stripes within the same pair are ordered.

Shared: All blocks with the same sequence number in the same object are located on the same PG. They form a fragment of the object, and the number of the fragment is the sequence number of the block.

Space utilization (rate): Calculated by k/n.

Object size: Objects in Ceph storage clusters have a maximum configurable size (such as 2MB, 4MB, etc.). The object size must be large enough to accommodate many stripe units, and should be an integer of stripe units. times.

Number of stripes: The Ceph client writes a series of stripe units into a series of objects determined by the number of stripes. This series of objects is called an object set. When the client writes to the last object in the object set, it returns to the first one.

Stripe width: Stripes have a configurable unit size (e.g. 64KB). The Ceph client divides the data equally into stripe units that fit into the write object, except for the last one. The strip width should be a slice of the object’s size so that the object contains many strip units.

strip_width=chunk_size*strip_size

Assume there is EC (k=4, m=2), strip_size=4, chunk_size=1K, then strip_width=4K. In ceph, strip_width defaults to 4K.

  • Data striping process

If you’re dealing with large-sized images, large S3 or Swift objects (like videos), or large CephFS directories, you’ll see significant read/write performance gains from striping across multiple objects in an object set. When the client writes stripe units to the corresponding objects in parallel, there will be significant write performance, because the objects are mapped to different placement groups and further mapped to different OSDs, which can be written in parallel at maximum speed.

In the above figure, the client data is striped into an object set (objectset1 in the above figure), which contains 4 objects. Among them, the first stripe unit is stripe unit 0 of object0, and the fourth stripe is Stripe unit 3 of object3, after writing the fourth strip, the client needs to confirm whether the object set is full. If the object set is not full, the client writes to the strip starting from the first object (object 0 in the above figure); if the object set is full, the client has to create a new object set (object set 2 in the above figure). Then start writing the first strip (stripe unit16) starting from the first object in the new object set (object 4 in the above figure).

4. Source code analysis

  • Coding Process

ECUtil::encode divides the original data according to the stripe width, and then encodes the stripe data to obtain the data block and check block of the stripe. Link each striped data block and parity block in order to form a data block and parity block.

int ECUtil::encode(
  const stripe_info_t &sinfo,
  ErasureCodeInterfaceRef &ec_impl,
  bufferlist &in,
  const set<int> & amp;want,
  map<int, bufferlist> *out)
....

//The file is encoded according to the size of strip_width each time
  for (uint64_t i = 0; i < logical_size; i + = sinfo.get_stripe_width()) {
    map<int, bufferlist> encoded;
    bufferlist buf;
    buf.substr_of(in, i, sinfo.get_stripe_width());
    //Call the corresponding erasure coding method for encoding
    int r = ec_impl->encode(want, buf, & amp;encoded);
    ceph_assert(r == 0);
    //Append the striped data block and check block to out
    for (map<int, bufferlist>::iterator i = encoded.begin();
i != encoded.end();
+ + i) {
      ceph_assert(i->second.length() == sinfo.get_chunk_size());
      (*out)[i->first].claim_append(i->second);
    }
  }

....
  return 0;
}

Next, let’s analyze ec_impl->encode(want, buf, & encoded) in depth. ErasureCode is a subclass of ErasureCodeInterface, so ErasureCode::encode is called. In ErasureCode::encode, it mainly performs memory allocation of map*encoded (encode_prepare()) and encoding of data blocks (encode_chunks).

int ErasureCode::encode(const set<int> & amp;want_to_encode,
                        const bufferlist &in,
                        map<int, bufferlist> *encoded)
{
  unsigned int k = get_data_chunk_count();
  unsigned int m = get_chunk_count() - k;
  bufferlist out;
  //encoded memory block allocation
  int err = encode_prepare(in, *encoded);
  if(err)
    return err;
  //Carry out encoding operations
  encode_chunks(want_to_encode, encoded);
  for (unsigned int i = 0; i < k + m; i + + ) {
    if (want_to_encode.count(i) == 0)
      encoded->erase(i);
  }
  return 0;
}

ErasureCode::encode_prepare() performs parameter initialization and memory allocation.

int ErasureCode::encode_prepare(const bufferlist &raw,
                                map<int, bufferlist> & amp;encoded) const
{
  unsigned int k = get_data_chunk_count();
  unsigned int m = get_chunk_count() - k;
  //The size of each block
  unsigned blocksize = get_chunk_size(raw.length());
  //Number of blank blocks
  unsigned padded_chunks = k - raw.length() / blocksize;
  bufferlist prepared = raw;

  //Split the raw data in order according to blocksize size, and write the divided blocks to encoded in order
  for (unsigned int i = 0; i < k - padded_chunks; i + + ) {
    bufferlist & amp;chunk = encoded[chunk_index(i)];
    chunk.substr_of(prepared, i * blocksize, blocksize);
    chunk.rebuild_aligned_size_and_memory(blocksize, SIMD_ALIGN);
    ceph_assert(chunk.is_contiguous());
  }
  if (padded_chunks) {
    unsigned remainder = raw.length() - (k - padded_chunks) * blocksize;
    bufferptr buf(buffer::create_aligned(blocksize, SIMD_ALIGN));

    raw.copy((k - padded_chunks) * blocksize, remainder, buf.c_str());
    buf.zero(remainder, blocksize - remainder);
    encoded[chunk_index(k-padded_chunks)].push_back(std::move(buf));

    for (unsigned int i = k - padded_chunks + 1; i < k; i + + ) {
      bufferptr buf(buffer::create_aligned(blocksize, SIMD_ALIGN));
      buf.zero();
      encoded[chunk_index(i)].push_back(std::move(buf));
    }
  }
  for (unsigned int i = k; i < k + m; i + + ) {
    bufferlist & amp;chunk = encoded[chunk_index(i)];
    chunk.push_back(buffer::create_aligned(blocksize, SIMD_ALIGN));
  }

  return 0;
}

After the above work is completed, the formal encoding encode_chunks() can begin. It is assumed here that the encoding method is jerasure and RS code is selected.

int ErasureCodeJerasure::encode_chunks(const set<int> & amp;want_to_encode,
map<int, bufferlist> *encoded)
{
  char *chunks[k + m];
  for (int i = 0; i < k + m; i + + )
    chunks[i] = (*encoded)[i].c_str();
  jerasure_encode( & amp;chunks[0], & amp;chunks[k], (*encoded)[0].length());
  return 0;
}

jerasure_encode() is the encoding function that calls the jerasure library.

void ErasureCodeJerasureReedSolomonVandermonde::jerasure_encode(char **data,
                                                                char **coding,
                                                                int blocksize)
{
  jerasure_matrix_encode(k, m, w, matrix, data, coding, blocksize);
}
  • Decoding process

There are two ECUtil::decode functions. Let’s pick the simple one for analysis.

The decode() function below initializes the data and decodes it.

int ECUtil::decode(
  const stripe_info_t &sinfo,
  ErasureCodeInterfaceRef &ec_impl,
  map<int, bufferlist> & amp;to_decode,
  bufferlist *out) {
  ceph_assert(to_decode.size());

  uint64_t total_data_size = to_decode.begin()->second.length();
  ....

  for (uint64_t i = 0; i < total_data_size; i + = sinfo.get_chunk_size()) {
    map<int, bufferlist> chunks;
    for (map<int, bufferlist>::iterator j = to_decode.begin();
j != to_decode.end();
+ + j) {
      chunks[j->first].substr_of(j->second, i, sinfo.get_chunk_size());
    }
    bufferlist bl;
    int r = ec_impl->decode_concat(chunks, & amp;bl);
    ceph_assert(r == 0);
    ceph_assert(bl.length() == sinfo.get_stripe_width());
    out->claim_append(bl);
  }
  return 0;
}

ErasureCode::decode_concat() decodes and links the older data blocks to restore the original data.

int ErasureCode::decode_concat(const map<int, bufferlist> & amp;chunks,
bufferlist *decoded)
{
  set<int> want_to_read;

  for (unsigned int i = 0; i < get_data_chunk_count(); i + + ) {
    want_to_read.insert(chunk_index(i));
  }
  map<int, bufferlist> decoded_map;
  //Decode core part
  int r = _decode(want_to_read, chunks, & amp;decoded_map);
  if (r == 0) {
    //Link the decoded data blocks
    for (unsigned int i = 0; i < get_data_chunk_count(); i + + ) {
      decoded->claim_append(decoded_map[chunk_index(i)]);
    }
  }
  return r;
}

Similarly, it is also assumed here that the erasure coding plug-in is jerasure. jerasure_decode() is the decoding function that calls the jerasure library.

int ErasureCodeJerasure::decode_chunks(const set<int> & amp;want_to_read,
const map<int, bufferlist> & chunks,
map<int, bufferlist> *decoded)
{
  unsigned blocksize = (*chunks.begin()).second.length();
  int erasures[k + m + 1];//Record the number of the lost block
  int erasures_count = 0;//Number of lost blocks
  char *data[k];
  char *coding[m];
  for (int i = 0; i < k + m; i + + ) {
    if (chunks.find(i) == chunks.end()) {
      erasures[erasures_count] = i;
      erasures_count + + ;
    }
    if (i < k)
      data[i] = (*decoded)[i].c_str();
    else
      coding[i - k] = (*decoded)[i].c_str();
  }
  erasures[erasures_count] = -1;

  ceph_assert(erasures_count > 0);
  return jerasure_decode(erasures, data, coding, blocksize);
}

Reference materials:

1. Architecture – Ceph Documentation

2. ceph source code analysis Chang Tao

3. ceph design principle and implementation