Lab8: Locks | Lock optimization implementation

Lab: locks

Memory allocator (moderate)

Your job is to implement per-CPU freelists, and stealing when a CPU’s free list is empty. You must give all of your locks names that start with “kmem”. That is, you should call initlock for each of your locks, and pass a name that starts with “kmem”. Run kalloctest to see if your implementation has reduced lock contention. To check that it can still allocate all of memory, run usertests sbrkmuch. Your output will look similar to that shown below, with much-reduced contention in total on kmem locks, although the specific numbers will differ. Make sure all tests in usertests pass. make grade should say that the kalloctests pass.

kalloc() has a free list protected by a single lock, which can cause contention when the amount of concurrency is high.

In the original implementation of kalloc, a freelist linked list is used, and the free physical page itself is directly used as a linked list item (so that no additional space is used) to connect it into a linked list. During allocation, the physical page is removed from the linked list, and the physical page is removed from the linked list during recycling. The physical page is put back into the linked list.

The idea here to solve performance hot spots is to “change shared resources into unshared resources.”

There are generally several ideas for lock competition optimization:

Only share when necessary (corresponding to splitting resources from CPU sharing to independent for each CPU)
When sharing is necessary, try to reduce the time spent in the key area (corresponding to “large locks and small locks”, reducing the granularity of the lock)

The experimental implementation methods are mainly as follows:

Assign separate freelist to each CPU
When there are not enough free pages in the freelist of a CPU, it is still necessary to “steal” memory pages from the freelists of other CPUs. Therefore, the freelist of a CPU is not only accessed by its corresponding CPU, but may also “steal” memory pages. Sometimes it is accessed by other CPUs, so it is still necessary to use a separate lock to protect the freelist of each CPU.

// kernel/kalloc.c
struct {<!-- -->
  struct spinlock lock;
  struct run *freelist;
} kmem[NCPU]; // Assign an independent freelist to each CPU and protect it with an independent lock.

char *kmem_lock_names[] = {<!-- -->
  "kmem_cpu_0",
  "kmem_cpu_1",
  "kmem_cpu_2",
  "kmem_cpu_3",
  "kmem_cpu_4",
  "kmem_cpu_5",
  "kmem_cpu_6",
  "kmem_cpu_7",
};

void
kinit()
{<!-- -->
  for(int i=0;i<NCPU;i + + ) {<!-- --> // Initialize all locks
    initlock( & amp;kmem[i].lock, kmem_lock_names[i]);
  }
  freerange(end, (void*)PHYSTOP);
}

// kernel/kalloc.c
void
kfree(void *pa)
{<!-- -->
  struct run *r;

  if(((uint64)pa % PGSIZE) != 0 || (char*)pa < end || (uint64)pa >= PHYSTOP)
    panic("kfree");

  // Fill with junk to catch dangling refs.
  memset(pa, 1, PGSIZE);

  r = (struct run*)pa;

  push_off();

  int cpu = cpuid();

  acquire( & amp;kmem[cpu].lock);
  r->next = kmem[cpu].freelist;
  kmem[cpu].freelist = r;
  release( & amp;kmem[cpu].lock);

  pop_off();
}

void *
kalloc(void)
{<!-- -->
  struct run *r;

  push_off();

  int cpu = cpuid();

  acquire( & amp;kmem[cpu].lock);

  if(!kmem[cpu].freelist) {<!-- --> // no page left for this cpu
    int steal_left = 64; // steal 64 pages from other cpu(s)
    for(int i=0;i<NCPU;i + + ) {<!-- -->
      if(i == cpu) continue; // no self-robbery
      acquire( & amp;kmem[i].lock);
      struct run *rr = kmem[i].freelist;
      while(rr & amp; & amp; steal_left) {<!-- -->
        kmem[i].freelist = rr->next;
        rr->next = kmem[cpu].freelist;
        kmem[cpu].freelist = rr;
        rr = kmem[i].freelist;
        steal_left--;
      }
      release( & amp;kmem[i].lock);
      if(steal_left == 0) break; // done stealing
    }
  }

  r = kmem[cpu].freelist;
  if(r)
    kmem[cpu].freelist = r->next;
  release( & amp;kmem[cpu].lock);

  pop_off();

  if(r)
    memset((char*)r, 5, PGSIZE); // fill with junk
  return (void*)r;
}

However, the above code may cause a deadlock. cpu_a holds its own lock and tries to steal cpu_b, and cpu_b holds its own lock and tries to steal cpu_a. Solution https://github.com/Miigon/blog/issues/8

Run kalloctest test

$kalloctest
start test1
test1 results:
--- lock kmem/bcache stats
lock: kmem_cpu_0: #fetch-and-add 0 #acquire() 35979
lock: kmem_cpu_1: #fetch-and-add 0 #acquire() 195945
lock: kmem_cpu_2: #fetch-and-add 0 #acquire() 201094
lock: bcache: #fetch-and-add 0 #acquire() 1248
--- top 5 contended locks:
lock: proc: #fetch-and-add 22486 #acquire() 132299
lock: virtio_disk: #fetch-and-add 16002 #acquire() 114
lock: proc: #fetch-and-add 11199 #acquire() 132301
lock: proc: #fetch-and-add 5330 #acquire() 132322
lock: proc: #fetch-and-add 4874 #acquire() 132345
tot = 0
test1 OK
start test2
total free number of pages: 32499 (out of 32768)
.....
test2 OK

Buffer cache (hard)

Modify the block cache so that the number of acquire loop iterations for all locks in the bcache is close to zero when running bcachetest. Ideally the sum of the counts for all locks involved in the block cache should be zero, but it’s OK if the sum is less than 500. Modify bget and brelse so that concurrent lookups and releases for different blocks that are in the bcache are unlikely to conflict on locks (e.g., don’t all have to wait for bcache.lock). You must maintain the invariant that at most one copy of each block is cached. When you are done, your output should be similar to that shown below (though not identical). Make sure usertests still passes. make grade should pass all tests when you are done.

The block cache in bcache will be shared by multiple processes (and further, by multiple CPUs) (because multiple processes can access the same block at the same time).

In the design of the original xv6, a two-way linked list is used to store all block caches. Every time you try to obtain a block blockno, the linked list will be traversed. If the target block already exists in the cache, it will be returned directly. If it does not exist, the nearest one will be selected. The oldest buf block with a reference count of 0 is cached as its block and returned.

The new improved solution can build a hash table from blockno to buf and lock each bucket individually. In this way, lock contention will only occur when blocks accessed by two processes at the same time hash to the same bucket at the same time. When the free buf in the bucket is insufficient, buf is obtained from other buckets.

My own thinking in this regard is to directly lock each bucket separately, so that competition between multiple locks can be avoided.

However, after referring to the codes of many big guys on the Internet, I found that this part is not as simple as I thought, and it can easily cause deadlock problems, although my code can pass the test machine.

Attached is the boss’s blog, which is very detailed.
https://blog.miigon.net/posts/s081-lab8-locks/#Deadlock problem

Complete code:

struct buf {<!-- -->
  int valid; // has data been read from disk?
  int disk; // does disk "own" buf?
  uintdev;
  uint blockno;
  struct sleeplock lock;
  uint refcnt;
  uint lastuse; // *newly added, used to keep track of the least-recently-used buf
  struct buf *next;
  uchar data[BSIZE];
};

// kernel/bio.c

// bucket number for bufmap
#define NBUFMAP_BUCKET 13
// hash function for bufmap
#define BUFMAP_HASH(dev, blockno) ((((dev)<<27)|(blockno))%NBUFMAP_BUCKET)

struct {<!-- -->
  struct buf buf[NBUF];
  struct spinlock eviction_lock;

  // Hash map: dev and blockno to buf
  struct buf bufmap[NBUFMAP_BUCKET];
  struct spinlock bufmap_locks[NBUFMAP_BUCKET];
} bcache;

void
binit(void)
{<!-- -->
  //Initialize bufmap
  for(int i=0;i<NBUFMAP_BUCKET;i + + ) {<!-- -->
    initlock( & amp;bcache.bufmap_locks[i], "bcache_bufmap");
    bcache.bufmap[i].next = 0;
  }

  //Initialize buffers
  for(int i=0;i<NBUF;i + + ){<!-- -->
    struct buf *b = & amp;bcache.buf[i];
    initsleeplock( & amp;b->lock, "buffer");
    b->lastuse = 0;
    b->refcnt = 0;
    // put all the buffers into bufmap[0]
    b->next = bcache.bufmap[0].next;
    bcache.bufmap[0].next = b;
  }

  initlock( & amp;bcache.eviction_lock, "bcache_eviction");
}

// Look through buffer cache for block on device dev.
// If not found, allocate a buffer.
// In either case, return locked buffer.
static struct buf*
bget(uint dev, uint blockno)
{<!-- -->
  struct buf *b;

  uint key = BUFMAP_HASH(dev, blockno);

  acquire( & amp;bcache.bufmap_locks[key]);

  // Is the block already cached?
  for(b = bcache.bufmap[key].next; b; b = b->next){<!-- -->
    if(b->dev == dev & amp; & amp; b->blockno == blockno){<!-- -->
      b->refcnt + + ;
      release( & amp;bcache.bufmap_locks[key]);
      acquiresleep( & amp;b->lock);
      return b;
    }
  }

  // Not cached.

  // to get a suitable block to reuse, we need to search for one in all the buckets,
  // which means acquiring their bucket locks.
  // but it's not safe to try to acquire every single bucket lock while holding one.
  // it can easily lead to circular wait, which produces deadlock.

  release( & amp;bcache.bufmap_locks[key]);
  // we need to release our bucket lock so that iterating through all the buckets won't
  // lead to circular wait and deadlock. however, as a side effect of releasing our bucket
  // lock, other cpus might request the same blockno at the same time and the cache buf for
  // blockno might be created multiple times in the worst case. since multiple concurrent
  // bget requests might pass the "Is the block already cached?" test and start the
  // eviction & reuse process multiple times for the same blockno.
  //
  // so, after acquiring eviction_lock, we check "whether cache for blockno is present"
  // once more, to be sure that we don't create duplicate cache bufs.
  acquire( & amp;bcache.eviction_lock);

  // Check again, is the block already cached?
  // no other eviction & amp; reuse will happen while we are holding eviction_lock,
  // which means no link list structure of any bucket can change.
  // so it's ok here to iterate through `bcache.bufmap[key]` without holding
  // it's cooresponding bucket lock, since we are holding a much stronger eviction_lock.
  for(b = bcache.bufmap[key].next; b; b = b->next){<!-- -->
    if(b->dev == dev & amp; & amp; b->blockno == blockno){<!-- -->
      acquire( & amp;bcache.bufmap_locks[key]); // must do, for `refcnt + + `
      b->refcnt + + ;
      release( & amp;bcache.bufmap_locks[key]);
      release( & amp;bcache.eviction_lock);
      acquiresleep( & amp;b->lock);
      return b;
    }
  }

  // Still not cached.
  // we are now only holding eviction lock, none of the bucket locks are held by us.
  // so it's now safe to acquire any bucket's lock without risking circular wait and deadlock.

  // find the one least-recently-used buf among all buckets.
  // finish with it's corresponding bucket's lock held.
  struct buf *before_least = 0;
  uint holding_bucket = -1;
  for(int i = 0; i < NBUFMAP_BUCKET; i + + ){<!-- -->
    // before acquiring, we are either holding nothing, or only holding locks of
    // buckets that are *on the left side* of the current bucket
    // so no circular wait can ever happen here. (safe from deadlock)
    acquire( & amp;bcache.bufmap_locks[i]);
    int newfound = 0; // new least-recently-used buf found in this bucket
    for(b = & amp;bcache.bufmap[i]; b->next; b = b->next) {<!-- -->
      if(b->next->refcnt == 0 & amp; & amp; (!before_least || b->next->lastuse < before_least->next->lastuse)) {<!-- -->
        before_least = b;
        newfound = 1;
      }
    }
    if(!newfound) {<!-- -->
      release( & amp;bcache.bufmap_locks[i]);
    } else {<!-- -->
      if(holding_bucket != -1) release( & amp;bcache.bufmap_locks[holding_bucket]);
      holding_bucket = i;
      // keep holding this bucket's lock....
    }
  }
  if(!before_least) {<!-- -->
    panic("bget: no buffers");
  }
  b = before_least->next;
  
  if(holding_bucket != key) {<!-- -->
    // remove the buf from it's original bucket
    before_least->next = b->next;
    release( & amp;bcache.bufmap_locks[holding_bucket]);
    // rehash and add it to the target bucket
    acquire( & amp;bcache.bufmap_locks[key]);
    b->next = bcache.bufmap[key].next;
    bcache.bufmap[key].next = b;
  }
  
  b->dev = dev;
  b->blockno = blockno;
  b->refcnt = 1;
  b->valid = 0;
  release( & amp;bcache.bufmap_locks[key]);
  release( & amp;bcache.eviction_lock);
  acquiresleep( & amp;b->lock);
  return b;
}

//......

// Release a locked buffer.
void
brelse(struct buf *b)
{<!-- -->
  if(!holdingsleep( & amp;b->lock))
    panic("brelse");

  releasesleep( & amp;b->lock);

  uint key = BUFMAP_HASH(b->dev, b->blockno);

  acquire( & amp;bcache.bufmap_locks[key]);
  b->refcnt--;
  if (b->refcnt == 0) {<!-- -->
    b->lastuse = ticks;
  }
  release( & amp;bcache.bufmap_locks[key]);
}

void
bpin(struct buf *b) {<!-- -->
  uint key = BUFMAP_HASH(b->dev, b->blockno);

  acquire( & amp;bcache.bufmap_locks[key]);
  b->refcnt + + ;
  release( & amp;bcache.bufmap_locks[key]);
}

void
bunpin(struct buf *b) {<!-- -->
  uint key = BUFMAP_HASH(b->dev, b->blockno);

  acquire( & amp;bcache.bufmap_locks[key]);
  b->refcnt--;
  release( & amp;bcache.bufmap_locks[key]);
}

Run results

$ bcachetest
start test0
test0 results:
--- lock kmem/bcache stats
lock: kmem_cpu_0: #fetch-and-add 0 #acquire() 32897
lock: kmem_cpu_1: #fetch-and-add 0 #acquire() 77
lock: kmem_cpu_2: #fetch-and-add 0 #acquire() 61
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6400
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6685
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6696
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 7018
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6266
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4206
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4206
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 2193
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4202
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 2196
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4359
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4409
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6411
lock: bcache_eviction: #fetch-and-add 0 #acquire() 83
--- top 5 contended locks:
lock: proc: #fetch-and-add 397110 #acquire() 70988
lock: proc: #fetch-and-add 262715 #acquire() 70988
lock: proc: #fetch-and-add 222165 #acquire() 70987
lock: virtio_disk: #fetch-and-add 161088 #acquire() 1098
lock: proc: #fetch-and-add 45459 #acquire() 71331
tot = 0
test0: OK
start test1
test1 OK
$

In the area of multi-threading, lock holding can easily cause deadlock problems. To effectively solve it, you need to have a good understanding of the underlying instructions and CPU operating principles. I still have a long way to go to learn.