Related usage scenarios of mutex in ATF

bakery lock

Let’s take a look at the appearance of this bread in ATF.

Scenario 1: Runtime service initialization

BL31 is responsible for initializing runtime services. One of them is PSCI.

As part of PSCI initialization, BL31 detects the system topology. It also initializes the data structures implementing the state machines used to track the state of the power domain nodes. The state can be one of “OFF”, “RUN”, or “RETENTION”.

All secondary CPUs are initially “off”. “ON” for the cluster to which the primary CPU belongs; “OFF” for any other cluster. It also initializes the locks that protect them.

The BL31 accesses the state of the CPU or cluster immediately after reset and before data caching is enabled in the warm start path. It is currently impossible to use “exclusive” based spin locks, so BL31 uses locks based on Lamport’s Bakery algorithm.

The runtime service framework and its initialization are described in more detail in the “EL3 Runtime Service Architecture” section below. The “Power State Coordination Interface” section below provides details on the state of the PSCI implementation.

Scenario 2: Consistent memory usage in PSCI implementation

The “psci_non_cpu_pd_nodes” data structure stores the power domain tree information of the platform and is used for status management of the power domain. By default, this data structure is allocated in a consistent memory region in TF-A, since it can be accessed by multiple CPUs, whether caching is enabled or disabled. (Because it is in consistent memory, it needs to be trivial to avoid competition problems.)

Next is Coherent. Coherent problems occur when there are multiple copies of a piece of data. The most typical situation where Coherent appears is the introduction of Cache.

For the case of a single Master, even if Cache is introduced, data can exist in the main memory and Cache at the same time, and the Master can always use Flush or Invalidate to update the data in the main memory. In fact, at this time, in most cases, it is not necessary to operate the Cache.

However, if multiple Masters are introduced, and each Master has its own Cache, things become complicated. The data D written by Master A to the location L of the main memory M may be cached in the cache of A, so if B reads the data directly from the main memory, it is likely to read the old data D'.

So, how to ensure that B can get the latest data? It is very intuitive to disable the Cache, but this will cause a loss in performance. There is another method. After each Master updates the data in the Cache, it needs to keep the data in the main memory and the Cache consistent. You can use the Write-through strategy, or you need to flush it after each update. Compared with disabling the Cache, the performance will be slightly better. However, each write will occupy the memory bandwidth, which is still not a good solution.

In hardware, we can combine the Master's Cache with each other, so that the Cache maintains data consistency with each other, and only writes the data back to the main memory when needed. In this way, on the premise of increasing the complexity of hardware design, performance optimization can be realized. For example, the shareable attribute in the ARMv8 architecture is a performance of hardware guarantee Coherent.

To sum up, Consistent is a problem caused by multiple Masters reading and writing to the same memory area. The relevant mechanism can only be provided by the hardware and guaranteed by the software. Coherent is the inconsistency of data caused by multiple copies (mainly Cache) of the same memory area, which can be guaranteed by hardware.

 typedef struct non_cpu_pwr_domain_node {
        /*
         * Index of the first CPU power domain node level 0 which has this node
         * as its parent.
         */
        unsigned int cpu_start_idx;

        /*
         * Number of CPU power domains which are siblings of the domain indexed
         * by 'cpu_start_idx' i.e. all the domains in the range 'cpu_start_idx
         * -> cpu_start_idx + ncpus' have this node as their parent.
         */
        unsigned int ncpus;

        /*
         * Index of the parent power domain node.
         */
        unsigned int parent_node;

        plat_local_state_t local_state;

        unsigned char level;

        /* For indexing the psci_lock array*/
        unsigned char lock_index;
    } non_cpu_pd_node_t;

In order to move this data structure into normal memory, the usage of each of its fields must be analyzed. Fields such as cpu_start_idx, ncpus, parent_node, level, and lock_index are in the It is written only once during a cold start.

Therefore, removing them from coherent memory only requires cleaning and invalidating the cache line after writing to those fields.

The field “local_state” can be accessed simultaneously by multiple CPUs in different cache states. Lamport’s Bakery locks “psci_locks” are used to ensure mutual exclusion of this field, and need to be cleaned up and invalidated after writing to this field.

Scenario 3: Bread mutually exclusive data

The bakery lock data structure “bakery_lock_t” is allocated in consistent memory and is accessed by multiple CPUs with mismatched attributes. bakery_lock_t is defined as follows: (Read and write, that is everyone Both can read, but only one can write at a time)

.. code:: c

    typedef struct bakery_lock {
        /*
         * The lock_data is a bit-field of 2 members:
         * Bit[0] : choosing. This field is set when the CPU is
         * choosing its bakery number.
         * Bits[1 - 15] : number. This is the bakery number allocated.
         */
        volatile uint16_t lock_data[BAKERY_LOCK_MAX_CPUS];
    } bakery_lock_t;

A feature of Lamport’s Bakery algorithm is that per-CPU volatile fields are readable by all CPUs but only writable by the owning CPU.

Depending on the size of the data cache line, each CPU field of the “bakery_lock_t” structure for multiple CPUs may exist on a single cache line.

These per-CPU fields can be read and written by multiple CPUs with mismatched memory attributes during lock contention. Since these fields are part of the lock implementation, they cannot access any other lock primitives to prevent consistency problems resulting from them. Therefore, simple software cache maintenance is not enough to allocate them in coherent memory. Consider the following example.

CPU0 updates its per-CPU field with data caching enabled. This write operation updates the local cache line, which also contains the other CPU’s copy of the field. Now CPU1 updates the per-CPU field of its “bakery_lock_t” structure with data caching disabled.

CPU1 then issues a DCIVAC operation to invalidate any stale copies of its field in any other cache line in the system. This action will also invalidate updates made by CPU0.

To use baked locks when ‘use_COHERENT_MEM’ is disabled, the lock data structure has been reworked. These changes take advantage of the aforementioned features of the Lamport’s Bakery algorithm. The bakery_lock structure only allocates memory for a single CPU. The macro “DEFINE_BAKERY_LOCK” assigns all the bake locks needed by the CPU to the “BAKERY_LOCK” section. The linker allocates memory for other cores by taking the total size allocated for the bakery_lock section and multiplying it by (PLATFORM_CORE_COUNT-1). This enables software to perform software cache maintenance on lock data structures without encountering the consistency issues associated with mismatched attributes.

The bakery lock data structure “bakery_info_t” is defined to be disabled in “USE_COHERENT_MEM” as follows:

.. code:: c

    typedef struct bakery_info {
        /*
         * The lock_data is a bit-field of 2 members:
         * Bit[0] : choosing. This field is set when the CPU is
         * choosing its bakery number.
         * Bits[1 - 15] : number. This is the bakery number allocated.
         */
         volatile uint16_t lock_data;
    } bakery_info_t;

A “bakery_info_t” represents a single per-CPU field of a lock, and the combination of corresponding “bakery_info_t” structures for all CPUs in the system represents the complete bakery lock. The memory view for a system with n baked locks is:

 bakery_lock section start
    |----------------|
    | `bakery_info_t`| <-- Lock_0 per-CPU field
    | Lock_0 | for CPU0
    |----------------|
    | `bakery_info_t`| <-- Lock_1 per-CPU field
    | Lock_1 | for CPU0
    |----------------|
    | .... |
    |----------------|
    | `bakery_info_t`| <-- Lock_N per-CPU field
    | Lock_N | for CPU0
    ------------------
    | XXXXX |
    | Padding to |
    | next Cache WB | <--- Calculate PERCPU_BAKERY_LOCK_SIZE, allocate
    | Granule | continuous memory for remaining CPUs.
    ------------------
    | `bakery_info_t`| <-- Lock_0 per-CPU field
    | Lock_0 | for CPU1
    |----------------|
    | `bakery_info_t`| <-- Lock_1 per-CPU field
    | Lock_1 | for CPU1
    |----------------|
    | .... |
    |----------------|
    | `bakery_info_t`| <-- Lock_N per-CPU field
    | Lock_N | for CPU1
    ------------------
    | XXXXX |
    | Padding to |
    | next Cache WB |
    | Granule |
    ------------------

Consider a system of 2 CPUs showing “N” bakery locks. For operations on Lock_N, the corresponding “bakery_info_t” in the “bakery_Lock” section of CPU0 and CPU1 needs to be fetched, and appropriate cache operations need to be performed for each access.
On Arm platforms, baked locks are used in psci (“psci_locks”) and power controller drivers (“Arm_lock”).

Scenario 3: Remove the non-functional impact of coherent cache

The removal of coherent memory regions incurs additional software overhead to perform cache maintenance on the affected data structures. However, since the memory for allocating data structures is cacheable, the performance gain is largely mitigated by the overhead.

However, breadlock performance suffers for the following reasons:
– Extra cache maintenance operations, and multiple cache-line reads per lock operation, since each CPU’s baked locks are spread across different cache lines.
The implementation has been optimized to minimize this extra overhead.

Measurements show that the minimum latency to acquire a lock averages 3-4 microseconds when breadlocks are allocated in normal memory, and 2 microseconds in device memory. Measurements were done on the Juno Arm development platform.

As mentioned, by disabling USE_COHERENT_MEM, you can save almost a page of memory. Each platform needs to consider these tradeoffs to decide whether coherent memory should be used. If a platform disables “USE_COHERENT_MEM” and requires a bakery lock in the porting layer, it can optionally define the macro “PLAT_PERCPU_bakery_LOCK_SIZE” (see: ref:`Porting Guide’). See the reference platform code for examples.
(The bread mutex is the consistent memory used)

Scene Four:

If the build option “USE_COHERENT_MEM” is enabled, each platform can allocate a block of identity-mapped safe memory for each BL stage, where the Device nGnRE attribute is aligned to a page boundary (4K).

All parts that allocate coherent memory are grouped under “coherent_ram”. For example: the mutex lock is placed in the section identified by the name “Bakery_lock” within “coheret_ram”, so the firmware can place variables there using the following C code instruction:
::
__section(“bakery_lock”)
Or use the following assembler code directive:
::
.section bakery_lock

The “coheret_ram” section is the sum of all the sections like “bakery_lock” that are used to allocate whatever is accessed when the CPU is executing with the MMU and caches enabled and when it is running with the MMU and caches disabled data structure. **Some examples are given below. (Indeed, it is used at startup, multi-core startup scenario)

Scene Five:

#define : PLAT_PERCPU_BAKERY_LOCK_SIZE [optional]

When USE_COHERENT_MEM=0 , this constant defines the total memory (in bytes) aligned to cache line boundaries that should be allocated per cpu to accommodate all baked locks.

If this constant is not defined when “USE_COHERENT_MEM=0”, the linker calculates the size of the “bakery_lock” input section, aligns it with the nearest “CACHE_WRITEBACK_GRANULE”, multiplies it by “PLATFORM_CORE_COUNT”, and stores the result in the linker symbol. This constant prevents the platform from being linker dependent and provides a more efficient mechanism for accessing per-cpu baked locking information.
If this constant is defined and its value is not equal to the value calculated by the linker, a link-time assertion is raised. A compile-time assertion is raised if the constant’s value is not aligned to a cache line boundary.