StratoVirt’s vCPU topology (SMP)

The CPU topology is used to represent the combination of CPUs at the hardware level. This article mainly explains the SMP (Symmetric Multi-Processor) architecture in the CPU topology. The CPU topology also includes other information, such as: cache, etc. These parts will Supplement later. In addition to describing the composition relationship of the CPU, the CPU topology also provides services for the kernel’s scheduler, thereby providing better performance. In StratoVirt, supporting CPU topology lays a foundation for subsequent CPU hot-swap development.

A common CPU SMP structure is:

Socket --> die --> cluster --> core --> thread
  • socket: corresponds to the CPU socket on the motherboard

  • die: During the production process of the processor, small squares are cut from the wafer, and the components between Dies are interconnected through the on-chip bus.

  • cluster: cluster, a combination of large cores or small cores

  • core: Indicates an independent physical CPU

  • thread: Logical CPU, a new concept introduced by Intel Hyper-Threading Technology

CPU topology acquisition principle

Because x86 and ARM have different topology acquisition methods, the following will introduce them separately.

x86

Under the x86 architecture, the operating system will obtain the CPU topology by reading the CPUID. On the x86 architecture, the CPUID instruction (identified by the CPUID opcode) is a processor supplementary instruction (whose name is derived from the CPU identification) that allows software to discover details about the processor. Programs can use CPUID to determine the processor type.

CPUID implicitly uses the EAX register to determine the main category of information returned, which is called the CPUID leaf. The CPUID leaves associated with the CPU topology are: 0BH and 1FH. 1FH is an extension of 0BH and can be used to represent more levels. Intel recommends checking whether 1FH exists first, and if 1FH exists, it will be used first. When the value of EAX is initialized to 0BH, CPUID returns core/logical processor topology information in EAX, EBX, ECX and EDX registers. This function (EAX=0BH) requires ECX to be initialized to an index at the same time. This index indicates whether it is at the core level or at the logical processor level. The OS calls this function in the order of ECX=0,1,2..n. The order in which the processor topology levels are returned is specific, since each level reports some cumulative data, and thus some information depends on information retrieved from previous levels. Under 0BH, the levels that ECX can represent are: SMT and Core, and under 1FH, the levels that can be represented are: SMT, Core, Module, Tile and Die.

The following table is a more detailed explanation:

Initial EAX Value Information Provided about the Processor
0BH EAX Bits 04 – 00: Number of bits to shift right on x2APIC ID to get a unique topology ID of the next level type*. All logical processors with the same next level ID share current level. Bits 31 – 05: Reserved. EBX Bits 15 – 00: Number of logical processors at this level type. The number reflects configuration as shipped by Intel. Bits 31- 16: Reserved. ECX Bits 07 – 00: Level number. Same value in ECX input. Bits 15 – 08: Level type. Bits 31 – 16: Reserved. EDX Bits 31- 00: x2APIC ID the current logical processor.
1FH EAX Bits 04 – 00: Number of bits to shift right on x2APIC ID to get a unique topology ID of the next level type*. All logical processors with the same next level ID share current level. Bits 31 – 05: Reserved. EBX Bits 15 – 00: Number of logical processors at this level type. The number reflects configuration as shipped by Intel. Bits 31- 16: Reserved. ECX Bits 07 – 00: Level number. Same value in ECX input. Bits 15 – 08: Level type. Bits 31 – 16: Reserved. EDX Bits 31- 00: x2APIC ID the current logical processor

Source: Intel 64 and IA-32 Architectures Software Developer’s Manual

ARM

Under the ARM architecture, if the operating system is started by relying on the Device Tree, the CPU topology will be obtained through the Device Tree. If it is started in ACPI mode, the operating system will obtain the CPU topology by parsing the ACPI PPTT table.

ACPI–PPTT

ACPI is the abbreviation of Advanced Configuration and Power Interface (Advanced Configuration and Power Interface), ACPI is an architecture-independent power management and configuration framework. This framework establishes a set of hardware registers to define power states. ACPI is an intermediate layer between the operating system and firmware, an interface between them. ACPI defines two data structures: data tables and definition blocks. data tables are used to store raw data for device drivers. Definition blocks consist of bits of bytecode that can be executed by an interpreter.

To allow hardware vendors flexibility in choosing their implementation, ACPI uses tables to describe system information, capabilities, and methods of controlling those capabilities. These tables list devices on the system board or that cannot be detected or power-managed using other hardware standards, as well as the functions described in ACPI concepts. They also list system features such as supported sleep power states, descriptions of power planes and clock sources available in the system, battery, system indicators, and more. This enables OSPM to control system devices without knowing how system control is implemented.

The PPTT table is one of the tables. The full name of the PPTT table is Processor Properties Topology Table. The processor properties topology table is used to describe the topology of the processor. The table can also describe additional information, such as which nodes in the processor topology constitute the physical package. .

The following table is the structure of the PPTT table, which contains a header and a body. The header is not much different from other ACPI tables. Among them, Signature is used to indicate that this is a PPTT table, and Length is the size of the entire table. For other information, please refer to the table below. The body of the table is a list of processor topologies.

The following table shows the processor hierarchy node structure. If it shows the processor structure, Type should be set to 0, and Length indicates the number of bytes of this node. Flags is used to describe the information related to the processor, see the detailed information about Flags later. Parent is used to point to the upper level node of this node, which stores an offset address

The following table is the structure of Flags, Flags occupies 4 bytes in length. Physical package: Set Physical package to 1 if this node of the processor topology represents the boundary of a physical package. Set to 0 if this instance of processor topology does not represent a physical package boundary. Processor is a Thread: For leaf entries: Must be set to 1 if the processing element representing this processor shares a functional unit with sibling nodes. For non-leaf entries: must be set to 0. Node is a Leaf: Must be set to 1 if the node is a leaf in the processor hierarchy. Otherwise it must be set to 0.

Reference: https://uefi.org/specs/ACPI/6.4/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html#processor-properties-topology-table-pptt

Device Tree

Device Tree is a data structure that describes hardware. The kernel’s startup program loads the device tree into memory, and then obtains hardware details by parsing the Device Tree. Device Tree is a tree structure consisting of a series of named nodes and attributes. Nodes can contain sub-nodes, and the relationship between them forms a tree. Properties are key-value pairs of name and value.

A typical device tree is as follows:

ARM’s CPU topology is defined in the cpu-map node, which is a child node of the cpu node. The cpu-map node can contain three sub-nodes: cluster node, core node, thread node. An example of the entire dts is as follows:

cpus {
 #size-cells = <0>;
 #address-cells = <2>;

 cpu-map {
  cluster0 {
   cluster0 {
    core0 {
     thread0 {
      cpu = < & CPU0>;
     };
     thread1 {
      cpu = < & CPU1>;
     };
    };

    core1 {
     thread0 {
      cpu = < & CPU2>;
     };
     thread1 {
      cpu = < & CPU3>;
     };
    };
   };

   cluster1 {
    core0 {
     thread0 {
      cpu = < & CPU4>;
     };
     thread1 {
      cpu = < &CPU5>;
     };
    };

    core1 {
     thread0 {
      cpu = < & CPU6>;
     };
     thread1 {
      cpu = < & CPU7>;
     };
    };
   };
  };
    };

    //...
};

Reference: https://www.kernel.org/doc/Documentation/devicetree/bindings/arm/topology.txt

Figure source: https://www.devicetree.org/specifications/

StratoVirt specific implementation

CPUID

First, we need to calculate the unique topology ID of each topology, and then obtain or create the corresponding CPUID entry. When the function value of the entry is equal to 0xB and 0X1F, we need to set the corresponding EAX according to the CPUID specification. Value of EBX, ECX. EAX is set to the topology ID, EBX is used to indicate how many logical processors there are in that level, and ECX indicates the level number. 0xB needs to be configured with an index equal to the value corresponding to 0, 1, and 0x1F needs to be configured with an index equal to the value corresponding to 0, 1, and 2. Here is the corresponding code:

// cpu/src/x86_64/mod.rs
const ECX_INVALID: u32 = 0u32 << 8;
const ECX_THREAD: u32 = 1u32 << 8;
const ECX_CORE: u32 = 2u32 << 8;
const ECX_DIE: u32 = 5u32 << 8;

impl X86CPUState {
    fn setup_cpuid( &self, vcpu_fd: &Arc<VcpuFd>) -> Result<()> {
        // calculate topology ID
        let core_offset = 32u32 - (self.nr_threads - 1).leading_zeros();
        let die_offset = (32u32 - (self.nr_cores - 1).leading_zeros()) + core_offset;
        let pkg_offset = (32u32 - (self.nr_dies - 1).leading_zeros()) + die_offset;

        // Get the fd of KVM and get the CPUID entries it supports

        for entry in entries.iter_mut() {
            match entry. function {
                //...
                0xb => {
                    // Extended Topology Enumeration Leaf
                    entry.edx = self.apic_id as u32;
                    entry.ecx = entry.index & 0xff;
                    match entry. index {
                        0 => {
                            entry.eax = core_offset;
                            entry.ebx = self.nr_threads;
                            entry.ecx |= ECX_THREAD;
                        }
                        1 => {
                            entry.eax = pkg_offset;
                            entry.ebx = self.nr_threads * self.nr_cores;
                            entry.ecx |= ECX_CORE;
                        }
                        _ => {
                            entry. eax = 0;
                            entry.ebx = 0;
                            entry.ecx |= ECX_INVALID;
                        }
                    }
                }
                // 0x1f extension, support die level
                0x1f => {
                    if self.nr_dies < 2 {
                        entry. eax = 0;
                        entry.ebx = 0;
                        entry.ecx = 0;
                        entry.edx = 0;
                        continue;
                    }

                    entry.edx = self.apic_id as u32;
                    entry.ecx = entry.index & 0xff;

                    match entry. index {
                        0 => {
                            entry.eax = core_offset;
                            entry.ebx = self.nr_threads;
                            entry.ecx |= ECX_THREAD;
                        }
                        1 => {
                            entry.eax = die_offset;
                            entry.ebx = self.nr_cores * self.nr_threads;
                            entry.ecx |= ECX_CORE;
                        }
                        2 => {
                            entry.eax = pkg_offset;
                            entry.ebx = self.nr_dies * self.nr_cores * self.nr_threads;
                            entry.ecx |= ECX_DIE;
                        }
                        _ => {
                            entry. eax = 0;
                            entry.ebx = 0;
                            entry.ecx |= ECX_INVALID;
                        }
                    }
                }
                //...
            }
        }
}

PPTT

Constructed according to the ACPI PPTT table standard, we need to calculate the offset value of each node for its child nodes to point to it. We also need to calculate the uid of each node. The uid is initialized to 0, and the value of uid is increased by one for each additional node. It is also necessary to calculate the value of Flags according to the standard of PPTT table. Finally, you need to calculate the size of the entire table and then modify the value of the original length.

// machine/src/standard_vm/aarch64/mod.rs
impl AcpiBuilder for StdMachine {
    fn build_pptt_table(
         &self,
        acpi_data: &Arc<Mutex<Vec<u8>>>,
        loader: &mut TableLoader,
    ) -> super::errors::Result<u64> {
        //...
        // Configure PPTT header

        // add socket node
        for socket in 0..self.cpu_topo.sockets {
            // Calculate the offset to the starting address
            let socket_offset = pptt.table_len() - pptt_start;
            let socket_hierarchy_node = ProcessorHierarchyNode::new(0, 0x2, 0, socket as u32);
            //...
            for cluster in 0..self.cpu_topo.clusters {
                let cluster_offset = pptt. table_len() - pptt_start;
                let cluster_hierarchy_node =
                    ProcessorHierarchyNode::new(0, 0x0, socket_offset as u32, cluster as u32);
                //...
                for core in 0..self.cpu_topo.cores {
                    let core_offset = pptt.table_len() - pptt_start;
                    // Determine whether a thread node needs to be added
                    if self.cpu_topo.threads > 1 {
                        let core_hierarchy_node =
                            ProcessorHierarchyNode::new(0, 0x0, cluster_offset as u32, core as u32);
                        //...
                        for _thread in 0..self.cpu_topo.threads {
                            let thread_hierarchy_node =
                                ProcessorHierarchyNode::new(0, 0xE, core_offset as u32, uid as u32);
                            //...
                            uid += 1;
                        }
                    } else {
                        let thread_hierarchy_node =
                            ProcessorHierarchyNode::new(0, 0xA, cluster_offset as u32, uid as u32);
                        //...
                        uid += 1;
                    }
                }
            }
        }
        // add PPTT table to loader
    }
}

Device Tree

StratoVirt’s microvm uses device tree to start, so we need to configure the cpu-map under the cpus node in the device tree to enable microvm to support parsing CPU topology. In StratoVirt, we support two-tier clusters. We use a multi-layer loop to create this tree. The first layer is to create the first layer cluster, the second layer corresponds to the second layer cluster, the third layer creates the core, and the fourth layer creates the thread.

impl CompileFDTHelper for LightMachine {
    fn generate_cpu_nodes( &self, fdt: &mut FdtBuilder) -> util::errors::Result<()> {
        // create cpus node
        //...

        // Generate CPU topology
        // create cpu-map node
        let cpu_map_node_dep = fdt.begin_node("cpu-map")?;
        // Create the first layer of cluster nodes
        for socket in 0..self.cpu_topo.sockets {
            let sock_name = format!("cluster{}", socket);
            let sock_node_dep = fdt.begin_node( &sock_name)?;
            // create second level cluster nodes
            for cluster in 0..self.cpu_topo.clusters {
                let clster = format!("cluster{}", cluster);
                let cluster_node_dep = fdt. begin_node( & clster)?;
                // create core node
                for core in 0..self.cpu_topo.cores {
                    let core_name = format!("core{}", core);
                    let core_node_dep = fdt.begin_node( & core_name)?;
                    // create thread node
                    for thread in 0..self.cpu_topo.threads {
                        let thread_name = format!("thread{}", thread);
                        let thread_node_dep = fdt. begin_node( & thread_name)?;
                        // Calculate the id of the cpu
                        // let vcpuid = ...
                        // then add to the node
                    }
                    fdt.end_node(core_node_dep)?;
                }
                fdt.end_node(cluster_node_dep)?;
            }
            fdt.end_node(sock_node_dep)?;
        }
        fdt.end_node(cpu_map_node_dep)?;

        Ok(())
    }
}

The structure of the device tree built by this code is basically the same as the structure shown in the previous principle.

Authentication method

We can start a virtual machine with the following command, the smp parameter is used to configure the vCPU topology

sudo ./target/release/stratovirt\
    -machine virt\
    -kernel /home/hwy/std-vmlinux.bin.1\
    -append console=ttyAMA0 root=/dev/vda rw reboot=k panic=1 \
    -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,unit=0,readonly=true \
    -drive file=/home/hwy/openEuler-22.03-LTS-stratovirt-aarch64.img,id=rootfs,readonly=false \
    -device virtio-blk-pci,drive=rootfs,bus=pcie.0,addr=0x1c.0x0,id=rootfs \
    -qmp unix:/var/tmp/hwy.socket,server,nowait \
    -serial stdio \
    -m 2048 \
    -smp 4,sockets=2,clusters=1,cores=2,threads=1

Next, we can view the configured topology by observing the files under /sys/devices/system/cpu/cpu0/topology.

[root@StratoVirt topology] ll
total 0
-r--r--r-- 1 root root 64K Jul 18 09:04 cluster_cpus
-r--r--r-- 1 root root 64K Jul 18 09:04 cluster_cpus_list
-r--r--r-- 1 root root 64K Jul 18 09:04 cluster_id
-r--r--r-- 1 root root 64K Jul 18 09:04 core_cpus
-r--r--r-- 1 root root 64K Jul 18 09:04 core_cpus_list
-r--r--r-- 1 root root 64K Jul 18 09:01 core_id
-r--r--r-- 1 root root 64K Jul 18 09:01 core_siblings
-r--r--r-- 1 root root 64K Jul 18 09:04 core_siblings_list
-r--r--r-- 1 root root 64K Jul 18 09:04 die_cpus
-r--r--r-- 1 root root 64K Jul 18 09:04 die_cpus_list
-r--r--r-- 1 root root 64K Jul 18 09:04 die_id
-r--r--r-- 1 root root 64K Jul 18 09:04 package_cpus
-r--r--r-- 1 root root 64K Jul 18 09:04 package_cpus_list
-r--r--r-- 1 root root 64K Jul 18 09:01 physical_package_id
-r--r--r-- 1 root root 64K Jul 18 09:01 thread_siblings
-r--r--r-- 1 root root 64K Jul 18 09:04 thread_siblings_list

for example:

cat core_cpus_list

turn out

0

Indicates that the cpu with the same core as cpu0 is only cpu0.

cat package_cpus_list

will show

0-1

Indicates that the cpus in the same socket as cpu0 are from cpu0 to cpu1.

The following tools can also assist in verification.

For example: lscpu

lscpu

By executing the lscpu command, the following results will appear

Architecture: aarch64
  CPU op-mode(s): 32-bit, 64-bit
  Byte Order: Little Endian
CPU(s): 64
  On-line CPU(s) list: 0-63
Vendor ID: ARM
  Model name: Cortex-A72
    Model: 2
    Thread(s) per core: 1
    Core(s) per cluster: 16
    Socket(s): -
    Cluster(s): 4
    Stepping: r0p2
    BogoMIPS: 100.00
    Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
NUMA:
  NUMA node(s): 4
  NUMA node0 CPU(s): 0-15
  NUMA node1 CPU(s): 16-31
  NUMA node2 CPU(s): 32-47
  NUMA node3 CPU(s): 48-63
Vulnerabilities:
  Itlb multihit: Not affected
  L1tf: Not affected
  Mds: Not affected
  Meltdown: Not affected
  Spec store bypass: Vulnerable
  Specter v1: Mitigation; __user pointer sanitization
  Specter v2: Vulnerable
  Srbds: Not affected
  Tsx async abort: Not affected