vCPU topology for StratoVirt (SMP)

CPU topology is used to represent the way the CPU is combined at the hardware level. This article mainly explains the SMP (Symmetric Multi-Processor, symmetric multi-processor system) architecture in the CPU topology. The CPU topology also includes other information, such as cache, etc. These parts will Will be added later. In addition to describing the composition relationship of the CPU, the CPU topology also provides services for the kernel’s scheduler to provide better performance. In StratoVirt, supporting CPU topology lays a foundation for subsequent CPU hot-plug development.

Common CPU SMP structures are:

Socket --> die --> cluster --> core --> thread
  • socket: corresponds to the CPU socket on the motherboard

  • Die: During the production process of the processor, small squares are cut from the wafer. The components between the dies are interconnected through the on-chip bus.

  • cluster: cluster, a combination of large or small cores

  • core: represents an independent physical CPU

  • thread: Logical CPU, a new concept introduced by Intel Hyper-Threading Technology

Principle of obtaining CPU topology

Because the topology acquisition methods of x86 and ARM are different, they will be introduced separately below.

x86

Under the x86 architecture, the operating system obtains the CPU topology by reading the CPUID. In x86 architecture, the CPUID instruction (identified by the CPUID opcode) is a processor supplemental instruction (its name is derived from the CPU identification) that allows software to discover the details of the processor. Programs can use the CPUID to determine the processor type.

CPUID implicitly uses the EAX register to determine the primary category of information returned, which is called the CPUID leaf. The CPUID leaves related to the CPU topology are: 0BH and 1FH. 1FH is an extension of 0BH and can be used to represent more levels. Intel recommends checking whether 1FH exists first. If 1FH exists, it will be used first. When the value of EAX is initialized to 0BH, CPUID returns core/logical processor topology information in the EAX, EBX, ECX and EDX registers. This function (EAX=0BH) requires ECX to be initialized to an index at the same time. This index indicates whether it is at the core level or the logical processor level. The OS calls this function in the order ECX=0,1,2..n. The order in which processor topology levels are returned is specific because each level reports some cumulative data and therefore some information relies on information retrieved from previous levels. Under 0BH, the levels that ECX can represent are: SMT and Core. Under 1FH, the levels that can be represented are: SMT, Core, Module, Tile and Die.

The table below is a more detailed explanation:

Initial EAX Value Information Provided about the Processor
0BH EAX Bits 04 – 00: Number of bits to shift right on x2APIC ID to get a unique topology ID of the next level type*. All logical processors with the same next level ID share current level. Bits 31 – 05: Reserved. EBX Bits 15 – 00: Number of logical processors at this level type. The number reflects configuration as shipped by Intel. Bits 31- 16: Reserved. ECX Bits 07 – 00: Level number. Same value in ECX input. Bits 15 – 08: Level type. Bits 31 – 16: Reserved. EDX Bits 31- 00: x2APIC ID the current logical processor.
1FH EAX Bits 04 – 00: Number of bits to shift right on x2APIC ID to get a unique topology ID of the next level type*. All logical processors with the same next level ID share current level. Bits 31 – 05: Reserved. EBX Bits 15 – 00: Number of logical processors at this level type. The number reflects configuration as shipped by Intel. Bits 31 – 16: Reserved. ECX Bits 07 – 00: Level number. Same value in ECX input. Bits 15 – 08: Level type. Bits 31 – 16: Reserved. EDX Bits 31- 00: x2APIC ID the current logical processor

Source: Intel 64 and IA-32 Architectures Software Developer’s Manual

ARM

Under the ARM architecture, if the operating system relies on Device Tree to start, it will obtain the CPU topology through Device Tree. If it is started in ACPI mode, the operating system will obtain the CPU topology by parsing the ACPI PPTT table.

ACPI–PPTT

ACPI is the abbreviation of Advanced Configuration and Power Interface. ACPI is an architecture-independent power management and configuration framework. This framework establishes a set of hardware registers to define power states. ACPI is an intermediate layer between the operating system and firmware, and an interface between them. ACPI defines two data structures: data tables and definition blocks. Data tables are used to store raw data for use by device drivers. Definition blocks consist of bytecodes that can be executed by the interpreter.

To give hardware vendors flexibility in choosing their implementations, ACPI uses tables to describe system information, capabilities, and methods of controlling those capabilities. These tables list devices on the system board or that cannot be detected or power managed using other hardware standards, as well as features described in ACPI Concepts. They also list system features such as supported sleep power states, description of the power planes and clock sources available in the system, battery, system indicators, etc. This enables OSPM to control system devices without knowing how system control is implemented.

The PPTT table is one of the tables. The full name of the PPTT table is Processor Properties Topology Table. The processor properties topology table is used to describe the topology of the processor. The table can also describe additional information, such as which nodes in the processor topology constitute the physical package. .

The following table is the structure of the PPTT table, which includes a header and body. The header is not much different from other ACPI tables. Among them, Signature is used to indicate that this is a PPTT table, and Length is the size of the entire table. For other information, you can view the table below. The main body of the table is a series of processor topologies.

The following table represents the processor level node structure. If it represents the processor structure, Type should be set to 0, and Length represents the number of bytes of this node. Flags is used to describe information related to the processor. For details, see the detailed information about Flags later. Parent is used to point to the upper-level node of this node and stores an offset address.

The following table is the structure of Flags. Flags occupies a length of 4 bytes. Physical package: Set Physical package to 1 if this node of the processor topology represents the boundary of a physical package. Set to 0 if this instance of the processor topology does not represent a physical package boundary. Processor is a Thread: For leaf entries: This must be set to 1 if the processing element representing this processor shares a functional unit with a sibling node. For non-leaf entries: Must be set to 0. Node is a Leaf: Must be set to 1 if the node is a leaf in the processor hierarchy. Otherwise it must be set to 0.

Reference: https://uefi.org/specs/ACPI/6.4/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html#processor-properties-topology-table-pptt

Device Tree

Device Tree is a data structure that describes hardware. The kernel’s startup program loads the device tree into memory and then parses the Device Tree to obtain hardware details. Device Tree is a tree structure, consisting of a series of named nodes and attributes. Nodes can contain sub-nodes, and the relationship between them constitutes a tree. Properties are key-value pairs of name and value.

A typical device tree is as follows:

ARM’s CPU topology is defined in the cpu-map node, which is a child node of the cpu node. The cpu-map node can contain three types of sub-nodes: cluster node, core node, and thread node. An example of the entire dts is as follows:

cpus {
 #size-cells = <0>;
 #address-cells = <2>;

 cpu-map {
  cluster0 {
   cluster0 {
    core0 {
     thread0 {
      cpu = < & amp;CPU0>;
     };
     thread1 {
      cpu = < & amp;CPU1>;
     };
    };

    core1 {
     thread0 {
      cpu = < & amp;CPU2>;
     };
     thread1 {
      cpu = < & amp;CPU3>;
     };
    };
   };

   cluster1 {
    core0 {
     thread0 {
      cpu = < & amp;CPU4>;
     };
     thread1 {
      cpu = < & amp;CPU5>;
     };
    };

    core1 {
     thread0 {
      cpu = < & amp;CPU6>;
     };
     thread1 {
      cpu = < & amp;CPU7>;
     };
    };
   };
  };
    };

    //...
};

Reference: https://www.kernel.org/doc/Documentation/devicetree/bindings/arm/topology.txt

Figure source: https://www.devicetree.org/specifications/

StratoVirt specific implementation

CPUID

First, we need to calculate the unique topology ID of each topology, and then obtain or create the corresponding CPUID entry ourselves. When the function value of the entry is equal to 0xB and 0X1F, we need to set the corresponding EAX according to the CPUID specification. EBX, ECX values. EAX is set to the topology ID, EBX is used to indicate how many logical processors there are at that level, and ECX represents the level number. 0xB needs to be configured with the value corresponding to index equal to 0, 1, and 0x1F needs to be configured with the value corresponding to index equal to 0, 1, 2. Here is the corresponding code:

// cpu/src/x86_64/mod.rs
const ECX_INVALID: u32 = 0u32 << 8;
const ECX_THREAD: u32 = 1u32 << 8;
const ECX_CORE: u32 = 2u32 << 8;
const ECX_DIE: u32 = 5u32 << 8;

impl X86CPUState {
    fn setup_cpuid( & amp;self, vcpu_fd: & amp;Arc<VcpuFd>) -> Result<()> {
        // Calculate topology ID
        let core_offset = 32u32 - (self.nr_threads - 1).leading_zeros();
        let die_offset = (32u32 - (self.nr_cores - 1).leading_zeros()) + core_offset;
        let pkg_offset = (32u32 - (self.nr_dies - 1).leading_zeros()) + die_offset;

        // Get KVM's fd and get the CPUID entries it supports

        for entry in entries.iter_mut() {
            match entry.function {
                // ...
                0xb => {
                    // Extended Topology Enumeration Leaf
                    entry.edx = self.apic_id as u32;
                    entry.ecx = entry.index & 0xff;
                    match entry.index {
                        0 => {
                            entry.eax = core_offset;
                            entry.ebx = self.nr_threads;
                            entry.ecx |= ECX_THREAD;
                        }
                        1 => {
                            entry.eax = pkg_offset;
                            entry.ebx = self.nr_threads * self.nr_cores;
                            entry.ecx |= ECX_CORE;
                        }
                        _ => {
                            entry.eax = 0;
                            entry.ebx = 0;
                            entry.ecx |= ECX_INVALID;
                        }
                    }
                }
                // 0x1f extension, supports die level
                0x1f => {
                    if self.nr_dies < 2 {
                        entry.eax = 0;
                        entry.ebx = 0;
                        entry.ecx = 0;
                        entry.edx = 0;
                        continue;
                    }

                    entry.edx = self.apic_id as u32;
                    entry.ecx = entry.index & 0xff;

                    match entry.index {
                        0 => {
                            entry.eax = core_offset;
                            entry.ebx = self.nr_threads;
                            entry.ecx |= ECX_THREAD;
                        }
                        1 => {
                            entry.eax = die_offset;
                            entry.ebx = self.nr_cores * self.nr_threads;
                            entry.ecx |= ECX_CORE;
                        }
                        2 => {
                            entry.eax = pkg_offset;
                            entry.ebx = self.nr_dies * self.nr_cores * self.nr_threads;
                            entry.ecx |= ECX_DIE;
                        }
                        _ => {
                            entry.eax = 0;
                            entry.ebx = 0;
                            entry.ecx |= ECX_INVALID;
                        }
                    }
                }
                // ...
            }
        }
}

PPTT

Building according to the ACPI PPTT table standard, we need to calculate the offset value of each node for its child nodes to point to it. We also need to calculate the uid of each node. The uid is initialized to 0 and increases by one for each additional node. The value of Flags also needs to be calculated according to the standards of the PPTT table. Finally, you need to calculate the size of the entire table and modify the original length value.

// machine/src/standard_vm/aarch64/mod.rs
impl AcpiBuilder for StdMachine {
    fn build_pptt_table(
         &self,
        acpi_data: & amp;Arc<Mutex<Vec<u8>>>,
        loader: &mut TableLoader,
    ) -> super::errors::Result<u64> {
        // ...
        //Configure PPTT header

        //Add socket node
        for socket in 0..self.cpu_topo.sockets {
            // Calculate the offset to the starting address
            let socket_offset = pptt.table_len() - pptt_start;
            let socket_hierarchy_node = ProcessorHierarchyNode::new(0, 0x2, 0, socket as u32);
            // ...
            for cluster in 0..self.cpu_topo.clusters {
                let cluster_offset = pptt.table_len() - pptt_start;
                let cluster_hierarchy_node =
                    ProcessorHierarchyNode::new(0, 0x0, socket_offset as u32, cluster as u32);
                // ...
                for core in 0..self.cpu_topo.cores {
                    let core_offset = pptt.table_len() - pptt_start;
                    // Determine whether thread node needs to be added
                    if self.cpu_topo.threads > 1 {
                        let core_hierarchy_node =
                            ProcessorHierarchyNode::new(0, 0x0, cluster_offset as u32, core as u32);
                        // ...
                        for _thread in 0..self.cpu_topo.threads {
                            let thread_hierarchy_node =
                                ProcessorHierarchyNode::new(0, 0xE, core_offset as u32, uid as u32);
                            // ...
                            uid + = 1;
                        }
                    } else {
                        let thread_hierarchy_node =
                            ProcessorHierarchyNode::new(0, 0xA, cluster_offset as u32, uid as u32);
                        // ...
                        uid + = 1;
                    }
                }
            }
        }
        //Add the PPTT table to the loader
    }
}

Device Tree

StratoVirt’s microvm uses the device tree to start, so we need to configure the cpu-map under the cpus node in the device tree to enable the microvm to support parsing the CPU topology. In StratoVirt, we support two-tier clusters. We use a multi-layer loop to create this tree. The first layer creates the first-layer cluster, the second layer corresponds to the second-layer cluster, the third layer creates the core, and the fourth layer creates the thread.

impl CompileFDTHelper for LightMachine {
    fn generate_cpu_nodes( & amp;self, fdt: & amp;mut FdtBuilder) -> util::errors::Result<()> {
        //Create cpus node
        // ...

        //Generate CPU topology
        //Create cpu-map node
        let cpu_map_node_dep = fdt.begin_node("cpu-map")?;
        //Create the first layer cluster node
        for socket in 0..self.cpu_topo.sockets {
            let sock_name = format!("cluster{}", socket);
            let sock_node_dep = fdt.begin_node( & amp;sock_name)?;
            //Create the second layer cluster node
            for cluster in 0..self.cpu_topo.clusters {
                let clster = format!("cluster{}", cluster);
                let cluster_node_dep = fdt.begin_node( & amp;clster)?;
                //Create core node
                for core in 0..self.cpu_topo.cores {
                    let core_name = format!("core{}", core);
                    let core_node_dep = fdt.begin_node( & amp;core_name)?;
                    //Create thread node
                    for thread in 0..self.cpu_topo.threads {
                        let thread_name = format!("thread{}", thread);
                        let thread_node_dep = fdt.begin_node( & amp;thread_name)?;
                        // Calculate the id of the cpu
                        // let vcpuid = ...
                        // Then add it to the node
                    }
                    fdt.end_node(core_node_dep)?;
                }
                fdt.end_node(cluster_node_dep)?;
            }
            fdt.end_node(sock_node_dep)?;
        }
        fdt.end_node(cpu_map_node_dep)?;

        Ok(())
    }
}

The structure of the device tree constructed by this code is basically the same as the structure shown in the previous principle.

Verification method

We can start a virtual machine through the following command, and the smp parameter is used to configure the vCPU topology

sudo ./target/release/stratovirt \
    -machine virt \
    -kernel /home/hwy/std-vmlinux.bin.1 \
    -append console=ttyAMA0 root=/dev/vda rw reboot=k panic=1 \
    -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,unit=0,readonly=true \
    -drive file=/home/hwy/openEuler-22.03-LTS-stratovirt-aarch64.img,id=rootfs,readonly=false \
    -device virtio-blk-pci,drive=rootfs,bus=pcie.0,addr=0x1c.0x0,id=rootfs \
    -qmp unix:/var/tmp/hwy.socket,server,nowait \
    -serial stdio \
    -m 2048 \
    -smp 4,sockets=2,clusters=1,cores=2,threads=1

Next, we can view the configured topology by observing the file below /sys/devices/system/cpu/cpu0/topology.

[root@StratoVirt topology] ll
total 0
-r--r--r-- 1 root root 64K Jul 18 09:04 cluster_cpus
-r--r--r-- 1 root root 64K Jul 18 09:04 cluster_cpus_list
-r--r--r-- 1 root root 64K Jul 18 09:04 cluster_id
-r--r--r-- 1 root root 64K Jul 18 09:04 core_cpus
-r--r--r-- 1 root root 64K Jul 18 09:04 core_cpus_list
-r--r--r-- 1 root root 64K Jul 18 09:01 core_id
-r--r--r-- 1 root root 64K Jul 18 09:01 core_siblings
-r--r--r-- 1 root root 64K Jul 18 09:04 core_siblings_list
-r--r--r-- 1 root root 64K Jul 18 09:04 die_cpus
-r--r--r-- 1 root root 64K Jul 18 09:04 die_cpus_list
-r--r--r-- 1 root root 64K Jul 18 09:04 die_id
-r--r--r-- 1 root root 64K Jul 18 09:04 package_cpus
-r--r--r-- 1 root root 64K Jul 18 09:04 package_cpus_list
-r--r--r-- 1 root root 64K Jul 18 09:01 physical_package_id
-r--r--r-- 1 root root 64K Jul 18 09:01 thread_siblings
-r--r--r-- 1 root root 64K Jul 18 09:04 thread_siblings_list

for example:

cat core_cpus_list

turn out

0

Indicates that the CPU with the same core as cpu0 is only cpu0.

cat package_cpus_list

will be displayed

0-1

Indicates that the CPUs in the same socket as cpu0 are from cpu0 to cpu1.

The following tools can also assist with verification.

For example: lscpu

lscpu

By executing the lscpu command, the following results will appear

Architecture: aarch64
  CPU op-mode(s): 32-bit, 64-bit
  Byte Order: Little Endian
CPU(s): 64
  On-line CPU(s) list: 0-63
Vendor ID: ARM
  Model name: Cortex-A72
    Model: 2
    Thread(s) per core: 1
    Core(s) per cluster: 16
    Socket(s): -
    Cluster(s): 4
    Stepping: r0p2
    BogoMIPS: 100.00
    Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
NUMA:
  NUMA node(s): 4
  NUMA node0 CPU(s): 0-15
  NUMA node1 CPU(s): 16-31
  NUMA node2 CPU(s): 32-47
  NUMA node3 CPU(s): 48-63
Vulnerabilities:
  Itlb multihit: Not affected
  L1tf: Not affected
  Mds: Not affected
  Meltdown: Not affected
  Spec store bypass: Vulnerable
  Specter v1: Mitigation; __user pointer sanitization
  Specter v2: Vulnerable
  Srbds: Not affected
  Tsx async abort: Not affected