Intel x86_64 LBR functionality

Article directory

  • Preface
  • 1. CPUID instruction
    • 1.1 Introduction to CPUID functions
    • 1.2 Input parameter 01H and return the result
      • 1.2.1 ECX returns results
      • 1.2.2 EDX returns results
    • 1.3 CPUID instruction in Linux
      • 1.3.1 The application layer calls the cpid command
      • 1.3.2 Calling the cpuid instruction in the Linux kernel
  • 2. MSR register
    • 2.1 Introduction to MSR register
    • 2.2 Introduction to RDMSR and WRMSR instructions
    • 2.3 IA32_DEBUGCTL MSR register
  • 3. LBR
    • 3.1 Introduction to LBR
    • 3.2 Usage of LBR
    • 3.3 Code demonstration
      • 3.3.1 User mode demo
      • 3.3.2 msr-tools
      • 3.3.3 taskset
      • 3.3.4 MSR register address
      • 3.3.5 Complete code

Foreword

This article introduces the LBR instruction tracking function of Intel processors. Through this function, the instruction information executed by the CPU can be obtained from the hardware level. The principle and process are roughly as follows:

  1. Use the CPUID instruction to read various identifiers and feature information of the processor to determine whether the hardware debugging function is supported and whether the rdmsr / wrmsr instructions are available;
  2. Set the IA32_DEBUGCTL MSR register through the wrmsr command to enable the LBR function;
  3. Read the IA32_DEBUGCTL MSR register through the rdmsr instruction to obtain the jump instruction information;

Below we will introduce these methods in more detail.

1. CPUID command

1.1 Introduction to CPUID functions

This instruction can read various identification and feature information of the processor (such as CPU model and supported functions), and save the information returned after the instruction is executed in the EAX, EBX, ECX, and EDX registers.
The CPUID instruction has two sets of functions: one returns basic information, and the other returns extended information.
This instruction has one input parameter (there may be two), which will be passed to the EAX (ECX) register. Generally, only one parameter is entered. Depending on the input parameters, it is returned to the EAX, EBX, ECX, and EDX registers. The information is also different. The introduction is as follows:

For different inputs, the returned results are as follows:

1.2 Input parameter 01H and return result

Here we focus on the results returned by ECX and EDX when EAX = 01H.

1.2.1 ECX return results

  • PDCM: Perfmon and Debug Capability. A value of 1 indicates that the processor supports performance and debugging capabilities.
  • DS-CPL: CPL Qualified Debug Store. A value of 1 indicates that the processor supports extensions to the Debug Store feature, allowing branch messages to be filtered based on the privilege level of the current system. (0: indicates kernel mode, 3: indicates user mode)
  • DTES64: 64-bit DS Area. A value of 1 indicates that the processor supports storing 64-bit addresses in the DS Area.

1.2.2 EDX return results


Here we mainly explain two parameters, related to LBR and BTS:

  • DS: Debug Store. The processor supports writing debugging information to a memory-resident buffer. BTS and PEBS use this feature. It can be understood as whether the processor supports BTS and PEBS functions.
  • MSR: Model Specific Registers RDMSR and WRMSR Instructions. Does the CPU support the instructions rdmsr/wrmsr to read and write the MSR register?

1.3 CPUID instruction in Linux

1.3.1 The application layer calls the cpid command

Here is a simple application to obtain the processor vendor ID (vendor ID) and family, as follows:

#include <stdio.h>

#define X86_VENDOR_INTEL 0
#define X86_VENDOR_AMD 1
#define X86_VENDOR_UNKNOWN 2

#define QCHAR(a, b, c, d) ((a) + ((b) << 8) + ((c) << 16) + ((d) << 24))
#define CPUID_INTEL1 QCHAR('G', 'e', 'n', 'u')
#define CPUID_INTEL2 QCHAR('i', 'n', 'e', 'I')
#define CPUID_INTEL3 QCHAR('n', 't', 'e', 'l')
#define CPUID_AMD1 QCHAR('A', 'u', 't', 'h')
#define CPUID_AMD2 QCHAR('e', 'n', 't', 'i')
#define CPUID_AMD3 QCHAR('c', 'A', 'M', 'D')

#define CPUID_IS(a, b, c, ebx, ecx, edx) \
(!((ebx ^ (a))|(edx ^ (b))|(ecx ^ (c))))

static inline void cpuid(int op, unsigned int *eax, unsigned int *ebx,
unsigned int *ecx, unsigned int *edx)
{<!-- -->
     asm volatile("cpuid" //asm means kernel assembly, executes the cpuid instruction, volatile means telling the gcc compiler not to optimize the code
: "=a" (*eax), //After the first colon: is the output parameter.
"=b" (*ebx), //The output operand constraint should have a constraint modifier "=" to specify that it is an output operand
"=c" (*ecx),
"=d" (*edx)
: "0" (*eax) //After the second colon: It is an input parameter. The Intel manual also states that ecx is sometimes used as an input parameter.
: "memory");
}

static int x86_vendor(void)
{<!-- -->
unsigned eax = 0x00000000;
unsigned ebx, ecx = 0, edx;

cpuid(0, & amp;eax, & amp;ebx, & amp;ecx, & amp;edx);

if (CPUID_IS(CPUID_INTEL1, CPUID_INTEL2, CPUID_INTEL3, ebx, ecx, edx))
          printf("GenuineIntel\\
");
return X86_VENDOR_INTEL;

if (CPUID_IS(CPUID_AMD1, CPUID_AMD2, CPUID_AMD3, ebx, ecx, edx))
          printf("AuthenticAMD\\
");
return X86_VENDOR_AMD;

return X86_VENDOR_UNKNOWN;
}

static int x86_family(void)
{<!-- -->
unsigned eax = 0x00000001;
unsigned ebx, ecx = 0, edx;
int x86;

cpuid(1, & amp;eax, & amp;ebx, & amp;ecx, & amp;edx);

x86 = (eax >> 8) & 0xf;
if (x86 == 15)
x86 + = (eax >> 20) & amp; 0xff;

return x86;
}

int main()
{<!-- -->
    unsigned int eax = 0;
unsigned int ebx = 0;
unsigned int ecx = 0;
unsigned int edx = 0;

     cpuid(0, & amp;eax, & amp;ebx, & amp;ecx, & amp;edx);
     printf("EBX ← %x ("Genu")EDX ← %x ("ineI") ECX ← %x ("ntel")\\
", ebx, edx ,ecx);

     int vendor = x86_vendor();
     int family = x86_family();

     printf("%d %d \\
", vendor, family);

     return 0;

}

The execution results are as follows:

This result is consistent with the Intel software developer manual.

You can also see the Vendor ID and CPU family information using the lscpu command.

1.3.2 Calling the cpuid instruction in the linux kernel

In addition to calling the cpuid instruction through assembly instructions at the application layer, you can also directly call the cpuid function interface in the kernel module.
This interface is defined in arch/x86/include/asm/processor.h (kernel version 3.10.0):

/*
 *Generic CPUID function
 * clear ?x since some cpus (Cyrix MII) do not set or clear ?x
 * resulting in stale register contents being returned.
 */
static inline void cpuid(unsigned int op,
unsigned int *eax, unsigned int *ebx,
unsigned int *ecx, unsigned int *edx)
{<!-- -->
*eax = op;
*ecx = 0;
__cpuid(eax, ebx, ecx, edx);
}

#define __cpuid native_cpuid

static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
unsigned int *ecx, unsigned int *edx)
{<!-- -->
/* ecx is often an input as well as an output. */
asm volatile("cpuid"
: "=a" (*eax),
"=b" (*ebx),
"=c" (*ecx),
"=d" (*edx)
: "0" (*eax), "2" (*ecx)
: "memory");
}

Test the cpuid directive through a simple module:

#include <linux/kernel.h>
#include <linux/module.h>


//Kernel module initialization function
static int __initlkm_init(void)
{<!-- -->
unsigned int eax = 0;
unsigned int ebx = 0;
unsigned int ecx = 0;
unsigned int edx = 0;

cpuid(0, & amp;eax, & amp;ebx, & amp;ecx, & amp;edx);
\t
printk("EBX:%xh("Genu") EDX:%xh("ineI") ECX:%xh("ntel")\\
", ebx, edx ,ecx);
\t\t
return 0;
}

//Kernel module exit function
static void __exit lkm_exit(void)
{<!-- -->
printk(KERN_DEBUG "exit\\
");
}

module_init(lkm_init);
module_exit(lkm_exit);

MODULE_LICENSE("GPL");

Makefile:

obj-m := kcpuid.o

all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

2. MSR register

2.1 Introduction to MSR register

MSR (Model Specific Register) is a concept in the x86 architecture. It refers to a series of registers used in x86 architecture processors to control CPU operation, function switches, debugging, tracking program execution, monitoring CPU performance, etc.
Different CPU models or different CPU manufacturers (Intel & AMD), their MSR registers may be different. It will change according to the specific CPU model. Each new CPU may introduce new MSR register.

2.2 Introduction to RDMSR and WRMSR instructions

As we mentioned earlier, you can use the CPUID to query whether the current CPU supports the RDMSR and WRMSR instructions, and these two instructions are provided by the Intel processor to read/write data in the MSR register.
These two instructions must be executed in privilege level 0 (kernel mode in Linux) or real mode:


The Linux kernel provides two interfaces, located in the arch/x86/include/asm/msr.h file:

2.3 IA32_DEBUGCTL MSR register

The address of the A32_DEBUGCTL register is 01D9H (different CPU families may have different names, such as MSR_DEBUGCTLA, MSR_DEBUGCTLB), which can be used for debugging, tracking interrupts, LBR, BTS, etc.
Here are some of the more important bits:

  1. bit0: LBR (last branch/interrupt/exception) flag. After this bit is set, the processor begins tracking the most recent branches, interrupts, and/or exceptions generated by the processor and stores them in LBR stack MSRs.
  2. bit1: BTF (single-step on branches) flag. When this bit is set, the processor treats the TF flag in the EFLAGS register as: TF as single-step on branches instead of single-step on instructions. (The gdb single-step debugging function of x86_64 is implemented using the TF bit in the EFLAGS register-single-step on instructions. I will write an article on the gdb single-step debugging and tracking principle of x86_64 later)
  3. bit6: TR (trace message enable) flag. This bit is set when the processor detects that a branch, interrupt, or exception has occurred; it sends a record of the branch to the system bus as a branch trace message (BTM).
  4. bit7: BTS (branch trace store) flag. After this bit is set, the BTS can save the BTMS to the memory-resident BTS buffer in the DS save area.
  5. bit8: BTINT (branch trace interrupt) flag. After this bit is set, the BTS will generate an interrupt when the BTS buffer is full.
  6. bit9: BTS_OFF_OS (branch trace off in privileged code) flag. After this bit is set, if CPL is 0, BTS or BTM is skipped, that is, the bts function is not enabled in the kernel mode.
  7. bit10: BTS_OFF_USR (branch trace off in user code) flag. After this bit is set, if CPL is greater than 0, BTS or BTM is skipped. That is: do not enable the bts function in user mode.

Therefore, you can use these two bits BTS_OFF_OS/BTS_OFF_USR to set whether to obtain user-mode branch jump instructions or kernel-mode branch jump instructions.
CPL: Current Privilege Level. The value of 0 represents the highest priority, and the value of 3 represents the lowest priority. In Linux, 0 is the kernel state and 3 is the user state.

So far we have introduced how to view the CPU characteristics, what is the MSR register, how to operate the MSR register, and IA32_DEBUGCTL , the MSR register specifically responsible for debugging. Let us further introduce IA32_DEBUGCTL The LRB function in the register, and a script program that uses the LBR function to obtain jump instruction data in the CPU is given.

3. LBR

3.1 Introduction to LBR

After bit0 of the IA32_DEBUGCTL MSR register is set, the processor starts to automatically record the generated branches, interrupts, exceptions and other branch records, and stores them in LBR stack MSRs. Let’s introduce LBR stack and TOS Pointer

  1. Last Branch Record (LBR) Stack: LBR consists of N pairs of msr registers (N is the LBR stack size, as shown in the table below). msr stores the source address and destination address of the latest branch.
  2. Last Branch Record Top-of-Stack (TOS) Pointer: TOS Pointer The least significant M bits in the MSR contain an M-bit pointer to the MSR in the LBR stack, which contains the most recent branch, interrupt, or exception recorded.

When using the LBR stack to record branch information, the TOS register indicates the current position of the stack. This allows the branch record to be read correctly from the LBR stack.

As can be seen from the table below, the number of msr in the LBR stack and the valid range of the TOS pointer value will be different for different processor families.

LBR msr is a 64-bit register. In 64-bit mode, last branch records store the complete address. In 32-bit mode, the high 32 bits are set to 0, and the low 32 bits store the latest branch record.
MSR_LASTBRANCH_0_FROM_IP – (N-1) MSR address Storage branch record source address
MSR_LASTBRANCH_0_TO_IP – (N-1) MSR address storage branch record destination address

3.2 Usage of LBR

(1) Query the LBR stack storage format: IA32_PERF_CAPABILITIES MSR (call rdmsrl)

(2) Turn on the LBR function and set bit0 = 1 registered in the IA32_DEBUGCTL MSR (call wrmsrl)

(3) Read the TOS pointer position (call rdmsrl)
Read the MSR_LASTBRANCH_TOS register, please refer to Intel vol4
(4) Read the LBR stack register (call rdmsrl)
Read MSR_LASTBRANCH_x_FROM_IP / MSR_LASTBRANCH_x_TO_IP register, please refer to Intel vol4

Advantages of LBR: Branch records are stored in registers, with almost no performance overhead.
Disadvantages of LBR: The number of register groups is limited, so the branch records we save are also limited.

3.3 Code Demonstration

experiment platform:
Intel x86_64, centos 7.8
Note that the experiment was conducted on a physical machine, and virtual machines do not support LBR.
Here, for the sake of simplicity, I use a shell script to demonstrate the code, which is used to capture the record of the code execution flow in user mode.

3.3.1 User mode demo

Here is the simplest while loop demo. This dmeo will generate many jmp instructions.

#include <stdio.h>

int main()
{<!-- -->
    int i = 0;
    while(1) {<!-- -->
        i + + ;
    }

    return 0;
}

Let’s take a look at its binary disassembly code, objdump -d a.out:

From the disassembly, we can see that the while loop has been executing the jmp instruction, and the resulting jump record is as follows:

{<!-- -->From : 4004fc , to : 4004f8} //jmp command

3.3.2 msr-tools

Here I use msr-tools tool package on linux shell command to read or write MSR register value.
There are two download addresses, the one I chose is the one below:
https://pkgs.org/download/msr-tools
https://mirrors.edge.kernel.org/pub/linux/utils/cpu/msr-tools/

3.3.3 taskset

taskset is used to set or read the CPU affinity of a running process given a PID or to set its CPU affinity when starting a new process. CPU affinity is a scheduler property that “binds” a process to a given set of CPUs on the system. The Linux scheduler will respect the given CPU affinity and the process will not run on any other CPU

What I use here is mainly to specify a given process to work on a certain CPU.

(1) When running the a.out program, bind the process to run on CPU 1

tasksset -c 1 ./a.out

(2) Get the CPU on which the a.out process is running:

tasksset -p process ID

3.3.4 MSR register address

Query the MSR register address related to the current CPU and LBR
You can get the DF_DM of the CPU through the cpuid command. Here I view it directly through the lscpu command:

Query the Intel manual according to DF_DM:
Address of MSR_IA32_DEBUGCTL register 0x1D9
Address of MSR_LBR_TOS register: 0x1C9
Address of MSR_LBR_SELECT register: 0x1C8

Supports 32 pairs of FROM TO records:
MSR_LASTBRANCH_0_FROM_IP – MSR_LASTBRANCH_31_FROM_IP: 0x680 – 0x69F
MSR_LASTBRANCH_0_TO_IP – MSR_LASTBRANCH_31_TO_IP: 0x6C0 – 0x6DF

3.3.5 Complete code

# Model Specific Registers address
MSR_LASTBRANCH_0_FROM_IP=680
MSR_LASTBRANCH_0_TO_IP=6C0
MSR_IA32_DEBUGCTL=1D9
MSR_LBR_TOS=1C9
MSR_LBR_SELECT=1C8

# Define ADDR_FROM and ADDR_FROM Var
ADDR_FROM=$MSR_LASTBRANCH_0_FROM_IP
ADDR_TO=$MSR_LASTBRANCH_0_TO_IP


#Configuration
CORE=1 # Run the target workload on core 1 (tasksset -c 1 process)
N_LBR=32 # Number of LBR records

# enable MSR kernel module
sudo modprobe msr

# enable LBR
sudo ./wrmsr -p ${CORE} 0x${MSR_IA32_DEBUGCTL} 0x1

# do not capture branches in ring 0
sudo ./wrmsr -p ${CORE} 0x${MSR_LBR_SELECT} 0x1

# wait a bit for the workload to issue enough branches
sleep 0.1

# read all LBR records
for i in `seq 1 ${<!-- -->N_LBR}`;
#for(( i = 0; i < ${N_LBR}; i + + ))
do
    echo "LBR record : $i"
    echo -n 0x$ADDR_FROM
    echo -n ", from address: "
    sudo ./rdmsr -p ${CORE} 0x${ADDR_FROM}
    echo -n 0x$ADDR_TO
    echo -n ", to address: "
    sudo ./rdmsr -p ${CORE} 0x${ADDR_TO}

    # increament ADDR_FROM (in hex) by 1
    ADDR_FROM=`echo "obase=16; ibase=16; ${ADDR_FROM} + 1;" | bc`

    # increament ADDR_TO (in hex) by 1
    ADDR_TO=`echo "obase=16; ibase=16; ${ADDR_TO} + 1;" | bc`
done

(1) Set process CPU affinity:

tasksset -c 1 ./demo & amp;

1: Indicates CPU1 (running on the second CPU).
22896: Indicates the PID number of the process.

(2) Run the shell script and view the results:

It can be seen that the control execution flow of the user mode program is obtained, and it is consistent with expectations:

{<!-- -->From : 4004fc , to : 4004f8} //jmp command

References

  1. Intel x86_64 CPUID instruction introduction
  2. Intel x86_64 LBR & BTS features
  3. Let’s talk about Intel x86_64 LBR function again
  4. Intel? 64 and IA-32 Architectures Software Developer Manuals