DPU network development SDK–DPDK(15)

Earlier we introduced the important functions involved in ethtool initializing the network card and the implementation principles behind them. Next, we introduce the underlying implementation of some remaining DPDK functions in this example. These functions are often used when writing DPDK applications.

rte_lcore_count()

In the previous article, we mentioned that DPDK has two global variables, one of which is of type struct rte_config, which can be obtained by calling rte_eal_get_configuration(), and rte_lcore_count() returns the value in the lcore_count member. This value is set in step 7 of the init() process.

rte_lcore_id()

The implementation of this func is to read a variable through the macro definition RTE_PER_LCORE(_lcore_id), which is defined as

#define RTE_PER_LCORE(name) (per_lcore_##name)

What it reads is a per-lcore variable, and the variable is defined as

#define RTE_DEFINE_PER_LCORE(type, name) \
    __thread __typeof__(type) per_lcore_##name

RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY;

The location of the per-lcore setting of _lcore_id is at
eal_thread_loop(), in step 30 of the init() process we mentioned that this func is the entrance to all worker lcores, and user-defined functions are also executed in this func. Before executing the user-defined function, _lcore_id Wait for the per-lcore variable to be set in advance to facilitate subsequent calls.

rte_eal_remote_launch()

This func has been introduced in Section 9 during the init() process.

rte_eal_wait_lcore()

This func is to check the status of lcore_config[lcore_id].state. When the status is not WAIT and not FINISHED, enter rte_pause(). rte_pause() is implemented through C language inline assembly.

For network packet processing programs, user-defined functions are usually an infinite loop, that is, processing packets indefinitely, so in such applications, rte_eal_wait_lcore() will never return.

rte_spinlock_trylock() / rte_spinlock_unlock()

Spin lock is a concept in the operating system kernel. Since DPDK is in user mode, the spin lock in user mode is implemented through C language inline assembly. This content is relatively complicated and will not be discussed here.

rte_eth_rx_burst()

This func is the basic entry function for packet collection and needs to be analyzed carefully.

Through the incoming port_id, find the corresponding network card object in the array rte_eth_devices, and call dev->rx_pkt_burst. Which processing function rx_pkt_burst points to is set in ixgbe_dev_rx_init()->ixgbe_set_rx_function() in ixgbe_dev_start() mentioned in Section 14. For ixgbe type devices, there are many optional processing functions, which need to be determined based on the characteristics of ixgbe_adaptor pointed to by dev->data->dev_private. This content is relatively complex and involves many hardware features. We will introduce it separately in the future. Here we take ixgbe_recv_pkts() as an example for analysis.

ixgbe_recv_pkts() can collect up to nb_pkts data packets at a time, but when obtaining data packets from the bottom layer, they are still obtained one by one. Before receiving, you need to obtain the receiving queue rx_queues[queue_id]->rx_ring, and the software queue rx_queues[queue_id]->sw_ring. First, call rte_mbuf_raw_alloc() to allocate a piece of memory nmb. This memory is used to store the data of the data packet. After the allocation, call rte_ixgbe_prefetch() to read the actual data. At this time, the read packet is written to sw_ring[ idx + 1], that is to say, the next data packet is pre-read first; after the next data packet reads the data, nmb is stored in sw_ring[idx], and the content originally pointed to by sw_ring[idx] is assigned. Give rxm further processing; then set some fields related to the data packet length, data packet data part length, data packet type, and hash situation in rxm, which can be regarded as the analysis process of the data packet. Finally rxm is returned as the result.

It can be seen from the process that when a certain data packet is received and processed, only the data packet is analyzed, and the reading action is targeted at the next data packet to be received, and the data packet processed this time is in the previous The data packet is read during processing. Here, the sw_ring element is used as an entry point to manage the data packet storage space, and the purpose is to realize a pre-reading of the data packet, while the rx_ring element is to store some basic information of the pre-read data packet. The implementation of pre-reading is implemented through inline assembly, which implements a direct read from the cache cache.

rte_eth_tx_burst()

Same as above, this func is the basic entry function for sending packets, and actually calls dev->tx_pkt_burst. The processing function specifically pointed to by dev->tx_pkt_burst is set in ixgbe_dev_tx_queue_setup()->ixgbe_set_tx_function() (introduced in Section 14). Compared with rx, tx has fewer optional processing functions. We use ixgbe_xmit_pkts() as an example for analysis. .

ixgbe_xmit_pkts() can also send multiple packets at a time. Before sending, you need to obtain the receiving queue tx_queues[queue_id]->tx_ring, and the software queue tx_queues[queue_id]->sw_ring. First, you need to check the number of free spaces counted in tx_queues[queue_id]. If the number is insufficient, call ixgbe_xmit_cleanup() to release some space.

In ixgbe_xmit_cleanup(), first calculate the last descriptor index desc_to_clean_to that needs to be released. This value defaults to the last released descriptor index last_desc_cleaned + tx reset threshold tx_rs_thresh. If the sum desc_to_clean_to is greater than the descriptor in the current queue of nb_tx_desc, subtract nb_tx_desc from desc_to_clean_to. Next, according to desc_to_clean_to, find the descriptor corresponding to sw_ring[desc_to_clean_to] in tx_ring, and confirm whether the status of the descriptor has been completed. If it is not completed, it cannot continue to be released. Next, determine the number of descriptors to be released nb_tx_to_clean, which is calculated based on the difference between the last descriptor index last_desc_cleaned and the last descriptor index that needs to be released desc_to_clean_to. At the same time, the two indexes must be considered. Contextual relationship ensures that the difference will not be negative. Finally, nb_tx_to_clean is directly added to the number of free spaces nb_tx_free counted in tx_queues[queue_id], and the value of last_desc_cleaned in tx_queues[queue_id] is updated. It can be seen from the entire release process that the process of releasing descriptors is achieved by changing the starting position, number and other contents of the free descriptors recorded in tx_queues[queue_id].

Next, we start to traverse the data packets to be sent in sequence and send them. For each data packet tx_pkt, check the flag bit tx_pkt->ol_flags to determine whether offload is turned on. If it is turned on, fill in the data structure union ixgbe_tx_offload, which is mainly filled. The content is the size of the second, third and fourth layers of the data packet, vlan, the size of the second and third layers of the outer encapsulation, etc. Finally, according to the offload information, a context object ctx is obtained. This ctx object may need to be rebuilt, specifically by the flag Bit new_ctx flag.

Next, you need to calculate the number of descriptors nb_used required to send this data packet. This value is determined by the number of segments of the data packet tx_pkt->nb_segs. If ctx requires a new one (that is, new_ctx is true), add one more. When the number of nb_used + the number of used descriptors in the queue nb_tx_used exceeds the tx reset threshold, the flag bit identifying the reset operation in read.cmd_type_len of the last tx_ring descriptor txp needs to be set.

The number of descriptors that need to be allocated to send a data packet must be equal to the number of segments of the data packet. If there is a need for a new context object, add 1. Based on this, it is decided to send the data packet and allocate the data in sw_ring for the data packet. After the descriptor, what is the index tx_last of the last descriptor (note that the descriptor is a ring structure and needs to handle loopback situations).

When the required number of descriptors nb_used is greater than nb_tx_free, you need to call ixgbe_xmit_cleanup() to release some descriptors. When the release process fails, the sending process of all packets is directly terminated and ixgbe_xmit_pkts() is directly ended. After the release process ends normally, if nb_tx_free is still greater than nb_tx_free, call ixgbe_xmit_cleanup() multiple times to release the descriptor until nb_tx_free is no longer greater than nb_tx_free; if an error occurs in a certain release process, the sending process of all packets will also be directly terminated.

Next, if offload is enabled for the packet and new_ctx is true, call ixgbe_set_xmit_ctx() to set the first free descriptor in tx_ring (that is, the index is tx_queues[queue_id]->tx_tail descriptor, and forcibly convert it to struct ixgbe_adv_tx_context_desc type), in addition, an element in sw_ring will be consumed here.

The next step is to loop through the segments in the data packet. The mbuf of each segment is assigned to the element of sw_ring, and the corresponding element of each sw_ring element in tx_ring is stored in the DMA address of the memory space pointed to by the segment, offload information, etc. . When processing the current sw_ring element txe, the next sw_ring element txn will be obtained according to txe->next_id. Here txn not only needs to be assigned to txe at the end of the cycle for the next segment processing, but also needs to perform a pre-send operation (through rte_prefetch0) , this is the same as the pre-reading during packet collection, and is to refresh the cache line. After each segment is processed, update the number of nb_tx_used and nb_tx_free in tx_queues[queue_id]. If the value of nb_tx_used after updating is greater than or equal to the tx reset threshold, the reset flag bit will be written to read.cmd_type_len of the current last tx_ring descriptor. , if nb_tx_used does not exceed the tx reset threshold, only the last tx_ring descriptor processed this time is recorded to txp. This variable may be used during the next packet processing.

The above is the analysis of the two most important processes in DPDK: sending data packets and receiving data packets. It can be seen from the ixgbe code level that the main processing process is to associate related mbuf resources by operating descriptors in the sending and receiving queues. , the final operation mbuf is sent through the embedded assembly language. The operations involving assembly will be introduced in detail with a deeper understanding of DPDK.