Project
- https://github.com/4paradigm/k8s-vgpu-scheduler/
How k8s-vgpu-scheduler implements vgpu resource allocation
apiVersion: v1 Kind: Pod metadata: name: gpu-pod spec: containers: - name: ubuntu-container image:ubuntu:18.04 command: ["bash", "-c", "sleep 86400"] resources: limits: nvidia.com/gpu: 2 # requesting 2 vGPUs nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional, Integer) nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
-
First create a device-plugin and register the gpu resource (nvidia.com/gpu)
-
Create a kube-Scheduler scheduler dedicated to allocating gpu resources (I can’t remember the name, it is called gpu-scheduler here. The default k8s scheduler is default-scheduler, the code is in the project path
pkg/scheduler
)- Here we mainly use the scheduler extender plugin mechanism to create an http extender plugin scheduling plugin for scheduling gpu pods (pay attention to nvidia.com/gpu resource parameters)
-
Use KubeSchedulerConfiguration to declare the above http extender plugin scheduling plugin to pay attention to other gpu resource configuration parameters (nvidia.com/gpumem, nvidia.com/gpucores)
-
Note here that if there is no declared configuration of KubeSchedulerConfiguration, k8s will think that these (nvidia.com/gpumem, nvidia.com/gpucores) are resource devices. k8s will search for them during scheduling and find that there are no quotas for this resource device on all nodes ( kubectl describe node nodeName can see the device resources of each node), causing the pod scheduling to fail.
-
The logic here is somewhat different from that of Tencent gpu-manager. Tencent registers two devices such as nvidia.com/vgpu and nvidia.com/vmem. When vgpu is allocated, vmem is counted and allocated together; when vmem is allocated, what None
-
# charts/vgpu/templates/scheduler/configmapnew.yaml # Here is a configuration file in the helm chart template apiVersion: v1 kind: ConfigMap metadata: name: {<!-- -->{<!-- --> include "4pd-vgpu.scheduler" . }}-newversion labels: app.kubernetes.io/component: 4pd-scheduler {<!-- -->{<!-- -->- include "4pd-vgpu.labels" . | nindent 4 }} data: config.yaml: | apiVersion: kubescheduler.config.k8s.io/v1beta2 kind: KubeSchedulerConfiguration leaderElection: leaderElect: false profiles: - schedulerName: {<!-- -->{<!-- --> .Values.schedulerName }} extenders: - urlPrefix: "https://127.0.0.1:443" filterVerb: filter bindVerb: bind nodeCacheCapable: true weight: 1 httpTimeout: 30s enableHTTPS: true tlsConfig: insecure: true managedResources: - name: {<!-- -->{<!-- --> .Values.resourceName }} ignoredByScheduler: true - name: {<!-- -->{<!-- --> .Values.resourceMem }} ignoredByScheduler: true - name: {<!-- -->{<!-- --> .Values.resourceCores }} ignoredByScheduler: true - name: {<!-- -->{<!-- --> .Values.resourceMemPercentage }} ignoredByScheduler: true - name: {<!-- -->{<!-- --> .Values.resourcePriority }} ignoredByScheduler: true - name: {<!-- -->{<!-- --> .Values.mluResourceName }} ignoredByScheduler: true - name: {<!-- -->{<!-- --> .Values.mluResourceMem }} ignoredByScheduler: true
-
-
Since there is now a new gpu-scheduler and the default k8s default-scheduler, how to choose the settings? — k8s mutate webhook mechanism
-
The mutate webhook will take effect before the pod reaches the scheduler. The main function of the mutate webhook is to check whether the pod has configured
nvidia.com/gpu
resource requirements. If so, the schedulerName in pod.spec will be changed. , set to gpu-scheduler (because if it is not changed, default-scheduler will be configured by default, and this scheduler has no gpu scheduling logic) -
# charts/vgpu/templates/scheduler/webhook.yaml # Here is a configuration file in the helm chart template apiVersion: admissionregistration.k8s.io/v1 kind: MutatingWebhookConfiguration metadata: name: {<!-- -->{<!-- --> include "4pd-vgpu.scheduler.webhook" . }} webhooks: - admissionReviewVersions: - v1beta1 clientConfig: {<!-- -->{<!-- -->- if .Values.scheduler.customWebhook.enabled }} url: https://{<!-- -->{<!-- --> .Values.scheduler.customWebhook.host}}:{<!-- -->{<!-- -->. Values.scheduler.customWebhook.port}}{<!-- -->{<!-- -->.Values.scheduler.customWebhook.path}} {<!-- -->{<!-- -->- else }} service: name: {<!-- -->{<!-- --> include "4pd-vgpu.scheduler" . }} namespace: {<!-- -->{<!-- --> .Release.Namespace }} path: /webhook port: {<!-- -->{<!-- --> .Values.scheduler.service.httpPort }} {<!-- -->{<!-- -->- end }} failurePolicy: Fail matchPolicy: Equivalent name: vgpu.4pd.io namespaceSelector: matchExpressions: - key: 4pd.io/webhook operator: NotIn values: -ignore objectSelector: matchExpressions: - key: 4pd.io/webhook operator: NotIn values: -ignore reinvocationPolicy: Never rules: - apiGroups: - "" apiVersions: - v1 operations: -CREATE resources: -pods scope: '*' sideEffects: None timeoutSeconds: 10
-
Visual understanding of logic diagrams
1 | Device-Plugin reporting logic
-
After deployment, the kubelet will call the Register method of the Device Plugin to register the resource device name ResourceName (nvidia.com/gpu) responsible for the Device Plugin into the kubelet.
-
// pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go // Register registers the device plugin for the given resourceName with Kubelet. func (plugin *NvidiaDevicePlugin) Register() error {<!-- --> conn, err := plugin.dial(pluginapi.KubeletSocket, 5*time.Second) if err != nil {<!-- --> return err } defer conn.Close() client := pluginapi.NewRegistrationClient(conn) reqt := & amp;pluginapi.RegisterRequest{<!-- --> Version: pluginapi.Version, // dfy: socket for communicating with the resource management server Endpoint: path.Base(plugin.socket), // dfy: Register resource name ResourceName: string(plugin.rm.Resource()), Options: & amp;pluginapi.DevicePluginOptions{<!-- --> GetPreferredAllocationAvailable: true, }, } _, err = client.Register(context.Background(), reqt) if err != nil {<!-- --> return err } return nil }
-
-
Kubelet calls the ListAndWatch method of Device Plugin to listen for changes in the resource device and synchronize it to the Node (you can view the device resources through kubectl describe node xx)
-
// pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go // ListAndWatch lists devices and update that list according to the health status func (plugin *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {<!-- --> s.Send( & amp;pluginapi.ListAndWatchResponse{<!-- -->Devices: plugin.apiDevices()}) for {<!-- --> select {<!-- --> case <-plugin.stop: return nil case d := <-plugin.health: // FIXME: there is no way to recover from the Unhealthy state. d.Health = pluginapi.Unhealthy klog.Infof("'%s' device marked unhealthy: %s", plugin.rm.Resource(), d.ID) s.Send( & amp;pluginapi.ListAndWatchResponse{<!-- -->Devices: plugin.apiDevices()}) } } }
-
2 | Device-Plugin resource allocation logic
- You can think of kubelet as the client and Device Plugin as the server.
- Communication between them is carried out through sockets placed in the
/var/lib/kubelet/device-plugins
directory. - The kublet will call the Device Plugin’s Allocate function to apply for the required resource device information based on the pod’s request.
3 | Scheduler Extender Plugin
- The request in the Pod may have multiple resource devices (such as nvidia.com/gpu, nvidia.com/gpumem, nvidia.com/gpupercentage, etc.), but currently only one resource device (nvidia.com/gpu, that is) is registered. Corresponding to the gpu.sock file above)
- The other two resources are not registered (nvidia.com/gpumem, nvidia.com/gpupercentage). If they are filled in in the pod request, these two resources will not be recognized during Pod scheduling, resulting in the Pending status.
- When we currently want these three resource devices to be handled by the same Nvidia Device Plugin, how should we do this?
- You can fill in the KubeSchedulerConfiguration file of the Scheduler and specify a Scheduler Extender Plugin to focus on these three resources to select an appropriate Node for the Pod.