[K8S device-plugin] Use the vgpu project to analyze the relationship between device-plugin, Scheduler Extender Plugin, and KubeSchedulerConfiguration

Project

  • https://github.com/4paradigm/k8s-vgpu-scheduler/

How k8s-vgpu-scheduler implements vgpu resource allocation

apiVersion: v1
Kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image:ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 vGPUs
          nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional, Integer)
          nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
  1. First create a device-plugin and register the gpu resource (nvidia.com/gpu)

  2. Create a kube-Scheduler scheduler dedicated to allocating gpu resources (I can’t remember the name, it is called gpu-scheduler here. The default k8s scheduler is default-scheduler, the code is in the project path pkg/scheduler )

    • Here we mainly use the scheduler extender plugin mechanism to create an http extender plugin scheduling plugin for scheduling gpu pods (pay attention to nvidia.com/gpu resource parameters)
  3. Use KubeSchedulerConfiguration to declare the above http extender plugin scheduling plugin to pay attention to other gpu resource configuration parameters (nvidia.com/gpumem, nvidia.com/gpucores)

    • Note here that if there is no declared configuration of KubeSchedulerConfiguration, k8s will think that these (nvidia.com/gpumem, nvidia.com/gpucores) are resource devices. k8s will search for them during scheduling and find that there are no quotas for this resource device on all nodes ( kubectl describe node nodeName can see the device resources of each node), causing the pod scheduling to fail.

    • The logic here is somewhat different from that of Tencent gpu-manager. Tencent registers two devices such as nvidia.com/vgpu and nvidia.com/vmem. When vgpu is allocated, vmem is counted and allocated together; when vmem is allocated, what None

    • # charts/vgpu/templates/scheduler/configmapnew.yaml
      # Here is a configuration file in the helm chart template
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: {<!-- -->{<!-- --> include "4pd-vgpu.scheduler" . }}-newversion
        labels:
          app.kubernetes.io/component: 4pd-scheduler
          {<!-- -->{<!-- -->- include "4pd-vgpu.labels" . | nindent 4 }}
      data:
        config.yaml: |
          apiVersion: kubescheduler.config.k8s.io/v1beta2
          kind: KubeSchedulerConfiguration
          leaderElection:
            leaderElect: false
          profiles:
          - schedulerName: {<!-- -->{<!-- --> .Values.schedulerName }}
          extenders:
          - urlPrefix: "https://127.0.0.1:443"
            filterVerb: filter
            bindVerb: bind
            nodeCacheCapable: true
            weight: 1
            httpTimeout: 30s
            enableHTTPS: true
            tlsConfig:
              insecure: true
            managedResources:
            - name: {<!-- -->{<!-- --> .Values.resourceName }}
              ignoredByScheduler: true
            - name: {<!-- -->{<!-- --> .Values.resourceMem }}
              ignoredByScheduler: true
            - name: {<!-- -->{<!-- --> .Values.resourceCores }}
              ignoredByScheduler: true
            - name: {<!-- -->{<!-- --> .Values.resourceMemPercentage }}
              ignoredByScheduler: true
            - name: {<!-- -->{<!-- --> .Values.resourcePriority }}
              ignoredByScheduler: true
            - name: {<!-- -->{<!-- --> .Values.mluResourceName }}
              ignoredByScheduler: true
            - name: {<!-- -->{<!-- --> .Values.mluResourceMem }}
              ignoredByScheduler: true
      
  4. Since there is now a new gpu-scheduler and the default k8s default-scheduler, how to choose the settings? — k8s mutate webhook mechanism

    • The mutate webhook will take effect before the pod reaches the scheduler. The main function of the mutate webhook is to check whether the pod has configured nvidia.com/gpu resource requirements. If so, the schedulerName in pod.spec will be changed. , set to gpu-scheduler (because if it is not changed, default-scheduler will be configured by default, and this scheduler has no gpu scheduling logic)

    • # charts/vgpu/templates/scheduler/webhook.yaml
      # Here is a configuration file in the helm chart template
      apiVersion: admissionregistration.k8s.io/v1
      kind: MutatingWebhookConfiguration
      metadata:
        name: {<!-- -->{<!-- --> include "4pd-vgpu.scheduler.webhook" . }}
      webhooks:
        - admissionReviewVersions:
          - v1beta1
          clientConfig:
            {<!-- -->{<!-- -->- if .Values.scheduler.customWebhook.enabled }}
            url: https://{<!-- -->{<!-- --> .Values.scheduler.customWebhook.host}}:{<!-- -->{<!-- -->. Values.scheduler.customWebhook.port}}{<!-- -->{<!-- -->.Values.scheduler.customWebhook.path}}
            {<!-- -->{<!-- -->- else }}
            service:
              name: {<!-- -->{<!-- --> include "4pd-vgpu.scheduler" . }}
              namespace: {<!-- -->{<!-- --> .Release.Namespace }}
              path: /webhook
              port: {<!-- -->{<!-- --> .Values.scheduler.service.httpPort }}
            {<!-- -->{<!-- -->- end }}
          failurePolicy: Fail
          matchPolicy: Equivalent
          name: vgpu.4pd.io
          namespaceSelector:
            matchExpressions:
            - key: 4pd.io/webhook
              operator: NotIn
              values:
              -ignore
          objectSelector:
            matchExpressions:
            - key: 4pd.io/webhook
              operator: NotIn
              values:
              -ignore
          reinvocationPolicy: Never
          rules:
            - apiGroups:
                - ""
              apiVersions:
                - v1
              operations:
                -CREATE
              resources:
                -pods
              scope: '*'
          sideEffects: None
          timeoutSeconds: 10
      
      

Visual understanding of logic diagrams

1 | Device-Plugin reporting logic

  1. After deployment, the kubelet will call the Register method of the Device Plugin to register the resource device name ResourceName (nvidia.com/gpu) responsible for the Device Plugin into the kubelet.

    • // pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
      
      // Register registers the device plugin for the given resourceName with Kubelet.
      func (plugin *NvidiaDevicePlugin) Register() error {<!-- -->
      conn, err := plugin.dial(pluginapi.KubeletSocket, 5*time.Second)
      if err != nil {<!-- -->
      return err
      }
      defer conn.Close()
      
      client := pluginapi.NewRegistrationClient(conn)
      reqt := & amp;pluginapi.RegisterRequest{<!-- -->
      Version: pluginapi.Version,
      // dfy: socket for communicating with the resource management server
      Endpoint: path.Base(plugin.socket),
      // dfy: Register resource name
      ResourceName: string(plugin.rm.Resource()),
      Options: & amp;pluginapi.DevicePluginOptions{<!-- -->
      GetPreferredAllocationAvailable: true,
      },
      }
      
      _, err = client.Register(context.Background(), reqt)
      if err != nil {<!-- -->
      return err
      }
      return nil
      }
      
  2. Kubelet calls the ListAndWatch method of Device Plugin to listen for changes in the resource device and synchronize it to the Node (you can view the device resources through kubectl describe node xx)

    • // pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
      
      // ListAndWatch lists devices and update that list according to the health status
      func (plugin *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {<!-- -->
      s.Send( & amp;pluginapi.ListAndWatchResponse{<!-- -->Devices: plugin.apiDevices()})
      
      for {<!-- -->
      select {<!-- -->
      case <-plugin.stop:
      return nil
      case d := <-plugin.health:
      // FIXME: there is no way to recover from the Unhealthy state.
      d.Health = pluginapi.Unhealthy
      klog.Infof("'%s' device marked unhealthy: %s", plugin.rm.Resource(), d.ID)
      s.Send( & amp;pluginapi.ListAndWatchResponse{<!-- -->Devices: plugin.apiDevices()})
      }
      }
      }
      

2 | Device-Plugin resource allocation logic

  • You can think of kubelet as the client and Device Plugin as the server.
  • Communication between them is carried out through sockets placed in the /var/lib/kubelet/device-plugins directory.
  • The kublet will call the Device Plugin’s Allocate function to apply for the required resource device information based on the pod’s request.

3 | Scheduler Extender Plugin

  • The request in the Pod may have multiple resource devices (such as nvidia.com/gpu, nvidia.com/gpumem, nvidia.com/gpupercentage, etc.), but currently only one resource device (nvidia.com/gpu, that is) is registered. Corresponding to the gpu.sock file above)
    • The other two resources are not registered (nvidia.com/gpumem, nvidia.com/gpupercentage). If they are filled in in the pod request, these two resources will not be recognized during Pod scheduling, resulting in the Pending status.
  • When we currently want these three resource devices to be handled by the same Nvidia Device Plugin, how should we do this?
    • You can fill in the KubeSchedulerConfiguration file of the Scheduler and specify a Scheduler Extender Plugin to focus on these three resources to select an appropriate Node for the Pod.