[K8S device-plugin] Use the vgpu project to analyze the relationship between device-plugin, Scheduler Extender Plugin, and KubeSchedulerConfiguration

Project

https://github.com/4paradigm/k8s-vgpu-scheduler/

How k8s-vgpu-scheduler implements vgpu resource allocation

apiVersion: v1
Kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image:ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 vGPUs
          nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional, Integer)
          nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU (Optional,Integer)

First create a device-plugin and register the gpu resource (nvidia.com/gpu)
Create a kube-Scheduler scheduler dedicated to allocating gpu resources (I can’t remember the name, it is called gpu-scheduler here. The default k8s scheduler is default-scheduler, the code is in the project path pkg/scheduler )
- Here we mainly use the scheduler extender plugin mechanism to create an http extender plugin scheduling plugin for scheduling gpu pods (pay attention to nvidia.com/gpu resource parameters)

Use KubeSchedulerConfiguration to declare the above http extender plugin scheduling plugin to pay attention to other gpu resource configuration parameters (nvidia.com/gpumem, nvidia.com/gpucores)

Note here that if there is no declared configuration of KubeSchedulerConfiguration, k8s will think that these (nvidia.com/gpumem, nvidia.com/gpucores) are resource devices. k8s will search for them during scheduling and find that there are no quotas for this resource device on all nodes ( kubectl describe node nodeName can see the device resources of each node), causing the pod scheduling to fail.
The logic here is somewhat different from that of Tencent gpu-manager. Tencent registers two devices such as nvidia.com/vgpu and nvidia.com/vmem. When vgpu is allocated, vmem is counted and allocated together; when vmem is allocated, what None

# charts/vgpu/templates/scheduler/configmapnew.yaml
# Here is a configuration file in the helm chart template
apiVersion: v1
kind: ConfigMap
metadata:
  name: {<!-- -->{<!-- --> include "4pd-vgpu.scheduler" . }}-newversion
  labels:
    app.kubernetes.io/component: 4pd-scheduler
    {<!-- -->{<!-- -->- include "4pd-vgpu.labels" . | nindent 4 }}
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: false
    profiles:
    - schedulerName: {<!-- -->{<!-- --> .Values.schedulerName }}
    extenders:
    - urlPrefix: "https://127.0.0.1:443"
      filterVerb: filter
      bindVerb: bind
      nodeCacheCapable: true
      weight: 1
      httpTimeout: 30s
      enableHTTPS: true
      tlsConfig:
        insecure: true
      managedResources:
      - name: {<!-- -->{<!-- --> .Values.resourceName }}
        ignoredByScheduler: true
      - name: {<!-- -->{<!-- --> .Values.resourceMem }}
        ignoredByScheduler: true
      - name: {<!-- -->{<!-- --> .Values.resourceCores }}
        ignoredByScheduler: true
      - name: {<!-- -->{<!-- --> .Values.resourceMemPercentage }}
        ignoredByScheduler: true
      - name: {<!-- -->{<!-- --> .Values.resourcePriority }}
        ignoredByScheduler: true
      - name: {<!-- -->{<!-- --> .Values.mluResourceName }}
        ignoredByScheduler: true
      - name: {<!-- -->{<!-- --> .Values.mluResourceMem }}
        ignoredByScheduler: true

Since there is now a new gpu-scheduler and the default k8s default-scheduler, how to choose the settings? — k8s mutate webhook mechanism

The mutate webhook will take effect before the pod reaches the scheduler. The main function of the mutate webhook is to check whether the pod has configured nvidia.com/gpu resource requirements. If so, the schedulerName in pod.spec will be changed. , set to gpu-scheduler (because if it is not changed, default-scheduler will be configured by default, and this scheduler has no gpu scheduling logic)

# charts/vgpu/templates/scheduler/webhook.yaml
# Here is a configuration file in the helm chart template
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: {<!-- -->{<!-- --> include "4pd-vgpu.scheduler.webhook" . }}
webhooks:
  - admissionReviewVersions:
    - v1beta1
    clientConfig:
      {<!-- -->{<!-- -->- if .Values.scheduler.customWebhook.enabled }}
      url: https://{<!-- -->{<!-- --> .Values.scheduler.customWebhook.host}}:{<!-- -->{<!-- -->. Values.scheduler.customWebhook.port}}{<!-- -->{<!-- -->.Values.scheduler.customWebhook.path}}
      {<!-- -->{<!-- -->- else }}
      service:
        name: {<!-- -->{<!-- --> include "4pd-vgpu.scheduler" . }}
        namespace: {<!-- -->{<!-- --> .Release.Namespace }}
        path: /webhook
        port: {<!-- -->{<!-- --> .Values.scheduler.service.httpPort }}
      {<!-- -->{<!-- -->- end }}
    failurePolicy: Fail
    matchPolicy: Equivalent
    name: vgpu.4pd.io
    namespaceSelector:
      matchExpressions:
      - key: 4pd.io/webhook
        operator: NotIn
        values:
        -ignore
    objectSelector:
      matchExpressions:
      - key: 4pd.io/webhook
        operator: NotIn
        values:
        -ignore
    reinvocationPolicy: Never
    rules:
      - apiGroups:
          - ""
        apiVersions:
          - v1
        operations:
          -CREATE
        resources:
          -pods
        scope: '*'
    sideEffects: None
    timeoutSeconds: 10

Visual understanding of logic diagrams

1 | Device-Plugin reporting logic

After deployment, the kubelet will call the Register method of the Device Plugin to register the resource device name ResourceName (nvidia.com/gpu) responsible for the Device Plugin into the kubelet.

// pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go

// Register registers the device plugin for the given resourceName with Kubelet.
func (plugin *NvidiaDevicePlugin) Register() error {<!-- -->
conn, err := plugin.dial(pluginapi.KubeletSocket, 5*time.Second)
if err != nil {<!-- -->
return err
}
defer conn.Close()

client := pluginapi.NewRegistrationClient(conn)
reqt := & amp;pluginapi.RegisterRequest{<!-- -->
Version: pluginapi.Version,
// dfy: socket for communicating with the resource management server
Endpoint: path.Base(plugin.socket),
// dfy: Register resource name
ResourceName: string(plugin.rm.Resource()),
Options: & amp;pluginapi.DevicePluginOptions{<!-- -->
GetPreferredAllocationAvailable: true,
},
}

_, err = client.Register(context.Background(), reqt)
if err != nil {<!-- -->
return err
}
return nil
}

Kubelet calls the ListAndWatch method of Device Plugin to listen for changes in the resource device and synchronize it to the Node (you can view the device resources through kubectl describe node xx)

// pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go

// ListAndWatch lists devices and update that list according to the health status
func (plugin *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {<!-- -->
s.Send( & amp;pluginapi.ListAndWatchResponse{<!-- -->Devices: plugin.apiDevices()})

for {<!-- -->
select {<!-- -->
case <-plugin.stop:
return nil
case d := <-plugin.health:
// FIXME: there is no way to recover from the Unhealthy state.
d.Health = pluginapi.Unhealthy
klog.Infof("'%s' device marked unhealthy: %s", plugin.rm.Resource(), d.ID)
s.Send( & amp;pluginapi.ListAndWatchResponse{<!-- -->Devices: plugin.apiDevices()})
}
}
}

2 | Device-Plugin resource allocation logic

You can think of kubelet as the client and Device Plugin as the server.
Communication between them is carried out through sockets placed in the /var/lib/kubelet/device-plugins directory.
The kublet will call the Device Plugin’s Allocate function to apply for the required resource device information based on the pod’s request.

3 | Scheduler Extender Plugin

The request in the Pod may have multiple resource devices (such as nvidia.com/gpu, nvidia.com/gpumem, nvidia.com/gpupercentage, etc.), but currently only one resource device (nvidia.com/gpu, that is) is registered. Corresponding to the gpu.sock file above)
- The other two resources are not registered (nvidia.com/gpumem, nvidia.com/gpupercentage). If they are filled in in the pod request, these two resources will not be recognized during Pod scheduling, resulting in the Pending status.
When we currently want these three resource devices to be handled by the same Nvidia Device Plugin, how should we do this?
- You can fill in the KubeSchedulerConfiguration file of the Scheduler and specify a Scheduler Extender Plugin to focus on these three resources to select an appropriate Node for the Pod.