How to quickly deploy AI inference service based on ACK Serverless

With the advent of the AI wave, various AI applications emerge in an endless stream. It is well known that AI applications strongly rely on GPU resources, but GPUs are very expensive. How to reduce the cost of using GPU resources has become the primary issue for users. The combination of AI and serverless technology can fully achieve the purpose of using resources on demand and reducing resource costs.

So in the cloud-native scenario, is there such an out-of-the-box, standard, and open solution? The answer is yes. We provide the Knative + KServe solution in ACK Serverless, which can help users quickly deploy AI inference services, use them on demand, and support automatic shrinking of GPU resources to 0 when there is no request, greatly saving AI application scenarios Lower resource usage costs.

About ACK Serverless

Container Service Serverless Edition ACK Serverless is a safe and reliable container product based on Alibaba Cloud’s elastic computing infrastructure and fully compatible with the Kubernetes ecosystem. With ACK Serverless, you can quickly create Kubernetes container applications without managing and maintaining k8s clusters, support multiple GPU resource specifications, and pay on demand based on the amount of resources actually used by the application.

Knative and KServe

Knative is an open source serverless application architecture based on Kubernetes, which provides functions such as request-based automatic elasticity, shrinking to 0, and grayscale publishing. Deploying Serverless applications through Knative can focus on application logic development and use resources on demand.

And KServe provides a simple Kubernetes CRD that can deploy single or multiple trained models to model service runtimes, such as inference servers such as TFServing, TorchServe, and Triton. These model service runtimes can provide out-of-the-box model services, and KServe provides basic API primitives to allow you to easily build custom model service runtimes. After deploying an inference model based on Knative using InferenceService, you will get the following serverless capabilities:

Scale down to 0
Automatic elasticity based on RPS, concurrency, CPU/GPU indicators
Multiple version management
Traffic management
Security certification
Observability out of the box

The KServe model service control plane is mainly responsible for KServe Controller, which is used to coordinate InferenceService custom resources and create Knative Service services, which can realize automatic scaling according to the request traffic, and shrink to zero when no traffic is received.

Rapidly deploy the first inference service based on KServe

In this article, we will deploy an InferenceService with predictive capabilities that will use a scikit-learn model trained on the iris (Iris) dataset. The dataset has three output classes: Iris Setosa (Mountain Iris, Index: 0), Iris Versicolour (variegated Iris, Index: 1), and Iris Virginica (Virginia Iris, Index: 2). Finally, you can send an inference request to the deployed model to predict the corresponding iris plant category.

Prerequisites

Activated ACK Serverless[1]
Deploy KServe[2]

Currently, Alibaba Cloud Knative supports one-click deployment of KServe. Supports gateway capabilities such as ASM, ALB, MSE, and Kourier.

Create InferenceService inference service

kubectl apply -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
EOF

Check service status:

kubectl get inferenceservices sklearn-iris

Expected output:

NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
sklearn-iris http://sklearn-iris-predictor-default.default.example.com True 100 sklearn-iris-predictor-default-00001 51s

Service access

1. Obtain the service access address

$ kubectl get albconfig knative-internet
NAME ALBID DNSNAME PORT & PROTOCOL CERTID AGE
knative-internet alb-hvd8nngl0lsdra15g0 alb-hvd8nngl0lsdra15g0.cn-beijing.alb.aliyuncs.com 24m

2. Prepare your inference input request in a file

The iris data set is a data set composed of three kinds of iris flowers, each with 50 sets of data. Each sample contains 4 features, which are the length and width of sepals and the length and width of petals.

cat <<EOF > "./iris-input.json"
{
  "instances": [
    [6.8, 2.8, 4.8, 1.4],
    [6.0, 3.4, 4.5, 1.6]
  ]
}
EOF

3. Access

INGRESS_DOMAIN=$(kubectl get albconfig knative-internet -o jsonpath='{.status.loadBalancer.dnsname}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_DOMAIN}/v1/models/sklearn-iris:predict" -d @./iris-input.json

Expected output:

* Trying 39.104.203.214:80...
* Connected to 39.104.203.214 (39.104.203.214) port 80 (#0)
> POST /v1/models/sklearn-iris:predict HTTP/1.1
> Host: sklearn-iris-predictor-default.default.example.com
> User-Agent: curl/7.84.0
> Accept: */*
> Content-Length: 76
> Content-Type: application/x-www-form-urlencoded
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 21
< content-type: application/json
< date: Wed, 21 Jun 2023 03:17:23 GMT
< server: envoy
< x-envoy-upstream-service-time: 4
<
* Connection #0 to host 39.104.203.214 left intact
{"predictions":[1,1]}

You should see two predictions returned (i.e. {“predictions”: [1, 1]}), which results in the two sets of data points sent for inference corresponding to the flower with index 1, which the model predicts are both ” Iris Versicolour (Varicolored Iris)”.

Summary

Currently, ACK Serverless has been upgraded to meet the new requirements arising from the outbreak of new scenarios such as AI, and to help enterprises evolve to the serverless business architecture more simply and smoothly in a standard, open, and flexible manner. Based on ACK Serverless combined with KServe can bring you the ultimate serverless experience in AI model reasoning scenarios.

Related links:

[1] Enable ACK Serverless

https://help.aliyun.com/zh/ack/serverless-kubernetes/user-guide/create-an-ask-cluster-2

[2] Deploy KServe

https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/knative-support-kserve

Click to try cloud products for free now to start a practical journey on the cloud!

Original link

This article is the original content of Alibaba Cloud and cannot be reproduced without permission.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeHomepageOverview 330437 people are studying systematically