Hystrix fault-tolerant components

Hystrix fault-tolerant component

Introduction to Hystrix

Hystrix, which means porcupine in English, is covered with thorns and looks untouchable. It is a protective mechanism. It is a fault-tolerant component, and Hystrix is also a component of Netflix.

1525658740266

So what does Hystix do? What exactly should be protected?
Hystix is a delay and fault-tolerance library open sourced by Netflix. It is used to isolate access to remote services and third-party libraries to prevent cascading failures.

Avalanche problem

In microservices, the calling relationships between services are complicated. A request may need to call multiple microservice interfaces to implement, which will form a very complex calling link:

As shown in the figure, a business request needs to call four services A, P, H, and I. These four services may also call other services.
If an exception occurs in a service at this time:

For example, an exception occurs in microservice I, the request is blocked, and the user will not get a response, then the tomcat thread will not be released, so more and more user requests come, and more and more threads will be blocked:

1533829307389

The number of threads and concurrency supported by the server is limited, and the request is always blocked, which will cause the server resources to be exhausted, resulting in all other services being unavailable, forming an avalanche effect.

This is like a car production line that produces different cars and requires the use of different parts. If a certain part cannot be used for various reasons, it will cause the entire car to be unable to be assembled and fall into a state of waiting for parts. Until the parts are in place, the Continue assembling. At this time, if many models require this part, the entire factory will be in a waiting state, causing all production to be paralyzed. The scope of a part continues to expand.

Hystix has two ways to solve the avalanche problem:

  • Thread isolation (thread pool isolation, semaphore isolation)
  • Service circuit breaker

Service downgrade

Introduce dependencies

First introduce the Hystrix dependency in spring-consumer’s pom.xml:

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
</dependency>

Add the @EnableHystrix or @EnableCircuitBreaker annotation to the service caller entry startup class to indicate the default configuration of activating the circuit breaker. The @EnableHystrix annotation is the semantics of @EnableCircuitBreaker, and their relationship is similar to @Service and @Component.

Enable Hystrix fusing

@SpringBootApplication
@EnableDiscoveryClient
@EnableCircuitBreaker
public class SpringConsumerApplication {<!-- -->
    @Bean
    @LoadBalanced
    public RestTemplate getRestTemplate(RestTemplateBuilder builder){<!-- -->
        return builder.build();
    }

    public static void main(String[] args) {<!-- -->
        SpringApplication.run(SpringConsumerApplication.class, args);
    }
}

As you can see, there are more and more annotations on our classes. In microservices, the above three annotations are often introduced, so Spring provides a combined annotation: @SpringCloudApplication

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly

Therefore, we can use this combined annotation to replace the previous 3 annotations.

@SpringCloudApplication
public class SpringConsumerApplication {<!-- -->

    @Bean
    @LoadBalanced
    public RestTemplate getRestTemplate(RestTemplateBuilder builder){<!-- -->
        return builder.build();
    }

    public static void main(String[] args) {<!-- -->
        SpringApplication.run(SpringConsumerApplication.class, args);
    }
}

Write downgrade logic

We modify spring-consumer. When the call of the target service fails, we hope to fail quickly and give the user a friendly prompt. Therefore, it is necessary to write the downgrade processing logic in case of failure in advance, and use HystixCommond to complete it:

@RestController
public class ConsumerController {<!-- -->

    @Autowired
    private RestTemplate restTemplate;

    @RequestMapping(value = "/consumerLoadBalanced/{id}")
    @HystrixCommand(fallbackMethod = "consumerLoadBalancedFallbackMethod")
    public String consumerLoadBalanced(@PathVariable String id){<!-- -->
        String url = "http://spring-provider/provider/" + id;
        String consumer = restTemplate.getForObject(url, String.class);
        return "LoadBalanced restTemplate consumer " + consumer;
    }

    public String consumerLoadBalancedFallbackMethod(String id){<!-- -->
        return "The system is busy, please try again later!";
    }
}

It should be noted that the downgraded logical method must ensure: the same parameter list and return value declaration as the normal logical method. It does not make much sense to return the User object in the failure logic. Generally, a friendly prompt will be returned. So we transformed the queryById method to return String, which is Json data anyway. It will be more convenient to return an error description in the failure logic.

illustrate:

  • @HystrixCommand(fallbackMethod = “queryByIdFallBack”): used to declare a method of downgrade logic. Of course, there are other attributes in this annotation. By default, the configuration of hystrix is read and all those that meet the downgrade conditions are downgraded uniformly. Of course, we You can configure a degraded business method separately, such as
 @HystrixCommand(
    fallbackMethod = "fallBackMethod",
    commandProperties = {<!-- -->
        @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "1000"),
        @HystrixProperty(name = "...", value = "..."),
        @HystrixProperty(name = "...", value = "...")
    }
)

Default FallBack

We just wrote the fallback on a certain business method. If there are many such business methods that cannot access the server and need to be downgraded, wouldn’t it be a lot to write? So we can add the fallback configuration to the class to implement the default fallback:

@RestController
@DefaultProperties(defaultFallback = "fallBackMethod") // Specify the global fallback method of a class
public class ConsumerController {<!-- -->

    @Autowired
    private RestTemplate restTemplate;

    @GetMapping
    @HystrixCommand // Mark this method as requiring downgrade
    public String consumerLoadBalanced(@PathVariable String id){<!-- -->
        String url = "http://spring-provider/provider/" + id;
        String consumer = restTemplate.getForObject(url, String.class);
        return "LoadBalanced restTemplate consumer " + consumer;
    }

    /**
     * Downgrade method
     * The return value must be consistent with the return value of the downgraded method
     * The downgrade method does not require parameters
     * @return
     */
    public String fallbackMethod(){<!-- -->
        return "Global default, the system is busy, please try again later!";
    }
}

illustrate:

  • @DefaultProperties(defaultFallback = “defaultFallBack”): Indicates a unified failure downgrade method on the class
  • @HystrixCommand: Use this annotation directly on the method and use the default downgrade method.
  • defaultFallback: The default downgrade method without any parameters to match more methods, but the return value must be the same

Hystrix timeout configuration

The global configuration of Hystrix is also called the default configuration. They are configured in the configuration file through hystrix.command.default.* (again, Hystrix is the caller for the service, so the configuration here Naturally, it is also configured on the caller side of the service)

In the previous case, the request will return an error message after more than 1 second. This is because the default timeout of Hystix is 1 second. We can modify this value through configuration:

We can set the Hystrix timeout through hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds. There is no prompt for this configuration.

hystrix:
  command:
    default: #You can also change default to a certain service name for a certain service.
      execution:
        isolation:
          thread: #In fact, for every http request, a thread is started. There is a thread pool inside hystrix.
            timeoutInMilliseconds: 6000 #Set hystrix timeout to 6000ms
          strategy: THREAD ##The default is to use thread pool isolation technology and can be omitted
             Note: To cooperate with the test, you need to modify the service provider. Open the browser F12 and check the time.

Whether we are using RestTemplate or OpenFeign, they will use Ribbon’s load balancing (and timeout retry) capabilities. Ribbon will also supervise request timeout issues. Therefore, in theory, the judgment standard of Hystrix’s timeout duration should be greater than the total time taken by Ribbon’s timeout retry. Otherwise, there will be a situation where Ribbon is still “trying hard” but Hystrix decides to “give up”. Of course, this is not impossible, but it is a bit unscientific.

Note: In other words, hystrix triggering circuit breaker has nothing to do with ribbon retries. Ribbon will still retry if it should be retried. If there is a retry, it will cause the called system to do useless and repetitive business.

In addition to setting reasonable parameter values, you can also directly turn off Hystrix’s timeout judgment, and it is completely up to the Ribbon to judge and report (to Hystrix) whether it has timed out or not.

Remodeling Service Provider

Modify the UserController interface of the service provider and sleep for a random period of time

 @RequestMapping(value = "/provider/{id}")
    public String provider(@PathVariable String id){<!-- -->
        try {<!-- -->
            Thread.sleep(8000);
        } catch (InterruptedException e) {<!-- -->
            return "exception:" + e.getMessage();
        }
        return "provider id = " + id + "port = " + port;
    }

When 6s cannot request the service provider normally, it actually triggers the circuit breaker first and then downgrades it.

Service circuit breaker

Principle of fusing

Fuse, also called circuit breaker, its English word is: Circuit Breaker

3 states of fuse:

  • Closed: closed state, all requests are accessed normally.
  • Open: open state, all requests will be downgraded. Hystrix will count requests. When the percentage of failed requests reaches the threshold within a certain period of time, the fuse will be triggered and the circuit breaker will be fully opened. The default failure ratio threshold is 50%, and the number of requests is at least 20. The default is 20 requests within five seconds. If 10 times fail (50%), the request cannot be accessed normally.
  • Half Open: Half open state. The open state is not permanent. It will enter sleep time after opening (default is 5S). The circuit breaker will then automatically enter the half-open state. At this time, some requests will be released to pass. If these requests are healthy, the circuit breaker will be completely closed. Otherwise, it will remain open and the sleep timer will be started again.

img

Hands-on practice

In order to accurately control the success or failure of the request, we add a piece of logic to the provider business:

 @RequestMapping(value = "/provider/{id}")
    public String provider(@PathVariable String id){<!-- -->
        if(id.equals("1")){<!-- -->
            throw new RuntimeException("Exception");
        }
        return "provider id = " + id + "port = " + port;
    }

Consumer business code

 @RequestMapping(value = "/consumerLoadBalanced/{id}")
    @HystrixCommand
    public String consumerLoadBalanced(@PathVariable String id){<!-- -->
        String url = "http://spring-provider/provider/" + id;
        String consumer = restTemplate.getForObject(url, String.class);
        return "LoadBalanced restTemplate consumer " + consumer;
    }
    public String fallbackMethod(){<!-- -->
        return "Global default, the system is busy, please try again later!";
    }

We prepare two request windows:

  • A request: http://localhost:8280/consumerLoadBalanced/1, doomed to fail
  • A request: http://localhost:8280/consumerLoadBalanced/2, definitely successful

When we access the request with ID 1 crazily (more than 20 times), the circuit breaker will be triggered. The circuit breaker will open and all requests will be downgraded.

At this time, if you access the request with ID 2, you will find that the response also fails, and returns to normal after a while.

Circuit breaker policy configuration

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 6000
      circuitBreaker:
        requestVolumeThreshold: 20
        sleepWindowInMilliseconds: 10000
        errorThresholdPercentage: 50
        #forceOpen: true #Whether to force the circuit breaker (trip) to open, the default is false, if true, all requests will be rejected, and the fallback downgrade method will be executed directly.

Interpretation:

  • requestVolumeThreshold: the minimum number of requests to trigger the fuse, the default is 20, the fuse will be activated if the number of requests in a window is greater than 20 within 10 seconds
  • errorThresholdPercentage: minimum proportion of failed requests that trigger circuit breaker, default 50%
  • sleepWindowInMilliseconds: sleep duration, default is 5000 milliseconds
  • forceOpen whether to force trip

Resolving catastrophic avalanches

Thread pool isolation

As mentioned before, when most people use Tomcat, when multiple HTTP requests are made to different interfaces, the tomcat server will create a thread pool to process these requests. They will share this thread pool. Assume that one of the HTTP requests a certain The database accessed by the interface responds very slowly, which will cause the service response time delay to increase. Most threads are blocked waiting for the data response to be returned, causing the threads in the entire Tomcat thread pool to be used up, and even bringing down the entire Tomcat. Therefore, if we can isolate requests for different interfaces into different thread pools, then the thread pool requesting a certain interface will not cause catastrophic failure to other services. This requires thread isolation or semaphore isolation to achieve

By default, Hystrix uses thread pools as the isolation strategy. An independent thread pool is prepared for each interface requested. The thread pools requesting the same interface are the same (the request is executed in a thread taken out of the thread pool). Different threads are requested. The interface will create different thread pools, such as: user request/provider interface, then hystrix will create a thread pool for this interface. The number of threads and a cache queue can be specified in the pool (for example: there are 10 threads in the pool, the queue The size is 100, then the maximum concurrency is 110. If the 111th request also requests this interface, if no thread is recycled and the queue cannot be placed, it will be downgraded directly)

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 6000
          strategy: THREAD ##Default value uses thread pool isolation technology
-------------------------------------------------- -----
hystrix:
  threadpool:
    default:
      coreSize: 200
      maxQueueSize: 1000
      queueSizeRejectionThreshold: 800
#The 1201st request will not be downgraded immediately. The specific timeout depends on the thread request configuration...thread.timeoutInMilliseconds: 6000
Parameters Description
coreSize The maximum number of concurrently executed threads, default 10
maxQueueSize The maximum number of queues for BlockingQueue, default value -1
queueSizeRejectionThreshold This attribute controls the maximum threshold of the queue. Even if maxQueueSize is not reached, the request will be rejected after reaching the queueSizeRejectionThreshold value. The default value is 5

It should be noted that maxQueueSize and queueSizeRejectionThreshold must be configured, not just one.

Semaphore isolation

The bottom layer uses atomic calculator technology to set its own independent threshold for each service (interface). For example, each service interface can only be accessed a maximum of 50 times at the same time. After exceeding the limit, the service will be degraded. When the client needs to report to the dependent service When a request is initiated, the counter + 1, and after the request returns successfully, the counter – 1.

Semaphore isolation mainly controls the number of concurrent requests to prevent large-scale blocking of request threads, thereby achieving the purpose of limiting current and preventing avalanches.

Configuration parameters:

hystrix:
  command:
    default: #You can also change default to a certain service name for a certain service.
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 6000
          strategy: SEMAPHORE #Thread pool isolation technology Another method is semaphore isolation strategy: SEMAPHORE

          semaphore:
            maxConcurrentRequests: 100 #The default semaphore maximum value is 100

Usage scenarios

Thread pool isolation: The number of concurrent requests is large and takes a long time (long requests are generally due to heavy calculations and database reading). Thread pool isolation is used. In this way, a large number of container threads can be guaranteed to be available. Due to service reasons, it will always be in a blocked state or waiting state, and will return quickly with failure.

Semaphore isolation: The request concurrency is large and the time is short (The request may be short due to the small amount of calculation and read cache). Semaphore isolation is used because the return of this type of service is usually very fast. , will not occupy the container thread for too long, and also reduces some overhead of thread switching, improving the efficiency of the cache service.