Revealing the secret of presto plug-in mechanism: exploring the infinite possibilities of data processing stage

Article directory

  • 1 Introduction
  • 2. Presto plug-in architecture
  • 3. Plugin interface
    • 3.1 Plug-in Agreement
    • 3.2 Plug-in implementation class
  • 4. Plug-in loading process
    • 4.1 PluginManager
  • 5. Plug-in application
  • 6. Summary

Keywords: Presto Plugin

1. Preface

Source code environment of this article:
presto: prestoDb version 0.275

  • Plug-in mechanism design is a very common and powerful extension method in the Presto framework. It can make software systems more flexible and scalable, allowing users to customize and expand system functions according to their needs and preferences. In distributed SQL query engines like Presto, the plug-in mechanism plays an important role and provides users with rich expansion capabilities.

  • Presto is a memory-based distributed query engine designed to process large-scale data quickly and efficiently. It is widely used in data analysis and processing scenarios, with excellent performance and flexible query capabilities. Presto’s plug-in architecture is built on Presto’s core architecture, providing users with an extensible way to enhance and customize Presto’s functionality. Through the plug-in mechanism, users can load customized plug-ins to add new query functions, support new data sources, implement custom functions, etc.

2. Presto plug-in architecture

In the Presto plug-in architecture, a plug-in is an independent module that can contain one or more collections of related functions. Each plugin can have its own configuration, dependencies and lifecycle management. Plug-ins can interact with Presto’s core code and use the APIs provided by Presto to extend and customize system functionality.
Presto plugins provide the following functionality:

  1. Data source extension: Plug-ins can load different types of data source drivers, allowing Presto to query and access various data sources, such as relational databases, NoSQL databases, object storage, etc.
  2. Function library extension: Plug-ins can provide customized functions and aggregate functions to meet specific business needs. Users can load corresponding plug-ins according to their own needs and use the functions defined in the plug-ins for data processing and calculations.
  3. Authentication and authorization extensions: Plug-ins can provide customized authentication and authorization mechanisms, allowing users to perform access control and authentication on queries based on their own security requirements.
  4. Optimizer and executor extensions: Plug-ins can implement custom query optimization rules and execution plan algorithms to improve query performance and efficiency.

The core component of the Presto plug-in architecture is the PluginManager class. The PluginManager class is responsible for the loading, registration, maintenance and life cycle management of plug-ins. It provides a set of methods to load plugin JAR files, parse plugin configuration, register plugins, and ensure proper initialization and destruction of plugins.
Through the PluginManager class, Presto can dynamically load and manage plug-ins, allowing users to easily extend and customize Presto’s functionality according to their needs. The loading and management process of plug-ins is a key link. By conducting in-depth source code analysis of this process, we can better understand the working principle of the Presto plug-in architecture and provide guidance and techniques for developing and utilizing plug-ins.

3. Plugin interface

The Presto Plugin interface is mainly in the Presto-spi module. Presto-spi is a core module of Presto. It provides a set of public interfaces and Service Provider Interface (SPI), which are used by other modules to define and extend Presto’s behavior and functions, Presto implements a loosely coupled plug-in architecture through the SPI module, so that various components and functions can be customized and extended by implementing interfaces and service provider interfaces. Such an architecture can easily support different data sources and expansion needs, while maintaining the integrity and maintainability of Presto’s core logic.

3.1 Plug-in Agreement

The functional methods defined by the com.facebook.presto.spi#Plugin interface are as follows:

public interface Plugin
{
    // Return the plug-in ConnectorFactory implementation -- connect to external data sources
    default Iterable<ConnectorFactory> getConnectorFactories()
    {
        return emptyList();
    }

    // Return the BlockEncoding implementation provided by the plug-in, which is used to compress and decompress Presto internal data structures to improve memory and network transmission efficiency.
    default Iterable<BlockEncoding> getBlockEncodings()
    {
        return emptyList();
    }

    // Return the type (Type) implementation provided by the plug-in, which is used to extend Presto's built-in types to support more data of different types and formats
    default Iterable<Type> getTypes()
    {
        return emptyList();
    }

    // Return the ParametricType implementation provided by the plug-in to support more complex generic types, such as MAP<ARRAY<STRING>>, etc.
    default Iterable<ParametricType> getParametricTypes()
    {
        return emptyList();
    }

    // Return the custom function implementation provided by the plug-in, which can be a SQL function, a custom aggregate function or a scalar function, which can greatly improve the flexibility and scalability of Presto
    default Set<Class<?>> getFunctions()
    {
        return emptySet();
    }

    // Return the SystemAccessControlFactory implementation provided by the plug-in, which is used to customize Presto's system access control policies, such as authorization, resource restrictions, etc.
    default Iterable<SystemAccessControlFactory> getSystemAccessControlFactories()
    {
        return emptyList();
    }

    // Return the PasswordAuthenticatorFactory implementation provided by the plug-in to support customizing Presto's password authentication method
    default Iterable<PasswordAuthenticatorFactory> getPasswordAuthenticatorFactories()
    {
        return emptyList();
    }

    // Return the EventListenerFactory implementation provided by the plug-in, which is used to customize some event listening and processing mechanisms, such as adding logging functions before/after SQL execution, etc.
    default Iterable<EventListenerFactory> getEventListenerFactories()
    {
    return emptyList();
    }

    // Return the ResourceGroupConfigurationManagerFactory implementation provided by the plug-in, which is used to customize Presto's resource management strategies, such as job grouping, priority, etc.
    default Iterable<ResourceGroupConfigurationManagerFactory> getResourceGroupConfigurationManagerFactories()
    {
    return emptyList();
    }

    // Return the SessionPropertyConfigurationManagerFactory implementation provided by the plug-in, which is used to customize the session property configuration of Presto and process it according to these properties when SQL is executed.
    default Iterable<SessionPropertyConfigurationManagerFactory> getSessionPropertyConfigurationManagerFactories()
    {
    return emptyList();
    }

    // Return the FunctionNamespaceManagerFactory implementation provided by the plug-in, which is used to implement Presto's function namespace management, which can support function isolation and sharing between different users, different organizations, and different data sources.
    default Iterable<FunctionNamespaceManagerFactory> getFunctionNamespaceManagerFactories()
    {
    return emptyList();
    }

    // Return the TempStorageFactory implementation provided by the plug-in, which is used to store some intermediate results in external temporary storage, thereby avoiding excessive memory consumption or even causing OutOfMemoryError;
    default Iterable<TempStorageFactory> getTempStorageFactories()
    {
    return emptyList();
    }

    // Return the QueryPrerequisitesFactory implementation provided by the plug-in, which is used to customize the preparation work before Presto executes SQL Query, such as data preparation, metadata loading, etc. before generating the optimization plan;
    default Iterable<QueryPrerequisitesFactory> getQueryPrerequisitesFactories()
    {
    return emptyList();
    }

    // Return the NodeTtlFetcherFactory implementation provided by the plug-in, which is used to obtain the service life cycle status of each node in the Presto cluster to support dynamic node online and offline functions;
    default Iterable<NodeTtlFetcherFactory> getNodeTtlFetcherFactories()
    {
    return emptyList();
    }

    // Return the ClusterTtlProviderFactory implementation provided by the plug-in, which is used to support the Presto Query expiration function, that is, to regularly clean up historical query records;
    default Iterable<ClusterTtlProviderFactory> getClusterTtlProviderFactories()
    {
    return emptyList();
    }

    // Returns the ExternalPlanStatisticsProvider implementation provided by the plug-in, which is used to collect runtime statistics of the Presto execution plan in order to analyze and optimize execution performance.
    default Iterable<ExternalPlanStatisticsProvider> getExternalPlanStatisticsProviders()
    {
    return emptyList();
    }
    }
</code><img class="look-more-preCode contentImg-no-view" src="//i2.wp.com/csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreBlack. png" alt="" title="">

Common Presto plug-in functions and functions:

  1. Connector plug-in: used to connect different data sources, such as Hadoop HDFS, Amazon S3, Apache Kafka, MySQL, etc., so that Presto can query and analyze data in these different data sources.
  2. Function plug-in: Provides new built-in functions or user-defined functions to enrich Presto’s query capabilities. These functions can be used for data conversion, mathematical calculations, string processing, date processing, etc.
  3. Authentication/Authorization plug-in: used to provide authentication and authorization mechanisms to ensure that only authorized users can access the Presto engine and perform query operations. This can be accomplished by integrating with existing authentication systems (such as Kerberos) or by providing custom user authentication and authorization logic.
  4. SerDe plug-in: Provides parsers for serializing and deserializing data, allowing Presto to read and analyze different data formats, such as JSON, Avro, Parquet, etc.
  5. Metadata plug-in: used to add new data sources, tables and column types to Presto’s metadata system so that Presto can understand and manage these new data structures.
  6. Connector Manager plug-in: Responsible for managing and maintaining various data source connections connected to the Presto engine. It can provide functions such as connection pooling and connection life cycle management.

3.2 Plug-in implementation class


Common interfaces of Presto plug-ins and their corresponding implementation classes:

  1. Connector: This interface defines the function of connecting to external data sources. Some common implementation classes of the Connector interface include:
    ○ JdbcConnector: used to connect to relational databases that support the JDBC protocol, such as MySQL, PostgreSQL, Oracle, etc.
    ○ HiveConnector: used to connect to the Hive data warehouse and supports querying Hive tables and views.
    ○ KafkaConnector: used to connect to the Apache Kafka stream processing platform and supports reading and writing Kafka topics.
  2. Function: This interface defines the functionality of a custom SQL function. Some common implementation classes of Function interface include:
    ○ ScalarFunction: Implements a scalar function, receives one or more input parameters, and returns a result.
    ○ AggregateFunction: implements an aggregate function, calculates a data set, and returns an aggregate result.
    ○ WindowFunction: Implement the window function, perform aggregate calculations on the data window, and return a result set.
  3. Type: This interface defines the functionality of custom data types. Some common implementation classes of the Type interface include:
    ○ ArrayType: Implements an array type, representing an array containing multiple elements.
    ○ MapType: Implements the mapping type and represents the data structure of key-value pair mapping.
    ○ RowType: Implements a row type that represents a set of data with named fields.
  4. Split: This interface defines the function of query task (split). Some common implementation classes of the Split interface include:
    ○ FileSplit: used to handle file splitting tasks.
    ○ TableSplit: used to handle table splitting tasks.
    ○ PartitionSplit: used to handle partition splitting tasks.
    In presto-main, presto-hive, presto-jdbc and other modules, corresponding implementation classes are provided to implement the above interfaces in order to connect different data sources, define custom functions and types, and process query tasks. In addition, these interfaces can be custom implemented as needed to meet specific needs.

Common JDBCPlugin (relational library plug-in); HivePlugin example:

4. Plug-in loading process

Presto plug-in loading is performed as follows when Presto starts:
com.facebook.presto.server#run

....
//Create a Bootstrap object: Start Presto by creating a Bootstrap object and pass in a module list.
// Bootstrap is a startup class provided by Presto, which is used to initialize Presto's running environment and load necessary modules.
Bootstrap app = new Bootstrap(modules.build());

try {<!-- -->
    // Initialize and get the Injector: Use the app.initialize() method to initialize the Bootstrap object,
    // Return an Injector instance. Injector is a dependency injection container provided by the Guice framework.
    // Used to manage object dependencies in Presto.
    Injector injector = app.initialize();
//Load the plug-in: Get the PluginManager instance and call the loadPlugins() method to load the plug-in.
    //PluginManager is Presto's plug-in manager, responsible for loading, managing and extending Presto plug-ins.
    injector.getInstance(PluginManager.class).loadPlugins();

    ServerConfig serverConfig = injector.getInstance(ServerConfig.class);

    if (!serverConfig.isResourceManager()) {<!-- -->
        injector.getInstance(StaticCatalogStore.class).loadCatalogs();
    }
        .......

</code><img class="look-more-preCode contentImg-no-view" src="//i2.wp.com/csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreBlack. png" alt="" title="">

4.1 PluginManager


Presto’s plug-in manager (PluginManager) is responsible for loading all plug-ins. The following is its basic workflow:

  1. When Presto starts, it will determine the plug-in directory (plugin.dir) through the configuration file (usually config.properties).
  2. The plugin manager scans this directory and its subdirectories for any valid plugins.
  3. For each plug-in found, the plug-in manager creates a new class loader and then uses this class loader to load the plug-in’s classes.
  4. The plug-in manager will call the plug-in’s ConnectorFactory to create a new Connector instance.
  5. This new Connector instance will be added to Presto’s global Connector list.

This process happens automatically when Presto starts, so all plugins will be available immediately after Presto starts.
If you want to add a new plug-in, you just need to put the plug-in jar file and a configuration file called presto-plugin.properties into a new subdirectory of the plug-in directory. You can then restart Presto and the new plugin will be loaded automatically.
Note: Presto does not check the plug-in version or compatibility when loading plug-ins. Therefore, you need to ensure that your plugin is compatible with your version of Presto.

loadPluginsThe loading method is as follows:



The loading configuration path is specified by plugin.bundles or plugin.dir in the config.properties file. plugin.bundles is a comma-separated list containing the path of the plugin JAR file. Each JAR file typically contains one or more Presto plug-ins. These JAR files can be located on the local file system or in a remote location such as HDFS or S3.
When Presto starts, it scans the JAR files specified in plugin.bundles and loads the plugins in them.
plugin.dir specifies a directory. Presto will scan all JAR files in this directory at startup and try to load the plugins in it. Unlike plugin.bundles, plugin.dir can only specify a directory path and cannot contain multiple comma-separated paths.
plugin.bundles is suitable for loading predefined plug-in JAR files at startup, while plugin.dir is suitable for dynamically loading all plug-in JAR files in a specific directory. Which method you choose to use depends on your needs and the way plugins are managed.

The final installPlugin() method is to register various functions defined in the plug-in into Presto, so that Presto has these functions and expands the capabilities and flexibility of Presto. By installing plug-ins, users can customize and configure Presto according to their own needs to meet data processing needs in different scenarios.

5. Plug-in application

Due to limited space, only the extended case of the function is analyzed here:
For example, if we want to implement a machine learning function in SQL that integrates support vector machine SVM to train a classification model, we only need to directly integrate the Plugin


Implement your own Presto machine learning function according to the writing method of functions such as UDF.
LearnClassifierAggregation This class defines three methods: input(), combine() and output(), which correspond to the three stages of the aggregation function: input (that is, accepting input row data), merging (merging multiple aggregators into a) and output (generating the final result)

6. Summary

Presto is designed with its plug-in mechanism to achieve a high degree of flexibility and scalability. Through the plug-in interface, users can easily extend and customize functions such as data sources, function libraries, authentication and authorization, optimizers and executors. This flexibility provides users with a wide range of expansion and customization options, allowing Presto to better adapt to diverse application scenarios.

As a big data distributed computing framework, Presto has powerful functions and advantages. It can seamlessly process disparate data sources, perform distributed in-memory calculations, and have flexible executors and monitoring capabilities. These characteristics make Presto play an important role in the distributed computing environment in the field of big data.

To sum up, Presto’s plug-in mechanism provides users with flexibility and scalability, making it a powerful and widely applicable distributed computing framework. In the future of big data analysis and processing, Presto is expected to continue to play an important role and maintain a leading position in the evolving big data environment.