How to implement a UDF for a database? The design and thinking behind the graph database NebulaGraph UDF function

Hello everyone, I am Zhao Junnan, directly employed by BOSS. I am mainly responsible for security-related graph storage work. As a loyal user who has used versions from v1.x to v3.x, I have witnessed the development of NebulaGraph and grown with it.

BOSS direct recruitment and NebulaGraph

Regarding the application scenarios of NebulaGraph in BOSS direct employment, you can read the previous article by Mr. Wenzhou (Application of graph database NebulaGraph in BOSS direct employment). From that time, the behavioral graph built by Mr. Wenzhou developed into the main business graph of security scenarios. , algorithm inference graphs, position similarity graphs and other services, now it supports the data lineage of Shucang classmates and real-time search and recall scenarios for searching classmates, and the scale of a single graph has reached hundreds of billions.

In terms of graph computing, BOSS directly employs single-dimensional clusters, multi-dimensional clusters, and basic offline features based on LPA and Louvain to widely apply graph technology in safe production environments. I believe that Tu will have a broader stage in BOSS Direct Recruitment in the future.

The emergence of UDF

With the widespread application of NebulaGraph in BOSS’s direct recruitment business, the corresponding requirements for internal technical personnel are becoming higher and higher. If technicians only stay at the usage level, they will not be able to meet many needs from functionality to performance. Therefore, learning source code has become inevitable.

Then during the process of migrating Neo4j->NebulaGraph, I found that the business relied on Neo4j’s UDF package. I originally came up with the idea of implementing the NebulaGraph UDF function.

UDF design and implementation principles

The above figure shows the execution process of a complete nGQL statement, and the UDF implementation principle is related to the execution process of nGQL, which is roughly as follows:

graphd receives the statement -> Bison lexical parsing (word segmentation) -> Flex grammatical parsing creates Sentence -> Validator verifies and generates AstContext (abstract syntax tree) -> toPlan generates execution plan Planner -> Optimizer optimizer optimization -> Executor execution processor execution.

In the lexical and syntax parsing stage, Function will be parsed separately. FunctionManager, as a native built-in function manager, is responsible for function definition, loading, calling and other operations, thereby managing the entire life cycle of the function. The function found by the calling statement through FunctionManager will eventually be called and executed by the executor.

NebulaGraph’s UDF implements a function-based call execution process and adds FunctionUdfManager:

static std::unordered_map<std::string, Value::Type> udfFunReturnType_;
static std::unordered_map<std::string, std::vector<std::vector<nebula::Value::Type>>>
    udfFunInputType_;
std::unordered_map<std::string, FunctionManager::FunctionAttributes> udfFunctions_;

class FunctionUdfManager {
 public:
  typedef GraphFunction *(create_f)();
  typedef void(destroy_f)(GraphFunction *);

  static StatusOr<Value::Type> getUdfReturnType(const std::string functionName,
                                                const std::vector<Value::Type> & amp;argsType);

  static StatusOr<const FunctionManager::FunctionAttributes> loadUdfFunction(
      std::string functionName, size_t arity);

  static FunctionUdfManager & amp;instance();

  FunctionUdfManager();

 private:
  static create_f *getGraphFunctionClass(void *func_handle);
  static destroy_f *deleteGraphFunctionClass(void *func_handle);

  void addSoUdfFunction(char *funName, const char *soPath, size_t i, size_t i1, bool b);
  void initAndLoadSoFunction();
};

It mainly does the following things:

Initialized together with FunctionManager, initAndLoadSoFunction starts scheduled scanning and scans files under the --udf_path path;
loadUdfFunction loads the .so file, instantiates the function method, and saves it in the Map with the function name as key;
With UDF functionality enabled, the function in the FunctionUdfManager Map is found and called when the FunctionManager does not find a function.

The implementation is relatively simple, which can be said to be a trick. If necessary, UDAF can also be implemented in a similar way.

How to use UDF

Let’s talk about the specific use of NebulaGraph UDF. If you are using NebulaGraph v3.5.0 + version, you can use the UDF function in the following way. If your version is v3.4.x and below, the UDF function is not supported yet. You can also cherry-pick this pr and compile it yourself to use the UDF function.

The first step is to enable the UDF function in the graphd configuration file and specify the package directory

# enable udf, c++ only
--enable_udf=true
# set the directory where the .so of udf are stored
--udf_path=/home/foobar/dev/nebula/udf/

The second step is to write custom function code and inherit GraphFunction. The structure of GraphFunction is as follows:

class GraphFunction;

extern "C" GraphFunction *create();
extern "C" void destroy(GraphFunction *function);

class GraphFunction {
 public:
  virtual ~GraphFunction() = default;

  virtual char *name() = 0;

  virtual std::vector<std::vector<nebula::Value::Type>> inputType() = 0;

  virtual nebula::Value::Type returnType() = 0;

  virtual size_t minArity() = 0;

  virtual size_t maxArity() = 0;

  virtual bool isPure() = 0;

  virtual nebula::Value body(
      const std::vector<std::reference_wrapper<const nebula::Value>> & amp;args) = 0;
};

create and destroy are the creation and destruction methods of functions;
name is the function name when called;
inputType, returnType input and output types;
minArity, maxArity parameter number;
Whether isPure function has state;
Implementation of body function.

The third step is to package the written function into a (.so) file and put it in the corresponding directory configured in the configuration file --udf_path. The graphd service will scan the packages in this path regularly (5 minutes). , loaded into the function library. After that, you can call the corresponding function in your own statement.

Note: Since graphd only scans function packages in the local path, if you want multiple graphd to take effect, they must all have corresponding packages in the local path.

Here I would like to cue Xia Siwei, the teacher, and thank him for the complete usage documentation and compilation environment: https://github.com/vesoft-inc/nebula/pull/4804.

UDF unresolved issues

Although UDF is currently available, it still has some optimization issues. for example:

The so package location only supports local scanning;
The function is only in the graphd layer and cannot be pushed down to storage;
Troublesome to use and requires user coding.

Of course, these problems are closely related to the initial design: at the beginning of developing UDF, we actually wanted to be compatible with C++’s so package and Java’s jar package. However, we tested the performance of C++ Jni calling Java and found that it was basically unable to be used for large-scale applications. scale production.

The picture below is the performance test at that time:

Because the performance of the implementation was really worrying, the initial design was abandoned.

Of course, there are also some future planning matters, mainly hoping that the NebulaGraph development team will work together to complete them:

Individual large query statements and deep queries can easily fill up the memory of storaged and affect the overall performance of the cluster. Is it possible to consider automatically killing the corresponding query through query timeout or memory monitoring to release the memory? In fact, for similar statements, it is basically difficult to get results. It is more likely to reduce the impact of the statement.
Fault tolerance of the cluster. In the case of multiple copies, the abnormal offline of a node will affect the entire cluster. Due to the complexity of the environment, specific location analysis is difficult. We hope to enhance the robustness of the cluster as much as possible.

Unexpected gains from developing UDF

As mentioned before, UDF is actually the product of reading NebulaGraph source code. Here I would like to talk about my feelings about reading the source code: the most intuitive feeling given to me by the overall NebulaGraph source code is the clear hierarchy and structure, and the elegant code. In conjunction with the core explanation series provided by the official blog, the difficulty for cross-language learning players like me has been greatly reduced.

I hope UDF can help you solve some problems, and my sharing can bring you some inspiration.

Thank you for reading this article (///▽///)

If you want to try out the graph database NebulaGraph, remember to download and use it from GitHub, (^з)-☆ star it-> GitHub; exchange graph database technology and application skills with other NebulaGraph users, and leave ” Let’s play with your business card~

The 2023 NebulaGraph technology community annual essay collection event is underway. Come here to receive gifts such as Huawei Meta 60 Pro, Switch game console, Xiaomi sweeping robot, etc. Event link: https://discuss.nebula-graph.com.cn/t /topic/13970