Custom Graph Component: 1.1-JiebaTokenizer specific implementation

JiebaTokenizer class inherits from Tokenizer class, and Tokenizer class inherits from GraphComponent class, and GraphComponent class inherits from ABC class (abstract base class). This article uses the example in “Using ResponseSelector to Implement Campus Recruitment FAQ Robot” to mainly explain in detail the specific implementation of the methods in the JiebaTokenizer class.

0. List of methods in the JiebaTokenizer class
The following is a list of all methods and properties in the JiebaTokenizer class, including methods and properties inherited from other classes, as shown below:

The execution order of the methods in the JiebaTokenizer class (default parameters) is as follows:

JiebaTokenizer.supported_languages()
JiebaTokenizer.required_packages()
JiebaTokenizer.get_default_config()
JiebaTokenizer.create()
JiebaTokenizer.__init__()
JiebaTokenizer.train()
JiebaTokenizer.persist()
JiebaTokenizer.load()
JiebaTokenizer.tokenize()

Default parameters do not execute 2 methods, as shown below:

_load_custom_dictionarypipeline(*) method
_copy_files_dir_to_dir(*) method

The next natural question is, how to customize parameters for JiebaTokenizer in config.yml? You can refer to the get_default_config() method, as shown below:

def get_default_config() -> Dict[Text, Any]:
    return {
        # default don't load custom dictionary # Do not load custom dictionary by default
        "dictionary_path": None, # Custom dictionary path
        # Flag to check whether to split intents # Flag to check whether to split intents
        "intent_tokenization_flag": False,
        # Symbol on which intent should be split # Symbol on which intent should be split
        "intent_split_symbol": "_",
        # Regular expression to detect tokens # Regular expression for detecting tokens
        "token_pattern": None,
        # Symbol on which prefix should be split # Symbol on which prefix should be split
        "prefix_separator_symbol": None,
    }

1.supported_languages(*) method
Parsing: Supported languages, namely [“zh”]. As follows:

@staticmethod
def supported_languages() -> Optional[List[Text]]:
    """Supported languages (see parent class for full docstring).""" # Supported languages (see parent class for full docstring).
    print("JiebaTokenizer.supported_languages()")

    return ["zh"]

2.get_default_config(*) method
Parsing: Return to the default configuration, as shown below:

@staticmethod
def get_default_config() -> Dict[Text, Any]:
    """Returns default config (see parent class for full docstring).""" # Returns the default configuration (see parent class for full docstring).
    print("JiebaTokenizer.get_default_config()")

    return {
        # default don't load custom dictionary # Do not load custom dictionary by default
        "dictionary_path": None,
        # Flag to check whether to split intents # Flag to check whether to split intents
        "intent_tokenization_flag": False,
        # Symbol on which intent should be split # Symbol on which intent should be split
        "intent_split_symbol": "_",
        # Regular expression to detect tokens # Regular expression for detecting tokens
        "token_pattern": None,
        # Symbol on which prefix should be split # Symbol on which prefix should be split
        "prefix_separator_symbol": None,
    }

3.__init__(*) method
Analysis: When the cls(config, model_storage, resource) of the create() method is executed, def __init__() is actually called. As follows:

def __init__(
    self, config: Dict[Text, Any], model_storage: ModelStorage, resource: Resource
) -> None:
    """Initialize the tokenizer.""" # Initialize the tokenizer.
    print("JiebaTokenizer.__init__()")

    super().__init__(config)
    self._model_storage = model_storage
    self._resource = resource

4.create(*) method
Resolution: Create a new component as follows:

@classmethod
def create(
    cls,
    config: Dict[Text, Any],
    model_storage: ModelStorage,
    resource: Resource,
    execution_context: ExecutionContext,
) -> JiebaTokenizer:
    """Creates a new component (see parent class for full docstring).""" # Creates a new component (see parent class for full docstring).
    print("JiebaTokenizer.create()")

    # Path to the dictionaries on the local filesystem.
    dictionary_path = config["dictionary_path"]

    if dictionary_path is not None:
        cls._load_custom_dictionary(dictionary_path)
    return cls(config, model_storage, resource)

(1) config: Dict[Text, Any]

{
'dictionary_path': None,
'intent_split_symbol': '_',
'intent_tokenization_flag': False,
'prefix_separator_symbol': None,
'token_pattern': None
}

(2)model_storage: ModelStorage

(3) resource: Resource

{
name = 'train_JiebaTokenizer0',
output_fingerprint = '318d7f231c4544dc9828e1a9d7dd1851'
}

(4) execution_context: ExecutionContext

Among them, cls(config, model_storage, resource) actually calls def __init__().

5.required_packages(*) method
Parsing: Any additional python dependencies required for this component to run, i.e. [“jieba”]. As follows:

@staticmethod
def required_packages() -> List[Text]:
    """Any extra python dependencies required for this component to run.""" # Any extra python dependencies required for this component to run.
    print("JiebaTokenizer.required_packages()")

    return ["jieba"]

6._load_custom_dictionary(*) method
Parsing: Load a custom dictionary from model storage as follows:

@staticmethod
def _load_custom_dictionary(path: Text) -> None:
    """Load all the custom dictionaries stored in the path. # Load all custom dictionaries stored in the path.
    More information about the dictionaries file format can be found in the documentation of jieba. https://github.com/fxsjy/jieba#load-dictionary
    """
    print("JiebaTokenizer._load_custom_dictionary()")
    import jieba

    jieba_userdicts = glob.glob(f"{path}/*") # Get all files under the path.
    for jieba_userdict in jieba_userdicts: # Traverse all files.
        logger.info(f"Loading Jieba User Dictionary at {jieba_userdict}") # Loading Jieba User Dictionary.
        jieba.load_userdict(jieba_userdict) # Load user dictionary.

7.train(*) method
Parsing: Copy the dictionary to model storage as follows:

def train(self, training_data: TrainingData) -> Resource:
    """Copies the dictionary to the model storage."""
    print("JiebaTokenizer.train()")

    self.persist() # Persistence.
    return self._resource

Among them, the returned self._resource content is as follows:

8.tokenize(*) method (key point)
Parsing: Tokenize the text of the provided attribute of the incoming message, as follows:

def tokenize(self, message: Message, attribute: Text) -> List[Token]:
    """Tokenizes the text of the provided attribute of the incoming message."""
    print("JiebaTokenizer.tokenize()")

    import jieba

    text = message.get(attribute) # Get the attributes of the message

    tokenized = jieba.tokenize(text) # Tokenize text
    tokens = [Token(word, start) for (word, start, end) in tokenized] # Generate tokens

    return self._apply_token_pattern(tokens)

Among them, the content of message.data is {'intent': 'goodbye', 'text': 'goodbye'}. The specific values of other fields are as follows:

9.load(*) method
Parsing: Load a custom dictionary from model storage as follows:

@classmethod
def load(
    cls,
    config: Dict[Text, Any],
    model_storage: ModelStorage,
    resource: Resource,
    execution_context: ExecutionContext,
    **kwargs: Any,
) -> JiebaTokenizer:
    """Loads a custom dictionary from model storage.""" # Load a custom dictionary from model storage.
    print("JiebaTokenizer.load()")

    dictionary_path = config["dictionary_path"]

    # If a custom dictionary path is in the config we know that it should have been saved to the model storage.
    if dictionary_path is not None:
        try:
            with model_storage.read_from(resource) as resource_directory:
                cls._load_custom_dictionary(str(resource_directory))
        except ValueError:
            logger.debug(
                f"Failed to load {cls.__name__} from model storage. "
                f"Resource '{resource.name}' doesn't exist."
            )
    return cls(config, model_storage, resource)

10._copy_files_dir_to_dir(*) method
Analysis: This method will be called when executing the persist(*) method, as shown below:

@staticmethod
def _copy_files_dir_to_dir(input_dir: Text, output_dir: Text) -> None:
    print("JiebaTokenizer._copy_files_dir_to_dir()")

    # make sure target path exists # Make sure target path exists
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    target_file_list = glob.glob(f"{input_dir}/*")
    for target_file in target_file_list:
        shutil.copy2(target_file, output_dir)

11.persist(*) method
Parsing: Persist a custom dictionary, as shown below:

def persist(self) -> None:
    """Persist the custom dictionaries."""
    print("JiebaTokenizer.persist()")

    dictionary_path = self._config["dictionary_path"]
    if dictionary_path is not None:
        with self._model_storage.write_to(self._resource) as resource_directory:
            self._copy_files_dir_to_dir(dictionary_path, str(resource_directory))

12._model_storageproperty
Analysis: Used to initialize the properties of the JiebaTokenizer class, see the constructor for details.

13._resourceattribute
Analysis: Used to initialize the properties of the JiebaTokenizer class, see the constructor for details.

references:
[1]https://github.com/RasaHQ/rasa/blob/main/rasa/nlu/tokenizers/jieba_tokenizer.py
[2] Use ResponseSelector to implement campus recruitment FAQ robot: https://mp.weixin.qq.com/s/ZG3mBPvkAfaRcjmXq7zVLA