Evaluation of topic model BERTopic using OCTIS

 Recently, in the part of learning the topic model of natural language processing, I found that many articles related to topic models on the Internet only have data processing and model training parts, and there is no model evaluation part. By consulting the official website of OCTIS (a commonly used topic model evaluation library) Documentation, learned some methods of using OCTIS, here to evaluate the topic coherence and topic diversity of the BERTopic topic model,
   **Installation package version:**
      BERTopic: 0.14.1 (latest version in March)
      OCTIS: 0.12.0 (latest version)

The official document link is (OCTIS)

By referring to the official document code in detail, we can know:

1. Data processing

If you want to use OCTIS, you need to process the training data first. The processing part here can use the preprocess that comes with octis

import os
import string
from octis.preprocessing.preprocessing import Preprocessing
os.chdir(os.path.pardir)

# Initialize preprocessing
preprocessor = Preprocessing(vocabulary=None, max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='english',
                             min_chars=1, min_words_docs=0)
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt')

# save the preprocessed dataset
dataset. save('hello_dataset')

In the Preprocessing category, there are some data processing operations, including stop word removal, lemmatization, etc.
Call the preprocess_dataset method under the Preprocessing class to process the document. The labels_path corresponds to the subject tag file of each document. It is not necessary, and you can only fill in the document file.
This method will initialize and return a Dataset object, which contains corpus corpus, vocabulary phrase file, labels topic tag file, etc.
Dataset comes with a save object, which will save the data in the Dataset object in the path folder, and the saving format is:

corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partition, and the label associated to the document (optional).
vocabulary: a .txt file where each line represents a word of the vocabulary

The translation is that there needs to be a tsv file named corpus under the folder. The file contains up to three columns: document, data division (train, val or test), and optional subject tag column. After consulting the source code, you can actually only store corpus.tsv. Vocabulary will be generated by corpus in the load_custom_dataset_from_folder method of Dataset, but will not be saved in the folder.

2. Import data set

The data can either choose the processed data set that comes with OCTIS

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")

If you are using your own dataset, use dataset_from_folder (folder path, not a single file)

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("../path/to/the/dataset/folder")

3, 1 and 2 combined

Combining 1 and 2 means that you can process the data set by yourself instead of the method provided by OCTIS. You only need to create a folder and put it in the corpus.tsv file. The document in the file is the document text that has been processed, and the word segmentation is separated by spaces .
Official website example

Note: The sorting of the columns in the corpus must be from left to right, from text, data division flags to hashtag columns, up to three columns. Columns are separated by tabs \t, which is an example of a tsv file.
Provide my own code for converting csv to tsv and processing

 def preprocess_data_to_octis(cls, csv_file: str, csv_col: str, tsv_file: str):
        df = pd.read_csv(csv_file, encoding='utf8')
        data = df[csv_col].tolist()
        stopwords = cls. get_stopwords()
        results = []
        for text in data:
            result = []
            text = text.lower().split(' ')
            for word in text:
                if word not in stopwords:
                    result.append(word)
            document = ' '.join(result)
            results.append(document)
        octis_data = pd. DataFrame(results, columns=["abstract"])
        octis_data.to_csv(tsv_file, index=False, sep='\t', header=False, encoding='utf-8')

4. Model evaluation

Model evaluation is performed directly here, and the indicators include topic coherence and topic diversity.

npmi = Coherence(texts=self.data.get_corpus(), topk=10, measure="c_npmi")
topic_diversity = TopicDiversity(topk=10)

Coherence uses npmi, diversity comes from diversity metrics, and the topk parameter indicates the number of words taken by each topic. Here, the default is 10, because bertopic also returns 10 words by default for each topic.
The texts of coherence are substituted into corpus.tsv in the folder, and self.data here refers to a Dataset object.
Call the score method of npmi and topic_diversity to get the corresponding model score

def score(self, model_output):
        """
        Retrieve the score of the metric
        Parameters
        ----------
        model_output : dictionary, output of the model
                       key 'topics' required.
        returns
        -------
        score : coherence score
        """
        topics = model_output["topics"]
        if topics is None:
            return -1
        if self.topk > len(topics[0]):
            raise Exception('Words in topics are less than topk')
        else:
            npmi = CoherenceModel(
                topics = topics,
                texts=self._texts,
                dictionary=self._dictionary,
                coherence=self.measure,
                processes = self.processes,
                topn=self.topk)
            return npmi. get_coherence()

Take the score method under the official coherence as an example, the parameter model_output refers to the output of the model, here is a dictionary, this dictionary must have a topics field, inside is a nested list, one row is a topic, and one column is the representation of the topic word. The list dimension is controlled by the number of topics obtained during model training and the specified number of words topk for each topic.
The code to obtain the number of topics and subject words under the BERTopic model is provided as follows. For details, please refer to the document of the author of BERTopic (BERTopic evaluation)

 def get_model_output(self):
        all_words = [word for words in self.data.get_corpus() for word in words]
        bertopic_topics = [
            [
                vals[0] if vals[0] in all_words else all_words[0]
                for vals in self.model.get_topic(i)[:10]
            ]
            for i in range(len(set(self. model. topics_)) - 1)
        ]

        output_tm = {<!-- -->"topics": bertopic_topics}
        return output_tm

5. Model evaluation data collation

score1 = npmi.score(get_model_output())
score2 = topic_diversity. score(get_model_output())

The two scores correspond to the model scores of theme coherence and theme diversity, which can be printed or converted into json objects for storage