Modeling analysis of the “Friends” sitcom data set based on LDA topic analysis

?♂? Personal homepage: @ aiperson’s personal homepage

?About the author: Python learner
I hope everyone will support us and we will make progress together!
If the article is helpful to you,
Welcome to comment Like Collection Add follow +

Table of Contents

1.Project background

2. Introduction to data sets

3.Technical Tools

4. Experimental process

4.1 Import data

4.2 Data preprocessing

4.3 Word cloud visualization

4.4 Prepare data for LDA model training

4.5 Determine the number of topics K

4.6LDA model training

4.7 Topic Modeling Visualization

5. Summary

Recommendations and benefits at the end of the article

1.Project background

“Friends” is a popular American sitcom created by David Crane and Marta Kauffman and aired from 1994 to 2004. The play tells the lives of six young people in New York, as well as their friendship, love and life experiences. Friends has gained a huge audience around the world due to its witty, humorous dialogue and realistic portrayal of emotions.

LDA (Latent Dirichlet Allocation) topic analysis is a text mining and machine learning technology designed to discover topic structures from large amounts of text data. This method assumes that each document is composed of multiple topics, and each topic is represented by a set of words. The application fields of LDA include information retrieval, social media analysis, news topic mining, etc.

Researchers’ interest in conducting thematic analysis of “Friends” may stem from the following aspects:

In-depth plot analysis: Through LDA theme analysis, the script of “Friends” can be broken down into different themes, revealing the depth and diversity of the plot. This helps to understand the development of storylines, the evolution of character relationships, and the humorous elements of sitcoms.
Audience feedback analysis: By analyzing audience comments, discussions, and feedback on Friends, you can understand which topics are attracting attention and discussion among your audience. This helps producers and writers better understand audience response, which can guide the direction of subsequent episodes.
Application of text mining technology: Applying text mining technology such as LDA to film and television script analysis can provide new methods for the creation and improvement of television dramas. This method can also be promoted and applied on other text data sets and expanded to the wider entertainment industry or text analysis field.
Cultural research perspective: Through thematic analysis, researchers can deeply explore the cultural elements, social phenomena and characteristics of the times reflected in “Friends”. This helps to understand the play’s place in the culture and the impact it had on audiences.

In this study, LDA topic analysis technology was used to model the script of the “Friends” sitcom, aiming to deeply explore the thematic structure of the show and reveal its story connotation and textual characteristics. Such analysis is expected to provide useful insights into the fields of TV drama creation, application of text mining techniques, and cultural research.

2. Introduction to data sets

This data set comes from kaggle. “Friends” is an American sitcom created by David Crane and Marta Kaufman. It was broadcast nationwide from September 22, 1994 to May 6, 2004. broadcast on the Broadcasting Corporation for a total of ten seasons. Starring Jennifer Aniston, Courteney Cox, Lisa Kudrow, Matt LeBlanc, Matthew Perry and David Schwimmer, the series revolves around six people living in New York City Friends in Manhattan in their twenties and thirties unfolded. The series is produced by Bright/Kauffman/Crane Productions in association with Warner Bros. Television. The original executive producers were Kevin S. Bright, Kaufman and Crane. The original data set has a total of 67,373 items and 6 feature variables. The meaning of each variable is as follows:

text: Conversation as text

speaker: speaker’s name

season: season number

episode: Episode 1

scene: scene number

utterance: number of utterances

3.Technical tools

Python version:3.9

Code editor: jupyter notebook

4. Experimental process

4.1 Import data

import pandas as pd
friends_data = pd.read_csv('friends.csv')
friends_data.head()

4.2 Data preprocessing

# Import libraries, classes and datasets
import re
import nltk

from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

To eliminate redundant words in conversations, I used the STOPWORDS dataset provided by the Gensim library, which contains a compilation of the 337 most commonly used stop words.

# Check the stop word list in gensim
print(STOPWORDS)
print(len(STOPWORDS))

In order to focus only on the dialogue text between characters, I first removed unnecessary columns.

# Delete unused columns
friends_data = friends_data.drop(columns=['scene','utterance'], axis=1)
friends_data.head()

Subsequently, I performed a preprocessing stage to remove punctuation and pause words. After this, I converted all words to lowercase and then tokenized, stemmed and sequenced the text.

# Initialize Porter stemmer and WordNet lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

#GensimStop words
gensim_stop_words = set(stopwords.words('english'))

# Remove punctuation, stop words, convert to lowercase, tokenize, stem and quote
def preprocess_text(text):
    cleaned_text = re.sub(r'[^\w\s]', '', text) # Remove all non-word and non-whitespace characters from text
    stop_words = set(gensim_stop_words)
    words = word_tokenize(cleaned_text.lower())
    filtered_words = [word for word in words if word not in stop_words and len(word) >= 4] # Remove words with less than 4 characters
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
    return lemmatized_words

#Apply preprocessing to the "text" column
friends_data['processed_text'] = friends_data['text'].apply(preprocess_text)
friends_data.head()

4.3 Word cloud visualization

After completing the text preprocessing, I created an initial visualization to explore the themes of conversations between Friends characters. I used the word cloud generated by the WordCloud library. This tool automatically scales the size of words based on their frequency, highlighting popular words while reducing the size of uncommon words.

import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Function to generate a word cloud for a specific speaker
def generate_wordcloud_for_speaker(speaker_name, text_data):
    
    # Combine all text data for a specific speaker and concatenate the tokens into a single string
    speaker_text = ' '.join(' '.join(tokens) for tokens in text_data[text_data['speaker'] == speaker_name]['processed_text'])
    
    #Create WordCloud object
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(speaker_text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'WordCloud for Speaker: {speaker_name}')
    plt.show()

# A list of specific speakers to visualize
speakers_to_visualize = ["Monica Geller", "Joey Tribbiani", "Chandler Bing", "Phoebe Buffay", "Ross Geller", "Rachel Green"]

# Draw a word cloud for each character
for speaker_name in speakers_to_visualize:
    generate_wordcloud_for_speaker(speaker_name, friends_data)

The word cloud provides a visual representation of each character’s most commonly used words, with larger words indicating higher frequency and importance in their conversations. As we can see from the visualization, they mainly talk about each other and other random topics in their daily conversations, like most friend groups!

4.4 Prepare data for LDA model training

In the next section, I aim to use LDA model training to delve deeper and explore typical topics that the characters engage in.

from gensim.corpora import Dictionary
from gensim.models import LdaModel
from pprint import pprint
import numpy as np

#Create dictionary and corpus
# Create a dictionary from tokenized text data
id2word = Dictionary(friends_data['processed_text'])
#File frequency
corpus = [id2word.doc2bow(tokens) for tokens in friends_data['processed_text']]

4.5 Determine the number of topics K

Use grid search to find the optimal number of topics

In order to effectively train the LDA model, a predefined number of topics is required. To determine the optimal number of topics, I performed a grid search within a range (from 3 to a maximum of 30 topics). The resulting model at each iteration is evaluated using a consistency score. This approach aims to determine the number of topics that yield the most meaningful results.

from gensim.models.coherencemodel import CoherenceModel
# Specify the range of topic numbers to search in
min_topics = 3
max_topics = 30
step_size = 5
topics_range = range(min_topics, max_topics + 1, step_size)

# Perform a grid search and calculate consistency scores for different numbers of topics
coherence_scores = []
for num_topics in topics_range:
    lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=friends_data['processed_text'], dictionary=id2word, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    coherence_scores.append(coherence_lda)

# Find the optimal number of topics with the highest coherence score
optimal_num_topics = topics_range[np.argmax(coherence_scores)]
print("Optimal Number of Topics:", optimal_num_topics)
print("Coherence Score for Optimal Number of Topics:", max(coherence_scores))

# Plot the relationship between the number of topics and the coherence score
plt.figure(figsize=(10, 6))
plt.plot(topics_range, coherence_scores, marker='o', color='b', label='Coherence Score')
plt.xlabel('Number of Topics')
plt.ylabel('Coherence Score')
plt.title('Number of Topics vs Coherence Score')
plt.xticks(topics_range)
plt.legend()
plt.grid(True)
plt.show()

print(coherence_scores)
print(optimal_num_topics)

4.6LDA model training

After conducting the grid search analysis, it was clear that the optimal number of topics for our dataset was determined to be 3. This specific number of topics produced the highest coherence scores, indicating a better alignment of topics in the conversation. Subsequently, I utilized this optimal value as an input parameter for training an LDA model to ensure focused and deep exploration of the underlying topics presented in Friends conversations.

# Number of topics of LDA model
num_topics = 3
#Build LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics)
# print theme
pprint(lda_model.print_topics())

Evaluating LDA models using coherence scores

In the previous step, I determined the optimal number of topics and I utilized the consistency score to evaluate the LDA model. However, let’s explore this score in more detail in this section.

The topic coherence metric evaluates a single topic by evaluating the semantic similarity between higher-ranked words in that topic. The Gensim library provides a class that implements the four most famous coherence models: u_mass, c_v, c_uci, c_npmi.

In the Friends dataset, the text consists of daily conversations between characters in the show. Due to this simple and straightforward language, complex models are not required to assess topic coherence. In the LDA analysis, I chose the c_v coherent model. It specifically evaluates word co-occurrence and topic uniqueness. Given the nature of casual conversations, this approach provides an adequate and appropriate evaluation for our dataset.

# Calculate continuity scores
coherence_model = CoherenceModel(model=lda_model, texts=friends_data['processed_text'], dictionary=id2word, coherence='c_v')
coherence_score = coherence_model.get_coherence()
# print results
print(f'Coherence Score: {coherence_score}')

Considering that the c_v coherence score ranges from 0 to 1, when the c_v coherence score is 0.567, it indicates that the topics extracted from the “Friends” data set have reasonable interpretability and coherence.

4.7 Topic Modeling Visualization

Using the pyLDAvis interactive visualization library in Python, I generated interactive visualizations for the 3 topics previously discovered in the “Friends” dataset. These visualizations have been saved as HTML files. You can manually open the HTML file in a web browser to interact with the visualization.

import pyLDAvis.gensim
import pickle
importpyLDAvis
import os

# If the 'results' directory does not exist, create it
os.makedirs('results', exist_ok=True)

# Prepare LDA visualization data
LDAvis_data_filepath = './results/ldavis_prepared_' + str(num_topics) + '.pkl'

# If you don't have the data ready yet, create it
if 1 == 1:
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)

# Save interactive visualization as HTML file
pyLDAvis.save_html(LDAvis_prepared, './results/ldavis_prepared_' + str(num_topics) + '.html')

# Print a message indicating where the HTML file is saved
print(f"Interactive LDA visualization saved as 'ldavis_prepared_{num_topics}.html'")

5. Summary

Through LDA topic analysis of the “Friends” sitcom script, we gained profound insights that help to more fully understand this classic TV series. The following is the main summary of the experiment:

Discovery of thematic structure: LDA analysis revealed the presence of multiple themes in the “Friends” script, each composed of a group of related words. This helps to understand the variety and complexity of the plot and how skillfully the writers bring different elements together.
Interpretation of plot and character relationships: Through thematic analysis, we can gain an in-depth understanding of the connection between different themes, plot development, and character relationships. This interpretation helps reveal the depth of the story, understanding the interactions between characters and how themes evolve as the plot progresses.
Identification of audience focus: By analyzing keywords related to different topics in audience comments and feedback, we can identify what viewers focus on during the episode. This is valuable information for producers and writers to guide the production and improvement of subsequent episodes.
Application of text mining technology: This experiment demonstrates how to successfully apply LDA topic analysis technology to TV drama scripts, providing a demonstration for the application of text mining technology in the film and television industry. This method is not only applicable to Friends, but can also be generalized to other TV series or text datasets.
Cultural Studies and Social Reflection: Thematic analysis not only helps to understand the script itself, but also reveals how the drama reflects the cultural and social phenomena of the time. This is an interesting way for cultural researchers and sociologists to dig deeper into the impact and meaning of TV series at a particular time.

Overall, through LDA theme analysis, the script of “Friends” is no longer just a series of dialogues and plots, but a rich thematic network containing deep stories, emotions and cultural elements. This analysis method provides new perspectives and methods for the creation of TV series, audience interaction and the application of text mining technology.

Recommendations and benefits at the end of the article

3 copies of “1000 Refined Questions for Computer Postgraduate Entrance Examination” will be given away with free shipping!

Editor’s recommendation

[Combining practice and learning to help strengthen]

Combined with the latest 408 exam syllabus, important knowledge points are refined and organized by chapter to help computer candidates who are strengthening their sprint quickly review the test points. Selected exercises are assigned according to the frequency, style, and question type of the exams over the past years. In the sprint stage, the time is tight and the tasks are heavy. The questions are not too many but precise. Following the answers in detail is not the purpose. Summarizing and reviewing the test to improve scores is the ultimate point. The combination of practice and learning, ultra-detailed answer analysis, and multiple problem-solving ideas allow you to understand each question thoroughly and draw inferences from one instance to another.

[Moderate difficulty, but not tricky]

The 1000 questions include multiple choice questions and large questions.

The questions are not easy, the foundation is not laid well, the strengthening is ineffective, and it is impossible to pass these questions. 1000 questions are difficult.

The questions are not tricky, and the tests are all from classic angles, which you must master if you want to get high scores. It is not an unpopular test method.

[The topic is novel and original]

People who take computer science postgraduate entrance exams have a pain point when they reach the intensive sprint stage – they want to find good questions to study, but they don’t have the answers. Various computer postgraduate entrance examination teaching aid materials have redundant and outdated questions, which have not changed for many years; there are not enough real questions to study, and we dare not study them easily and hastily. In view of this, “1000 Refined Questions for Computer Postgraduate Entrance Examination” combines the style characteristics of past examination questions and creates more than 70% of the newly compiled exercises to ensure that the questions are both consistent with the style of the actual examination questions and are innovative.

[Predict test questions, prepare in advance]

All teaching aids must be based on improving scores and preparing for exams. This book also aims at predicting the questions when setting the questions. It predicts the real questions of the 24 postgraduate entrance examinations based on the comprehensive analysis of past questions.

Lucky draw method: 3 friends will be randomly selected from the comment area and given away for free!

How to participate: Follow the blogger, like, collect, and comment in the comment area “Life is too short, refuse to get involved!” (Remember to like + collect, otherwise the draw will be invalid, and each person can comment up to three times!)

Event deadline: 2023-11-15 20:00:00

JD purchase link: https://item.m.jd.com/product/13919097.html

List announcement time: 2023-11-15 21:00:00

To obtain information and more fan benefits, follow the public account below to obtain

a74f7d5d03234f7c8a635562034442a0.gif#pic_center