Information retrieval and data mining | [Experiment] Ranking retrieval model

Article directory

Experimental content
Related concepts
Experimental steps
- Word segmentation preprocessing
- Build an inverted index table
- Calculate the similarity between query and each document
- queries preprocessing and retrieval functions
- - Perform lexical analysis and standardization on the input text
  - Retrieval function
- Debugging results

Experimental content

Implement the most basic Ranked retrieval model based on Experiment1
- Input: a query (like Ron Weasley birthday)
- Output: Return the top K (e.g., K = 100) relevant tweets.
Use SMART notation: lnc.ltn
- Document: logarithmic tf (l as first character), no idf and cosine normalization
- Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization
Improve Inverted index
- Store the DF of each term in Dictionary
- Store the TF of terms in each doc in the posting list with pairs (docID, tf)

Related concepts

Information retrieval and data mining | (5) Document scoring, term weight calculation and vector space model
Term frequency (term frequencyy): The number of times t appears in the document.
Document set frequency (collection frequency): The number of times a term appears in a document set.
Document frequency (document frequency): The number of all documents in which t appears.
Inverse document frequency:
t

f

?

i

d

f

t

,

d

tf-idf_{t,d}

tf?idft,d?calculation:
Similarity calculation:
Query weighting mechanism:

Experimental steps

Word segmentation preprocessing

Convert the input Twitter document to lowercase, which is processed uniformly here, so that subsequent queries are not case-sensitive.
Find and determine the location index of key parts of information in the Twitter document based on specific tags, and extract the tweetid and tweet content in the Twitter document.
Perform word segmentation on the extracted text content and convert the words into their singular form.
Perform lemmatization restoration on the word list after word segmentation, mainly focusing on the restoration operation of verbs. At the same time, filter out [“text”, “tweetid”]

Add the filtered valid words to the final result list and return it.

#Word segmentation preprocessing
def tokenize_tweet(document):
    # Unified processing makes the query case-insensitive
    document = document.lower()
    # Find and determine the location index of key parts of information in the Twitter document based on specific tags
    # The minus 1 minus 3 here is an adjustment to whether the quotation marks and commas are cut in or not.
    a = document.index("tweetid") - 1
    b = document.index("errorcode") - 1
    c = document.index("text") - 1
    d = document.index("timestr") - 3
    # Extract the main information of tweetid and text content in the Twitter document
    document = document[a:b] + document[c:d]
    # Word segmentation processing and converting words into their singular form
    terms = TextBlob(document).words.singularize()
    # Restore the word form after word segmentation and filter out valid words that are not useless words
    result = []
    for word in terms:
        # Convert the current word to a Word object
        expected_str = Word(word)
        # Verb reduction operation
        expected_str = expected_str.lemmatize("v")
        if expected_str not in uselessTerm:
            # Filter out ["text", "tweetid"] and add to result
            result.append(expected_str)
    return result

Build an inverted index table

Store the term in each doc in TF with pairs (docID, tf).

First of all, it is clear that in this process, the lnc rule is used to calculate the corresponding weight of document terms, that is, logarithmic tf (l as first character), no idf and cosine normalization.

The specific process is as follows:

Read content. Each line in the file represents a tweet. Break each line of Twitter text into words (lemmatize) and store them in a list line

Use a global variable N to record the number of Twitter documents read.

Extract tweetid from line and remove from line.

Create an empty dictionary tf to count the number of occurrences of each word in the current document. Traverse each word in the line and update the number of occurrences of the word by determining whether the word already exists in the key of the tf dictionary.

Calculate logarithmic tf for the frequency of each term in the tf dictionary, that is, add 1 to the number of occurrences and take the logarithm. (corresponds tologarithmic tf (l as first character))

Normalization (corresponding to cosine normalization), traverse the keys (i.e. terms) of the tf dictionary to obtain the normalization factor. Finally, the code loops through the keys of the tf dictionary again and multiplies the frequency of each term by the normalization factor. Get the final corresponding tf weight.

Convert line to collection unique_terms and iterate over each word in it.

If the word already exists in the key of the postings dictionary, update the dictionary entry corresponding to the word and add the tweetid and weight to it.

If the word does not exist in the key of the postings dictionary, the key is created and the tweetid and weight are added to it.

Count term frequency
# Count term frequency and record the number of occurrences of each word in the current document tf = {} for word in line: if word in tf.keys(): tf[word] + = 1 else: tf[word] = 1

1

+

l

o

g

(

t

f

t

,

d

)

1 + log(tf_{t,d})

1 + log(tft,d?)

# logarithmic tf for word in tf.keys(): tf[word] = 1 + math.log(tf[word])

1

w

1

2

+

w

2

2

+

.

.

.

+

w

m

2

\frac{1}{\sqrt{{w_1}^2 + {w_2}^2 + … + {w_m}^2}}

w1?2 + w2?2 + … + wm?2

# Normalization, cosine normalization cosine = 0 for word in tf.keys(): cosine = cosine + tf[word] * tf[word] cosine = 1.0 / math.sqrt(cosine) for word in tf.keys(): tf[word] = tf[word] * cosine

Calculate the similarity between query and each document

First of all, it is clear that the process is divided into two steps. First, calculate the corresponding weight of the query term, and then find the similarity (that is, multiply the two weights of the corresponding term and combine them. sum) and sort in descending order. Query weight adopts the ltn rule, that is, logarithmic tf (l in leftmost column), idf (t in second column), no normalization.

The specific process is as follows:

Traverse the query word list query?, perform term frequency statistics on each word, and store the results in tf.

Traverse the keys of the tf dictionary (ie, query words), and calculate the document frequency df? based on the document frequency (number of times the document appears) of each word in postings. If a word is not in postings?, set the document frequency to the global variable N (representing the total number of documents).

Calculate the weight tf[word] = (math.log(tf[word]) + 1) * math.log(N / df), corresponding to ltn ( logarithmic tf, idf, no normalization).

For each query term, check whether it exists in the postings dictionary. If it exists, traverse the inverted index of the query term (document number and corresponding term weight), and calculate the similarity score based on the term weight of each document and the tf-idf value of the query term. .

Store the scores and sort them in descending order to get a list ranked by similarity and return it as the result.

def similarity(query): global score_tid tf = {} # Count term frequencies for word in query: if word in tf: tf[word] + = 1 else: tf[word] = 1 # Count document frequency for word in tf.keys(): if word in postings: df = len(postings[word]) else: df=N # Corresponds to ltn, logarithmic tf (l in leftmost column), idf (t in second column), no normalization tf[word] = (math.log(tf[word]) + 1) * math.log(N / df) # Calculate similarity for word in query: if word in postings: for tid in postings[word]: if tid in score_tid.keys(): score_tid[tid] + = postings[word][tid] * tf[word] else: score_tid[tid] = postings[word][tid] * tf[word] # Sort in descending order according to score (similarity) similarity = sorted(score_tid.items(), key=lambda x: x[1], reverse=True) return similarity

queries preprocessing and retrieval functions

Perform lexical analysis and standardization on the input text

def token(doc): # Convert the input text to lowercase letters for uniform processing. doc = doc.lower() # Split the text into individual terms and try to convert the terms to singular form terms = TextBlob(doc).words.singularize() # Restore the word form after word segmentation and return the result list result result = [] for word in terms: expected_str = Word(word) expected_str = expected_str.lemmatize("v") result.append(expected_str) return result

Retrieval function

def Union(sets): return reduce(set.union, [s for s in sets]) def do_search(): query = token(input("please input search query >> ")) result = [] if query == []: sys.exit() # set() removes duplicates in the query word list unique_query = set(query) # Generate a list containing the id sets of tweets corresponding to each query term, and use the Union() function to union these sets. relevant_tweetids = Union([set(postings[term].keys()) for term in unique_query]) print("There are total" + str(len(relevant_tweetids)) + "Relevant tweets!") if not relevant_tweetids: print("No tweets matched any query terms for") print(query) else: print("the top 100 tweets are:") scores = similarity(query) i=1 for (id, score) in scores: if i <= 100: # Return the first n queried information result.append(id) print(str(score) + ": " + id) i = i + 1 else: break print("finished")

Debugging results

Final code

import sys from collections import defaultdict from textblob import TextBlob from textblob import Word import math from functools import reduce uselessTerm = ["text", "tweetid"] # Build an inverted index table and store the TF of terms in each doc with pairs (docID, tf) postings = defaultdict(dict) #Number of documents N = 0 # Final weight score_tid = defaultdict(dict) #Word segmentation preprocessing def tokenize_tweet(document): # Unified processing makes the query case-insensitive document = document.lower() # Find and determine the location index of key parts of information in the Twitter document based on specific tags # The minus 1 minus 3 here is an adjustment to whether the quotation marks and commas are cut in or not. a = document.index("tweetid") - 1 b = document.index("errorcode") - 1 c = document.index("text") - 1 d = document.index("timestr") - 3 # Extract the main information of tweetid and text content in the Twitter document document = document[a:b] + document[c:d] # Word segmentation processing and converting words into their singular form terms = TextBlob(document).words.singularize() # Restore the word form after word segmentation and filter out valid words that are not useless words result = [] for word in terms: # Convert the current word to a Word object expected_str = Word(word) # Verb reduction operation expected_str = expected_str.lemmatize("v") if expected_str not in uselessTerm: # Filter out ["text", "tweetid"] and add to result result.append(expected_str) return result # Build an inverted index table and store the TF of terms in each doc with pairs (docID, tf) # lnc: logarithmic tf, no idf and cosine normalization def get_postings(): global postings, N content = open(r"Tweets.txt") # Content reading, each tweet is stored as an element in lines lines = content.readlines() for line in lines: N + = 1 # Preprocessing line = tokenize_tweet(line) # Extract the first element in the processed word list, which is the tweetid of the Twitter document tweetid = line[0] # Delete after extraction, not as a valid word line.pop(0) # Count term frequency and record the number of occurrences of each word in the current document tf = {} for word in line: if word in tf.keys(): tf[word] + = 1 else: tf[word] = 1 # logarithmic tf for word in tf.keys(): tf[word] = 1 + math.log(tf[word]) # Normalization, cosine normalization cosine = 0 for word in tf.keys(): cosine = cosine + tf[word] * tf[word] cosine = 1.0 / math.sqrt(cosine) for word in tf.keys(): tf[word] = tf[word] * cosine # Convert the processed word list into a set and obtain the unique words in it unique_terms = set(line) for key_word in unique_terms: if key_word in postings.keys(): postings[key_word][tweetid] = tf[key_word] else: postings[key_word][tweetid] = tf[key_word] # Query standardization processing def token(doc): # Convert the input text to lowercase letters for uniform processing. doc = doc.lower() # Split the text into individual terms and try to convert the terms to singular form terms = TextBlob(doc).words.singularize() # Restore the word form after word segmentation and return the result list result result = [] for word in terms: expected_str = Word(word) expected_str = expected_str.lemmatize("v") result.append(expected_str) return result # Calculate the similarity between query and each document def similarity(query): global score_tid tf = {} # Count term frequencies for word in query: if word in tf: tf[word] + = 1 else: tf[word] = 1 # Count document frequency for word in tf.keys(): if word in postings: df = len(postings[word]) else: df=N # Corresponds to ltn, logarithmic tf (l in leftmost column), idf (t in second column), no normalization tf[word] = (math.log(tf[word]) + 1) * math.log(N / df) # Calculate similarity for word in query: if word in postings: for tid in postings[word]: if tid in score_tid.keys(): score_tid[tid] + = postings[word][tid] * tf[word] else: score_tid[tid] = postings[word][tid] * tf[word] # Sort in descending order according to score (similarity) similarity = sorted(score_tid.items(), key=lambda x: x[1], reverse=True) return similarity def Union(sets): return reduce(set.union, [s for s in sets]) def do_search(): query = token(input("please input search query >> ")) result = [] if query == []: sys.exit() # set() removes duplicates in the query word list unique_query = set(query) # Generate a list containing the id sets of tweets corresponding to each query term, and use the Union() function to union these sets. relevant_tweetids = Union([set(postings[term].keys()) for term in unique_query]) print("There are total" + str(len(relevant_tweetids)) + "Relevant tweets!") if not relevant_tweetids: print("No tweets matched any query terms for") print(query) else: print("the top 100 tweets are:") scores = similarity(query) i=1 for (id, score) in scores: if i <= 100: # Return the first n queried information result.append(id) print(str(score) + ": " + id) i = i + 1 else: break print("finished") def main(): get_postings() while True: do_search() if __name__ == "__main__": main()

Reference blog: Information retrieval experiment 2- Ranked retrieval model