Information retrieval and data mining | [Experiment] Ranking retrieval model

Article directory

  • Experimental content
  • Related concepts
  • Experimental steps
    • Word segmentation preprocessing
    • Build an inverted index table
    • Calculate the similarity between query and each document
    • queries preprocessing and retrieval functions
      • Perform lexical analysis and standardization on the input text
      • Retrieval function
    • Debugging results

Experimental content

  1. Implement the most basic Ranked retrieval model based on Experiment1
    • Input: a query (like Ron Weasley birthday)
    • Output: Return the top K (e.g., K = 100) relevant tweets.
  2. Use SMART notation: lnc.ltn
    • Document: logarithmic tf (l as first character), no idf and cosine normalization
    • Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization
  3. Improve Inverted index
    • Store the DF of each term in Dictionary
    • Store the TF of terms in each doc in the posting list with pairs (docID, tf)

Related concepts

  • Information retrieval and data mining | (5) Document scoring, term weight calculation and vector space model
  • Term frequency (term frequencyy): The number of times t appears in the document.
  • Document set frequency (collection frequency): The number of times a term appears in a document set.
  • Document frequency (document frequency): The number of all documents in which t appears.
  • Inverse document frequency:
  • t

    f

    ?

    i

    d

    f

    t

    ,

    d

    tf-idf_{t,d}

    tf?idft,d?calculation:

  • Similarity calculation:
  • Query weighting mechanism:

Experimental steps

Word segmentation preprocessing

  1. Convert the input Twitter document to lowercase, which is processed uniformly here, so that subsequent queries are not case-sensitive.

  2. Find and determine the location index of key parts of information in the Twitter document based on specific tags, and extract the tweetid and tweet content in the Twitter document.

  3. Perform word segmentation on the extracted text content and convert the words into their singular form.

  4. Perform lemmatization restoration on the word list after word segmentation, mainly focusing on the restoration operation of verbs. At the same time, filter out [“text”, “tweetid”]

  5. Add the filtered valid words to the final result list and return it.

    #Word segmentation preprocessing
    def tokenize_tweet(document):
        # Unified processing makes the query case-insensitive
        document = document.lower()
        # Find and determine the location index of key parts of information in the Twitter document based on specific tags
        # The minus 1 minus 3 here is an adjustment to whether the quotation marks and commas are cut in or not.
        a = document.index("tweetid") - 1
        b = document.index("errorcode") - 1
        c = document.index("text") - 1
        d = document.index("timestr") - 3
        # Extract the main information of tweetid and text content in the Twitter document
        document = document[a:b] + document[c:d]
        # Word segmentation processing and converting words into their singular form
        terms = TextBlob(document).words.singularize()
        # Restore the word form after word segmentation and filter out valid words that are not useless words
        result = []
        for word in terms:
            # Convert the current word to a Word object
            expected_str = Word(word)
            # Verb reduction operation
            expected_str = expected_str.lemmatize("v")
            if expected_str not in uselessTerm:
                # Filter out ["text", "tweetid"] and add to result
                result.append(expected_str)
        return result
    

Build an inverted index table

  • Store the term in each doc in TF with pairs (docID, tf).
  1. First of all, it is clear that in this process, the lnc rule is used to calculate the corresponding weight of document terms, that is, logarithmic tf (l as first character), no idf and cosine normalization.
  2. The specific process is as follows:
    • Read content. Each line in the file represents a tweet. Break each line of Twitter text into words (lemmatize) and store them in a list line
    • Use a global variable N to record the number of Twitter documents read.
    • Extract tweetid from line and remove from line.
    • Create an empty dictionary tf to count the number of occurrences of each word in the current document. Traverse each word in the line and update the number of occurrences of the word by determining whether the word already exists in the key of the tf dictionary.
    • Calculate logarithmic tf for the frequency of each term in the tf dictionary, that is, add 1 to the number of occurrences and take the logarithm. (corresponds tologarithmic tf (l as first character))
    • Normalization (corresponding to cosine normalization), traverse the keys (i.e. terms) of the tf dictionary to obtain the normalization factor. Finally, the code loops through the keys of the tf dictionary again and multiplies the frequency of each term by the normalization factor. Get the final corresponding tf weight.
    • Convert line to collection unique_terms and iterate over each word in it.
      • If the word already exists in the key of the postings dictionary, update the dictionary entry corresponding to the word and add the tweetid and weight to it.
      • If the word does not exist in the key of the postings dictionary, the key is created and the tweetid and weight are added to it.
  • Count term frequency
    # Count term frequency and record the number of occurrences of each word in the current document
    tf = {<!-- -->}
     for word in line:
         if word in tf.keys():
             tf[word] + = 1
         else:
             tf[word] = 1
    
  • 1

    +

    l

    o

    g

    (

    t

    f

    t

    ,

    d

    )

    1 + log(tf_{t,d})

    1 + log(tft,d?)

     # logarithmic tf
     for word in tf.keys():
         tf[word] = 1 + math.log(tf[word])
    
  • 1

    w

    1

    2

    +

    w

    2

    2

    +

    .

    .

    .

    +

    w

    m

    2

    \frac{1}{\sqrt{{w_1}^2 + {w_2}^2 + … + {w_m}^2}}

    w1?2 + w2?2 + … + wm?2
    ?1?

     # Normalization, cosine normalization
     cosine = 0
     for word in tf.keys():
         cosine = cosine + tf[word] * tf[word]
     cosine = 1.0 / math.sqrt(cosine)
     for word in tf.keys():
         tf[word] = tf[word] * cosine
    

Calculate the similarity between query and each document

  1. First of all, it is clear that the process is divided into two steps. First, calculate the corresponding weight of the query term, and then find the similarity (that is, multiply the two weights of the corresponding term and combine them. sum) and sort in descending order. Query weight adopts the ltn rule, that is, logarithmic tf (l in leftmost column), idf (t in second column), no normalization.
  2. The specific process is as follows:
    • Traverse the query word list query?, perform term frequency statistics on each word, and store the results in tf.
    • Traverse the keys of the tf dictionary (ie, query words), and calculate the document frequency df? based on the document frequency (number of times the document appears) of each word in postings. If a word is not in postings?, set the document frequency to the global variable N (representing the total number of documents).
    • Calculate the weight tf[word] = (math.log(tf[word]) + 1) * math.log(N / df), corresponding to ltn ( logarithmic tf, idf, no normalization).
    • For each query term, check whether it exists in the postings dictionary. If it exists, traverse the inverted index of the query term (document number and corresponding term weight), and calculate the similarity score based on the term weight of each document and the tf-idf value of the query term. .
    • Store the scores and sort them in descending order to get a list ranked by similarity and return it as the result.
    def similarity(query):
        global score_tid
        tf = {<!-- -->}
        # Count term frequencies
        for word in query:
            if word in tf:
                tf[word] + = 1
            else:
                tf[word] = 1
        # Count document frequency
        for word in tf.keys():
            if word in postings:
                df = len(postings[word])
            else:
                df=N
            # Corresponds to ltn, logarithmic tf (l in leftmost column), idf (t in second column), no normalization
            tf[word] = (math.log(tf[word]) + 1) * math.log(N / df)
        # Calculate similarity
        for word in query:
            if word in postings:
                for tid in postings[word]:
                    if tid in score_tid.keys():
                        score_tid[tid] + = postings[word][tid] * tf[word]
                    else:
                        score_tid[tid] = postings[word][tid] * tf[word]
        # Sort in descending order according to score (similarity)
        similarity = sorted(score_tid.items(), key=lambda x: x[1], reverse=True)
        return similarity
    

queries preprocessing and retrieval functions

Perform lexical analysis and standardization on the input text

def token(doc):
    # Convert the input text to lowercase letters for uniform processing.
    doc = doc.lower()
    # Split the text into individual terms and try to convert the terms to singular form
    terms = TextBlob(doc).words.singularize()
    # Restore the word form after word segmentation and return the result list result
    result = []
    for word in terms:
        expected_str = Word(word)
        expected_str = expected_str.lemmatize("v")
        result.append(expected_str)
    return result

Retrieval function

def Union(sets):
    return reduce(set.union, [s for s in sets])

def do_search():
    query = token(input("please input search query >> "))
    result = []
    if query == []:
        sys.exit()
    # set() removes duplicates in the query word list
    unique_query = set(query)
    # Generate a list containing the id sets of tweets corresponding to each query term, and use the Union() function to union these sets.
    relevant_tweetids = Union([set(postings[term].keys()) for term in unique_query])
    print("There are total" + str(len(relevant_tweetids)) + "Relevant tweets!")
    if not relevant_tweetids:
        print("No tweets matched any query terms for")
        print(query)
    else:
        print("the top 100 tweets are:")
        scores = similarity(query)
        i=1
        for (id, score) in scores:
            if i <= 100: # Return the first n queried information
                result.append(id)
                print(str(score) + ": " + id)
                i = i + 1
            else:
                break
        print("finished")

Debugging results

Final code

import sys
from collections import defaultdict
from textblob import TextBlob
from textblob import Word
import math
from functools import reduce

uselessTerm = ["text", "tweetid"]
# Build an inverted index table and store the TF of terms in each doc with pairs (docID, tf)
postings = defaultdict(dict)
#Number of documents
N = 0
# Final weight
score_tid = defaultdict(dict)

#Word segmentation preprocessing
def tokenize_tweet(document):
    # Unified processing makes the query case-insensitive
    document = document.lower()
    # Find and determine the location index of key parts of information in the Twitter document based on specific tags
    # The minus 1 minus 3 here is an adjustment to whether the quotation marks and commas are cut in or not.
    a = document.index("tweetid") - 1
    b = document.index("errorcode") - 1
    c = document.index("text") - 1
    d = document.index("timestr") - 3
    # Extract the main information of tweetid and text content in the Twitter document
    document = document[a:b] + document[c:d]
    # Word segmentation processing and converting words into their singular form
    terms = TextBlob(document).words.singularize()
    # Restore the word form after word segmentation and filter out valid words that are not useless words
    result = []
    for word in terms:
        # Convert the current word to a Word object
        expected_str = Word(word)
        # Verb reduction operation
        expected_str = expected_str.lemmatize("v")
        if expected_str not in uselessTerm:
            # Filter out ["text", "tweetid"] and add to result
            result.append(expected_str)
    return result

# Build an inverted index table and store the TF of terms in each doc with pairs (docID, tf)
# lnc: logarithmic tf, no idf and cosine normalization
def get_postings():
    global postings, N
    content = open(r"Tweets.txt")
    # Content reading, each tweet is stored as an element in lines
    lines = content.readlines()
    for line in lines:
        N + = 1
        # Preprocessing
        line = tokenize_tweet(line)
        # Extract the first element in the processed word list, which is the tweetid of the Twitter document
        tweetid = line[0]
        # Delete after extraction, not as a valid word
        line.pop(0)

        # Count term frequency and record the number of occurrences of each word in the current document
        tf = {}
        for word in line:
            if word in tf.keys():
                tf[word] + = 1
            else:
                tf[word] = 1
        # logarithmic tf
        for word in tf.keys():
            tf[word] = 1 + math.log(tf[word])
        # Normalization, cosine normalization
        cosine = 0
        for word in tf.keys():
            cosine = cosine + tf[word] * tf[word]
        cosine = 1.0 / math.sqrt(cosine)
        for word in tf.keys():
            tf[word] = tf[word] * cosine

        # Convert the processed word list into a set and obtain the unique words in it
        unique_terms = set(line)
        for key_word in unique_terms:
            if key_word in postings.keys():
                postings[key_word][tweetid] = tf[key_word]
            else:
                postings[key_word][tweetid] = tf[key_word]

# Query standardization processing
def token(doc):
    # Convert the input text to lowercase letters for uniform processing.
    doc = doc.lower()
    # Split the text into individual terms and try to convert the terms to singular form
    terms = TextBlob(doc).words.singularize()
    # Restore the word form after word segmentation and return the result list result
    result = []
    for word in terms:
        expected_str = Word(word)
        expected_str = expected_str.lemmatize("v")
        result.append(expected_str)
    return result

# Calculate the similarity between query and each document
def similarity(query):
    global score_tid
    tf = {<!-- -->}
    # Count term frequencies
    for word in query:
        if word in tf:
            tf[word] + = 1
        else:
            tf[word] = 1
    # Count document frequency
    for word in tf.keys():
        if word in postings:
            df = len(postings[word])
        else:
            df=N
        # Corresponds to ltn, logarithmic tf (l in leftmost column), idf (t in second column), no normalization
        tf[word] = (math.log(tf[word]) + 1) * math.log(N / df)
    # Calculate similarity
    for word in query:
        if word in postings:
            for tid in postings[word]:
                if tid in score_tid.keys():
                    score_tid[tid] + = postings[word][tid] * tf[word]
                else:
                    score_tid[tid] = postings[word][tid] * tf[word]
    # Sort in descending order according to score (similarity)
    similarity = sorted(score_tid.items(), key=lambda x: x[1], reverse=True)
    return similarity


def Union(sets):
    return reduce(set.union, [s for s in sets])

def do_search():
    query = token(input("please input search query >> "))
    result = []
    if query == []:
        sys.exit()
    # set() removes duplicates in the query word list
    unique_query = set(query)
    # Generate a list containing the id sets of tweets corresponding to each query term, and use the Union() function to union these sets.
    relevant_tweetids = Union([set(postings[term].keys()) for term in unique_query])
    print("There are total" + str(len(relevant_tweetids)) + "Relevant tweets!")
    if not relevant_tweetids:
        print("No tweets matched any query terms for")
        print(query)
    else:
        print("the top 100 tweets are:")
        scores = similarity(query)
        i=1
        for (id, score) in scores:
            if i <= 100: # Return the first n queried information
                result.append(id)
                print(str(score) + ": " + id)
                i = i + 1
            else:
                break
        print("finished")

def main():
    get_postings()
    while True:
        do_search()

if __name__ == "__main__":
    main()

Reference blog: Information retrieval experiment 2- Ranked retrieval model

syntaxbug.com © 2021 All Rights Reserved.