python text processing attempt 2

python text processing attempt 2

  • Encountered a magical website: https://loriballen.com/ai-chatgpt-prompts/

    • I want to copy some of its prompts above, but the website can’t even be used with the right button, and F12 can’t be used to view web page elements
    • Then ctrl + u to view the source code of the webpage is still possible, those texts are all in the source code
  • First tried octopus, and many crawler plug-ins of Google Chrome, such as simple scraper (this is really easy to use)

  • Then I tried the simplest crawler (the kind that does not analyze web page elements)

    Directly read the text on the web page into a text file

    import requests
    from bs4 import BeautifulSoup
    
    url: str = 'https://loriballen.com/ai-chatgpt-prompts/'
    
    # Crawl the text content in https://loriballen.com/ai-chatgpt-prompts/ and put it in text.txt
    response: requests.models.Response = requests.get(url)
    soup: BeautifulSoup = BeautifulSoup(response. text, 'html. parser')
    text: str = soup. get_text()
    
    # text put in text.txt
    # f. write(text)
    # UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 4253: illegal multibyte sequence
    with open('text.txt', 'w', encoding='utf-8') as f:
        f. write(text)
    
    

    This writing method was modified for me by github copilot lab, and I tried the readable option

    This colon is a feature in Python called variable annotation1. It allows you to mark the meaning of the variable with a colon and a Python expression when declaring the variable. Python itself does not do anything with these annotations, it just saves them in the __annotations__ attribute2, but some third-party tools, such as type checkers, can use these annotations to analyze your code1.

    In your example, the str after the colon is a type hint 3, which indicates that this variable should be of type string. Type hints are a way of using variable annotations that can help you write cleaner, more robust, and more maintainable code3.

    For example, if we wanted to add a type hint to the url variable, we could write:

    url: str = 'https://loriballen.com/ai-chatgpt-prompts/'
    

    This means that the url variable should be a string type.

    Source: Conversation with Bing, 2023/3/23(1) python – Use of colon in variable declaration – Stack Overflow. https://stackoverflow.com/questions/51639332/use-of-colon-in-variable-declaration Accessed 2023/3/23.
    (2) python – Function parameter with colon – Stack Overflow. https://stackoverflow.com/questions/54962869/function-parameter-with-colon Accessed 2023/3/23.
    (3) Python Print Type of Variable – How to Get Var Type – freeCodeCamp.org. https://www.freecodecamp.org/news/python-print-type-of-variable-how-to-get-var-type /Accessed 2023/3/23.
    (4) When To Use Colon ( in Python? – AskPython. https://www.askpython.com/python/examples/when-to-use-colon-python Accessed 2023/3/23.

  • Then remove the blank line

    # Read text.txt and remove blank lines
    # write your code here
    with open('text.txt', 'r', encoding='utf-8') as f:
        text = f. read()
        # text.strip() removes the leading and trailing spaces
        text = text. strip()
        # text.split('\
    ') Split by newline character
        text = text. split('\
    ')
        # text = [i for i in text if i != ''] remove blank lines
        text = [i for i in text if i != '']
        # text = '\
    '.join(text) merge according to newline character
        text = '\
    '.join(text)
        # Write to textNew.txt
        with open('textNew.txt', 'w', encoding='utf-8') as f:
            f. write(text)
            print('write textNew.txt success')
    
  • Then manually delete a lot of irrelevant ones, leaving suitable

    • Format manually

      Probably a blank line for each type of prompt, like this

      Business Strategy Prompts
      What are some tips for [X] What are some tips for getting more real estate listings?
      What are the best resources for [X] What are the best resources for a new Youtuber?
      How can you [achieve goal] in [time frame]? How can you monetize a Youtube channel within 1 year?
      
      Blogging Prompts
       What is the most popular [type of content] in [industry/niche] What is the most popular type of content in the digital marketing space?
      
      Email Prompts
      Write a marketing email promoting this product or service: [insert product or service]It helps to add text to guide ChatGPT to the facts. Bullet points work well. After entering the product or service, add a colon, and then add the bullet points or text.
      
      Marketing Prompts
      Check the grammar and punctuation in this document (you will need to provide the paragraph, paragraphs, or full text) Check the grammar and spelling in an article.
      
      
      
  • so what to do

    (Let AI think for me)

    I gave the following prompt, but unfortunately the code given by copilot, phind, and cusor couldn’t run (cursor thought my # meant that the symbol at the beginning of the line was #)

    # read textNew.txt
    # Give an example so you can see what I mean
    # The format in textNew.txt is probably as follows
    #category1
    #prompt1
    #prompt2
    #prompt3
    # empty line
    # category2
    #prompt1
    #prompt2
    # Now what you have to do is to read the contents of this file and turn it into a dictionary
    

    Then I’ll change it again and write the prompt more clearly. This time it’s awesome.

    # Read textNew.txt
    # Give an example so you can see what I mean
    # The format in textNew.txt is probably as follows
    # empty line
    #category1
    #prompt1
    #prompt2
    #...
    #promptk
    # empty line
    # category2
    #prompt1
    #prompt2
    #...
    #promptk
    #category3
    #prompt1
    #prompt2
    #...
    #promptk
    # empty line
    #categoryn
    #prompt1
    #prompt2
    #...
    #promptk
    # Now what you have to do is to read the contents of this file and turn it into a dictionary
    
    # What you have to do is write this dictionary into a file called data.json
    # write your code here
    
    # Read textNew.txt
    # Give an example so you can see what I mean
    # The format in textNew.txt is probably as follows
    # empty line
    #category1
    #prompt1
    #prompt2
    #...
    #promptk
    # empty line
    # category2
    #prompt1
    #prompt2
    #...
    #promptk
    #category3
    #prompt1
    #prompt2
    #...
    #promptk
    # empty line
    #categoryn
    #prompt1
    #prompt2
    #...
    #promptk
    # Now what you have to do is to read the contents of this file and turn it into a dictionary
    
    # What you have to do is write this dictionary into a file called data.json
    # write your code here
    
    # First, import the json module
    import json
    # Then read textNew.txt
    with open('textNew.txt', 'r', encoding='utf-8') as f:
        # write your code here
        # Split paragraphs by blank lines
        # The first line of each paragraph is category
        # The second line of each paragraph starts with prompt
        text = f.read() # read textNew.txt
        text = text.split('\
    \
    ') # Split paragraphs by blank lines
        # extract category
        # i.split('\
    ') means to split by newline character
        # text is a list, each element in the list is a string
        # i is the string in it, which is a paragraph, i[0] is the first line of this paragraph
        category = [i.split('\
    ')[0] for i in text]
        # extract prompt
        prompt = [i.split('\
    ')[1:] for i in text]
        # store in dictionary
        # dict(zip(category, prompt)) means to combine category and prompt into a dictionary
        # zip(category, prompt) means to combine category and prompt into a list
        data = dict(zip(category, prompt))
        # write to data.json
        with open('data.json', 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=4)
            print('write data.json successfully')
    
    

    The result is quite satisfactory

image-20230323191550068