python text processing attempt 2
-
Encountered a magical website: https://loriballen.com/ai-chatgpt-prompts/
- I want to copy some of its prompts above, but the website can’t even be used with the right button, and F12 can’t be used to view web page elements
- Then ctrl + u to view the source code of the webpage is still possible, those texts are all in the source code
-
First tried octopus, and many crawler plug-ins of Google Chrome, such as simple scraper (this is really easy to use)
-
Then I tried the simplest crawler (the kind that does not analyze web page elements)
Directly read the text on the web page into a text file
import requests from bs4 import BeautifulSoup url: str = 'https://loriballen.com/ai-chatgpt-prompts/' # Crawl the text content in https://loriballen.com/ai-chatgpt-prompts/ and put it in text.txt response: requests.models.Response = requests.get(url) soup: BeautifulSoup = BeautifulSoup(response. text, 'html. parser') text: str = soup. get_text() # text put in text.txt # f. write(text) # UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 4253: illegal multibyte sequence with open('text.txt', 'w', encoding='utf-8') as f: f. write(text)
This writing method was modified for me by github copilot lab, and I tried the
readable
optionThis colon is a feature in Python called variable annotation1. It allows you to mark the meaning of the variable with a colon and a Python expression when declaring the variable. Python itself does not do anything with these annotations, it just saves them in the __annotations__ attribute2, but some third-party tools, such as type checkers, can use these annotations to analyze your code1.
In your example, the str after the colon is a type hint 3, which indicates that this variable should be of type string. Type hints are a way of using variable annotations that can help you write cleaner, more robust, and more maintainable code3.
For example, if we wanted to add a type hint to the url variable, we could write:
url: str = 'https://loriballen.com/ai-chatgpt-prompts/'
This means that the url variable should be a string type.
Source: Conversation with Bing, 2023/3/23(1) python – Use of colon in variable declaration – Stack Overflow. https://stackoverflow.com/questions/51639332/use-of-colon-in-variable-declaration Accessed 2023/3/23.
(2) python – Function parameter with colon – Stack Overflow. https://stackoverflow.com/questions/54962869/function-parameter-with-colon Accessed 2023/3/23.
(3) Python Print Type of Variable – How to Get Var Type – freeCodeCamp.org. https://www.freecodecamp.org/news/python-print-type-of-variable-how-to-get-var-type /Accessed 2023/3/23.
(4) When To Use Colon ( in Python? – AskPython. https://www.askpython.com/python/examples/when-to-use-colon-python Accessed 2023/3/23. -
Then remove the blank line
# Read text.txt and remove blank lines # write your code here with open('text.txt', 'r', encoding='utf-8') as f: text = f. read() # text.strip() removes the leading and trailing spaces text = text. strip() # text.split('\ ') Split by newline character text = text. split('\ ') # text = [i for i in text if i != ''] remove blank lines text = [i for i in text if i != ''] # text = '\ '.join(text) merge according to newline character text = '\ '.join(text) # Write to textNew.txt with open('textNew.txt', 'w', encoding='utf-8') as f: f. write(text) print('write textNew.txt success')
-
Then manually delete a lot of irrelevant ones, leaving suitable
-
Format manually
Probably a blank line for each type of prompt, like this
Business Strategy Prompts What are some tips for [X] What are some tips for getting more real estate listings? What are the best resources for [X] What are the best resources for a new Youtuber? How can you [achieve goal] in [time frame]? How can you monetize a Youtube channel within 1 year? Blogging Prompts What is the most popular [type of content] in [industry/niche] What is the most popular type of content in the digital marketing space? Email Prompts Write a marketing email promoting this product or service: [insert product or service]It helps to add text to guide ChatGPT to the facts. Bullet points work well. After entering the product or service, add a colon, and then add the bullet points or text. Marketing Prompts Check the grammar and punctuation in this document (you will need to provide the paragraph, paragraphs, or full text) Check the grammar and spelling in an article.
-
-
so what to do
(Let AI think for me)
I gave the following prompt, but unfortunately the code given by copilot, phind, and cusor couldn’t run (cursor thought my # meant that the symbol at the beginning of the line was #)
# read textNew.txt # Give an example so you can see what I mean # The format in textNew.txt is probably as follows #category1 #prompt1 #prompt2 #prompt3 # empty line # category2 #prompt1 #prompt2 # Now what you have to do is to read the contents of this file and turn it into a dictionary
Then I’ll change it again and write the prompt more clearly. This time it’s awesome.
# Read textNew.txt # Give an example so you can see what I mean # The format in textNew.txt is probably as follows # empty line #category1 #prompt1 #prompt2 #... #promptk # empty line # category2 #prompt1 #prompt2 #... #promptk #category3 #prompt1 #prompt2 #... #promptk # empty line #categoryn #prompt1 #prompt2 #... #promptk # Now what you have to do is to read the contents of this file and turn it into a dictionary # What you have to do is write this dictionary into a file called data.json # write your code here
# Read textNew.txt # Give an example so you can see what I mean # The format in textNew.txt is probably as follows # empty line #category1 #prompt1 #prompt2 #... #promptk # empty line # category2 #prompt1 #prompt2 #... #promptk #category3 #prompt1 #prompt2 #... #promptk # empty line #categoryn #prompt1 #prompt2 #... #promptk # Now what you have to do is to read the contents of this file and turn it into a dictionary # What you have to do is write this dictionary into a file called data.json # write your code here # First, import the json module import json # Then read textNew.txt with open('textNew.txt', 'r', encoding='utf-8') as f: # write your code here # Split paragraphs by blank lines # The first line of each paragraph is category # The second line of each paragraph starts with prompt text = f.read() # read textNew.txt text = text.split('\ \ ') # Split paragraphs by blank lines # extract category # i.split('\ ') means to split by newline character # text is a list, each element in the list is a string # i is the string in it, which is a paragraph, i[0] is the first line of this paragraph category = [i.split('\ ')[0] for i in text] # extract prompt prompt = [i.split('\ ')[1:] for i in text] # store in dictionary # dict(zip(category, prompt)) means to combine category and prompt into a dictionary # zip(category, prompt) means to combine category and prompt into a list data = dict(zip(category, prompt)) # write to data.json with open('data.json', 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=4) print('write data.json successfully')
The result is quite satisfactory