python multithreading and multiprocessing

Multiple threads and multi-processes

1. What is a process and what is a thread?

? Process: A running program. Every time we execute a program, our operating system automatically prepares some necessary resources for the program (for example, allocates memory, creates a thread that can be executed. )

? Thread: Within a program, an execution process that can be directly scheduled by the CPU. It is the smallest unit that the operating system can perform calculation scheduling. It is included in the process and is the actual operating unit in the process.

? The relationship between processes and threads:

? Process is a resource unit. Thread is an execution unit. It is like a company. The resources of a company are tables, chairs, benches, computers and water dispensers. However, if we say that a company is running and running, there must be capabilities. People who work for this company. The same is true in programs. Processes are the various resources needed for the program to run. But if the program wants to run, it must be scheduled and executed by the CPU by threads.

? Every program we run will have a thread by default. Even if it is a program with only helloworld level, if it wants to be executed, a thread will be generated.

Two, multi-threading

? As the name suggests, multi-threading is to allow the program to generate multiple threads to execute together. Let’s take a company as an example. If there is only one employee in a company, the work efficiency will definitely not be high. How to improve efficiency? Just hire more people. OK.

? How to implement multi-threading? In python, there are two options to implement multi-threading.

1. Create threads directly using Thread

Let’s first look at the effect of single thread

def func():
    for i in range(1000):
        print("func", i)


if __name__ == '__main__':
    fun()
    for i in range(1000):
        print("main", i)

Look at multithreading again

from threading import Thread


def func():
    for i in range(1000):
        print("func", i)


if __name__ == '__main__':
    t = Thread(target=func)
    t.start()
    for i in range(1000):
        print("main", i)

2. Inherit the Thread class

from threading import Thread


class MyThread(Thread):
    def run(self):
        for i in range(1000):
            print("func", i)


if __name__ == '__main__':
    t = MyThread()
    t.start()
    for i in range(1000):
        print("main", i)

The above two are the most basic python solutions for creating multi-threads. python also provides a thread pool

3. Thread pool

Python also provides a thread pool function. Multiple threads can be created at one time, and we programmers do not need to manually maintain them. Everything is automatically managed by the thread pool.

# Thread pool
def fn(name):
    for i in range(1000):
        print(name, i)


if __name__ == '__main__':
    with ThreadPoolExecutor(10) as t:
        for i in range(100):
            t.submit(fn, name=f"thread{<!-- -->i}")

What if the task has a return value?

def func(name):
    time.sleep(2)
    return name


def do_callback(res):
    print(res.result())


if __name__ == '__main__':
    with ThreadPoolExecutor(10) as t:
        names = ["Thread1", "Thread2", "Thread3"]
        for name in names:
            # Option 1, add callback
            t.submit(func, name).add_done_callback(do_callback)

            
if __name__ == '__main__':
    start = time.time()
    with ThreadPoolExecutor(10) as t:
        names = [5, 2, 3]
        # Option two, directly use map for task distribution. Finally, the results are returned uniformly
        results = t.map(func, names, ) #The results are executed in the order you passed them. The cost is that if the first one is not finished, there will be no results after that.
        for r in results:
            print("result", r)
    print(time.time() - start)

4. Application of multi-threading in crawlers

http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml

Still using the case of Xinfadi.

import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor


def get_page_source(url):
    resp = requests.get(url)
    return resp.text


def get_totle_count():
    url = "http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml"
    source = get_page_source(url)
    tree = etree.HTML(source)
    last_href = tree.xpath("//div[@class='manu']/a[last()]/@href")[0]
    totle = last_href.split("/")[-1].split(".")[0]
    return int(totle)


def download_content(url):
    source = get_page_source(url)
    tree = etree.HTML(source)
    trs = tree.xpath("//table[@class='hq_table']/tr[position() > 1]")
    result = []
    for tr in trs:
        tds = tr.xpath("./td/text()")
        result.append((tds[0], tds[1], tds[2], tds[3], tds[4], tds[5], tds[6]))
    return result


def main():
    f = open("data.csv", mode="w")
    totle = get_totle_count()
    url_tpl = "http://www.xinfadi.com.cn/marketanalysis/0/list/{}.shtml"

    with ThreadPoolExecutor(50) as t:
        data = t.map(download_content, (url_tpl.format(i) for i in range(1, totle + 1)))
        # Get the returns of all tasks
        for item in data:
            # Loop out one row of data for each task
            for detial in item:
                #Write to file
                content = ",".join(detial) + "\
"
                print(content)
                f.write(content)


if __name__ == '__main__':
    main()

Three, multiple processes

After all, the value a company can create is limited. What should I do? Open a branch. This is called multi-process. The solution for implementing multi-process in Python is almost the same as multi-threading. It is very simple.

###1. Create a process directly using Process

def func():
    for i in range(1000):
        print("func", i)


if __name__ == '__main__':
    p = Process(target=func)
    p.start()

    for i in range(1000):
        print("main", i)

2. Inherit Process class

class MyProcess(Process):
    def run(self):
        for i in range(1000):
            print("MyProcess", i)


if __name__ == '__main__':
    t = MyProcess()
    t.start()
    for i in range(1000):
        print("main", i)

###3. Application of multi-process in crawler

? We generally rarely use multi-process directly. The most suitable situation for using multi-process is: multiple tasks need to be executed together. And the data may overlap with each other but the functions are relatively independent. For example, if we make a proxy IP pool ourselves, It is necessary to crawl from the network, and the IP obtained must be verified before it can be used. At this time, the crawling task and the verification task are equivalent to two completely independent functions. At this time, multiple processes can be started To achieve. For another example, if we encounter an image capture, we know that the image is generally in the img tag of the web page. The src attribute stores the download address of the image. At this time, we can use a multi-process solution to achieve it, one Responsible for frantically scanning image download addresses. Another process is only responsible for downloading images.

? In summary, multiple tasks need to be executed in parallel, but the tasks are relatively independent (not necessarily completely independent). You can consider using multiple processes.

# Process 1. Extract the download path of the image from the image website
def get_pic_src(q):
    print("start main page spider")
    url = "http://www.591mm.com/mntt/"
    resp = requests.get(url)
    tree = etree.HTML(resp.text)
    child_hrefs = tree.xpath("//div[@class='MeinvTuPianBox']/ul/li/a/@href")
    print("get hrefs from main page", child_hrefs)
    for href in child_hrefs:
        href = parse.urljoin(url, href)
        print("handle href", href)
        resp_child = requests.get(href)
        tree = etree.HTML(resp_child.text)
        pic_src = tree.xpath("//div[@id='picBody']//img/@src")[0]
        print(f"put {<!-- -->pic_src} to the queue")
        q.put(pic_src)
        # Assignment, paged image capture
        # print("ready to another!")
        # others = tree.xpath('//ul[@class="articleV2Page"]')
        # if others:


# Process 2. Extract the download path of the image from the image website
def download(url):
    print("start download", url)
    name = url.split("/")[-1]
    resp = requests.get(url)
    with open(name, mode="wb") as f:
        f.write(resp.content)
    resp.close()
    print("downloaded", url)


def start_download(q):
    with ThreadPoolExecutor(20) as t:
        while True:
            t.submit(download, q.get()) # Start

            
def main():
    q = Queue()
    p1 = Process(target=start_download, args=(q,))
    p2 = Process(target=get_pic_src, args=(q,))
    p1.start()
    p2.start()


if __name__ == '__main__':
    main()