NodeJS crawls design pictures on Ink Knife

Background

The designer shared a prototype of the ink knife, but it was given read-only permission and the materials could not be downloaded. During development, I wanted to download an animated picture inside, and found the image address in the page structure through the browser’s F12 tool.

2023-10-21-1-HTML.jpg
However, after direct access by the browser, it was found that there was no permission: Nginx‘s 403 page. . Then I want to download this image in other ways.

2023-10-21-2-Nginx.jpg

Failed attempt: Save as image via browser request

From the previous 403 error report, we can know that accessing the link to this image should require header information. Then first look at the header information of this request in the network (I filtered it here using the image as a condition). After finding the request, right-click and there is Save as picture, I thought it was done, but after saving, I found that the size is only 1M (1024KB, and you can see from the browser request that the actual file size is almost 10M), this is probably There are some restrictions in the browser that cause the downloaded pictures to be incomplete or incomplete.

2023-10-21-3-Save.jpg

Successful attempt: NodeJS sends Fetch request

In the right-click on the network request in the developer tools, there is another option: Fetch in the console. After clicking, a piece of code will be generated in the console to send a request to obtain the image, and bring the header information. .

2023-10-21-4-Fetch.png

2023-10-21-5-Console.jpg
When I saw this code, I immediately thought that I could send the request through Node.js, and then download and save the image. Just do it. The following is the complete code.

const fs = require("fs");

const downloadFile = (async (url, path) => {<!-- -->
    const res = await fetch("https://modao.cc/x/y/z.gif", {<!-- -->
        "credentials": "include",
        "headers": {<!-- -->
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/118.0",
            "Accept": "image/avif,image/webp,*/*",
            "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
            "Sec-Fetch-Dest": "image",
            "Sec-Fetch-Mode": "no-cors",
            "Sec-Fetch-Site": "same-origin",
            "Pragma": "no-cache",
            "Cache-Control": "no-cache"
        },
        "referrer": "https://modao.cc/abc/opq & amp;from=sharing",
        "method": "GET",
        "mode": "cors"
    });
    fs.writeFile(path, Buffer.from(await res.arrayBuffer()), 'binary', function(err) {<!-- -->
        if (err) throw err;
        console.log("OK");
    });
});

downloadFile(1, "./1.gif")
</code><img class="look-more-preCode contentImg-no-view" src="//i2.wp.com/csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreBlack.png" alt ="" title="">

The above code mainly uses the fetch method of Node.js to send resource requests, and the fs module to store images, which is simple, direct and effective.

Possible problems

However, not all materials can be downloaded through the above method. Some picture downloads return a status code: 304 Not Modified; we know that if the server returns a status code of 304 Not Modified > , which means that the requested resource has not changed on the server, and the server tells the client that the cached version can be used. This is an optimization mechanism that reduces network traffic and improves performance.

When a browser or other client requests a resource for the first time, the server returns the complete content of the resource and a response header (Response Header), which contains a field called “ETag”. ETag is a unique identifier that represents the version of a resource. When the client requests the same resource again, a field called “If-None-Match” will be included in the request header (Request Header). The value of this field is the ETag returned by the last request. value.

If the server receives a request with the “If-None-Match” field and finds that the resource’s ETag value matches the value in the request header, the server will return 304 Not Modified status code, telling the client that the cached version can be used. This saves bandwidth and server resources because the client can fetch resources directly from the cache without re-downloading.

Solution: Update the request header and try to add the Cache-Control: no-cache header in the fetch request. This will tell the server not to use the cached version and force the actual return resource content. Or directly remove If-Modified-Since and If-None-Match in the header information generated by the browser:

 "If-Modified-Since": "Fri, 21 Jul 2023 07:05:31 GMT",
    "If-None-Match":""64ba2e3b-14711""
const fs = require("fs");

const downloadFile = (async (url, path) => {<!-- -->
    const res = await fetch("https://modao.cc/x/y/z.png", {<!-- -->
        "credentials": "include",
        "headers": {<!-- -->
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/118.0",
            "Accept": "image/avif,image/webp,*/*",
            "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
            "Sec-Fetch-Dest": "image",
            "Sec-Fetch-Mode": "no-cors",
            "Sec-Fetch-Site": "same-origin",
        },
        "referrer": "https://modao.cc/abc/opq & amp;from=sharing",
        "method": "GET",
        "mode": "cors"
    });
    fs.writeFile(path, Buffer.from(await res.arrayBuffer()), 'binary', function(err) {<!-- -->
        if (err) throw err;
        console.log("OK");
    });
});

downloadFile(2, "./2.png")
</code><img class="look-more-preCode contentImg-no-view" src="//i2.wp.com/csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreBlack.png" alt ="" title="">

Small summary

The above records the process of using NodeJS to crawl design images on Ink Knife.

  1. When using Node.js‘s crawler fetch to request, the status code returned 304 Not Modified means that the requested resource has not changed on the server, so The server does not return the actual resource content, but tells the client that a cached version is available.

  2. This usually happens when the client sends a request with a If-Modified-Since or If-None-Match header, which contains information about the previous request. The resource-related information returned by the server is used to determine whether the resource has changed.

  3. To solve this problem, you can try adding the Cache-Control: no-cache header to the fetch request. This will tell the server not to use the cached version and force the actual resource to be returned. content.