Dependence
When I checked the information today, I accidentally found a foreign netizen who mirrored the blog posts of TheOleNewThing
from 2003
to 2019
(https:// bytepointer.com/resources/old_new_thing/index.htm), there are actually more than 5000
articles (really productive). It’s worth noting that many of the links in Microsoft’s official blog are no longer valid. In order to prevent this netizen’s mirror link from failing, I decided to download and save these contents as soon as possible. It is obviously unrealistic to save manually, after all, there are more than 5000
articles! So, I decided to write a script to automate the download, which is a smart move! If it was a few years ago, I would definitely have to write the script by hand, but now, AI
is so powerful, I don’t have to. Fortunately, I used Cursor (https://www.cursor.so/) to automatically generate the whole process of downloading the script, and made a simple modification to it, successfully converting all 5000
The articles are downloaded and archived, and packaged according to year and month! I have to say a word of admiration, AI
is really powerful.
Download URL
All articles have been compressed by year, and interested friends can download them by themselves.
Baidu cloud disk: https://pan.baidu.com/s/1gGx7mvLsumbtzoQbatPsfw?pwd=dbqt Extraction code: dbqt
gitee
:
https://gitee.com/bianchengnan/my-blog/tree/master/download-the-old-new-thing-archive-blog-using-powershell-with-the-help-of-Cursor
Note: The copyright of all articles belongs to the original author. I just transferred a copy to prevent losing contact. Invasion and deletion!
Download
Q:
download url and save to disk and then parse inner urls and then save these urls to disk
A:
To download a URL, save it to disk, parse inner URLs, and then save these URLs to disk in PowerShell, you can use the Invoke-WebRequest cmdlet to download the webpage, Out-File cmdlet to save it to disk, and then use a regular expression to parse the inner URLs. Here is a simple example:
# Define the URL $url = "http://example.com" # Define the output file for the webpage $outfile = "output.html" # Define the output file for the inner URLs $urlfile = "urls.txt" # Download the webpage and save to disk Invoke-WebRequest -Uri $url -OutFile $outfile # Read the saved file $content = Get-Content -Path $outfile # Parse the inner URLs $innerUrls = $content | Where-Object { $_ -match "(http|https)://[a-zA-Z0-9\-\.] + \.[a-zA-Z]{2,3 }(/\S*)?" } # Save the inner URLs to disk $innerUrls | Out-File -FilePath $urlfile
This script will download the webpage, save it to output.html, read the saved file, parse the inner URLs, and then save these URLs to urls.txt. You can adjust the regular expression in the Where-Object cmdlet to match the URLs you are interested in.
Voiceover: I found that the parsed internal
urls
were not correct, so I checked the link form in the original web page, and made further suggestions forCursor
The request was made in Chinese this time.
Q:
Parse a string matching the href="xxx"
pattern.
A:
To download a URL
in PowerShell
, save it to disk, then parse the internal URL
and convert those URL
Save to disk, you can use the Invoke-WebRequest cmdlet
to download a webpage, use the Out-File cmdlet
to save it to disk, and then use regular expressions to parse the internal URL
. Here is a simple example:
# omit N lines # Parse internal URLs $innerUrls = $content | Where-Object { $_ -match 'href="([^"]*)"' } # save the internal URL to disk $innerUrls | Out-File -FilePath $urlfile
This script will download the web page, save it to output.html
, read the saved file, parse the internal URL
, and save those URL
to urls.txt
. You can adjust the regular expression in the Where-Object cmdlet
to match the URL
you are interested in.
Voice-over: Great for remembering context.
Q:
There is a problem with parsing the internal url
statement, not only saving the content inside href=""
A:
Sorry, the regex in my previous answer was indeed wrong. We should use a capturing group to only get the content inside href=""
. Here is the corrected code:
# omit N lines # Parse internal URLs $innerUrls = $content | Select-String -Pattern 'href="([^"]*)"' -AllMatches | % { $_.Matches } | % { $_.Groups[1].Value } # save the internal URL to disk $innerUrls | Out-File -FilePath $urlfile
This script will download the web page, save it to output.html
, read the saved file, parse the internal URL
, and then convert those URL
Save to urls.txt
. The regular expression in the Select-String cmdlet
has been modified so that it will only match what is inside href=""
.
Voiceover: After reading the generated content, it is infinitely close to the final desired result, but I did not expect to write more than
5000
articles, and manually splicing URLs still requires a lot of work Will do. So continue to optimize the script.
Q:
You need to add the prefix https://bytepointer.com/resources/old_new_thing/ to the parsed url before saving it
A:
Add a prefix to the parsed URL
and save the PowerShell
code as follows:
# omit N lines # add prefix $prefix = "https://bytepointer.com/resources/old_new_thing/" $innerUrls = $innerUrls | ForEach-Object { $prefix + $_ } # save the internal URL to disk $innerUrls | Out-File -FilePath $urlfile
This script will download the web page, save it to output.html
, read the saved file, parse the internal URL
, and then at each URL
before, and finally save these URL
to urls.txt
.
So far, the url
that needs to be downloaded has been saved to urls.txt
. The next step is to write a script to download these url
. The final upload script is as follows:
# define the URL $url = "https://bytepointer.com/resources/old_new_thing/index.htm" # Define the output file for the web page $outfile = "output.html" # Define output files for internal URLs $urlfile = "urls.txt" # Download the web page and save it to disk Invoke-WebRequest -Uri $url -OutFile $outfile # read saved file $content = Get-Content -Path $outfile # Parse internal URLs $innerUrls = $content | Select-String -Pattern 'href="([^"]*)"' -AllMatches | % { $_.Matches } | % { $_.Groups[1].Value } # add prefix $prefix = "https://bytepointer.com/resources/old_new_thing/" $innerUrls = $innerUrls | ForEach-Object { $prefix + $_ } # save the internal URL to disk $innerUrls | Out-File -FilePath $urlfile
Parse url and download
Q:
Traverse urls.txt
and download each url
in it, if the download fails, you need to be prompted, and record the location of the download.
A:
In PowerShell
, you can use the Invoke-WebRequest cmdlet
to download a URL
and use the try/catch
statement to Handle possible errors. Here is a simple example:
# define URL file $urlfile = "urls.txt" # read URL file $urls = Get-Content -Path $urlfile # loop through each URL foreach ($url in $urls) { # define the output file $outfile = "downloaded_" + [System.IO.Path]::GetFileName($url) try { # Download URL and save to disk Invoke-WebRequest -Uri $url -OutFile $outfile # Print the location of the download Write-Output "Downloaded $url to $outfile" } catch { # If the download fails, print an error message Write-Output "Failed to download $url: $_" } }
This script will read each URL
in the urls.txt
file, then attempt to download and save each URL
to disk. If the download fails, it will print an error message. Successfully downloaded files will be saved as "downloaded_"
plus the original filename.
Voiceover:
The last
Write-Output "Failed to download $url: $_"
will report an error, you need to add a space between$url
and:
.I removed the
downloaded_
prefix from the saved file name, but saved it in thed:/downloaded/
directory.
Because there are too many files, it would be messy to put them all in one directory, so I thought about grouping them by year and month.
Group by year and month
Q:
Traverse the files in the folder, the first 8
characters of the file name are the year, month and day, such as 20030722_001_why_do_you_have_to_click_the_start_button_to_shut_down.html
. Extract the year and month from it, and create folders based on the year, then create subfolders based on the month, and put the corresponding files into the newly created folder.
A:
Here is an example of a PowerShell
script that iterates through the files in a folder, extracts the first 8
characters of the filename as the year and month, then creates folders based on the year, according to Create a subfolder for the month and move the corresponding files to the newly created folder.
$sourceDir = "C:\path\to\your\source\directory" # source folder path $destDir = "C:\path\to\your\destination\directory" # Destination folder path Get-ChildItem $sourceDir -File | ForEach-Object { $year = $_.Name.Substring(0, 4) # extract the year $month = $_.Name.Substring(4, 2) # Extract the month $newDir = Join-Path -Path $destDir -ChildPath $year -AdditionalChildPath $month # new folder path if (!(Test-Path $newDir)) { # If the folder does not exist, create it New-Item -ItemType Directory -Path $newDir | Out-Null } Move-Item -Path $_.FullName -Destination $newDir # Move the file to a new folder }
Please replace $sourceDir
and $destDir
with your actual paths. This script will traverse all the files in $sourceDir
, extract the first 4
characters of the file name as the year, the next 2
characters as the month, and then Create a new folder named after the year in $destDir
, then create a subfolder named after the month under this folder, and move the files to the new folder. If the new folder already exists, it will not be created again.
Voiceover:
-AdditionalChildPath
doesn’t seem to be a legal option, it was invented byAI
. So manually modify the code as follows:
$newDir = Join-Path -Path $destDir -ChildPath $year
$newDir = Join-Path -Path $newDir -ChildPath $month
Compression
Q:
Traverse the folder and compress
A:
You can use PowerShell
‘s Compress-Archive
command to traverse folders and compress files. Here is an example:
$sourceDir = "C:\path\to\your\source\directory" # source folder path $destDir = "C:\path\to\your\destination\directory" # Destination folder path Get-ChildItem $sourceDir -Recurse | ForEach-Object { if ($_.PSIsContainer) { $zipFilePath = Join-Path -Path $destDir -ChildPath ($_.Name + ".zip") # Compressed file path Compress-Archive -Path $_.FullName -DestinationPath $zipFilePath # Compressed folder } }
Please replace $sourceDir
and $destDir
with your actual paths. This script will traverse all files and folders in $sourceDir
, if it is a folder, it will be compressed into a .zip
file and saved to $destDir
in.
Voiceover: I don’t actually need recursion, just remove the
-Recurse
option.
Summary
The whole process is very comfortable, basically did not write a line of code, just made some changes and fine-tuning of details. Just realized the whole function. The whole process took less than 20
minutes.
The usage of Cursor
is also very simple, and you can find it by searching on the Internet. ctrl + L
to start the magic!