Feel the power of AI – use Cursor to automatically generate powershell scripts Download TheOldNewThing blog post

Dependence

When I checked the information today, I accidentally found a foreign netizen who mirrored the blog posts of TheOleNewThing from 2003 to 2019 (https:// bytepointer.com/resources/old_new_thing/index.htm), there are actually more than 5000 articles (really productive). It’s worth noting that many of the links in Microsoft’s official blog are no longer valid. In order to prevent this netizen’s mirror link from failing, I decided to download and save these contents as soon as possible. It is obviously unrealistic to save manually, after all, there are more than 5000 articles! So, I decided to write a script to automate the download, which is a smart move! If it was a few years ago, I would definitely have to write the script by hand, but now, AI is so powerful, I don’t have to. Fortunately, I used Cursor (https://www.cursor.so/) to automatically generate the whole process of downloading the script, and made a simple modification to it, successfully converting all 5000 The articles are downloaded and archived, and packaged according to year and month! I have to say a word of admiration, AI is really powerful.

Download URL

All articles have been compressed by year, and interested friends can download them by themselves.

Baidu cloud disk: https://pan.baidu.com/s/1gGx7mvLsumbtzoQbatPsfw?pwd=dbqt Extraction code: dbqt

gitee:

https://gitee.com/bianchengnan/my-blog/tree/master/download-the-old-new-thing-archive-blog-using-powershell-with-the-help-of-Cursor

Note: The copyright of all articles belongs to the original author. I just transferred a copy to prevent losing contact. Invasion and deletion!

Download

download url and save to disk and then parse inner urls and then save these urls to disk

To download a URL, save it to disk, parse inner URLs, and then save these URLs to disk in PowerShell, you can use the Invoke-WebRequest cmdlet to download the webpage, Out-File cmdlet to save it to disk, and then use a regular expression to parse the inner URLs. Here is a simple example:

# Define the URL
$url = "http://example.com"

# Define the output file for the webpage
$outfile = "output.html"

# Define the output file for the inner URLs
$urlfile = "urls.txt"

# Download the webpage and save to disk
Invoke-WebRequest -Uri $url -OutFile $outfile

# Read the saved file
$content = Get-Content -Path $outfile

# Parse the inner URLs
$innerUrls = $content | Where-Object { $_ -match "(http|https)://[a-zA-Z0-9\-\.] + \.[a-zA-Z]{2,3 }(/\S*)?" }

# Save the inner URLs to disk
$innerUrls | Out-File -FilePath $urlfile

This script will download the webpage, save it to output.html, read the saved file, parse the inner URLs, and then save these URLs to urls.txt. You can adjust the regular expression in the Where-Object cmdlet to match the URLs you are interested in.

Voiceover: I found that the parsed internal urls were not correct, so I checked the link form in the original web page, and made further suggestions for Cursor The request was made in Chinese this time.

Parse a string matching the href="xxx" pattern.

To download a URL in PowerShell, save it to disk, then parse the internal URL and convert those URL Save to disk, you can use the Invoke-WebRequest cmdlet to download a webpage, use the Out-File cmdlet to save it to disk, and then use regular expressions to parse the internal URL . Here is a simple example:

# omit N lines

# Parse internal URLs
$innerUrls = $content | Where-Object { $_ -match 'href="([^"]*)"' }

# save the internal URL to disk
$innerUrls | Out-File -FilePath $urlfile

This script will download the web page, save it to output.html, read the saved file, parse the internal URL, and save those URL to urls.txt. You can adjust the regular expression in the Where-Object cmdlet to match the URL you are interested in.

Voice-over: Great for remembering context.

There is a problem with parsing the internal url statement, not only saving the content inside href=""

Sorry, the regex in my previous answer was indeed wrong. We should use a capturing group to only get the content inside href="". Here is the corrected code:

# omit N lines

# Parse internal URLs
$innerUrls = $content | Select-String -Pattern 'href="([^"]*)"' -AllMatches | % { $_.Matches } | % { $_.Groups[1].Value }

# save the internal URL to disk
$innerUrls | Out-File -FilePath $urlfile

This script will download the web page, save it to output.html, read the saved file, parse the internal URL, and then convert those URL Save to urls.txt. The regular expression in the Select-String cmdlet has been modified so that it will only match what is inside href="".

Voiceover: After reading the generated content, it is infinitely close to the final desired result, but I did not expect to write more than 5000 articles, and manually splicing URLs still requires a lot of work Will do. So continue to optimize the script.

You need to add the prefix https://bytepointer.com/resources/old_new_thing/ to the parsed url before saving it

Add a prefix to the parsed URL and save the PowerShell code as follows:

# omit N lines

# add prefix
$prefix = "https://bytepointer.com/resources/old_new_thing/"
$innerUrls = $innerUrls | ForEach-Object { $prefix + $_ }

# save the internal URL to disk
$innerUrls | Out-File -FilePath $urlfile

This script will download the web page, save it to output.html, read the saved file, parse the internal URL, and then at each URL before, and finally save these URL to urls.txt.

So far, the url that needs to be downloaded has been saved to urls.txt. The next step is to write a script to download these url. The final upload script is as follows:

# define the URL
$url = "https://bytepointer.com/resources/old_new_thing/index.htm"

# Define the output file for the web page
$outfile = "output.html"

# Define output files for internal URLs
$urlfile = "urls.txt"

# Download the web page and save it to disk
Invoke-WebRequest -Uri $url -OutFile $outfile

# read saved file
$content = Get-Content -Path $outfile

# Parse internal URLs
$innerUrls = $content | Select-String -Pattern 'href="([^"]*)"' -AllMatches | % { $_.Matches } | % { $_.Groups[1].Value }

# add prefix
$prefix = "https://bytepointer.com/resources/old_new_thing/"
$innerUrls = $innerUrls | ForEach-Object { $prefix + $_ }

# save the internal URL to disk
$innerUrls | Out-File -FilePath $urlfile

Parse url and download

Traverse urls.txt and download each url in it, if the download fails, you need to be prompted, and record the location of the download.

In PowerShell, you can use the Invoke-WebRequest cmdlet to download a URL and use the try/catch statement to Handle possible errors. Here is a simple example:

# define URL file
$urlfile = "urls.txt"

# read URL file
$urls = Get-Content -Path $urlfile

# loop through each URL
foreach ($url in $urls) {
  # define the output file
  $outfile = "downloaded_" + [System.IO.Path]::GetFileName($url)

  try {
    # Download URL and save to disk
    Invoke-WebRequest -Uri $url -OutFile $outfile

    # Print the location of the download
    Write-Output "Downloaded $url to $outfile"
  } catch {
    # If the download fails, print an error message
    Write-Output "Failed to download $url: $_"
  }
}

This script will read each URL in the urls.txt file, then attempt to download and save each URL to disk. If the download fails, it will print an error message. Successfully downloaded files will be saved as "downloaded_" plus the original filename.

Voiceover:

The last Write-Output "Failed to download $url: $_" will report an error, you need to add a space between $url and : .

I removed the downloaded_ prefix from the saved file name, but saved it in the d:/downloaded/ directory.

Because there are too many files, it would be messy to put them all in one directory, so I thought about grouping them by year and month.

Group by year and month

Traverse the files in the folder, the first 8 characters of the file name are the year, month and day, such as 20030722_001_why_do_you_have_to_click_the_start_button_to_shut_down.html. Extract the year and month from it, and create folders based on the year, then create subfolders based on the month, and put the corresponding files into the newly created folder.

Here is an example of a PowerShell script that iterates through the files in a folder, extracts the first 8 characters of the filename as the year and month, then creates folders based on the year, according to Create a subfolder for the month and move the corresponding files to the newly created folder.

$sourceDir = "C:\path\to\your\source\directory" # source folder path

$destDir = "C:\path\to\your\destination\directory" # Destination folder path

Get-ChildItem $sourceDir -File | ForEach-Object {
  $year = $_.Name.Substring(0, 4) # extract the year
  $month = $_.Name.Substring(4, 2) # Extract the month
  $newDir = Join-Path -Path $destDir -ChildPath $year -AdditionalChildPath $month # new folder path

  if (!(Test-Path $newDir)) { # If the folder does not exist, create it
New-Item -ItemType Directory -Path $newDir | Out-Null
  }
  Move-Item -Path $_.FullName -Destination $newDir # Move the file to a new folder
}

Please replace $sourceDir and $destDir with your actual paths. This script will traverse all the files in $sourceDir, extract the first 4 characters of the file name as the year, the next 2 characters as the month, and then Create a new folder named after the year in $destDir, then create a subfolder named after the month under this folder, and move the files to the new folder. If the new folder already exists, it will not be created again.

Voiceover: -AdditionalChildPath doesn’t seem to be a legal option, it was invented by AI. So manually modify the code as follows:

$newDir = Join-Path -Path $destDir -ChildPath $year

$newDir = Join-Path -Path $newDir -ChildPath $month

Compression

Traverse the folder and compress

You can use PowerShell‘s Compress-Archive command to traverse folders and compress files. Here is an example:

$sourceDir = "C:\path\to\your\source\directory" # source folder path
$destDir = "C:\path\to\your\destination\directory" # Destination folder path

Get-ChildItem $sourceDir -Recurse | ForEach-Object {
  if ($_.PSIsContainer) {
    $zipFilePath = Join-Path -Path $destDir -ChildPath ($_.Name + ".zip") # Compressed file path
    Compress-Archive -Path $_.FullName -DestinationPath $zipFilePath # Compressed folder
  }
}

Please replace $sourceDir and $destDir with your actual paths. This script will traverse all files and folders in $sourceDir, if it is a folder, it will be compressed into a .zip file and saved to $destDir in.

Voiceover: I don’t actually need recursion, just remove the -Recurse option.

Summary

The whole process is very comfortable, basically did not write a line of code, just made some changes and fine-tuning of details. Just realized the whole function. The whole process took less than 20 minutes.

The usage of Cursor is also very simple, and you can find it by searching on the Internet. ctrl + L to start the magic!