Having fun with ChatGPT4 to build an archive of IRS PDF documents. Curious about XetHub default deduplication over PDFs. 15+% feels pretty good!
README.md
Archive of IRS PDF documents - ~2700 documents
Using ChatGPT4 to write the scraper
Initial prompt
Using the following prompt:
Write python code to scrape the URL "https://www.irs.gov/downloads/irs-pdf", include support for pagination by adding "?page=" to the query strong of the URL from page=0 through page=54. For each page, download any linked PDF documents.
got the following...
To scrape the URL and download PDF files, you can use the requests
library to fetch the HTML content and the BeautifulSoup
library to parse the HTML and extract the required information. First, install these libraries if you haven't already:
pip install requests beautifulsoup4
Here's the Python code to scrape the URL and download the linked PDF documents:
import os
import requests
from bs4 import BeautifulSoup
base_url = "https://www.irs.gov/downloads/irs-pdf"
pdf_download_folder = "pdf_downloads"
if not os.path.exists(pdf_download_folder):
os.makedirs(pdf_download_folder)
def download_pdf(pdf_url, pdf_name):
response = requests.get(pdf_url)
with open(os.path.join(pdf_download_folder, pdf_name), "wb") as pdf_file:
pdf_file.write(response.content)
def scrape_page(page_number):
page_url = f"{base_url}?page={page_number}"
response = requests.get(page_url)
soup = BeautifulSoup(response.text, "html.parser")
pdf_links = soup.find_all("a", class_="pdf")
for link in pdf_links:
pdf_name = link["href"].split("/")[-1]
pdf_url = f"https://www.irs.gov{link['href']}"
print(f"Downloading {pdf_name}...")
download_pdf(pdf_url, pdf_name)
for page_number in range(55):
print(f"Scraping page {page_number}...")
scrape_page(page_number)
print("All PDFs have been downloaded.")
This script will download the linked PDF documents for each page in the range 0-54 and save them in a folder named "pdf_downloads". Please note that web scraping may be against the terms of service of some websites. Make sure to review the target website's terms of service and respect any guidelines or limitations they may have.
Fix some bugs
This wasn't right, so added:
This isn't right. The soup.find_all() needs to be fixed. It needs to not look for a class="pdf" and instead look to see if the href ends in ".pdf"
and now got...
I apologize for the confusion. Here's the updated code that checks if the href
attribute ends with ".pdf":
import os
import requests
from bs4 import BeautifulSoup
base_url = "https://www.irs.gov/downloads/irs-pdf"
pdf_download_folder = "pdf_downloads"
if not os.path.exists(pdf_download_folder):
os.makedirs(pdf_download_folder)
def download_pdf(pdf_url, pdf_name):
response = requests.get(pdf_url)
with open(os.path.join(pdf_download_folder, pdf_name), "wb") as pdf_file:
pdf_file.write(response.content)
def scrape_page(page_number):
page_url = f"{base_url}?page={page_number}"
response = requests.get(page_url)
soup = BeautifulSoup(response.text, "html.parser")
all_links = soup.find_all("a", href=True)
for link in all_links:
href = link["href"]
if href.lower().endswith(".pdf"):
pdf_name = href.split("/")[-1]
pdf_url = f"https://www.irs.gov{href}"
print(f"Downloading {pdf_name}...")
download_pdf(pdf_url, pdf_name)
for page_number in range(55):
print(f"Scraping page {page_number}...")
scrape_page(page_number)
print("All PDFs have been downloaded.")
This version of the script will now look for all the links with an href
attribute, check if the href
ends with ".pdf" (case-insensitive), and download the corresponding PDF files.
Parallelize with scrapy
Then got tired of single-threaded scraping, so asked ChatGPT to use scrapy to parallelize.
Can you rewrite this scraper using the Python scrapy library?
and got...
Certainly! First, you need to install the Scrapy
library if you haven't already:
pip install scrapy
Then, create a new Scrapy project:
scrapy startproject irs_pdf_scraper
This command will create a new directory named irs_pdf_scraper
. Now, navigate to the irs_pdf_scraper/spiders
directory and create a new spider called irs_pdf_spider.py
with the following content:
import scrapy
class IrsPdfSpider(scrapy.Spider):
name = "irs_pdf_spider"
start_urls = [f"https://www.irs.gov/downloads/irs-pdf?page={i}" for i in range(55)]
def parse(self, response):
for link in response.css("a[href$='.pdf'], a[href$='.PDF']"):
pdf_url = response.urljoin(link.attrib['href'])
pdf_name = pdf_url.split("/")[-1]
yield scrapy.Request(pdf_url, callback=self.save_pdf, meta={'pdf_name': pdf_name})
def save_pdf(self, response):
pdf_name = response.meta['pdf_name']
with open(f"pdf_downloads/{pdf_name}", "wb") as pdf_file:
pdf_file.write(response.body)
This spider uses Scrapy to scrape the URLs and download the linked PDF documents.
Finally, create a folder named pdf_downloads
in the root directory of the project to store the downloaded PDF files. You can now run the spider using the following command from the root directory of the project:
scrapy crawl irs_pdf_spider
This will scrape the URLs and download the linked PDF documents as specified. Note that Scrapy is an asynchronous scraping framework, so it may be faster than the previous solution that used requests
and BeautifulSoup
.
Dataset
2694 PDF documents
Total size on disk: 1.8G
Deduplication - 15.7% smaller!
1.73 GiB added, Deduped to 1.46 GiB. Ratio: 84.3%
File List | Total items: 6 | ||
---|---|---|---|
Name | Last Commit | Size | Last Modified |
code | |||
data | |||
.gitattributes | |||
.gitignore | |||
README.md | |||
dedupe.png |
About
Having fun with ChatGPT4 to build an archive of IRS PDF documents. Curious about XetHub default deduplication over PDFs. 15+% feels pretty good!
Repository Size
Activity 7 commits
-
committed 29e95957e5 3wk ago
-
committed 0db4d87e5b 3wk ago
-
committed 2e40ad5f78 5mo ago
-
committed d3fe3e9993 5mo ago
-
committed ed6cb29018 5mo ago