1
0
Fork 0

Having fun with ChatGPT4 to build an archive of IRS PDF documents. Curious about XetHub default deduplication over PDFs. 15+% feels pretty good!

README.md

Archive of IRS PDF documents

Using ChatGPT4 to write the scraper

Initial prompt

Using the following prompt:

Write python code to scrape the URL "https://www.irs.gov/downloads/irs-pdf", include support for pagination by adding "?page=" to the query strong of the URL from page=0 through page=54. For each page, download any linked PDF documents.

got the following...

To scrape the URL and download PDF files, you can use the requests library to fetch the HTML content and the BeautifulSoup library to parse the HTML and extract the required information. First, install these libraries if you haven't already:

pip install requests beautifulsoup4

Here's the Python code to scrape the URL and download the linked PDF documents:

import os
import requests
from bs4 import BeautifulSoup

base_url = "https://www.irs.gov/downloads/irs-pdf"
pdf_download_folder = "pdf_downloads"

if not os.path.exists(pdf_download_folder):
    os.makedirs(pdf_download_folder)

def download_pdf(pdf_url, pdf_name):
    response = requests.get(pdf_url)
    with open(os.path.join(pdf_download_folder, pdf_name), "wb") as pdf_file:
        pdf_file.write(response.content)

def scrape_page(page_number):
    page_url = f"{base_url}?page={page_number}"
    response = requests.get(page_url)
    soup = BeautifulSoup(response.text, "html.parser")
    pdf_links = soup.find_all("a", class_="pdf")

    for link in pdf_links:
        pdf_name = link["href"].split("/")[-1]
        pdf_url = f"https://www.irs.gov{link['href']}"
        print(f"Downloading {pdf_name}...")
        download_pdf(pdf_url, pdf_name)

for page_number in range(55):
    print(f"Scraping page {page_number}...")
    scrape_page(page_number)

print("All PDFs have been downloaded.")

This script will download the linked PDF documents for each page in the range 0-54 and save them in a folder named "pdf_downloads". Please note that web scraping may be against the terms of service of some websites. Make sure to review the target website's terms of service and respect any guidelines or limitations they may have.

Fix some bugs

This wasn't right, so added:

This isn't right. The soup.find_all() needs to be fixed. It needs to not look for a class="pdf" and instead look to see if the href ends in ".pdf"

and now got...

I apologize for the confusion. Here's the updated code that checks if the href attribute ends with ".pdf":

import os
import requests
from bs4 import BeautifulSoup

base_url = "https://www.irs.gov/downloads/irs-pdf"
pdf_download_folder = "pdf_downloads"

if not os.path.exists(pdf_download_folder):
    os.makedirs(pdf_download_folder)

def download_pdf(pdf_url, pdf_name):
    response = requests.get(pdf_url)
    with open(os.path.join(pdf_download_folder, pdf_name), "wb") as pdf_file:
        pdf_file.write(response.content)

def scrape_page(page_number):
    page_url = f"{base_url}?page={page_number}"
    response = requests.get(page_url)
    soup = BeautifulSoup(response.text, "html.parser")
    all_links = soup.find_all("a", href=True)

    for link in all_links:
        href = link["href"]
        if href.lower().endswith(".pdf"):
            pdf_name = href.split("/")[-1]
            pdf_url = f"https://www.irs.gov{href}"
            print(f"Downloading {pdf_name}...")
            download_pdf(pdf_url, pdf_name)

for page_number in range(55):
    print(f"Scraping page {page_number}...")
    scrape_page(page_number)

print("All PDFs have been downloaded.")

This version of the script will now look for all the links with an href attribute, check if the href ends with ".pdf" (case-insensitive), and download the corresponding PDF files.

Parallelize with scrapy

Then got tired of single-threaded scraping, so asked ChatGPT to use scrapy to parallelize.

Can you rewrite this scraper using the Python scrapy library?

and got...

Certainly! First, you need to install the Scrapy library if you haven't already:

pip install scrapy

Then, create a new Scrapy project:

scrapy startproject irs_pdf_scraper

This command will create a new directory named irs_pdf_scraper. Now, navigate to the irs_pdf_scraper/spiders directory and create a new spider called irs_pdf_spider.py with the following content:

import scrapy

class IrsPdfSpider(scrapy.Spider):
    name = "irs_pdf_spider"
    start_urls = [f"https://www.irs.gov/downloads/irs-pdf?page={i}" for i in range(55)]

    def parse(self, response):
        for link in response.css("a[href$='.pdf'], a[href$='.PDF']"):
            pdf_url = response.urljoin(link.attrib['href'])
            pdf_name = pdf_url.split("/")[-1]
            yield scrapy.Request(pdf_url, callback=self.save_pdf, meta={'pdf_name': pdf_name})

    def save_pdf(self, response):
        pdf_name = response.meta['pdf_name']
        with open(f"pdf_downloads/{pdf_name}", "wb") as pdf_file:
            pdf_file.write(response.body)

This spider uses Scrapy to scrape the URLs and download the linked PDF documents.

Finally, create a folder named pdf_downloads in the root directory of the project to store the downloaded PDF files. You can now run the spider using the following command from the root directory of the project:

scrapy crawl irs_pdf_spider

This will scrape the URLs and download the linked PDF documents as specified. Note that Scrapy is an asynchronous scraping framework, so it may be faster than the previous solution that used requests and BeautifulSoup.

Dataset

2694 PDF documents

Total size on disk: 1.8G

Deduplication - 15.7% smaller!

1.73 GiB added, Deduped to 1.46 GiB. Ratio: 84.3%
File List Total items: 6
Name Last Commit Size Last Modified
code Fixes to avoid redownloading, still super basic scraping 1 year ago
data Import complete 1 year ago
.gitattributes Initial commit 79 B 1 year ago
.gitignore initial commit 287 B 1 year ago
README.md Import complete 5.7 KiB 1 year ago
dedupe.png Import complete 105 KiB 1 year ago

About

Having fun with ChatGPT4 to build an archive of IRS PDF documents. Curious about XetHub default deduplication over PDFs. 15+% feels pretty good!

Repository Size

Loading repo size...

Commits 5 commits

File Types