README.md
Air Quality ETL Pipeline Example
Description
This project contains an example for how to use XetHub to store and run an ETL (Extract, Transform, Load) pipeline. The data pipeline processes air quality measurements, retrieved using the Open AQ API, and displays the data interactively.
Data Sources
OpenAQ - API designed for aggregating and sharing open air quality data from around the world.
This current data set uses JSON aggregation of the data up through 2022.
Parameters
- pm1 - PM1 ➡️ Particulate matter less than 1 micrometer in diameter mass concentration, µg/m³
- pm10 - PM10 ➡️ Particulate matter less than 10 micrometers in diameter mass concentration, µg/m³
- pm25 - PM2.5 ➡️ Particulate matter less than 2.5 micrometers in diameter mass concentration, µg/m³
- um003 - PM0.3 ➡️ count, particles/cm³
- um005 - PM0.5 ➡️ count, particles/cm³
- um010 - PM1 ➡️ count, particles/cm³
- um025 - PM2.5 ➡️ count, particles/cm³
- um050 - PM5.0 ➡️ count, particles/cm³
- um100 - PM10 ➡️ count, particles/cm³
- pressure ➡️ Atmospheric or barometric pressure, hPa
- temperature ➡️ °C
- humidity ➡️ %
Pipeline
- Individual units upload the current set of measurements to
data/raw_upload/<country>/<location>/
. Seedata/scripts/upload_new_data.sh
for an example of doing this. - XetHub Actions: runs the ETL process every day (
src/etl/aggregate.py
), which scrapes all the raw uploads and creates an aggregated csv file with all previous measurements for a site. - Capsules: A capsule is run using streamlit (
src/capsule/app.py
) to display a data exploration app for the repo.
File List | Total items: 7 | ||
---|---|---|---|
Name | Last Commit | Size | Last Modified |
.xethub/workflows | |||
data | |||
src | |||
.gitattributes | |||
.gitignore | |||
README.md | |||
requirements.txt |