Skip to main content

Update and Review

Upon creating a XetHub account, a Tutorial repository was created for you with Seattle air quality data from 2022. Follow along to integrate new data into your repository, updating charts and code to help add context to the data review process.

Clone your tutorial

Begin by cloning your Tutorial repository. Click to copy the clone command from your tutorial header: Screenshot of click to clone

Paste this command into your terminal and run it.

Create a branch and add data

Once a repository is cloned to your local machine, you can interact with it just as you would any Git repository. In the background, XetHub uses deduplication algorithms to keep your repository as lightweight as possible, while preserving all history. Each commit creates a snapshot of your full Xet repository, making it easy to see how your code and data have evolved over time.

Since we're adding data, let's checkout a branch to make it easy to roll back changes in case the new data has any issues.

cd Tutorial
git checkout -b add-August-data

Our repository currently has Seattle air quality data from January to July 2022. Let's download and add August's air quality data, then run a script to update the aggregate annual data file:

curl -O https://xethub.com/XetHub/assets/raw/branch/main/seattle_2022_08.csv
mv seattle_2022_08.csv monthly_air_quality_data
python aggregate_data.py
note

Don't have Python? Install it here or manually append the data from monthly_air_quality_data/seattle_2022_08.csv to annual_air_quality_data/seattle_2022.csv, removing the header.

Update visualizations

From the XetHub UI, you can see that the Tutorial repository displays two custom visualizations. Screenshot of Tutorial custom visualizations

Each of these is generated from a custom visualization specification file. You can update the JSON specification files (aggregate_o3.vl.json and aggregate_pm25.vl.json) to include August's data by taking the last layer in each file:

  {
"data": {"url": "monthly_air_quality_data/seattle_2022_07.csv"},
"mark": {"type": "line", "strokeDash": [2,2]}
}

And replacing it with these lines, which rewrite the July data to display as a solid line and add the August data with dashed lines:

  {
"data": {"url": "monthly_air_quality_data/seattle_2022_07.csv"},
"mark": "line"
},
{
"data": {"url": "monthly_air_quality_data/seattle_2022_08.csv"},
"mark": {"type": "line", "strokeDash": [2,2]}
}

Stage, commit, and push your changes

Now the new data has been added, aggregated, and charted. Stage your changes and check the Git status:

git add .
git status

Check the output to make sure that the expected files are staged.

$ git status
On branch add-August-data
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: aggregate_o3.vl.json
modified: aggregate_pm25.vl.json
modified: annual_air_quality_data/seattle_2022.csv
new file: monthly_air_quality_data/seattle_2022_08.csv

Commit and push your changes using familiar Git syntax.

git commit -m "Add August 2022 data"
git push -u origin add-August-data

Review changes

Navigate to the link in the output of the git push command above. The link will look something like this:

  remote:   https://xethub.com/<username>/Tutorial/compare/main...<username>:add-August-data

From this view, scroll down to see before and after views from your changes. Note that for CSV files, XetHub provides summary statistics to help understand how your data has changed. In this case, the histograms show at a glance that the max nitrogen dioxide (NO2) measurement was 0.07 — almost twice the previous 2022 high of 0.04. This could warrant some investigation.

Screenshot of CSV summary view differences

Click New Pull Request at the top of the page to start a formal review, adding any comments you want to associate with the pull request, and use the UI to merge your changes.

Summary

With this workflow, data owners can create pull requests to preview new data and add collaborators for extra review, leveraging custom visualizations to easily see if the data looks reasonable prior to merging in changes. Even better, every change in data and code is tracked and easily recoverable in case a regression is found in the future.