Skip to main content

Hello Flickr

Flex your XetHub muscles with these quick examples of working with our version of the Flickr30k dataset, a classic 4.2GB collection of over 30,000 images.


Stream to read repositories

Clicking through a dataset of any size through the browser can be tedious, especially when you're trying to understand if the dataset would be useful for you. The alternative is downloading the files, but waiting for several gigs of files to transfer also doesn't seem like a good usage of time or disk space.

Enter mount as the perfect tool for read-only exploration. Mount streams XetHub repositories to your local machine in seconds and loads them just-in-time, without long download delays or taking up disk space. You can access mounted files with any application on your machine.

  1. Mount the repository to a folder called Flickr:

    git xet mount xet://XetHub/Flickr30k/main Flickr

    This command is available for any repository under the Access button.

    note

    Mount not found on your machine? Try installing NFS client to resolve the issue.

  2. Confirm a successful mount by checking the last line for something like:

    Mount complete in 3.147795s
  3. Open Flickr with any file browser of your choice and find 131090759.jpg, which shows how we all feel when working with large files! Play around some more to see the range of images.

  4. Unmount the folder when you're done.

    umount Flickr

Mounting is useful when speed is important and editing is not, perfect for speedy data loads to your compute container.

Duplicate or clone to develop

When you need to do more than just read, duplicate and/or clone a repository to work with it:

  • Duplicating a repository creates your own copy of an existing repository that is independent of the original repository. This is called a "detached fork" in Git-speak and is perfect for picking up someone else's work to build upon. Once you duplicate a repository, clone it locally to make edits. Any changes you make to a duplicated repository will not affect the original.
  • Cloning a repository makes local copy of an existing repository. Any changes you make to a cloned repository will affect the original and can be "pushed" back to the repository to update the shared codebase. XetHub supports both full clones and lazy clones; full clones download every file in the repository and lazy clones only download pointers in case you only want to work with a subset of files.

Our public Flickr30k repository has some problematic annotations. Let's duplicate it and fix them!

  1. Go to the Flickr30k repository and click Duplicate: Screenshot of Flickr30k's duplicate button

    Wait for a few seconds for your repository to be created for you.

  2. All annotations in this dataset are in results.csv. Let's take a look by clicking on the file name from the XetHub UI. Notice that the top comment mentions broken links!

    Screenshot of results.csv summary

  3. Lazy cloning a repository keeps all large files as pointer files, only fully downloading a file when explicitly asked to. Let's lazy clone our version of Flickr30k since we mostly just want to edit a single file. Copy the command from the Access dropdown, clicking the lazy clone option:

    Screenshot of lazy clone selection

    And run it from your command line:

    git xet clone --lazy xet@xethub.com:<username>/Flickr30k.git
    cd Flickr30k
  4. Create a branch to test on:

    git checkout -b test
  5. Materialize, or fetch the full file for, results.csv so we can find instances of "The image links are broken" and fix them.

    git xet materialize results.csv
  6. Find instances of the phrase "The image links are broken" within results.csv, and materialize the relevant images so you can create and add your own annotation. The first hit is 49/4901396689.jpg - prepend the directory name of flickr30k_images and fetch the file:

    git xet materialize flickr30k_images/49/4901396689.jpg
  7. Open the image and add your own annotation to replace that instance of "The image links are broken" in results.csv. Repeat for the other instances if you'd like!

  8. Commit and push your changes.

    git commit -am "Improved annotations in results.csv"
    git push -u origin test

Review and release

Share your work with others or record of your personal changes through pull requests.

  1. Start a pull request from the output of the push command above, clicking on a link that looks something like:

    https://xethub.com/<username>/Flickr30k/compare/main...test
  2. Notice that the most recent commit is the one that you just made, and check that the count of "The image links are broken" has decreased.

  3. If all looks well, merge your changes through the pull request UX flow or through the CLI with:

    git checkout main
    git merge test
    git push

Now you can use this cleaner set of annotations in your next training job. While you can see where the repository was originally duplicated from for provenance, there is no risk of accidentally pushing your changes back to the original repository; only your copy of the files has been changed.


Next steps

Now that you've tried some basic XetHub functionality, choose your own adventure:

  • 🐍 Love working with Python? Check out advanced access patterns for more flexible development.
  • 🖼️ Big things can be hard to grok. Try our visualizations to bring context to your files.
  • 🔎 Dive into XetHub to explore our featured repositories and starter projects.