Skip to main content

Stream Files

One of XetHub's most powerful features is the ability to stream large files in seconds. This read-only access can be used with our file-system mount and Python APIs, with the following benefits:

  • Access files larger than your local disk without needing to write custom I/O code.
  • Contents are streamed and cached on demand for faster access on subsequent reads.
  • Explore huge repositories locally without the need to wait for slow downloads.
  • Mount and Python APIs can be used from both your local machine and the cloud, making it easy to move from local development to production without extra code changes.

What's the difference between using Xet mount and Python APIs?

While the performance of mount and PyXet APIs is equivalent, the interaction patterns are different. Use mount when you want to use the CLI or local tools to access your data, use PyXet if you're working in Python.

Xet mount through the CLI provides a file-system view of your repository, meaning that you can access mounted files with any tools on your machine. Once the mount completes (usually in a few seconds), you can browse and work with your files as if they were local.

PyXet APIs provide an instant Pythonic way of working with your repository, with support for fsspec interaction patterns. For notebooks and Python code, this is a flexible way to use files without having to first download them.

When should I stream?

If you just want to explore or read data, you should stream - it's the fastest way to access your data! To modify a repository, follow the clone workflow.

What are some specific use cases for streaming?

  • Easy access while conducting exploratory data analysis
  • Fast data access on the cloud for model training, especially for distributed training jobs
  • Read in multiple versions of a repository and use Python or local tools for detailed change analysis

Stream a repository

note

Make sure that you have installed PyXet prior to running these steps.

  1. Navigate to the repository you want to access from the XetHub UI and click the Access button. Under the Stream tab, choose either CLI or Python to find the command that works best for you.

    Screenshot of Access dropdown in the XetHub UI

  2. On your local or cloud machine, run the command.

Xet mount

By default, the main branch of the repository will be mounted with its name in the directory where the command was run.

If the contents of the directory are Parquet, SQLite files or other file-types which provide tools with efficient random access, disabling prefetch is recommended for optimal performance:

git xet mount xet://user/repo/branch --prefetch 0

You can unmount from your CLI or file explorer:

  • From the CLI, use umount "path/repo"
  • From your file explorer, right-click the mounted object and select Eject

If you are accessing the mounted files from an application while trying to unmount, the unmount or Eject command may fail. If this happens, you can force unmount with diskutil umount force "path/repo".

PyXet APIs

While the default Access dropdown provides one method to access a Xet repository, there are many supported patterns. Reference our PyXet documentation to learn more.