Warning: This blog post references deprecated XetHub links and functionality. Please use as a reference only and follow our current work on Hugging Face.

May 11, 2023

PyXet: A Python Library for Building Apps with XetHub

Yonatan Alexander

Yonatan Alexander

Yonatan Alexander

TL;DR

We have launched PyXet, a Python library for building apps with XetHub! PyXet helps developers iterate faster by storing their data, code, and ML models all in one place.  Get started here (repo + tutorials). PyXet implements most of pathlib, fsspec, pandas, and common command line functions such as ls and cp. Join us and get involved to help shape the future of pyxet.

PyXet is at ‘alpha’ quality and is available on PyPI today. Get involved and join our Slack community. PyXet supports Python 3.7+ on MacOS and Linux today.

Why PyXet?

Everyone is racing to develop generative AI apps, but the tools most developers use to build these applications are not optimized for large data sets — in fact, they don’t work very well at all. Developers have to flip back and forth between systems that hold code and data, introducing significant drag on efficiency, productivity, and accuracy.

A typical workflow today, many systems involved, which drags productivity.

Today, the best practices are to fragment solutions’ code, models, logs, data etc., and manage assets, environment and versions as addresses somehow. Often, this means there is a naming convention in S3 like the following: s3://models-logs-data/<environment>/<version>/<date>/<file>.

That makes sense under the constraints of what Git can handle and how blob stores are designed, but would you do it otherwise? What would be a natural way?

Same workflow with XetHub enabling everyone to collaborate together.

That seems simple enough.

Restore and audit databases as we do with code, experiment with models as we do with branches, test and CI/CD on a project level as if the project is a local app. Much easier.

That’s why today, at PyData Conference 2023 in Seattle, we announced the launch of PyXet, an open-source Python library for building apps with XetHub!

PyXet will help developers iterate faster by enabling storage of their data, code, and ML models all in one place.  Get started here.

Introducing PyXet

PyXet was built with developer productivity in mind. XetHub scales Git repositories to 1TB but we know that using the Git command line breaks your flow. With PyXet you can now work with your data like you do today - while staying in Python.

As an example, here is how you can read a file directly from a XetHub repo into a Pandas DataFrame:

import pandas as pd
import pyxet

df = pd.read_csv('xet://xdssio/titanic/main/titanic.csv')
df

# will produce something like:

Out[3]:
     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0              1         0       3  ...   7.2500   NaN         S
1              2         1       1  ...  71.2833   C85         C
2              3         1       3  ...   7.9250   NaN         S
3              4         1       1  ...  53.1000  C123         S
4              5         0       3  ...   8.0500   NaN         S
..           ...       ...     ...  ...      ...   ...       ...
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
888          889         0       3  ...  23.4500   NaN         S
889          890         1       1  ...  30.0000  C148         C
890          891         0       3  ...   7.7500   NaN         Q

[891 rows x 12 columns]

PyXet implements most of pathlib and fsspec today. It’s easy to navigate, use, understand, and remember, enabling ML teams to ship better projects faster. As human time and machine time get more expensive, PyXet simplifies the ability to get and work with your data easily, quickly, and consistently —without having to waste hours downloading and uploading data.

For more details on how to get started with PyXet, check out the documentation here.

What’s next?

We are planning to open source pyxet and extend its functionality to include writing back to repositories, mounting repositories (to allow streaming data), adding Windows support, and more. Follow the repo to stay updated & get involved!

Share on