Learning

Data Version Control (DVC)

| 5 min read

One of the most annoying parts of working in a collaborative Machine Learning project is managing the data. Different people are changing it at different times and it’s hard to organize a good way to do it. Many times, we ended up literally transfering files with a USB stick from my laptop to someone else’s. And that’s embarrasing.

You can’t also just put the data in your github. That’s bad practice.

To solve that problem, DVC gives you an interface like git but for your data. You use a remote source (like an S3-like Object Storage) where you have the data and use dvc to download, upload, update, etc.

For the source of data, I use Cloudflare R2 because it’s compatible with any S3 interface (including DVC’s) and is more affordable (realistically it’s free for most projects unless you have really big amounts of data).

R2 is the Github and DVC is the git.

Setting Up and Configuring DVC

1. Install DVC

I use uv and you should to. What are you doing if you’re still using pip?

uv tool install dvc
uv tool install "dvc[s3]"

2. Initialize and configure DVC

Navigate to your project root and run this command:

dvc init

This will create a .dvc directory in the root of your project. There’s a .dvc/config file where you tell DVC where the remote is and that kind of stuff. Mine looks like this:

[core]
    remote = r2
    autostage = true
['remote "r2"']
    url = s3://spaiced # name of the bucket
    endpointurl = https://cloudflare-account-id.r2.cloudflarestorage.com
    credentialpath = ../.aws/credentials
    profile = r2

If someone else is setting up dvc for your project, they will probably send you this information.

The .dvc and .dvc/config are public and should be pushed to Github.

3. Set up your Object Storage credentials

Create the .aws/credentials file that stores the access and secret keys for accessing your object storage bucket. This should not be submitted to github. Add .aws/ and credentials to your .gitignore.

[r2]
aws_access_key_id = access-key
aws_secret_access_key = secret-access-key

Again, if your team member is handling this, they will send you these. And they will, or should, tell you to put the file in your .gitignore. Actually, they probably have put it in the .gitignore themselves.

Enough setting up!! You’re now good to go.

4. Tracking Data

Now, you have to tell DVC which files or folders to track (so it can check them for changes to keep version control). You do this by running this command:

dvc add data/

I usually dedicate a data folder in my projects for this.

The way this works is that dvc creates a data.dvc file which stores the metadata about the files, not the files themselves. dvc actually adds the files in the folder to .gitignore automatically. So what gets stored in Github is never the data itself, but the data.dvc file that indicates which version and what files exist at that moment.

When you add, delete, or update the data files, the data.dvc file will change to reflect that. So, you have to regularly commit and push your data.dvc to Github, and any time you perform the changes.

Now, let’s see how to change the actual data.

Using DVC

So, now that DVS is set up, here’s what the DVC daily workflow looks like:

If you already have data in the remote source (R2 Object Storage), you can pull the latest version of the data from the remote source by running

dvc pull

Whenever you make a change, e.g. adding a new file, updating, deleting, etc., you first have to make dvc track all the new files and versions by running

dvc add data/

After that, you have to update DVC’s local cache (basically a git commit for the data) by running dvc commit data/ Finally, you push your changes to the remote source (in this case the R2 Object Storage) by running

dvc push

As I said, the goal of DVC is to provide a git-like interface for data, so the commands and workflow are also similar to git.

At this point, your data is updated locally and in the remote storage. Because of that the data.dvc file has changed to reflect the changes and represent the most updated version. So, to finalize everything, you have to commit and push the data.dvc to your Github repository. This allows others to be able to get updated data with dvc pull. If you forget this, the data will be updated but people can’t get the updated versions of data.

Recap

  1. dvc pull to get latest version of data
  2. change the data by modifying, adding, or deleting files
  3. dvc add data/ to update the tracking (start tracking new files, stop tracking deleted files, etc.)
  4. dvc commit data/ to save the changes locally (saving the changes to dvc, not the actual files)
  5. dvc push to push the changes to the remote source (e.g. R2 object storage)
  6. git add data.dvc && git commit && git push - push the updated version of data.dvc to Github