Managing data with DVC¶

For data analysis, and especially for machine learning, it is extremely valuable to be able to reproduce different versions of analyses that were performed with different data sets and parameters. However, in order to obtain reproducible analyses, both the data and the model (including algorithms, parameters, etc.) must be versioned. Due to the size of the data, versioning data for reproducible analyses is a bigger problem than versioning models. Tools such as DVC help with data management by allowing users to transfer data to a remote data storage location using a Git-like workflow. This simplifies the retrieval of specific versions of data to reproduce an analysis.

DVC was developed to enable the sharing and traceable management of ML models and data sets. It uses its own system for storing files with support for SSH and HDFS, among others.

Tip

cusy seminar: Storing code and data in a versioned and reproducible manner

Installation¶

DVC can be installed with uv. Please note, however, that you must explicitly specify the extras. These can be [ssh], [s3], [gs], [azure], and [oss] or [all]. For ssh, the command looks like this:

$ uv add dvc[ssh]

Alternatively, DVC can also be installed via other package managers:

Debian/Ubuntu

$ sudo wget https://dvc.org/deb/dvc.list -O /etc/apt/sources.list.d/dvc.list
$ sudo apt update
$ sudo apt install dvc

macOS

$ brew install iterative/homebrew-dvc/dvc

Managing data with DVC¶

Comparison with related technologies¶

git-annex¶

Workflow management systems such as Airflow and Luigi¶

Installation¶