Skip to main content

DataHub Developer's Guide

Requirements

Option 1: Using Homebrew on Mac

On macOS, these can be installed using Homebrew.

# Install Java
brew install openjdk@17

# Install Python
brew install [email protected] # you may need to add this to your PATH
# alternatively, you can use pyenv to manage your python versions

# Install docker and docker compose
brew install --cask docker

Option 2: Using mise

Alternatively you can use mise en place for managing tool installation. You can see the existing mise.toml file in the repository and let mise manage the tool versions.

You can install mise cli by following the instructions on https://mise.jdx.dev/getting-started.html

Fork and clone the repo if you haven't done so already. Refer: Building the Project

# Enter the root folder of the repo
> cd datahub

# Needed the first time to allow mise to auto activate the tools
# mentioned in mise.toml
> mise trust

# Needed once if the required tools haven't been installed via mise before
# or if a new tool is added or tool version changed since last use
> mise install

After this the required tools should be auto activated as soon as you enter the folder where the repo is cloned

You can verify the tools are activated correctly by running

# Check tool versions installed
❯ mise ls --local
Tool Version Source Requested
java 17.0.2 ~/path/to/datahub/mise.toml 17
node 22.21.1 ~/path/to/datahub/mise.toml 22
python 3.11.14 ~/path/to/datahub/mise.toml 3.11
yarn 4.12.0 ~/path/to/datahub/mise.toml latest

Building the Project

Fork and clone the repository if haven't done so already

git clone https://github.com/{username}/datahub.git

Change into the repository's root directory

cd datahub

Use gradle wrapper to build the project

./gradlew build

Note that the above will also run tests and a number of validations which makes the process considerably slower.

We suggest partially compiling DataHub according to your needs:

  • Build Datahub's backend GMS (Generalized metadata service):

    ./gradlew :metadata-service:war:build
  • Build Datahub's frontend:

    ./gradlew :datahub-frontend:dist -x yarnTest -x yarnLint
  • Build DataHub's command line tool:

    ./gradlew :metadata-ingestion:installDev
  • Build DataHub's documentation:

    ./gradlew :docs-website:yarnLintFix :docs-website:build -x :metadata-ingestion:runPreFlightScript
    # To preview the documentation
    ./gradlew :docs-website:serve

Dependency Management

Dependency Locking

DataHub uses Gradle's dependency locking to ensure reproducible builds across all environments. Dependency locks guarantee that the exact same dependency versions are used everywhere - in development, CI, and production - preventing unexpected behavior from transitive dependency updates.

Why Dependency Locking?

  • Reproducible Builds: Same dependency versions across all environments
  • Security: Prevents unexpected transitive dependency updates that could introduce vulnerabilities
  • Stability: Explicit control over when dependencies change
  • Audit Trail: Lock files in git provide clear history of dependency changes

How It Works

Each sub-project has a gradle.lockfile that records the exact version of every direct and transitive dependency used across all configurations (compile, runtime, test, etc.). When you build the project, Gradle uses these locked versions instead of resolving the latest versions.

Viewing Locked Dependencies

To see the locked dependencies for a project:

./gradlew :project-name:dependencies --configuration runtimeClasspath

For example:

./gradlew :metadata-io:dependencies --configuration runtimeClasspath

Updating Dependencies

When you update dependencies in build.gradle files, you must regenerate the lock files. There are several ways to do this depending on your needs:

Update All Lock Files (Complete Refresh)

After changing dependencies in any build.gradle file, run:

./gradlew resolveAndLockAll --write-locks

This resolves all configurations across all sub-projects and updates all lock files. This is the recommended approach when:

  • Adding or removing dependencies
  • Changing dependency versions
  • After pulling changes that modify build.gradle files
Update a Specific Dependency Version

To update a specific dependency to a newer version:

./gradlew dependencies --update-locks group:artifact

For example, to update Jackson:

./gradlew dependencies --update-locks com.fasterxml.jackson.core:jackson-databind
Update Lock Files for a Single Project

If you only changed dependencies in one sub-project:

./gradlew :project-name:dependencies --write-locks

Note: This only locks configurations that get resolved by the dependencies task. For complete coverage, use resolveAndLockAll instead.

Common Workflows

Adding a new dependency:

  1. Add the dependency to the appropriate build.gradle file
  2. Run ./gradlew resolveAndLockAll --write-locks
  3. Verify the build works: ./gradlew :project-name:build
  4. Commit both the build.gradle change and updated lock files

Updating an existing dependency:

  1. Change the version in build.gradle
  2. Run ./gradlew resolveAndLockAll --write-locks
  3. Review the lock file changes to see what transitive dependencies changed
  4. Test thoroughly, especially if major version upgrades
  5. Commit the changes

After pulling changes: If someone else updated dependencies and you pull their changes, just build normally:

./gradlew build

Gradle will automatically use the locked versions from the updated lock files. You don't need to regenerate locks unless you're making your own dependency changes.

Troubleshooting

Build fails with dependency resolution error:

  • Ensure you've run ./gradlew resolveAndLockAll --write-locks after dependency changes
  • Check that lock files are committed and up to date
  • Try ./gradlew --refresh-dependencies build to refresh Gradle's cache

Lock file is out of sync: If you see warnings about lock state, regenerate the locks:

./gradlew resolveAndLockAll --write-locks

Deploying Local Versions

This guide explains how to set up and deploy DataHub locally for development purposes.

Initial Setup

Before you begin, you'll need to install the local datahub CLI tool:

cd metadata-ingestion/
python3 -m venv venv
source venv/bin/activate
cd ../

Deploying the Full Stack

Deploy the entire system using docker-compose:

./gradlew quickstartDebug

Access the DataHub UI at http://localhost:9002

Refreshing the Frontend

To run and update the frontend with local changes, open a new terminal and run:

cd datahub-web-react
yarn install && yarn start

The frontend will be available at http://localhost:3000 and will automatically update as you make changes to the code.

Refreshing components of quickStart

To refresh any of the running system started by ./gradlew quickstartDebug, run

./gradlew debugReload

This will build any changed components and restart those containers that had changes. There are a few other quickstart* variants, like quickstartDebugMin, quickstartDebugConsumers

For each of those variants, there is a corresponding reloadTask.

For ./gradlew quickstartDebugConsumers, the reload command is ./gradlew debugConsumersReload For ./gradlew quickstartDebugMin, the reload command is ./gradlew debugMinReload

A full restart using ./gradlew quickstartDebug is recommended if there are significant changes and the setup/system update containers need to be run again. For incremental changes, the debugReload* variants can be used.

Cleaning up containers and volumes

To completely remove containers and volumes for a specific project, you can use the nuke tasks:

# Remove containers and volumes for specific projects
./gradlew quickstartDebugNuke # For debug project
./gradlew quickstartCypressNuke # For cypress project (dh-cypress)

Note: These are Gradle nuke tasks. For CLI-based cleanup, see datahub docker nuke in the quickstart guide.

Using .env to configure settings of services started by quickstart