DataHub Developer's Guide
Requirements
- Java 17 JDK
- Python 3.11
- Docker
- Node 22.x
- Docker Compose >=2.20
- Yarn >=v1.22 for documentation building
- Docker engine with at least 8GB of memory to run tests.
Option 1: Using Homebrew on Mac
On macOS, these can be installed using Homebrew.
# Install Java
brew install openjdk@17
# Install Python
brew install [email protected] # you may need to add this to your PATH
# alternatively, you can use pyenv to manage your python versions
# Install docker and docker compose
brew install --cask docker
Option 2: Using mise
Alternatively you can use mise en place for managing tool installation. You can see the existing mise.toml file in the repository and let mise manage the tool versions.
You can install mise cli by following the instructions on https://mise.jdx.dev/getting-started.html
Fork and clone the repo if you haven't done so already. Refer: Building the Project
# Enter the root folder of the repo
> cd datahub
# Needed the first time to allow mise to auto activate the tools
# mentioned in mise.toml
> mise trust
# Needed once if the required tools haven't been installed via mise before
# or if a new tool is added or tool version changed since last use
> mise install
After this the required tools should be auto activated as soon as you enter the folder where the repo is cloned
You can verify the tools are activated correctly by running
# Check tool versions installed
❯ mise ls --local
Tool Version Source Requested
java 17.0.2 ~/path/to/datahub/mise.toml 17
node 22.21.1 ~/path/to/datahub/mise.toml 22
python 3.11.14 ~/path/to/datahub/mise.toml 3.11
yarn 4.12.0 ~/path/to/datahub/mise.toml latest
Building the Project
Fork and clone the repository if haven't done so already
git clone https://github.com/{username}/datahub.git
Change into the repository's root directory
cd datahub
Use gradle wrapper to build the project
./gradlew build
Note that the above will also run tests and a number of validations which makes the process considerably slower.
We suggest partially compiling DataHub according to your needs:
Build Datahub's backend GMS (Generalized metadata service):
./gradlew :metadata-service:war:buildBuild Datahub's frontend:
./gradlew :datahub-frontend:dist -x yarnTest -x yarnLintBuild DataHub's command line tool:
./gradlew :metadata-ingestion:installDevBuild DataHub's documentation:
./gradlew :docs-website:yarnLintFix :docs-website:build -x :metadata-ingestion:runPreFlightScript
# To preview the documentation
./gradlew :docs-website:serve
Dependency Management
Dependency Locking
DataHub uses Gradle's dependency locking to ensure reproducible builds across all environments. Dependency locks guarantee that the exact same dependency versions are used everywhere - in development, CI, and production - preventing unexpected behavior from transitive dependency updates.
Why Dependency Locking?
- Reproducible Builds: Same dependency versions across all environments
- Security: Prevents unexpected transitive dependency updates that could introduce vulnerabilities
- Stability: Explicit control over when dependencies change
- Audit Trail: Lock files in git provide clear history of dependency changes
How It Works
Each sub-project has a gradle.lockfile that records the exact version of every direct and transitive dependency used across all configurations (compile, runtime, test, etc.). When you build the project, Gradle uses these locked versions instead of resolving the latest versions.
Viewing Locked Dependencies
To see the locked dependencies for a project:
./gradlew :project-name:dependencies --configuration runtimeClasspath
For example:
./gradlew :metadata-io:dependencies --configuration runtimeClasspath
Updating Dependencies
When you update dependencies in build.gradle files, you must regenerate the lock files. There are several ways to do this depending on your needs:
Update All Lock Files (Complete Refresh)
After changing dependencies in any build.gradle file, run:
./gradlew resolveAndLockAll --write-locks
This resolves all configurations across all sub-projects and updates all lock files. This is the recommended approach when:
- Adding or removing dependencies
- Changing dependency versions
- After pulling changes that modify
build.gradlefiles
Update a Specific Dependency Version
To update a specific dependency to a newer version:
./gradlew dependencies --update-locks group:artifact
For example, to update Jackson:
./gradlew dependencies --update-locks com.fasterxml.jackson.core:jackson-databind
Update Lock Files for a Single Project
If you only changed dependencies in one sub-project:
./gradlew :project-name:dependencies --write-locks
Note: This only locks configurations that get resolved by the dependencies task. For complete coverage, use resolveAndLockAll instead.
Common Workflows
Adding a new dependency:
- Add the dependency to the appropriate
build.gradlefile - Run
./gradlew resolveAndLockAll --write-locks - Verify the build works:
./gradlew :project-name:build - Commit both the
build.gradlechange and updated lock files
Updating an existing dependency:
- Change the version in
build.gradle - Run
./gradlew resolveAndLockAll --write-locks - Review the lock file changes to see what transitive dependencies changed
- Test thoroughly, especially if major version upgrades
- Commit the changes
After pulling changes: If someone else updated dependencies and you pull their changes, just build normally:
./gradlew build
Gradle will automatically use the locked versions from the updated lock files. You don't need to regenerate locks unless you're making your own dependency changes.
Troubleshooting
Build fails with dependency resolution error:
- Ensure you've run
./gradlew resolveAndLockAll --write-locksafter dependency changes - Check that lock files are committed and up to date
- Try
./gradlew --refresh-dependencies buildto refresh Gradle's cache
Lock file is out of sync: If you see warnings about lock state, regenerate the locks:
./gradlew resolveAndLockAll --write-locks
Deploying Local Versions
This guide explains how to set up and deploy DataHub locally for development purposes.
Initial Setup
Before you begin, you'll need to install the local datahub CLI tool:
cd metadata-ingestion/
python3 -m venv venv
source venv/bin/activate
cd ../
Deploying the Full Stack
Deploy the entire system using docker-compose:
./gradlew quickstartDebug
Access the DataHub UI at http://localhost:9002
Refreshing the Frontend
To run and update the frontend with local changes, open a new terminal and run:
cd datahub-web-react
yarn install && yarn start
The frontend will be available at http://localhost:3000 and will automatically update as you make changes to the code.
Refreshing components of quickStart
To refresh any of the running system started by ./gradlew quickstartDebug, run
./gradlew debugReload
This will build any changed components and restart those containers that had changes. There are a few other quickstart* variants, like quickstartDebugMin, quickstartDebugConsumers
For each of those variants, there is a corresponding reloadTask.
For ./gradlew quickstartDebugConsumers, the reload command is ./gradlew debugConsumersReload
For ./gradlew quickstartDebugMin, the reload command is ./gradlew debugMinReload
A full restart using ./gradlew quickstartDebug is recommended if there are significant changes and the setup/system update containers need to be run again.
For incremental changes, the debugReload* variants can be used.
Cleaning up containers and volumes
To completely remove containers and volumes for a specific project, you can use the nuke tasks:
# Remove containers and volumes for specific projects
./gradlew quickstartDebugNuke # For debug project
./gradlew quickstartCypressNuke # For cypress project (dh-cypress)
Note: These are Gradle nuke tasks. For CLI-based cleanup, see
datahub docker nukein the quickstart guide.