Disaster Recovery of GitLab.com
This is a Controlled Document
In line with GitLab’s regulatory obligations, changes to controlled documents must be approved or merged by a code owner. All contributions are welcome and encouraged.Purpose
GitLab continually assesses ways to improve recovery capabilities across the full platform to ensure that should a disaster occur, normal operations can be restored as quickly and with as little disruption as possible.
This policy outlines the current capabilities for responding to disaster scenarios for GitLab.com, how GitLab.com backups are validated and tested for restoration, and how GitLab tests service recovery procedures in the unlikely event of a large-scale service disruption.
Disaster Recovery
Scope
GitLab.com’s disaster recovery strategy encompasses the following components:
- Regular validation and testing of restore procedures
- Automated restore and validation of backups
- Deployment across multiple data centers in
us-east1to provide tolerance against localized disruptions
Validation and Testing of Restore Procedures
Mock disaster recovery (DR) events, called Game Days, are conducted quarterly to simulate incidents affecting one or more services. These exercises validate DR processes and assess readiness for actual incidents.
During these Game Days, RTO and RPO targets are validated by recording measurements for each procedure.
Regional Recovery
All GitLab.com backups are stored in multi-region object storage to ensure the capability to recover customer data in the unlikely event of a regional disaster. Recovery from regional backups is validated through automated recovery and data validation processes described below.
Automated restore testing and data integrity validation of backups
GitLab employs automated mechanisms to ensure data integrity of backups.
PostgreSQL
Daily restoration testing is performed for GitLab.com application databases in CI pipelines (internal) using the PostgreSQL Database Restore Validation project. This process performs a point-in-time recovery (PITR) restore into a new instance and verifies data integrity by running queries on the restored database. The same process is used for CustomersDOT database in scheduled pipeline runs (internal) in the postgres-prdsub project.
Gitaly Disk Snapshots
Hourly restoration testing is performed for Git repositories using a randomly selected Gitaly disk in CI pipelines (internal) via the Gitaly snapshot verification project. This process takes a random Gitaly snapshot, restores it to a new disk, and verifies data integrity by checking for recent commits after restoration.
Object Storage
Automated restore validation is not required for Object Storage due to its inherent protections through versioning and soft delete.
Multiple zone deployments
GitLab.com is deployed across multiple GCP availability zones in the us-east1 region.
During short-term outages affecting a single zone within us-east1, unaffected zones will scale up to restore service.
For the Gitaly service, backup recovery will be necessary if data loss occurs.
Exceptions
Exceptions to this policy will be managed in accordance with the Information Security Policy Exception Management Process.
References
- Backups
- PostgreSQL Database Restore Validation Project
- PostgreSQL Database Restore Validation Pipeline Runs
- Gitaly snapshot verification Project
- Gitaly snapshot verification Project Pipeline Runs
- Records Retention & Disposal
- Disaster Recovery runbooks
- GameDays
b731f1fe)
