Decision trees for large repo migration to Git
TL;DR A checklist of key decisions to make when migrating large repositories to Git.
Sharing field notes on the key decisions to make when migrating large repositories to GitHub. GitHub is used as an example of targeted Git hosting platform, but the decisions may apply to other Git hosting platforms.
The decision tress are built on discussions from Migrate to Git from centralized version control, Manage and store large files in Git, and GitHub Docs.
In this post, the source of the repository may originate from decentralized version control systems (DVCS) like GitHub, Azure DevOps, or GitLab; or from centralized version control systems (CVCS) like Subversion and Perforce. Especially Perforce where it is historically been used for large products with many creative media, image assets, and binary objects.
Large repository is defined here as repositories that are gigabytes in size, that stretch the system limits of Git filesystem.
Areas of design
flowchart LR
B[Start] --> C[1. Git branching strategy]
C --> D[2. Data Lineage]
D --> E[3. Managing binaries]
1. Git branching strategy
Git is a distribution version control system which is entirely different to the centralized version control system. The team development workflows and collaboration interactions will vastly differ with Git. Defining a good fit branching strategy upfront is important as it is costly and painful to change later.
flowchart TD
Start[Git Branching Strategy] --> Complexity
subgraph Complexity
BS{How do we collaborate?} --> |multiple teams, <br />separation of concerns| BS1(GitFlow)
BS --> |high affinity, <br />high proximity| BS2(GitHub Flow)
BS1 --> NC{Naming convention}
BS2 --> NC{Naming convention}
end
subgraph Lifecycle
BL{Keep branch?} --> |no| BL1(Temporary <br />periodic purge)
BL --> |yes| BL2(Long-live)
end
subgraph Merge
CH{Keep commits history?} --> CH1(All commits)
CH --> CH2(Squash)
CH --> CH3(Rebase)
end
Complexity --> Merge
Merge --> Lifecycle
Lifecycle --> BPR(Define branch protections)
Key expected outcomes:
- Branching model for development, releases, patches, and hotfixes (e.g. GitFlow, GitHub Flow)
- Roles responsibilities for branches
- Lifecycle management of branches
- Merge strategy for branches
- Naming convention for branches
- Inputs to define teams and role assignments, branch protection rules, repository rulesets
2. Data Lineage
Every developer can pull the full history of a Git repository locally and work offline. Identify what need not to migrate will reduce the complexity of migration, these may be archived code, orphaned objects, or stale branches and tags.
flowchart TD
Start[Data Lineage] --> Code
subgraph Code
KH{Keep history?} --> |no| KH1[Tip migration]
KH{Keep history?} --> |yes| KH2[Source code with history]
end
subgraph Branches
BR{Migrate from git?} --> |yes| BR1(Can keep)
BR --> |no| BR2(Abandon)
end
subgraph Tags
TA{Migrate from git?} --> |yes| TA1(Can keep)
TA --> |no| TA3(Abandon)
end
Code --> Branches
Branches --> Tags
Key expected outcomes:
- Depth of commit history to keep
- Branches to migrate
- Tags to migrate
3. Managing binaries
Git is a filesystem optimized for versioning text files and source code, share and collaborate with distributed teams or individuals. GitHub has recommendations on the repository limits, therefore Git may not be the right home for binaries. Good repository hygiene practices will reduces the risk of file corruption, repository bloating and performance issues. Below consider what to migrate and where to migrate to.
flowchart TD
Start[Binaries] --> A(Analyse)
A --> Libraries
subgraph Libraries
LT{Have libraries} --> |yes| LT1(Package registries)
end
subgraph Binaries
BI{Binaries > <em>n</em>-MB?} --> |no| BF{Update frequently?}
BI --> |yes| BI1[Git LFS]
BF --> |small, infrequent| BF1[Git]
BF --> |yes| BI1
end
subgraph Build
BU{Have build ouput?} --> |yes| BA{Use by other workflow?}
BA --> |yes| BU1[Actions Artifacts]
BA --> |no| BU2[GitHub Packages]
end
Libraries --> Binaries
Binaries --> Build
Build --> PPR(Define push protection)
Key expected outcomes:
- List of identified orphaned libraries and binaries to purge
- Libraries to migrate to GitHub Packages
- Purge build outputs and update pipeline to use Actions Artifacts
- Binaries to migrate to Git LFS
- Binaries to migrate to other blob storage
- Inputs into push rules