Post

Decision trees for large repo migration to Git

Decision trees for large repo migration to Git

TL;DR A checklist of key decisions to make when migrating large repositories to Git.

Sharing field notes on the key decisions to make when migrating large repositories to GitHub. GitHub is used as an example of targeted Git hosting platform, but the decisions may apply to other Git hosting platforms.

The decision tress are built on discussions from Migrate to Git from centralized version control, Manage and store large files in Git, and GitHub Docs.

In this post, the source of the repository may originate from decentralized version control systems (DVCS) like GitHub, Azure DevOps, or GitLab; or from centralized version control systems (CVCS) like Subversion and Perforce. Especially Perforce where it is historically been used for large products with many creative media, image assets, and binary objects.

Large repository is defined here as repositories that are gigabytes in size, that stretch the system limits of Git filesystem.

Areas of design

flowchart LR
	B[Start] --> C[1. Git branching strategy]
	C --> D[2. Data Lineage]
	D --> E[3. Managing binaries]

1. Git branching strategy

Git is a distribution version control system which is entirely different to the centralized version control system. The team development workflows and collaboration interactions will vastly differ with Git. Defining a good fit branching strategy upfront is important as it is costly and painful to change later.

flowchart TD

	Start[Git Branching Strategy] --> Complexity

	subgraph Complexity
	BS{How do we collaborate?} --> |multiple teams, <br />separation of concerns| BS1(GitFlow)
	BS --> |high affinity, <br />high proximity| BS2(GitHub Flow)
	BS1 --> NC{Naming convention}
	BS2 --> NC{Naming convention}
	end

	subgraph Lifecycle
	BL{Keep branch?} --> |no| BL1(Temporary <br />periodic purge)
	BL --> |yes| BL2(Long-live)
	end

	subgraph Merge
	CH{Keep commits history?} --> CH1(All commits)
	CH --> CH2(Squash)
	CH --> CH3(Rebase)
	end

	Complexity --> Merge
	Merge --> Lifecycle
	Lifecycle --> BPR(Define branch protections)

Key expected outcomes:

  • Branching model for development, releases, patches, and hotfixes (e.g. GitFlow, GitHub Flow)
  • Roles responsibilities for branches
  • Lifecycle management of branches
  • Merge strategy for branches
  • Naming convention for branches
  • Inputs to define teams and role assignments, branch protection rules, repository rulesets

2. Data Lineage

Every developer can pull the full history of a Git repository locally and work offline. Identify what need not to migrate will reduce the complexity of migration, these may be archived code, orphaned objects, or stale branches and tags.

flowchart TD

	Start[Data Lineage] --> Code
	subgraph Code
	KH{Keep history?} --> |no| KH1[Tip migration]
	KH{Keep history?} --> |yes| KH2[Source code with history]
	end

	subgraph Branches
	BR{Migrate from git?} --> |yes| BR1(Can keep)
	BR --> |no| BR2(Abandon)
	end

	subgraph Tags
	TA{Migrate from git?} --> |yes| TA1(Can keep)
	TA --> |no| TA3(Abandon)
	end

	Code --> Branches
	Branches --> Tags

Key expected outcomes:

  • Depth of commit history to keep
  • Branches to migrate
  • Tags to migrate

3. Managing binaries

Git is a filesystem optimized for versioning text files and source code, share and collaborate with distributed teams or individuals. GitHub has recommendations on the repository limits, therefore Git may not be the right home for binaries. Good repository hygiene practices will reduces the risk of file corruption, repository bloating and performance issues. Below consider what to migrate and where to migrate to.

flowchart TD

	Start[Binaries] --> A(Analyse)
	A --> Libraries

	subgraph Libraries
	LT{Have libraries} --> |yes| LT1(Package registries)
	end

	subgraph Binaries
	BI{Binaries > <em>n</em>-MB?} --> |no| BF{Update frequently?}
	BI --> |yes| BI1[Git LFS]
	BF --> |small, infrequent| BF1[Git]
	BF --> |yes| BI1
	end

	subgraph Build
	BU{Have build ouput?} --> |yes| BA{Use by other workflow?}
	BA --> |yes| BU1[Actions Artifacts]
	BA --> |no| BU2[GitHub Packages]
	end

	Libraries --> Binaries
	Binaries --> Build

	Build --> PPR(Define push protection)

Key expected outcomes:

  • List of identified orphaned libraries and binaries to purge
  • Libraries to migrate to GitHub Packages
  • Purge build outputs and update pipeline to use Actions Artifacts
  • Binaries to migrate to Git LFS
  • Binaries to migrate to other blob storage
  • Inputs into push rules
This post is licensed under CC BY 4.0 by the author.