Tuesday, February 21, 2012

SVN vs. DVCS

First post for 2012.

I was going to make this a SVN vs. Mercurial post, due to my recent efforts to push forward Mercurial as a replacement for the former at my current workplace. But such a post would miss the fact that what I'm really pushing is a migration away from SVN to a Distributed Version Control System (DVCS). Mercurial is just the choice of DVCS the company I work for has settled on, but there are plenty of other options, each more appropriate for a particular community or purpose. So SVN vs. DVCS it is.

I'll preface this post with a couple of disclaimers. First, yes, some of this is just my opinion. Secondly, SVN isn't terrible; there are still millions of software developers all over the world that use it daily as their version control system of choice (even if most of the hacker/open-source world has migrated away from it already). My argument is just that a DVCS and the workflow associated with it is better.

First, let's start with the word Distributed. Distributed Version Control Systems contain this word as part of their name because of the way they spread the repository out to all users. In SVN you would check out a particular revision and branch/tag of a repository (and that is all that is stored on your local machine); with a DVCS system, each user checks out a full copy of the repository. This means every user has all branches, all tags, and all revisions of all files.

You might think that checking out every tag, branch and revision would be very slow. DVCS systems mostly get around this by being smart about how they store and transmit changes; Git has a fairly complex database and garbage collection/compression system to keep on top of the worst of the repository bloat, while Mercurial only stores file deltas, instead of whole files.

Because of the way the repository is spread out to all users, DVCS systems allow users to commit to the local repository, as well as pushing their changes to a remote host. The biggest change in workflow this creates between SVN and DVCS systems is that it is now possible to commit any change to source control without affecting other users. I feel that this is the most significant benefit of a DVCS system on a day-to-day basis; the ability to commit early and often proves invaluable when it comes to experimentation and large chunks of development work. This has flow-on effects such as improving merge operations: changes are divided up into smaller chunks, and DVCS systems focus on the history of changes between revisions rather than the absolute state of a file at the two revisions being merged.

The workflow in DVCS systems is generally divided up into at least 3 steps:

  • Pull changes from the central server or from another user.
  • Commit changes to the local copy of the repository.
  • Push the changes in the current repository state to the central server or another user.
One of my colleagues at work pointed out a very good downside to this pull-commit-push workflow; what if people simply work the their local version of the repository for a week and something happens to their computer, or the office burns down? The central server is backed up, but individual developer machines aren't. I mumbled something about no one being silly enough to do such a thing, but it raised an interesting point. Personally I think this would be the equivalent of someone working in an SVN respository for a week and not comitting; but DVCS lulls the user into a false sense of security on this front because when files are locally comitted, they're in version control, right? Sure, but this doesn't guarantee data replication and integrity. Logic should dictate that people always make sure they push DVCS code to a server or other user at least once every day or two.

But how do we deal with large chunks of isolated development in a DVCS repository? In SVN we would create a branch, commit to the branch, and then merge the branch back into the main trunk of the repository once we are finished. This process has never been as fluid or seamless as was promised in the early days of SVN, especially on large repositories and changes. In DVCS systems branches and tags are replaced by the concept of labels or tags. Because the repository history is encoded as a string of changes, the head/trunk of the repository is just a pointer to the most recent change in that string of changes. A branch is just a pointer to a different string of changes, which may share those changes with many other branches or tags. This means to create a branch, you simply create a new tag and assign it to the changeset you want to base the branch on, preserving the current head label/tag as it stands. If you need to merge the changes from a branch back into the trunk of a DVCS repository, the repository simply applies file changes in the branch on top of the file changes in the trunk.

Overall I think a DVCS system has huge advantages over a standard SVN workflow. The focus on changes, rather than the instantaneous state of a respository more closely aligns with a software development workflow. The decentralization of the version control systems allows for finer-grained control over commit points and allows users to commit code to source control even if it would normally have a negative affect on other users in an SVN-based version control system. SVN chains users to a repository server, while DVCS systems allow users to be the drivers of information flow.

Don't take my word for it, check out Mercurial, Git, or any of the other awesome DVCS systems out there!