Text Diff for Code Review: How Diffing Works and Why It Matters

· 15 min read

Every pull request, every merge conflict, every git log -p output — they all rely on text diffing. Yet most developers treat diff output as a black box: green lines are additions, red lines are deletions, move on. Understanding how diff algorithms actually work makes you faster at code review, better at resolving merge conflicts, and more effective at structuring commits that are easy for your team to review.

What a Diff Actually Computes

A diff algorithm takes two sequences of text (usually lines) and computes the minimum edit script — the smallest set of insertions and deletions that transforms the first text into the second. This is formally known as the Longest Common Subsequence (LCS) problem, and it has been studied in computer science since the 1970s.

Consider two simple files:

# File A          # File B
function greet()  function greet()
  puts "hello"      puts "hello"
  puts "world"      puts "goodbye"
end               end

The diff identifies that line 3 changed from puts "world" to puts "goodbye". In unified diff format, this appears as a deletion of the old line followed by an insertion of the new one. But the algorithm does not actually understand "change" — it only understands insert and delete. A "changed" line is simply a delete-then-insert at the same position. This distinction matters when you are reading complex diffs where multiple adjacent lines change simultaneously.

The Myers Diff Algorithm

The default algorithm behind git diff, GNU diff, and most programming libraries is Eugene Myers' 1986 algorithm. It finds the shortest edit script — the minimum number of insertions and deletions — using a greedy approach that explores an edit graph.

Think of the edit graph as a grid where the x-axis represents lines in the original file and the y-axis represents lines in the modified file. Moving right means deleting a line from the original. Moving down means inserting a line from the new version. Moving diagonally means a line is unchanged (a match). The algorithm finds the path from the top-left corner to the bottom-right corner that uses the most diagonal moves — maximising matching lines and minimising edits.

The time complexity is O(N * D) where N is the total length of both files and D is the size of the minimum edit script. For files that are mostly similar (small D), this is nearly linear. For completely different files, it degrades toward O(N²). In practice, code review diffs are usually small changes to large files, so Myers performs extremely well.

When Myers Gets It Wrong: The Patience Diff Alternative

Myers optimises for the shortest edit script, but the shortest diff is not always the most readable. Consider this refactoring where a function is moved:

# Before           # After
def validate()     def process()
  check_input()      run_pipeline()
end                end

def process()      def validate()
  run_pipeline()     check_input()
end                end

Myers might match the end keywords across the wrong functions, producing a confusing diff that appears to change the bodies of both functions rather than simply showing that they swapped positions. The output would be technically correct (minimum edits) but cognitively expensive to review.

Patience diff (available in Git via git diff --patience) takes a different approach. It first identifies unique lines that appear exactly once in both files — these become anchor points. Then it builds the diff around those anchors. Because function signatures and unique comments tend to be unique lines, patience diff produces more readable output for structural changes, function reordering, and block movements.

Git also offers the --histogram algorithm, which is an optimised version of patience diff and is the default in some Git configurations. You can set it globally with git config --global diff.algorithm histogram.

Reading Unified Diff Format

The unified diff format is what you see in pull requests, git diff output, and patch files. Here is a complete example with annotations:

--- a/src/auth.js                 ← Original file path
+++ b/src/auth.js                 ← Modified file path
@@ -12,7 +12,9 @@ function login(user) {  ← Hunk header
   const token = generateToken();  ← Context line (unchanged)
   const expiry = Date.now();      ← Context line
-  return { token, expiry };       ← Deleted line
+  const refreshToken = uuid();    ← Added line
+  return {                        ← Added line
+    token, expiry, refreshToken   ← Added line
+  };                              ← Added line
   }                               ← Context line

The hunk header @@ -12,7 +12,9 @@ tells you: the original starts at line 12 and shows 7 lines; the modified version starts at line 12 and shows 9 lines. The function name after @@ is a Git convenience that shows the nearest enclosing function — incredibly useful for navigating large diffs.

Context lines (no prefix) are unchanged lines shown for reference, typically 3 lines above and below each change. You can adjust this with git diff -U5 for 5 lines of context or -U0 for no context at all.

Diff Strategies for Better Code Reviews

Review Commit by Commit, Not the Full PR

A well-structured pull request tells a story through its commits. Reviewing the full diff shows you the destination but not the journey. Reviewing each commit separately shows the reasoning behind each change. In GitHub, click "Commits" tab on a PR to review this way. On the command line, git log --oneline feature-branch..main lists the commits, and git show <hash> shows each one.

Use Word-Level Diffing for Prose and Config

Line-level diffs are perfect for code but terrible for prose, configuration files with long lines, or minified content. Git supports word-level diffing with git diff --word-diff, which highlights individual changed words within a line rather than flagging the entire line as modified. For JSON and YAML config files, this makes small value changes immediately visible instead of forcing you to scan long lines character by character.

Ignore Whitespace When It Doesn't Matter

Reformatting commits — running Prettier, changing indentation, normalising line endings — produce enormous diffs that contain zero logical changes. Use git diff -w to ignore all whitespace changes, or git diff --ignore-space-change to ignore only changes in the amount of whitespace (preserving additions and deletions of whitespace). In GitHub PRs, the "Hide whitespace changes" toggle does the same thing.

Detect Moved Code with --color-moved

Git 2.15 introduced --color-moved, which highlights code that was moved from one location to another using a distinct colour (typically dimmed). This is invaluable for refactoring reviews where functions or blocks are reorganised without modification. Enable it by default: git config --global diff.colorMoved default.

Semantic Diff: Beyond Line Matching

Traditional diff tools treat source code as plain text. They have no understanding of syntax, so a moved function looks identical to a deleted-and-recreated function. Semantic diff tools parse the abstract syntax tree (AST) of the code and compare structural elements rather than text lines.

Tools like Difftastic understand the syntax of dozens of languages. They can show that a function was moved without modification, that a variable was renamed consistently, or that an if block was wrapped around existing code — all things that produce noisy line-level diffs but are structurally simple changes.

You can configure Git to use an external diff tool: git config --global diff.external difft (for Difftastic) or git difftool for one-off comparisons. The trade-off is speed — AST parsing is slower than line-level comparison, so semantic diff tools may feel sluggish on very large changesets.

Diffing in Different Contexts

Pull Request Reviews

GitHub, GitLab, and Bitbucket all render diffs with syntax highlighting, inline commenting, and file-level navigation. Tips for effective PR review: collapse files you have already reviewed (GitHub remembers this across sessions), use keyboard shortcuts (n/p to jump between files in GitHub), and leave summary comments on the PR rather than scattering feedback across individual lines when the issue is architectural.

Merge Conflict Resolution

Merge conflicts are really three-way diffs: the common ancestor (base), your version (ours), and the incoming version (theirs). Understanding this helps enormously. Use git config merge.conflictStyle zdiff3 to show all three versions in conflict markers instead of just two. The base version tells you what was there before either side made changes, which often makes the correct resolution obvious.

Database Migrations

Schema diffs are a specialised form of text diffing where the output is a migration script. Tools like sqldiff, Flyway, and Alembic compute the diff between two database schemas and generate ALTER TABLE statements. The same LCS principles apply, but the "edit operations" are SQL DDL statements rather than line insertions and deletions.

Configuration Drift Detection

Infrastructure-as-code tools (Terraform, Ansible, Puppet) use diffing to show planned changes before applying them. terraform plan is essentially a diff between your desired state (config files) and the current state (cloud resources). Reading these diffs accurately prevents catastrophic infrastructure changes.

Performance Characteristics of Diff Algorithms

AlgorithmTime ComplexityBest ForAvailable In
MyersO(N × D)Shortest edit script, typical code changesGit (default), GNU diff
PatienceO(N log N + D²)Structural changes, function reorderinggit diff --patience
HistogramO(N × D) optimisedGeneral purpose, often faster than Myersgit diff --histogram
Hunt-McIlroyO(N × D)Original Unix diff, historical interestLegacy systems
Semantic/ASTO(N log N) typicalLanguage-aware structural comparisonDifftastic, GumTree

Common Diff Mistakes in Code Review

Reviewing reformatted code as logic changes. When a commit mixes formatting changes with logic changes, the diff becomes nearly unreadable. Encourage your team to separate formatting commits from logic commits. Better yet, enforce formatting via pre-commit hooks so reformatting diffs never reach pull requests at all.

Missing context in large diffs. If a pull request touches 40 files, there is a strong temptation to skim. But the bugs hide in the files you skip. If you cannot review a PR thoroughly in 30 minutes, it is too large. Break it up. Research from SmartBear found that review effectiveness drops dramatically after 400 lines of diff, and anything over 200 lines should be reviewed in focused sessions.

Anchoring on the diff instead of the result. Diffs show you what changed, not what the code looks like now. After reviewing the diff, open the full file at the target commit and read the affected functions in context. A change that looks fine in isolation might be inconsistent with surrounding code.

Ignoring test diffs. Test code is production code — it documents expected behaviour and catches regressions. Review test diffs as carefully as implementation diffs. If the logic changed but no tests changed, that is a red flag worth commenting on.

Automating Diff Analysis

Several tools can augment human code review by analysing diffs programmatically. Linters that run on changed files only (via git diff --name-only) catch style issues before review. Coverage diff tools show whether new code is covered by tests. Complexity analysis on changed functions flags regressions in code maintainability.

For quick one-off comparisons during development — checking two versions of a config file, comparing API responses, or verifying text transformations — a browser-based diff tool is faster than setting up a full Git workflow. Our text diff tool runs entirely in your browser with syntax-highlighted output and no file uploads.

Compare Text Instantly

Paste two text blocks, see the differences highlighted. No uploads, no accounts — runs entirely in your browser.

Open Diff Tool →
Need a developer? Hire Anthony D Johnson — Senior .NET & Azure Developer →