This proposal describes a way to efficiently distribute updates to installed applications using binary patches.

Last Significant Update:

2024-10-10

Status:

Draft

Comments to:

jkoshy@NetBSD.org


A number of popular application packages are now close to the 100MB mark in size.

Table 1. Examples of Large Applications
Application Category Platform Approximate Size (MB)

Brave™

Web Browser

Ubuntu/x86_64

113

clang

Program Development

NetBSD/amd64

51

Electric

Computer Aided Design

NetBSD/amd64

87

Firefox™

Web Browser

NetBSD/amd64

58

Signal™ (Desktop)

Instant Messaging

Ubuntu/x86_64

117

Many of these large applications are also updated frequently.

Because these applications are currently distributed as archives (.tgz, .deb, etc.), updating the applications installed on a system usually requires users to download hundreds of MBs of data.  This makes updating the OS costly and time-consuming for users who are on metered or low-bandwidth connections to the Internet.

The Proposed Protocol

We could exploit a few aspects of the current package distribution process to make it more bandwidth efficient:

  1. There is likely a local cache of package files that could be leveraged for our purpose.

    For example, Debian’s apt package manager keeps a cache of installed packages under /var/cache/apt/archives/.

  2. The changes between two released package versions are likely to be small relative to the size of the overall package.  For example, security updates are likely to make small changes to executables and shared libraries, leaving the rest of the package unchanged.

The Basic Idea

Diagram
  1. Prior to committing to downloading a large package, the client first looks for a binary patch that could transform an existing package version[1] to the desired version.[2]

  2. The system obtains a binary patch for transforming one of the package versions it has (version Vz) to its desired version Vp.

  3. The system applies this binary patch to its copy of package with version Vz, and checks whether the patched file has the expected checksum for version Vp.

  4. If the patch succeeds, and if the checksum of the patched file matches the expected checksum, the system can proceed with the package upgrade as if it had downloaded the full package archive from the upstream repository.  Otherwise, it can fall back to downloading the full archive for version Vp from upstream.

Assumptions

This proposal assumes that:

  1. It is feasible to generate binary patches between two package versions.

  2. That such binary patches would be significantly smaller than the packages themselves.

These assumptions need to be validated against real-world packages.

CDN-Friendly Protocol Version

It turns out that we can implement initial “query for patches” by placing these binary patches at pre-defined locations in the package archive’s namespace.

Diagram
  1. The client retrieves a list of known patches from its cached package version (Vx) to subsequent package versions by retrieving a file at a “well-known” location in the package repository.

If a binary patch from version Vx to the desired version Vp is present in this list, this patch would in turn be retrieved via the content delivery network.

Diagram

To assist clients that update their systems infrequently, the package server could keep around pre-computed patches from prior package versions Vx to later versions Vx+1, Vx+2, …, up to some configured limit.

multvers

Protocol Fallback

If the package server has no patch that transforms the cached package version Vx to the desired package version Vp, then the protocol would fall back to a full download.

Diagram

Publishing Process Changes

When a new package version is published at the package repository, we would need to generate binary patches from the previous ‘N’ package versions to the latest published version.  The CDN being used to distribute packages (e.g., cdn.netbsd.org) would have the previously published package versions needed to generate these binary patches.

Additional Improvements

Using generic binary patch formats (such as BSDiff or RFC 3284) may suffice for many packages, perhaps after first unpacking the package archive to avoid the changes introduced by the archiving process itself.

Format-Specific Binary Patches

We may be able to further improve the succinctness of the generated patches by using application-specific knowledge.  For example, if the difference between two packages is due to a small change to an ELF executable, we could generate an ELF-aware update (i.e., a file-format-specific patch) that adjusts text segment contents and/or symbol tables for the affected ELF object alone.  Similarly, other kinds of data (e.g., image files) could use their own kinds of succinct, format-specific, patches.

Depending on the changes involved, a collection of such ‘format-specific’ patches could be more concise than a ‘generic’ binary patch for the whole package.

Downloadable Patch Algorithms

The use of 'format-specific' patches however creates a problem for deployment, in that it could take a while for the support for new format-specific patch algorithms to percolate through the installed base of systems.  Till that happens, we would need to keep supporting older, less efficient, patches alongside the more efficient ones — which adds to the operational burden when releasing new application versions.  Systems that are updated infrequently would also run the risk of being ‘too old’ to process newer kinds of format-specific patches.

However, if we encode the implementation of format-specific patch algorithms using portable bytecode (say, as WASM bytecode), then we could distribute binary patch algorithms just like we do binary patches.  Both the patch algorithm bytecode and the data needed for the patch would be cacheable at the edge.

Such patch algorithm bytecode would execute in a tightly-controlled environment, with no access to the external state apart than the package data being transformed.

Diagram

With this approach a new format-specific patching method could be supported by simply directing the package manager to the appropriate ‘algorithm id’ to use.

Patch and Algorithm Signing

To reduce the risk of man-in-the-middle attacks we could consider signing patches and algorithms and have the package installer check these signatures at install time.  The signatures could be placed at a “well-known” location in the package repository, say something like: /$PKGNAME/sigs/$SOMEPATCHNAME.sig.[3]

Compatibility

Backwards compatibility

Older systems that are unaware of the existence of binary patches will continue to work with this protocol.  Such systems would go directly to the ‘full download’ step.

Forwards compatibility

Newer systems connecting to older package servers (servers not offering patches or patch-algorithms) will fall back to full downloads.

Pros and Cons

Pros

  • Could help users on metered Internet connections to save on usage costs.

  • Could help users on low-bandwidth connections to update their systems faster.

  • Could help users with intermittent network connectivity to make efficient use of their time online.

  • Could reduce bandwidth costs for package archives and their mirrors.

  • Offers users potentially faster updates, although this depends on the relative speeds of network and disk access on the system being updated.

Cons

  • The additional file fetches may be significant for high-latency connections, particularly if no binary patch is found and the fallback path ends up being taken.

  • The package publishing process becomes more complex, as binary patches would need to be generated when a new package is released.

  • Depending on the patch generation algorithm in use, the binary patch generation step could potentially be resource-intensive.  This may not be an issue if the patch generation is performed on a well-provisioned build system.

  • The package management tool on the system being updated will also need to be be changed to implement this protocol.

BSDiff

BSD-licensed tools for binary patching.

RFC 3284

The VCDIFF Generic Differencing and Compression Data Format.

XDelta

A C library and command-line tool for delta compression per RFC 3284.

Delta RPMs

The Fedora Project supports the notion of ‘deltas’ between published packages.


1. Instead of version numbers, we could also use file checksums for naming versions Vx , Vy, etc.
2. The same protocol could be used for updating local package lists too.
3. NetBSD’s pkg_admin utility is already able to sign packages, and pkg_add can be configured to check these at installation time.  We could extend this functionality to patches and algorithm bytecode too.