Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Synopsis

Pangalactic aspires to become a decentralized data network tailored towards using deterministic computation for revision control and build systems. The flagship application built on this network is pg, a decentralized revision control tool, similar to git, monotone, mercurial, fossil, and so forth.

Pangalactic is very much an incomplete work in progress. The existing functionality is spotty, as well as the documentation in this book.

Roadmap

  • derive hostapi call.
    • introduce sequence std guest.
  • attestation caching
  • planset concurrency support
  • ipfs store
  • pub/sub
  • rebrand
  • API docs
  • revision control tool MVP
  • hostenv
  • content encryption
  • garbage collection?
  • self-hosted revision control
  • self-hosted compiler
  • expand across the galaxy

Building Pangalactic

Users interact with pangalactic through a set of binaries:

  • pg - The primary high-level revision control binary; users who only want typical revision control can just use this.
  • pg-revcon - A lower-level utility for revision control, useful for scripting or automation.
  • pg-store - A Store interface, for users who want to interact with the Store directly instead of doing revision control.
  • pg-derive - A tool to derive deterministic computations within the Store.
  • pg-seed - A tool for initializing the Store.

Additionally, this book is developed alongside the code for the binaries.

These binaries are build from a rust workspace 1.

Build Steps

The primary build and installation approach is via nix. This handles the entire build and installation process, including building binaries, building WASM components, rendering this book, generating diagrams, and so forth. It can also be used to run the full quality assurance checks performed by the Continuous Integration system. If you prefer not to use nix and to build/test the components "by hand", head over to the developer section on Development without nix.

  1. Install nix on your system.
  2. Set export NIX_EXPERIMENTAL=flakes in your shell to support the newer flake feature of nix. 2
  3. Retrieve the source code from the pangalactic project on Github.
  4. In the source code directory run nix build.
  5. This creates a symlink called result containing the binaries, book, and other artifacts:
  • result/bin/ - A directory with the binaries.
  • result/doc/pangalactic/index.html - The rendering of this book.

  1. There are many rust crates within this rust workspace which can enable rust developers to extend and build upon pangalactic. If you are interested in extending or building rust code atop pangalactic, head over to the Let's Hack! chapter.

  2. The developers of pangalactic consider it a bug of nix that this is not enabled by default.

Vision

TODO: Sort out terminology, especially tech-concept-words.

Let me tell you a tale...

Alice invites Bob to begin collaborating on a project, "OSSM Sauce", so she sends Bob the publisher-id she uses as her main collaboration nexus for the project.

Bob subscribes to this publisher-id to download a copy of the source code, history, and built artifacts. He examines the main story Alice has arranged to get a feel for the development of the project. He likes the idea, and he's excited to be invited to collaborate since OSSM Sauce is not yet announced to the public and is privately shared between a small group of collaborators, even though they are all coordinating on this project and many others through the universal pangalactic network.

Bob wants to know OSSM Sauce is safe for people to use, especially himself, so he browses the current code, looking for vulnerabilities, or even worse, anything malicious. He notices some surprising functionality and reviews the main story again to understand how that functionality arose.

The main story shows months of incremental development, organized in a legible fashion, giving a feasible description for how this functionality could be developed from scratch. It reads a bit like a text book introducing the design of a complex engineering system incrementally. It has some branching substories in its history, that serve a bit like parenthetical or optional chapters in the development, giving Bob the choice of digging down into specific areas of development, or to continue on with the primary sequence of events.

As Bob drills down into a particular substory, he sees a change that introduces a novel technique he had never learned about elsewhere. The diff from earlier revisions to this point seem to produce a breakthrough result, as if by magic or pure genius.

The change in question first introduces functional specifications for how this surprising feature should function, which is part of the standard editorial policy of this branch which constrains the changes in the substory so that they are guaranteed to pass the automated determinstic criteria of the editorial policy. Then after defining these specifications, the next change simply produces the working functionality all at once.

Skeptical that this is how Alice actually figured this functionality out, so he switches from viewing the main story or substories to analyzing the transaction ledger of the project. This shows an alleged chronological sequence of operations Alice performed to add changes to stories, describe bundles of changes as a coherent unit, rewrite or modify sequences of changes to replace them with new sequences, introduce/remove/rename/re-arrange stories with respect to each other, and so on. These operations are how Alice was able to rearrange the project's history into such a clear pedagogical main story. By reviewing this ledger, Bob is able to review previously existing stories which show earlier attempts at developing the feature and defining the specifications. Finally with this insight, Bob has a closer understanding of how Alice had her breakthrough, and he sends her a message about what he learned from those older superceded stories.

Since Bob feels pretty confident that this project is safe to use directly, Bob launches the built artifacts on his system. He decides to use the downloaded artifacts rather than rebuilding them, because pangalactic verifies the proofs of computational integrity underlying the build process for the project, so he has very high confidence that they were produced by a very precisely specified toolchain. He's also confident every change in the stories he reviewed matched the editorial policy for those parts of the story, which rely on the same computational proving and validation system.

The artifacts launch in a popular runtime developed and built within the pangalactic system which restrains artifacts from taking actions beyond the capabilities Bob explicitly grants to them. He passes some testing / exploration capabilities and gets familiar with how OSSM Sauce operates and learns how to manipulate his exploratory capabilities to his satisfaction, so then he starts giving it more and more important capabilities influencing his broader life.

As he gains familiarity with OSSM Sauce, he begins introducing his own functional specifications and stories for new functionality and sends Alice his own publisher id. Soon Alice and Bob decide to share OSSM Sauce to a wider group, although people outside that wider group cannot access any of OSSM Sauce's history, code, or artifacts without someone, somewhere inviting them. Bob begins collaborating with other acquiantances similar to how he collaborates with Alice, by sharing publisher ids to bootstrap pangalactic connections.

Because they subscribe to each other this way, they are able to see updates from each other, which include suggestions for how Alice should modify her stories based on those Bob has produced, or vice versa, as well as issue tracking, annotation, and discussion systems to facilitate improving OSSM Sauce through a distributed, decentralized, web of collaborators of unknown extent.

Finally, since pangalactic has become such a useful universal development collaboration system, the developers of pangalactic itself continue to improve it as humanity spreads to other stars, allowing pangalactic to truly live up to its name.

Plot hole: How do people afford to keep improving pangalactic?

Architecture

The pangalactic system is composed of several "infrastructure" layers for data storage & distribution, synchronization among collaborators, determistic computation. Atop this infrastructure is the flagship pg revision control app, although nothing prevents other applications from building on this infrastructure.

Warning: The text here is not fully reconciled with the code, including command examples!

Concepts

Before describing the pangalactic architecture, we introduce a few important concepts that help clarify the architectural design:

Signing and Verifying

Pangalactic makes heavy use of cryptographic signature schemes. However, it's important to dispel two persistent confusions about these schemes:

  • Signing and verification keys are not identified with people or connected to human "identity". Instead, they are "mere components" of the system. Code that relies on verifying signatures does this so the user can rely on beliefs about how the associated signing keys are controlled. Humans and/or code may control and manage any number of signing keys (via software) in pangalactic.
  • Signing and verification keys are cryptographic values used to sign data and verify signatures, and nothing about them is inherently public or private. A verification key may be broadcast widely, or it may be controlled in a small private extent. Likewise with signing keys.

Validity and Attestations

Below we often describe various kinds of data and their relation to each other. It is often the case that relationship between data items "should be" something described here. What happens when it is not?

Since pangalactic is designed to encompass a universal network of collaborators sharing a universal data set, and any kind of participants may collaborate, we want to protect a user from the malicious usage of other collaborators in the network.

Pangalactic software uses three methods to protect users:

  • zero-knowledge proofs of computational integrity aka ZKPs: when possible, metadata representing a relationship between two data items proves that the relationship correctly holds by include a third item which is a ZKP for the relationship predicate.

  • signed attestations: where ZKPs are not-yet-implemented or impossible (because a relation is not mathematically provable), we rely on signatures to attest to the validity of a relationship. An example of where this when a publisher publishes a new update, that update may be arbitrary (and not a provable derivation of previous updates) so the signature serves to indicate the publisher authorized the update. (Recall that the signature scheme does not identify any human participant.

  • sandboxing: where feasible, pangalactic sandboxes computations so that maliciously authored computations, or buggy ones, cannot cause harm beyond the sandbox.

With this section in mind, the rest of this chapter describes how data "should be" related to each other descriptively, without using qualifiers like "should".

Capabilities

We use the term capability inspired by the capabilities security mindset. To say "X is a capability to Y" indicates that (if the network is properly available) knowing X is both sufficient and necessary to accomplish Y.

Verification can be Outsourced

By relying on signatures, encryption, and ZKPs (especially the zero-knowledge property) we can often oursource verification so that a third party service can verify a data relation is valid and available without the capability to read that data! This outsourcing relationship can be expressed with verification capabilities.1

Authorization / Distribution Layering

Pangalactic separates authorization decisions (such as which components can authorize updates to data or which may read data) from distribution of data across the network.

Authorization never relies on third party authorities, including within the network such as routing, naming, or remote host access control mechanisms. It relies only on capabilities, ZKPs, and attestations.

However, authorization and distribution are not entirely orthogonal because a good distribution design ensures that data flows to where it can be read rather than to areas where it cannot be read, and also that updates reach interested parties efficiently.

TODO: This chapter currently ignores most of the distribution design. This part of the design is underdeveloped at present.

Revision Control

The flagship application of pangalactic, pg is a revision control tool.

Filesystem Structure

A user Alice can create a new workspace which consists of a local filesystem working directory and revision control book-keeping.

The book-keeping is stored within the .pg subdirectory of the working directory. It is further split into tracked and untracked state. .pg/UNTRACKED directory contains untracked state and every other child of .pg is tracked. The book-keeping metadata consists of small links into the Store, which are both described below in The Store section.

A workspace is always associated with a publisher which is a means of controlling how users can subscribe to the changes of the workspace via a pubid identifier.

The untracked book-keeping contains at least:

  • The publisher's key, which is a publication capability for producing new records of changes.2

The tracked book-keeping contains at least the most recent record of the changes in the revision history: .pg/tip.pgl.

Creating Records

A user saves changes in their working directory to create a new record. This follows the following process:

  1. The working directory, excluding .pg/UNTRACKED and any path configured to be ignored within .pg/config.toml, is inserted into the Store to produce a CID (described in The Store section).
  2. If .pg/policy.pgx file exists, it is used to derive a revision (described below). If it is not present a hardcoded no-op policy is used instead.
  3. A new directory is stored is inserted in the Store with two links:
  • prev: a link to the previous record (which comes from .pg/tip.pgl in the no-op policy.
  • rev: a link to the revision from step 2.
  1. A record is created with the contents of 3, signed by the publisher.
  2. That record is used to overwrite .pg/tip.pgl.
  3. The as-of-yet-unspecified "distribution system" ensures subscribers to this workspace receive the new record.

Store-Space vs Filesystem-Space

The Store (described below) provides a directory & file abstraction. The complete revision history at any point is always "just" a directory structure in the Store. We refer to the layout and contents of this structure as Store-space to distinguish that it is the layout within the Store.

Meanwhile, the actual files and directories on a users computer that make up a workspace have a different but related layout and contents, and we distinguish this as the Filesystem-space, or often we'll just say "on the fs" for short.

The no-op policy converts the fs-space layout from step 1 into a revision, where the layout of revision isn't well specified yet...

Some things a revision needs to do:

  • Specify the revision.
  • Track public peer subscriptions?
  • Contain a snapshot of the working directory contents.

The Conventional Policy

By default, when a user creates a new workspace with pg init, a record is immediately published, implicitly, that installs the Conventional Policy (aka the CoPo) as the policy for that record, which is permitted by the no-op policy.

The CoPo, among other things, provides new revision control features atop the lower host-platform-coded layer, in contrast the the CoPo which is at the sandboxed deterministic computation layer of derivations.

Transactions

CoPo requires every change to a repository to be a transaction. Transactions are high-level user-facing concepts that map well to the UX, such as "save the current working directory state", "annotate a recent revision with this descriptive log message", and "if the given revision passes language and application-specific deterministic checks, then append a new record attesting to that fact, else do nothing."

CoPo guarantees that the saved working directory state is the deterministic outcome of applying the associated transaction to the previous history. In this way, users can explore either snapshots of working directories or high-level transaction logs to understand the revision history, content with the knowledge that the relationship between these is preserved by the policy.

Narratives and Releases

Since a workspace is associated with a single publisher, and a publisher distributes a linear sequence of records, the revision control history produced by a given workspace is a linear history.

CoPo introduces narratives to facilitate common revision control management techniques. A narrative is a name referring to a "managed" history of changes which can be modified semi-arbitrarily. CoPo revisions store a mapping of the narrative names to "narrative revisions" within the top-level revision.

When users need a complete auditable record of changes (for example while investigating the insertion of a backdoor), including changes to narrative histories, they can review the log of top-level records. OTOH, when they need to review managed, curated, or pedagogical histories, they can rely on narratives.

Narratives are vaguely akin to git branches, that is, mutable names referring to different semi-related sub-histories of the complete revision history. However, with CoPo there are many important differences:

  • Narratives are revision controlled, whereas git branches are local mutable state.
  • Narratives can be semi-arbitrarily modified to "ret-con" the presented sequence of changes (similar to git branch modifications like squashing, rebasing, etc...). Unlike git, the sequence of changes to narratives themselves are tracked in the linear revision history. This strikes a balance between the needs for revision history to either provide a complete auditable record versus a well-edited pedagogical narrative of how the latest revision could ideally / should have been developed.
  • Narratives are published to all subscribers of a workspace. (TODO: We could have a "local narrative" feature where they are stored in the untracked book-keeping area.)

Releases are narratives that live under the /release/... namespace and are more constrained by CoPo so that they cannot be overwritten and follow a constrained versioning format. (TODO: Should we allow overwrites of releases for "hot-fixes" or retractions?)

Pub/Sub

A core primitive of pangalactic is the Pub/Sub system, which has an authorization layer and (one or more) outer distribution layers.

Note: while there is low-level commandline access to direct pub/sub features via pg uth pubsub, all of the examples describe using the higher level revision control use case which manages pub/sub usage under the hood.

The Pub/Sub Capabilities Model

The Pub/Sub Capabilities Model or PSCM is the authoritzation layer of pub/sub: a network/IO-agnostic cryptographic data format which defines how users can publish and subscribe to update records.

Publishers

Any time Alice wishes to share a sequence of related update records to Bob and Charlie, she generates a new publisher. This typically occurs under the hood when using pg for revision control when a user runs pg init.

A publisher is entirely controlled by a cryptographic signing key, which Alice typically stores privately. Again, when using pg for revision control, this is managed under the hood. With that in place, Alice then shares her publisher ID aka pubid with Bob and Charlie. The pubid for a revision control workspace is available with the pg info command.

Subscriptions

Bob and Charlie then take Alice's pubid and subscribe to it. Suppose the publisher Alice created is for a revision control workspace and Bob wants to acquire a new local copy, then he uses the pg fork command to create a new local workspace and subscribe to Alice's pubid. Meanwhile, let's suppose the workspace Alice is hacking on is part of a project that Charlie is also hacking on, but the two haven't yet directly collaborated. In that case Charlie uses pg peer in his workspace to subscribe to Alice's workspace updates.

Records

Now that Bob and Charlie are subscribed to Alice's pubid, they can receive records from it. A publisher always produces a linear sequence of records. Records contain a sequence number which starts at 0 and increments each time Alice publishes a record with her publisher. Additionally it contains a previouis CID and a current CID. (CIDs are described soon in The Store section).

A record provides the pubid which produced it (either because it is literally stored in the record or it is derivable given the signature scheme and signature).

PSCM Confidentiality

An important feature of the pub/sub PSCM is that an arbitary third party, Mal, cannot track updates from Alice's workspace without subscribing to the pubid for it. If Alice is working on a private project she could ensure she sends the pubid to Bob and Charlie privately, and rely on them not to share the pubid elsewhere without checking with her first.

In other words a pubid is a capability for two actions: retrieving and reading records from the publisher.

The Store

The Store is a universal immutable data storage and distribution layer, spread across a network. The term "the Store" refers to both the Authorization / Distribution distinction, but in this section we describe only the Store Capabilities Model.

The key primitive is hash-based indexing, along with "directory" nodes for linking together data within the Store.

Content Identifiers, or CIDs for short, are (relatively) short self-authenticating immutable data references. At an elementary level, the data they refer to is an arbitrary sequence of bytes, called a blob. CIDs provide the following properties:

  • Compactness: CIDs are relatively short with a capped length (~32 or ~64 bytes).
  • Determinism: any two devices compute the same CID given the same blobs as inputs.
  • Collision Resistance: no two distinct blobs result in the same CID value.
  • Read Capability: a blob cannot be retrieved and read from any source without the corresponding CID. (TODO: Introduce immutable VerifyCap vs ReadCap terminology and semantics.)

Rust Crate: pangalactic-cid

Directories are blobs which contain a pangalactic-specific serialization of a directory structure which contains a set of child links.

Each child link contains a link name which is unique within a given directory. It also includes a link kind which claims the referent blob is either a directory or a file. Finally it contains a link reference.

Rust Crates: pangalactic-dir, pangalactic-linkkind, pangalactic-link

A link reference is either just a bare CID or it is a (record, path) tuple. A link reference can be resolved to a "direct CID" which is either just the bare CID in the first case, or the result of traversing the path from the current CID of the record. When a link reference is a bare CID, it's called a hard link and when it is a (record, path) it is called a splice link or splice for short. The path selects a subpath within the record which can be used to splice to previous history, specific narratives (especially releases), etc...

Note: Splices are not yet implemented.

Splices are designed to support several important pg use cases:

  • Unified Subscriptions & Locking: An important property is that splices provide a subscription to receive future updates and simultaneously locking. Locking (also called pinning) is a common feature of dependency management systems to ensure a particular revision of software is tied to specific revisions of all of its dependencies. Transitive pinning is necessary to ensure any two hosts that build the software are building with the same inputs. One of pg's target use cases is deterministic builds, so making pinning a first-class feature meets this goal.

  • Maintaining peer subscriptions of a workspace. These are subscriptions to peer workspaces which enable users to share code patches with each other. This is somewhat akin to the git revision control tool's .git/refs/remotes tracking, except that every peer is a signed record (and revision controlled).

  • Maintaining "embedded workspaces" in a workspace. Embedded workspaces allow a user to embed the source code of other projects into their source code via a slice, which enables explicit version control across the containing and embedded projects. For example, a rust project could splice many dependencies into a "cargo vendor" directory structure so that cargo will build the primary (containing) workspace without needing network access. Meanwhile, the containing workspace also is guaranteed to have pinned subscriptions to the dependencies and so dependency management is explicitly controlled by the containing workspace author. Avoiding network access is an important requirement for deterministic builds.3

Finally, because splices are first class in the PSCM, "lock updates" are also a first class feature via the pg update command. This can be used to discover and pin newer updates to dependencies, peer repositories, or embedded workspaces. Those changes, of course, will be part of the revision history of the workspace.

The Local Store

The Local Store is a user-scoped filesystem storage backend containing a subset of the complete universal Store.

Note: Because the Local Store is user-scoped, all of a user's revision control storage is shared across all projects on the system. For example, if a user clones the same project into multiple different working directories, the revision control storage is deduplicated / cached for all working directories.

Note: There is no equivalent of .git blob storage. This means if you need to ensure the complete history of a project is copied to a new host, you must either copy the entire Local Store (which may include other projects), or use the pg tool to perform the transfer.

Derivation

The pangalactic derivation system is a first-class deterministic computation system. Purpose-built code directs how to construct new files and directories in the Store given existing files and directories as the sole input.

The current implementation is build on WASM (via wasmtime) with a special-purpose binary "host call API".

A plan is a directory that specifies both the executable and inputs. A host can derive an attestation from a plan. An attestation contains the originating plan, the generated output (or deterministic error), a log, and supporting evidence that this is the correct attestation for the plan.

The supporting evidence should eventually be a ZKP, but in the short term we will implement signature-based attestation. (Note: When Alice derives outputs as part of the standard revision control process, the resulting attestations are implicitly signed by Alice's publisher.)


  1. This possibility, as well as most of the authorization architecture of pangalactic is directly inspired by the Tahoe-LAFS project.

  2. A hypothesized hazard is that users will relatively frequently cp -r their workspace, then continue revision control in both locations. This will create colliding records. This should be detected and the user could be guided to convert one of the copies to a fork.

  3. This enables pg repositories meet the use cases of git subtree and/or git submodules except as a single first class feature that behaves the same as other uses of splice links.

Revision Control

Revision control is the "flagship app" built atop the pangalactic infrastructure, primarily accessible throught the pg end-user tool.

In addition to the top-level pangalactic architecture, this chapter assumes you are familiar with decentralized revision control tools, such as git, hg, monotone, darcs, etc...

Concepts

Publications and Subscriptions

A publication refers to a self-authenticated revision control history.

Workspaces

When a user initializes a new project with pg init they convert a host directory into a newly created workspace with these components:

  • <workdir> - the working directory is the directory which was passed in the --workdir option to pg init (which defaults to the current directory) and we will refer to it as workdir for short.
  • <workdir>/.pg - the book-keeping directory contains revision control metadata. This is the only subdirectory within the workdir which pg treats specially. (TODO: Review this claim as the design evolves.)
  • <workdir>/.pg/prev.pglink - the previous transaction link is present for every

Let's Hack Pangalactic!

So you wanna hack on pangalactic, eh?

The first step is quick and easy, take the Pangalactic Oath:

I dedicate some of my time toward improving pangalactic as I see fit in service of our glorious mission to explore all of the Milky Way.

Don't fret about such an audacious Oath; we may fail, but it's worth a shot.

Prerequisites

You'll benefit from some familiarity with the User Guide (such as Building) and Design (especially the Vision).

With that in mind, check out the Implementation overview to get a feel for how it's currently implemented.

The Journey Continues

... and if you've made it this far, best of luck. You've attained True Englightenment™️ 1!


  1. -at least in so far as these incomplete docs about a work-in-progress software project can help one attain True Englightenment™️.

Development Without nix

-a.k.a. the Tedious Way

nix Training Wheels

If you want to run all of the build steps directly, but still benefit from nix to set up a complete development environment, you can run nix develop. This will drop you into a new shell process pre-configured with all the appropriate build tool releases to run everything directly without nix intermediating (beyond the environment setup).

No nix at All

Finally if you want to build without nix present at all on your system, you will have to configure your development environment with all of the prerequisites.

In this case, the flake.nix and nix-support/ files are the best documentation for how to install and configure the prerequisite development environment, because that is the literal code exercised by continuous integration. Start with nix-support/init-system.nix for the definition of build targets which are defined in packages or the development environment (in devShells.default).

In broad strokes the prerequisites for a non-nix system are:

  • For the binaries:
    • rustup (or the specific toolchain specified in rust-toolchain.toml)
    • The cargo, clippy, rustc, and rustfmt rust toolchain "components".
    • The wasm32-unknown-unknown rust toolchain "target" along with the target for your host. 1
  • For the book:
    • mdbook, cargo-depgraph, and graphviz.

Binaries Build Process

... TODO: We want to simplify the non-nix and nix build process before writing this section.


  1. None of this book covers cross-compiling specifically, but we note that a benefit of building with nix is much easier cross-compilation support.

Implementation

There are separate commandline tools for accessing the different architectural layers of pangalactic (see Architecture).

Pangalactic is implemented primarily with rust, with nix orchestrating the build process. Deterministic computation relies on an assumption that WASM is deterministic 1, and wasmtime is currently used as the execution environment. Both "host" code and "WASM guest" code is implemented in rust. Some crates are built both on the target host architecture and the WASM architecture to share data types across that boundary.


  1. This needs careful investigation to ensure it is the case. The WASM specifications and issue trackers are not explicit about this goal. In any case, we hope to replace WASM with a system supporting concise non-interactive proofs of computational integrity.

Crate Intra-Dependencies

Here are the current crate dependencies within this workspace, (with redundant transitive dependencies omitted):

Intra-Workspace Dependencies